# Product Review Opinion Mining

_Assignment for the University of Bath as part of MSc in Artificial Intelligence_

_Data source: Minqing Hu and Bing Liu, 2004. Department of Computer Sicence - University of Illinois at Chicago. (see readme.md in data folders for additional information)_

## 1. Task and Dataset Analysis

The task of opinion mining for product reviews consists of two seperate but related challenges:

1. Feature Extraction
2. Sentiment Detection

The feature extraction task is concerned with defining which product features are being described in the reviews and additionally extracting sentiment bearing words. The sentiment detection task is then concerned with determining the polarity of reviews (whether sentiment is positive or negative towards the extracted features).

The 17 text files provided follow broadly the same structure, with review sentences being divided by with the characters '##'. Text before these characters denotes annotations which are used as the "Gold Standard" product features and sentiment for this analysis. Text after these characters denotes the review sentences themselves. In this analysis I consider the following information from the files:
- The Gold Standard product features
- The Gold Standard sentiment attached to the product features (though only the polarity, positive or negative)
- The sentence itself from which features and sentiment bearing words are extracted, and sentiment detected

I will proceed with the analysis step by step using a single text file (Apex AD2600 Progressive-scan DVD player) as a demonstration, with evaluation of feature extraction and sentiment detection conducted across all files.

## 2. Data Parsing

Once the relevant libraries have been imported, the first step in the process is to load the text file and parse it's contents into a dataframe, with the key task being to seperate the Gold Standard features and sentiments from the sentences themselves:

In [1]:
import pandas as pd
import spacy
import nltk
import math
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [2]:
# Please change filepath if re-running code

with open('C:/Users/samjp/OneDrive/Desktop/Products/Apex AD2600 Progressive-scan DVD player.txt') as txt_file:
    df = pd.DataFrame({'Raw text': txt_file.readlines()})

df['Sentence'] = df['Raw text'].str.split('##').str[1]
df['Sentence'] = df['Sentence'].str.replace('\n', "")
df['Gold Standard'] = df['Raw text'].str.split('##').str[0]
df['Gold Standard'] = df['Gold Standard'].str.split(',')
df['Gold Standard Sentiment Score'] = [[] for _ in range(len(df))]
df['Gold Standard Feature'] = [[] for _ in range(len(df))]
df['Gold Standard Sentiment'] = [[] for _ in range(len(df))]
df = df.dropna()
df = df.reset_index(drop=True)

for i in range(len(df['Gold Standard'])):
    if df.loc[i, 'Gold Standard'] != ['']:
        for item in df.loc[i, 'Gold Standard']:
            if "[" in item:
                df.loc[i, 'Gold Standard Sentiment Score'].append(item.split("[")[1])
                df.loc[i, 'Gold Standard Feature'].append(item.split("[")[0])

for i in range(len(df['Gold Standard Sentiment Score'])):
    if df.loc[i, 'Gold Standard Sentiment Score'] != []:
        for item in df.loc[i, 'Gold Standard Sentiment Score']:
            if "+" in item:
                df.loc[i, 'Gold Standard Sentiment'].append("pos")
            elif "-" in item:
                df.loc[i, 'Gold Standard Sentiment'].append("neg")
            else:
                df.loc[i, 'Gold Standard Sentiment'].append("neutral")
df

Unnamed: 0,Raw text,Sentence,Gold Standard,Gold Standard Sentiment Score,Gold Standard Feature,Gold Standard Sentiment
0,"##repost from january 13 , 2004 with a better ...","repost from january 13 , 2004 with a better fi...",[],[],[],[]
1,##does your apex dvd player only play dvd audi...,does your apex dvd player only play dvd audio ...,[],[],[],[]
2,##or does it play audio and video but scrollin...,or does it play audio and video but scrolling ...,[],[],[],[]
3,##before you try to return the player or waste...,before you try to return the player or waste h...,[],[],[],[]
4,##no picture : \n,no picture :,[],[],[],[]
...,...,...,...,...,...,...
734,dvd player[+3]##i am really impressed by this ...,i am really impressed by this dvd player .,[dvd player[+3]],[+3]],[dvd player],[pos]
735,"##if it can fit in the drive bay , this dvd pl...","if it can fit in the drive bay , this dvd play...",[],[],[],[]
736,"play[+2], dvd[+2]##for instance , i made sever...","for instance , i made several back-ups of my d...","[play[+2], dvd[+2]]","[+2], +2]]","[play, dvd]","[pos, pos]"
737,format[+3]##no matter the format . \n,no matter the format .,[format[+3]],[+3]],[format],[pos]


Some sentences have multiple features and associated sentiments which are parsed as lists into the "Gold Standard Feature" and "Gold Standard Sentiment" columns respectively.

Here I have also removed the rows which do not contain "sentences" for the purposes of the analysis (ie those that do not contain "##").

## 3. Data Pre-processing and Feature Extraction

Next, I look to extract product features and sentiment-bearing words from the "Sentence" column. Pre-processing is performed first to standardise the data as much as possible. During this step, several procedures are performed:

- The sentence is converted to lowercase - this helps in all future steps of the process, as it ensures that if two words are the same, they will be counted as such, regardless of the case used.
- The sentence is tokenized using Spacy's NLP pipeline to seperate out individual parts-of-speech.
- Punctuation is removed as it is not likely to be a part of a product feature nor be indicative of sentiment. There may be some exceptions to this, for example, exlamation marks could indicate enthusiasm, but they could equally indicate anger. I have chosen to focus on the words themselves in this analysis as they are more likely to clearly indicate either positive or negative sentiment.
- Stopwords are also removed - while they are required to make a sentence readable, they are not likely to either be product features (generally nouns) or sentiment-bearing words (generally adjectives).
- Words are lemmatised - again this is to help standardise the words to be analysed so product features and sentiment can more easily be detected. Lemmatisation rather than stemming is more appropriate here as it is important that full (not stemmed) words are extracted as product features to aid readability in the product and feature summaries. 

The pre-processed sentence is then stored in the "Cleaned Sentence" column.


While pre-processing and Feature Extraction are seperate steps, they are combined in the code here for efficiency.
Product Features are identified and extracted to the "Identified features" column in the following way:

- Nouns are extracted as potential product features
- Spacy's chunking algorithm is applied to identify noun-phrases
- Observing that the vast majority of product features within the text files are one or two word phrases. Noun phrases are parsed using nltk's bigram function, which divides the noun phrase into 2-word chunks.
- These bigrams are then added to the list of nouns as potential product features occuring in a given sentence.

Adjectives are also extracted and stored in the "Sentiment Words" column as the most likely parts of speech to be sentiment bearing in relation to features (nouns).

Here I chose to extract adjectives before applying an algorithm to determine sentence sentiment. From observations in the data, it is clear that although many sentences contain possible features, it is only a subset of these that display sentiment towards those features. Therefore, it is necessary to extract words that are likely to determine sentiment so the algorithm can be applied selectively to those sentences, reducing the possibility that sentences are labelled as having positive or negative sentiment where no sentiment is actually present.

In [3]:
# Lower case
df['Sentence'] = df['Sentence'].str.lower()

# remove stopwords & punctuation, lemmatise + Extract Nouns, NPs and Adjectives

nlp = spacy.load("en_core_web_sm")
df['Cleaned Sentence'] = ''
df['Identified Features'] = [[] for _ in range(len(df))]
df['Sentiment Words'] = [[] for _ in range(len(df))]
for i in range(len(df['Sentence'])):
    doc = nlp(df.loc[i, 'Sentence'])
    cleaned_doc = ""
    for token in doc:
        if token.is_stop is False and token.is_punct is False:
            cleaned_doc += str(token.lemma_) + " "
    df.loc[i, 'Cleaned Sentence'] = cleaned_doc
    cleaned_doc = nlp(cleaned_doc)
    for token in cleaned_doc:
        if token.pos_ == 'NOUN':
            df.loc[i, 'Identified Features'].append(str(token))
        if token.pos_ == 'ADJ':
            df.loc[i, 'Sentiment Words'].append(str(token))
    for chunk in cleaned_doc.noun_chunks:
        nltk_tokens = nltk.word_tokenize(str(chunk))
        if len(nltk_tokens) > 2:
            bigram_list = list(nltk.bigrams(nltk_tokens))
            for bigram in bigram_list:
                bigram_string = str(bigram[0]) + " " + str(bigram[1])
                df.loc[i, 'Identified Features'].append(bigram_string)
        else:
            df.loc[i, 'Identified Features'].append(str(chunk))
df

Unnamed: 0,Raw text,Sentence,Gold Standard,Gold Standard Sentiment Score,Gold Standard Feature,Gold Standard Sentiment,Cleaned Sentence,Identified Features,Sentiment Words
0,"##repost from january 13 , 2004 with a better ...","repost from january 13 , 2004 with a better fi...",[],[],[],[],repost january 13 2004 well fit title,"[repost, well fit, fit title]",[]
1,##does your apex dvd player only play dvd audi...,does your apex dvd player only play dvd audio ...,[],[],[],[],apex dvd player play dvd audio video,"[player, audio, video, apex dvd, dvd player, d...",[]
2,##or does it play audio and video but scrollin...,or does it play audio and video but scrolling ...,[],[],[],[],play audio video scroll black white,"[scroll, white, audio video, video scroll, scr...","[audio, black]"
3,##before you try to return the player or waste...,before you try to return the player or waste h...,[],[],[],[],try return player waste hour call apex tech su...,"[return, player, waste, hour, call, support, p...","[simple, troubleshooting]"
4,##no picture : \n,no picture :,[],[],[],[],picture,"[picture, picture]",[]
...,...,...,...,...,...,...,...,...,...
734,dvd player[+3]##i am really impressed by this ...,i am really impressed by this dvd player .,[dvd player[+3]],[+3]],[dvd player],[pos],impress dvd player,"[player, dvd player]",[]
735,"##if it can fit in the drive bay , this dvd pl...","if it can fit in the drive bay , this dvd play...",[],[],[],[],fit drive bay dvd player play,"[player, fit drive, drive bay, bay dvd, dvd pl...",[]
736,"play[+2], dvd[+2]##for instance , i made sever...","for instance , i made several back-ups of my d...","[play[+2], dvd[+2]]","[+2], +2]]","[play, dvd]","[pos, pos]",instance up dvd movie dvd r w + r w play dvds,"[instance, w, play, dvds, instance, dvd movie,...",[]
737,format[+3]##no matter the format . \n,no matter the format .,[format[+3]],[+3]],[format],[pos],matter format,"[format, matter format]",[]


## 4. Feature Pruning

Now that I have a list of potential product features for each sentence, I conduct an analysis to determine which are most likely to be genuine product features, and prune those that are not:

In [4]:
feature_dict = {}
for features in df['Identified Features']:
    for feature in features:
        if feature_dict.get(feature) is None:
            feature_dict[feature] = 1
        else:
            feature_dict[feature] += 1

feature_df = pd.DataFrame.from_dict(feature_dict, orient='index', columns=['Count'])
feature_df = feature_df.sort_values(by="Count", ascending=False)
feature_df

Unnamed: 0,Count
player,164
dvd player,76
play,59
work,48
problem,47
...,...
disappear,1
slim design,1
screen image,1
3/4 screen,1


Clearly there are some words/phrases here that are potentially product features, for example "player" and "dvd player". Some, for example "problem" and "disappear" are not. Additionally, there are far more product features here than reviews in the database, which makes analysis of frequent features difficult. Pruning some features is therefore required, which I do in several ways, firstly through:

- Pointwise mutual information (PMI) - This technique is used to prune redundant bigrams by measuring co-occurance of the two words, ie. the probability that the words occur together (determined by the number of times the a particular bigram appears in the corpus) compared to the number of times the words appear in total. bigrams with low PMI (<-3 in this analysis) are pruned from the feature set.
- Infrequent features - A feature (weather a bigram or unigram) is pruned if it appears in <1% of reviews. This eliminates a large portion of features that are not likely contribute to the analysis as they occur in only a small number of review sentences.

In [5]:
# Redundancy Pruning through pointwise mutual information (if PMI >-3)

corpus_size = 0
for sentence in df['Cleaned Sentence']:
    corpus_size += len(nltk.word_tokenize(sentence))

feature_df["PMI"] = 0
for index in feature_df.index:
    if len(nltk.word_tokenize(index)) == 2:
        bigram_count = feature_df.loc[index, "Count"]
        if index.split(" ")[0] in feature_df.index:
            word_1_count = feature_df.loc[index.split(" ")[0], "Count"]
        else:
            word_1_count = bigram_count
        if " " in index:
            if index.split(" ")[1] in feature_df.index:
                word_2_count = feature_df.loc[index.split(" ")[1], "Count"]
            else:
                word_2_count = bigram_count
        else:
            word_2_count = bigram_count
        pmi = math.log(((bigram_count / corpus_size) / ((word_1_count/corpus_size) + (word_2_count/corpus_size))), 2)
        feature_df.loc[index, "PMI"] = pmi

feature_df = feature_df[feature_df.PMI > -3]

# Pruning infrequent features (appear in <1% of reviews)

feature_df = feature_df[feature_df.Count > 0.01*len(df)]

print(f"Features remaining: {len(feature_df)}")
feature_df[:20]

Features remaining: 56


Unnamed: 0,Count,PMI
player,164,0.0
dvd player,76,-1.473172
play,59,0.0
work,48,0.0
problem,47,0.0
dvd,47,0.0
unit,36,0.0
disc,34,0.0
picture,34,0.0
feature,33,0.0


Next, the pruned feature list is applied to the dataframe (df). In doing this, some additional feature pruning takes place at a sentence level. 

- If a feature in a sentence is a strict subset of another feature in the same sentence, it is removed as a feature in that sentence. For example if the features of one sentence are [dvd, player, dvd player], both dvd and player would be pruned as features from that sentence. This reduces the number of features appearing per sentence where it is likely that they are referring to the same product feature.
- As mentioned above, the analysis is only concerned with features where some sentiment is present (which mirrors the text file data), therefore an additional column is added ("Sentiment Bearing Features"). For each sentence, this column is populated with the pruned features if and only if adjectives were extracted from the sentence. This column represents the final set of extracted product features.

In [6]:
# Apply pruned features to data

df['Pruned Features'] = [[] for _ in range(len(df))]
for i in range(len(df['Identified Features'])):
    for feature in df.loc[i, 'Identified Features']:
        if feature in feature_df.index and feature not in df.loc[i, 'Pruned Features']:
            df.loc[i, 'Pruned Features'].append(feature)

# Remove those features that are subsets of others per sentence

for i in range(len(df['Pruned Features'])):
    if len(df.loc[i, 'Pruned Features']) > 1:
        for feature in df.loc[i, 'Pruned Features'].copy():
            if len(df.loc[i, 'Pruned Features']) > 1:
                other_features = df.loc[i, 'Pruned Features'].copy()
                other_features.remove(feature)
                for other_feature in other_features:
                    if feature in other_feature and feature in df.loc[i, 'Pruned Features']:
                        df.loc[i, 'Pruned Features'].remove(feature)

# Remove features with no sentiment

df['Sentiment Bearing Features'] = [[] for _ in range(len(df))]
for i in range(len(df['Pruned Features'])):
    if df.loc[i, 'Sentiment Words'] != []:
        for feature in df.loc[i, 'Pruned Features']:
            df.loc[i, 'Sentiment Bearing Features'].append(feature)

df

Unnamed: 0,Raw text,Sentence,Gold Standard,Gold Standard Sentiment Score,Gold Standard Feature,Gold Standard Sentiment,Cleaned Sentence,Identified Features,Sentiment Words,Pruned Features,Sentiment Bearing Features
0,"##repost from january 13 , 2004 with a better ...","repost from january 13 , 2004 with a better fi...",[],[],[],[],repost january 13 2004 well fit title,"[repost, well fit, fit title]",[],[],[]
1,##does your apex dvd player only play dvd audi...,does your apex dvd player only play dvd audio ...,[],[],[],[],apex dvd player play dvd audio video,"[player, audio, video, apex dvd, dvd player, d...",[],"[video, apex dvd, dvd player]",[]
2,##or does it play audio and video but scrollin...,or does it play audio and video but scrolling ...,[],[],[],[],play audio video scroll black white,"[scroll, white, audio video, video scroll, scr...","[audio, black]",[],[]
3,##before you try to return the player or waste...,before you try to return the player or waste h...,[],[],[],[],try return player waste hour call apex tech su...,"[return, player, waste, hour, call, support, p...","[simple, troubleshooting]","[return, player, hour, support]","[return, player, hour, support]"
4,##no picture : \n,no picture :,[],[],[],[],picture,"[picture, picture]",[],[picture],[]
...,...,...,...,...,...,...,...,...,...,...,...
734,dvd player[+3]##i am really impressed by this ...,i am really impressed by this dvd player .,[dvd player[+3]],[+3]],[dvd player],[pos],impress dvd player,"[player, dvd player]",[],[dvd player],[]
735,"##if it can fit in the drive bay , this dvd pl...","if it can fit in the drive bay , this dvd play...",[],[],[],[],fit drive bay dvd player play,"[player, fit drive, drive bay, bay dvd, dvd pl...",[],[dvd player],[]
736,"play[+2], dvd[+2]##for instance , i made sever...","for instance , i made several back-ups of my d...","[play[+2], dvd[+2]]","[+2], +2]]","[play, dvd]","[pos, pos]",instance up dvd movie dvd r w + r w play dvds,"[instance, w, play, dvds, instance, dvd movie,...",[],"[play, dvds]",[]
737,format[+3]##no matter the format . \n,no matter the format .,[format[+3]],[+3]],[format],[pos],matter format,"[format, matter format]",[],[format],[]


## 5. Precision and Recall

To test the effectiveness of the feature extraction, two metrics are commonly used:
- Precision - the proportion of actual features that matched predicted features
- Recall - the proportion of the predicted features that matched the actual features
These metrics are applied to each of the 17 text files seperately, using the Gold Standard features as a measure of accuracy. The output is generated below.

In [7]:
# This function is a concatination of the steps above

def feature_extraction(file):

    with open(file) as txt_file:
        df = pd.DataFrame({'Raw text': txt_file.readlines()})

    df['Sentence'] = df['Raw text'].str.split('##').str[1]
    df['Sentence'] = df['Sentence'].str.replace('\n', "")
    df['Gold Standard'] = df['Raw text'].str.split('##').str[0]
    df['Gold Standard'] = df['Gold Standard'].str.split(',')
    df['Gold Standard Sentiment Score'] = [[] for _ in range(len(df))]
    df['Gold Standard Feature'] = [[] for _ in range(len(df))]
    df['Gold Standard Sentiment'] = [[] for _ in range(len(df))]
    df = df.dropna()
    df = df.reset_index(drop=True)

    for i in range(len(df['Gold Standard'])):
        if df.loc[i, 'Gold Standard'] != ['']:
            for item in df.loc[i, 'Gold Standard']:
                if "[" in item:
                    df.loc[i, 'Gold Standard Sentiment Score'].append(item.split("[")[1])
                    df.loc[i, 'Gold Standard Feature'].append(item.split("[")[0])

    for i in range(len(df['Gold Standard Sentiment Score'])):
        if df.loc[i, 'Gold Standard Sentiment Score'] != []:
            for item in df.loc[i, 'Gold Standard Sentiment Score']:
                if "+" in item:
                    df.loc[i, 'Gold Standard Sentiment'].append("pos")
                elif "-" in item:
                    df.loc[i, 'Gold Standard Sentiment'].append("neg")
                else:
                    df.loc[i, 'Gold Standard Sentiment'].append("neutral")


    # Lower case
    df['Sentence'] = df['Sentence'].str.lower()

    # remove stopwords & punctuation, lemmatise + Extract Nouns, NPs and Adjectives

    nlp = spacy.load("en_core_web_sm")
    df['Cleaned Sentence'] = ''
    df['Identified Features'] = [[] for _ in range(len(df))]
    df['Sentiment Words'] = [[] for _ in range(len(df))]
    for i in range(len(df['Sentence'])):
        doc = nlp(df.loc[i, 'Sentence'])
        cleaned_doc = ""
        for token in doc:
            if token.is_stop is False and token.is_punct is False:
                cleaned_doc += str(token.lemma_) + " "
        df.loc[i, 'Cleaned Sentence'] = cleaned_doc
        cleaned_doc = nlp(cleaned_doc)
        for token in cleaned_doc:
            if token.pos_ == 'NOUN':
                df.loc[i, 'Identified Features'].append(str(token))
            if token.pos_ == 'ADJ':
                df.loc[i, 'Sentiment Words'].append(str(token))
        for chunk in cleaned_doc.noun_chunks:
            nltk_tokens = nltk.word_tokenize(str(chunk))
            if len(nltk_tokens) > 2:
                bigram_list = list(nltk.bigrams(nltk_tokens))
                for bigram in bigram_list:
                    bigram_string = str(bigram[0]) + " " + str(bigram[1])
                    df.loc[i, 'Identified Features'].append(bigram_string)
            else:
                df.loc[i, 'Identified Features'].append(str(chunk))

    # Build feature list (dataframe)

    feature_dict = {}
    for features in df['Identified Features']:
        for feature in features:
            if feature_dict.get(feature) is None:
                feature_dict[feature] = 1
            else:
                feature_dict[feature] += 1

    feature_df = pd.DataFrame.from_dict(feature_dict, orient='index', columns=['Count'])
    feature_df = feature_df.sort_values(by="Count", ascending=False)

    # Redundancy Pruning through pointwise mutual information (if PMI >-3)

    corpus_size = 0
    for sentence in df['Cleaned Sentence']:
        corpus_size += len(nltk.word_tokenize(sentence))

    feature_df["PMI"] = 0
    for index in feature_df.index:
        if len(nltk.word_tokenize(index)) == 2:
            bigram_count = feature_df.loc[index, "Count"]
            if index.split(" ")[0] in feature_df.index:
                word_1_count = feature_df.loc[index.split(" ")[0], "Count"]
            else:
                word_1_count = bigram_count
            if " " in index:
                if index.split(" ")[1] in feature_df.index:
                    word_2_count = feature_df.loc[index.split(" ")[1], "Count"]
                else:
                    word_2_count = bigram_count
            else:
                word_2_count = bigram_count
            pmi = math.log(((bigram_count / corpus_size) / ((word_1_count/corpus_size) + (word_2_count/corpus_size))), 2)
            feature_df.loc[index, "PMI"] = pmi

    feature_df = feature_df[feature_df.PMI > -3]

    # Pruning infrequent features (appear in <1% of reviews)

    feature_df = feature_df[feature_df.Count > 0.01*len(df)]

    # Apply pruned features to data

    df['Pruned Features'] = [[] for _ in range(len(df))]
    for i in range(len(df['Identified Features'])):
        for feature in df.loc[i, 'Identified Features']:
            if feature in feature_df.index and feature not in df.loc[i, 'Pruned Features']:
                df.loc[i, 'Pruned Features'].append(feature)

    # Remove those features that are subsets of others per sentence

    for i in range(len(df['Pruned Features'])):
        if len(df.loc[i, 'Pruned Features']) > 1:
            for feature in df.loc[i, 'Pruned Features'].copy():
                if len(df.loc[i, 'Pruned Features']) > 1:
                    other_features = df.loc[i, 'Pruned Features'].copy()
                    other_features.remove(feature)
                    for other_feature in other_features:
                        if feature in other_feature and feature in df.loc[i, 'Pruned Features']:
                            df.loc[i, 'Pruned Features'].remove(feature)

    # Remove features with no sentiment

    df['Sentiment Bearing Features'] = [[] for _ in range(len(df))]
    for i in range(len(df['Pruned Features'])):
        if df.loc[i, 'Sentiment Words'] != []:
            for feature in df.loc[i, 'Pruned Features']:
                df.loc[i, 'Sentiment Bearing Features'].append(feature)


    return df

In [8]:
def precision_recall(feature_extraction_dataframe):

    true_positive_count = 0 # when i predict the gold standard feature
    false_positives_count = 0 # when i predict something which isn't a gold standard feature
    false_negative_count = 0 # when i don't predict something which is a gold standard feature

    for i in range(len(feature_extraction_dataframe['Sentiment Bearing Features'])):
        for feature in feature_extraction_dataframe.loc[i, 'Sentiment Bearing Features']:
            if feature in feature_extraction_dataframe.loc[i, 'Gold Standard Feature']:
                true_positive_count += 1
            elif feature not in feature_extraction_dataframe.loc[i, 'Gold Standard Feature']:
                false_positives_count += 1
        for feature in feature_extraction_dataframe.loc[i, 'Gold Standard Feature']:
            if feature not in feature_extraction_dataframe.loc[i, 'Sentiment Bearing Features']:
                false_negative_count += 1

    precision = true_positive_count / (true_positive_count + false_positives_count)
    recall = true_positive_count / (true_positive_count + false_negative_count)

    return precision, recall

In [9]:
# Please change filepaths if you wish to re-run code

filepaths = ['C:/Users/samjp/OneDrive/Desktop/Products/Apex AD2600 Progressive-scan DVD player.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Canon G3.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Canon PowerShot SD500.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Canon S100.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Computer.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Creative Labs Nomad Jukebox Zen Xtra 40GB.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Diaper Champ.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Hitachi router.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/ipod.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Linksys Router.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/MicroMP3.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Nikon coolpix 4300.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Nokia 6600.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Nokia 6610.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/norton.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Router.txt',
             'C:/Users/samjp/OneDrive/Desktop/Products/Speaker.txt']

filenames = ['Apex AD2600 Progressive-scan DVD player',
             'Canon G3',
             'Canon PowerShot SD500',
             'Canon S100',
             'Computer',
             'Creative Labs Nomad Jukebox Zen Xtra 40GB',
             'Diaper Champ',
             'Hitachi router',
             'ipod',
             'Linksys Router',
             'MicroMP3',
             'Nikon coolpix 4300',
             'Nokia 6600',
             'Nokia 6610',
             'norton',
             'Router',
             'Speaker']

precision_recall_df = pd.DataFrame(columns=["Product", "Precision", "Recall"])

for i in range(len(filepaths)):
    extracted_df = feature_extraction(filepaths[i])
    product = filenames[i]
    precision, recall = precision_recall(extracted_df)
    precision_recall_df.loc[len(precision_recall_df.index)] = [product, precision, recall]
    
print(precision_recall_df)
print(precision_recall_df.mean())

                                      Product  Precision    Recall
0     Apex AD2600 Progressive-scan DVD player   0.113475  0.149184
1                                    Canon G3   0.106061  0.269231
2                       Canon PowerShot SD500   0.074742  0.195946
3                                  Canon S100   0.109181  0.200913
4                                    Computer   0.111853  0.189266
5   Creative Labs Nomad Jukebox Zen Xtra 40GB   0.154362  0.298701
6                                Diaper Champ   0.077083  0.154812
7                              Hitachi router   0.183938  0.267925
8                                        ipod   0.038333  0.119792
9                              Linksys Router   0.066856  0.212670
10                                   MicroMP3   0.072711  0.134106
11                         Nikon coolpix 4300   0.155689  0.384236
12                                 Nokia 6600   0.153046  0.219616
13                                 Nokia 6610   0.200000  0.32

  print(precision_recall_df.mean())


We can see that precision and recall is low for all products, with an average precision of 0.11 and recall of 0.21. This may be improved in further analysis by using approaches such as infrequent feature identification and opinion word extraction methods described in Hu and Liu (2004, https://www.aaai.org/Papers/AAAI/2004/AAAI04-119.pdf). 

## 6. Sentiment Analysis

Now that sentiment bearing product features have been extracted from the dataset, I use a supervised learning approach (using sklearn's Multinomial Naive Bayes algorithm) to determine whether sentiment is positive or negative. I use the Gold Standard sentiments provided and the cleaned sentence text to train the model, and apply this at a sentence level (ie not considering differing sentiment in the same sentence for two different product features), which would be an interesting additional approach analysis to be explored in the future. I chose Naive Bayes as it's bag-of-words approach tends to perform well in binary classification tasks, especially sentiment analysis.

A quirk of the dataset is that the Gold Standard features and sentiment do not necessarily match the extracted sentiment bearing features (ie some extracted features are not labelled in the data and some labelled data are not extracted). In this case I only consider sentences with both extracted sentiment bearing features, and Gold Standard features and sentiment. This does somewhat reduce the size of the training/testing set, however there is enough overlap to generate a decent size dataset for all products.

The approach is applied to each of the 17 text files seperately, with laplace smoothing and an 80%/20% training vs testing data split and accuracy (the proportion of sentence sentiment correctly identified) reported below.

In [10]:
def sentiment_analyser_accuracy(feature_extraction_dataframe):

    feature_extraction_dataframe["Sentence sentiment"] = ""
    for i in range(len(feature_extraction_dataframe["Gold Standard Sentiment"])):
        if "pos" in feature_extraction_dataframe.loc[i, "Gold Standard Sentiment"] and "neg" not in feature_extraction_dataframe.loc[i, "Gold Standard Sentiment"]:
            feature_extraction_dataframe.loc[i, "Sentence sentiment"] = "pos"
        elif "neg" in feature_extraction_dataframe.loc[i, "Gold Standard Sentiment"] and "pos" not in feature_extraction_dataframe.loc[i, "Gold Standard Sentiment"]:
            feature_extraction_dataframe.loc[i, "Sentence sentiment"] = "neg"

    training_df = feature_extraction_dataframe.copy()
    training_df = training_df.drop(
        ["Raw text", "Sentence", "Gold Standard", "Gold Standard Sentiment Score", "Gold Standard Feature",
         "Gold Standard Sentiment", "Identified Features", "Sentiment Words", "Pruned Features",
         "Sentiment Bearing Features"], axis=1)

    training_df['Sentence sentiment'].replace('', np.nan, inplace=True)
    training_df = training_df.dropna()
    training_df = training_df.reset_index(drop=True)

    training_data_size = math.ceil(len(training_df) * 0.8)
    training_data = training_df.sample(n=training_data_size, random_state=1) # Random state defined so results can be compared with different approaches
    testing_data = training_df.drop(training_data.index)
    vectorizer = CountVectorizer()
    train_X = vectorizer.fit_transform(training_data["Cleaned Sentence"])
    train_X = pd.DataFrame(train_X.toarray(), columns=vectorizer.get_feature_names())
    train_y = training_data["Sentence sentiment"]
    model = MultinomialNB(alpha=1.0)
    model.fit(train_X, train_y)

    test_X = vectorizer.transform(testing_data["Cleaned Sentence"])
    test_X = pd.DataFrame(test_X.toarray(), columns=vectorizer.get_feature_names())
    test_y = testing_data["Sentence sentiment"]
    accuracy = model.score(test_X, test_y)

    return accuracy

In [11]:
sentiment_accuracy_df = pd.DataFrame(columns=["Product", "Sentiment Accuracy"])

for i in range(len(filepaths)):
    extracted_df = feature_extraction(filepaths[i])
    product = filenames[i]
    accuracy = sentiment_analyser_accuracy(extracted_df)
    sentiment_accuracy_df.loc[len(sentiment_accuracy_df.index)] = [product, accuracy]
    
print(sentiment_accuracy_df)
print(sentiment_accuracy_df.mean())

                                      Product  Sentiment Accuracy
0     Apex AD2600 Progressive-scan DVD player            0.882353
1                                    Canon G3            0.829787
2                       Canon PowerShot SD500            0.791667
3                                  Canon S100            0.764706
4                                    Computer            0.717391
5   Creative Labs Nomad Jukebox Zen Xtra 40GB            0.808511
6                                Diaper Champ            0.833333
7                              Hitachi router            0.578947
8                                        ipod            0.903226
9                              Linksys Router            0.702703
10                                   MicroMP3            0.750000
11                         Nikon coolpix 4300            0.741935
12                                 Nokia 6600            0.714286
13                                 Nokia 6610            0.803922
14        

  print(sentiment_accuracy_df.mean())


Naive Bayes provides a moderate degree of accuracy across these datasets at 0.78 on average. This is likely hampered by the number of excluded rows, and could be improved further if a larger dataset were used.

## 7. Summary Generation

Now that sentiment bearing features have been extracted from the data, and sentiment predicted for each sentence. Summaries can be generated at a product level. In the previous step, I excluded sentiment bearing feature sentences with no corresponding Gold Standard feature and sentiment to correctly report accuracy figures. This model is now applied to all sentiment bearing feature sentences so data summaries can by generated accurately. Importantly, the model is still trained on the same dataset as above and data is vectorised based only on words appearing in cleaned sentences in the training set, so there is no data leakage from other data.

Below I print out summaries for the 5 most frequent features for 2 products:

In [12]:
def sentiment_analyser_outputs(feature_extraction_dataframe):
    feature_extraction_dataframe["Sentence sentiment"] = ""
    for i in range(len(feature_extraction_dataframe["Gold Standard Sentiment"])):
        if "pos" in feature_extraction_dataframe.loc[i, "Gold Standard Sentiment"] and "neg" not in \
                feature_extraction_dataframe.loc[i, "Gold Standard Sentiment"]:
            feature_extraction_dataframe.loc[i, "Sentence sentiment"] = "pos"
        elif "neg" in feature_extraction_dataframe.loc[i, "Gold Standard Sentiment"] and "pos" not in \
                feature_extraction_dataframe.loc[i, "Gold Standard Sentiment"]:
            feature_extraction_dataframe.loc[i, "Sentence sentiment"] = "neg"

    training_df = feature_extraction_dataframe.copy()
    training_df = training_df.drop(
        ["Raw text", "Sentence", "Gold Standard", "Gold Standard Sentiment Score", "Gold Standard Feature",
         "Gold Standard Sentiment", "Identified Features", "Sentiment Words", "Pruned Features",
         "Sentiment Bearing Features"], axis=1)

    training_df['Sentence sentiment'].replace('', np.nan, inplace=True)
    training_df = training_df.dropna()
    training_df = training_df.reset_index(drop=True)

    training_data_size = math.ceil(len(training_df) * 0.8)
    training_data = training_df.sample(n=training_data_size, random_state=1)
    vectorizer = CountVectorizer()
    train_X = vectorizer.fit_transform(training_data["Cleaned Sentence"])
    train_X = pd.DataFrame(train_X.toarray(), columns=vectorizer.get_feature_names())
    train_y = training_data["Sentence sentiment"]
    model = MultinomialNB(alpha=1.0)
    model.fit(train_X, train_y)

    output_df = feature_extraction_dataframe.copy()
    output_df = output_df.drop(
        ["Raw text", "Gold Standard", "Gold Standard Sentiment Score", "Gold Standard Feature",
         "Gold Standard Sentiment", "Identified Features", "Sentiment Words", "Pruned Features"
            , "Sentence sentiment"], axis=1)

    for i in range(len(output_df["Sentiment Bearing Features"])):
        if output_df.loc[i, "Sentiment Bearing Features"] == []:
            output_df.loc[i, "Sentiment Bearing Features"] = np.NaN
    output_df = output_df.dropna()
    output_df = output_df.reset_index(drop=True)

    test_X = vectorizer.transform(output_df["Cleaned Sentence"])
    test_X = pd.DataFrame(test_X.toarray(), columns=vectorizer.get_feature_names())
    output_df["Predicted Sentiment"] = model.predict(test_X)

    return output_df

def generate_summary(product_name, outputs, top_x_features=5, number_sentences=5):
    feature_list = []
    pos_review_list = []
    neg_review_list = []
    pos_sentences = []
    neg_sentences = []
    for i in range(len(outputs)):
        for feature in outputs.loc[i, "Sentiment Bearing Features"]:
            if feature in feature_list:
                if outputs.loc[i, "Predicted Sentiment"] == "pos":
                    pos_review_list[feature_list.index(feature)] += 1
                    pos_sentences[feature_list.index(feature)].append(str(outputs.loc[i, "Sentence"]))
                else:
                    neg_review_list[feature_list.index(feature)] += 1
                    neg_sentences[feature_list.index(feature)].append(str(outputs.loc[i, "Sentence"]))

            else:
                feature_list.append(feature)
                if outputs.loc[i, "Predicted Sentiment"] == "pos":
                    pos_review_list.append(1)
                    neg_review_list.append(0)
                    pos_sentences.append([outputs.loc[i, "Sentence"]])
                    neg_sentences.append([])
                else:
                    neg_review_list.append(1)
                    pos_review_list.append(0)
                    neg_sentences.append([outputs.loc[i, "Sentence"]])
                    pos_sentences.append([])

    data_tuples = list(zip(feature_list, pos_review_list, neg_review_list, pos_sentences, neg_sentences))
    summary_df = pd.DataFrame(data_tuples,
                              columns=["Features", "Positive", "Negative", "Positive Sentences", "Negative Sentences"])
    summary_df["Total Reviews"] = summary_df["Positive"] + summary_df["Negative"]
    summary_df = summary_df.sort_values(by="Total Reviews", ascending=False)
    summary_df = summary_df.reset_index(drop=True)

    print(f"Product Name: {product_name}")
    for i in range(min(len(summary_df), top_x_features)):
        print("\n")
        print(f"Feature: {summary_df.loc[i, 'Features']}")
        print("\n")
        print(f"Positive: {summary_df.loc[i, 'Positive']}")
        for j in range(min(summary_df.loc[i, 'Positive'], number_sentences)):
            print(f"- {summary_df.loc[i, 'Positive Sentences'][j]}")
        print("\n")
        print(f"Negative: {summary_df.loc[i, 'Negative']}")
        for j in range(min(summary_df.loc[i, 'Negative'], number_sentences)):
            print(f"- {summary_df.loc[i, 'Negative Sentences'][j]}")
        print("\n")
    print("\n")
    print("\n")
    
    return ""

In [13]:
for i in range(5,7):
    extracted_df = feature_extraction(filepaths[i])
    product = filenames[i]
    outputs = sentiment_analyser_outputs(extracted_df)
    generate_summary(product, outputs, top_x_features=5, number_sentences=5)

Product Name: Creative Labs Nomad Jukebox Zen Xtra 40GB


Feature: player


Positive: 85
- like it 's predecessor , the quickly revised nx , this player boasts a decent size and weight , a relatively-intuitive navigational system that categorizes based on id3 tags , and excellent sound ( widely known to be better than ipod - not surprising considering the number of years creative has been in the audio peripheral business ) . 
- they player 's interface itself is also very easy to use . 
- i was a little concerned to be the black sheep buying this player instead of the incredibly overpriced apple i-pod . 
- much cheaper than i-pod good looking player ( beautiful blue back-lit screen ) if you 've read about the player , some have complained about the lack of a viewing hole for the face when the case is on , but this is good because the face does n't get damaged / scratched fast transfer rate 
- the creative labs zen xtra has all the features the i-pod has and if you get if from amazon yo

The code above can be modified for i in range(0,17) to display summaries for other products. The number of features to display can also be modified with top_x_features (these are always displayed from most frequent to least) and the number of positive and negative sentences to print out per feature can be modified with number_sentences.

## 8. Conclusions

In this report I have presented an approach for extracting sentiment bearing features from raw text, pruning the features with a variety of tools, predicting the sentiment of sentences using supervised learning and providing summary outputs from the data. Throughout I have used the evaluation methods of Precision and Recall for feature extraction, and accuracy for sentiment prediction. 

In the future I could look to develop this approach further. While I am confident that Naive Bayes is a good choice of algorithm for the task, the feature extraction step could be further refined with additional pruning steps or by merging features with similar meanings, perhaps using semantic modelling. Additionallly a more sophisticated approach for identifying sentiment words could be taken (rather than adjective extraction), for example, dependency parsing could identify references to particular features, which could then be extracted for further analysis.