# Opinion Mining and Sentiment Analysis - Exercises

**Text Mining unit**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Setup

### Import libraries

In [35]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from statsmodels.stats.contingency_tables import mcnemar
import os
from urllib.request import urlretrieve
import glob
import gzip
import json
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Download utility function

In [36]:
# Download a file from an URL
def download(file, url):
    if not os.path.isfile(file):
        urlretrieve(url, file)

### Amazon Reviews Datasets Downloading

In [37]:
# Dataset filenames
DICT_DATASET = "reviews_Clothing_Shoes_and_Jewelry.json.gz"

# Dataset downloading
download(DICT_DATASET, "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Clothing_Shoes_and_Jewelry_5.json.gz")
download('All_Beauty.json.gz','http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/All_Beauty.json.gz')
download("positive-words.txt", "https://raw.githubusercontent.com/unibodatascience/BBS-TextMining/3ad6643b698f652f200dfbf463a3cb49de8c0e9f/05%20-%20Opinion%20Mining%20with%20Python%20(part%201)/data/positive-words.txt")
download("negative-words.txt", "https://raw.githubusercontent.com/unibodatascience/BBS-TextMining/3ad6643b698f652f200dfbf463a3cb49de8c0e9f/05%20-%20Opinion%20Mining%20with%20Python%20(part%201)/data/negative-words.txt")

In [38]:
# Check if the files have been successfully downloaded
print(glob.glob("*"))

['negative-words.txt', 'All_Beauty.json.gz', 'positive-words.txt', 'reviews_Clothing_Shoes_and_Jewelry.json.gz', 'sample_data']


Within the exercises you will also make use of the Hu and Liu sentiment lexicon: run the following to load sets of positive and negative words.

In [39]:
def scan_hu_liu(f):
    for line in f:
        line = line.decode(errors="ignore").strip()
        if line and not line.startswith(";"):
            yield line

def load_hu_liu(filename):
    with open(filename, "rb") as f:
        return set(scan_hu_liu(f))

hu_liu_pos = load_hu_liu("positive-words.txt")
hu_liu_neg = load_hu_liu("negative-words.txt")

### Dataset loading

In [40]:
total_samples = 5000

In [41]:
def load_dataset(dataset_path):
    # We do not consider 3-stars rating as they can be confuse the model in
    # opinion polarity discrimination
    pos_overall_values = {5.0, 4.0}
    neg_overall_values = {1.0, 2.0}
    
    # Data structures
    data = []
    overall = []
    
    # Data loading: read all the dataset
    print("Loading json file...")
    
    # Reading dataset: we build our dataset by selecting
    # only the reviews which are not 3 stars rated. For each review added to our
    # dataset we add to "overall" the label for that review.
    with gzip.open(dataset_path) as jsonfile:
        index = 0
        pos = 0
        neg = 0
        # Each line in the json file represents a review
        for line in jsonfile:
            review = json.loads(line)
            # 
            if review['overall'] in pos_overall_values and pos < int(total_samples / 2):
                index += 1
                # We keep track of the number of positive reviews read
                pos += 1
                # Review appending to our dataset
                data.append(review)
                # Label appending
                overall.append(1)
                if index >= total_samples:
                    break
            elif review['overall'] in neg_overall_values and neg < int(total_samples / 2):
                index += 1
                # We keep track of the number of negative reviews read
                neg += 1
                data.append(review)
                # Label appending
                overall.append(0)
                # We stop reading if we reached the maximum number of samples
                if index >= total_samples:
                    break
    
    # Select only the review text for each review object
    reviewtext = [value['reviewText'] for value in data]
    
    return reviewtext, overall

In [42]:
texts, labels  = load_dataset(DICT_DATASET)

Loading json file...


### Exercises

**1)** Create a pandas Dataframe containing the provided data (texts and labels). 

In [43]:
data = pd.DataFrame({"text": texts, "label": labels})
data

Unnamed: 0,text,label
0,This is a great tutu and at a really great pri...,1
1,I bought this for my 4 yr old daughter for dan...,1
2,What can I say... my daughters have it in oran...,1
3,"We bought several tutus at once, and they are ...",1
4,Thank you Halo Heaven great product for Little...,1
...,...,...
4995,"way to small, if you have a larger butt like i...",0
4996,"I was hoping 10M would fit me, I'm 150-lbs and...",0
4997,I've gained weight over the last several month...,0
4998,I'm giving one star! All I've owned in flip fl...,0


In [44]:
len(data)

5000

**2)** Then, randomly split the loaded dataset into training and test set
* using an hold-out approach (e.g. the training set is composed by the 90% of the dataset reviews) 
* and keeping the dataset cardinality balanced by label. 

In [45]:
# Scikit-learn train/test split
train_x, test_x, train_y, test_y = train_test_split(data["text"], data["label"], test_size=0.2, random_state=42, stratify=data["label"])

**3)** Finally, verify the labels distribution over the training and test sets. 

In [46]:
train_y.value_counts()

1    2000
0    2000
Name: label, dtype: int64

In [47]:
test_y.value_counts()

1    500
0    500
Name: label, dtype: int64

**4)** A lexicon with sets of commonly used positive and negative words is provided in the variables pos_words
and neg_words, respectively. 

Classify the reviews in the test set by first assigning to
each a score equal to the number of known positive words within it minus the number of negative words,
then return `1` for reviews with a positive score and `0` for reviews with a negative or null score.

To calculate the score value for each word: sum 1 or -1 for each positive/negative respectively word and sum 2 or -2 for each positive/negative word respectively that is preceded by the word "very".

Finally, evaluate the classification accuracy, i.e. the ratio between the number of correctly classified reviews and
the total count of test reviews.

In [48]:
def sentiment_label(text):
    words = nltk.word_tokenize(text)
    score = 0

    for i in range(len(words)):
        word_score = 0

        if words[i] in hu_liu_pos: # "not good" --> -1
            word_score = 1
            if i > 0 and words[i - 1] == "very":
              word_score += 1

        elif words[i] in hu_liu_neg: # "not bad" --> 1
            word_score = -1
            if i > 0 and words[i - 1] == "very":
              word_score -= 1

        score += word_score

    return 1 if score > 0 else 0

In [15]:
lexicon_label = test_x.apply(sentiment_label)
print(lexicon_label)

3717    1
1967    1
3943    1
1849    1
1576    1
       ..
562     1
512     1
3108    1
793     1
130     1
Name: text, Length: 1000, dtype: int64


In [49]:
np.mean(lexicon_label == test_y) # True is converted to 1 and False to 0

0.652

**5)** Create a tf.idf vector space model from training reviews excluding words appearing in less than 5 documents and extract the document-term matrix for them

In [50]:
vect = CountVectorizer(min_df=5)
train_dtm = vect.fit_transform(train_x)

**6)** Train a Bernoulli Naive Bayes classifier on the training reviews, using the representation created above

In [51]:
model = BernoulliNB(binarize=0.0)
model.fit(train_dtm, train_y);

**7)** Verify the accuracy of the classifier on the test set

In [52]:
test_dtm = vect.transform(test_x)
model.score(test_dtm, test_y)

0.828

**8)** Repeat steps from 5 to 7, this time using a TfIdf Vectorizer and a Multinomial NB model

In [53]:
vect = TfidfVectorizer(min_df=5)
train_dtm = vect.fit_transform(train_x)

In [54]:
model = MultinomialNB()
model.fit(train_dtm, train_y);

In [55]:
test_dtm = vect.transform(test_x)
model.score(test_dtm, test_y)

0.873

**9)** Repeat steps from 5 to 7, this time using TF-IDF Vectorizer that includes also bigrams and trigrams as features and a Multinomial NB model as classifier.

In [56]:
vect = TfidfVectorizer(min_df=5, ngram_range=(1, 3))
train_dtm = vect.fit_transform(train_x)

In [57]:
model = MultinomialNB()
model.fit(train_dtm, train_y);

In [58]:
test_dtm = vect.transform(test_x)
model.score(test_dtm, test_y)

0.884

**10)** Complete the followings tasks:

1. Use the code cell below to load a new dataset in `beauty_df` containing Amazon review on beauty products. 

2. Then, create a new pandas dataframe named `beauty_data` selecting only `reviewText`, `overall` columns.

3. Add a `label` column to the DataFrame whose value is `1` for reviews with 4 or 5 stars and `0` for reviews with 3 stars or less

4. Test the previous Multinomial NB model (used in point **9**) calculating the mean accuracy on beauty reviews contained in `beauty_data`

In [59]:
def parse(path):
  g = gzip.open(path, 'r')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index').sample(10000)

# Loading data into a pandas dataframe
beauty_df = getDF('All_Beauty.json.gz')
beauty_df["reviewText"] = beauty_df["reviewText"].apply(lambda x: np.str_(x)) # encoding the strings as unicode ones
beauty_df.head() # print first 5 entries

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
210251,5.0,True,"03 19, 2016",A10GNWWQYJ1WTA,B015C1185C,Isaidy collado,I apply it two days ago in my hand and its sti...,Exactly what I was looking for,1458345600,,,
35786,1.0,True,"01 29, 2014",A26MUUA3U50QSQ,B000GLRREU,Joan,I have used Waterpiks for years but this new m...,HANDLE KEEPS LEAKING AFTER A FEW MONTHS.,1390953600,2.0,{'Size:': ' Ultra'},
178350,1.0,True,"10 25, 2016",A3OT1PB7LD2ISA,B00PJ5CSWY,Amazon Customer,Sticky,One Star,1477353600,,,
230902,5.0,True,"06 26, 2016",A2EEWDFRHLT8FJ,B01BCR38IU,Bezhodeer,"EXCELLENT! Covers the gray, doesn't rub off.",Five Stars,1466899200,,{'Format:': ' Health and Beauty'},
189006,1.0,True,"12 8, 2016",A13PRP3GY92NRA,B00VF344X0,Ann,You get waht you pay for. Strong chemical smel...,One Star,1481155200,,,


In [60]:
beauty_data = beauty_df[["reviewText", 'overall']].copy().reset_index() # select only the text and the score of each review
beauty_data

Unnamed: 0,index,reviewText,overall
0,210251,I apply it two days ago in my hand and its sti...,5.0
1,35786,I have used Waterpiks for years but this new m...,1.0
2,178350,Sticky,1.0
3,230902,"EXCELLENT! Covers the gray, doesn't rub off.",5.0
4,189006,You get waht you pay for. Strong chemical smel...,1.0
...,...,...,...
9995,1488,"It is a solid, well-made product. Non-slip wit...",5.0
9996,250975,Like product. Ship time overly long,4.0
9997,191306,Absolutely fabulous! It cleans deep down with ...,5.0
9998,102191,"The color is beautiful, but it clumps on my li...",2.0


In [61]:
beauty_data["label"] = np.where(beauty_data["overall"] >= 4, 1, 0)

In [62]:
beauty_dtm = vect.transform(beauty_data["reviewText"])

In [63]:
model.score(beauty_dtm, beauty_data["label"])

0.8126

**11)** Perform the following tasks:
1. Get the prediction lists on `beauty_data` by the latest MultinomialNB model and the first unsupervised model (based on opinion words list)
2. Finally, compare the two models by performing a Mcnemar test using the provided `mcnemar_pvalue` function below. You must call the function passing the two model prediction lists as parameters (e.g. `mcnemar_pvalue(preds1, preds2)`). 
  * This function returns the pvalue of the test where the null hypothesis is that the two models have the same error proportions
  * The pvalue here represent the probability that the difference between the proportions of the compared models errors is obtained by chance, in other words the probability the models have the same proportion of errors. Thus, the greater the pvalue the more similar the two models. 
  * e.g. The p-value of 0.000 signifies that the difference between the two proportions of errors is statistically significant.
3. Decide if the two models are similar
  * Set a confidence level of 0.95
  * Check if `p-value > (1 - confidence level) `
  

In [66]:
def mcnemar_pvalue(model1_predictions, model2_predictions):
    # define contingency table
    table = pd.crosstab(model1_predictions, model2_predictions)
    print(table)

    # calculate mcnemar test
    result = mcnemar(table)
    return result.pvalue

In [64]:
unsupervised_model_predictions = beauty_data["reviewText"].apply(sentiment_label)

In [65]:
supervised_model_predictions = model.predict(beauty_dtm)

In [67]:
# interpret the p-value
confidence_level = 0.95
alpha = 1 - confidence_level

pvalue = mcnemar_pvalue(supervised_model_predictions, unsupervised_model_predictions)
print("P-Value: " + str(pvalue))

if pvalue > alpha:
	print('Same proportions of errors (accept H0)')
else:
	print('Different proportions of errors (reject H0)')

reviewText     0     1
row_0                 
0           1757  1547
1           1901  4795
P-Value: 1.779659287245867e-09
Different proportions of errors (reject H0)
