# Opinion Mining and Sentiment Analysis - Exercises

**Text Mining unit**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Setup

### Import libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from statsmodels.stats.contingency_tables import mcnemar
import os
from urllib.request import urlretrieve
import glob
import gzip
import json
import nltk
nltk.download("punkt")

### Download utility function

In [None]:
# Download a file from an URL
def download(file, url):
    if not os.path.isfile(file):
        urlretrieve(url, file)

### Datasets Downloading

In [None]:
# Dataset filenames
DICT_DATASET = "reviews_Clothing_Shoes_and_Jewelry.json.gz"

# Dataset downloading
download(DICT_DATASET, "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Clothing_Shoes_and_Jewelry_5.json.gz")
download('All_Beauty.json.gz','http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/All_Beauty.json.gz')
download("positive-words.txt", "https://raw.githubusercontent.com/unibodatascience/BBS-TextMining/3ad6643b698f652f200dfbf463a3cb49de8c0e9f/05%20-%20Opinion%20Mining%20with%20Python%20(part%201)/data/positive-words.txt")
download("negative-words.txt", "https://raw.githubusercontent.com/unibodatascience/BBS-TextMining/3ad6643b698f652f200dfbf463a3cb49de8c0e9f/05%20-%20Opinion%20Mining%20with%20Python%20(part%201)/data/negative-words.txt")

In [None]:
# Check if the files have been successfully downloaded
print(glob.glob("*"))

Within the exercises you will also make use of the Hu and Liu sentiment lexicon: run the following to load sets of positive and negative words.

In [None]:
def scan_hu_liu(f):
    for line in f:
        line = line.decode(errors="ignore").strip()
        if line and not line.startswith(";"):
            yield line

def load_hu_liu(filename):
    with open(filename, "rb") as f:
        return set(scan_hu_liu(f))

hu_liu_pos = load_hu_liu("positive-words.txt")
hu_liu_neg = load_hu_liu("negative-words.txt")

### Dataset loading

In [None]:
total_samples = 5000

In [None]:
def load_dataset(dataset_path):
    # We do not consider 3-stars rating as they can be confuse the model in
    # opinion polarity discrimination
    pos_overall_values = {5.0, 4.0}
    neg_overall_values = {1.0, 2.0}
    
    # Data structures
    data = []
    overall = []
    
    # Data loading: read all the dataset
    print("Loading json file...")
    
    # Reading dataset: we build our dataset by selecting
    # only the reviews which are not 3 stars rated. For each review added to our
    # dataset we add to "overall" the label for that review.
    with gzip.open(dataset_path) as jsonfile:
        index = 0
        pos = 0
        neg = 0
        # Each line in the json file represents a review
        for line in jsonfile:
            review = json.loads(line)
            # 
            if review['overall'] in pos_overall_values and pos < int(total_samples / 2):
                index += 1
                # We keep track of the number of positive reviews read
                pos += 1
                # Review appending to our dataset
                data.append(review)
                # Label appending
                overall.append(1)
                if index >= total_samples:
                    break
            elif review['overall'] in neg_overall_values and neg < int(total_samples / 2):
                index += 1
                # We keep track of the number of negative reviews read
                neg += 1
                data.append(review)
                # Label appending
                overall.append(0)
                # We stop reading if we reached the maximum number of samples
                if index >= total_samples:
                    break
    
    # Select only the review text for each review object
    reviewtext = [value['reviewText'] for value in data]
    
    return reviewtext, overall

In [None]:
texts, labels = load_dataset(DICT_DATASET)

### Exercises

**1)** Create a pandas Dataframe containing the provided data (texts and labels). 

**2)** Then, randomly split the loaded dataset into training and test set
* using an hold-out approach (e.g. the training set is composed by the 90% of the dataset reviews) 
* and keeping the dataset cardinality balanced by label. 

**3)** Afterwards, verify the labels distribution over the training and test sets. 

**4)** A lexicon with sets of commonly used positive and negative words is provided in the variables pos_words
and neg_words, respectively. 

Classify the reviews in the test set by first assigning to
each a score equal to the number of known positive words within it minus the number of negative words,
then return `1` for reviews with a positive score and `0` for reviews with a negative or null score.

To calculate the score value for each word: sum `1` or `-1` for each positive/negative word respectively and sum `2` or `-2` for each positive/negative word respectively that is preceded by the word "very".

Finally, evaluate the classification accuracy, i.e. the ratio between the number of correctly classified reviews and
the total count of test reviews.

**5)** Create a tf.idf vector space model from training reviews excluding words appearing in less than 5 documents and extract the document-term matrix for them

**6)** Train a Bernoulli Naive Bayes classifier on the training reviews, using the representation created above

**7)** Verify the accuracy of the classifier on the test set

**8)** Repeat steps from 5 to 7, this time using a TfIdf Vectorizer and a Multinomial NB model

**9)** Repeat steps from 5 to 7, this time using TF-IDF Vectorizer that includes also bigrams and trigrams as features and a Multinomial NB model as classifier.

**10)** Complete the followings tasks:

1. Use the code cell below to load a new dataset in `beauty_df` containing Amazon review on beauty products. 

2. Then, create a new pandas dataframe named `beauty_data` selecting only `reviewText`, `overall` columns.

3. Add a `label` column to the DataFrame whose value is `1` for reviews with 4 or 5 stars and `0` for reviews with 3 stars or less

4. Test the previous Multinomial NB model (used in point **9**) calculating the mean accuracy on beauty reviews contained in `beauty_data`

In [None]:
def parse(path):
  g = gzip.open(path, 'r')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index').sample(10000)

# Loading data into a pandas dataframe
beauty_df = getDF('All_Beauty.json.gz')
beauty_df["reviewText"] = beauty_df["reviewText"].apply(lambda x: np.str_(x)) # encoding the strings as unicode ones
beauty_df.head() # print first 5 entries

**11)** Perform the following tasks:
1. Get the prediction lists on `beauty_data` by the latest MultinomialNB model and the first unsupervised model (based on opinion words list)
2. Finally, compare the two models by performing a Mcnemar test using the provided `mcnemar_pvalue` function below. You must call the function passing the two model prediction lists as parameters (e.g. `mcnemar_pvalue(preds1, preds2)`). 
  * This function returns the pvalue of the test where the null hypothesis is that the two models have the same error proportions
  * The pvalue here represent the probability that the difference between the proportions of the compared models errors is obtained by chance, in other words the probability the models have the same proportion of errors. Thus, the greater the pvalue the more similar the two models. 
  * e.g. The p-value of 0.000 signifies that the difference between the two proportions of errors is statistically significant.
3. Decide if the two models are similar
  * Set a confidence level of 0.95
  * Check if `p-value > (1 - confidence level) `
  

In [None]:
def mcnemar_pvalue(model1_predictions, model2_predictions):
    # define contingency table
    table = pd.crosstab(model1_predictions, model2_predictions)
    print(table)

    # calculate mcnemar test
    result = mcnemar(table)
    return result.pvalue