# Opinion Mining & Sentiment Analysis: Exercises (part 2)

**Text Mining unit**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Setup

### Import libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from statsmodels.stats.contingency_tables import mcnemar
import os
from urllib.request import urlretrieve
import glob
import gzip
import json

  import pandas.util.testing as tm


### Utility functions

In [None]:
# Download a file from an URL
def download(file, url):
    if not os.path.isfile(file):
        urlretrieve(url, file)

In [None]:
def scan_hu_liu(f):
    for line in f:
        line = line.decode(errors="ignore").strip()
        if line and not line.startswith(";"):
            yield line

def load_hu_liu(filename):
    with open(filename, "rb") as f:
        return set(scan_hu_liu(f))

In [None]:
total_samples = 50000

def load_dataset(dataset_path):
    # We do not consider 3-stars rating as they can be confuse the model in
    # opinion polarity discrimination
    pos_overall_values = {5.0, 4.0}
    neg_overall_values = {1.0, 2.0}
    
    # Data structures
    data = []
    overall = []

    # Data loading: read all the dataset
    print("Loading json file...")
    
    # Reading dataset: we build our dataset by selecting
    # only the reviews which are not 3 stars rated. For each review added to our
    # dataset we add to "overall" the label for that review.
    with gzip.open(dataset_path) as jsonfile:
        index = 0
        pos = 0
        neg = 0
        # Each line in the json file represents a review
        for line in jsonfile:
            review = json.loads(line)
            # 
            if review['overall'] in pos_overall_values and pos < int(total_samples / 2):
                index += 1
                # We keep track of the number of positive reviews read
                pos += 1
                # Review appending to our dataset
                data.append(review)

                # Label appending
                overall.append(1)
                if index >= total_samples:
                    break
            elif review['overall'] in neg_overall_values and neg < int(total_samples / 2):
                index += 1
                # We keep track of the number of negative reviews read
                neg += 1
                data.append(review)
                # Label appending
                overall.append(0)
                # We stop reading if we reached the maximum number of samples
                if index >= total_samples:
                    break

    df = pd.DataFrame.from_dict(data)[["reviewText", "overall"]]
    df["reviewText"] = df["reviewText"].apply(lambda x: np.str_(x)) # encoding the strings as unicode ones
    df["overall"] = np.where(df["overall"] >= 4, 1, 0)
    return df

In [None]:
def mcnemar_pval(model1_predictions, model2_predictions):
    # define contingency table
    table = pd.crosstab(model1_predictions, model2_predictions)
    print(table)

    # calculate mcnemar test
    result = mcnemar(table)
    return result.pvalue

### Datasets Downloading

In [None]:
# Dataset downloading
download('AMAZON_FASHION.json.gz','http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/AMAZON_FASHION.json.gz')
download("Software.json.gz", "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Software.json.gz")
download("positive-words.txt", "https://raw.githubusercontent.com/unibodatascience/BBS-TextMining/3ad6643b698f652f200dfbf463a3cb49de8c0e9f/05%20-%20Opinion%20Mining%20with%20Python%20(part%201)/data/positive-words.txt")
download("negative-words.txt", "https://raw.githubusercontent.com/unibodatascience/BBS-TextMining/3ad6643b698f652f200dfbf463a3cb49de8c0e9f/05%20-%20Opinion%20Mining%20with%20Python%20(part%201)/data/negative-words.txt")

In [None]:
# Check if the files have been successfully downloaded
print(glob.glob("*"))

['Software.json.gz', 'Gift_Cards.json.gz', 'positive-words.txt', 'negative-words.txt', 'sample_data']


### Datasets loading

In [None]:
pos_words = load_hu_liu("positive-words.txt")
neg_words = load_hu_liu("negative-words.txt")

In [None]:
# Loading data into a pandas dataframe
reviews = load_dataset('AMAZON_FASHION.json.gz')
reviews.head() # print first 5 entries

Loading json file...


Unnamed: 0,reviewText,overall
0,Exactly what I needed.,1
1,"I agree with the other review, the opening is ...",0
2,Love these... I am going to order another pack...,1
3,too tiny an opening,0
4,Exactly what I wanted.,1


In [None]:
reviews_B = load_dataset('Software.json.gz')
reviews_B.head() # print first 5 entries

Loading json file...


Unnamed: 0,reviewText,overall
0,The materials arrived early and were in excell...,1
1,I am really enjoying this book with the worksh...,1
2,"IF YOU ARE TAKING THIS CLASS DON""T WASTE YOUR ...",0
3,I have used LearnSmart and can officially say ...,1
4,"Strong backgroung, good read, quite up to date...",1


### Exercises

1) Check the two loaded datasets cardinality and their distribution by label (`overall` field)

2) Then, randomly split the loaded dataset into training and test set
* using an hold-out approach (e.g. the training set is composed by the 70% of the dataset reviews)
* keeping the splits cardinality balanced by label

3)  A lexicon with sets of commonly used positive and negative words is provided in the variables pos_words and neg_words, respectively.

Setup NLTK and define the scoring function to classify all the reviews by first assigning to each a score equal to the number of known positive words within it minus the number of negative words, then return 1 for reviews with a positive score and 0 for reviews with a negative or null score.

4) Apply the function to all reviews contained in the test set of `review`



5) Compare the obtained labels with the known ones and compute the accuracy as the ratio of matches

6) Create a pipeline including a `CountVectorizer` to convert reviews into word count vectors and a `LogisticRegression` model

7) Train the model on the training set of the `review` dataset 

8) Evaluate the model on the test set of the `review` dataset 

9) Repeat steps from 6 to 8, but replacing the `CountVectorizer` in the pipeline with a `TfidfVectorizer`

10) Repeat steps from 6 to 8, as above, but set the `ngram_range` parameter of the `TfidfVectorizer` to include bigrams

11) Repeat the evaluation phase of the three models above, this time on all the reviews from `reviews_B` dataset

12) Extract the predictions from `reviews` test set for each classifier: we already have those for the unsupervised one (`preds_1`), we need those from the supervised models

13) Perform comparisons between every pair of models using the provided `mcnemar_pval(p1, p2)` function

Set a confidence level to consider for rejecting the null hypothesis (H0) e.g. 95%.

H0 states the two classifiers present the same proportions of errors (they disagree to the same amount).

**In each case, the closer the p-value to 0, the more compared classifiers are different**