# Opinion Mining and Sentiment Analysis: Lab In Teams

**Text Mining unit**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions
- The provided exercises must be executed by teams up to 4 students indicated by the teacher, different teams should not communicate with each other

- It is allowed to consult course material and the Web for advice

- If still in doubt about anything, ask the teacher

- At the end, the file must contain all the required results (as code cell outputs) along with all the commands necessary to reproduce them; 

- The function of every command or group of related commands
must be documented clearly and concisely. 

- In order to work in pairs, you can access the same Google account that you created for your group and edit the notebook on Google Colab, but be careful to not overwrite the changes made by the other member of your group (to avoid this, you can edit a separate copy of the notebook and then merge the two members results before the end of the test). 

- You have 1.5 hours to complete the exercises.

## Setup

The following cell contains all necessary imports

In [None]:
import numpy as np
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

Run the following to download the necessary files

In [None]:
import os
from urllib.request import urlretrieve
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [None]:
download("100k_reviews.tsv.gz", "https://www.dropbox.com/s/9fkjz84dnzfyimt/estore_reviews_100k.tsv.gz?dl=1")
download("positive-words.txt", "https://github.com/datascienceunibo/bbs-opinion-lab-2019/raw/master/positive-words.txt")
download("negative-words.txt", "https://github.com/datascienceunibo/bbs-opinion-lab-2019/raw/master/negative-words.txt")

In [None]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Dataset

We provide in the `100k_reviews.tsv.gz` file a dataset of 100,000 reviews posted on Amazon.com about DVDs of movies and TV series. Each review is labeled with a score between 1 and 5 stars.

Run the following to correctly load the file into a pandas DataFrame.

In [None]:
data = pd.read_csv("100k_reviews.tsv.gz", sep="\t", compression="gzip")

In [None]:
data.head()

Unnamed: 0,text,stars
0,George Romero did the right thing when he pick...,5
1,"OK, that makes it sound like something out of ...",5
2,"- At a tribal village, a pensive Elizabeth Cur...",5
3,Wow! This has to be one of the more unusual mo...,5
4,Kevin Costner is one of those actors that I ne...,5


In [None]:
data = data.sample(5000)

Within the teamwork you will also make use of the Hu and Liu sentiment lexicon: run the following to load sets of positive and negative words.

In [None]:
def scan_hu_liu(f):
    for line in f:
        line = line.decode(errors="ignore").strip()
        if line and not line.startswith(";"):
            yield line

def load_hu_liu(filename):
    with open(filename, "rb") as f:
        return set(scan_hu_liu(f))

hu_liu_pos = load_hu_liu("positive-words.txt")
hu_liu_neg = load_hu_liu("negative-words.txt")

## Exercises

**1)** Verify the distribution of the number of stars

In [None]:
data["stars"].value_counts()

5    2406
4    1305
3     696
2     343
1     250
Name: stars, dtype: int64

**2)** Add a `label` column to the DataFrame whose value is `"pos"` for reviews with 4 or 5 stars and `"neg"` for reviews with 3 stars or less

In [None]:
data["label"] = np.where(data["stars"] >= 4, "pos", "neg")

data.head() # To visualize the new dataframe

Unnamed: 0,text,stars,label
7724,Here is a film that just might be less effecti...,5,pos
72819,I'm talking about Chaplain of course. In City...,5,pos
5034,This series was a excellent vampire experience...,5,pos
9585,The movie is about famed novelist and writer I...,3,neg
95801,If you're considering buying Season 1 of Enter...,5,pos


**3)** Split the dataset randomly into a training set with 80\% of data and a test set with the remaining 20\%

In [None]:
trainset, testset = train_test_split(data, test_size=0.2)

print("training set shape: " + str(trainset.shape))
print("Test set shape: " + str(testset.shape))

training set shape: (4000, 3)
Test set shape: (1000, 3)


**4)** Create a function which accepts a text as input, counts the occurrences of positive and negative words from the Hu \& Liu lexicon and return `"pos"` if there are more positive words than negative or `"neg"` otherwise

In [None]:
def sentiment_label(text):
    words = nltk.word_tokenize(text)
    pos_count = sum(1 for word in words if word in hu_liu_pos)
    neg_count = sum(1 for word in words if word in hu_liu_neg)
    return "pos" if pos_count > neg_count else "neg"

In [None]:
# test
(sentiment_label("This is awesome!"),
 sentiment_label("This is horrible!"))

('pos', 'neg')

**4bis)** Apply the function above to test reviews and get the percentage of cases where the output function matches the known label

In [None]:
%%time
lexicon_label = testset["text"].apply(sentiment_label)

print(lexicon_label)

13300    pos
23877    pos
90444    pos
22763    pos
29277    neg
        ... 
68387    pos
46199    pos
77561    pos
62437    pos
4712     neg
Name: text, Length: 1000, dtype: object
CPU times: user 3.28 s, sys: 19.8 ms, total: 3.3 s
Wall time: 4.24 s


or...

In [None]:
lexicon_label = list(map(sentiment_label, testset["text"]))

In [None]:
lexicon_label == testset["label"] # This way we check the calculated sentiment vs the known label

13300     True
23877     True
90444     True
22763    False
29277    False
         ...  
68387     True
46199     True
77561    False
62437    False
4712     False
Name: label, Length: 1000, dtype: bool

In [None]:
np.mean(lexicon_label == testset["label"]) # True is converted to 1 and False to 0

0.634

**5)** Create a tf.idf vector space model from training reviews excluding words appearing in less than 3 documents and extract the document-term matrix for them

In [None]:
vect = TfidfVectorizer(min_df=3)
train_dtm = vect.fit_transform(trainset["text"])

**6)** Train a logistic regression classifier on the training reviews, using the representation created above

In [None]:
%%time
model = LogisticRegression(max_iter=500)
model.fit(train_dtm, trainset["label"]);

CPU times: user 557 ms, sys: 586 ms, total: 1.14 s
Wall time: 1.13 s


**7)** Verify the accuracy of the classifier on the test set

In [None]:
test_dtm = vect.transform(testset["text"])
model.score(test_dtm, testset["label"])

0.778

**8)** Extract the 10 words with the highest regression coefficient and the 10 words with the lowest coefficient

In [None]:
coefs = pd.Series(
          model.coef_[0],
    index=vect.get_feature_names()
).sort_values()



In [None]:
coefs.head(10)

boring   -2.958191
bad      -2.735670
but      -2.665958
worst    -2.657898
stupid   -1.872586
no       -1.847642
not      -1.767202
much     -1.610942
only     -1.598726
money    -1.597942
dtype: float64

In [None]:
coefs.tail(10)

enjoy        1.565964
favorite     1.597295
wonderful    1.633455
love         1.635895
dvd          1.716200
excellent    1.743889
you          1.756413
and          2.298236
best         2.309310
great        3.925792
dtype: float64

**9)** Create a function which accepts a text as input and returns a list of the only words from the text which are also present in the Hu and Liu lexicon (each distinct word must appear in the list as many times as it appears in the text)

In [None]:
hu_liu_all = hu_liu_pos | hu_liu_neg # union of two sets of positive and negative Hu&Liu opinion words

def tokenize_hu_liu(text):
    words = nltk.word_tokenize(text) # text tokenization in order to split words in the given text
    return [word for word in words if word in hu_liu_all]

In [None]:
# test
(tokenize_hu_liu("This is awesome awesome!"), # "This is awesome awesome" --> ["this", "is", "awesome", "awesome"] --> ["awesome", "awesome"]
 tokenize_hu_liu("This is horrible!")) # "This is horrible" --> ["this", "is", "horrible"] --> ["horrible"]

(['awesome', 'awesome'], ['horrible'])

**10)** Repeat points from 5 to 7 with a tf.idf vectorizer which uses the function above to extract tokens from text

In [None]:
vect = TfidfVectorizer(min_df=3, tokenizer=tokenize_hu_liu)
train_dtm = vect.fit_transform(trainset["text"])

In [None]:
%%time
model = LogisticRegression(max_iter=500)
model.fit(train_dtm, trainset["label"]);

CPU times: user 77.7 ms, sys: 34 µs, total: 77.8 ms
Wall time: 77.7 ms


In [None]:
test_dtm = vect.transform(testset["text"])
print(model.score(test_dtm, testset["label"]))

0.792
