# Opinion Mining and Sentiment Analysis: Teamwork

**Text Mining unit**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions
- The provided exercises must be executed by teams of 2 or 3 persons indicated by the teacher, different teams should not communicate with each other
- It is allowed to consult course material and the Web for advice
- If still in doubt about anything, ask the teacher

- At the end, the file must contain all the required results (as code cell outputs) along with all the commands necessary to reproduce them; 
- the function of every command or group of related commands
must be documented clearly and concisely. 
- The name of every variable defined in the commands (not counting the ones provided by the initial steps) must have the initial letters of your last names as a prefix (e.g. “sj_train_set” for Smith and Johnson). 
- In order to work in pairs, you can access the same Google account that you created for your group and edit the notebook on Google Colab, but be careful to not overwrite the changes made by the other member of your group (to avoid this, you can edit a separate copy of the notebook and then merge the two members results before the end of the test). 
- You have 1.5 hours to complete the test.
- When finished, the team member with an alphabetically lower surname will send the notebook file (having .ipynb extension) via mail (using the same Google Account as your group one) to the teachers (gianluca.moro@unibo.it; nicola.piscaglia@bbs.unibo.it) indicating “[BBS Teamwork] Your last names” as
subject, also keeping an own copy of the file for safety.

## Setup

The following cell contains all necessary imports

In [12]:
import numpy as np
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

Run the following to download the necessary files

In [13]:
import os
from urllib.request import urlretrieve
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [14]:
download("100k_reviews.tsv.gz", "https://www.dropbox.com/s/9fkjz84dnzfyimt/estore_reviews_100k.tsv.gz?dl=1")
download("positive-words.txt", "https://github.com/datascienceunibo/bbs-opinion-lab-2019/raw/master/positive-words.txt")
download("negative-words.txt", "https://github.com/datascienceunibo/bbs-opinion-lab-2019/raw/master/negative-words.txt")

In [15]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Dataset

We provide in the `100k_reviews.tsv.gz` file a dataset of 100,000 reviews posted on Amazon.com about DVDs of movies and TV series. Each review is labeled with a score between 1 and 5 stars.

Run the following to correctly load the file into a pandas DataFrame.

In [16]:
data = pd.read_csv("100k_reviews.tsv.gz", sep="\t", compression="gzip")

In [17]:
data.head()

Unnamed: 0,text,stars
0,George Romero did the right thing when he pick...,5
1,"OK, that makes it sound like something out of ...",5
2,"- At a tribal village, a pensive Elizabeth Cur...",5
3,Wow! This has to be one of the more unusual mo...,5
4,Kevin Costner is one of those actors that I ne...,5


In [18]:
data = data.sample(5000)

Within the teamwork you will also make use of the Hu and Liu sentiment lexicon: run the following to load sets of positive and negative words.

In [19]:
def scan_hu_liu(f):
    for line in f:
        line = line.decode(errors="ignore").strip()
        if line and not line.startswith(";"):
            yield line

def load_hu_liu(filename):
    with open(filename, "rb") as f:
        return set(scan_hu_liu(f))

hu_liu_pos = load_hu_liu("positive-words.txt")
hu_liu_neg = load_hu_liu("negative-words.txt")

## Exercises

**1)** Verify the distribution of the number of stars

In [20]:
data["stars"].value_counts()

5    2397
4    1327
3     692
2     336
1     248
Name: stars, dtype: int64

**2)** Add a `label` column to the DataFrame whose value is `"pos"` for reviews with 4 or 5 stars and `"neg"` for reviews with 3 stars or less

In [21]:
data["label"] = np.where(data["stars"] >= 4, "pos", "neg")

data.head() # To visualize the new dataframe

Unnamed: 0,text,stars,label
5671,Great acting and thrilling story from the begi...,5,pos
50118,Blade Trinity is a wonderful movie but takes a...,5,pos
40889,Quite honestly one of the greatest comedies ev...,5,pos
24318,"So. At long last, the controversial film ""Bat...",4,pos
53003,"To put it plainly, this movie was filmed in FU...",2,neg


**3)** Split the dataset randomly into a training set with 80\% of data and a test set with the remaining 20\%

In [22]:
trainset, testset = train_test_split(data, test_size=0.2)

print("training set shape: " + str(trainset.shape))
print("Test set shape: " + str(testset.shape))

training set shape: (4000, 3)
Test set shape: (1000, 3)


**4)** Create a function which accepts a text as input, counts the occurrences of positive and negative words from the Hu \& Liu lexicon and return `"pos"` if there are more positive words than negative or `"neg"` otherwise

In [23]:
def sentiment_label(text):
    words = nltk.word_tokenize(text)
    pos_count = sum(1 for word in words if word in hu_liu_pos)
    neg_count = sum(1 for word in words if word in hu_liu_neg)
    return "pos" if pos_count > neg_count else "neg"

In [24]:
# test
(sentiment_label("This is awesome!"),
 sentiment_label("This is horrible!"))

('pos', 'neg')

**4)** Apply the function above to test reviews and get the percentage of cases where the output function matches the known label

In [25]:
%%time
lexicon_label = testset["text"].apply(sentiment_label)

print(lexicon_label)

48822    pos
48431    pos
50731    pos
38853    pos
53305    neg
        ... 
29815    pos
20647    pos
95909    neg
22381    pos
54693    pos
Name: text, Length: 1000, dtype: object
CPU times: user 1.82 s, sys: 9.47 ms, total: 1.83 s
Wall time: 1.84 s


or...

In [26]:
lexicon_label = list(map(sentiment_label, testset["text"]))

In [27]:
lexicon_label == testset["label"] # This way we check the calculated sentiment vs the known label

48822     True
48431     True
50731     True
38853     True
53305    False
         ...  
29815    False
20647     True
95909    False
22381     True
54693     True
Name: label, Length: 1000, dtype: bool

In [28]:
np.mean(lexicon_label == testset["label"]) # True is converted to 1 and False to 0

0.666

**5)** Create a tf.idf vector space model from training reviews excluding words appearing in less than 3 documents and extract the document-term matrix for them

In [29]:
vect = TfidfVectorizer(min_df=3)
train_dtm = vect.fit_transform(trainset["text"])

**6)** Train a logistic regression classifier on the training reviews, using the representation created above

In [30]:
%%time
model = LogisticRegression(max_iter=500)
model.fit(train_dtm, trainset["label"]);

CPU times: user 445 ms, sys: 474 ms, total: 919 ms
Wall time: 495 ms


**7)** Verify the accuracy of the classifier on the test set

In [31]:
test_dtm = vect.transform(testset["text"])
model.score(test_dtm, testset["label"])

0.782

**8)** Extract the 10 words with the highest regression coefficient and the 10 words with the lowest coefficient

In [32]:
coefs = pd.Series(
          model.coef_[0],
    index=vect.get_feature_names()
).sort_values()

In [33]:
coefs.head(10)

but       -2.602180
not       -2.577796
bad       -2.370608
plot      -2.085036
worst     -1.984539
nothing   -1.866736
even      -1.727888
poor      -1.724745
too       -1.713623
doesn     -1.615402
dtype: float64

In [34]:
coefs.tail(10)

favorite     1.399785
seen         1.407237
love         1.520499
wonderful    1.569970
excellent    1.727540
enjoy        1.755930
perfect      1.815803
will         1.952415
and          2.658624
great        3.664347
dtype: float64

**9)** Create a function which accepts a text as input and returns a list of the only words from the text which are also present in the Hu and Liu lexicon (each distinct word must appear in the list as many times as it appears in the text)

In [35]:
hu_liu_all = hu_liu_pos | hu_liu_neg # union of two sets of positive and negative Hu&Liu opinion words

def tokenize_hu_liu(text):
    words = nltk.word_tokenize(text) # text tokenization in order to split words in the given text
    return [word for word in words if word in hu_liu_all]

In [36]:
# test
(tokenize_hu_liu("This is awesome awesome!"), # "This is awesome awesome" --> ["this", "is", "awesome", "awesome"] --> ["awesome", "awesome"]
 tokenize_hu_liu("This is horrible!")) # "This is horrible" --> ["this", "is", "horrible"] --> ["horrible"]

(['awesome', 'awesome'], ['horrible'])

**10)** Repeat points from 5 to 7 with a tf.idf vectorizer which uses the function above to extract tokens from text

In [37]:
vect = TfidfVectorizer(min_df=3, tokenizer=tokenize_hu_liu)
train_dtm = vect.fit_transform(trainset["text"])

In [38]:
%%time
model = LogisticRegression(max_iter=500)
model.fit(train_dtm, trainset["label"]);

CPU times: user 89.1 ms, sys: 0 ns, total: 89.1 ms
Wall time: 90.2 ms


In [39]:
test_dtm = vect.transform(testset["text"])
print(model.score(test_dtm, testset["label"]))

0.786
