# Opinion Mining and Sentiment Analysis: Teamwork

**Text Mining unit**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions
- The provided exercises must be executed by teams of 2 or 3 persons indicated by the teacher, different teams should not communicate with each other
- It is allowed to consult course material and the Web for advice
- If still in doubt about anything, ask the teacher

- At the end, the file must contain all the required results (as code cell outputs) along with all the commands necessary to reproduce them; 
- the function of every command or group of related commands
must be documented clearly and concisely. 
- The name of every variable defined in the commands (not counting the ones provided by the initial steps) must have the initial letters of your last names as a prefix (e.g. “sj_train_set” for Smith and Johnson). 
- In order to work in pairs, you can access the same Google account that you created for your group and edit the notebook on Google Colab, but be careful to not overwrite the changes made by the other member of your group (to avoid this, you can edit a separate copy of the notebook and then merge the two members results before the end of the test). 
- You have 1.5 hours to complete the test.
- When finished, the team member with an alphabetically lower surname will send the notebook file (having .ipynb extension) via mail (using the same Google Account as your group one) to the teachers (gianluca.moro@unibo.it; nicola.piscaglia@bbs.unibo.it) indicating “[BBS Teamwork] Your last names” as
subject, also keeping an own copy of the file for safety.

## Setup

The following cell contains all necessary imports

In [None]:
import numpy as np
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

Run the following to download the necessary files

In [None]:
import os
from urllib.request import urlretrieve
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [None]:
download("100k_reviews.tsv.gz", "https://www.dropbox.com/s/9fkjz84dnzfyimt/estore_reviews_100k.tsv.gz?dl=1")
download("positive-words.txt", "https://raw.githubusercontent.com/unibodatascience/BBS-TextMining/gh-pages/05%20-%20Opinion%20Mining%20with%20Python%20(part%201)/data/positive-words.txt")
download("negative-words.txt", "https://raw.githubusercontent.com/unibodatascience/BBS-TextMining/gh-pages/05%20-%20Opinion%20Mining%20with%20Python%20(part%201)/data/negative-words.txt")

In [None]:
nltk.download("punkt")

## Dataset

We provide in the `100k_reviews.tsv.gz` file a dataset of 100,000 reviews posted on Amazon.com about DVDs of movies and TV series. Each review is labeled with a score between 1 and 5 stars.

Run the following to correctly load the file into a pandas DataFrame.

In [None]:
data = pd.read_csv("100k_reviews.tsv.gz", sep="\t", compression="gzip")

In [None]:
data.head()

Within the teamwork you will also make use of the Hu and Liu sentiment lexicon: run the following to load sets of positive and negative words.

In [None]:
def scan_hu_liu(f):
    for line in f:
        line = line.decode(errors="ignore").strip()
        if line and not line.startswith(";"):
            yield line

def load_hu_liu(filename):
    with open(filename, "rb") as f:
        return set(scan_hu_liu(f))

hu_liu_pos = load_hu_liu("positive-words.txt")
hu_liu_neg = load_hu_liu("negative-words.txt")

## Exercises

**1)** Verify the distribution of the number of stars

**2)** Add a `label` column to the DataFrame whose value is `"pos"` for reviews with 4 or 5 stars and `"neg"` for reviews with 3 stars or less

**3)** Split the dataset randomly into a training set with 80\% of data and a test set with the remaining 20\%

**4)** Create a function which accepts a text as input, counts the occurrences of positive and negative words from the Hu \& Liu lexicon and return `"pos"` if there are more positive words than negative or `"neg"` otherwise

**4)** Apply the function above to test reviews and get the percentage of cases where the function output matches the known label

**5)** Create a tf.idf vector space model from training reviews excluding words appearing in less than 3 documents and extract the document-term matrix for them

**6)** Train a logistic regression classifier on the training reviews, using the representation created above

**7)** Verify the accuracy of the classifier on the test set

**8)** Extract the 10 words with the highest regression coefficient and the 10 words with the lowest coefficient

**9)** Create a function which accepts a text as input and returns a list of the only words from the text which are also present in the Hu and Liu lexicon (each distinct word must appear in the list as many times as it appears in the text)

**10)** Repeat points from 5 to 7 with a tf.idf vectorizer which uses the function above to extract tokens from text