# Text Mining Project Work (Group 2)

**Text Classification and Sentiment Analysis**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions
- The provided exercises must be executed by the students of Group 2
- At the end, the file must contain all the required results (as code cell outputs) along with all the commands necessary to reproduce them; 
- The function of every command or group of related commands
must be documented clearly and concisely. 
- The submission deadline is the 1st July 2022.
- When finished, one team member will send the notebook file (having .ipynb extension) via mail (using your BBS email account) to the teacher (nicola.piscaglia@bbs.unibo.it) indicating “[BBS Teamwork] Your last names” as subject, also keeping an own copy of the file for safety.
- You are allowed to consult the teaching material and to search the Web for quick reference. 
- If still in doubt about anything, ask the teacher
- It is severely NOT allowed to communicate with other teams. Ask the teacher for any clarification about the exercises.
- Each correctly developed point counts 2/30.

##Setup

Run the following to import some necessary packages and download all the needed files.

In [None]:
import os
from urllib.request import urlretrieve
import pandas as pd

In [None]:
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [None]:
download("reviews-books.csv.gz", "https://www.dropbox.com/s/otbdd2u7x9ylzku/reviews-books.csv.gz?dl=1")
download("reviews-electronics.csv.gz", "https://www.dropbox.com/s/6tjqidwp8cwqfkq/reviews-electronics.csv.gz?dl=1")
download("reviews-music.csv.gz", "https://www.dropbox.com/s/radmd3pjerw143z/reviews-music.csv.gz?dl=1")

In [None]:
download("positive-words.txt", "https://www.dropbox.com/s/pmju477pv8ayzho/positive-words.txt?dl=1")
download("negative-words.txt", "https://www.dropbox.com/s/yy4l1ezlrsar8cf/negative-words.txt?dl=1")

##Exercises

1) Load the dataset contained in `reviews-books.csv.gz` file in a new dataframe named `reviews_A`. Then load the dataset contained in `reviews-electronics.csv.gz` file in a new dataframe named `reviews_B` and the dataset contained in `reviews-music.csv.gz` in a new dataframe named `reviews_C`. Finally, read from `positive-words.txt` and `negative-words.txt` files the opinion word lists putting them to two new variables `pos_words` and `neg_words` respectively

2) Print the first five rows of the two datasets. Then, print the cardinality of the 3 `reviews_X` datasets and the distribution of the `label` feature.

3) Split `reviews_A` into train and test set by selecting the first reviews half as train set and the second one as test set.

4) Classify the reviews provided in the `reviews_A` test set by first assigning to each a score equal to the number of known positive words within it minus the number of negative words, then return  "pos" for reviews with a positive score and "neg" for reviews with a negative or null score.

Start with the setup of NLTK and the definition of the scoring function.
Then, apply the function to all the `reviews_A` in the test set.
Finally, compare the obtained labels with the known ones and compute the accuracy as the ratio of matches.

5) Create a pipeline including a `CountVectorizer` to convert reviews into word count vectors and a `LogisticRegression` model

6) Train the model on all `reviews_B` data

7) Evaluate the model on the `reviews_A` test set

8) Create a new pipeline as above, but replacing the `CountVectorizer` in the pipeline with a `TfidfVectorizer`

9) Fit the new pipeline all the `reviews_B` data

10) Evaluate the new model on the `reviews_A` test set

11) Repeat points 8, 9 and 10 but set the `ngram_range` parameter of the `TfidfVectorizer` to include bigrams

12) Repeat the evaluation of the three models above, this time on all the `reviews_C` data

13) Get the predictions for each classification model used so far (you should already have the predictions for the first unsupervised model) on the `reviews_A` test data. Then, for each pair of the four classifier considered so far, indicate whether the predictions provided by the two compared classifiers on the `reviews_A` test set are significantly similar or different using the McNemar’s test (consider a 95% confidence level, i.e. p-value must be > 0.05 for models to be significantly similar). 

To obtain the p-value, use the provided `mcnemar_pval` function providing the arrays with the labels predicted by the two models and the true ones.

```
mcnemar_pval(model1_predictions, model2_predictions, reviews_A_test["label"])
```
Note: McNemar test cannot be applied to compare two models on different test set data.


In [None]:
from statsmodels.stats.contingency_tables import mcnemar

def mcnemar_pval(p1, p2, y_test):
    model1_errors = p1 != y_test
    model2_errors = p2 != y_test

    # define contingency table
    mc_table = pd.crosstab(model1_errors, model2_errors)
    
    print(mc_table)
    
    # calculate mcnemar test
    mc_result = mcnemar(mc_table)
    return mc_result.pvalue

  import pandas.util.testing as tm


14) Build a 300-dimensional Word2Vec vector space on all `reviews_A` data using a window size equals to 7 and excluding all the terms that appear less than 10 times.

15) Find the 25 words most similar to the word "*interesting*" in the vector space just built.

16) Extra: train/fine-tune a transformer-based model (e.g. BERT) on `reviews_A` training reviews and evaluate it on the `reviews_A` test reviews.