# Text Mining Project Work (Group 1)

**Text Classification and Sentiment Analysis**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions
- The provided exercises must be executed by the students of Group 1
- At the end, the file must contain all the required results (as code cell outputs) along with all the commands necessary to reproduce them; 
- The function of every command or group of related commands
must be documented clearly and concisely. 
- The submission deadline is the 1st July 2022.
- When finished, one team member will send the notebook file (having .ipynb extension) via mail (using your BBS email account) to the teacher (nicola.piscaglia@bbs.unibo.it) indicating “[BBS Teamwork] Your last names” as subject, also keeping an own copy of the file for safety.
- You are allowed to consult the teaching material and to search the Web for quick reference. 
- If still in doubt about anything, ask the teacher
- It is severely NOT allowed to communicate with other teams. Ask the teacher for any clarification about the exercises.
- Each correctly developed point counts 2/30.

## Setup

The following cell contains some necessary imports

In [None]:
import numpy as np
import pandas as pd
import gzip
import json
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import os
from urllib.request import urlretrieve

Run the following to download the necessary files

In [None]:
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [None]:
download("Magazine_Subscriptions.json.gz", "https://www.dropbox.com/s/g6om8q8c8pvirw8/Magazine_Subscriptions.json.gz?dl=1")

In [None]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Exercises

1) We provide in the `Magazine_Subscriptions.json.gz` file a dataset composed by several reviews posted on Amazon.com about Magazine Subscriptions. 
Each review is labeled with a score between 1 and 5 stars (represented by the ```overall``` feature).

The text of each review is represented by the ```reviewText``` feature which is going to be our input data along with the ```overall``` one.

Load the dataset putting it in a new Pandas dataframe.

2) Print the dataset rows number and visualize the first 5 rows.

3) Undersample the data by `overall` feature in order to obtain a class-balanced dataset.



4) Cast the `reviewText` column to unicode string



**5)** Select from data only the features named ```reviewText``` and ```overall``` putting them in a dataframe





**6)** Verify the distribution of the number of stars

**7)** Remove from the dataframe the reviews rated with 3 stars.

**8)** Add a `label` column to the DataFrame whose value is `"pos"` for reviews with 4 stars, `"very_pos"` for 5-rated reviews, `"neg"` for reviews with 2 stars and `"very_neg"` for 1-rated reviews.

**9)** Split the dataset randomly into a training set with 70% of data and a test set with the remaining 30%, stratifying the split by the `label` variable

**10)** Create a tf.idf vector space model from training reviews excluding words appearing in less than 3 documents and using bigrams in addition to single words. Then, extract the document-term matrix for them.

**11)** Train a logistic regression classifier on the training reviews, using the representation created above

**12)** Verify the accuracy of the classifier on the test set

**13)** Get the model predictions and print the confusion matrix

14) Train a Deep Learning model of your choice (excluding transformer-based models like BERT) using the document-term representation built in point 10 and evaluate it on test data. Try to maximize the model accuracy. The usage of recurrent layers is up to you.

15) Get the predictions of this latter model and compare them with the Logistic Regression model ones trained in point 11 using the McNemar test and setting a confidence level = 95% (i.e. p-value must be > 0.05 for models to be significantly similar). 

Hint: you will need to adapt the type of the two model predictions to integer arrays in order to be compared.


To obtain the p-value, you can use the provided `mcnemar_pval` function providing the arrays with the labels predicted by the two models and the true ones.

```
mcnemar_pval(model1_predictions, model2_predictions, y_test)
```

Note: McNemar test cannot be applied to compare two models on different test set data.

In [None]:
from statsmodels.stats.contingency_tables import mcnemar

def mcnemar_pval(p1, p2, y_test):
    model1_errors = p1 != y_test
    model2_errors = p2 != y_test

    print(model1_errors, model2_errors)

    # define contingency table
    mc_table = pd.crosstab(model1_errors, model2_errors)
    
    print(mc_table)
    
    # calculate mcnemar test
    mc_result = mcnemar(mc_table)
    return mc_result.pvalue

  import pandas.util.testing as tm


16) Extra: train/fine-tune a transformer-based model (e.g. BERT) on training reviews and evaluate it on the test reviews.