# Text Mining Project Work (Group 6)

**Text Classification and Sentiment Analysis**

_Prof. Gianluca Moro, Dott. Ing. Nicola Piscaglia – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions
- The provided exercises must be executed by the students of Group 6
- At the end, the file must contain all the required results (as code cell outputs) along with all the commands necessary to reproduce them; 
- The function of every command or group of related commands
must be documented clearly and concisely. 
- The submission deadline is the 1st July 2022.
- When finished, one team member will send the notebook file (having .ipynb extension) via mail (using your BBS email account) to the teacher (nicola.piscaglia@bbs.unibo.it) indicating “[BBS Teamwork] Your last names” as subject, also keeping an own copy of the file for safety.
- You are allowed to consult the teaching material and to search the Web for quick reference. 
- If still in doubt about anything, ask the teacher
- It is severely NOT allowed to communicate with other teams. Ask the teacher for any clarification about the exercises.
- Each correctly developed point counts 2/30

##Setup

Run the following to import some necessary packages and download all the needed files.

In [1]:
import os
from urllib.request import urlretrieve
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.linear_model import LogisticRegression

In [2]:
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [3]:
download("reviews-electronics.csv.gz", "https://www.dropbox.com/s/5aidj1ns3wiuchi/reviews-electronics.csv.gz?dl=1")
download("reviews-home.csv.gz", "https://www.dropbox.com/s/9dlvc0nntibibk3/reviews-home.csv.gz?dl=1")
download("reviews-books.csv.gz", "https://www.dropbox.com/s/otbdd2u7x9ylzku/reviews-books.csv.gz?dl=1")

In [4]:
download("positive-words.txt", "https://www.dropbox.com/s/pmju477pv8ayzho/positive-words.txt?dl=1")
download("negative-words.txt", "https://www.dropbox.com/s/yy4l1ezlrsar8cf/negative-words.txt?dl=1")

##Exercises

1) Load the dataset contained in `reviews-electronics.csv.gz` file in a new dataframe named `reviews_A`. Then load the dataset contained in `reviews-home.csv.gz` file in a new dataframe named `reviews_B` and the dataset contained in `reviews-books.csv.gz` in a new dataframe named `reviews_C`. Finally, read from positive-words.txt and negative-words.txt files the opinion word lists putting them to two new variables `pos_words` and `neg_words` respectively

2) Print the first five rows of the two datasets. Then, print the cardinality of the 3 `reviews_X` datasets and the distribution of the `label` feature.

3) Split `reviews_A` into train and test set by selecting the first reviews half as train set and the second one as test set.

4) Classify the reviews provided in the `reviews_A` test set by first assigning to each a score equal to the sum of scores of known words within it and then, return `"pos"` for reviews with a positive score and `"neg"` for reviews with a negative or null score. 

Score each word:
 - -1 if it is found in negative words list
 - -2 if it is found in negative word list and it is preceded by the word "very"
 - +1 if it is found in positive words list
 - +2 if it is found in positive word list and it is preceded by the word "very"


Start with the setup of NLTK and the definition of the scoring function.
Then, apply the function to all the `reviews_A` in the test set.
Finally, compare the obtained labels with the known ones and compute the accuracy as the ratio of matches.

5) Create a pipeline including a `CountVectorizer` to convert reviews into word count vectors (excluding words that appear in less than 3 documents) and a `LogisticRegression` model

6) Train the model on all `reviews_B` data

7) Evaluate the model on the `reviews_A` test set

8) Create a new pipeline as above, but replacing the `CountVectorizer` in the pipeline with a `TfidfVectorizer` and the `LogisticRegression` model with a `MultinomialNB` one.

9) Fit the new pipeline all the `reviews_B` data and evaluate the new model on the `reviews_A` test set

10) Repeat points 8 and 9 but set the `ngram_range` parameter of the `TfidfVectorizer` to use only bigrams.

11) Repeat the evaluation of the three models above, this time on the all the `reviews_C` data

12) Tokenize the `reviews_A` train reviews and use them to build a 300-dimensional Word2Vec vector space using a window size equals to 5 and excluding all the terms that appear less than 7 times.

13) Convert the tokenized training reviews into a list of lists of terms indices in the Word2Vec model, leaving out terms not present in the model.



14) Make all indices sequences of the same length (250 words for each review), trimming longer sequences to that size and padding shorter sequences with zero values.

15) Train a LSTM or GRU neural network of your choice on the training sequences defined above. 

Finally, test the neural network on `reviews_A` test reviews. Try to maximize the accuracy on test data. 

16) Extra: train/fine-tune a transformer-based model (e.g. BERT) on `reviews_A` training reviews and evaluate it on the `reviews_A` test reviews.