# Case Study 3 : Textual analysis of movie reviews

**Due Date: December 5, 2023, BEFORE the beginning of class at 12:00pm ET**

NOTE: There are always last minute issues submitting the case studies. DO NOT WAIT UNTIL THE LAST MINUTE!

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*o-qaS9WPD9ocA9Ofr85v5g.png">

**TEAM Members:** Please EDIT this cell and add the names of all the team members in your team

    member 1
    
    member 2
    
    ...

**Desired outcome of the case study.**
* In this case study we will look at movie reviews from the v2.0 polarity dataset comes from
the http://www.cs.cornell.edu/people/pabo/movie-review-data.
    * It contains written reviews of movies divided into positive and negative reviews.
* As in Case Study 2 idea is to *analyze* the data set, make *conjectures*, support or refute those conjectures with *data*, and *tell a story* about the data!
    
**Required Readings:**
* This case study will be based upon the scikit-learn Python library
* We will build upon the tutorial "Working With Text Data" which can be found at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* In particular, this case study is quite similar to "Exercise 2: Sentiment Analysis on movie reviews" on the above web page.
* Read about deep learning at https://scikit-learn.org/stable/modules/neural_networks_supervised.html


**Case study assumptions:**
* You have access to a python installation

**Required Python libraries:**
* Numpy (www.numpy.org) (should already be installed from Case Study 2)
* Matplotlib (matplotlib.org) (should already be installed from Case Study 2)
* Scikit-learn (scikit-learn.org).
* You are also welcome to use the Python Natural Language Processing Toolkit (www.nltk.org) (though it is not required).

** NOTE **
* Please don't forget to save the notebook frequently when working in IPython Notebook, otherwise the changes you made can be lost.

# Getting the data onto Colab example.

In [None]:
! wget https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

Look for the directory txt_sentoken

In [None]:
! tar xzf review_polarity.tar.gz
! ls

## Problem 1 (10 points): Complete Exercise 2: Sentiment Analysis on movie reviews from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

* Installing scikit-learn using Anaconda does not necessarily download the example source-code.
* Accordingly, you may need to download these directly from Github at https://github.com/scikit-learn/scikit-learn:
    * The data can be downloaded using doc/tutorial/text_analytics/data/movie_reviews/fetch_data.py
    * A skeleton for the solution can be found in doc/tutorial/text_analytics/skeletons/exercise_02_sentiment.py
    * A completed solution can be found in doc/tutorial/text_analytics/solutions/exercise_02_sentiment.py
* Here is a direct link to the code to help you out:  https://github.com/scikit-learn/scikit-learn/tree/main/doc/tutorial/text_analytics
* **It is ok to use the solution provided in the scikit-learn distribution as a starting place for your work.**

### Modify the solution to Exercise 2 so that it can run in this iPython notebook
* This will likely involve moving around data files and/or small modifications to the script.

In [None]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

## Problem 2 (10 points): Explore the scikit-learn TfidVectorizer class

**Read the documentation for the TfidVectorizer class at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.**
* Define the term frequency–inverse document frequency (TF-IDF) statistic (http://en.wikipedia.org/wiki/Tf%E2%80%93idf will likely help).
* Run the TfidVectorizer class on the training data above (docs_train).
* Explore the min_df and max_df parameters of TfidVectorizer.  What do they mean? How do they change the features you get?
* Explore the ngram_range parameter of TfidVectorizer.  What does it mean? How does it change the features you get? (Note, large values  of ngram_range may take a long time to run!)

In [None]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

## Problem 3 (15 points): Machine learning algorithms


* Based upon Problem 2 pick some parameters for TfidfVectorizer
    * "fit" your TfidfVectorizer using docs_train
    * Compute "Xtrain", a Tf-idf-weighted document-term matrix using the transform function on docs_train
    * Compute "Xtest", a Tf-idf-weighted document-term matrix using the transform function on docs_test
    * Note, be sure to use the same Tf-idf-weighted class (**"fit" using docs_train**) to transform **both** docs_test and docs_train
* Examine two classifiers provided by scikit-learn
    * LinearSVC
    * KNeighborsClassifier
    * Why do you think it might be working better?
* For a particular choice of parameters and classifier, look at 2 examples where the prediction was incorrect.
    * Can you conjecture on why the classifier made a mistake for this prediction?

In [None]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

## Problem 4 (15 points): Using pre-trained models trained from Hugging Face






#### Installing some necessary packages for this problem

In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install transformers[torch]
!pip install evaluate

#### **Checking That The Notebook Has a GPU.**

#### **This is very important, as the fine-tuning will take very long, or the notebook may even crash, without having a GPU. Do not continue with the rest of problem 4 without seeing "Sucess!" in the printout from the cell below.**

In [None]:
import torch

if torch.cuda.is_available():
      device = torch.device("cuda:0")
      print(f"Success!\nUsing GPU: ({torch.cuda.get_device_name(device=device)})")
else:
      print("\nNo GPU found.\n\nDO NOT CONTINUE TO PROBLEM 4 or 5 UNTIL YOU SEE \"Success!\" PRINTED OUT.\n\nDo the following:\n1. Save your colab file.\n2. Click on Runtime -> Change runtime type -> T4 GPU -> OK (Don't worry about losing progress if the runtime needs to restart, we just saved the file in step #1) -> Save.\n3. Wait 5-10 sec.\n5. Click on Runtime -> View Resources. You should see \"GPU RAM\" as one of the charts.\n5. Rerun this cell.")
      device = torch.device("cpu")


### 1. Using Hugging Face Models "Off-The-Shelf"

Go to [hugging face's model hub](https://huggingface.co/models) and search for a sentiment model.

Can you find one for reviews, movies, or something else that fits the problem well?

Only use models that have the following:
- Have summary statistics of its performance on a test or validation dataset  
- Have a python API (you should see a window for it at the bottom of the page)
- Take **RAW TEXT** (not the vectorized words) as input into the model
- Output **POSITIVE** or **NEGATIVE** (if its output is pos/neg/neutral, this is fine, but you will need to transform this output to only pos/neg and described how you handle **neutral** outputs)

Report the url of the model's page you found in the box below.

In the code cell below, evaluate the performance of your chosen pre-trained model on the v2.0 polarity dataset. What's the accuracy, F1 Score, and other metrics.




####URL TO MODEL PAGE: _________________
####Hugging Face Reported TEST/VALIDATION PERFORMANCE: _________________
####v2.0 Polarity Dataset TEST PERFORMANCE: _________________

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer

In [None]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary


### 2. Using A PyTorch Model From Hugging Face

Here, we will be using a movie sentiment model made by JamesH. The model card can be found [here](JamesH/Movie_review_sentiment_analysis_model).

Report the url of the model's page you found in the box below.

In the code cells below, evaluate the performance of your chosen pre-trained model on the v2.0 polarity dataset. What's the accuracy, F1 Score, and other metrics?

####URL TO MODEL PAGE: _________________
####Hugging Face Reported TEST/VALIDATION PERFORMANCE: _________________
####v2.0 Polarity Dataset TEST PERFORMANCE: _________________

####Loading the model

In [None]:
james_h_model = AutoModelForSequenceClassification.from_pretrained("JamesH/autotrain-third-project-1883864250")
james_h_tokenizer = AutoTokenizer.from_pretrained("JamesH/autotrain-third-project-1883864250")

#### Example API of how to interface with the JamesH model

In [None]:
input_phrase = "That movie wasn't good. For several reasons. Firstly, there isn't anything special about it"
with torch.no_grad():
  inputs = james_h_tokenizer(input_phrase, return_tensors="pt")
  outputs = james_h_model(**inputs)
  print(outputs)
  # Output is a tuple containing the logits of the positive and negative predictions
  # We need to convert these to probabilities
  logits = outputs["logits"]
  odds = torch.exp(logits)
  probabilities = odds / (1 + odds)
  print(probabilities)
  # Lets make it look more readable
  positive = probabilities[0,0]
  negative = probabilities[0,1]
  print("\n")
  print(f"Phrase: {input_phrase}\nPositive: {positive:.3f}\nNegative: {negative:.3f}")

  if positive > 0.5:
    print(f"The phrase has positive sentiment")
  else:
    print(f"The phrase has negative sentiment")

In [None]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

### 3. Fine-tuning a Pre-Trained Model On Our Dataset

Here, we will be using the JamesH model as the starting point of training a new model that's trained specifically for our dataset

You will be tasked with finding hyperparameters to use during the fine-tuning process that improves the original model's performance.

Then you will report the new model performance on the fine-tuned model.

In [None]:
import os
import pandas as pd
import random
import datasets
import numpy as np
import evaluate

df_train = pd.DataFrame(columns=['label', 'text'])
df_test = pd.DataFrame(columns=['label', 'text'])

pos_reviews = []
neg_reviews = []

# Getting all positive reviews from disk
for filename in os.listdir("./txt_sentoken/pos"):
    with open(f"./txt_sentoken/pos/{filename}", "r") as f:
        pos_reviews.append(f.read())

# Getting all negative reviews from disk
for filename in os.listdir("./txt_sentoken/neg"):
    with open(f"./txt_sentoken/neg/{filename}", "r") as f:
        neg_reviews.append(f.read())

# Randomly shuffle both lists for splitting into training and test sets
random.shuffle(pos_reviews)
random.shuffle(neg_reviews)

# Add each review into the dataset variable
# This dataset format is compatible for most Hugging Face wrappers in PyTorch
test_percentage = 0.2

for i in range(len(neg_reviews)):
  temp = {}
  temp["label"] = 0
  temp["text"] = neg_reviews[i]
  temp = pd.DataFrame(temp, index=[0])
  if i < int(len(neg_reviews) * test_percentage):
    df_test = pd.concat([df_test, temp], ignore_index = True)
    df_test.reset_index()
  else:
    df_train = pd.concat([df_train, temp], ignore_index = True)
    df_train.reset_index()

for i in range(len(pos_reviews)):
  temp = {}
  temp["label"] = 1
  temp["text"] = pos_reviews[i]
  temp = pd.DataFrame(temp, index=[0])
  if i < int(len(pos_reviews) * test_percentage):
    df_test = pd.concat([df_test, temp], ignore_index = True)
    df_test.reset_index()
  else:
    df_train = pd.concat([df_train, temp], ignore_index = True)
    df_train.reset_index()


dataset_train = datasets.Dataset.from_pandas(df_train)
dataset_test = datasets.Dataset.from_pandas(df_test)


def tokenize_function(examples):
    return james_h_tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_train = dataset_train.map(tokenize_function, batched=True)
tokenized_test = dataset_test.map(tokenize_function, batched=True)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### See the [documentation](https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#transformers.TrainingArguments) what different hyperparameters can be adjusted for the fine-tuning.

### Play around with different parameters, specifically with:
- learning rate
- weight decay
- max gradient norm
- adam epsilon
- num_train_epochs


What happens when you change them? Does the fine-tuning take shorter? Longer? Does the performance improve or worsen? Try a few different configurations, record the results, and hypothesize (or if you can, explain!) the reason why theses hyperparameter changes had the observed effect.  

*Note: Do NOT change the **output_dir** or **use_cpu** or **evalulation_strategy** variables.*  





In [None]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary
#   See https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#transformers.TrainingArguments

training_args = TrainingArguments(output_dir="finetuned_movie_sentiment", use_cpu=False, evaluation_strategy="epoch", )

#----------------------------------------------

trainer = Trainer(
    model=james_h_model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)


trainer.train()

In the code cell below, evaluate the performance of your chosen pre-trained model on the v2.0 polarity dataset. What's the accuracy, F1 Score, and other metrics.

In [None]:
james_h_finetuned_model = AutoModelForSequenceClassification.from_pretrained("./finetuned_movie_sentiment/checkpoint-500")

#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary


## Problem 5 (10 points): Accuracy is not everything!  How fast are the algorithms versus their accuracy?
**Compare the runtime of your  baseline algorithms to the runtime of the pre-trained Hugging Face model and the fine-tuned Hugging Face model**

**The jupyter command %timeit can be used to measure how long a calculation takes https://ipython.readthedocs.io/en/stable/interactive/magics.html.**
* How long does it take to run the "predict" function on the entirety of v2.0 polarity dataset on the Sci-Kit Learn models and the Hugging Face models? Can you explain why? Make a table showing your results.
* Which method has the best ability in predicting the sentiment correctly? Can you explain why?

In [None]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary


## Problem 6 (20 points): Business question

* Suppose you had a machine learning algorithm that could detect the sentiment of NewsAPI articles that was highly accurate.  What kind of business could you build around that?
* Who would be your competitors, and what are their sizes?
* What would be the size of the market for your product?
* In addition, assume that your machine learning was slow to train, but fast in making predictions on new data.  How would that affect your business plan?
* How could you use the cloud to support your product?

# Slides (for a 5-8 minute presentation) (20 points)


1. (5 points) Motivation about the data collection, why the topic is interesting to you.

2. (10 points) Communicating Results (figure/table)

3. (5 points) Story telling (How all the parts (data, analysis, result) fit together as a story?)


# Done

All set!

** What do you need to submit?**

* **Notebook File**: Save this IPython notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "ipython notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.


* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . We will ask two teams which are randomly selected to present their case studies in class for this case study.

* **Report**: please prepare a report (less than 10 pages) to report what you found in the data.
    * What data you collected?
    * Why this topic is interesting or important to you? (Motivations)
    * How did you analyse the data?
    * What did you find in the data?

     (please include figures or tables in the report, but no source code)


*Please compress all the files into a single zipped file.*


** How to submit: **

        Please submit through canvas.wpi.edu

### DS3010 Case Study 3 Team ??

#### where ?? is your team number.
        
** Note: Each team just needs to submits one submission **

# Grading Criteria:

**Total Points: 100**


---------------------------------------------------------------------------
**Notebook results:**
    Points: 80


    -----------------------------------
    Question 1:
    Points: 10
    -----------------------------------
    
    -----------------------------------
    Question 2:
    Points: 10
    -----------------------------------
        
    -----------------------------------
    Question 3:
    Points: 15
    -----------------------------------
  
    -----------------------------------
    Question 4:  
    Points: 15
    -----------------------------------

    -----------------------------------
    Question 5:  
    Points: 10
    -----------------------------------

    -----------------------------------
    Question 6:  
    Points: 20
    -----------------------------------

---------------------------------------------------------------------------
**Slides (for a 5-8 minute presentation): Story-telling**
    Points: 20


1. Motivation about the data collection, why the topic is interesting to you.
    Points: 5

2. Communicating Results (figure/table)
    Points: 10

3. Story telling (How all the parts (data, analysis, result) fit together as a story?)
    Points: 5
