# Lab 2: Transfer Learning & Transformers: Comparison of Transformer Architecture to Other Neural Networks in Evaluation of Movie Reviews and Ratings

Group Members:

- Parker Brown

- Suma Chackola

- Chris Peters

- Oliver Raney



**The execution of this lab was performed collaboratively across 4 computers. Therefore, while the individual cells are not all shown with the direct execution results, the code presented in those cells was utilized to produce the results in this notebook.**


<a id="top"></a>
## Contents
* <a href="#P1">1.0 Introduction & Dataset Overview</a>
* <a href="#P2">2.0 Transfer Learning Foundational Model </a>
* <a href="#P3">3.0 Splitting the Data </a>
* <a href="#P4">4.0 Training a Model from Scratch </a>
* <a href="#P5">5.0 Training a Model by Transfer Learning from Foundational Model </a>
* <a href="#P6">6.0 Fine-Tuning the Model </a>
* <a href="#P6">7.0 Results: Comparing All Investigated Models </a>
________________________________________________________________________________________________________


<a href="#top">Back to Top</a>
<a id="P1"></a>
## 1.0 Introduction & Dataset Overview
#### Give an overview of the dataset you have chosen to use. What is the classification task. Is this multi-task? Explain. What is the feature data? Who collected the data? Why? When? Is the data multi-modal? What evaluation criteria will you be using and why? 

In this lab, we are performing classification on Movie Review text to associate the text of the review with an overall "positivity" rating of Positive, Neutral, or Negative. By reviewing the text as a whole instead of sets of words or phrases, we are evaluating a sequential text to classify it into a distinct category.  

### Dataset Overview
The dataset we are using for this analysis comes from an IEEE Open Access repository at the following source: 
- Data Source: https://ieee-dataport.org/open-access/imdb-movie-reviews-dataset

The dataset compiles movie reviews from the Internet Movie Database (IMDb),  https://www.imdb.com/, and contains 1 million reviews from 1150 movies spread across 17 genres. In this dataset is other metadata such as the IMDb rating and movie rating. The data was compiled by Pal, Barigidad, and Mustafi and utilized and presented as a paper at the 2020 International Conference on Computing, Communication, and Security (ICCCS). In their analysis, they used the content of the movie reviews to classify the genre of the movie through word tokenization and a keyword list specific to the genre, and from their results they created a "Movie Recommender" based on a genre input from a user.

This is not a multi-modal dataset because it only contains textual data. We did not choose a multi-modal dataset, although there are other IMDb datasets that do contain multi-modal data, such as this one: https://arxiv.org/abs/1702.01992, which contains images of the poster of the movie, in addition to the movie genre, rating, and other text. 


### Classification Task
We utilized this dataset in Lab 1 for this course, where we performed sentiment analysis on the movie reviews and compared that sentiment to the movie positivity rating. For this lab, we differentiate that approach by using neural networks with a word embedding vectorizer to train a model. The training will analyze the sequential text in the review and use the associated rating to learn how the overall sequence relates to the positivity with respect to the vectorization of the sequence. Once that association is learned, the model evaluates new review sequences to classify those to an appropriate rating using the same vectorization analysis approach. 

In the dataset, the reviews are rated on a scale of 1-10. With our classifier, we will expect to get a similar scale, so that it is a multi-class classification task. We plan to segment the results into generalized score categories of:
- Score < 3.5 -> "Negative"
- 3.5 < Score < 6.5 -> "Neutral"
- 6.5 < Score -> "Positive"

### Evaluation Criteria



<a href="#top">Back to Top</a>
<a id="P2"></a>
## 2.0 Transfer Learning Foundational Model
#### Describe the foundational model that you will be using to transfer learn from. What tasks was this foundational model trained upon? Explain if the new task is within the same domain, across domains, etc. 

### Foundational Model: BERT

<a href="#top">Back to Top</a>
<a id="P3"></a>
## 3.0 Splitting the Data 
#### Split the data into training and testing. Be sure to explain how you performed this operation and why you think it is reasonable to split this particular dataset this way. For multi-task datasets, be sure to explain if it is appropriate to stratify within each task. If the dataset is already split for you, explain how the split was achieved and how it is stratified.

x

### Preparing Data

In [2]:
#imports
import os
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [3]:
#https://www.geeksforgeeks.org/how-to-iterate-over-files-in-directory-using-python/
# and https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe



directory = 'data/movie_dataset/2_reviews_per_movie_raw'

dfs = list()

#Concatenate data into one pandas dataframe
for filename in os.listdir(directory):	
	data = pd.read_csv(os.path.join(directory, filename), header='infer')	
	dfs.append(data)
df = pd.concat(dfs, ignore_index=True)


df.head()



Unnamed: 0,username,rating,helpful,total,date,title,review
0,Imme-van-Gorp,7,102,123,30 January 2019,Unfortunately the ending ruined an otherwise ...,This movie is full of suspense. It makes you g...
1,sonofocelot-1,5,385,500,10 May 2016,...oh dear Abrams. Again.\n,I'll leave this review fairly concise. <br/><b...
2,mhodaee,5,110,143,4 August 2017,"Fantastic, gripping, thoroughly enjoyable, un...",I give the 5/10 out of the credit I owe to the...
3,fil-nik09,5,73,100,5 October 2016,Hmmm...\n,"First of all, I must say that I was expecting ..."
4,DVR_Brale,7,42,56,27 July 2016,Slow building & plot alternating claustrophob...,I've always loved movies with strong atmospher...


In [4]:
# retain only rating and review
df = df.drop(columns=['username', 'helpful', 'total', 'date','title'], errors='ignore')

#print unique ratings
df.rating.unique()


array(['7', '5', '9', '8', '10', 'Null', '6', '1', '4', '3', '2'],
      dtype=object)

In [5]:
#drop Null ratings
df = df[~df['rating'].str.contains('Null')]

# Convert "rating" to int
df= df.astype({'rating':'int'})
df.rating.unique()



array([ 7,  5,  9,  8, 10,  6,  1,  4,  3,  2])

In [6]:

# Drop rows where "rating" is NaN or reviews are missing
df = df.dropna(subset=['rating'])
df = df.dropna(subset=['review'])


df_ratings = df['rating']
df_reviews = df['review']


df_reviews.head()

0    This movie is full of suspense. It makes you g...
1    I'll leave this review fairly concise. <br/><b...
2    I give the 5/10 out of the credit I owe to the...
3    First of all, I must say that I was expecting ...
4    I've always loved movies with strong atmospher...
Name: review, dtype: object

In [47]:
df['review'] = df['review'].str.lower() 
df.head()

Unnamed: 0,rating,review
0,7,this movie is full of suspense. it makes you g...
1,5,i'll leave this review fairly concise. <br/><b...
2,5,i give the 5/10 out of the credit i owe to the...
3,5,"first of all, i must say that i was expecting ..."
4,7,i've always loved movies with strong atmospher...


In [48]:
#prep for train test split
df_reviews_train, df_reviews_test,  df_ratings_train_, df_ratings_test = \
    train_test_split(df_reviews, df_ratings, test_size=0.2, random_state=0)

In [49]:
#prep for train test split
df_train, df_test = \
    train_test_split(df, test_size=0.2, random_state=0)

<a href="#top">Back to Top</a>
<a id="P4"></a>
## 4.0 Training a Model from Scratch 
#### Train a model from scratch to perform the classification task (this does NOT need to be a transformer). That is, do not use transfer learning for the classification task. Verify the model converges (even if the model is overfit). This does NOT need to mirror the foundational model. This model may be far less computational to train.
xx

In [38]:
#!pip install --upgrade keras-nlp
#!pip install --upgrade keras 

In [40]:
#import os

#os.environ["KERAS_BACKEND"] = "jax" 

#import keras_nlp 
#import tensorflow
#import keras

In [44]:
#!pip uninstall keras keras-core keras-nlp
#!pip install keras==2.15.0 keras-nlp==0.6.3

#!pip install --upgrade keras-nlp
#import os
os.environ["KERAS_BACKEND"] = "jax"  # Or "jax" or "torch"!

#from keras import keras_nlp
import tensorflow as tf
import keras

#TODO: I could pip install tensorflow but the following code is causing issues. I have added it to my PATH environment etc but its still a pain. See if you guys can get this to resolve.

The reason I introduced this code was to see how to persist the history of the models. This was one of my tasks.

In [45]:
multi_hot_layer = keras.layers.TextVectorization(
    max_tokens=4000, output_mode="multi_hot"
)
multi_hot_layer.adapt(df_train.map(lambda x, y: x))

multi_hot_ds = df_train.map(lambda x, y: multi_hot_layer(x), y)
multi_hot_val_ds = df_test.map(lambda x, y: (multi_hot_layer(x), y))

# We then learn a linear regression over that layer, and that's our entire
# baseline model!

inputs = keras.Input(shape=(4000,), dtype="int32")
outputs = keras.layers.Dense(1, activation="sigmoid")(inputs)
baseline_model = keras.Model(inputs, outputs)
baseline_model.compile(loss="binary_crossentropy", metrics=["accuracy"])
baseline_model.fit(multi_hot_ds, validation_data=multi_hot_val_ds, epochs=5)

AttributeError: 'DataFrame' object has no attribute 'map'

<a href="#top">Back to Top</a>
<a id="P5"></a>
## 5.0 Training a Model by Transfer Learning from Foundational Model 
#### Train a model by transfer learning from your foundational model. Verify that the new model converges. You only need to train a model using the bottleneck features for this step. 

<a href="#top">Back to Top</a>
<a id="P6"></a>
## 6.0 Fine-Tuning the Model 
#### Perform fine tuning upon the model by training some layers within the foundational model. Verify that the model converges. 
xx



<a href="#top">Back to Top</a>
<a id="P7"></a>
## 7.0 Results: Comparing All Investigated Models 
#### Report the results of all models using the evaluation procedure that you argued for at the beginning of the lab. Compare the convergence of the models and the running time. Results should be reported with proper statistical comparisons and proper visualizations.
xx