# Assignment 3: Text Classification Using the Stanford SST Sentiment Dataset
## Sofia Zaidman
## 4/15/23
### https://github.com/szaidman22/text-classification-stanford-movie-review-sentiment

## Get data

In [1]:
%%capture
#install aimodelshare library
! pip install aimodelshare==0.0.189

In [1]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [2]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

In [3]:
#checking size and makeup of data
print(len(X_train))
print(len(X_test))
print(y_train_labels.unique())
print(X_train[738])

6920
1821
['Positive' 'Negative']
A different movie -- sometimes tedious -- by a director many viewers would like to skip but film buffs should get to know .


## Discuss the dataset in general terms and describe why building a predictive model using this data might be practically useful.  Who could benefit from a model like this? Explain.

The dataset contains about 9000 total movie reviews that are labeled as having either positive or negative sentiment. 

Building a model to predict the sentiment of movie reviews would be practically useful for websites like Rotten Tomatoes or IMDB, which aggregate reviews from multiple sources and generate an overall rating for a film. It could also be useful as an input for a movie recommendation algorithm. 

Movie review websites would obviously benefit from a model like this, and so would the filmmakers, movie studios, and moviegoers who use aggregated review information to make choices about which films to make or see. 

## Run at least three prediction models to try to predict the SST sentiment dataset well.

First we need to preprocess and tokenize data

In [4]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 40)
(1821, 40)


### Use an Embedding layer and LSTM layers in at least one model


I tried a couple simple versions with LSTM and I decided to keep the dropout in this one. It seems to prevent more severe initial overfitting.

I submitted this model as #149 and it performed relatively well, with an accuracy of .807 on the test data. At the time of submission it was in the top 10 models, though to be fair many of the models submitted have almost identical accuracy.

In [24]:
from tensorflow.keras.layers import Dense, Embedding,Flatten,LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization

model = Sequential()
model.add(Embedding(10000, 10, input_length=40))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2)) 
model.add(Dense(2, activation='softmax'))

model.summary()

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

history = model.fit(preprocessor(X_train), y_train,
                    epochs=15,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_8 (Embedding)     (None, 40, 10)            100000    
                                                                 
 lstm_9 (LSTM)               (None, 32)                5504      
                                                                 
 dense_8 (Dense)             (None, 2)                 66        
                                                                 
Total params: 105,570
Trainable params: 105,570
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [5]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


In [26]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [6]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this specific Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [7]:
#Instantiate Competition
mycompetition= ai.Competition(apiurl)

In [29]:
#Submit Model 1: 
prediction_column_index=model.predict(preprocessor(X_test)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]

mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): LSTM
Provide any useful notes about your model (optional): LSTM with dropout

Your model has been submitted as model version 149

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Use an Embedding layer and Conv1d layers in at least one model

Submitted this one and it performed worse than the first, but only slightly. This one learned much faster and got to 100% accuracy on training data very quickly.

In [50]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import SimpleRNN, LSTM,Embedding


model2 = Sequential()
model2.add(Embedding(10000, 100, input_length=40))
model2.add(layers.Conv1D(16, 4, activation='relu'))  
BatchNormalization(),
model2.add(layers.Conv1D(8, 4, activation='relu'))  
model2.add(Flatten())
model2.add(Dense(2, activation='softmax'))

model2.summary()

model2.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model2.fit(preprocessor(X_train), y_train,
                    epochs=5,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_29"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_29 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 conv1d_43 (Conv1D)          (None, 37, 16)            6416      
                                                                 
 conv1d_44 (Conv1D)          (None, 34, 8)             520       
                                                                 
 flatten_14 (Flatten)        (None, 272)               0         
                                                                 
 dense_22 (Dense)            (None, 2)                 546       
                                                                 
Total params: 1,007,482
Trainable params: 1,007,482
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epo

In [51]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [54]:
#Submit Model 2: 
prediction_column_index=model2.predict(preprocessor(X_test)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]

mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): conv1d
Provide any useful notes about your model (optional): conv1d

Your model has been submitted as model version 152

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Use transfer learning with glove embeddings for at least one of these models

This model actually ended up being the best performing so far, coming in at an accuracy of .81 for the test dataset. I increased the dropout percentage and it did result in further decreased overfitting.

In [8]:
# Download Glove embedding matrix weights (Might take 10 mins or so!)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2023-04-16 13:34:23--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-04-16 13:34:23--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-04-16 13:34:24--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [9]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [10]:
import os

# Extract embedding data for 100 feature embedding matrix
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [11]:
# Build embedding matrix
word_index = tokenizer.word_index

embedding_dim = 100 # change if you use txt files using larger number of features

embedding_matrix = np.zeros((10000, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < 10000:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [35]:
from tensorflow.keras.layers import Dense, Embedding,Flatten,LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization

model4 = Sequential()
model4.add(Embedding(10000, embedding_dim, input_length=40))
model4.add(LSTM(16, dropout=0.4, recurrent_dropout=0.4)) 
model4.add(Dense(2, activation='softmax'))

model4.summary()

model4.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

history = model4.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_15"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_15 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 lstm_5 (LSTM)               (None, 16)                7488      
                                                                 
 dense_7 (Dense)             (None, 2)                 34        
                                                                 
Total params: 1,007,522
Trainable params: 1,007,522
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [48]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model4, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model4.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [49]:
#Submit Model 4: 
prediction_column_index=model4.predict(preprocessor(X_test)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]

mycompetition.submit_model(model_filepath = "model4.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): lstm
Provide any useful notes about your model (optional): lstm 100 trying again

Your model has been submitted as model version 158

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Discuss which models performed better and point out relevant hyper-parameter values for successful models.


In general, all of the models I tried performed very similarly. I played around with tuning the number of units in the LSTM and the dropout rate. The number of units in the LSTM honestly didn't seem to make much of a difference. The dropout rate had the clearest effect, resulting in what appeared to be a slower learning rate and decreased overfitting to the training data. My most successful models had the higher dropout rate of .4 so I view this as leading to more success. I could conceivably try to increase the dropout rate even further in subsequent models.

My best performing model also used the pre-trained glove embedding layer. Again, there was only a difference of about 1% in accuracy between my best performing and second best performing models, but either way I'm sure that the additional context contained in the pre-trained embeddings is useful for preventing overfitting to training data.

Another thing I experimented with was stacking layers (LSTM and conv1d) and adding batch normalization layers to some of my conv1d models. This didn't seem to make a noticeable impact and often resulted in the models getting to 100% accuracy on the training data very quickly, which I didn't necessarily view as a good thing and took as a sign of overfitting.

Another thing I noticed when submitting my models to the leaderboard was that one of the top performers on the board used the adam optimizer rather than rmsprop. I tried a model using adam and saw no difference.

## After you submit your first three models, describe your best model with your team via your team slack channel


By Sunday, only one member of my team had also posted in the slack channel to discuss their best models. This member said that their best performing model had an embedding layer with conv1d layers. I had already tried my own models with conv1d layers, so I decided to take some inspiration from the best performing models on the leaderboard, not necessarily from my team. I saw that one of the best performers used two bidirectional LSTM layers, so I wanted to give that a try.

### Trying bidirectional

In [60]:
from tensorflow.keras.layers import Dense, Embedding,Flatten,LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization

model5 = Sequential()
model5.add(Embedding(10000, embedding_dim, input_length=40))
model5.add(layers.Bidirectional(layers.LSTM(16,dropout=0.2, recurrent_dropout=0.2,return_sequences=True)))
model5.add(layers.Bidirectional(layers.LSTM(8)))
model5.add(Dense(2, activation='softmax'))

model5.summary()

model5.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

history = model5.fit(preprocessor(X_train), y_train,
                    epochs=5,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_28"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_28 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 bidirectional_13 (Bidirecti  (None, 40, 32)           14976     
 onal)                                                           
                                                                 
 bidirectional_14 (Bidirecti  (None, 16)               2624      
 onal)                                                           
                                                                 
 dense_18 (Dense)            (None, 2)                 34        
                                                                 
Total params: 1,017,634
Trainable params: 1,017,634
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epo

In [51]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model5, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model5.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [52]:
#Submit Model 5: 
prediction_column_index=model5.predict(preprocessor(X_test)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]

mycompetition.submit_model(model_filepath = "model5.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): bidirectional lstm
Provide any useful notes about your model (optional): bidirectional lstm

Your model has been submitted as model version 159

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


Sadly my try with bidirectional LSTM layers didn't result in a model with higher accuracy. Overall, the variance in accuracy accross all my models is only about 2%, with my worst models coming in at 79% and best at 81%. So in general I'm not seeing a huge difference being made with any of my changes in hyperparameter tuning. 

I'm going to try another bidirectional LSTM model but changing up the dropout amount on both layers.

In [15]:
from tensorflow.keras.layers import Dense, Embedding,Flatten,LSTM, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization

model6 = Sequential()
model6.add(Embedding(10000, embedding_dim, input_length=40))
model6.add(Bidirectional(LSTM(16,dropout=0.6, recurrent_dropout=0.6,return_sequences=True)))
model6.add(Bidirectional(LSTM(8,dropout=0.2, recurrent_dropout=0.2,)))
model6.add(Dense(2, activation='softmax'))

model6.summary()

model6.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

history = model6.fit(preprocessor(X_train), y_train,
                    epochs=4,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 40, 100)           1000000   
                                                                 
 bidirectional_2 (Bidirectio  (None, 40, 32)           14976     
 nal)                                                            
                                                                 
 bidirectional_3 (Bidirectio  (None, 16)               2624      
 nal)                                                            
                                                                 
 dense_1 (Dense)             (None, 2)                 34        
                                                                 
Total params: 1,017,634
Trainable params: 1,017,634
Non-trainable params: 0
_________________________________________________________________
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [17]:
history = model6.fit(preprocessor(X_train), y_train,
                    epochs=2,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/2
Epoch 2/2


In [18]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model6, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model6.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [19]:
#Submit Model 6: 

prediction_column_index=model6.predict(preprocessor(X_test)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]

mycompetition.submit_model(model_filepath = "model6.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): bidirectional lstm
Provide any useful notes about your model (optional): bidirectional lstm with more dropout

Your model has been submitted as model version 200

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


This actually ended up being one of my worst performing models, but again the difference in accuracy was not very much. I trained this model on less epochs so maybe doing more epochs would be more effective? 

## Discuss which models you tried and which models performed better and point out relevant hyper-parameter values for successful models.

Again, all of the models I tried performed very similarly, ranging from 78% accuracy to 81% accuracy. I tried a few LSTM-based models, some models with stacked conv1d layers, added batch normalization to one, played around with adam vs. rmsprop optimization. I tried doing my own embeddings and used the glove embeddings. I also tried adding bidirectional LSTMs based on other top models I saw on the leaderboard. There was not a very clear jump in accuracy caused by tuning any of these hyperparameters. When I investigated the top performing models, I found that there was a good amount of variation in model architecture. Unfortunately, the highest performing model by 10% accuracy didn't have any model details submitted to the contest page. I wonder if it was based on transfer learning from a pre-trained sentiment classification model.

Ultimately, my best performing model had one LSTM layer with .4 dropout and used the pre-trained glove embedding layer. I found the dropout hyperparameter had the most noticeable effect in changing how fast the model was learning and dealing with overfitting. If I were to continue trying to refine my models, I would probably try gridsearch with different dropout ranges.
