**Timothy Ng**

## Introduction

This report summarizes my analysis of the Stanford SST Sentiment Dataset, which comprises 8741 individual sentences extracted from movie reviews. Each review is classified into either of two categories: positive or negative. By analyzing this data, we can gain insight into how language composition affects sentiment, which can be useful in various applications.

One practical benefit of building predictive models on this data is that it can be used to automatically classify the sentiment of large amounts of text data, such as product reviews or social media posts. This allows for quicker and more efficient analysis of customer feedback. The insights gained from this analysis can then be used by businesses to improve their products and services, as well as their marketing strategies. In addition, the model can be used in other applications, such as chatbots or virtual assistants, to better understand and respond to users' needs and emotions.

You can find the full code and model testing notebook to come here: https://github.com/timnyt/SST-Analysis

In [None]:
#install aimodelshare library
! pip install aimodelshare==0.0.189

In [2]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [3]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

##2.   Preprocessing

For this exercise, we will be using a variety of different deep learning models including LSTMS and word embeddings using the Keras package. Many of these functions require the input text to be integer encoded. Our preprocessor, therefore, will tokenize each of the words in our corpus. 


In [4]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=50, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 50)
(1821, 50)


##3. Modeling 

Much of the modelling for this competition was done in a seperate notebook. For brevity, I will present the three models which performed best.


In [5]:
#Separate validation data 
from sklearn.model_selection import train_test_split
x_train_split, x_val, y_train_split, y_val = train_test_split(
     X_train, y_train, test_size=0.2, random_state=42)

In [6]:
! pip install keras_tuner

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#Model 1: Embedding + LSTM

In [9]:

from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten
import keras_tuner as kt

#Define model structure & parameter search space with function
def build_model(hp):
    model = keras.Sequential()
    model.add(Embedding(10000, 16, input_length=50))
    model.add(LSTM(units=hp.Int("units", min_value=32, max_value=512, step=32), #range 32-512 inclusive, minimum step between tested values is 32
                   return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
    model.add(Flatten())
    model.add(Dense(2, activation='softmax'))
    model.compile(
        optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"], #only postiive or negative therefore binary coressentropy
    )
    return model

#initialize the tuner (which will search through parameters)
tuner = kt.RandomSearch(
    hypermodel=build_model, 
    objective="val_accuracy", # objective to optimize
    max_trials=3, #max number of trials to run during search
    executions_per_trial=1, #higher number reduces variance of results; guages model performance more accurately 
    overwrite=True,
    directory="tuning_model",
    project_name="tuning_units",
)

tuner.search(preprocessor(x_train_split), y_train_split, epochs=3, validation_data=(preprocessor(x_val), y_val))

Trial 3 Complete [00h 06m 26s]
val_accuracy: 0.7536126971244812

Best val_accuracy So Far: 0.7731214165687561
Total elapsed time: 00h 16m 17s


In [10]:
# Build model with best hyperparameters

# Get the top 2 hyperparameters.
best_hps = tuner.get_best_hyperparameters(5)
# Build the model with the best hp.
tuned_model = build_model(best_hps[0])
# Fit with the entire dataset.
tuned_model.fit(x=preprocessor(X_train), y=y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f692a7f2190>

#### Save preprocessor function to local "preprocessor.zip" file

In [11]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


#### Save model to local ".onnx" file

In [12]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(tuned_model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("tuned_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

###Generate predictions from X_test data and submit model to competition


In [13]:
#Set credentials using modelshare.org username/password
from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this Movie Review Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [14]:
#Instantiate Competition
import aimodelshare as ai
mycompetition= ai.Competition(apiurl)

In [15]:
#Submit Model: 

#-- Generate predicted y values (Model 1)
prediction_column_index=tuned_model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "tuned_model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 385

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [None]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

## Model 2: Embedding + Conv1d


In [16]:

from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import SimpleRNN, LSTM,Embedding
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense


#Define model structure & parameter search space with function
def build_model(hp):
    model = keras.Sequential()
    model.add(Embedding(10000, 16, input_length=50))
    model.add(Conv1D(64,5, activation  = "relu"))
    model.add(GlobalMaxPooling1D())
    model.add(Flatten())
    model.add(Dense(units=2, activation='softmax'))
    model.compile(
        optimizer= "adam", loss="binary_crossentropy", metrics=["accuracy"],
    )
    return model


#initialize the tuner (which will search through parameters)
tuner = kt.RandomSearch(
    hypermodel=build_model, 
    objective="val_accuracy", # objective to optimize
    max_trials=3, #max number of trials to run during search
    executions_per_trial=1, #higher number reduces variance of results; guages model performance more accurately 
    overwrite=True,
    directory="tuning_model",
    project_name="tuning_units",
)

tuner.search(preprocessor(x_train_split), y_train_split, epochs=3, validation_data=(preprocessor(x_val), y_val))



Trial 1 Complete [00h 00m 12s]
val_accuracy: 0.7644508481025696

Best val_accuracy So Far: 0.7644508481025696
Total elapsed time: 00h 00m 12s


In [17]:
# Build model with best hyperparameters

# Get the top 2 hyperparameters.
best_hps = tuner.get_best_hyperparameters(5)
# Build the model with the best hp.
tuned_model = build_model(best_hps[0])
# Fit with the entire dataset.
tuned_model.fit(x=preprocessor(X_train), y=y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f6926b183a0>

In [20]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(tuned_model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("tuned_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [22]:
#Submit Model 2: 

#-- Generate predicted values (a list of predicted labels "positive" or "negative") (Model 2)
prediction_labels = tuned_model.predict(preprocessor(X_test))

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "tuned_model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)



FileNotFoundError: ignored

## Model 3: Transfer Learning with Glove Embedding Matrix

In [23]:
# Download Glove embedding matrix weights (Might take 10 mins or so!)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2023-04-18 05:16:48--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-04-18 05:16:48--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-04-18 05:16:48--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [24]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [34]:
# Extract embedding data for 100 feature embedding matrix
import os
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [35]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features
max_words = 10000
word_index = tokenizer.word_index

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [44]:
embedding_layer = Embedding(10000, 100, weights=[embedding_matrix], input_length=50, trainable=False)

# Use transfer learning with glove embeddings 
model = keras.Sequential()
model.add(embedding_layer)
model.add(GlobalMaxPooling1D())
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.summary()


Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 50, 100)           1000000   
                                                                 
 global_max_pooling1d_5 (Glo  (None, 100)              0         
 balMaxPooling1D)                                                
                                                                 
 flatten_5 (Flatten)         (None, 100)               0         
                                                                 
 dense_8 (Dense)             (None, 32)                3232      
                                                                 
 dense_9 (Dense)             (None, 2)                 66        
                                                                 
Total params: 1,003,298
Trainable params: 3,298
Non-trainable params: 1,000,000
________________________________________

In [45]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# train the model
model.fit(preprocessor(X_train), y_train, epochs=10, batch_size=32, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f691b966580>

In [46]:
# Save model to local ONNX file
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=True,
                          deep_learning=True)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [47]:
#Submit Model 3: 

#-- Generate predicted values (a list of predicted labels "real" or "fake")
prediction_labels = model.predict(preprocessor(X_test))

# Submit Model 3 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 392

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


## Model Discussion

After experimenting with different model architectures, my best performing model was a tuned sequential model that included an LSTM layer and an embedding layer. This model achieved a cross-validation accuracy of 0.79, which outperformed my model with a Conv1D layer and an embedding layer that achieved an accuracy of 0.78. However, the transfer learning model had the worst performance with an accuracy of only 0.65. This could be because I didn't have enough time to tune this model properly.

Interestingly, I found that models with the LSTM layer tended to perform better than those without, both in terms of cross-validation score and leaderboard ranking. During the model tuning process, I discovered that the number of epochs had the most significant impact on performance. Surprisingly, dropout didn't seem to have as much effect on performance as I initially thought, and I found that no dropout gave me the best results. Although the Keras tuner suggested otherwise regarding dropout, I didn't observe any significant improvement in performance on the leaderboard.