<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---

## Stanford Sentiment Treebank - Movie Review Classification Competition
Let's share our models to a centralized leaderboard, so that we can collaborate and learn from the model experimentation process...

**Instructions:**
1.   Get data in and set up X_train / X_test / y_train
2.   Preprocess data using keras Tokenizer/ Write and Save Preprocessor function
3. Fit model on preprocessed data and save preprocessor function and model 
4. Generate predictions from X_test data and submit model to competition
5. Repeat submission process to improve place on leaderboard



https://github.com/yuxuanmaaa/advml/tree/main

## 1. Get data in and set up X_train, X_test, y_train objects

In [1]:
#install aimodelshare library
! pip install aimodelshare==0.0.189

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting aimodelshare==0.0.189
  Downloading aimodelshare-0.0.189-py3-none-any.whl (967 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m967.8/967.8 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting shortuuid>=1.0.8
  Downloading shortuuid-1.0.11-py3-none-any.whl (10 kB)
Collecting tf2onnx
  Downloading tf2onnx-1.14.0-py3-none-any.whl (451 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m451.2/451.2 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Collecting scikit-learn==1.2.1
  Downloading scikit_learn-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydot==1.3.0
  Downloading pydot-1.3.0-py2.py3-none-any.whl (18 kB)
Collecting keras2onnx>=1.7.0
  Downloading keras2onnx-1.7.0-py3-non

In [1]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [None]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

##2.   Preprocess data using keras tokenizer / Write and Save Preprocessor function


In [None]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 40)
(1821, 40)


##3. Fit model on preprocessed data and save preprocessor function and model 


In [None]:
from tensorflow.keras.layers import SimpleRNN, LSTM
from tensorflow.keras.layers import Dense, Embedding,Flatten
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(10000, 16, input_length=40))
model.add(LSTM(32, return_sequences=True, dropout=0.2))
model.add(LSTM(32, dropout=0.2))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))
model.summary()

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_25"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_22 (Embedding)    (None, 40, 16)            160000    
                                                                 
 lstm_2 (LSTM)               (None, 40, 32)            6272      
                                                                 
 lstm_3 (LSTM)               (None, 32)                8320      
                                                                 
 flatten_1 (Flatten)         (None, 32)                0         
                                                                 
 dense_18 (Dense)            (None, 2)                 66        
                                                                 
Total params: 174,658
Trainable params: 174,658
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epo

#### Save preprocessor function to local "preprocessor.zip" file

In [None]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


#### Save model to local ".onnx" file

In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

## 4. Generate predictions from X_test data and submit model to competition


In [None]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this specific Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [None]:
#Instantiate Competition

mycompetition= ai.Competition(apiurl)

In [None]:
#Submit Model 1: 

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 382

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [None]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,embedding_layers,conv1d_layers,maxpooling1d_layers,simplernn_layers,dropout_layers,flatten_layers,lstm_layers,inputlayer_layers,concatenate_layers,bidirectional_layers,globalmaxpooling1d_layers,globalaveragepooling1d_layers,dense_layers,batchnormalization_layers,sigmoid_act,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,version
0,92.32%,92.31%,92.40%,92.32%,unknown,,,unknown,,,,,,,,,,,,,,,,,,,,,,,,,rian,78
1,82.99%,82.91%,83.59%,82.99%,keras,,True,Sequential,6.0,3603906.0,1.0,,,,1.0,,,,,2.0,1.0,,1.0,,,1.0,,,str,Adam,14417664.0,,eminilkay,143
2,82.77%,82.75%,82.85%,82.76%,keras,True,True,Sequential,8.0,830274.0,1.0,,,,,1.0,3.0,,,,,,3.0,,,1.0,3.0,2.0,str,RMSprop,3323248.0,,emmayang,234
3,82.55%,82.51%,82.87%,82.55%,keras,,True,Sequential,4.0,172674.0,1.0,,,,,1.0,1.0,,,,,,1.0,,,1.0,1.0,,str,RMSprop,691568.0,,timnyt,327
4,82.55%,82.54%,82.57%,82.55%,keras,,True,Sequential,6.0,817762.0,1.0,,,,,1.0,2.0,,,,,,2.0,,1.0,,2.0,1.0,str,RMSprop,3272592.0,,sdp2158,195
5,82.00%,81.99%,82.03%,82.00%,keras,,True,Sequential,3.0,111138.0,1.0,,,,,,,,,1.0,,,1.0,,,1.0,,,str,RMSprop,445856.0,9.0,realdfy,184
6,81.78%,81.78%,81.79%,81.78%,keras,,True,Sequential,6.0,963856.0,1.0,1.0,1.0,,,1.0,,,,,,,2.0,,,1.0,,2.0,str,Adam,3856424.0,,francesyang,66
7,81.78%,81.78%,81.78%,81.78%,keras,,True,Sequential,3.0,2168577.0,1.0,,,,,,1.0,,,,,,1.0,,1.0,,1.0,,function,Adam,8675184.0,,rian,76
8,81.67%,81.65%,81.80%,81.66%,keras,,True,Sequential,5.0,174658.0,1.0,,,,,1.0,2.0,,,,,,1.0,,,1.0,2.0,,str,RMSprop,699936.0,,ran_liao,237
9,81.45%,81.42%,81.66%,81.45%,keras,,True,Sequential,6.0,235682.0,1.0,,,,,1.0,3.0,,,,,,1.0,,,1.0,3.0,,str,RMSprop,944400.0,,amsay99,85


## 5. Repeat submission process to improve place on leaderboard


In [None]:
# Train and submit model 2 using same preprocessor (note that you could save a new preprocessor, but we will use the same one for this example).
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop

model2 = Sequential()
model2.add(layers.Embedding(10000, 16, input_length=40))
model2.add(layers.Conv1D(32, 4, activation='relu')) 
model2.add(layers.MaxPooling1D(3)) 
model2.add(layers.Conv1D(32, 4, activation='relu'))
model2.add(layers.GlobalMaxPooling1D())
model2.add(Dense(2, activation='softmax'))

model2.compile(optimizer=RMSprop(lr=1e-4), loss='binary_crossentropy', metrics=['acc'])
history = model2.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:

#Submit Model 2: 

#-- Generate predicted y values (Model 2)
prediction_column_index=model2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 380

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [None]:
# Compare two or more models 
data=mycompetition.compare_models([1, 2], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_1_Layer,Model_1_Shape,Model_1_Params,Model_2_Layer,Model_2_Shape,Model_2_Params
0,Embedding,"[None, 40, 16]",160000.0,Embedding,"[None, 40, 16]",160000
1,Flatten,"[None, 640]",0.0,LSTM,"[None, 40, 32]",6272
2,Dense,"[None, 2]",1282.0,LSTM,"[None, 32]",8320
3,,,,Flatten,"[None, 32]",0
4,,,,Dense,"[None, 2]",66


In [None]:
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2023-04-18 04:48:04--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-04-18 04:48:04--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-04-18 04:48:04--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [None]:
! unzip glove.6B.zip.1

Archive:  glove.6B.zip.1
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [None]:
# Extract embedding data for 100 feature embedding matrix
import os
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [None]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features
max_words = 10000
maxlen = 40
word_index = tokenizer.word_index

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [None]:
import tensorflow as tf
model3 = tf.keras.Sequential()
model3.add(tf.keras.layers.Embedding(max_words, embedding_dim, input_length=maxlen))
model3.add(tf.keras.layers.Flatten())
model3.add(tf.keras.layers.Dense(32, activation='relu'))
model3.add(tf.keras.layers.Dense(2, activation='sigmoid'))
model3.summary()

Model: "sequential_30"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_26 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 flatten_5 (Flatten)         (None, 4000)              0         
                                                                 
 dense_25 (Dense)            (None, 32)                128032    
                                                                 
 dense_26 (Dense)            (None, 2)                 66        
                                                                 
Total params: 1,128,098
Trainable params: 1,128,098
Non-trainable params: 0
_________________________________________________________________


In [None]:
model3.layers[0].set_weights([embedding_matrix])
model3.layers[0].trainable = False



model3.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model3.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)
model3.save_weights('pre_trained_glove_model.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model3, framework='keras',
                          transfer_learning=True,
                          deep_learning=True)

with open("model3.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:
#Submit Model 3: 

#-- Generate predicted y values (Model 3)
prediction_column_index=model3.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model3.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 389

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [None]:
# Get leaderboard

data = mycompetition.get_leaderboard()
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,embedding_layers,conv1d_layers,maxpooling1d_layers,simplernn_layers,dropout_layers,flatten_layers,lstm_layers,inputlayer_layers,concatenate_layers,bidirectional_layers,globalmaxpooling1d_layers,globalaveragepooling1d_layers,dense_layers,batchnormalization_layers,sigmoid_act,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,version
0,92.32%,92.31%,92.40%,92.32%,unknown,,,unknown,,,,,,,,,,,,,,,,,,,,,,,,,rian,78
1,82.99%,82.91%,83.59%,82.99%,keras,,True,Sequential,6.0,3603906.0,1.0,,,,1.0,,,,,2.0,1.0,,1.0,,,1.0,,,str,Adam,14417664.0,,eminilkay,143
2,82.77%,82.75%,82.85%,82.76%,keras,True,True,Sequential,8.0,830274.0,1.0,,,,,1.0,3.0,,,,,,3.0,,,1.0,3.0,2.0,str,RMSprop,3323248.0,,emmayang,234
3,82.55%,82.51%,82.87%,82.55%,keras,,True,Sequential,4.0,172674.0,1.0,,,,,1.0,1.0,,,,,,1.0,,,1.0,1.0,,str,RMSprop,691568.0,,timnyt,327
4,82.55%,82.54%,82.57%,82.55%,keras,,True,Sequential,6.0,817762.0,1.0,,,,,1.0,2.0,,,,,,2.0,,1.0,,2.0,1.0,str,RMSprop,3272592.0,,sdp2158,195
5,82.00%,81.99%,82.03%,82.00%,keras,,True,Sequential,3.0,111138.0,1.0,,,,,,,,,1.0,,,1.0,,,1.0,,,str,RMSprop,445856.0,9.0,realdfy,184
6,81.89%,81.88%,81.91%,81.89%,keras,,True,Sequential,4.0,187138.0,1.0,,,,,1.0,1.0,,,,,,1.0,,,1.0,1.0,,str,RMSprop,749424.0,,timnyt,385
7,81.78%,81.78%,81.79%,81.78%,keras,,True,Sequential,6.0,963856.0,1.0,1.0,1.0,,,1.0,,,,,,,2.0,,,1.0,,2.0,str,Adam,3856424.0,,francesyang,66
8,81.78%,81.78%,81.78%,81.78%,keras,,True,Sequential,3.0,2168577.0,1.0,,,,,,1.0,,,,,,1.0,,1.0,,1.0,,function,Adam,8675184.0,,rian,76
9,81.67%,81.65%,81.80%,81.66%,keras,,True,Sequential,5.0,174658.0,1.0,,,,,1.0,2.0,,,,,,1.0,,,1.0,2.0,,str,RMSprop,699936.0,,ran_liao,237


In [None]:
# Compare two or more models
data=mycompetition.compare_models([1, 2, 3], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_1_Layer,Model_1_Shape,Model_1_Params,Model_2_Layer,Model_2_Shape,Model_2_Params,Model_3_Layer,Model_3_Shape,Model_3_Params
0,Embedding,"[None, 40, 16]",160000.0,Embedding,"[None, 40, 16]",160000,Embedding,"[None, 40, 16]",160000.0
1,Flatten,"[None, 640]",0.0,LSTM,"[None, 40, 32]",6272,LSTM,"[None, 40, 256]",279552.0
2,Dense,"[None, 2]",1282.0,LSTM,"[None, 32]",8320,Flatten,"[None, 10240]",0.0
3,,,,Flatten,"[None, 32]",0,Dense,"[None, 2]",20482.0
4,,,,Dense,"[None, 2]",66,,,


**1. Discuss the dataset in general terms and describe why building a predictive model using this data might be practically useful.  Who could benefit from a model like this? Explain.**

SST data contains fully labeled parse tree having the sentimental score for each phrase. Building a predictive model like this could help companies or individuals to understand others' feelings. Restaurant could adjust their business model base on the reviews customer left. Investors could adjust their investment model based on the comments or reactions the public has toward current economy form etc. 

**2. Discuss which models performed better and point out relevant hyper-parameter values for successful models.**

The best model had flatten layers, lstm layers, dense layers. The activation function is also important, probably playing a key role here. I used softmax and tanh.

