# Text Classification Using the Stanford SST Sentiment Dataset

Projects in Advanced Machine Learning

# **Data Preparation**

1) Get data in and set up X_train, X_test, y_train objects

In [27]:
#install aimodelshare library
! pip install aimodelshare==0.0.189

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [28]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [3]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

In [7]:
print(X_train[1])

The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .


In [11]:
print(X_train[2])

Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .


In [26]:
print("X_train size:", len(X_train))
print("X_test size:", len(X_test))
print("total number of reviews:", len(X_train)+len(X_test))
print("y_train size:", len(y_train))

X_train size: 6920
X_test size: 1821
total number of reviews: 8741
y_train size: 6920


In [25]:
# Take a look at the y_train_labels
import numpy as np
from collections import Counter

Counter(y_train_labels)

Counter({'Positive': 3610, 'Negative': 3310})

# **Discuss the dataset in general terms and describe why building a predictive model using this data might be practically useful.  Who could benefit from a model like this? Explain.**



This dataset is called Stanford Sentiment Treebank - Movie Review Classification. In total, there are 8741 reviews, and 6920 labels.

There are three seperate datasets in the whole dataset: 
1. X_train: there are 6920 reviews in X_train.
2. X_test: there are 1921 reviews in X_test.
3. Y_train_labels: y_train_labels contains 6920 labels of the 6920 reviews in X_train. There are 2 kinds of labels, positive and negative. There are 3610 positive labels and 3310 negative labels.

Here, I will build predictive model using this dataset. Building a label prediction model for a movie review classification dataset would be useful for several reasons:

* Automation: A label prediction model can automate the task of classifying movie reviews, saving time and effort compared to manual classification. This can be particularly useful when dealing with large datasets that would be impractical to classify manually.

* Consistency: A label prediction model can provide consistent and standardized classification, reducing the potential for human error and bias.

* Insights: By analyzing the model's predictions, we can gain insights into the factors that contribute to positive or negative reviews. This can help movie studios and filmmakers to better understand their audiences and improve their products.

* Recommendation systems: A label prediction model can be integrated into a recommendation system to suggest movies to users based on their preferences. This can improve the user experience and increase user engagement.

Overall, building a label prediction model for a movie review classification dataset can help to improve efficiency, consistency, and insights, leading to better decision-making and user experiences.

The model can be useful for researchers and analysts who are interested in understanding the patterns and trends in movie reviews. They can use the model to identify the most common positive and negative sentiments in the reviews and track how these sentiments change over time.

2) Preprocess data using keras tokenizer / Write and Save Preprocessor function

In [44]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

word_index = tokenizer.word_index

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 40)
(1821, 40)


# **Run at least three prediction models to try to predict the SST sentiment dataset well.**

*   Use an Embedding layer and LSTM layers in at least one model

* Use an Embedding layer and Conv1d layers in at least one model


* Use transfer learning with glove embeddings for at least one of these models

In [64]:
# Model 1: Embedding layer and LSTM layers

from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten
from tensorflow.keras.models import Sequential

model1 = Sequential()
model1.add(Embedding(10000, 16, input_length=40))
model1.add(LSTM(32, return_sequences=True, dropout=0.2))
model1.add(LSTM(32, dropout=0.2))
model1.add(Flatten())
model1.add(Dense(2, activation='softmax'))

model1.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model1.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [65]:
# Model 2: Embedding layer and Conv1d layers

from tensorflow.keras.layers import Dense, Embedding, Conv1D, GlobalMaxPooling1D

model2 = Sequential()
model2.add(Embedding(10000, 32, input_length=40))
model2.add(Conv1D(32, 7, activation='relu'))
model2.add(GlobalMaxPooling1D())
model2.add(Dense(2, activation='softmax'))

model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history2 = model2.fit(preprocessor(X_train), y_train, epochs=10, batch_size=128, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [48]:
# Model 3: transfer learning with glove embeddings

# Download Glove embedding matrix weights
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

! unzip glove.6B.zip 

--2023-04-13 02:15:00--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-04-13 02:15:00--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-04-13 02:15:00--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [54]:
# Extract embedding data for 100 feature embedding matrix
import os
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features

max_words = 10000

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector


Found 400001 word vectors.


In [62]:
model3 = Sequential()
model3.add(Embedding(max_words, embedding_dim, input_length=40))
model3.add(Flatten())
model3.add(Dense(32, activation='relu'))
model3.add(Dense(2, activation='softmax'))
model3.summary()

Model: "sequential_23"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_23 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 flatten_16 (Flatten)        (None, 4000)              0         
                                                                 
 dense_30 (Dense)            (None, 32)                128032    
                                                                 
 dense_31 (Dense)            (None, 2)                 66        
                                                                 
Total params: 1,128,098
Trainable params: 1,128,098
Non-trainable params: 0
_________________________________________________________________


In [63]:
# Set weights for the Embedding layer
model3.layers[0].set_weights([embedding_matrix])
model3.layers[0].trainable = False

model3.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])

history3 = model3.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

model3.save_weights('pre_trained_glove_model.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# **Discuss which models performed better and point out relevant hyper-parameter values for successful models.**

Model1 is #10 on leader board.

Model2 is #3 on leader board.

Model3 is #39 on leader board.

Model2 performs better than the others. The relevant hyper-parameter values for model2 are:


*   **Embedding Dimension**: The embedding layer in the model defines the dimension of the dense embedding vectors used to represent each word in the input text. In this case, the embedding dimension is set to 32.

*   **Filter Size**: The Conv1D layer in the model uses a kernel of size 7 to perform convolution on the input sequence.

*   **Number of Filters**: The Conv1D layer in the model uses 32 filters to generate 32 feature maps.

*   **Pooling Method**: The GlobalMaxPooling1D layer in the model performs global max pooling on the feature maps generated by the Conv1D layer.

*   **Number of Hidden Units**: The Dense layer in the model has 2 hidden units. 

*   **Learning Rate**: The learning rate is a hyperparameter that controls the step size of the optimizer during training. In this case, the optimizer is RMSprop, which has a default learning rate of 0.001.

*   **Batch Size**: The batch size is the number of samples that are processed before the model is updated during training. In this case, the batch size is set to 128.

*   **Number of Epochs**: The number of epochs is the number of times the model is trained on the entire dataset. In this case, the model is trained for 10 epochs.



# **Submit your best three models to the leader board for the SST Model Share competition.**

In [71]:
# Save preprocessor
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


In [66]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this Movie Review Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [67]:
#Instantiate Competition
import aimodelshare as ai
mycompetition= ai.Competition(apiurl)

In [69]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model1 = model_to_onnx(model1, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model1.onnx", "wb") as f:
    f.write(onnx_model1.SerializeToString())

In [78]:
# Submit the best three models to the leader board

#Submit Model 1: 

#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model1.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model1.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 70

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [79]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,embedding_layers,conv1d_layers,maxpooling1d_layers,dropout_layers,flatten_layers,lstm_layers,inputlayer_layers,bidirectional_layers,globalmaxpooling1d_layers,globalaveragepooling1d_layers,dense_layers,sigmoid_act,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,version
0,81.78%,81.78%,81.79%,81.78%,keras,,True,Sequential,6.0,963856.0,1.0,1.0,1.0,,1.0,,,,,,2.0,,1.0,,2.0,str,Adam,3856424.0,,francesyang,66
1,80.90%,80.89%,80.96%,80.90%,keras,,True,Sequential,4.0,640130.0,1.0,,,1.0,,,,,,1.0,1.0,,1.0,,,str,RMSprop,3111632.0,,chachagsedaro,26
2,80.57%,80.35%,82.02%,80.58%,keras,,True,Sequential,4.0,201154.0,1.0,,,,1.0,,,,,,2.0,,2.0,,,str,RMSprop,805360.0,,1jiahe,46
3,79.80%,79.63%,80.85%,79.81%,keras,,True,Sequential,5.0,193702.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,776112.0,,amsay99,43
4,80.13%,80.13%,80.16%,80.13%,keras,,True,Sequential,4.0,640130.0,1.0,,,1.0,,,,,,1.0,1.0,,1.0,,,str,RMSprop,3111632.0,,chachagsedaro,25
5,79.58%,79.46%,80.34%,79.59%,keras,,True,Sequential,4.0,206850.0,1.0,,,,1.0,1.0,,,,,1.0,,1.0,1.0,,str,RMSprop,828272.0,,jer2240,51
6,79.25%,79.06%,80.41%,79.26%,keras,,True,Sequential,5.0,174658.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,699936.0,7.0,lprockop,55
7,79.25%,79.21%,79.52%,79.26%,keras,,True,Sequential,5.0,287402.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,1150912.0,,amsay99,60
8,78.81%,78.64%,79.80%,78.82%,keras,,True,Sequential,4.0,82952.0,1.0,1.0,,,,,,,1.0,,1.0,,1.0,,1.0,str,RMSprop,332568.0,,amsay99,61
9,78.70%,78.62%,79.21%,78.71%,keras,,True,Sequential,6.0,963856.0,1.0,1.0,1.0,,1.0,,,,,,2.0,,1.0,,2.0,str,Adam,3856424.0,,francesyang,65


In [74]:
# Save keras model2 to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model2 = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model2.SerializeToString())

In [80]:
# Submit the best three models to the leader board

# Submit Model 2

#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 71

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [81]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,embedding_layers,conv1d_layers,maxpooling1d_layers,dropout_layers,flatten_layers,lstm_layers,inputlayer_layers,bidirectional_layers,globalmaxpooling1d_layers,globalaveragepooling1d_layers,dense_layers,sigmoid_act,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,version
0,81.78%,81.78%,81.79%,81.78%,keras,,True,Sequential,6.0,963856.0,1.0,1.0,1.0,,1.0,,,,,,2.0,,1.0,,2.0,str,Adam,3856424.0,,francesyang,66
1,80.90%,80.89%,80.96%,80.90%,keras,,True,Sequential,4.0,640130.0,1.0,,,1.0,,,,,,1.0,1.0,,1.0,,,str,RMSprop,3111632.0,,chachagsedaro,26
2,80.57%,80.35%,82.02%,80.58%,keras,,True,Sequential,4.0,201154.0,1.0,,,,1.0,,,,,,2.0,,2.0,,,str,RMSprop,805360.0,,1jiahe,46
3,80.57%,80.50%,81.06%,80.58%,keras,,True,Sequential,4.0,327266.0,1.0,1.0,,,,,,,1.0,,1.0,,1.0,,1.0,str,RMSprop,1309824.0,,mymstella,71
4,79.80%,79.63%,80.85%,79.81%,keras,,True,Sequential,5.0,193702.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,776112.0,,amsay99,43
5,80.13%,80.13%,80.16%,80.13%,keras,,True,Sequential,4.0,640130.0,1.0,,,1.0,,,,,,1.0,1.0,,1.0,,,str,RMSprop,3111632.0,,chachagsedaro,25
6,79.58%,79.46%,80.34%,79.59%,keras,,True,Sequential,4.0,206850.0,1.0,,,,1.0,1.0,,,,,1.0,,1.0,1.0,,str,RMSprop,828272.0,,jer2240,51
7,79.25%,79.06%,80.41%,79.26%,keras,,True,Sequential,5.0,174658.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,699936.0,7.0,lprockop,55
8,79.25%,79.21%,79.52%,79.26%,keras,,True,Sequential,5.0,287402.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,1150912.0,,amsay99,60
9,78.81%,78.64%,79.80%,78.82%,keras,,True,Sequential,4.0,82952.0,1.0,1.0,,,,,,,1.0,,1.0,,1.0,,1.0,str,RMSprop,332568.0,,amsay99,61


In [83]:
# Save keras model3 to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model3 = model_to_onnx(model3, framework='keras',
                          transfer_learning=True,
                          deep_learning=True)

with open("model3.onnx", "wb") as f:
    f.write(onnx_model3.SerializeToString())

In [84]:
# Submit the best three models to the leader board

# Submit Model 3

#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model3.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model3.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 72

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [85]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,embedding_layers,conv1d_layers,maxpooling1d_layers,dropout_layers,flatten_layers,lstm_layers,inputlayer_layers,bidirectional_layers,globalmaxpooling1d_layers,globalaveragepooling1d_layers,dense_layers,sigmoid_act,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,version
0,81.78%,81.78%,81.79%,81.78%,keras,,True,Sequential,6.0,963856.0,1.0,1.0,1.0,,1.0,,,,,,2.0,,1.0,,2.0,str,Adam,3856424.0,,francesyang,66
1,80.90%,80.89%,80.96%,80.90%,keras,,True,Sequential,4.0,640130.0,1.0,,,1.0,,,,,,1.0,1.0,,1.0,,,str,RMSprop,3111632.0,,chachagsedaro,26
2,80.57%,80.35%,82.02%,80.58%,keras,,True,Sequential,4.0,201154.0,1.0,,,,1.0,,,,,,2.0,,2.0,,,str,RMSprop,805360.0,,1jiahe,46
3,80.57%,80.50%,81.06%,80.58%,keras,,True,Sequential,4.0,327266.0,1.0,1.0,,,,,,,1.0,,1.0,,1.0,,1.0,str,RMSprop,1309824.0,,mymstella,71
4,80.13%,80.13%,80.16%,80.13%,keras,,True,Sequential,4.0,640130.0,1.0,,,1.0,,,,,,1.0,1.0,,1.0,,,str,RMSprop,3111632.0,,chachagsedaro,25
5,79.80%,79.63%,80.85%,79.81%,keras,,True,Sequential,5.0,193702.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,776112.0,,amsay99,43
6,79.58%,79.46%,80.34%,79.59%,keras,,True,Sequential,4.0,206850.0,1.0,,,,1.0,1.0,,,,,1.0,,1.0,1.0,,str,RMSprop,828272.0,,jer2240,51
7,79.25%,79.06%,80.41%,79.26%,keras,,True,Sequential,5.0,174658.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,699936.0,7.0,lprockop,55
8,79.25%,79.21%,79.52%,79.26%,keras,,True,Sequential,5.0,287402.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,1150912.0,,amsay99,60
9,78.81%,78.64%,79.80%,78.82%,keras,,True,Sequential,4.0,82952.0,1.0,1.0,,,,,,,1.0,,1.0,,1.0,,1.0,str,RMSprop,332568.0,,amsay99,61


**3. After you submit your first three models, describe your best model with your team via your team slack channel**

* Fit and submit up to three more models after learning from your team.

In [87]:
# Model 4
from tensorflow.keras.layers import Dense, Embedding, Conv1D, GlobalMaxPooling1D

model4 = Sequential()
model4.add(Embedding(10000, 32, input_length=40))
model4.add(Conv1D(32, 7, activation='relu'))
model4.add(GlobalMaxPooling1D())
model4.add(Dense(2, activation='softmax'))

model4.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history4 = model4.fit(preprocessor(X_train), y_train, epochs=10, batch_size=256, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [88]:
# Save keras model3 to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model4 = model_to_onnx(model4, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model4.onnx", "wb") as f:
    f.write(onnx_model4.SerializeToString())

In [90]:
# Submit Model 4

#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model4.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model4.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 73

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [91]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,embedding_layers,conv1d_layers,maxpooling1d_layers,dropout_layers,flatten_layers,lstm_layers,inputlayer_layers,bidirectional_layers,globalmaxpooling1d_layers,globalaveragepooling1d_layers,dense_layers,sigmoid_act,softmax_act,tanh_act,relu_act,loss,optimizer,memory_size,team,username,version
0,81.78%,81.78%,81.79%,81.78%,keras,,True,Sequential,6.0,963856.0,1.0,1.0,1.0,,1.0,,,,,,2.0,,1.0,,2.0,str,Adam,3856424.0,,francesyang,66
1,80.90%,80.89%,80.96%,80.90%,keras,,True,Sequential,4.0,640130.0,1.0,,,1.0,,,,,,1.0,1.0,,1.0,,,str,RMSprop,3111632.0,,chachagsedaro,26
2,80.57%,80.35%,82.02%,80.58%,keras,,True,Sequential,4.0,201154.0,1.0,,,,1.0,,,,,,2.0,,2.0,,,str,RMSprop,805360.0,,1jiahe,46
3,80.57%,80.50%,81.06%,80.58%,keras,,True,Sequential,4.0,327266.0,1.0,1.0,,,,,,,1.0,,1.0,,1.0,,1.0,str,RMSprop,1309824.0,,mymstella,71
4,80.13%,80.13%,80.16%,80.13%,keras,,True,Sequential,4.0,640130.0,1.0,,,1.0,,,,,,1.0,1.0,,1.0,,,str,RMSprop,3111632.0,,chachagsedaro,25
5,79.80%,79.63%,80.85%,79.81%,keras,,True,Sequential,5.0,193702.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,776112.0,,amsay99,43
6,79.58%,79.46%,80.34%,79.59%,keras,,True,Sequential,4.0,206850.0,1.0,,,,1.0,1.0,,,,,1.0,,1.0,1.0,,str,RMSprop,828272.0,,jer2240,51
7,79.25%,79.06%,80.41%,79.26%,keras,,True,Sequential,5.0,174658.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,699936.0,7.0,lprockop,55
8,79.25%,79.21%,79.52%,79.26%,keras,,True,Sequential,5.0,287402.0,1.0,,,,1.0,2.0,,,,,1.0,,1.0,2.0,,str,RMSprop,1150912.0,,amsay99,60
9,78.81%,78.64%,79.80%,78.82%,keras,,True,Sequential,4.0,82952.0,1.0,1.0,,,,,,,1.0,,1.0,,1.0,,1.0,str,RMSprop,332568.0,,amsay99,61


# **Discuss results**

Model4 is #12 on the leaderboard. So far, Model2 performs the best.

# **4. Discuss which models you tried and which models performed better and point out relevant hyper-parameter values for successful models.**

I tried 4 models in total. Model1 uses an Embedding layer and LSTM layers. Model2 uses an Embedding layer and Conv1d layers. Model3 uses transfer learning with glove embeddings. Model4 is created based on Model2 and the discussion results with the group. Among them, Model2 performs the best, with the following hyper-parameter values:

*   **Embedding Dimension**: The embedding layer in the model defines the dimension of the dense embedding vectors used to represent each word in the input text. In this case, the embedding dimension is set to 32.

*   **Filter Size**: The Conv1D layer in the model uses a kernel of size 7 to perform convolution on the input sequence.

*   **Number of Filters**: The Conv1D layer in the model uses 32 filters to generate 32 feature maps.

*   **Pooling Method**: The GlobalMaxPooling1D layer in the model performs global max pooling on the feature maps generated by the Conv1D layer.

*   **Number of Hidden Units**: The Dense layer in the model has 2 hidden units. 

*   **Learning Rate**: The learning rate is a hyperparameter that controls the step size of the optimizer during training. In this case, the optimizer is RMSprop, which has a default learning rate of 0.001.

*   **Batch Size**: The batch size is the number of samples that are processed before the model is updated during training. In this case, the batch size is set to 128.

*   **Number of Epochs**: The number of epochs is the number of times the model is trained on the entire dataset. In this case, the model is trained for 10 epochs.