## Text Classification Using the Stanford SST Sentiment Dataset

### 1. Get data in and set up X_train, X_test, y_train objects


Github link: 

In [None]:
! pip install aimodelshare==0.0.189

**NOTE:** Restart the runtime after the installation

In [1]:
# Get the data in from the package 
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


## Discussion of the data set and the importance of text classification models:
 
The data to be used for the models below consist of different movie reviews which are classified into negative or possitive based on their commenatry on the movie. Building a model that can acuratley predict the sentiment behing commentary can be a very powerfull tool that could be leveraged for any type of textual feedback. Sucha model could have many implementations acorss a wide range of fields that rely on commenatry or feedback such as publi policy, mental health, and entertainment. Having the ability to create models that predict the sentiment behind commentary can a powerfull tool for stakeholder to meassure the public response to a particular project and make changes as needed within a resonable time, depending on the problem being tackled.   

In [93]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

##2.   Preprocess data using keras tokenizer / Write and Save Preprocessor function

In [74]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 40)
(1821, 40)


In [4]:
# save the preprocesor in a local file 
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


##3. Fit models on preprocessed data and save preprocessor function and model 


#### Model 1:

In [5]:
from tensorflow.keras.layers import Dense, Embedding,Flatten
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(10000, 16, input_length=40))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))
model.summary()

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 40, 16)            160000    
                                                                 
 flatten (Flatten)           (None, 640)               0         
                                                                 
 dense (Dense)               (None, 2)                 1282      
                                                                 
 flatten_1 (Flatten)         (None, 2)                 0         
                                                                 
 dense_1 (Dense)             (None, 2)                 6         
                                                                 
Total params: 161,288
Trainable params: 161,288
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 

save the model abive to and ".onnx" file 

In [6]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [7]:

from aimodelshare.aws import set_credentials 
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this specific Playground
set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [8]:
#Instantiate Competition
mycompetition= ai.Competition(apiurl)

### Submitting model 1: 

In [9]:
 #-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 399

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model 2: 

In [14]:
# Setting the hyperparametres

vocab_size = 10000  # Limit the vocabulary size to the top 10,000 words
maxlen = 40        # Set the maximum sequence length
embedding_dim = 32  # Dimension of the word embeddings
lstm_units = 36    # Number of LSTM units

In [20]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Flatten
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import Bidirectional, TimeDistributed, BatchNormalization, Dropout

model2 = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=maxlen),
    Bidirectional(LSTM(lstm_units, return_sequences=True, dropout=0.2)),
    BatchNormalization(),
    Bidirectional(LSTM(lstm_units, return_sequences=True, dropout=0.2)),
    TimeDistributed(Dense(lstm_units, activation='relu')),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(2, activation='softmax')
])
model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model2.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [21]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

### Submitting model 2[texto del enlace](https://): 

In [22]:
#Submit Model 2: 

#-- Generate predicted y values (Model 2)
prediction_column_index=model2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 401

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model 3: 

In [69]:
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D

model3 = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=maxlen),
    Conv1D(filters=64, kernel_size=3, activation='relu', padding='same'),
    BatchNormalization(),
    Conv1D(filters=64, kernel_size=3, activation='relu', padding='same'),
    GlobalMaxPooling1D(),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(2, activation='softmax')
])

model3.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model3.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [70]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model3, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model3.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [75]:
#Submit Model 3: 

#-- Generate predicted y values (Model 3)
prediction_column_index=model3.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model3.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 418

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model 4:

Downloaded the pre-trained word vector document from here: https://nlp.stanford.edu/projects/glove/

In [34]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [27]:
# Function to load the downloaded pre-trained vector of words
def load_glove_embeddings(glove_file):
    embeddings = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove_file = '/content/drive/MyDrive/glove/glove.6B.100d.txt'  
glove_embeddings = load_glove_embeddings(glove_file)


In [50]:
texts = pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
labels =pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# Tokenize the text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# Pad the sequences
maxlen = 40
x_data = pad_sequences(sequences, maxlen=maxlen)

# Encode the labels
label_encoder = LabelEncoder()
integer_labels = label_encoder.fit_transform(labels)
y_data = to_categorical(integer_labels)


In [51]:
# creating the embedding matrix 
def create_embedding_matrix(glove_embeddings, word_index, embedding_dim):
    vocab_size = len(word_index) + 1
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    for word, i in word_index.items():
        embedding_vector = glove_embeddings.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

word_index = tokenizer.word_index
embedding_dim = 100  # Update the embedding dimension based on the GloVe embeddings
embedding_matrix = create_embedding_matrix(glove_embeddings, word_index, embedding_dim)





In [52]:
from tensorflow.keras.initializers import Constant

# running the model based on the prior steps 
vocab_size = len(word_index) + 1
num_classes = len(label_encoder.classes_)

maxlen = 40  

model4 = Sequential([
    Embedding(vocab_size, embedding_dim, embeddings_initializer=Constant(embedding_matrix),
              input_length=maxlen, trainable=False),  # Set trainable to False to freeze the embeddings
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(num_classes, activation='softmax')
])

model4.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model4.summary()


Model: "sequential_14"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_15 (Embedding)    (None, 40, 100)           1383600   
                                                                 
 flatten_13 (Flatten)        (None, 4000)              0         
                                                                 
 dense_34 (Dense)            (None, 128)               512128    
                                                                 
 dropout_16 (Dropout)        (None, 128)               0         
                                                                 
 dense_35 (Dense)            (None, 2)                 258       
                                                                 
Total params: 1,895,986
Trainable params: 512,386
Non-trainable params: 1,383,600
_________________________________________________________________


In [61]:
maxlen = 40  
x_train = pad_sequences(x_train, maxlen=maxlen)
X_test = pad_sequences(X_test, maxlen=maxlen)

X_test, y_train_labels = x_data, y_data
x_val, y_val = x_data, y_data

epochs = 10
batch_size = 40
history = model4.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [62]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model4, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model4.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [76]:
#-- Generate predicted y values 
prediction_column_index=model4.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 4 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model4.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 419

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


Regarding the model performance, right of the strat we can see that the CONV1D layers helped a lot with the classification problemm when compared withe LSTM layers and the plain vanilla model using only dense layers at the beguining. There is a lot of information to be extracted via CNN techniques when applied to text classification. The transfer learning model performed very well too when compared to the LSTM. However I did use a pre-trained vector of word; link is above for it. The number of filter for the models with CONV1D layers had an efect on the performance of the model; as more filter were able to extract more meaningful relationships. The max lenth in the preprocesor function is annother important hyper paramter as smaller lengths were able to extract more information. Making the LSTM layer bidirectional also helped with the models classification performance. Kernel size also seemed to help the overal models ferformance for CONV1D layers. 

### Post discussion Models: 

#### Model 1: CONV1D


In [66]:
model1_t = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=maxlen),
    Conv1D(filters=128, kernel_size=3, activation='relu', padding='same'),
    BatchNormalization(),
    Conv1D(filters=128, kernel_size=3, activation='relu', padding='same'),
    BatchNormalization(),
    Conv1D(filters=128, kernel_size=3, activation='relu', padding='same'),
    GlobalMaxPooling1D(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(2, activation='softmax')
])

model1_t.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model1_t.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [77]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model1_t, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model1_t.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [78]:
#Submit Model: 

#-- Generate predicted y values 
prediction_column_index=model1_t.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model1_t.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 421

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model 2: LSTM 

In [81]:
# Setting the hyperparametres

vocab_size = 10000  # Limit the vocabulary size to the top 10,000 words
maxlen = 20        # Set the maximum sequence length
embedding_dim = 24  # Dimension of the word embeddings
lstm_units = 15    # Number of LSTM units

In [82]:
def preprocessor(data, maxlen=20, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 20)
(1821, 20)


In [83]:
model2_t = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=maxlen),
    Bidirectional(LSTM(lstm_units, return_sequences=True, dropout=0.2)),
    BatchNormalization(),
    Bidirectional(LSTM(lstm_units, return_sequences=True, dropout=0.2)),
    TimeDistributed(Dense(lstm_units, activation='relu')),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(2, activation='softmax')
])
model2_t.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model2_t.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [84]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model2_t, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2_t.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [85]:
#Submit Model: 
#-- Generate predicted y values 
prediction_column_index=model2_t.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2_t.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 423

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model 3: CONV1D

In [96]:
# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 40)
(1821, 40)


In [97]:
model3_t = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=40),
    Conv1D(filters=128, kernel_size=5, activation='relu', padding='same'),
    BatchNormalization(),
    Conv1D(filters=128, kernel_size=5, activation='relu', padding='same'),
    BatchNormalization(),
    Conv1D(filters=128, kernel_size=5, activation='relu', padding='same'),
    GlobalMaxPooling1D(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(2, activation='softmax')
])

model3_t.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model1_t.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [98]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model3_t, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model3_t.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [99]:
#Submit Model: 
#-- Generate predicted y values 
prediction_column_index=model3_t.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model3_t.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 428

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


### Final Discussion:

After commenting and going over what hyperparameters to changes in order to improve the overall performance of the models, we can divide the best models using CONV1D layers and LSTM layers into separte categories for hypertuning.In the case of models using CONVD1D layers, the two parameters that stood out the most where the kernel and filter size for the CONV1D layer, as increasing these helped the model classification power. In regrd to the LSTM layers, making them bidirectional improved the performance of the model when compared to ones that did not use it. The other relevant parameter for LSTM layers prooved to be the Maximum length argument as smaller numbers performed better. 