# Machine Learning Project - Inappropriate Language Classification - LSTM

This notebook is separate from the rest as the embedding layers are directly integrated in the model. That is because the model adds it's own embeddings for the count vectorizer. Furthermore, the LSTM model has a set input size, as such the inputs will be troncated from the ? end / start ?

This Jupyter Notebook contains the following features:
1. Model Choice
    1. Using the Base Embedding Layer
        - Data Tockenisation
        - Model building with embedding layer
    2. Using the GloVe embeddings
        - Data Embedding
        - Model building without embedding layer
2. Model Training
3. Model Testing

In [3]:
#Parameters
max_input_size = 100

## 1. Model choice

### 1. Default Embeddings

In [None]:
#Create Tockenizer
max_words = 10000 # Max number of words to use in the tockenizer

from experiment_baseplate import get_text_data
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(get_text_data())
word_index = tokenizer.word_index
print("Number of known words: ", len(word_index))

In [None]:
from experiment_baseplate import load_split_data

X_train, y_train, X_validate, y_validate, X_test, y_test = load_split_data()

#Tockenize data
from tensorflow.keras.preprocessing.sequence import pad_sequences

def post_process(X_values):
    X_values = tokenizer.texts_to_sequences(X_values)
    return pad_sequences(X_values, maxlen=max_input_size)

X_train = post_process(X_train)
X_test = post_process(X_test)
X_validate = post_process(X_validate)

In [None]:
#Define layers
import tensorflow.keras.layers as tfl

embedding_dim = 200

lstm_layers = [
    tfl.Input(shape=(max_input_size,)),
    tfl.Embedding(max_words, embedding_dim),
    tfl.LSTM(64),
    tfl.Dropout(0.2),
    tfl.Dense(2, activation='softmax')
]

### GloVe Embeddings

In [None]:
'''
If needed download weights
'''
from experiment_baseplate import get_glove_model

get_glove_model()

In [2]:
from experiment_baseplate import get_split_glove_embedding

X_train, y_train, X_validate, y_validate, X_test, y_test = get_split_glove_embedding()

Loading GloVe model
Done loading GloVe model

Embedding data
Done Embedding data


In [4]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

def post_process(X_values):
    return np.array( pad_sequences(X_values, maxlen=max_input_size) )# , dtype=np.uint8)

X_train = post_process(X_train)
X_test = post_process(X_test)
X_validate = post_process(X_validate)

In [5]:
#Define layers
import tensorflow.keras.layers as tfl

glove_embedding_dim = X_train.shape[2]

lstm_layers = [
    tfl.Input(shape=(max_input_size, glove_embedding_dim)),
    tfl.LSTM(64),
    tfl.Dropout(0.2),
    tfl.Dense(2, activation='softmax')
]

### 2. Model Training

In [6]:
#Build the model
from tensorflow.keras.models import Model

if(len(lstm_layers) < 2):
    print("Not enough layers in your model!")
    exit()

for i in range(1, len(lstm_layers)):
    lstm_layers[i] = lstm_layers[i](lstm_layers[i - 1])


model = Model(inputs=lstm_layers[0], outputs=lstm_layers[-1])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 100, 200)]        0         
                                                                 
 lstm (LSTM)                 (None, 64)                67840     
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 2)                 130       
                                                                 
Total params: 67,970
Trainable params: 67,970
Non-trainable params: 0
_________________________________________________________________


In [7]:
#Train the model
epochs = 20
batch_size = 64

# Early stopping regularization
from tensorflow.keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_validate, y_validate), callbacks=[es])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 9: early stopping


<keras.callbacks.History at 0x1d9a47768d0>

### Model Testing

In [8]:
from experiment_baseplate import score
import time

start = time.time_ns()
X_test_predict = model.predict(X_test)
test_time = time.time_ns() - start

print("LSTM Model")
print(f"Test values\n\t{score( X_test_predict , y_test)} | inf_time : {test_time / X_test.shape[0]} ns")

LSTM Model
Test values
	accuracy : 0.8923022122956659 | precision : 0.8025218427323273 | recall : 0.6915935828877006 | f2 : 0.8908604519595025 | inf_time : 637437.5493386218 ns


In [9]:
model.save_weights("./saved/lstm/lstm_gl/")