# Machine Learning Project - Inappropriate Language Classification - LSTM

This notebook is separate from the rest as the embedding layers are directly integrated in the model. That is because the model adds it's own embeddings for the count vectorizer. Furthermore, the LSTM model has a set input size, as such the inputs will be troncated from the ? end / start ?

This Jupyter Notebook contains the following features:
1. Model Choice
    1. Using the Base Embedding Layer
        - Data Tockenisation
        - Model building with embedding layer
    2. Using the GloVe embeddings
        - Data Embedding
        - Model building without embedding layer
2. Model Training
3. Model Testing

In [3]:
#Parameters
max_input_size = 200

#Base LSTM
embedding_dim = 200

## 1. Model choice

### 1. Default Embeddings

In [5]:
#Create Tockenizer
max_words = 10000 # Max number of words to use in the tockenizer

from experiment_baseplate import get_text_data
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(get_text_data())
word_index = tokenizer.word_index
print("Number of known words: ", len(word_index))

Number of known words:  72629


In [24]:
from experiment_baseplate import load_split_data

X_train, y_train, X_validate, y_validate, X_test, y_test = load_split_data()

#Tockenize data
from tensorflow.keras.preprocessing.sequence import pad_sequences

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
X_validate = tokenizer.texts_to_sequences(X_validate)

X_train = pad_sequences(X_train, maxlen=max_input_size)
X_test = pad_sequences(X_test, maxlen=max_input_size)
X_validate = pad_sequences(X_validate, maxlen=max_input_size)

In [7]:
#Define layers
import tensorflow.keras.layers as tfl

lstm_layers = [
    tfl.Input(shape=(max_input_size,)),
    tfl.Embedding(max_words, embedding_dim),
    tfl.LSTM(64),
    tfl.Dropout(0.2),
    tfl.Dense(2, activation='softmax')
]

### GloVe Embeddings

In [None]:
'''
If needed download weights
'''
from experiment_baseplate import get_glove_model

get_glove_model()

In [15]:
from experiment_baseplate import get_split_glove_embedding

X_train, y_train, X_validate, y_validate, X_test, y_test = get_split_glove_embedding()

# from tensorflow.keras.preprocessing.sequence import pad_sequences

# X_train = pad_sequences(X_train, maxlen=max_input_size)
# X_test = pad_sequences(X_test, maxlen=max_input_size)
# X_validate = pad_sequences(X_validate, maxlen=max_input_size)

Loading GloVe model
Done loading GloVe model

Embedding data
Done Embedding data


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train, maxlen=max_input_size)
X_test = pad_sequences(X_test, maxlen=max_input_size)
X_validate = pad_sequences(X_validate, maxlen=max_input_size)

In [21]:
X_train[0].shape

(200,)

In [None]:
#Define layers
import tensorflow.keras.layers as tfl

lstm_layers = [
    tfl.Input(shape=(max_input_size,)),
    tfl.Embedding(max_words, embedding_dim),
    tfl.LSTM(64),
    tfl.Dropout(0.2),
    tfl.Dense(2, activation='softmax')
]

### 2. Model Training

In [8]:
#Build the model
from tensorflow.keras.models import Model

if(len(lstm_layers) < 2):
    print("Not enough layers in your model!")
    exit()

for i in range(1, len(lstm_layers)):
    lstm_layers[i] = lstm_layers[i](lstm_layers[i - 1])


model = Model(inputs=lstm_layers[0], outputs=lstm_layers[-1])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 200)]             0         
                                                                 
 embedding (Embedding)       (None, 200, 200)          2000000   
                                                                 
 lstm (LSTM)                 (None, 64)                67840     
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 2)                 130       
                                                                 
Total params: 2,067,970
Trainable params: 2,067,970
Non-trainable params: 0
_________________________________________________________________


### 3. Model Testing

In [9]:
epochs = 2
batch_size = 32

model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_validate, y_validate))

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1f8ca08eb80>

### Model Testing

In [10]:
from experiment_baseplate import score

print("LSTM Model")
print("Validate values -> " + score( model.predict(X_validate) , y_validate))
print("Test values -> " + score( model.predict(X_test) , y_test))

LSTM Model
Validate values -> accuracy : 0.9185480853186978 | precision : 0.9134559675550405 | recall : 0.8990021382751248
Test values -> accuracy : 0.9158662841461893 | precision : 0.9111918604651162 | recall : 0.8946767518196089
