# Long Short-Term Memory

In our [previous MLP experiment](./03-mlp.ipynb), we saw that a neural network could extract more from averaged Word2Vec embeddings than a simpler Logistic Regression. However, averaging throws away crucial information: the *order of words* in a sentence.

This time, we're diving deeper into neural networks with **Long Short-Term Memory (LSTM)** units. They are specifically designed to process sequences, remembering context and understanding how word order contributes to meaning. Let's see if harnessing this sequential power can push our accuracy even further.

## Data Preparapation

In [1]:
import kagglehub
df = kagglehub.dataset_load(
    kagglehub.KaggleDatasetAdapter.PANDAS,
    'jp797498e/twitter-entity-sentiment-analysis',
    'twitter_training.csv',
    pandas_kwargs={'encoding': 'ISO-8859-1'},
)

df = df[df.columns[[2, 3]]]
df.columns = ['sentiment', 'text']

df['text'] = df['text'].astype(str)
df['sentiment'] = df['sentiment'].astype(str)
df = df.dropna()

df = df.loc[df['sentiment'] != 'Irrelevant']
display(df)

Unnamed: 0,sentiment,text
0,Positive,I am coming to the borders and I will kill you...
1,Positive,im getting on borderlands and i will kill you ...
2,Positive,im coming on borderlands and i will murder you...
3,Positive,im getting on borderlands 2 and i will murder ...
4,Positive,im getting into borderlands and i can murder y...
...,...,...
74676,Positive,Just realized that the Windows partition of my...
74677,Positive,Just realized that my Mac window partition is ...
74678,Positive,Just realized the windows partition of my Mac ...
74679,Positive,Just realized between the windows partition of...


In [2]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
x_train = train['text']
y_train = train['sentiment']
x_test = test['text']
y_test = test['sentiment']

## Tokenization

Before our neural network can understand text, we need to convert sentences into a numerical format it can process. In our MLP approach, we tokenized text and then immediately averaged word vectors. LSTM pipeline is a *a bit more complicated*.

The first step is *tokenization*, where we break down each sentence into individual words or sub-word units called "tokens." Then, we'll build a vocabulary of all unique tokens in our training data and assign a unique integer ID to each one, transforming our text into sequences of numbers.

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['text'])
display(tokenizer.texts_to_sequences(["Hello World"]))

[[737, 162]]

## Data Padding

The next step is called padding. LSTMs (like most neural networks) require input sequences to be of a uniform length. Since our sentences naturally vary, we need to **pad** our integer sequences to a **fixed length**.

This involves either truncating longer sequences or adding special tokens (usually zeros) to shorter sequences until they all reach a predetermined maximum length. We can use some *reasonable* number here, and adjust it later if needed.

In [4]:
maxlen = 96

from tensorflow.keras.utils import pad_sequences
x_train = pad_sequences(tokenizer.texts_to_sequences(x_train), maxlen=maxlen)
x_test = pad_sequences(tokenizer.texts_to_sequences(x_test), maxlen=maxlen)

display(x_train)

array([[    0,     0,     0, ...,   357,   155,   163],
       [    0,     0,     0, ...,    25,  1136,    10],
       [    0,     0,     0, ...,   827,  3072,  1119],
       ...,
       [    0,     0,     0, ...,   469,     7,  5669],
       [    0,     0,     0, ...,    36,    16, 14165],
       [    0,     0,     0, ...,     3,   121,    10]], dtype=int32)

## Embedding Layer

Next, we need to transform these padded sequences into meaningful representations that capture their semantic relationships. That's where the **embedding** layer comes in. To build it, we may use our existing Word2Vec model.

In [5]:
import kagglehub
path = kagglehub.dataset_download(
    'leadbest/googlenewsvectorsnegative300', 
    path='GoogleNews-vectors-negative300.bin.gz'
)

from gensim.models import KeyedVectors
wv = KeyedVectors.load_word2vec_format(path, binary=True)

Think of it as a sophisticated, trainable lookup table - for each integer ID representing a token in our sequence, the embedding layer looks up its corresponding dense vector. This helps our neural network to capture not some random indexes, but the *semantic* meaning of words.

In [6]:
import numpy as np
embedding_matrix_shape = (len(tokenizer.word_index) + 1, wv.vector_size)
embedding_matrix = np.zeros(shape=embedding_matrix_shape)
for word, index in tokenizer.word_index.items():
    if word in wv:
        embedding_matrix[index] = wv.get_vector(word)
    else:
        embedding_matrix[index] = np.zeros(wv.vector_size)

## Label Encoding

In [7]:
from sklearn.preprocessing import LabelBinarizer 
import pandas as pd

encoder = LabelBinarizer()
encoder.fit(df['sentiment'])

y_train_encoded = pd.DataFrame(encoder.fit_transform(y_train))
y_test_encoded = pd.DataFrame(encoder.transform(y_test))

## Building and Training the Model

Here comes the most interesting part. With our text now represented as sequences of dense semantic vectors (thanks to the embedding layer), we can introduce the star of this experiment - the **Long Short-Term Memory (LSTM)** layer.

Unlike a simple Dense layer that processes all its inputs at once, an LSTM processes each vector in our sequence one at a time. Internally, each LSTM unit contains a sophisticated set of "gates" – an input gate, a forget gate, and an output gate – along with a memory cell.

These gates learn to control the flow of information, deciding what to remember from previous steps, what to discard, and what new information from the current word's vector is important enough to update its memory, allowing it to capture *context* and *dependencies* across the entire sequence.

In [8]:
from tensorflow.keras import layers, Sequential
num_classes = len(encoder.classes_)
model = Sequential([
    layers.Embedding(
        weights=[embedding_matrix],
        input_dim=embedding_matrix_shape[0], 
        output_dim=embedding_matrix_shape[1],
        trainable=True, # allow optimizer to tweak this layer too
        mask_zero=True, # ignore padding zeroes
    ),
    layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2),
    layers.Dropout(0.2),
    layers.Dense(num_classes, activation='softmax'),
])

We can compile and train our model now, but before we do so - let's implement a regularisation technique called **early stopping**. Essentially, that's a special function that watches our training process and stops it when the monitored metric stops improving. That helps to reduce overfitting and saves us some computation cycles when the training process gets stuck with the same accuracy for too long.

In [9]:
from tensorflow.keras.callbacks import EarlyStopping
earlystop = EarlyStopping(monitor='val_accuracy', patience=5, restore_best_weights=True)

Also, we could try tweaking the optimizer learning rate - the default one might be too high when fine-tuning pre-trained embeddings.

In [10]:
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.0001)

Everything is ready - let's start the training process.

In [11]:
from tensorflow import device
with device('/CPU:0'):
    model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(x_train, y_train_encoded, epochs=35, batch_size=128, validation_split=0.1, callbacks=[earlystop]) 

Epoch 1/35
[1m347/347[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 280ms/step - accuracy: 0.4975 - loss: 1.0252 - val_accuracy: 0.6669 - val_loss: 0.8161
Epoch 2/35
[1m347/347[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m96s[0m 275ms/step - accuracy: 0.6696 - loss: 0.8008 - val_accuracy: 0.7111 - val_loss: 0.7120
Epoch 3/35
[1m347/347[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m95s[0m 272ms/step - accuracy: 0.7129 - loss: 0.7021 - val_accuracy: 0.7466 - val_loss: 0.6414
Epoch 4/35
[1m347/347[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m97s[0m 281ms/step - accuracy: 0.7468 - loss: 0.6350 - val_accuracy: 0.7747 - val_loss: 0.5757
Epoch 5/35
[1m347/347[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m97s[0m 279ms/step - accuracy: 0.7780 - loss: 0.5642 - val_accuracy: 0.7934 - val_loss: 0.5168
Epoch 6/35
[1m347/347[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m102s[0m 294ms/step - accuracy: 0.8068 - loss: 0.4949 - val_accuracy: 0.8144 - val_loss: 0.4728
Epoch 7/3

## Result

In [15]:
from sklearn.metrics import classification_report
with device('/CPU:0'):
    y_pred_probs = model.predict(x_test, verbose=False)
    y_pred_labels = np.argmax(y_pred_probs, axis=1)
    y_true_labels = np.argmax(y_test_encoded.to_numpy(), axis=1)
    print(classification_report(y_true_labels, y_pred_labels, target_names=encoder.classes_))

              precision    recall  f1-score   support

    Negative       0.92      0.93      0.92      4605
     Neutral       0.89      0.91      0.90      3594
    Positive       0.92      0.89      0.91      4140

    accuracy                           0.91     12339
   macro avg       0.91      0.91      0.91     12339
weighted avg       0.91      0.91      0.91     12339



## Conclusion

We were able to achieve a final accuracy of **91%** - this was accomplished by combining LSTM units with pre-trained Word2Vec embeddings, enabling the model to leverage sequential information.

This proves the value of sequence-aware models over simple embedding averaging for this dataset, performing competitively with the heavily optimized CountVectorizer approach. Further improvements would likely require exploring more advanced architectures like [transformers](https://arxiv.org/abs/1706.03762).

But for now... This is more than enough for this class of tasks.