# Multilayer Perceptron

In our [previous attempt](./02-logistic-regression-word2vec.ipynb), we tried using Word2Vec to improve our sentiment classification, but instead of a higher score, we got a much, much worse result.

That happened because our existing architecture (logistic regression) was unfit for a new vectorization (seemingly much better) approach. But what if we change the architecture itself?

That gives us a nice opportunity to try a different paradigm called **deep learning** - a branch of machine learning and artificial intelligence that uses artificial neural networks to process data and learn patterns. These networks, inspired by the structure of the human brain, are built with multiple layers of interconnected nodes (**neurons**) that allow them to identify complex relationships and make predictions. 

##  Data Preparation

In [21]:
%env PYTHONUNBUFFERED=1
import kagglehub
df = kagglehub.dataset_load(
    kagglehub.KaggleDatasetAdapter.PANDAS,
    'jp797498e/twitter-entity-sentiment-analysis',
    'twitter_training.csv',
    pandas_kwargs={'encoding': 'ISO-8859-1'},
)

df = df[df.columns[[2, 3]]]
df.columns = ['sentiment', 'text']

df['text'] = df['text'].astype(str)
df['sentiment'] = df['sentiment'].astype(str)
df = df.dropna()

df = df.loc[df['sentiment'] != 'Irrelevant']
display(df)

env: PYTHONUNBUFFERED=1


Unnamed: 0,sentiment,text
0,Positive,I am coming to the borders and I will kill you...
1,Positive,im getting on borderlands and i will kill you ...
2,Positive,im coming on borderlands and i will murder you...
3,Positive,im getting on borderlands 2 and i will murder ...
4,Positive,im getting into borderlands and i can murder y...
...,...,...
74676,Positive,Just realized that the Windows partition of my...
74677,Positive,Just realized that my Mac window partition is ...
74678,Positive,Just realized the windows partition of my Mac ...
74679,Positive,Just realized between the windows partition of...


In [22]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
x_train = train['text']
y_train = train['sentiment']
x_test = test['text']
y_test = test['sentiment']

## Semantic Vectorization

Our vectorization routine also remains exactly the same. We are doing this so we can see how the change of *approach* affects the final result with the same data.

In [24]:
%env PYTHONUNBUFFERED=1
import kagglehub
path = kagglehub.dataset_download(
  'leadbest/googlenewsvectorsnegative300', 
  path='GoogleNews-vectors-negative300.bin.gz'
)

from gensim.models import KeyedVectors
wv = KeyedVectors.load_word2vec_format(path, binary=True)

import numpy as np
from gensim.utils import simple_preprocess
def vectorize(text):
  tokens = simple_preprocess(text.lower(), deacc=True)
  token_vectors = [wv.get_vector(x) for x in tokens if x in wv]
  if token_vectors:
    return np.mean(token_vectors, axis=0)
  else:
    return np.zeros(wv.vector_size)

x_train = np.array(train['text'].map(vectorize).tolist())
x_test = np.array(test['text'].map(vectorize).tolist())

env: PYTHONUNBUFFERED=1


## Label Encoding

Before we proceed further, we need to transform our output training as well! That happens because neural networks do not work with text *directly* - instead, we need to encode our labels, turning them into some kind of mathematical representation.

In [25]:
import pandas as pd
from sklearn.preprocessing import LabelBinarizer 
encoder = LabelBinarizer()
encoder.fit(df['sentiment'])
y_train_encoded = pd.DataFrame(encoder.fit_transform(y_train))
y_test_encoded = pd.DataFrame(encoder.transform(y_test))

This method is called "one-hot encoding", where each category is assigned a unique binary column. We have three categories - so our encodings will have a dimension of three each.

In [26]:
display(y_train_encoded)

Unnamed: 0,0,1,2
0,1,0,0
1,0,0,1
2,0,1,0
3,0,0,1
4,0,0,1
...,...,...,...
49347,1,0,0
49348,0,1,0
49349,1,0,0
49350,0,0,1


## Building and Training the Model

Now, let's design our model structure. This time, we will use a thing called **multilayer perceptron**. As its name states, it is a neural network that consists of multiple **layers** of neurons - allowing one to learn complex, non-linear relationships in data (unlike linear regression classifier).

For this task, we will use three types of layers - input (transforms our source data and passes it next), dense (simple layer of interconnected neurons), and dropout (special layer that helps against overfitting by randomly disabling part of the previous layer during the training process).

In [27]:
from tensorflow.keras import layers, Sequential
num_classes = len(encoder.classes_)
model = Sequential([
  layers.Input(shape=(wv.vector_size,)),
  layers.Dense(128, activation='relu'),
  layers.Dropout(0.2),
  layers.Dense(128, activation='relu'),
  layers.Dropout(0.1),
  layers.Dense(num_classes, activation='softmax')
])

Originally, there were only three layers - input, hidden, and output. But stacking two more hidden layers helped to uplift the original 82% accuracy to 85%. That happened due to increased model capacity, and (potential) hierarchical feature learning.

Let's try to compile and train our model now.

In [28]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train_encoded, epochs=50, batch_size=32, validation_split=0.1) 

Epoch 1/50
[1m1388/1388[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.5856 - loss: 0.9010 - val_accuracy: 0.6647 - val_loss: 0.7766
Epoch 2/50
[1m1388/1388[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 596us/step - accuracy: 0.6687 - loss: 0.7731 - val_accuracy: 0.6868 - val_loss: 0.7307
Epoch 3/50
[1m1388/1388[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 863us/step - accuracy: 0.6915 - loss: 0.7157 - val_accuracy: 0.6925 - val_loss: 0.7076
Epoch 4/50
[1m1388/1388[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 759us/step - accuracy: 0.7166 - loss: 0.6691 - val_accuracy: 0.7150 - val_loss: 0.6784
Epoch 5/50
[1m1388/1388[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.7365 - loss: 0.6272 - val_accuracy: 0.7231 - val_loss: 0.6393
Epoch 6/50
[1m1388/1388[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.7513 - loss: 0.5895 - val_accuracy: 0.7382 - val_loss: 0.6148
Epoch 7/50

<keras.src.callbacks.history.History at 0x308d99cd0>

## Result

In [29]:
from sklearn.metrics import classification_report
y_pred_probs = model.predict(x_test, verbose=False)
y_pred_labels = np.argmax(y_pred_probs, axis=1)
y_true_labels = np.argmax(y_test_encoded.to_numpy(), axis=1)
print(classification_report(y_true_labels, y_pred_labels, target_names=encoder.classes_))

              precision    recall  f1-score   support

    Negative       0.84      0.88      0.86      4547
     Neutral       0.88      0.79      0.83      3636
    Positive       0.83      0.87      0.85      4156

    accuracy                           0.85     12339
   macro avg       0.85      0.84      0.85     12339
weighted avg       0.85      0.85      0.85     12339



## Conclusion

This experiment yielded an 85% accuracy, confirming that a deeper, non-linear model can extract more signal from these embeddings than a linear classifier. The addition of hidden layers and dropout underscores the "more capacity, better results" principle, at least to a point. 

While a significant improvement over the 65% Logistic Regression baseline with these features, it still trails the 91% n-gram model. This suggests that to fully leverage word2vec semantic richness, architectures capable of processing sequences, such as LSTMs or Transformers, are the necessary next step.