# Word Embeddings

In our [previous notebook](./01-logistic-regression.ipynb), we built a solid sentiment classifier using logistic regression and simple count vectorizer. While that model performed well by focusing on word counts and phrases, it treated each word largely as an independent entity without understanding its underlying meaning or relationship to other words.

This time, we'll delve into **word embeddings** – a technique that represents words as dense vectors, capturing their semantic relationships and context within a large body of text. Our today's goal is to integrate them into our logistic regression pipeline and see how they will affect our sentiment analysis score.

<!-- Articles used:
- https://medium.com/swlh/sentiment-classification-using-word-embeddings-word2vec-aedf28fbb8ca
-->

##  Data Preparation

We are going to use the same dataset as before.

In [1]:
from datasets import load_dataset
import numpy as np

train, test = load_dataset('stanfordnlp/imdb', split=['train', 'test'])
class_names = train.features['label'].names

x_train = np.array(train['text'])
y_train = np.array(train['label'])
x_test = np.array(test['text'])
y_test = np.array(test['label'])

But this time, we might do some data preprocessing. For each text sample, we are going to apply a technique called **semantic vectorization**. Its core idea is to vectorize separate words into a thing called **word embedding** instead of a simple index.

What is a word embedding? Essentially, it is a mathematical representation of a word (or phrase) as a vector (a numerical array) in a high-dimensional space. These vectors capture the semantic meaning of the word by representing its relationships to other words in a corpus of text. That means that semantically similar words would be close in this high-dimensional space. 

This approach requires a separate vectorization model - we could start with a pre-trained model called GoogleNews300. It contains 300-dimensional vectors for 3 million words and phrases.

In [2]:
from os import path
from huggingface_hub import snapshot_download
model_path = path.join(snapshot_download('fse/word2vec-google-news-300'), 'word2vec-google-news-300.model')

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Let's see it in action by loading it and comparing the similarities of different words.

In [3]:
from gensim.models import KeyedVectors
wv = KeyedVectors.load(model_path)

pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]

for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


That makes some sense, right? Communism has poor relations with cars.\
But what do those vectors look like?

In [4]:
display(wv.get_vector('hello'))

array([-0.05419922,  0.01708984, -0.00527954,  0.33203125, -0.25      ,
       -0.01397705, -0.15039062, -0.265625  ,  0.01647949,  0.3828125 ,
       -0.03295898, -0.09716797, -0.16308594, -0.04443359,  0.00946045,
        0.18457031,  0.03637695,  0.16601562,  0.36328125, -0.25585938,
        0.375     ,  0.171875  ,  0.21386719, -0.19921875,  0.13085938,
       -0.07275391, -0.02819824,  0.11621094,  0.15332031,  0.09082031,
        0.06787109, -0.0300293 , -0.16894531, -0.20800781, -0.03710938,
       -0.22753906,  0.26367188,  0.012146  ,  0.18359375,  0.31054688,
       -0.10791016, -0.19140625,  0.21582031,  0.13183594, -0.03515625,
        0.18554688, -0.30859375,  0.04785156, -0.10986328,  0.14355469,
       -0.43554688, -0.0378418 ,  0.10839844,  0.140625  , -0.10595703,
        0.26171875, -0.17089844,  0.39453125,  0.12597656, -0.27734375,
       -0.28125   ,  0.14746094, -0.20996094,  0.02355957,  0.18457031,
        0.00445557, -0.27929688, -0.03637695, -0.29296875,  0.19

Now we need to build a vectorization routine. For each sequence we will perform a simple tokenization, extract embeddings, and then squash them into a single **averaged** vector. 

But why?

The reason is simple - most traditional classifiers (like logistic regression) are fundamentally designed to work with fixed-size, flat feature vectors. They don't have an inherent mechanism to understand or process sequences of varying lengths or the temporal relationships within those sequences.

In [5]:
from gensim.utils import simple_preprocess

def vectorize(text):
    tokens = simple_preprocess(text.lower(), deacc=True)
    token_vectors = [wv.get_vector(x) for x in tokens if x in wv]
    if token_vectors:
        return np.mean(token_vectors, axis=0)
    else:
        return np.zeros(wv.vector_size)

display(vectorize('Hello World!'))

array([-5.90820312e-02,  4.27246094e-02,  1.09664917e-01,  2.31933594e-01,
       -1.54785156e-01,  1.24206543e-02, -3.73535156e-02, -2.03613281e-01,
        4.36401367e-02,  2.67089844e-01, -6.06689453e-02, -8.30078125e-02,
        2.09960938e-02, -1.28173828e-01, -5.04455566e-02,  1.40136719e-01,
       -2.25830078e-02,  1.89941406e-01,  2.62695312e-01, -1.08276367e-01,
        1.48437500e-01,  1.24267578e-01,  1.52343750e-01, -1.44531250e-01,
        1.11083984e-01, -2.56347656e-02, -1.44958496e-01,  8.83789062e-02,
        4.87060547e-02, -2.24609375e-02,  2.72521973e-02, -4.12597656e-02,
       -1.77246094e-01, -6.59179688e-02,  3.29589844e-02, -1.29028320e-01,
        5.17578125e-02, -3.44543457e-02,  1.60156250e-01,  2.27050781e-01,
       -6.68334961e-02, -1.36718750e-02,  8.53271484e-02,  2.10449219e-01,
        8.93554688e-02,  2.57812500e-01, -1.57012939e-01,  2.33230591e-02,
       -9.74121094e-02,  1.65039062e-01, -3.05175781e-01,  1.96533203e-02,
        5.90209961e-02,  

## Building and Training the Model

Our final pipeline will remain almost the same - vectorizer, followed by a logistic regression classifier. Defining a simple cross-validation grid is a nice idea as well - but it will contain only one parameter now (regularisation strength).

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression

vectorizer = FunctionTransformer(lambda x: np.vstack([vectorize(seq) for seq in x]))
classifier = LogisticRegression()

pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier),
])

param_grid = {
    'classifier__C': [0.1, 1, 10],
}

The model is ready to be trained. This time, we may use **grid search** - our parameters matrix is so tiny that we could afford a full hyperplanar parameter search instead of a randomized one.

In [7]:
%%capture --no-stdout
from joblib import parallel_backend
from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(pipeline, param_grid, n_jobs=-1, verbose=3)
cv.fit(x_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5] END .................classifier__C=0.1;, score=0.816 total time=  54.9s
[CV 4/5] END .................classifier__C=0.1;, score=0.811 total time=  51.0s
[CV 2/5] END .................classifier__C=0.1;, score=0.829 total time=  54.8s
[CV 5/5] END .................classifier__C=0.1;, score=0.823 total time=  48.2s
[CV 3/5] END .................classifier__C=0.1;, score=0.799 total time=  53.2s
[CV 1/5] END ...................classifier__C=1;, score=0.845 total time=  47.0s
[CV 2/5] END ...................classifier__C=1;, score=0.854 total time=  46.3s
[CV 3/5] END ...................classifier__C=1;, score=0.828 total time=  44.6s
[CV 4/5] END ...................classifier__C=1;, score=0.851 total time=  36.7s
[CV 5/5] END ...................classifier__C=1;, score=0.845 total time=  35.3s
[CV 1/5] END ..................classifier__C=10;, score=0.850 total time=  35.1s
[CV 2/5] END ..................classifier__C=10;,

## Result

In [8]:
from sklearn.metrics import classification_report
prediction = cv.best_estimator_.predict(x_test)
print(classification_report(y_test, prediction, target_names=class_names))

              precision    recall  f1-score   support

         neg       0.86      0.87      0.86     12500
         pos       0.87      0.86      0.86     12500

    accuracy                           0.86     25000
   macro avg       0.86      0.86      0.86     25000
weighted avg       0.86      0.86      0.86     25000



## Conclusion

The model achieved a final accuracy of **86%**. While this demonstrates the basic application of word embeddings, it significantly underperforms compared to the previous approach.

This suggests that for this specific dataset and task, the simple averaging of word embeddings, which loses word order and contextual nuances, is less effective than a feature representation that explicitly captures local phrases (like n-grams).

This highlights a limitation of simple averaging - it discards crucial word order and local contextual information. To overcome this and fully leverage the semantic power of word embeddings without losing sequential context, we need models that can learn to understand the relationships between words in a sentence.