# Word Embeddings

In our [previous notebook](./01-logistic-regression.ipynb), we built a solid sentiment classifier using logistic regression and simple count vectorizer. While that model performed well by focusing on word counts and phrases, it treated each word largely as an independent entity without understanding its underlying meaning or relationship to other words.

This time, we'll delve into **word embeddings** – a technique that represents words as dense vectors, capturing their semantic relationships and context within a large body of text. Our today's goal is to integrate them into our logistic regression pipeline and see how they will affect our sentiment analysis score.

<!-- Articles used:
- https://medium.com/swlh/sentiment-classification-using-word-embeddings-word2vec-aedf28fbb8ca
-->

##  Data Preparation

We are going to use the same dataset as before.

In [1]:
from datasets import load_dataset
ds = load_dataset('stanfordnlp/imdb', split='train+test')
train, test = ds.train_test_split(test_size=0.2, seed=0).values()
display(train.to_pandas())

x_train = train['text']
y_train = train['label']
x_test = test['text']
y_test = test['label']

Unnamed: 0,text,label
0,After the SuperFriends and Scooby Doo left the...,1
1,good job.that's how i would describe this anim...,1
2,Michael Cacoyannis has had a relatively long c...,1
3,I've just seen this film in a lovely air-condi...,0
4,My one-line summary hints that this is not a g...,1
...,...,...
39995,***SPOILERS*** ***SPOILERS*** After two so-so ...,1
39996,"Way back in 1967, a certain director had no id...",0
39997,I saw this movie with my dad. I must have been...,1
39998,During my teens or should I say prime time I w...,1


This time, we will do some data preprocessing. For each text sample, we are going to apply a technique called **semantic vectorization**. Its core idea is to vectorize separate words into a thing called **word embedding** instead of a simple index.

What is a word embedding? Essentially, it is a mathematical representation of a word (or phrase) as a vector (a numerical array) in a high-dimensional space. These vectors capture the semantic meaning of the word by representing its relationships to other words in a corpus of text. That means that semantically similar words would be close in this high-dimensional space. 

This approach requires a separate vectorization model - we could start with a pre-trained model called GoogleNews300. It contains 300-dimensional vectors for 3 million words and phrases.

In [2]:
from os import path
from huggingface_hub import snapshot_download
model_path = path.join(snapshot_download('fse/word2vec-google-news-300'), 'word2vec-google-news-300.model')

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Let's see it in action by loading it and comparing the similarities of different words.

In [3]:
from gensim.models import KeyedVectors
wv = KeyedVectors.load(model_path)

pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]

for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

ImportError: cannot import name 'triu' from 'scipy.linalg' (/opt/homebrew/Caskroom/miniconda/base/envs/ml/lib/python3.12/site-packages/scipy/linalg/__init__.py)

That makes some sense, right? Communism has poor relations with cars.\
But what do those vectors look like?

In [None]:
display(wv.get_vector('hello'))

Now we need to build a vectorization routine. For each sequence we will perform a simple tokenization, extract embeddings, and then squash them into a single **averaged** vector. 

But why?

The reason is simple - most traditional classifiers (like logistic regression) are fundamentally designed to work with fixed-size, flat feature vectors. They don't have an inherent mechanism to understand or process sequences of varying lengths or the temporal relationships within those sequences.

In [None]:
from gensim.utils import simple_preprocess
import numpy as np

def vectorize(text):
    tokens = simple_preprocess(text.lower(), deacc=True)
    token_vectors = [wv.get_vector(x) for x in tokens if x in wv]
    if token_vectors:
        return np.mean(token_vectors, axis=0)
    else:
        return np.zeros(wv.vector_size)

display(vectorize('Hello World!'))

## Building and Training the Model

Our final pipeline will remain almost the same - vectorizer, followed by a logistic regression classifier. Defining a simple cross-validation grid is a nice idea as well - but it will contain only one parameter now (regularisation strength).

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression

vectorizer = FunctionTransformer(lambda x: np.vstack([vectorize(seq) for seq in x]))
classifier = LogisticRegression()

pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier),
])

param_grid = {
    'classifier__C': [0.1, 1, 10],
}

The model is ready to be trained. This time, we may use **grid search** - our parameters matrix is so tiny that we could afford a full hyperplanar parameter search instead of a randomized one.

In [None]:
%%capture --no-stdout
from joblib import parallel_backend
from sklearn.model_selection import GridSearchCV
cv = GridSearchCV(pipeline, param_grid, n_jobs=-1, verbose=3)
cv.fit(x_train, y_train)

## Result

In [None]:
from sklearn.metrics import classification_report
prediction = cv.best_estimator_.predict(x_test)
print(classification_report(y_test, prediction, target_names=ds.features['label'].names))

## Conclusion

The model achieved a final accuracy of **86%**. While this demonstrates the basic application of word embeddings, it significantly underperforms compared to the previous approach using a simple count vectorizer with n-grams (which reached 90%).

This suggests that for this specific dataset and task, the simple averaging of word embeddings, which loses word order and contextual nuances, is less effective than a feature representation that explicitly captures local phrases (like n-grams).

This highlights a limitation of simple averaging - it discards crucial word order and local contextual information, which n-grams successfully captured. To overcome this and fully leverage the semantic power of word embeddings without losing sequential context, we need models that can learn to understand the relationships between words in a sentence.