# Word2Vec Embeddings & t-SNE

## Description

This project aims to extract the text from the <a href="https://www.kaggle.com/hsankesara/medium-articles">Medium Articles</a>, create a <a href="https://www.pydoc.io/pypi/gensim-3.2.0/autoapi/models/word2vec/index.html">Gensim's Word2Vec Model</a> and analyse the relationships between words in the high dimensional corpus using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html">Sklearn's TSNE Module</a>.

Link to the Project: https://github.com/VETURISRIRAM/Word2Vec_t-SNE

## Word2Vec & t-SNE Introduction

The deep learning models designed for Natural Language Processing domain require the text data in numeric form (though some of them nowadays have their own internal word embedding processes and just requires you to pass the text as it is). But the overall idea is to trainform the textual data into numbers that the models understand.

`Word2Vec` is a way to represent similar words in similar numeric form. This ensures that a high dimensional data could be transformed into a much lower dimension. This way similar words are near each other (in the same neighborhood) as their word vectors are also similar. In this project, Word2Vec model is used for <a href="https://pypi.org/project/gensim/">Gensim</a> library.

`t-Distributed Stochastic Neighbor Embedding (t-SNE)` is a dimentionality reduction technique just like PCA or TruncatedSVD but with a slightly different approach. Imagine there are two distributions, one which measures the pairwise similarities between the actual input points and another which measures the pairwise similarities between low dimensional input points in the word embeddings. But the approach is computationally expensive.

In this project, this technique is used to visualize the high dimensional word vectors to identify similar words in a clustered form.

In [None]:
import re
import spacy
import string
import datetime
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from sklearn.manifold import TSNE

Initialize stopwords.

In [None]:
# nltk.download()
STOPWORDS = set(stopwords.words("english"))

Basically, this script reads the data and gets specifically the text present in the `text` column.

In [None]:
# Read the data.
print("Started: ", datetime.datetime.now())
df = pd.read_csv("/kaggle/input/medium-articles/articles.csv")
print("Data Read: ", datetime.datetime.now())

Then, minor preprocessing is done like lowercasing, removing words like `[something]` and the words which contain numbers in them, and removes all the special charatcters.

In [None]:
def clean_text(text):
    """
    Function to clean the text by performing:
    1) Lowercase Operation.
    2) Removing words with square brackets.
    3) Removing punctuations.
    4) Removing stopwords.
    :param text: Raw Text.
    :return text: Clean Text.
    """

    text = text.lower()
    text = re.sub(r"\[.*?\]", "", text)
    text = re.sub(r"\w*\d\w*", "", text)
    text = re.sub(rf"[{re.escape(string.punctuation)}]", "", text)
    if len(text) > 3:
        text = " ".join([t for t in text.split() if t not in STOPWORDS])

        return text
    else:

        return ""

In [None]:
# Clean the text.
df["text"] = df["text"].apply(lambda text_value: clean_text(text_value))
print("Cleaned Text: ", datetime.datetime.now())

The words are then lemmatized to their root form and corpus of documents is created.

In [None]:
def lemmatize_text_tokens(text, nlp):
    """
    Funciton to tokenize a sentence, lemmatize the tokens and return lemmatized sentence back.
    :param text: Raw Text.
    :param nlp: Spacy Object with "en" corpus loaded.
    :return lemmatized_text: Lemmatized sentence.
    """

    tokens = nlp(text)
    lemmatized_tokens = list()
    for token in tokens:
        lemmatized_token = token.lemma_
        lemmatized_tokens.append(lemmatized_token)
    lemmatized_text = " ".join(lemmatized_tokens)

    return lemmatized_text

In [None]:
# Lemmatize the tokens.
nlp = spacy.load("en")
df["lemmatized_text"] = df["text"].apply(lambda text_value: lemmatize_text_tokens(text_value, nlp))
print("Lemmatized the Documents' Tokens: ", datetime.datetime.now())

In [None]:
# Split the documents into tokens.
doc_sentences = [text.split() for text in list(df["lemmatized_text"])]
print("Split Token of Documents: ", datetime.datetime.now())

The word embeddings are created using `Word2Vec` model. I used all but one cores of my system in doing this. While training the Word2Vec model, you can play around with the hyperparametrs like min_count, window, size, workers, etc. These parameters basically define the number of occurrences of the word, the distance between the actual word and the predictions, the number of cores to use and the size of the feature vectors.

In [None]:
w2v_model = Word2Vec(min_count=200,
                         window=5,
                         size=100,
                         workers=7)

In [None]:
print("Building Vocabulary: ", datetime.datetime.now())
w2v_model.build_vocab(doc_sentences)

In [None]:
print("Training Word2Vec Model: ", datetime.datetime.now())
w2v_model.train(doc_sentences, total_examples=w2v_model.corpus_count, epochs=w2v_model.epochs)
print("Training Done: ", datetime.datetime.now())

Following are the most similar words related the word `computer`.

In [None]:
w2v_model.init_sims(replace=True)
most_similar_words = w2v_model.wv.most_similar(positive=['computer'])
for similar_word in most_similar_words:
    print(similar_word)

`t-SNE` is used to plot the models vocabulary (limited) so associations between words could be analyzed.

In [None]:
def plt_tsne(word2vec_model):
    """
    Function to plot the words using t-SNE from the models vocabulary and the probability associations.
    :param word2vec_model: Word2Vec Model.
    """

    labels = list()
    words = list()

    for word in word2vec_model.wv.vocab:
        words.append(word2vec_model[word])
        labels.append(word)

    tsne_model = TSNE(perplexity=25, n_components=2, init="pca", n_iter=2000, random_state=0)
    new_values = tsne_model.fit_transform(words)

    x, y = list(), list()

    for value in new_values:
        x.append(value[0])
        y.append(value[1])

    plt.figure(figsize=(18, 18))

    for i in range(len(x)):
        plt.scatter(x[i], y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords="offset points",
                     ha="right",
                     va="bottom")
    plt.savefig("./tsne_plot_word2vec.png")
    plt.show(True)

In [None]:
plt_tsne(w2v_model)

## Interesting Observations

If you look at the image (`tsne_plot_word2vec.png`) created after the program execution, you would notice clusters of words with semantic similarities.

<img src="./tsne_plot_word2vec.png">

Example, there is cluster with words - `"neural"`, `"network"`, and `"architecture"`, another cluster with words - `"machine"`, `"deep"`, `"learning"`, and `"learn"`.

It is a great way to identify semantic similarities between words in the corpus and looking at the results, it actually makes sense from a human perspective.