# Word Embeddings: `word2vec`

What are word embeddings? They are a way to transform words into a numerical format, specifically vectors, so they can be used as inputs to a model. Previously, you have seen a way to encode text as numbers using a bag-of-words approach, which takes into account the *frequency* of words to encode the text. However, what if we want to use the *meaning* of the word in our model instead? As humans, when we interpret text, we aren't just looking at how often a word appears, we use the context of the words (which words appear before or after, what words have a similar meaning, etc). <br> <br>
Introducing... **word embeddings**!! We embed a word (or a bi-gram, tri-gram, phrase, etc) as a vector in a higher dimensional space. The word embedding model, `word2vec`, is a way to find the vector representations of words (also referred to as tokens). Below illustrates two approaches. <br> <br>
The continuous bag-of-words (CBOW) approach predicts a single word using the words that come before and after it. The number of words before and after the target word that is looked at is the called the *window size*. For example, if we had the text `I went to the store to get some apples`, we may try to use the word vectors for `I`, `went`, `to`, `the`, `to`, `get`, `some`, `apples` to predict the word `store`. This would correspond to a *window size* of 4 because there are 4 words on either side of the target word. <br> <br>
In the skip-gram model, we input a word and predict what words are related or predict what words we expect would come before and after it. In the above example, we'd aim to predict the remaining words in the sentence from the word vector for `store`. 

![word_embedding](data/word_embedding.png)

These models are *neural networks*, a type of a branch of machine learning method called deep learning. The general idea is that there are nodes/neurons that are interconnected with different weights that are learned during the training process. The structure mimics the structure of the human brain, which is why they are called neural networks. Don't worry too much about the details for our purposes today. 

# gensim

`gensim` is a popular Python package for natural language processing. It is the fastest library for training vector embeddings, can handle large corpus, and has many other NLP uses, such as topic modeling (which we will get into later today!)

In [None]:
# Uncomment the line below and un if you do not already have gensim installed
# !pip install gensim

In [None]:
import gensim
import gensim.downloader as api

# Pre-trained Word Embeddings

There are many word embedding models that have already been trained on a large corpus. Thre are many different models trained in different contexts already available on `gensim`. Here are some examples.

In [None]:
gensim_models = list(api.info()['models'].keys())
print(gensim_models)

We are going to take a closer look at the `word2vec-google-news-300` model: this is a word embedding model that is trained on Google News, where the embedding is 300 dimensions. Downloading this might take a while! The word embedding model is nearly 2 GB. 

In [None]:
wv = api.load('word2vec-google-news-300')

There are a variety of methods we can use to explore the embeddings in this model. For example, `.index_to_key` will give us a list of the words/phrases (called the vocabulary) that we have vector embeddings for in this model. 

In [None]:
# 5 example words in the corpus
wv.index_to_key[:5]

We can find how many word vectors we have in this model by taking the length of this list.

In [None]:
n_words = len(wv.index_to_key)
print(f"Number of words: {n_words}")
print(wv.index_to_key[:20])

The model is trained using a vocabulary of size 3 million! This is a huge model, which takes hours to train. This is why we used a pre-trained model - we likely don't have the resources to train this on our local machines.

Accessing the actual word vectors can be done by treating the word vector model as a dictionary. For example, let's take a look at the word vector for `"banana"`:

In [None]:
print(wv["banana"])
print(wv["banana"].size)

The word vector has a dimension of 300. To us, these numbers are pretty uncomprehensible and uninterpretable, but we can now use these vectors to perform "calculations" on words to find words that are similar and dissimilar. A particular interesting use of word embeddings is to find words similar by *analogy*. 
<br> <br>
What do we expect below, based on the context we are given? <br> 
#### **geese - goose + toad = ?**

![word_analogy](data/word_analogy.png)

 If you said toads, that's correct!

Note if you try to use a word not in the vocabulary (a word that is not present in the corpus the model was trained on), it will result in an error. This is a limitation of `Word2Vec`. It cannot infer vector embeddings for words that it has never seen before.

In [None]:
# this will error
wv["nbvouphr"]

# Word Similarity

A semantic question we can ask is  that are similar to "banana". How does word similarity look in vector operations? We'd expect similar words to have vectors that are closer to each other in vector space.

There are many metrics of vector similarity - one of the most useful ones is the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). It has a range of 0 to 1, with orthogonal vectors have a cosine similarity of 0, and parallel vectors having a cosine similarity of 1. `gensim` provides a function that lets us find the most similar vectors to a queried vector - let's give it a shot! 

What are some things you notice about the words returned?

In [None]:
wv.most_similar('watermelon')

In [None]:
wv.most_similar('happy')

# Distance 

The number that appears next to the word is the distance between that word and the word we specified. How "similar" a word is to another word is determined by the "distance" between the vector embeddings. The higher the distance, the less "similar" those words are, since their vector embeddings are far from each. We expect words with similar meaning and contexts to be near each other in vector representation. These patterns are part of what was learned by the neural network during the training process. 

Choose two words and find the distance between their vectors.

In [None]:
# Fill in the blanks with two words, make sure the words are a string
wv.distance(..., ...)

# Analogies

We briefly talked about this before, but now let's look at some more examples as well as how to actually perform the calculations.

`Paris : France :: Berlin : Germany`

Here, the analogy is between (Paris, France) and (Berlin, Germany), with "capital city" being the concept that connects them. We can abstract the "analogy" relationship to vector modeling. Let's pretend we're working with each of the vectors. Then, the analogy is

$\mathbf{v}_{\text{France}} - \mathbf{v}_{\text{Paris}} \approx \mathbf{v}_{\text{Germany}} - \mathbf{v}_{\text{Berlin}}.$

The vector difference here represents the notion of "capital city". Presumably, going from the Paris vector to the France vector (i.e., the vector difference) will be the same as going from the Berlin vector to the Germany vector, if that difference carries similar semantic meaning.

Let's test this directly. We'll do so by rewriting the above expression:

$\mathbf{v}_{\text{France}} - \mathbf{v}_{\text{Paris}} + \mathbf{v}_{\text{Berlin}} \approx \mathbf{v}_{\text{Germany}}.$

We'll calculate the difference between Paris and France, add on Germany, and find the closest vector to that quantity. Notice that, in all these operations, we set `norm=True`, and renormalize. That's because different vectors might be of different lengths, so the normalization puts everything on a common scale.

In [None]:
# Calculate "capital city" vector difference
difference = wv.get_vector('France', norm=True) - wv.get_vector('Paris', norm=True) 
# Add on Berlin
difference += wv.get_vector('Berlin', norm=True)
# Renormalize vector
# in linear algebra, the norm of a vector is a way to measure the magnitude of a vector
difference /= np.linalg.norm(difference) 

In [None]:
# What is the most similar vector?
wv.most_similar(difference)

Here is a more concise way to do this same thing using a function we are already familiar with: `most_similar`

In [None]:
wv.most_similar(positive=["France", "Berlin"], negative=["Paris"])[:1]

## Try it out on your own [not covered]

## Challenge 1

Look up the `doesnt_match` function in `gensim`'s documentation. Use this function to identify which word doesn't match in the following group:

banana, apple, strawberry, happy

Then, try it on groups of words that you choose. Here are some suggestions:

1. A group of fruits, and a vegetable. Can it identify that the vegetable doesn't match?
2. A group of vehicles that travel by land, and a vehicle that travels by air (e.g., a plane or helicopter). Can it identify the vehicle that flies?
3. A group of scientists (e.g., biologist, physicist, chemist, etc.) and a person who does not study an empirical science (e.g., an artist). Can it identify the occupation that is not science based?

To be clear, `word2vec` does not learn the precise nature of the differences between these groups. However, the semantic differences correspond to similar words appearing near each other in large corpora.

In [None]:
# Your code here

## Challenge 2

Carry out the following word analogies:

1. Mouse : Mice :: Goose : ?
2. Kangaroo : Joey :: Cat : ?
3. United States : Dollar :: Mexico : ?
4. Happy : Sad :: Up : ?
5. California : Sacramento :: Canada : ?
6. California : Sacramento :: Washington : ?

What about something more abstract, such as:

7. United States : hamburger :: Canada : ?

Some work well, and others don't work as well. Try to come up with your own analogies!

In [None]:
# Your code here

In [None]:
import pandas as pd
import numpy as np
import re

# Training custom word embeddings

We've been using pre-trained word embeddings, but you can also train your own word embeddings using a corpus of your choice. Note that training a model on a large corpus will take a long time and be very computationally expensive, so we'll just be using a small corpus today as an example. Generally, larger corpus tend to produce better embeddings, but we can still get meaningful results of a smaller corpus.

First, lets load in and preprocess our text

In [None]:
tweets_path = 'data/airline_tweets.csv'
tweets = pd.read_csv(tweets_path, sep=',')

In [None]:
def preprocess(text):
    """Preprocesses a string."""
    # Lowercase
    text = text.lower()
    # Replace URLs
    url_pattern = r'https?:\/\/.*[\r\n]*'
    url_repl = ' URL '
    text = re.sub(url_pattern, url_repl, text)
    # Replace digits
    digit_pattern = '\d+'
    digit_repl = ' DIGIT '
    text = re.sub(digit_pattern, digit_repl, text)
    # Replace hashtags
    hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
    hashtag_repl = ' HASHTAG '
    text = re.sub(hashtag_pattern, hashtag_repl, text)
    # Replace users
    user_pattern = r'@(\w+)'
    user_repl = ' USER '
    text = re.sub(user_pattern, user_repl, text)
    # Remove blank spaces
    blankspace_pattern = r'\s+'
    blankspace_repl = ' '
    text = re.sub(blankspace_pattern, blankspace_repl, text).strip()
    return text

In [None]:
tweets['text_processed'] = tweets['text'].apply(lambda x: preprocess(x))
tweets['text_processed'].head()

The `Word2Vec` module will allow us to create our own model.

In [None]:
from gensim.models import Word2Vec

This model takes in `sentences`, which is a list of lists: the outer list enumerates the documents, and the inner list enumerates the tokens within in each list. So, we need to run a word tokenizer on each of the tweets. Let's use `nltk`'s word tokenizer:

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
sentences = [word_tokenize(tweet) for tweet in tweets['text_processed']]
sentences[0]

Now, we train the model. We are going to use CBOW to train the model since it's better suited for smaller datasets. Take note of what other arguments we set:

In [None]:
model = Word2Vec(
    sentences=sentences,
    vector_size=30,
    window=5,
    min_count=1,
    sg=0)

The model is now trained! Let's take a look at some word vectors. We can access them using the `wv` attribute:

In [None]:
len(model.wv)

In [None]:
model.wv['worst']

Let's explore the word embeddings our model learned.

In [None]:
model.wv.most_similar('worst')

In [None]:
model.wv.distance('great', 'united')

## Extra: Sentiment Analysis [not covered]

In the previous module, we used the airline tweets dataset to perform sentiment classification: we tried to classify the sentiment of a text given the bag-of-words representation. Can we do something similar with a word embedding representation?

In the word embedding representation, we have an $N$-dimensional vector for each word in a tweet. How can we come up with a representation for the entire tweet?

The simplest approach we could take is to simply average the vectors together to come up with a "tweet representation". Let's see how this works for predicting sentiment classification.

First, we need to subset the dataset into the tweets which only have positive or negative sentiment:

In [None]:
tweets_binary = tweets[tweets['airline_sentiment'] != 'neutral']
y = tweets_binary['airline_sentiment']
print(y.value_counts(normalize=True))

Now, we need to compute the feature matrix. We will query the word vector in each tweet, and come up with an average for the sample:

In [None]:
vector_size = 30
X = np.zeros((len(y), vector_size))

# Enumerate over tweets
for idx, tweet in enumerate(tweets_binary['text_processed']):
    # Tokenize the current tweet
    tokens = word_tokenize(tweet)
    n_tokens = len(tokens)
    # Enumerate over tokens, obtaining word vectors
    for token in tokens:
        X[idx] += model.wv.get_vector(token)
    # Take the average
    X[idx] /= n_tokens

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

We will be using logistic regression to classify tweets into positive or negative sentiment.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train) 
#if there is a red warning messsage appears, you can ignore it

In [None]:
print(f"Training accuracy: {lr.score(X_train, y_train)}")
print(f"Test accuracy: {lr.score(X_test, y_test)}")

While this performance is pretty good, its not amazing. Here are some considerations we should keep in mind seeing these results.

1. We used a word embedding on a relatively small corpus. A word embedding obtained from a very large corpus would perform better. The tricky part in doing this is that our smaller corpus may have some niche tokens that are not in the larger model, so we'd have to work around that.
2. We simply averaged word embeddings across tokens. When doing this, we lose meaning in the ordering of words. Other methods, such as `doc2vec`, have been proposed to address these concerns.
3. Word embeddings might be an overly complicated approach for the task at hand. In a tweet aimed at an airline, a person needs to convey their sentiment in only 140 characters. So they are more likely to use relatively simple words that easily convey sentiment, making a bag-of-words a natural approach.

It's important to note that we also lose out on the interpretability of the logistic regression model, because the actual dimensions of each word vector do not themselves have any meaning. 

Moral of the story: word embeddings are great, but always start with the simpler model! This is a good way to baseline other approaches, and it might actually work pretty well!

\newpage

# Topic Modeling

Topic modeling is an unsupervised machine learning used to identify clusters/groups of similar words within a body of text. While this is not clustering, you can think of it as something similar. Topic modeling is often used to characterize a collection of documents by uncovering the abstract "topics". It doesn't categorize documents into clusters, but rather groups together words/phrases that are similar, and we can use those words to determine which documents correspond to which "topic".

Consider genre classification. Some books may neatly fall into one genre, such as mystery, science fiction, etc. However, other books may be considered as incorporating multiple genres. You might have a fantasy novel which has mystery components to it, or a romance novel set in the future. In these cases, we don't want to cluster the fantasy novel into a "fantasy" bucket, and the romance novel in a "romance" bucket. We'd instead like to have some measure of assigning various topics, with different magnitudes to documents. This is the goal of topic modeling.


There are two common approaches to perform topic modeling: non-negative matrix factorization and latent dirichlet allocation (LDA). We will focus on LDA for today.

In [None]:
# This is just a plotting helper function for later
def plot_top_words(model, feature_names, n_top_words=10, n_row=2, n_col=5, normalize=False):
    """Plots the top words from a topic model.
    
    Parameters
    ----------
    model : topic model object (e.g., LatentDirichletAllocation, NMF)
        The trained topic model. It should have a components_ attribute.
    feature_names : array-like of strings
        The names of each token, as a list or array.
    n_top_words : int
        The number of words to plot for each topic.
    n_row : int
        The number of rows in the plot.
    n_col : int
        The number of columns in the plot.
    normalize : boolean
        If True, normalizes the components so that they sum to 1 along samples.
    """
    # Create figure
    fig, axes = plt.subplots(n_row, n_col, figsize=(3 * n_col, 5 * n_row), sharex=True)
    axes = axes.flatten()
    components = model.components_
    # Normalize components, if necessary
    if normalize:
        components = components / components.sum(axis=1)[:, np.newaxis]
    # Iterate over each topic
    for topic_idx, topic in enumerate(components):
        # Obtain the top words for each topic
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        # Get the token names
        top_features = [feature_names[i] for i in top_features_ind]
        # Get their values
        weights = topic[top_features_ind]

        # Plot the token weights as a bar plot
        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 20})
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=20)
        
        # Customize plot
        for i in "top right left".split():
            ax.spines[i].set_visible(False)

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)

    return fig, axes

## Dataset

We will be using a new dataset called the **20 Newsgroups** dataset. You can find the original page for this dataset [here](http://qwone.com/~jason/20Newsgroups/).

This dataset is comprised of around 18000 newsgroups posts on 20 topics. The split between the train and test set is based upon a messages posted before and after a specific date. The news groups are as follows, with specific labels indicated:

* *Computers*
    * comp.graphics
    * comp.os.ms-windows.misc
    * comp.sys.ibm.pc.hardware
    * comp.sys.mac.hardware
    * comp.windows.x
* *Recreation*
    * rec.autos
    * rec.motorcycles
    * rec.sport.baseball
    * rec.sport.hockey
* *Science*
    * sci.crypt
    * sci.electronics
    * sci.med
    * sci.space
* *Miscellaneous*
    * misc.forsale
* *Politics*
    * talk.politics.misc
    * talk.politics.guns
    * talk.politics.mideast
* *Religion*
    * talk.religion.misc
    * alt.atheism
    * soc.religion.christian
    
Let's begin by importing the dataset. We'll use `scikit-learn` to do so.

In [None]:
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

%matplotlib inline

# Import fetcher function
from sklearn.datasets import fetch_20newsgroups

In [None]:
# Always check the documentation! 
# Note this may take a while to load
full_data, labels = fetch_20newsgroups(
    subset='train',
    shuffle=False,
    random_state=1,
    remove=("headers", "footers", "quotes"),
    return_X_y=True)

In [None]:
# Let's see some data samples
print(full_data[5])
print('\n\n--------\n\n')
print(full_data[50])
print('\n\n--------\n\n')
print(full_data[1000])

If we take a look at the labels, we see that they're integers, each specifying one of the 20 possible classes:

In [None]:
print(np.unique(labels))
print(labels.shape)

In [None]:
newsgroups = fetch_20newsgroups(
    subset='train',
    shuffle=True,
    random_state=1,
    remove=("headers", "footers", "quotes"))

In [None]:
list(newsgroups)

In [None]:
newsgroups.target_names

In [None]:
# taking a subset of the data to simplify analysis and save time
# feel free experiment with the entire dataset later if you would like
n_subsamples = 2000
data = full_data[:n_subsamples]

## Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a Bayesian model that captures how specific topics can generate documents. It is one of the oldest models applied to perform topic modeling.

One significant difference between LDA and NMF is that LDA is a *generative* model. This means that it can be used to *generate* new documents, by sampling from it. Assume we have a number of topics $T$. Then, we generate a new document as follows:

1. Choose a number of words $N$ according to a Poisson distribution. If you're not familiar with a Poisson distribution, don't worry - the only thing you need to know is that the outputs from a Poisson distribution can only be nonnegative integers (e.g., 0, 1, 2, 3 ...).
2. Choose a vector of values $\boldsymbol{\theta}=(\theta_1, \theta_2, \ldots, \theta_T)$ from a Dirichlet distribution. The details of a Dirichlet distribution aren't too important other than that it guarantees all of the $\theta_i$ add up to 1, and are positive. So, we can think of the $\theta_i$ as proportions, or probabilities.
3. For each of the $N$ words $w_n$:
- Choose a topic $t_n$ according to a Multinomial distribution following $\boldsymbol{\theta}$. In other words, choose a topic according to the probabilities set by $\boldsymbol{\theta}$ (remember, we're thinking of these values as proportions, or probabilities).
- Choose a word $w_n$ from a probability distribution $p(w_n|t_n)$ conditioned on $t_n$. This probability distribution is another Multinomial distribution.

LDA does not model the order of the words, so in the end, it produces a collection of words - just like the bag of words.

![lda](data/lda.png)

There's a lot of variables there, so let's consider a concrete example. Let's suppose we have two topics: soccer and basketball. These are $t_1$ and $t_2$. 

Some topics are more likely to contains words than others. For example, soccer is more likely to contain `liverpool` and `freekick`, but probably not `nba`. Basketball meanwhile will very likely contain `rebound` and `nba`. Furthermore, even though it's unlikely, a soccer topic might still refer to the `nba`. This unlikeliness is captured through the probabilities assigned in the distribution $p(w_n|t_n)$.

Next, each document might consist of multiple "proportions" of topics. We've already seen this in NMF, only this time, LDA captures this via a probability distribution rather than a matrix operation. So, Document 1 might mainly be about Soccer, and not really reference basketball - this would be reflected in the probabilities $\boldsymbol{\theta}=(0.9, 0.1)$. Meanwhile, another document might equally reference soccer and basketball, so we'd need a different set of parameters $\boldsymbol{\theta}=(0.5, 0.5)$.

Once again, we're going to use `scikit-learn` to perform LDA. This time, however, we'll use a `CountVectorizer`, since LDA explicitly models *counts*.

In [None]:
# Use a CountVectorizer
n_tokens = 1000
count_vectorizer = CountVectorizer(
    max_df=0.95,
    min_df=2,
    max_features=n_tokens,
    stop_words="english")

In [None]:
# Fit and transform CountVectorizer
counts = count_vectorizer.fit_transform(data)
print(counts.shape)

In [None]:
tokens = count_vectorizer.get_feature_names_out()

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
n_components = 10
random_state = 0

lda = LatentDirichletAllocation(
    n_components=n_components,
    max_iter=5,
    learning_method="online", # Use when dataset is large
    learning_offset=50.0, 
    random_state=random_state)

In [None]:
# Fit the LDA model
lda.fit(counts)

How can we analyze the trained model? The `lda` object also comes with a `components_` variable, which corresponds to the topic word distribution. Let's plot these values using the function we created above:

In [None]:
lda.components_

Can you match up the target names with any of these plots below?

In [None]:
# This time, we're normalizing - what does this do?
fig, axes = plot_top_words(lda, tokens, normalize=True)
plt.show()

## Extra: Dimensionality Reduction [not covered]

In both NMF and LDA, we broke down the documents into topics. This was, in effect, a *change in representation*. We went from a DTM representation, to a representation of *topics*. 

Because there are fewer topics than there are tokens, we can think of this as a *dimensionality reduction*. This is desirable for several reasons, the main one being that it's easier to interpret, say, 10 dimensions than it is to interpret 1000.

This is computationally true, as well: once we get to higher dimensions, it's harder to compare different vectors with each other, because they generally end up all close to orthogonal. This is known as the *curse of dimensionality*.

Let's first transform the counts into the topic representation:

In [None]:
topic_representation = lda.transform(counts)
topic_representation.shape

Let's use a familiar similarity measure to calculate the similarity between pairs of documents.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
#similarily of the first few docs
#Notice something special about the diagonal? What does it mean?
cosine_similarity(topic_representation[:4])

### Try it yourself: Finding Similar Documents
Calculate the cosine similarity between all pairs of documents, and find the two documents whose cosine similarity is the highest. What are these documents? Do they seem similar?

In [None]:
#Your code here