Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# WORD EMBEDDINGS FOR CLASSIFICATION

## Pretrained word embeddings

We can make use of pretrained word embeddings to represent our input text in a classification problem. Let's try it out with the embeddings we've trained in the word embeddings notebook, which have the advantage of having been trained on data that is similar to our classification task's data (reviews). You could try other embeddings (such as those available in [Gensim](https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html)).

In [1]:
import gensim

wv = gensim.models.KeyedVectors.load("data/reviews_wv.txt")

Let's load data for our classification task.

In [2]:
import pandas as pd
import re

# Importing the dataset
dataset = pd.read_csv('data/restaurant_reviews.tsv', delimiter = '\t', quoting = 3)

dataset

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


To make sure we have only tokens (words) that are ready to fetch embeddings for, we'll limit ourselves to lower-case alphabetic sequences. For that, we do some preprocessing:

In [3]:
# cleanup
corpus = []
for i in range(0, dataset['Review'].size):
    # get review, remove non alpha chars and convert to lower-case
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]).lower()
    # add review to corpus
    corpus.append(review)

Now we can convert our "cleaned" corpus into embeddings.

#### Fixing the length of the input

The reviews in our corpus have variable length. However, we need to represent them with a fixed-length vector of features. One way to do it is to impose a limit on the number of word embeddings we want to include.

To convert words into their vector representations (embeddings), let's create an auxiliary function that takes in the number of embeddings we wish to include in the representation:

In [4]:
import numpy as np

def text_to_vector(embeddings, text, sequence_len):
    
    # split text into tokens
    tokens = text.split()
    
    # convert tokens to embedding vectors, up to sequence_len tokens
    vec = []
    n = 0
    i = 0
    while i < len(tokens) and n < sequence_len:   # while there are tokens and did not reach desired sequence length
        try:
            vec.extend(embeddings.get_vector(tokens[i]))
            n += 1
        except KeyError:
            True   # simply ignore out-of-vocabulary tokens
        finally:
            i += 1
    
    # add blanks up to sequence_len, if needed
    for j in range(sequence_len - n):
        vec.extend(np.zeros(embeddings.vector_size,))
    
    return vec

The above *text_to_vector* function takes an *embeddings* dictionary, the *text* to convert, and the number of words *sequence_len* from *text* to consider. It returns a vector with appended embeddings for the first *sequence_len* words that exist in the *embeddings* dictionary (tokens for which no embedding is found are ignored). In case the text has less than *sequence_len* words for which we have embeddings, blank embeddings will be added.

To better decide how many word embeddings we wish to append, let's learn a bit more about the length of each review in our corpus.

In [5]:
from scipy import stats

lens = [len(c.split()) for c in corpus]
print(np.min(lens), np.max(lens), np.mean(lens), np.std(lens), stats.mode(lens))

1 32 11.04 6.312242073938545 ModeResult(mode=array([4]), count=array([80]))


  print(np.min(lens), np.max(lens), np.mean(lens), np.std(lens), stats.mode(lens))


So, we have reviews ranging from 1 to 32 tokens (words), with an average size of 11.04 and a standard deviation of 6.31, being 4 the most frequent review length.

Let's limit reviews to, say, length 10: longer reviews will get truncated, while shorter reviews will be padded with empty embeddings for the missing tokens. (Note: according to function *text_to_vector*, this may also happen to reviews of length >= 10, if they happen to include out-of-vocabulary tokens.)

In [6]:
# convert corpus into dataset with appended embeddings representation
embeddings_corpus = []
for c in corpus:
    embeddings_corpus.append(text_to_vector(wv, c, 10))

X = np.array(embeddings_corpus)
y = dataset['Liked']

print(X.shape, y.shape)

(1000, 1500) (1000,)


As expected, our feature vectors have 1500 dimensions: 10 times the size of each embedding vector, which is 150 in this case.

Now we can use this feature representation to train a model! Try out training a Logistic Regression or a Support Vector Machine model.

In [None]:
# your code here


#### Aggregating word embeddings

Instead of appending word embeddings from a fixed number of tokens, we could consider using embeddings for the whole set of tokens, by taking their mean. This way, we will still get a fixed length representation, equal to the embeddings vector size (150 in our case).

Implement the *text_to_mean_vector* function, which takes the embeddings dictionary and the text to convert, and returns the mean of the embeddings of its tokens.

In [None]:
def text_to_mean_vector(embeddings, text):
    # your code here


Use the above function to convert the corpus into a dataset with mean embeddings representation. The shape of the feature matrix *X* should be *(1000, 150)*.

In [None]:
# your code here


Now we can use this mean embeddings representation to train a model! Try out training a Logistic Regression or a Support Vector Machine model.

In [None]:
# your code here


It is also possible to use other aggregation functions, besides taking the mean of the word embeddings. For instance, we could take the element-wise *max*. Try it out and check if you notice any changes in the performance of the models!

In [None]:
# your code here
