# Word Embeddings and Applications

Notebook Contents:
- Loading word embeddings using [gensim](https://radimrehurek.com/gensim/) package
- Using word embeddings for text classification task
- Training a word2vec model using gensim

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Loading pre-trained word2vec model

In [None]:
import gensim.downloader as api
wv = api.load('glove-wiki-gigaword-300')



In [None]:
len(wv)

400000

We can obtain vector representation of a word

In [None]:
vec_king = wv['king']
vec_king.shape

(300,)

## Calculate word similarity

Using the function `wv.similarity`

In [None]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'automobile'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.50
'car'	'automobile'	0.60
'car'	'bicycle'	0.50
'car'	'airplane'	0.43
'car'	'cereal'	0.03
'car'	'communism'	0.02


Print the 5 most similar words to “car” or “minivan”

In [None]:
print(wv.most_similar(positive=['car', 'minivan'], topn=5))

[('suv', 0.7696972489356995), ('vehicle', 0.7469112873077393), ('truck', 0.7312718629837036), ('cars', 0.7033854722976685), ('jeep', 0.6848679184913635)]


Which of the below does not belong in the sequence?

In [None]:
print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

car


## Word Analogy


In [None]:
print(wv.similar_by_vector(wv['spain'] - wv['madrid'] + wv['athens'], topn=10))

[('greece', 0.7637240886688232), ('athens', 0.7158880233764648), ('spain', 0.5469861030578613), ('greek', 0.5434280633926392), ('cyprus', 0.5079883933067322), ('bulgaria', 0.49355754256248474), ('portugal', 0.4708734154701233), ('hungary', 0.4684615135192871), ('crete', 0.4490693211555481), ('greeks', 0.4459525942802429)]


In [None]:
print(wv.similar_by_vector(wv['king'] - wv['man'] + wv['woman'], topn=10))

[('king', 0.8065858483314514), ('queen', 0.6896163821220398), ('monarch', 0.5575490593910217), ('throne', 0.5565374493598938), ('princess', 0.5518684387207031), ('mother', 0.5142154097557068), ('daughter', 0.5133156776428223), ('kingdom', 0.5025345087051392), ('prince', 0.5017740726470947), ('elizabeth', 0.4908031225204468)]


## Using Word Embeddings for Text Classification Task

In the part, You will use the [News Aggregator Data Set](https://archive.ics.uci.edu/ml/datasets/News+Aggregator) to build a model that can classify an articles into "business", "science and technology", "entertainment", "health" based on articles' titles.

We have prepared train ([train.csv](https://www.dl.dropboxusercontent.com/s/rs7sqtb87m30o17/train.csv)), and test ([test.csv](https://www.dl.dropboxusercontent.com/s/fu4wa76kiwlby7u/test.csv) data sets. We used csv format to save data files. In first line of each file is the header with two columns: "TITLE" and "CATEGORY". We will use articles' titles for classification.

Meaning of categories are as follows.

(b = business, t = science and technology, e = entertainment, m = health)

In [None]:
%%capture
!rm -f train.csv
!rm -f test.csv
!wget https://www.dl.dropboxusercontent.com/s/rs7sqtb87m30o17/train.csv
!wget https://www.dl.dropboxusercontent.com/s/fu4wa76kiwlby7u/test.csv

Loading the data

In [None]:
import pandas as pd

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

Calculate the numbers of samples in training/test

In [None]:
def get_stats(df):
    print(df["CATEGORY"].value_counts())

get_stats(df_train)

b    4530
e    4178
t    1225
m     739
Name: CATEGORY, dtype: int64


In [None]:
get_stats(df_test)

b    558
e    541
t    155
m     80
Name: CATEGORY, dtype: int64


### Loading data

In [None]:
train_texts = df_train['TITLE']
y_train = df_train['CATEGORY']

test_texts = df_test['TITLE']
y_test = df_test['CATEGORY']

### Preprocess data

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
import string
translator = str.maketrans('', '', string.punctuation)

def preprocess(text):
    text = text.strip()

    tokens = word_tokenize(text)

    text  = " ".join(tokens).lower()
    return text

print(train_texts[0])
print(preprocess(train_texts[0]))

Taco Bell reveals 'secret' ingredients of mystery beef that's 88 per cent cow
taco bell reveals 'secret ' ingredients of mystery beef that 's 88 per cent cow


In [None]:
train_clean_texts = [preprocess(t) for t in train_texts]
test_clean_texts = [preprocess(t) for t in test_texts]

### Building a logistic regression model with BoW features

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

vectorizer = CountVectorizer(max_features=20000)
X_train = vectorizer.fit_transform(train_clean_texts)
X_test = vectorizer.transform(test_clean_texts)

clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)

y_preds = clf.predict(X_test)

print(metrics.classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           b       0.91      0.96      0.94       558
           e       0.93      0.97      0.95       541
           m       0.90      0.66      0.76        80
           t       0.89      0.71      0.79       155

    accuracy                           0.92      1334
   macro avg       0.91      0.83      0.86      1334
weighted avg       0.92      0.92      0.91      1334



### Using averaged features derived from pre-trained word embeddings

We will calculate the average of word vectors in a sentence

In [None]:
import numpy as np

def sent2vec(s):
    """Get the feature vector of a sentence
    """
    words = s.split()
    list_of_vectors = [wv[w] for w in words if w in wv]
    list_of_vectors = np.array(list_of_vectors, dtype=object)

    return np.mean(list_of_vectors, axis=0)

X_train_w2v = np.array([sent2vec(s) for s in train_clean_texts], dtype=object)
X_test_w2v = np.array([sent2vec(s) for s in test_clean_texts], dtype=object)

X_train_w2v.shape

(10672, 300)

Training Logistic Regression with Word Embedding Features

In [None]:
clf = LogisticRegression(max_iter=500)
clf.fit(X_train_w2v, y_train)

y_preds = clf.predict(X_test_w2v)

print(metrics.classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           b       0.89      0.94      0.91       558
           e       0.94      0.96      0.95       541
           m       0.86      0.82      0.84        80
           t       0.84      0.63      0.72       155

    accuracy                           0.90      1334
   macro avg       0.88      0.84      0.86      1334
weighted avg       0.90      0.90      0.90      1334



## References

- [Word2Vec Model Tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py) on gensim documentation