# HW06: Word Embeddings

Remember that these homework work as a completion grade. **You can <span style="color:red">not</span> skip one section this homework.**

**Essay Feedback**

Please provide feedback to two classmates' essays on Eduflow.

**Training word2vec**

In this section, we train a word2vec model using gensim. We train the model on text8 (which consists of the first 90M characters of a Wikipedia dump from 2006 and is considered one of the benchmarks for evaluating language models).

In [1]:
import gensim.downloader as api

api.info("text8")

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [2]:
dataset = api.load("text8")

In [3]:
from gensim.models import Word2Vec

##TODO train a word2vec model on this dataset, only consider words which appear at least 10 times in the corpus
model = Word2Vec(dataset,  # list of tokenized sentences
               workers = 4, # Number of threads to run in parallel
               vector_size = 300,  # Word vector dimensionality     
               min_count =  10, # Minimum word count  
               window = 5, # Context window size      
               sample = 1e-3, # Downsample setting for frequent words
               )

**Word Similarities**

gensim models provide almost all the utility you might want to wish for to perform standard word similarity tasks. They are available in the .wv (wordvectors) attribute of the model, more details could be found [here](https://radimrehurek.com/gensim/models/keyedvectors.html).

In [4]:
word_vectors = model.wv

##TODO find the closest words to king
results = word_vectors.most_similar(positive=['king'])
for idx, (key, sim) in enumerate(results):
    print(f'{idx}. "{key}" with similarity {sim:.3f}')

0. kings with similarity 0.661
1. prince with similarity 0.646
2. throne with similarity 0.644
3. queen with similarity 0.633
4. regent with similarity 0.611
5. emperor with similarity 0.609
6. pharaoh with similarity 0.599
7. vii with similarity 0.595
8. sultan with similarity 0.594
9. aragon with similarity 0.589


King is to man as woman is to X

In [8]:
##TODO find the closest word for the vector "woman" + "king" - "man"
results = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
for idx, (key, sim) in enumerate(results):
    print(f'{idx}. "{key}" with similarity {sim:.3f}')

0. "queen" with similarity 0.619


**Evaluate Word Similarities** 

One common way to evaluate word2vec models are word analogy tasks. Let's check how good our model is on one of those. We consider the [WordSim353](http://alfonseca.org/eng/research/wordsim353.html) benchmark, the task is to determine how similar two words are.

In [30]:
#!wget http://alfonseca.org/pubs/ws353simrel.tar.gz
#!tar xf ws353simrel.tar.gz

path = "wordsim353_sim_rel/wordsim_similarity_goldstandard.txt"

def load_data(path):
    X, y = [], []
    with open(path) as f:
        for line in f:
            line = line.strip().split("\t")
            X.append((line[0], line[1])) # each entry in x contains two words, e.g. X[0] = (tiger, cat)
            y.append(float(line[-1])) # each entry in y is the annotation how similar two words are, e.g. Y[0] = 7.35
    return X, y

X, y = load_data(path)
print (X[:3], y[:3])

[('tiger', 'cat'), ('tiger', 'tiger'), ('plane', 'car')] [7.35, 10.0, 5.77]


In [33]:
##TODO compute how similar the pairs in the WordSim353 are according to our model
##TODO if  aword is not present in our model, we assign similarity 0 for the respective text pair
y_w2v = []
for x1, x2 in X:
    if x1 in word_vectors and x2 in word_vectors:
        y_w2v.append(word_vectors.similarity(x1, x2))
    else:
        y_w2v.append(0)

for (x1, x2), sim in zip(X[:3], y_w2v[:3]):
    print(f'"{x1}", "{x2}": {sim:.3f}')



"tiger", "cat": 0.604
"tiger", "tiger": 1.000
"plane", "car": 0.428


In [41]:
import numpy as np
from scipy.stats import spearmanr

##TODO compute spearman's rank correlation between our prediction and the human annotations
sp = spearmanr(y, y_w2v)
print(f'Spearman correlation={sp.correlation:.3f} with pvalue={sp.pvalue:.1e}')



Spearman correlation=0.647 with pvalue=1.7e-25


In [54]:
import spacy
en = spacy.load('en_core_web_sm')

##TODO compute word similarities in the WordSim353 dataset using spaCy word embeddings
y_spacy = []
for x1, x2 in X:
    y_spacy.append(en(x1).similarity(en(x2)))

for (x1, x2), sim in zip(X[:3], y_spacy[:3]):
    print(f'"{x1}", "{x2}": {sim:.3f}')

##TODO compute spearman's rank correlation between these similarities and the human annotations
# Don't worry if results are not too convincing for this experiment
sp = spearmanr(y, y_spacy)
print(f'Spearman correlation={sp.correlation:.3f} with pvalue={sp.pvalue:.1e}')


  y_spacy.append(en(x1).similarity(en(x2)))


"tiger", "cat": 0.633
"tiger", "tiger": 1.000
"plane", "car": 0.713
Spearman correlation=0.031 with pvalue=6.6e-01


**Keras Embeddings**

In [55]:
#Import the AG news dataset (same as hw01)
#Download them from here 
# !wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

Unnamed: 0,label,title,lead,text
20637,sci/tech,Court Bars Rubble Removal from Jerusalem Shrin...,Reuters - Israel's Supreme Court has\temporari...,Court Bars Rubble Removal from Jerusalem Shrin...
26554,business,Federated betting on Macy #39;s name,"In a widely anticipated move, Federated Depart...",Federated betting on Macy #39;s name In a wide...
42397,sport,Soccer: US Captain Reyna Injured,"MANCHESTER, England - United States team capta...","Soccer: US Captain Reyna Injured MANCHESTER, E..."
47069,world,Iraqi children bear brunt of bombings,"Some of the children cry, refusing to speak bu...",Iraqi children bear brunt of bombings Some of ...
2035,business,Confidence drops in Germany,"Investor confidence in Germany, Europe #39;s b...",Confidence drops in Germany Investor confidenc...


In [58]:
from keras.preprocessing.text import text_to_word_sequence

##TODO tokenize the text using text_to_word_sequence
df['toks'] = df['text'].apply(text_to_word_sequence)
print(df['toks'].head())


20637    [court, bars, rubble, removal, from, jerusalem...
26554    [federated, betting, on, macy, 39, s, name, in...
42397    [soccer, us, captain, reyna, injured, manchest...
47069    [iraqi, children, bear, brunt, of, bombings, s...
2035     [confidence, drops, in, germany, investor, con...
Name: toks, dtype: object


In [68]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences

length_vocab = 1000
max_seq_length = 100

#TODO create a one_hot representation for each word and truncate/pad the sequences such that they are all of the same length
X = df['toks'].apply(lambda toks: [one_hot(t, length_vocab) for t in toks])
X = X.apply(lambda seq: pad_sequences(seq, max_seq_length))

In [70]:
from keras.models import Sequential
from keras.layers import Embedding

##TODO create a sequential model with just one embedding layer and show the model summary
model = Sequential()                        # create a sequential model
model.add(Embedding(
    input_dim=length_vocab,                 # one hot input dimension 
    output_dim=100,                         # embedding dimension
    input_length=max_seq_length             # sequence length
))
model.summary()


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 100, 100)          100000    
                                                                 
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________
