# Word Embedding Resonance Model Experiment

## Introduction

As explained in the paper, this data comes from **The Big Bad NLP Database** (https://datasets.quantumstat.com). We chose the AG News set, which has about 120k articles for topic classification - the topics being world, sports, business, and sci/tech. We also chose the Amazon Fine Food Reviews dataset to introduce text that we expect would be different than any of the text found in AG, in particular those articles classified as sports articles.

The idea for the experiment is as such:
 - Gather only sports articles from AG
 - Conduct a 60/40 split over these sports articles to get the baseline corpus $B$ and the first target corpus $T_1$, respectively.
 - Randomly gather a sample from the Amazon Fine Food Reviews (we sample the same number of rows as $T_1$) to get the second target corpus, $T_2$
 - Train word embeddings over the three corpora, and compare resonance scores derived from the word embeddings between $B$ and $\{T_1, T_2\}$
 
Our hypothesis is that the resonance score will be higher for $T_1$ than that of $T_2$.

## Preprocessing

We felt the need to scale down since we are running this experiment on a mere 2015 Mac...not Google's Sycamore...hence, after loading in the total datasets, we do some analysis to determine an efficient way to cut out rows. See below:

In [1]:
import pandas as pd
import numpy as np
import random
import copy
import datetime

In [2]:
def loadData(f_name, col_names = None):
    if col_names is None:
        return pd.read_csv(f_name)
    else:
        return pd.read_csv(f_name, names = col_names, header = None)

ag   = loadData('./data/ag_train.csv', col_names=['Category', 'Description', 'Text'])
food = loadData('./data/food_reviews.csv', )
print('AG DATA  : keys: ', list(ag.keys()), ' length: ', len(ag), '\n')
print('FOOD DATA: keys: ', list(food.keys()), ' length: ', len(food))

AG DATA  : keys:  ['Category', 'Description', 'Text']  length:  120000 

FOOD DATA: keys:  ['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text']  length:  568454


#### Getting AG data in shape

In [3]:
# Sports category is 2 for AG so we will extract only those datapoints

ag_sports  = copy.deepcopy(ag.loc[ag['Category'] == 2])
ag_sports.drop(['Category', 'Description'], axis=1, inplace=True)

indices    = list(np.arange(0, len(ag_sports)))
train_ind  = random.sample(indices, round(0.6*len(indices)))
target_ind = list(set(indices) - set(train_ind))

assert(len(train_ind)+len(target_ind) == len(indices))

baseline_1 = copy.deepcopy(ag_sports.iloc[train_ind])
t1_1       = copy.deepcopy(ag_sports.iloc[target_ind])

We noticed that many of these data have the news source and/or location as a prefix to the article text, such as: <br>
- AP - Darin Erstad doubled in the go-ahead run in the eighth inning, lifting the Anaheim Angels to a 3-2 victory over the Detroit Tigers on Sunday. The win pulled Anaheim within a percentage point of Boston and Texas in the AL wild-card race.
- MILWAUKEE (Sports Network) - U.S. Ryder Cup captain Hal  Sutton finalized his team on Monday when he announced the  selections of Jay Haas and Stewart Cink as his captain's picks.
- ATHENS (Reuters) - At the beach volleyball, the 2004  Olympics is a sell-out, foot-stomping success.
- HAVEN, Wis. -- Perched high on the bluffs overlooking Lake Michigan, Whistling Straits is a massive, windswept landscape, as large a golf course as \$40 million can buy. It is complete with sand dunes that could double as ski slopes and deep bunkers that should require elevators.

However, others do not. We would like to get rid of these sources as they can inject our model with information that indicates $B$ and $T_1$ were pulled from the same set since the Amazon data will not come with these prefixes.

We cleaned this by splitting the string on the '-' or '--' character, deleting the first element of the list (the news source/location prefix), and then rejoining the remaining list to get the actual article. We note here that this is perhaps not the most efficient or elegant means of cleaning, since a '-' character could exist within the text without a prefix, and hence the article would be chopped off, but we take this as a small loss compared to the amount that will be clean. To get rid of rows that are extremely messed up by this method, we only keep those rows with more than 5 words.

In [4]:
def cleanAG(df):
    for index, row in df.iterrows():
        if '--' in row['Text']:
            clean_text  = removePrefix(row['Text'], '--')
            row['Text'] = clean_text
        elif '-' in row['Text']:
            clean_text  = removePrefix(row['Text'], '-')
            row['Text'] = clean_text
    return df

def removePrefix(text, char):
    temp = text.split(char)
    del temp[0]
    return (char).join(temp)

#def removeDoubleDash(pd_series):
#    return pd_series.str.split('--', expand = True)[1]

baseline_2 = cleanAG(baseline_1)
t1_2       = cleanAG(t1_1)

In [5]:
baseline_3 = copy.deepcopy(baseline_2[baseline_2['Text'].str.split().str.len() > 5])
print('Num Rows in B : ', len(baseline_3))
t1_3       = copy.deepcopy(t1_2[t1_2['Text'].str.split().str.len() > 5])
print('Num Rows in T1: ', len(t1_3))

Num Rows in B :  17155
Num Rows in T1:  11469


#### Getting Amazon Review data in shape

This was simple because the text is pretty clean already - we just extract a random sample of the same size of $T_1$ from the ~570k rows available, and then extracted only the text for our $T_2$ corpus.

In [6]:
food_indices = np.random.randint(0, high = len(food), size = len(t1_3))
t2_1 = food.iloc[food_indices]
t2_2 = pd.DataFrame(t2_1['Text'])
print('Num Rows in T2: ', len(t2_2))

Num Rows in T2:  11469


## Let's Get To Some Word Embeddings

Even though it is probably best practice to clean up our text a little bit more (perhaps lemmatizing, getting rid of stop words, changing all tokens to lowercase, ridding our text of punctuation, etc.), I read in *Natural Language Processing in Action* (Lane et al.) that cleaning further may actually rid our rich dataset of valuable information. For example, when I told this to a colleague, he pondered for a minute then recalled a paper he read that was able to identify racist terminology by the way the word 'the' was used. In racist text, 'the' typically is used just before a demeaning word/phrase. Perhaps this was a key indicator for this model, although 'the' is listed as a stopword almost everywhere.

So, without further ado, let's get to the meat and potatos of this notebook and start training with some word embeddings. In the book mentioned above, it says 'The gensimword2vec model expects a list of sentences, where each sentence is broken up into tokens'. Hence, after loading in Word2Vec, I transform the data here.

In [7]:
from gensim.models.word2vec import Word2Vec

In [8]:
baseline = [value.split() for index, value in baseline_3['Text'].items()]
t1       = [value.split() for index, value in t1_3['Text'].items()]
t2       = [value.split() for index, value in t2_2['Text'].items()]

Now, we will set some parameters for training our Word2Vec, but first I want to address the importance of **setting a seed so that the models for each corpora will initialize the same way**. This is key - otherwise we would have no consistency since initialization of the weights in a neural net is a random process. However, after doing some research on the almighty StackExchange, I found that Word2Vec does this anyway...cheeky geniuses. To be clear, if I train on the same Jupyter Notebook kernel, the initialization will be consistent. However, if I restart and clear the kernel, I will get different results. For our purpose, this works just fine because I certainly will not be clearing the kernel in between running cells?? Can you even do that? Let's verify here:

In [9]:
test_sentences = baseline[:10]

toymodel1 = Word2Vec(
    test_sentences,
    workers   = 1,
    size      = 32,
    min_count = 1,
    window    = 4,
    sample    = 0.001)
toymodel2 = Word2Vec(
    test_sentences,
    workers   = 1,
    size      = 32,
    min_count = 1,
    window    = 4,
    sample    = 0.001)

toymodel1_dict = {key:toymodel1.wv[key] for idx, key in enumerate(toymodel1.wv.vocab)}
toymodel2_dict = {key:toymodel2.wv[key] for idx, key in enumerate(toymodel2.wv.vocab)}

In [10]:
print(toymodel1_dict['the'] == toymodel2_dict['the'])

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True]


In [11]:
k1 = list(toymodel1_dict.keys())
k2 = list(toymodel2_dict.keys())
assert(k1==k2)
for word in k1:
    if not np.array_equal(toymodel1_dict[word], toymodel2_dict[word]):
        print('Embeddings are Different For: ', word)

Alright, enough with the build-up...let's train word embeddings on each corpus. 
- The vector length is set to 128; it's relatively small because the corpora are not huge.
- The number or workers is set to 1 as the documentation states that this eliminates "order jitter' on the computer (so we ensure we start with the same initial NN weights).
- The window size is small as well since our sentences are not very long.
- The subsampling rate is the 'threshold for configuring which higher-frequency words are randomly downsampled', and we set it to the recommended value.
- As stated in *Natural Language Processing In Action*, the 'skip-gram approach works well with small corpora and rare terms', but I still choose CBOW because it 'shows higher accuracies for frequent words and is much faster to train'.
- To further ensure quality embeddings, we increase the default number of epochs from 5 to 20.


In [12]:
def trainWordVecs(params, corpus):
    model = Word2Vec(
            corpus,
            size      = params['num_features'],
            min_count = params['min_word_count'],
            workers   = params['num_workers'],
            window    = params['window_size'],
            sample    = params['sample_rate'],
            sg        = params['skipgram'],
            iter      = params['epochs'])
    return {key:model.wv[key] for idx, key in enumerate(model.wv.vocab)}

params = {
'num_features': 128,
'min_word_count': 3,
'num_workers': 1,
'window_size': 2,
'sample_rate': 0.00001,
'skipgram': 0,
'epochs': 20}

baseline_dict = trainWordVecs(params, baseline)
t1_dict       = trainWordVecs(params, t1)
t2_dict       = trainWordVecs(params, t2)

Nice, now we find the words that are found in all three corpora:

In [13]:
baseline_words = set(baseline_dict.keys())
t1_words       = set(t1_dict.keys())
t2_words       = set(t2_dict.keys())

print('Computed {} word embeddings in baseline_mod.'.format(len(baseline_words)))
print('Computed {} word embeddings in t1_mod.'.format(len(t1_words)))
print('Computed {} word embeddings in t2_mod.'.format(len(t2_words)))

common_words = list(baseline_words & t1_words & t2_words)
print('Found {} common words.'.format(len(common_words)))

Computed 12937 word embeddings in baseline_mod.
Computed 9996 word embeddings in t1_mod.
Computed 16471 word embeddings in t2_mod.
Found 3304 common words.


## Results

Finally, after training, we need to compute the distances between the embeddings of the words found in the list $common\_words$. We do so for all of the methods mentioned in the paper, and show the final results at the very bottom.

In [14]:
def getResults(input_dict):
    metric = input_dict['distance_metric']
    lamb = input_dict['lambda']
    common_words = input_dict['common_words']
    baseline_dict = input_dict['baseline_dict']
    t1_dict = input_dict['t1_dict']
    t2_dict = input_dict['t2_dict']
    t1_distances = []
    t2_distances = []
    for word in input_dict['common_words']:
        base_vec = getWordVec(baseline_dict, word)
        t1_vec   = getWordVec(t1_dict, word)
        t2_vec   = getWordVec(t2_dict, word)
        dist1    = getEmbeddingDistance(base_vec, t1_vec, metric)
        dist2    = getEmbeddingDistance(base_vec, t2_vec, metric)
        t1_distances.append(dist1)
        t2_distances.append(dist2)
    t1_resonance = computeResonance(np.array(t1_distances), lamb, metric)
    t2_resonance = computeResonance(np.array(t2_distances), lamb, metric)
    if input_dict['log_res']:
        saveResults(common_words, t1_distances, t2_distances)
    return t1_resonance, t2_resonance

def getWordVec(word_dict, word):
    return np.array(word_dict[word])

def getEmbeddingDistance(baseline_vec, target_vec, distance_metric):
    if distance_metric == 'euclidean':
        return np.linalg.norm((baseline_vec - target_vec))
    elif distance_metric == 'manhattan':
        return np.sum(abs(baseline_vec - target_vec))
    elif distance_metric == 'cosine_sim_neg':
        return np.dot(baseline_vec, target_vec)/(np.linalg.norm(baseline_vec)*np.linalg.norm(target_vec))
    elif distance_metric == 'cosine_sim_pos':
        return max(0, np.dot(baseline_vec, target_vec)/(np.linalg.norm(baseline_vec)*np.linalg.norm(target_vec)))
    
def computeResonance(distance_array, lamb, metric):
    if metric == 'euclidean' or metric == 'manhattan':
        return 100 - 100*(np.tanh((1/lamb)*np.sum(distance_array)))
    elif metric == 'cosine_sim_neg':
        return 100/(1 + np.exp((-1/lamb)*np.sum(distance_array)))
    elif metric == 'cosine_sim_pos':
        return 100*(np.tanh((1/lamb)*np.sum(distance_array)))
    
def saveResults(common_words, t1_distances, t2_distances):
    df = pd.DataFrame(list(zip(common_words, t1_distances, t2_distances)),
                      columns =['word', 't1_distance', 't2_distance'])
    df['t1>t2'] = np.where(df['t1_distance'] >= df['t2_distance'], 1, 0)
    df.to_csv('results/{}.csv'.format(datetime.datetime.now().strftime('%Y_%m_%d_%f')))

In [15]:
euclid = {'distance_metric': 'euclidean',
          'lambda':10000,
          'common_words': common_words,
          'baseline_dict': baseline_dict,
          't1_dict': t1_dict,
          't2_dict': t2_dict,
          'log_res': False}
manh = {'distance_metric': 'manhattan',
          'lambda':100000,
          'common_words': common_words,
          'baseline_dict': baseline_dict,
          't1_dict': t1_dict,
          't2_dict': t2_dict,
          'log_res': False}
cos_neg = {'distance_metric': 'cosine_sim_neg',
          'lambda':2500,
          'common_words': common_words,
          'baseline_dict': baseline_dict,
          't1_dict': t1_dict,
          't2_dict': t2_dict,
          'log_res': False}
cos_pos = {'distance_metric': 'cosine_sim_pos',
          'lambda':2700,
          'common_words': common_words,
          'baseline_dict': baseline_dict,
          't1_dict': t1_dict,
          't2_dict': t2_dict,
          'log_res': False}

t1_euclid, t2_euclid = getResults(euclid)
t1_manh, t2_manh = getResults(manh)
t1_cn, t2_cn = getResults(cos_neg)
t1_cp, t2_cp = getResults(cos_pos)

In [16]:
print('Euclidean : T1 Resonance: ', round(t1_euclid,2), ' T2 Resonance: ', round(t2_euclid,2))
print('Manhattan : T1 Resonance: ', round(t1_manh,2), ' T2 Resonance: ', round(t2_manh,2))
print('Cosine Neg: T1 Resonance: ', round(t1_cn,2), ' T2 Resonance: ', round(t2_cn,2))
print('Cosine Pos: T1 Resonance: ', round(t1_cp,2), ' T2 Resonance: ', round(t2_cp,2))

Euclidean : T1 Resonance:  70.54  T2 Resonance:  54.7
Manhattan : T1 Resonance:  72.84  T2 Resonance:  58.38
Cosine Neg: T1 Resonance:  73.73  T2 Resonance:  64.48
Cosine Pos: T1 Resonance:  74.23  T2 Resonance:  50.22
