# Week 4 - Word Embeddings Supplemental

This notebook contains two additional word embedding possibilities.

For this notebook we will be using the following packages

In [60]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
#Make sure you update it before starting this notebook
import lucem_illud #pip install -U git+git://github.com/Computational-Content-Analysis-2018/lucem_illud.git

#All these packages need to be installed from pip
import gensim#For word2vec, etc
import requests #For downloading our datasets
import nltk #For stop words and stemmers
import numpy as np #For arrays
import pandas #Gives us DataFrames
import matplotlib.pyplot as plt #For graphics
import seaborn #Makes the graphics look nicer
import sklearn.metrics.pairwise #For cosine similarity
import sklearn.manifold #For T-SNE
import sklearn.decomposition #For PCA
import copy

#gensim uses a couple of deprecated features
#we can't do anything about them so lets ignore them 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

#This 'magic' command makes the plots work better
#in the notebook, don't use it outside of a notebook.
#Also you can ignore the warning
%matplotlib inline

import os #For looking through files
import os.path #For managing file paths

# Data

This notebook relies on a few data files that are not in the git repo due to their size please download and unzip [this](https://github.com/Computational-Content-Analysis-2018/Upcoming/raw/master/data/supplement.zip) (472MB) file in the data directory.

# The Score Function

The score function is a simple calculation developed by [Matt Taddy](https://arxiv.org/pdf/1504.07295.pdf) to calculate the likelihood that a given text would have been generated by a word-embedding model by summing the inner product between each pair of the text's word vectors. 

Here, we explore this using a model trained with millions of resumes from the CareerBuilder website (we can't share the private resumes...but we can share a model built with them :-):

In [4]:
resume_model  = gensim.models.word2vec.Word2Vec.load('../data/resumeAll.model')

We can examine the vacabularies of this model by building a word-index map:

In [5]:
vocab = resume_model.index2word

Let's just load the sample and take a look at it. The sentences in each job description are already tokenized and normalized.

In [6]:
sampleDF = pandas.read_csv('../data/SampleJobAds.csv', index_col = False)
#We need to convert the last couple columns from strings to lists
sampleDF['tokenized_sents'] = sampleDF['tokenized_sents'].apply(lambda x: eval(x))
sampleDF['normalized_sents'] = sampleDF['normalized_sents'].apply(lambda x: eval(x))
sampleDF

Unnamed: 0.1,Unnamed: 0,hiringOrganization_organizationName,jobDescription,jobLocation_address_region,jobLocation_geo_latitude,jobLocation_geo_longitude,qualifications,responsibilities,tokenized_sents,normalized_sents
0,158844,"Golfsmith International, Inc.","""Sales Associate Tracking Code 220425-971 Job ...",California,33.91918,-118.41647,,"""Ensure each Customer receives exceptional ser...","[[``, Sales, Associate, Tracking, Code, 220425...","[[sales, associate, tracking, code, job, descr..."
1,257645,Intel,For PHY system engineering team within the Wir...,,,,,,"[[For, PHY, system, engineering, team, within,...","[[for, phy, system, engineering, team, within,..."
2,107875,Florida Hospital,*RN Medical Oncology PCU Orlando - Nights* Flo...,Florida,28.53834,-81.37924,,,"[[*RN, Medical, Oncology, PCU, Orlando, -, Nig...","[[medical, oncology, pcu, orlando, florida, ho..."
3,202394,Hitachi Data Systems,Title: Specialist Sales Account Representative...,,,,,,"[[Title, :, Specialist, Sales, Account, Repres...","[[title, specialist, sales, account, represent..."
4,109675,Footprint Retail Services,**Footprint Retail Services** **Job Descriptio...,,,,,A Merchandiser must complete all assigned merc...,"[[**Footprint, Retail, Services**, **Job, Desc...","[[retail, job, title, retail, merchandiser, re..."
5,215973,Home Depot,Position Purpose: Provide outstanding service ...,Indiana,41.13060,-85.12886,,Provide outstanding service to ensure efficien...,"[[Position, Purpose, :, Provide, outstanding, ...","[[position, purpose, provide, outstanding, ser..."
6,207524,Home Depot,The Asset Protection Specialist is primarily r...,New Jersey,40.21455,-74.61932,Must be eighteen years of age or older. Must p...,,"[[The, Asset, Protection, Specialist, is, prim...","[[the, asset, protection, specialist, is, prim..."
7,64426,East West Bank,# Job Description East West Bank is one of the...,California,34.06862,-118.02757,,We are currently seeking a Customer Service Ce...,"[[#, Job, Description, East, West, Bank, is, o...","[[job, description, east, west, bank, is, one,..."
8,245192,IBM,Job Description IBM is seeking to hire a Senio...,,,,,,"[[Job, Description, IBM, is, seeking, to, hire...","[[job, description, ibm, is, seeking, to, hire..."
9,202429,Hitachi Data Systems,Title: Field Solutions Engineer Location: New ...,,,,,Job Functions;Specific duties in this role wil...,"[[Title, :, Field, Solutions, Engineer, Locati...","[[title, field, solutions, engineer, location,..."


Let's define a function to calculate the likelihood of each job description. The idea is borrowed from [Matt Taddy](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/deepir.ipynb), who shows how a document can be characterized as the inner product of the distance between its words. In other words, this analysis will show which job ads are most likely to find an appropriate pool of workers in the resume bank that generated our word embedding.  

In [7]:
def adprob(ad, model):
    sen_scores = model.score(ad, len(ad))
    ad_score = sen_scores.mean()
    return ad_score

Let's apply this function to every job description.

In [8]:
sampleDF['likelihood'] = sampleDF['normalized_sents'].apply(lambda x: adprob(x, resume_model))

Let's take a look at the top 5 job descriptions that have the highest likelihood.

In [9]:
for ad in sampleDF.sort_values(by = 'likelihood', ascending = False)['jobDescription'][:5]:
    print (ad + '\n\n')

Project Engineering including below jobs: 1. Hardware designing of DCS 2. Software configurations, programming, testing of DCS/PLC 3. Testing and FAT 4. Installation and commissioning. 5. Material ordering, approvals of datasheets. 6. HSE compliance as per HSE directives of HON. BE / B. Tech - Instrumentation / Control / Electronics. **Job:** **Engineering* **Title:** *Systems Engineer* **Location:** *IND-MH-Pune* **Requisition ID:** *00302235*


Like talking on the phone? Enjoy giving great customer service? Use those skills while working flexible,part time hours.


*# Positions:* 2 *Location:* US - UT - Orem *Category:* Engineering


Title: Respiratory Therapy, Intern Location: XX-XX-XX


Title: Position Opening at Illinois Wesleyan University Location: US-IL-Bloomington




Let's take a look at the bottom 5 job descriptions that have the lowest likelihood to be matched by the resumes.

In [10]:
for ad in sampleDF.sort_values(by = 'likelihood')['jobDescription'][:5]:
    print (ad + '\n\n')

Title: Contract Svc Admin Location: United States\-Michigan\-Caledonia Other Locations: JOB SUMMARY: ESSENTIAL DUTIES AND RESPONSIBILITIES: _ Other duties may be assigned:_ Prepare GPO rebate data for upload into STARR system oEnsure accurate data loads oProcess data into required reports oEnsure that GPOs are paid accurately and timely Prepare Sales Tracing data for upload into STARR system oEnsure accurate data loads oProcess data into required reports Prepare Rebate data for upload into STARR system oEnsure accurate data loads oProcess data into required reports oEnsure that distributors are accurately taking rebates against open invoices Manage work flow throughout the sales contracting process Ensure compliance with contract policies and performance requirements Assist in the preparation of charts and sales reports and analytics for customers and Management oQuarterly Reports oSales Tracing Answers customers and internal sale/customer service staff questions regarding pricing Mana

We can do the same for phrases corresponding to job skills.

In [11]:
adprob([["python", "programming"]], resume_model)

-14.355247

In [12]:
adprob([["basic", "programming"]], resume_model)

-8.93109

Basic programming appears to be more likely in this pool of resumes than python programming. 

We can also do some simple statistics. Unfortunately, we don't have a large sample here. Nevertheless, let's first look at the mean likelihood score of each hiring organization. Some organizations will do well to hire on CareerBuilder...while others will not.

In [15]:
sampleDF.groupby("hiringOrganization_organizationName")[['likelihood']].mean().sort_values('likelihood', ascending = False)

Unnamed: 0_level_0,likelihood
hiringOrganization_organizationName,Unnamed: 1_level_1
Honeywell,-25.974749
Legrand North America,-40.500500
Franciscan St. Eilzabeth Health,-43.167206
Illinois Wesleyan University,-52.681591
Gold's Gym,-75.230896
Jewel-Osco,-80.965462
Ingersoll Rand,-81.494949
Kaiser Permanente,-85.523113
Hallmark Health,-86.037094
Dr Pepper Snapple Group,-99.421326


We can also look at the mean likelihood of each state.

In [16]:
sampleDF.groupby("jobLocation_address_region")[['likelihood']].mean().sort_values('likelihood', ascending = False)

Unnamed: 0_level_0,likelihood
jobLocation_address_region,Unnamed: 1_level_1
Utah,-40.5005
Montana,-63.730721
Connecticut,-67.623466
Kansas,-85.259682
Tennessee,-93.188822
Illinois,-93.989101
New Hampshire,-94.109772
Georgia,-102.27182
South Carolina,-103.527328
North Carolina,-108.747134


You would increase the sample size if you want to do a more serious study.

## <span style="color:red">*Exercise 3a*</span>

<span style="color:red">**Do only 3a or 3b.** Construct cells immediately below this that calculate the scores for a small sample of documents from outside your corpus to identify which are *closest* to your corpus. Then calculate the scores for a few phrases or sentences to identify the ones most likely to have appeared in your corpus. Interrogate patterns associated with these document/phrase scores (e.g., which companies produced job ads most or least likely to find jobseekers in the resume corpus?) What do these patterns suggest about the boundaries of your corpus?

I'm trying to see how media portrayals of meritocracy have changed since 2011 in Singapore. Particularly interested in changes around 2017, when there was a controversial breach in the state's meritocracy doctrine when it reserved a Presidential election for Malay candidates. Here, I want to compare my main corpus (articles from The Straits Times and Business Times, two of Singapore's newspapers of record) with articles from TheOnlineCitizien, a popular new media site.

In [89]:
# helper functions from above
def adprob(ad, model):
    sen_scores = model.score(ad, len(ad))
    ad_score = sen_scores.mean()
    return ad_score

In [90]:
# train the w2v model with the main corpus
stbtDF = pandas.read_csv("meritSTBT.csv")

stbtDF['tokenized_sents'] = stbtDF['text'].apply(lambda x: [nltk.word_tokenize(s) for s in nltk.sent_tokenize(x)])
stbtDF['normalized_sents'] = stbtDF['tokenized_sents'].apply(lambda x: 
                                                                           [lucem_illud.normalizeTokens(s, 
                                                                                           stopwordLst = lucem_illud.stop_words_basic, 
                                                                                           stemmer = None) 
                                                                            for s in x])

In [91]:
stbtW2V = gensim.models.word2vec.Word2Vec(stbtDF['normalized_sents'].sum(), hs=1, negative=0)

In [95]:
# import the secondary sample
tocXLX = pandas.ExcelFile('meritTOC.xlsx')
tocDF = tocXLX.parse('Sheet1')
tocDF['tokenized_sents'] = tocDF['text'].apply(lambda x: [nltk.word_tokenize(s) for s in nltk.sent_tokenize(x)])
tocDF['normalized_sents'] = tocDF['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s, stopwordLst = lucem_illud.stop_words_basic) for s in x])

In [96]:
#calculate likelihood of sample appearing in main corpus
tocDF['likelihood'] = tocDF['normalized_sents'].apply(lambda x: adprob(x, stbtW2V))

In [97]:
#print top 5 articles tha are more likely to appear in main corpus
for x in tocDF.sort_values(by = 'likelihood', ascending = False)['title'][:5]:
    print (x + '\n\n')

Be the change you wish to see


“Urgency for us to find our identity and voice again”


Meritocracy’s demerits


No by-election even if minority MP quits GRC? Shouldn’t the constitution trump the PEA?


How did we become so racist?




In [98]:
#print bottom 5 articles tha are more likely to appear in main corpus
for x in tocDF.sort_values(by = 'likelihood')['title'][:5]:
    print (x + '\n\n')

Don’t milk 2019’s bicentennial anniversary for political capital


The PAP and the Malay Singaporean: Between rhetoric, reality and meritocracy


Cronyism and aristocracy undermines PAP’s claim of meritocracy: SingFirst


Fair and just society? The statistics say otherwise


Reflections on life as a Malay/Muslim in a Chinese pond




It's difficult to see any clear differences from the titles, so let's compare the text body of the most likely against the least likely.

In [99]:
for x in tocDF.sort_values(by = 'likelihood', ascending = False)['text'][:5]:
    print (x + '\n\n')

1. We were promised Swiss standard of living. Instead we got the most expensive city to live in.
2. We were promised equality, instead we thrive in elitism.
3. We were told that our nation is built on meritocracy, instead we are ranked 5th in Crony Capitalism Index.
4. We pledged to build a democratic society but instead we got an autocratic government.
5. We spent billions of taxpayers money to increase productivity, but instead these measures are counter productive when we flood our workforce with cheap foreign labour.
6. We planned for a long term sustainable economic growth, but instead history has shown that our economic growth model is nothing meaningful except to increase our population digits by granting new citizens.
7. It is our objective to reach the world cup by 2010, but instead we are 7 years late and our football association is in the doldrums.
8. Our government pledged no one will be left behind but today, many are slipping through the cracks with reportedly high number

In [100]:
for x in tocDF.sort_values(by = 'likelihood')['text'][:5]:
    print (x + '\n\n')

Singapore’s Bicentennial in 2019 is historic for the country and for all Singaporeans – young and old, native and new citizens, from all sides of the political spectrum.
2019 will mark a special year in Singapore’s history, as it was 200 years ago that the modern Singapore was founded by Sir Stamford Raffles and Major William Farquhar of the British East India Company.
The history books reveal that on 29 January 1819, a small British party led by Raffles and Farquhar landed in Singapore at the mouth of the Singapore River, and helped establish a colonial port city.
Their arrival sparked the establishment of a jewel in the east for the British East Company along with Melaka and Penang, which then turned Singapore from an almost uninhabited island and pirate haunt into a flourishing international port by the late 1800s and early 1900s.
200 years on after Singapore’s founding, it looks very likely that Singapore’s Bicentennial in 2019 could well be politically milked for whatever it’s wor

Unfortunately, I can't see any clear differences between the two, bar their respectively lengths. Deeper and subtler investigations would have to be called for.

# Linguistic Change

Below is code that aligns the dimensions of multiple embeddings arrayed over time or some other dimension and allow identification of semantic chanage as the word vectors change their loadings for focal words. This code comes from the approach piloted at Stanford by William Hamilton, Daniel Jurafsky and Jure Lescovec [here](https://arxiv.org/pdf/1605.09096.pdf). 

In [48]:
def calc_syn0norm(model):
    """since syn0norm is now depricated"""
    return (model.wv.syn0 / np.sqrt((model.wv.syn0 ** 2).sum(-1))[..., np.newaxis]).astype(np.float32)

def smart_procrustes_align_gensim(base_embed, other_embed, words=None):
    """Procrustes align two gensim word2vec models (to allow for comparison between same word across models).
    Code ported from HistWords <https://github.com/williamleif/histwords> by William Hamilton <wleif@stanford.edu>.
    (With help from William. Thank you!)
    First, intersect the vocabularies (see `intersection_align_gensim` documentation).
    Then do the alignment on the other_embed model.
    Replace the other_embed model's syn0 and syn0norm numpy matrices with the aligned version.
    Return other_embed.
    If `words` is set, intersect the two models' vocabulary with the vocabulary in words (see `intersection_align_gensim` documentation).
    """
    base_embed = copy.copy(base_embed)
    other_embed = copy.copy(other_embed)
    # make sure vocabulary and indices are aligned
    in_base_embed, in_other_embed = intersection_align_gensim(base_embed, other_embed, words=words)

    # get the embedding matrices
    base_vecs = calc_syn0norm(in_base_embed)
    other_vecs = calc_syn0norm(in_other_embed)

    # just a matrix dot product with numpy
    m = other_vecs.T.dot(base_vecs) 
    # SVD method from numpy
    u, _, v = np.linalg.svd(m)
    # another matrix operation
    ortho = u.dot(v) 
    # Replace original array with modified one
    # i.e. multiplying the embedding matrix (syn0norm)by "ortho"
    other_embed.wv.syn0norm = other_embed.wv.syn0 = (calc_syn0norm(other_embed)).dot(ortho)
    return other_embed
    
def intersection_align_gensim(m1,m2, words=None):
    """
    Intersect two gensim word2vec models, m1 and m2.
    Only the shared vocabulary between them is kept.
    If 'words' is set (as list or set), then the vocabulary is intersected with this list as well.
    Indices are re-organized from 0..N in order of descending frequency (=sum of counts from both m1 and m2).
    These indices correspond to the new syn0 and syn0norm objects in both gensim models:
        -- so that Row 0 of m1.syn0 will be for the same word as Row 0 of m2.syn0
        -- you can find the index of any word on the .index2word list: model.index2word.index(word) => 2
    The .vocab dictionary is also updated for each model, preserving the count but updating the index.
    """

    # Get the vocab for each model
    vocab_m1 = set(m1.wv.vocab.keys())
    vocab_m2 = set(m2.wv.vocab.keys())

    # Find the common vocabulary
    common_vocab = vocab_m1&vocab_m2
    if words: common_vocab&=set(words)

    # If no alignment necessary because vocab is identical...
    if not vocab_m1-common_vocab and not vocab_m2-common_vocab:
        return (m1,m2)

    # Otherwise sort by frequency (summed for both)
    common_vocab = list(common_vocab)
    common_vocab.sort(key=lambda w: m1.wv.vocab[w].count + m2.wv.vocab[w].count,reverse=True)

    # Then for each model...
    for m in [m1,m2]:
        # Replace old syn0norm array with new one (with common vocab)
        indices = [m.wv.vocab[w].index for w in common_vocab]
        old_arr = calc_syn0norm(m)
        new_arr = np.array([old_arr[index] for index in indices])
        m.wv.syn0norm = m.wv.syn0 = new_arr

        # Replace old vocab dictionary with new one (with common vocab)
        # and old index2word with new one
        m.index2word = common_vocab
        old_vocab = m.wv.vocab
        new_vocab = {}
        for new_index,word in enumerate(common_vocab):
            old_vocab_obj=old_vocab[word]
            new_vocab[word] = gensim.models.word2vec.Vocab(index=new_index, count=old_vocab_obj.count)
        m.wv.vocab = new_vocab

    return (m1,m2)

In order to explore this, let's get some data that follows a time trend. We'll look at conference proceedings from the American Society for Clinical Oncologists.

In [39]:
ascoDF = pandas.read_csv("../data/ASCO_abstracts.csv", index_col=0)

Prepare for wor2vec

In [40]:
ascoDF['tokenized_sents'] = ascoDF['Body'].apply(lambda x: [nltk.word_tokenize(s) for s in nltk.sent_tokenize(x)])
ascoDF['normalized_sents'] = ascoDF['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s, stopwordLst = lucem_illud.stop_words_basic) for s in x])

We will be creating many embeddings so we have created this function to do most of the work. It creates two collections of embeddings, one the original and one the aligned.

In [49]:
def compareModels(df, category, sort = True):
    """If you are using time as your category sorting is important"""
    embeddings_raw = {}
    cats = sorted(set(df[category]))
    for cat in cats:
        #This can take a while
        print("Embedding {}".format(cat), end = '\r')
        subsetDF = df[df[category] == cat]
        #You might want to change the W2V parameters
        embeddings_raw[cat] = gensim.models.word2vec.Word2Vec(subsetDF['normalized_sents'].sum())
    #These are much quicker
    embeddings_aligned = {}
    for catOuter in cats:
        embeddings_aligned[catOuter] = [embeddings_raw[catOuter]]
        for catInner in cats:
            embeddings_aligned[catOuter].append(smart_procrustes_align_gensim(embeddings_aligned[catOuter][-1], embeddings_raw[catInner]))
    return embeddings_raw, embeddings_aligned

Now we generate the models

In [50]:
rawEmbeddings, comparedEmbeddings = compareModels(ascoDF, 'Year')

Embedding 2011

We need to compare them across all permutions so we will define another function to help, we will be using 1 - cosine similarity as that gives a more intitive range of 0-2 with low values meaning little change and high meaning lots of change

In [57]:
def getDivergenceDF(word, embeddingsDict):
    dists = []
    cats = sorted(set(embeddingsDict.keys()))
    dists = {}
    for cat in cats:l;''
        dists[cat] = []
        for embed in embeddingsDict[cat][1:]:
            dists[cat].append(np.abs(1 - sklearn.metrics.pairwise.cosine_similarity(embeddingsDict[cat][0][word],
                                                                             embed[word])[0,0]))
    return pandas.DataFrame(dists, index = cats)

Lets look at a couple words

In [58]:
targetWord = 'breast'

pltDF = getDivergenceDF(targetWord, comparedEmbeddings)
fig, ax = plt.subplots(figsize = (10, 7))
seaborn.heatmap(pltDF, ax = ax, annot = False) #set annot True for a lot more information
ax.set_xlabel("Starting year")
ax.set_ylabel("Final year")
ax.set_ylabel("Final year")
ax.set_title("Yearly linguistic change for: '{}'".format(targetWord))
plt.show()

ValueError: Expected 2D array, got 1D array instead:
array=[-0.03787251 -0.02861004  0.18884921 -0.08289266 -0.05654515 -0.20369662
 -0.26148152  0.15026323 -0.06361463 -0.05502306  0.1475952   0.02619783
  0.05713504 -0.04724671 -0.12107192 -0.07694732 -0.00666577  0.00452494
 -0.08186782 -0.07371792 -0.01004681 -0.02041619  0.14257084  0.10933459
 -0.05865278 -0.0207543  -0.03966745 -0.0300614  -0.08953458  0.02604257
  0.04714831 -0.08366276 -0.03789314  0.01560924 -0.17681524  0.07895701
 -0.11819233 -0.01259898 -0.09899756  0.04000058  0.12056059  0.07364208
  0.14969619 -0.14258212  0.03398201 -0.00165168  0.19330804 -0.09391454
 -0.02021404 -0.07589911 -0.08778081 -0.06395055 -0.17923713  0.02778851
 -0.07864519  0.19914365 -0.05158907  0.03703358 -0.0363735  -0.09924856
  0.03396147  0.1046105  -0.01835469  0.03268759 -0.12329658  0.06918551
  0.05825925  0.0471094  -0.00856406  0.00576133 -0.00475279  0.2277804
  0.0023359   0.07800668  0.03493834  0.14960101  0.0374338   0.00777626
 -0.05590351 -0.03223129 -0.07287616 -0.03346199  0.13843681 -0.11082813
  0.20683679 -0.02885618  0.13561612  0.1313949   0.00544382 -0.13488147
 -0.06708039 -0.0807347  -0.06276187  0.1903605  -0.19340041 -0.04321403
 -0.0074007  -0.17688584  0.05231977 -0.14721575].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [59]:
targetWord = 'triple'

pltDF = getDivergenceDF(targetWord, comparedEmbeddings)
fig, ax = plt.subplots(figsize = (10, 7))
seaborn.heatmap(pltDF, ax = ax, annot = False) #set annot True for a lot more information
ax.set_xlabel("Starting year")
ax.set_ylabel("Final year")
ax.set_ylabel("Final year")
ax.set_title("Yearly linguistic change for: '{}'".format(targetWord))
plt.show()

ValueError: Expected 2D array, got 1D array instead:
array=[ 0.00547318 -0.00223683  0.14721876 -0.09584481 -0.01345191  0.19960998
 -0.06275916  0.19516413 -0.04558695 -0.15655145  0.05271405 -0.01289412
 -0.01222189 -0.09338681 -0.13717504 -0.09035186  0.02047576  0.03522668
 -0.16740729 -0.16636063 -0.00842528  0.03832736 -0.14432456  0.0427644
 -0.08040936  0.10076872 -0.02966548  0.08064591 -0.14237367  0.01777278
 -0.04747973  0.11428855 -0.00142894 -0.06981437  0.08967764  0.06271459
 -0.08144091 -0.0171784   0.02314707 -0.01422119  0.09224103  0.05659618
  0.03110036 -0.08393237 -0.07253054 -0.06385563  0.14379384 -0.08170025
 -0.1075234  -0.16829437  0.0121942   0.0099536  -0.06204883  0.07026649
 -0.22029467  0.05821025 -0.04067654  0.02516019  0.00121412 -0.06903686
  0.00811559  0.0513284  -0.07313036  0.13153312  0.01099963  0.07326267
  0.19469282  0.00451329 -0.17951286 -0.0494875  -0.15977137  0.12475465
  0.01690009  0.00520802 -0.01449092  0.01529378  0.07532664 -0.01194853
 -0.21058889  0.027772   -0.09758291  0.02480148 -0.0442983  -0.01207198
  0.25480756  0.01726467  0.19747822 -0.01285543 -0.01956378 -0.18328318
 -0.06805687 -0.13348462 -0.18570176 -0.02690407 -0.1234125  -0.10262409
 -0.01064802 -0.19729742 -0.13032968 -0.07145134].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

We can also ask which words changed the most

In [54]:
def findDiverence(word, embeddingsDict):
    cats = sorted(set(embeddingsDict.keys()))
    
    dists = []
    for embed in embeddingsDict[cats[0]][1:]:
        dists.append(1 - sklearn.metrics.pairwise.cosine_similarity(embeddingsDict[cats[0]][0][word], embed[word])[0,0])
    return sum(dists)

def findMostDivergent(embeddingsDict):
    words = []
    for embeds in embeddingsDict.values():
        for embed in embeds:
            words += list(embed.wv.vocab.keys())
    words = set(words)
    print("Found {} words to compare".format(len(words)))
    return sorted([(w, findDiverence(w, embeddingsDict)) for w in words], key = lambda x: x[1], reverse=True)
    

In [62]:
wordDivergences = findMostDivergent(comparedEmbeddings)

Found 1760 words to compare


ValueError: Expected 2D array, got 1D array instead:
array=[ 0.0180935  -0.11713199  0.1452082  -0.04648605 -0.04461922  0.11367017
 -0.06002687  0.12593561 -0.06374723 -0.03775394  0.0369553  -0.0002488
  0.01443592 -0.05507106 -0.035613   -0.03880918  0.03032941  0.03000773
 -0.08969052 -0.06927356  0.03029203 -0.03869503  0.03110863  0.146458
  0.0687525  -0.09275088  0.0325884   0.00261067 -0.06087853  0.01447242
 -0.03067323  0.05729987 -0.02119075  0.01493513 -0.16450314  0.02446493
 -0.1304246   0.00740857 -0.19238243 -0.15482871  0.04596594 -0.11289337
  0.0726473   0.07386872 -0.12323517 -0.11215425  0.12335626 -0.13611695
  0.01438461 -0.05205595 -0.12428512 -0.02079526 -0.09251371  0.05298097
 -0.08995653  0.1459504   0.10121784 -0.02479587  0.09412502 -0.14949547
 -0.01547198 -0.1056686   0.05196228 -0.05802559 -0.12560824  0.01441022
  0.03415819  0.06977218 -0.11868212  0.04271431 -0.22558296  0.16424939
  0.08613934 -0.0219034   0.02956183  0.11976187  0.14888737  0.1227535
 -0.10586319  0.15889943  0.03386529  0.00445525 -0.06929325  0.03354631
  0.13765363  0.02513475  0.23748937 -0.08286566 -0.06898127 -0.10496702
  0.04940588 -0.16886553 -0.202984    0.16335684 -0.21317996 -0.02221089
 -0.07998636 -0.20537244 -0.02977666 -0.16816758].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The most divergent words are:

In [56]:
wordDivergences[:10]

NameError: name 'wordDivergences' is not defined

And the least

In [None]:
wordDivergences[-10:]

In [None]:
targetWord = wordDivergences[0][0]

pltDF = getDivergenceDF(targetWord, comparedEmbeddings)
fig, ax = plt.subplots(figsize = (10, 7))
seaborn.heatmap(pltDF, ax = ax, annot = False) #set annot True for a lot more information
ax.set_xlabel("Starting year")
ax.set_ylabel("Final year")
ax.set_ylabel("Final year")
ax.set_title("Yearly linguistic change for: '{}'".format(targetWord))
plt.show()

In [None]:
targetWord = wordDivergences[-1][0]

pltDF = getDivergenceDF(targetWord, comparedEmbeddings)
fig, ax = plt.subplots(figsize = (10, 7))
seaborn.heatmap(pltDF, ax = ax, annot = False) #set annot True for a lot more information
ax.set_xlabel("Starting year")
ax.set_ylabel("Final year")
ax.set_ylabel("Final year")
ax.set_title("Yearly linguistic change for: '{}'".format(targetWord))
plt.show()

## <span style="color:red">*Exercise 3b*</span>

<span style="color:red">**Do only 3a or 3b.** Construct cells immediately below this that align word embeddings over time or across domains/corpora. Interrogate the spaces that result and ask which words changed most and least over the entire period or between contexts/corpora. What does this reveal about the social game underlying your space? 

In [28]:
tocXLX = pandas.ExcelFile('meritTOC.xlsx')

In [29]:
tocDF = tocXLX.parse('Sheet1')

In [35]:
# tokenize
stbtDF = pandas.read_csv("meritSTBT.csv")

stbtDF['tokenized_sents'] = stbtDF['text'].apply(lambda x: [nltk.word_tokenize(s) for s in nltk.sent_tokenize(x)])
stbtDF['normalized_sents'] = stbtDF['tokenized_sents'].apply(lambda x: [lucem_illud.normalizeTokens(s, stopwordLst = lucem_illud.stop_words_basic) for s in x])

Unnamed: 0.1,Unnamed: 0,abstr,art_num,author,date,section,source,subj,text,title,url,tokenized_sents,normalized_sents
0,0,"According to the survey of 6,600 people aged 1...",1,"Wee, Lea\n","May 3, 2012\n",Mind Your Body\n,The Straits Times; Singapore\n,Mental disorders; Anxieties; Older people; Ch...,Young people in Singapore aged 18 to 29 are a...,Young & disturbed\n,http://sfx.lib.uchicago.edu/sfx_local?url_ver...,"[[Young, people, in, Singapore, aged, 18, to, ...","[[young, people, singapore, aged, higher, risk..."
1,1,[...]his education was practically self-financ...,2,Wong Kim Hoh\n,"May 13, 2012\n",Think\n,The Straits Times; Singapore\n,,Wong Kim Hoh meets...\nChong Kee Hiong\nWhen ...,"To be a CEO, read martial arts novels: The ...",http://sfx.lib.uchicago.edu/sfx_local?url_ver...,"[[Wong, Kim, Hoh, meets, ..., Chong, Kee, Hion...","[[wong, kim, hoh, meets, chong, kee, hiong, pr..."
2,2,The Danes - the only people in occupied Europe...,3,Janadas Devan\n,"May 22, 2012\n",Review\n,The Straits Times; Singapore\n,Immigration policy; Workforce; Aliens\n,"AS EVENTS of the past week have revealed, the...","When in Singapore, act like a S'porean: Int...",http://sfx.lib.uchicago.edu/sfx_local?url_ver...,"[[AS, EVENTS, of, the, past, week, have, revea...","[[events, past, week, revealed, divisions, sin..."
3,3,While they agreed on respect for multi-racial ...,4,,"May 22, 2012\n",Prime News\n,The Straits Times; Singapore\n,Elections\n,WE HEARD YOU...\nREADERS have told us that th...,The news in 5 minutes\n,http://sfx.lib.uchicago.edu/sfx_local?url_ver...,"[[WE, HEARD, YOU, ...], [READERS, have, told, ...","[[heard], [readers, told, us, find, quick, wra..."
4,4,Only in that way would people be prepared to c...,5,Wang Gungwu\n,"Jun 23, 2012\n",Saturday Special Report\n,The Straits Times; Singapore\n,Education; Families & family life\n,MERITOCRACY is a valued ideal in most of the ...,Meritocracy's enemies: China's long history...,http://sfx.lib.uchicago.edu/sfx_local?url_ver...,"[[MERITOCRACY, is, a, valued, ideal, in, most,...","[[meritocracy, valued, ideal, modern, world], ..."
5,5,COMMUNITY groups like Jamiyah play a key role ...,6,"Chan, Robin\n","Jun 30, 2012\n",Singapore\n,The Straits Times; Singapore\n,Education; Business schools\n,COMMUNITY groups like Jamiyah play a key role...,"Community groups play a key role, says PM: ...",http://sfx.lib.uchicago.edu/sfx_local?url_ver...,"[[COMMUNITY, groups, like, Jamiyah, play, a, k...","[[community, groups, like, jamiyah, play, key,..."
6,6,Mr Muhamad Ali Aman from the Motivasi Youth As...,7,"Ong, Andrea\n","Jul 1, 2012\n",News\n,The Straits Times; Singapore\n,Public housing\n,Questions about policies seen as discriminati...,Frank talk of policies concerning Malay-Musli...,http://sfx.lib.uchicago.edu/sfx_local?url_ver...,"[[Questions, about, policies, seen, as, discri...","[[questions, policies, seen, discriminating, s..."
7,7,Modern Singapore had been established by Stamf...,8,"Latif, Asad\n","Jul 13, 2012\n",Review\n,The Straits Times; Singapore\n,Aliens; Immigration policy; Cities\n,ONE problem with foreign misfits - immigrants...,Confessions of a former foreign misfit\n,http://sfx.lib.uchicago.edu/sfx_local?url_ver...,"[[ONE, problem, with, foreign, misfits, -, imm...","[[one, problem, foreign, misfits, immigrants, ..."
8,8,[...]they are sending a clear message that the...,9,"Chang, Rachel; Ong, Cheryl\n","Jul 21, 2012\n",Prime News\n,The Straits Times; Singapore\n,Income distribution; Children & youth; Standa...,AS MANY as eight out of 10 Singaporeans belie...,Poll flags concerns over wage gap\n,http://sfx.lib.uchicago.edu/sfx_local?url_ver...,"[[AS, MANY, as, eight, out, of, 10, Singaporea...","[[many, eight, singaporeans, believe, standard..."
9,9,[...]his humble beginnings have given him an e...,10,"Chang, Rachel; Ong, Cheryl\n","Jul 21, 2012\n",Insight\n,The Straits Times; Singapore\n,Schools; Children & youth; Standard of living...,"ENTREPRENEUR Kenneth Li, 26, never doubted he...","Moving up in life, the Singapore way: Insig...",http://sfx.lib.uchicago.edu/sfx_local?url_ver...,"[[ENTREPRENEUR, Kenneth, Li, ,, 26, ,, never, ...","[[entrepreneur, kenneth, li, never, doubted, w..."
