The meaning of words often changes over time.  In this homework, you will explore this phenomenon by identifying shifts in word meaning over the space of one hundred years by examining word embeddings trained on historical data (largely published before 1923) and those trained on contemporary texts.

In [1]:
import re
from gensim.models import KeyedVectors
import operator
import pandas as pd
import numpy as np

In [2]:
wiki = KeyedVectors.load_word2vec_format("../data/glove.6B.50d.50K.txt", binary=False)

In [3]:
guten = KeyedVectors.load_word2vec_format("../data/gutenberg.200.vectors.50K.txt", binary=False)

Q1. Before we jump in, select 5 words whose senses you believe have changed over the period of the past 100 years. Ensure they are in the vocabulary of both models.  Explain the two different meanings they have.  This is an important step in stating your beliefs before you examine any empirical evidence; do not change these terms after you have run the models you develop below.  (Here we are only evaluating the rationales, not whether the terms *actually* undergo sense change, as measured below.)

In [4]:
# fill in terms here
terms=['google', 'apple', 'mouse', 'chip', 'windows']
for term in terms:
    if term not in wiki or term not in guten:
        print("%s missing!" % term)

**Q1 response**.

The words I've chosen follow a similar line of reasoning—each of them has acquired an additional layer of meaning due to technological advancements. The term `google` was originally coined in the 1930s but only became widely known to the general public with the company's debut. The rest of the words have all developed multiple meanings for the same reason. Words like `apple` and `mouse` have their original, prototypical meanings of a fruit and an animal, as they were perceived a century ago. So as `chip` and `windows`. However, these terms have evolved to be associated with technological concepts such as `internet` and `web` in contemporary usage.

`google`: 
- Meaning 1: used colloquially as a math term
- Meaning 2: tech company trademark, to search

`apple`
- Meaning 1: fruit
- Meaning 2: tech company trademark

`mouse`
- Meaning 1: rodent animal
- Meaning 2: tech device

`chip`
- Meaning 1: fragments; gambling chips; food
- Meaning 2: circuit material

`windows`
- Meaning 1: opening in the wall
- Meaning 2: computer operating system

Q2. Find the words that have changed the most by calculating the number of words that overlap in their 50 nearest neighbors.  That is, let $\mathcal{N}_{guten}(\textrm{awesome})$ be the 50 nearest neighbors for the word "awesome" in the Gutenberg embeddings and $\mathcal{N}_{wiki}(\textrm{awesome})$ be the 50 nearest neighbors for "awesome" in the Wikipedia embeddings.  Calculate the size of $\mathcal{N}_{guten}(\textrm{awesome}) \cap \mathcal{N}_{wiki}(\textrm{awesome})$.  Under this method, the words that share the *fewest* neighbors have moved the furthest apart.  Display the 100 words that have moved the furthest apart and the 100 words that have remained the closest together, along with their intersection score.  

In [5]:
def find_words(vocab):
    
    # initialize a dict to hold the size of overlapped neighbors
    num = {}
    
    # loop through the vocab 
    for word in vocab: 
        
        # get the number of neighbors in guten 
        nguten = [k for k, v in guten.most_similar(word, topn=50)]
        
        # get the number of neighbors in wiki
        nwiki = [k for k, v in wiki.most_similar(word, topn=50)]
        
        # get the overlapped neighbors
        overlap = list(set(nguten) & set(nwiki))
        
        # get the size of overlapping
        num[word] = len(overlap)
    
    # sort the dict by the size of overlapping
    sorted_num = dict(sorted(num.items(), key=lambda item: item[1], reverse=True))
    
    return sorted_num

In [6]:
find_words(terms)

{'mouse': 9, 'windows': 7, 'google': 1, 'apple': 1, 'chip': 0}

In [7]:
# get the shared vocab between two word lists
wiki_vocab = wiki.index_to_key
guten_vocab = guten.index_to_key
shared_vocab = list(set(wiki_vocab) & set(guten_vocab))

In [8]:
# get the neighbor size score
overlaps = find_words(shared_vocab)

In [9]:
# 100 words that remained close together
list(overlaps.items())[:100]

[('38', 44),
 ('39', 44),
 ('37', 44),
 ('49', 43),
 ('48', 43),
 ('43', 43),
 ('59', 42),
 ('42', 41),
 ('46', 41),
 ('41', 41),
 ('36', 41),
 ('33', 38),
 ('6', 38),
 ('57', 37),
 ('32', 36),
 ('1869', 35),
 ('5', 35),
 ('55', 35),
 ('44', 35),
 ('1866', 34),
 ('7', 34),
 ('65', 34),
 ('1843', 34),
 ('1865', 34),
 ('45', 34),
 ('62', 34),
 ('56', 33),
 ('47', 33),
 ('1844', 33),
 ('1856', 33),
 ('2', 33),
 ('9', 33),
 ('8', 33),
 ('1854', 32),
 ('67', 32),
 ('1850', 32),
 ('52', 32),
 ('35', 32),
 ('68', 32),
 ('1902', 32),
 ('61', 32),
 ('1831', 31),
 ('10', 31),
 ('fifteen', 31),
 ('1840', 31),
 ('1907', 31),
 ('53', 31),
 ('1858', 31),
 ('4', 31),
 ('1845', 31),
 ('34', 31),
 ('14', 31),
 ('1899', 30),
 ('16', 30),
 ('fourteen', 30),
 ('1855', 30),
 ('21', 30),
 ('1861', 30),
 ('1829', 30),
 ('1908', 30),
 ('1859', 30),
 ('11', 30),
 ('31', 30),
 ('13', 30),
 ('77', 30),
 ('19', 30),
 ('63', 29),
 ('1', 29),
 ('54', 29),
 ('12th', 29),
 ('1876', 29),
 ('kentucky', 29),
 ('iowa', 2

In [10]:
# 100 words whose meanings shist furthest apart
list(overlaps.items())[-100:]

[('mop', 0),
 ('wallop', 0),
 ('ter', 0),
 ('painless', 0),
 ('dimming', 0),
 ('kennard', 0),
 ('curt', 0),
 ('spanning', 0),
 ('aggravated', 0),
 ('mitigating', 0),
 ('snapshot', 0),
 ('gwen', 0),
 ('healer', 0),
 ('flagship', 0),
 ('mansoor', 0),
 ('ilk', 0),
 ('messed', 0),
 ('bracken', 0),
 ('reflective', 0),
 ('bran', 0),
 ('valdez', 0),
 ('corliss', 0),
 ('tenths', 0),
 ('md', 0),
 ('countenance', 0),
 ('dike', 0),
 ('lowe', 0),
 ('tagging', 0),
 ('e-mail', 0),
 ('rajah', 0),
 ('lombardo', 0),
 ('britton', 0),
 ('include', 0),
 ('recurrent', 0),
 ('delano', 0),
 ('spades', 0),
 ('decimal', 0),
 ('seppi', 0),
 ('transvaal', 0),
 ('milking', 0),
 ('doris', 0),
 ('payne', 0),
 ('breasted', 0),
 ('unseated', 0),
 ('patriarch', 0),
 ('multiplicity', 0),
 ('maneuver', 0),
 ('inexplicably', 0),
 ('promoter', 0),
 ('beryl', 0),
 ('thome', 0),
 ('dexter', 0),
 ('retainer', 0),
 ('compatriots', 0),
 ('dictating', 0),
 ('romney', 0),
 ('exploited', 0),
 ('steers', 0),
 ('nimrod', 0),
 ('sco

Now let's look at how much the candidate terms you defined above have changed their meaning as measured in these embeddings.  First, we can just print their neighborhoods:

In [11]:
def print_top(word):
    print("=== %s ===\n" % word)
    print("Gutenberg:")
    for k, v in guten.most_similar(word, topn=10):
        print("%.3f\t%s" % (v,k))

    print()
    print("Wikipedia:")
    for k, v in wiki.most_similar(word, topn=10):
        print("%.3f\t%s" % (v,k)) 
    print()

In [12]:
for term in terms:
    print_top(term)

=== google ===

Gutenberg:
0.542	miscellanies
0.533	selections
0.525	scans
0.520	underline
0.516	disclaims
0.512	derivable
0.503	epiblast
0.499	multiple
0.496	etexts
0.491	transcribed

Wikipedia:
0.894	yahoo
0.853	aol
0.845	microsoft
0.818	internet
0.818	web
0.809	facebook
0.793	ebay
0.791	netscape
0.791	online
0.782	software

=== apple ===

Gutenberg:
0.686	fruit
0.679	apples
0.662	apricot
0.662	onion
0.661	pear
0.656	cabbage
0.656	cherry
0.656	peach
0.648	bread-fruit
0.639	gum

Wikipedia:
0.754	blackberry
0.744	chips
0.743	iphone
0.733	microsoft
0.733	ipad
0.722	pc
0.720	ipod
0.719	intel
0.715	ibm
0.709	software

=== mouse ===

Gutenberg:
0.701	kitten
0.694	cat
0.641	dog
0.589	bird
0.580	caterpillar
0.576	puppy
0.559	butterfly
0.549	hen
0.545	squirrel
0.537	dormouse

Wikipedia:
0.797	monkey
0.781	bugs
0.773	cat
0.762	rabbit
0.750	worm
0.731	clone
0.727	robot
0.720	spider
0.710	bug
0.703	frog

=== chip ===

Gutenberg:
0.594	fireman
0.585	cal
0.578	barkeeper
0.568	johnny
0.562	denson
0

**check+**. Let's make this a little more precise.  Rank all terms by the overlap score you created above, so that words with scores closer to 0 (i.e., no overlap in nearest neighbors) are ranked higher (i.e., closer to position 1). Measure how good your guesses were by calculating their [mean reciprocal rank](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) within this list.  (Again, we're not evaluating how good your guesses were above, but rather the correctness of your implementation of MRR.)

In [13]:
# rank them by the score (the minimum is ranked 1st)
pd.Series(overlaps).rank(method='min')

38           26507.0
39           26507.0
37           26507.0
49           26504.0
48           26504.0
              ...   
porcupine        1.0
crockett         1.0
licks            1.0
cob              1.0
aggregate        1.0
Length: 26509, dtype: float64

In [14]:
# get the inverse of the rank
reciprocal_ranks = (1/pd.Series(overlaps).rank(method='min')).to_dict()

In [15]:
# get the reciprocal ranks of selected terms
for word in terms:
    rr = reciprocal_ranks[word]
    print(f'The reciprocal rank of the word: {word}: {rr}')

The reciprocal rank of the word: google: 0.0001462629808395495
The reciprocal rank of the word: apple: 0.0001462629808395495
The reciprocal rank of the word: mouse: 4.24881033310673e-05
The reciprocal rank of the word: chip: 1.0
The reciprocal rank of the word: windows: 4.5964331678617394e-05


In [16]:
mrr = np.mean([reciprocal_ranks[word] for word in terms])
print(f'The mean reciprocal rank of selected terms is {mrr}')

The mean reciprocal rank of selected terms is 0.20007619567933776


The MRR of selected terms is close to 0.2, suggesting that overall there's very little overlap in the nearst neighbors. The meaning of the selected words do undergo change over the century. 