This notebook explores distribitional simliarity in a dataset of 10,000 Wikipedia articles (4.4M words), building high-dimensional, sparse representations for words from the distinct contexts they appear in.  These representations allow for analysis of the most similar words to a given query, and are interpretable with respect to the specific contexts that are most important for determining that two words are similar.

In [None]:
from collections import defaultdict, Counter
import math
import operator
import gzip

In [None]:
window=2
vocabSize=10000

In [None]:
filename="../data/wiki.10K.txt"

wiki_data=open(filename, encoding="utf-8").read().lower().split(" ")


In [None]:
# We'll only create word representation for the most frequent K words

def create_vocab(data):
    word_representations={}
    vocab=Counter()
    for i, word in enumerate(data):
        vocab[word]+=1

    topK=[k for k,v in vocab.most_common(vocabSize)]
    for k in topK:
        word_representations[k]=defaultdict(float)
    return word_representations

In [None]:
# word representation for a word = its unigram distributional context (the unigrams that show
# up in a window before and after its occurence)

def count_unigram_context(data, word_representations):
    for i, word in enumerate(data):
        if word not in word_representations:
            continue
        start=i-window if i-window > 0 else 0
        end=i+window+1 if i+window+1 < len(data) else len(data)
        for j in range(start, end):
            if i != j:
                word_representations[word][data[j]]+=1

In [None]:
def count_directional_context(data, word_representations):
    for i, word in enumerate(data):
        if word not in word_representations:
            continue
        start=i-window if i-window > 0 else 0
        end=i+window+1 if i+window+1 < len(data) else len(data)
        left="L: %s" % ' '.join(data[start:i])
        right="R: %s" % ' '.join(data[i+1:end])
        
        word_representations[word][left]+=1
        word_representations[word][right]+=1

In [None]:
# normalize a word represenatation vector that its L2 norm is 1.
# we do this so that the cosine similarity reduces to a simple dot product

def normalize(word_representations):
    for word in word_representations:
        total=0
        for key in word_representations[word]:
            total+=word_representations[word][key]*word_representations[word][key]
            
        total=math.sqrt(total)
        for key in word_representations[word]:
            word_representations[word][key]/=total
        

In [None]:
def dictionary_dot_product(dict1, dict2):
    dot=0
    for key in dict1:
        if key in dict2:
            dot+=dict1[key]*dict2[key]
    return dot

In [None]:
def find_sim(word_representations, query):
    if query not in word_representations:
        print("'%s' is not in vocabulary" % query)
        return None
    
    scores={}
    for word in word_representations:
        cosine=dictionary_dot_product(word_representations[query], word_representations[word])
        scores[word]=cosine
    return scores

In [None]:
# Find the K words with highest cosine similarity to a query in a set of word_representations
def find_nearest_neighbors(word_representations, query, K):
    scores=find_sim(word_representations, query)
    if scores != None:
        sorted_x = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)
        for idx, (k, v) in enumerate(sorted_x[:K]):
            print("%s\t%s\t%.5f" % (idx,k,v))

Explore the difference between `count_unigram_context` and `count_directional_context` for determining what counts as "context".  `count_unigram_context` counts an individual unigram in the bag of words around a target as a "context" variable, while `count_directional_context` counts the sequence of words before and after the word as a single "context"--and specifies the direction it occurs (to the left or right of the word).

In [None]:
word_representations=create_vocab(wiki_data)
count_directional_context(wiki_data, word_representations)
normalize(word_representations)

In [None]:
find_nearest_neighbors(word_representations, "actor", 10)

In [None]:
# Let's find the contexts shared between two words that have the most contribution
# to the cosine similarity

def find_shared_contexts(word_representations, query1, query2, K):
    if query1 not in word_representations:
        print("'%s' is not in vocabulary" % query1)
        return None
    
    if query2 not in word_representations:
        print("'%s' is not in vocabulary" % query2)
        return None
    
    context_scores={}
    dict1=word_representations[query1]
    dict2=word_representations[query2]
    
    for key in dict1:
        if key in dict2:
            score=dict1[key]*dict2[key]
            context_scores[key]=score

    sorted_x = sorted(context_scores.items(), key=operator.itemgetter(1), reverse=True)
    for idx, (k, v) in enumerate(sorted_x[:K]):
        print("%s\t%s\t%.5f" % (idx,k,v))

In [None]:
find_shared_contexts(word_representations, "actor", "politician", 10)

We can see here that the single feature that has the most impact on similarity between these parts is the directional ngram ". he" (which would appear in text like "John is an actor **. He** ..."

**Activity**: Find the nearest neighbors for other words above (in the `find_nearest_neighbors` cell); then find the shared contexts for a pair of nearest neighbors (as we did for actor/politician).  What does this reveal about drives similarity?