# Week Four Exercise: Working with word vectors

Since we're getting closer to real open issues in NLP, this exercise will be more open-ended than the last couple.

## Part 0: Setup

First let's load a set of 50D word vectors from GloVe. You can download them through NYU Classes under *Resources: Data for Exercises*. If you need the original zip file, which includes 300D vectors (~1GB, may overwhelm room wifi), click the period at the end of this sentence[.](http://nlp.stanford.edu/data/glove.6B.zip)

`glove_home` below specifies the location of the unzipped file. `words_to_load` specifies how many word vectors we want to load. The words are saved in frequency order, so specifying 50,000 means that we only want to work with the 50,000 most frequent words from the source corpus. You can load up to 400,000 words.

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

In [None]:
glove_home = '../'
words_to_load = 50000

import numpy as np

with open(glove_home + 'glove.6B.50d.txt') as f:
    loaded_embeddings = np.zeros((words_to_load, 50))
    words = {}
    ordered_words = []
    for i, line in enumerate(f):
        if i >= words_to_load: 
            break
        
        s = line.split()
        loaded_embeddings[i, :] = np.asarray(s[1:])
        words[s[0]] = i
        ordered_words.append(s[0])

Here's how to look up a word:

In [None]:
loaded_embeddings[words['potato']]

## Part 1: The Semantic Orientation Method

The __semantic orientation__ method of [Turney and Littman 2003](http://doi.acm.org/10.1145/944012.944013) is a method for automatically scoring words along some single semantic dimension like sentiment. It works from a pair of small seed sets of words that represent two opposing points on that dimension.

*Some code in this section was adapted from Stanford CS 224U*

Here's a sample pair of seed sets:

In [None]:
seed_pos = ['good', 'great', 'awesome', 'like', 'love']
seed_neg = ['bad', 'awful', 'terrible', 'hate', 'dislike']

Let's look up the embeddings for these words.

In [None]:
seed_pos_indices = [words[seed] for seed in seed_pos]
seed_neg_indices = [words[seed] for seed in seed_neg]
seed_pos_mat = loaded_embeddings[seed_pos_indices]
seed_neg_mat = loaded_embeddings[seed_neg_indices]

And write a function to score words along the axis.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def determine_coefficient(candidate_word):
    candidate = loaded_embeddings[words[candidate_word]]
    pos_sim = np.sum([cosine_similarity(candidate, reference) for reference in seed_pos_mat])
    neg_sim = np.sum([cosine_similarity(candidate, reference) for reference in seed_neg_mat])
    return pos_sim - neg_sim   

In [None]:
determine_coefficient('abhorrent')

And sort our vocabulary by its score along the axis. For now, we're only scoring frequent words, since this process can be slow.

In [None]:
from operator import itemgetter

scored_words = [(word, determine_coefficient(word)) for word in ordered_words[1:5000]]
sorted_words = sorted(scored_words, key=itemgetter(1), reverse=True)

In [None]:
pp.pprint(sorted_words[:10])
pp.pprint(sorted_words[-10:])

Spend a few minutes exploring possible seed sets for semantic dimensions other than sentiment. What works? What doesn't? Why?

## Part 2: Analogy

Next, let's try to build a similar function for determining which words are likely to be good completions for an analogy. Our inputs will be a pair and a singleton word that together represent an analogy problem.

In [None]:
# good:best::bad:???
prompt_pair = ('good', 'best')
prompt_seed = 'bad'

Your code should produce a sorted vector of words that could complete the analogy. You can print it as follows. You're welcome to adapt code from above.

In [None]:
pp.pprint(sorted_words[:10])

Once your done, try some different analogies. What tends to work? What doesn't? Why?