<a href="https://colab.research.google.com/github/spatank/InteractiveFictionCIS700/blob/master/NLP_for_Text_Adventure_Games_part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP for Text Adventure Games - part 2

In this notebook, we will introduce you to word embeddings.  Word embeddings are a way of representing the meaning of words using vectors.  This style of meaning representation is different than what we saw with WordNet.  Rather than manually organizing words into a hierarchy and breaking words into their distinct senses like WordNet does, word vectors are computed automatically based on their coocurrence with other words in very large collections of texts (which are called _corpora_ in NLP).  Word vector representations do not give us hypernym, hyponym or antonym relationships by default, but they do give us an extremely straightforward way of computing the similarity of two words.  Similarity is computed by comparing the vectors and computing the cosine angle between them.  Vectors with a very small angle between them tend to correspond to words with very similar meanings.

Starting around 2013, several algorithms were developed to efficiently create word vectors.  The most famous of which is the Word2Vec algorithm developed by researchers at Google. 
If you'd like to learn more about how word vectors work, I recommend that you read the [Vector Semantics and Embeddings](https://web.stanford.edu/~jurafsky/slp3/6.pdf) chapter in the [Speech and Language Processing textbook](https://web.stanford.edu/~jurafsky/slp3/) by Jurafsky and Martin who are currently updating the textbook, and are relasing their draft chapters for free online while they are updating the book.

## Pre-trained Word Embeddings and Magnitude 

We are going to install a software package called Magnitude that allows for the fast efficient manipulation of word vectors.  If you'd like to learn more about it, you can read our [EMNLP 2018 paper about Magnitude](http://www.cis.upenn.edu/~ccb/publications/magnitude-fast-efficient-vector-embeddings-in-python.pdf), or you can read the [Magnitude developer documentation on Github](](https://github.com/plasticityai/magnitude)).

Then, we'll download a set of pre-trained word vectors that are stored in the Magnitude file format.  This file is several gigabytes large, so it will take a few minutes to download.

In [1]:
!pip3 install pymagnitude
!wget http://magnitude.plasticity.ai/glove/heavy/glove.6B.300d.magnitude
#!wget http://magnitude.plasticity.ai/word2vec/heavy/GoogleNews-vectors-negative300.magnitude

from pymagnitude import *
vectors = Magnitude("glove.6B.300d.magnitude")
#vectors = Magnitude("GoogleNews-vectors-negative300.magnitude")

Collecting pymagnitude
[?25l  Downloading https://files.pythonhosted.org/packages/0a/a3/b9a34d22ed8c0ed59b00ff55092129641cdfa09d82f9abdc5088051a5b0c/pymagnitude-0.1.120.tar.gz (5.4MB)
[K     |████████████████████████████████| 5.4MB 2.8MB/s 
[?25hBuilding wheels for collected packages: pymagnitude
  Building wheel for pymagnitude (setup.py) ... [?25l[?25hdone
  Created wheel for pymagnitude: filename=pymagnitude-0.1.120-cp36-cp36m-linux_x86_64.whl size=135918205 sha256=bed3105fbf044993670085742d8cfb5af5757f9d5faf95a34044cb64b2a47e61
  Stored in directory: /root/.cache/pip/wheels/a2/c7/98/cb48b9db35f8d1a7827b764dc36c5515179dc116448a47c8a1
Successfully built pymagnitude
Installing collected packages: pymagnitude
Successfully installed pymagnitude-0.1.120
--2020-02-01 04:21:42--  http://magnitude.plasticity.ai/glove/heavy/glove.6B.300d.magnitude
Resolving magnitude.plasticity.ai (magnitude.plasticity.ai)... 52.216.88.18
Connecting to magnitude.plasticity.ai (magnitude.plasticity.ai)|5

After the files have downloaded, we can start running Python and Magnitude!  We will load a set of vectors from the file that we just downloaded.<p>Once the vectors are loaded, we can see how many vectors we've loaded in with the following code block.  This means that we have vectors representing this many words.  This is the size of our **vocabulary.**  

In [2]:
from pymagnitude import *
vectors = Magnitude("glove.6B.300d.magnitude")

print("The number of words with vector representations in this file is %s." % len(vectors))

The number of words with vector representations in this file is 400000.


## Word Vectors


We can see what the *dimensionality* of each vector is.  The dimensionality is just the length of the vector.  


In [3]:
vectors.dim

300

We can print out what a vector look likes.  It should have a bunch of real-valued numbers (positive or negative).  The number of values that we will see is *vectors.dim*

In [4]:
vectors.query("troll").shape

(300,)

In [5]:
if "troll" in vectors:
  print(vectors.query("troll"))

[-0.0244411  0.0139053 -0.0925782 -0.00481   -0.05063    0.0802103
  0.0431387  0.0066427 -0.0033448  0.045577  -0.0484613  0.0071967
 -0.0044964  0.0520721 -0.0442212 -0.0151458  0.0492292  0.0470102
  0.0446977  0.0498603  0.0774734  0.0230924  0.0423744  0.0585978
 -0.0205354  0.0104937  0.0165054  0.0986903  0.0251441  0.0992028
 -0.0076652 -0.088784   0.0427017 -0.0448236  0.0673819  0.0090755
 -0.0025259 -0.0052761 -0.0015195  0.0118352  0.0006056  0.0918715
 -0.0879299  0.0160313 -0.0603762  0.0351043  0.0573174 -0.0273775
 -0.0387619 -0.0417181 -0.0443722  0.0006057 -0.0133144  0.0279044
 -0.0180323 -0.089854  -0.0289977  0.0308318  0.1055828  0.02625
  0.0264209  0.0280518  0.0556217 -0.0365645  0.0046129  0.0266582
 -0.0332288 -0.033619   0.0692844 -0.0413854 -0.0542569  0.1936116
  0.0003418  0.0223659 -0.0431944 -0.0068162  0.0140149 -0.0009213
  0.0102707 -0.0703579  0.033281  -0.0908879  0.0688744 -0.0347897
 -0.0144963  0.0502326  0.0899547  0.0121853  0.047636  -0.10852

That's what a troll looks like according to our model!  It looks just like [this](https://en.wikipedia.org/wiki/Troll_doll), right?  Well, not really, but the cool think about vectors is that they allow us to say how similar two things are.  

## Vector similarity

Having vectors for two words allows us to see how similar they are.  We can ask, how similar are *trolls* and *ogres* versus *trolls* and *princesses*?    The result will be a decimal between 0 and 1.0, with numbers closer to 1 indicating that the words are more similar.

In [6]:
print(vectors.similarity("trolls", "ogres"))
print(vectors.similarity("trolls", "princesses"))
print(vectors.similarity("princes", "princesses"))

0.6001749
0.28430298
0.5949442


The Magnitude software allows you to query for the most similar word out of a list of words using the command *most_similar_to_given*, which takes a query word and then a list of other words to compare it to.

In [7]:
vectors.most_similar_to_given("troll", ["princess", "prince", "ogre", "knight"]) 

'ogre'

## Retrieving the Most Similar Words

We can also look for the word vectors that are most similar to a query word.  Here are the words that are most similar to *trolls*.  Try replacing the word *trolls* with whatever word you want, and re-running this cell (by pressing the play button again), and see what the most similar words are to the word that you entered.

The Magnitude package uses an approximate k-nearest neighbors algorithm to efficiently retrieve the most similar vectors.  

In [8]:
vectors.most_similar_approx("trolls", topn = 20)

[('goblins', 0.6100390574368859),
 ('ogres', 0.6001749071757398),
 ('elves', 0.55979357815532),
 ('troll', 0.5146559037664176),
 ('ghouls', 0.4178171184355435),
 ('hobbits', 0.41689985629806614),
 ('centaurs', 0.41310322261340815),
 ('unicorns', 0.40355174882397904),
 ('humanoids', 0.4018230024045053),
 ('undead', 0.3909448179681192),
 ('witches', 0.3878153878952091),
 ('mermaids', 0.3858897554172245),
 ('serpents', 0.37686437645467663),
 ('genies', 0.3591981580827621),
 ('dwarfs', 0.3572682090842818),
 ('gargoyles', 0.351698249026839),
 ('demigods', 0.34777196255617326),
 ('punks', 0.3431107973310006),
 ('weirdos', 0.3332822877547912),
 ('goliath', 0.3286498431144693)]

## Finding the most similar command
For this part of the homework, we will ask you to write a method to take in the player's command and find the command in the set of commands that your game's knows how to parser.  

You can construct a sentence embedding for a command, by taking the component wide average of words in the command.  The sentence emedding will have the same length as a word embedding.  You can compare a player's command against each of the known commands by constructing vectors for all of them, and then using Magnitude's `similarity` function

In [0]:
# FOR YOU TO DO: Write these function
def construct_sentence_vector(command, vectors):
  sentence_vector = np.zeros(shape=(vectors.dim,))
  for word in command.split():
    word_vector = vectors.query(word)
    # TODO - Do something
  return sentence_vector

def find_most_similar_command(user_command, known_commands, vectors):
  # TODO - Do something
  return known_commands[0]

construct_sentence_vector("get fish", vectors)

## Solving Word Analogy Problems (Optional)

Famously, word2vec was shown to to be able solve many word analogy problems like "***man*** is to ***king*** as ***woman*** is to **-----**".  It does this by performing some vector arithmetic.   We take the vector for *king*, subtract the vector for *king*, and then add the vector for *woman*:<p>+ *king* <p>- *man*<p>+ *woman*<p>The result is a vector.  To figure out what word is closest to it, we find the most similar word vectors to the vector that resulted from our arithmetic. 

Magnitude allows us to do this in the following way:

In [0]:
vectors.most_similar(positive = ["king", "woman"], negative = ["man"])

These gender-based analogy problems also reveal that gender bias is encoded in word embeddings.  There was a paper that exposed this called [Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings](http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-d).  



## What Can You Do with a Sword? (Optional)

Nancy Fulda is one of our invited speakers for later in the term.  She and her co-authors wrote a research paper called [What Can You Do with a Rock? Affordance Extraction via Word Embeddings](https://www.ijcai.org/Proceedings/2017/0144.pdf) that developed an algorithm to extract verb associations for words using word embeddings.  

They start with a set verb/noun pairs that are simlar to text adventure commands like:

`[‘sing song’, ‘drink water’, ‘read book’, ‘eat food’, wear coat’, ‘drive car’, ‘ride horse’, ‘give gift’, ‘attack enemy’, ‘say word’, ‘open door’, ‘climb tree’, ‘heal wound’, ‘cure disease’, ‘paint picture’]`

Their algorithm can then use this list of commands and a set of pre-trained word vectors to discover verbs associated with ‘sword’, returning the following:
`[‘vanquish’, ‘duel’, ‘unsheathe’, ‘wield’, ‘behead’, ‘battle’, ‘impale’, ‘overpower’].`

If you're interested, you could reimplement their method for this week's homework.