<a href="https://colab.research.google.com/github/saralieber/CS_Studio/blob/master/Review_Ch7_Part2_Text_to_Vectors_Meanings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deriving meaning from words

Is there a way to represent all English words so they have this "closer in space is closer in meaning" property that we saw with colors represented as their RGB properties?

To answer this, we have to first think of what *meaning* means.

One theory (Distributional Hypothesis) popular among computational linguists is that linguistic items with similar contexts have similar meanings.

In other words, a word's meaning is just a big list of all the contexts it occurs in. Two words are closer in meaning if they share contexts.

How do we turn this insight into a system for creating general-purpose vectors that capture the meaning of words?

Let's use a small source text to begin with, such as this except from Dickens:

    It was the best of times, it was the worst of times.

This spreadsheet tries to capture the context of words. 
![dickens contexts](http://static.decontextualize.com/snaps/best-of-times.png)

The spreadsheet has one column for every possible context, and one row for every word. The values in each cell correspond with how many times the word occurs in the given context. The numbers in the columns constitute that word's vector, i.e., the vector for the word `of` is

    [0, 0, 0, 0, 1, 0, 0, 0, 1, 0]

You could use the same distance formula we defined before to get useful information about which vectors in this space are similar to each other.

In particular, the vectors for `best` and `worst` are actually the same (a distance of zero), since they occur only in the same context (`the __ of`).

Of course, the conventional way of thinking about "best" and "worst" is that they're *antonyms*, not *synonyms*. But they're also clearly two words of the same kind, with related meanings (through opposition), a fact that is captured by this distributional model.

In many texts, there will be many thousands if not millions of possible contexts. It turns out, though, that many of these dimensions (contexts) will end up being superfluous and can either be eliminated or combined with other dimensions without significantly affecting the predictive power of the resulting vectors.

The process of getting rid of superfluous dimensions in a vector space is called *dimensionality reduction*.

The question of how to identify a "context" is difficult to answer. 

You might want to...

*   Use the word before and after the given word (e.g., see example above)
*   Use a larger window (e.g., the two words before and after the given word)
*   Use a non-contiguous window (e.g., skip a word before and after the given word)
*   Look at larger syntactic structure: what are the syntactic-contexts you find the word in?
*   Exclude certain "function" words like "the" and "of" 
*   Lemmatize the words before you begin your analysis so two occurences with different "forms" of the same word count as the same context

These are all questions open to research and debate.




## GloVe vectors

You don't have to create your own word vectors from scratch. Many researchers have made downloadable databases of pre-trained vectors.

One such project is Stanford's [Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove/). These 300-dimensional vectors are included with spaCy, and they're the vectors we'll use for this activity. They come with `en_core_web_md`.

In [0]:
import spacy
!python -m spacy download en_core_web_md # download the dictionary
import en_core_web_md
nlp = en_core_web_md.load()

In [3]:
nlp.vocab.has_vector('frankenstein') # Check to make sure word vectors have been loaded

True

In [0]:
dogv = nlp.vocab['dog'].vector # get the 300-dimensional vector for dog

In [5]:
type(dogv)

numpy.ndarray

In [0]:
dog_list = dogv.tolist()

In [7]:
len(dog_list) # 300

300

In [8]:
dog_list[:10]

[-0.4017600119113922,
 0.37057000398635864,
 0.02128100022673607,
 -0.3412500023841858,
 0.04953800141811371,
 0.29440000653266907,
 -0.17375999689102173,
 -0.2798199951648712,
 0.06762199848890305,
 2.169300079345703]

In [0]:
# The following function gets a vector of a given string from spaCy's vocabulary

def get_vec(s:str) -> list:
  return nlp.vocab[s].vector.tolist()

In [10]:
get_vec('dog') == dog_list # should be True

True

In [11]:
# There is also a vector for words not in the vocab. It is all zeroes.

zero_vec = get_vec('askfsda') # Not in vocab
zero_vec.count(0) # 300 zeroes, i.e., all zeroes

300

In [12]:
# The following shows that cosine similarity between `dog` and `puppy` is larger than the similarity between `trousers` and `octopus`

up.cosine_similarity(get_vec('dog'), get_vec('puppy')) > up.cosine_similarity(get_vec('trousers'), get_vec('octopus'))

NameError: ignored