# Intro to Word Vector Models

Hi all,

Aim here is to demo different NLP functionalities fom spaCy. Below I demo the word vector representations capability.

Stand-alone words (e.g., "bank") by themselves often can't convey much info because *context* is missing (e.g., "commercial bank" or "river bank").

For large numbers of common words, some idea about context (and relative probabilities of occurring in differenrt contexts) can be had if models can be trained on large enough corpora (e.g., Wikipedia corpus, Goog News, Goog books etc.).

Imagine building a colossally high-D **vocabulary** space wherein words similar in meaning and context are closer together than to other words. 

**Word embeddings** are real-number co-ordinate vectors representing a vocab's words in such a space (see example below).

To be useful, a set of word vectors for a vocabulary should capture (a) the meaning of words, (b) the relationship between words, and (c) the context of different words (d) as they are used naturally.

They allow us to implicitly include external information from the world into our language understanding models. They thus have broad NLP applications in areas like sentiment-an, text classification etc.

In [1]:
# setup chunk
# !python -m spacy download en_core_web_sm

import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()

# nlp = en_core_web_md.load()
import pandas as pd
import time

## Using Word Vec in spaCy

Using pre-trained models in Spacy is incredible convenient, given that they come built in. 

Spacy has a number of different models of different sizes available for use, with models in 7 different languages (include English, Polish, German, Spanish, Portuguese, French, Italian, and Dutch), and of different sizes.

For English, we'll install the large *en_core_web_md* library, which includes 20k unique vectors with 384 dimensions.

### Invoking Word Vecs of Particular Tokens

Below I demo how to invoke basic word-vec functionality in spaCy. Behold.


In [2]:
# process a sentence using the model
sent = "Cat and Dog should be closer together in vocab space as should Mango and Banana."

# routine to display sentence as DF postags
def token_attrib(sent0):
	doc = nlp(sent0)

	text=[]; lemma=[]; postag=[]; depcy=[]

	for token in doc:
		text.append(token.text)
		lemma.append(token.lemma_)
		postag.append(token.pos_)
		depcy.append(token.dep_)

	test_df = pd.DataFrame({'text':text, 'lemma':lemma, 'postag':postag, 'depcy':depcy})
	return(test_df)

# display the outp DF
token_attrib(sent)

Unnamed: 0,text,lemma,postag,depcy
0,Cat,cat,PROPN,nsubj
1,and,and,CCONJ,cc
2,Dog,dog,PROPN,conj
3,should,should,VERB,aux
4,be,be,VERB,ROOT
5,closer,close,ADJ,advmod
6,together,together,ADV,advmod
7,in,in,ADP,prep
8,vocab,vocab,NOUN,amod
9,space,space,NOUN,pobj


In [3]:
# It's that simple - all of the vectors and words are assigned after this point. Get the vector for 'Dog':
sent_ann = nlp(sent)  # annotate sent
print("Third token in sentence is: ", sent_ann[2], "\n")

# check word vec length
print("Word-vector shape for the token: ", sent_ann[2], " is:\n", sent_ann[2].vector.shape, "\n")  # 384x1

# view a few tokens 
print("First 10 word-vec co-ords for the token *", sent_ann[2], "* are:\n", sent_ann[2].vector[:10])
print("==========\n")
print("First 10 word-vec co-ords for the token *", sent_ann[0], "* are:\n", sent_ann[0].vector[:10])

Third token in sentence is:  Dog 

Word-vector shape for the token:  Dog  is:
 (384,) 

First 10 word-vec co-ords for the token * Dog * are:
 [-1.9289579   4.1713147   3.2283213   2.3073325  -0.2678644   0.39294454
  0.4357926  -3.7268226  -1.0738019   0.3235892 ]

First 10 word-vec co-ords for the token * Cat * are:
 [ 1.143025    0.07028291  2.5424826   1.9754298  -1.2145972   3.1527033
  0.9029567  -2.3866327   1.2338722  -0.5432149 ]


Although we humans can't tell that those numbers out there represent 'dogs' and 'cats' respectively, no such problem for machines.

Further, recognize what we have just achieved - reducing words to location co-ordinates in vocabulary space! 

Analogous to how reducing text under bag-of-words (BOW) model to Document Token matrices (DTMs) enabled application of all variety of matrix ops to text collections. 

IOW, with word-embeddings giving word locations in vocab space, the entire toolkit for spatial analytics can be brought to bear to analyze things like: <p>
- similarities between words (how close are they in this space) <p>
- differences between words (how far apart) <p>
- context around words (which words occur most *around* focal words) <p>
- vector sums of word-aggregates (in a sentence, for example) <p>
- etc! <p>

### Computing Word-Vec similarity

A quick dash to the slides for the concept of 'cosine similarity' and let's return once there.

Let's compute cosine similarity between words of interest viz. cats, dogs, daisies, lilies etc. Behold.    

In [4]:
# higher the cosine-similairity score, more the words contextually relate to each other
tokens = nlp("dog cat banana rose lily cauliflower")  # annotate doc

text1=[]; text2=[]; simil_score=[]
# for loop and print
for token1 in tokens:
    for token2 in tokens:
        text1.append(token1.text)
        text2.append(token2.text)
        simil_score.append(token1.similarity(token2))

# build DF for neat output
simil_df = pd.DataFrame({'word1':text1, 'word2':text2, 'similarity': simil_score} )        
simil_df

Unnamed: 0,word1,word2,similarity
0,dog,dog,1.0
1,dog,cat,0.503085
2,dog,banana,0.292889
3,dog,rose,0.15284
4,dog,lily,0.17279
5,dog,cauliflower,0.126985
6,cat,dog,0.503085
7,cat,cat,1.0
8,cat,banana,0.471543
9,cat,rose,0.072137


Results above are along expected lines I hope. 

Again, imp to note that above vocab space was built based on common vocab words found in massive corpora online by the likes of <p>
- [Google's Word2vec model](https://code.google.com/archive/p/word2vec/), <p>
- [Stanford's Global Vectors for Word Representation GloVe model](https://nlp.stanford.edu/projects/glove/) <p>
- and even [Facebook's fasttext algo](https://github.com/facebookresearch/fastText). <p>
So they aren't infalliable but reflect the contexts which they trained on.

Sure, one can and should train word embeddings specific to one's business domain for better prediction and explanation. spaCy (and *gensim*) allow for such custom-building of vocabs.

### Vector Sums in Metric Vocab Spaces

Vectors can be added and subtracted in metric spaces (which vocab spaces are in 300+ dimns). They can also be multiplied and divided (dot- & cross- products and inverses, for example).

Below I demo what happens when we do Vector summation over natural word groupings (e.g., words in a sentence) to obtain a document level vocab space vector. 

spaCy makes this incredibly convenient via the extension '.vector' to the annotated document. Result is the mean or average of the vector sum of all component tokens. 

Behold.

In [6]:
# simulate 3 sentences
sent1 = "Mozart's ninth symphony is considered his best."
sent2 = "The Unforgiven is a cult western."
sent3 = "Machine Learning basics for Managers."
sent4 = "Elon Musk wants to colonize Mars."

# use nlp() to annotate them into nlp objects
doc1 = nlp(sent1)
doc2 = nlp(sent2)
doc3 = nlp(sent3)
doc4 = nlp(sent4)

# view a few nums from the mean vector for the entire sentence 
# (useful for sentence classification etc.)
print(doc1.vector[:8],"\n")  
print(doc2.vector[:8],"\n")  
print(doc3.vector[:8],"\n")  

[ 0.15729392  0.31356293  0.5392628   1.5597563   1.2362208   0.4116027
  0.43689406 -0.04755433] 

[ 0.30408886  0.1734363   0.9671667   4.356767    1.422149   -0.48655793
 -0.18334758 -0.6775466 ] 

[ 0.4242241   1.6088556   1.5877622   1.8958215  -0.07814423  1.3804523
 -1.5191032  -2.0572953 ] 



In [7]:
# now examine which sents are closer to each other in context / meaning terms as per word-vec 
sent_1=[]; sent_2=[]; simil_score=[]

for i in [1, 2, 3]:
    for j in [2, 3, 4]:
        sent_1.append(eval(str("sent" + str(i))))
        sent_2.append(eval(str("sent" + str(j))))
        simil_score.append(eval(str("doc" + str(i))).similarity(eval(str("doc" + str(j)))))
        
# store n display as DF
sent_simil_df = pd.DataFrame({'sent1':sent_1, 'sent2':sent_2, 'simil_score':simil_score})        
sent_simil_df

Unnamed: 0,sent1,sent2,simil_score
0,Mozart's ninth symphony is considered his best.,The Unforgiven is a cult western.,0.626263
1,Mozart's ninth symphony is considered his best.,Machine Learning basics for Managers.,0.281639
2,Mozart's ninth symphony is considered his best.,Elon Musk wants to colonize Mars.,0.415729
3,The Unforgiven is a cult western.,The Unforgiven is a cult western.,1.0
4,The Unforgiven is a cult western.,Machine Learning basics for Managers.,0.398735
5,The Unforgiven is a cult western.,Elon Musk wants to colonize Mars.,0.45667
6,Machine Learning basics for Managers.,The Unforgiven is a cult western.,0.398735
7,Machine Learning basics for Managers.,Machine Learning basics for Managers.,1.0
8,Machine Learning basics for Managers.,Elon Musk wants to colonize Mars.,0.596288


Well, what do you think? Do the similarities make common sense? 

We can compare similarity between text aggregates such as paragraphs and documents just as well. 

Recall from MKTR we did cluster-an on the DTM to find docs that were 'similar' in their word features. 

This is different in that we're using externally trained models to implicitly provide context during the similarity score computation.

Chalo, back to the slides.

Voleti