## Import packages

You'll first need to install either the ```spaCY``` medium or large model!

->> terminal

```cd cds-language
source ./lang101/bin/activate
python -m spacy download en_core_web_md
deactivate```

In [2]:
# preprocessing
import os
import pandas as pd
from tqdm import tqdm # progress bar

# nlp
import spacy
nlp = spacy.load("en_core_web_md") # medium model that contains word vectors which the small model does not contain

# gensim
from gensim.models import Word2Vec
import gensim.downloader

## Using pretrained vectors in ```spaCy```

spaCy comes with pretrained word embeddings generated with word2vec. Each word has a vector representing it. The word embedding models in spaCy are not as accurate as Gensim's models, but they are more simple and intuitive. The spaCy word embeddings are not that useful, because spaCy does not make it clear how they have been trained. 

In [32]:
# Looking at the vector for the word "denmark"
nlp("denmark").vector

array([ 0.12209  , -0.017989 , -0.046495 , -0.011551 ,  0.56601  ,
       -0.41398  ,  0.33022  , -0.33376  , -0.13001  , -0.12592  ,
       -0.87791  , -0.23211  ,  0.062879 ,  0.36644  ,  0.054478 ,
        0.18169  , -0.17619  ,  0.5006   ,  0.70912  ,  0.072825 ,
        0.7663   ,  0.32764  ,  0.32388  , -0.39116  , -0.44868  ,
       -0.32976  , -0.076284 , -0.0095968, -0.15763  ,  0.66581  ,
       -0.3471   , -0.35091  ,  0.0083347,  0.47103  ,  0.25362  ,
        0.33329  ,  0.13503  ,  0.055926 ,  0.30558  , -0.10581  ,
        0.0025447, -0.22811  , -0.25086  ,  0.24853  ,  0.041999 ,
       -0.45543  ,  0.33007  , -0.63446  ,  0.20003  , -0.16201  ,
       -0.73001  ,  0.18834  , -0.08403  , -0.74857  , -0.041885 ,
        0.013566 ,  0.12618  , -0.086973 ,  0.6415   , -0.64083  ,
       -0.33979  , -0.30045  ,  0.4442   , -0.16814  ,  0.042421 ,
        0.38954  ,  0.13112  ,  0.050652 ,  0.028356 , -0.19597  ,
       -0.33335  ,  0.51083  ,  0.031252 , -0.46036  ,  0.9636

^This vector is an array of 300 points that encode the word "denmark". Each number is a weight that has been calcualted with the word2vec skipgram model. Skipgram trains a logistic classifier on each word, and then the weights that are trained then become this vector for the word. Hence, we have 300 weights/nodes representing a single words. Every word has a unique representation - unique weights.  

__Comparing individual words__

We can calculate similarity between words.

In [4]:
# Creating spacy nlp objects for different words.
banana = nlp("banana")
apple = nlp("apple")
scotland = nlp("scotland")
denmark = nlp("denmark")

__Inspect word similarities__

With spaCy we can investigate the similarity between word vectors

In [5]:
banana.similarity(apple)

0.5831844567891399

In [6]:
banana.similarity(scotland)

0.0703526672694443

^Banana and apple are very similar while banana and Scotland are very dissimilar. 

In [7]:
denmark.similarity(scotland)

0.5694890898124977

__Document similarities__

With spaCy we can also look at the similarity between sentences/documents. Hence, not just individual words. spaCy does this by taking the vector for each word in the document and then average these vectors, so we can an average vector for each docuemnt. We can then compare these averaged vectors for documents with each other to see how similar documents are to each other. 

Keep in mind that the results should be taken with a grain of salt, because averaging word vectors is not that informative. Once we have documents consisting of many words, averaging their vectors does not make much sense because it introduces a lot of variance/randomness, and comparing documents in this way is not very useful. 

In [33]:
doc1 = nlp("I like bananas")
doc2 = nlp("I like apples")
doc3 = nlp("I come from Scotland")
doc4 = nlp("I live in Denmark")

In [9]:
doc1.similarity(doc3)

0.6828703253926357

In [10]:
doc3.similarity(doc4)

0.8041838664871435

## Working with ```gensim```

If you want more accurate word embeddings we need to work with the word embeddings provided by gensim. Gensim provides different pretrained word embeddings (see below).

__Download pretrained models__

In [11]:
list(gensim.downloader.info()['models'].keys())

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

^These are the different pretrained word embedding models provided by gensim. We can see that there are three different approaches represented: word2vec, GloVe, and fastText. These are different in terms of how they are trained on the data. Their output is essentially the same, but the algorithm used to calcualte the embeddings/weights are not the same across the approaches. FastText models contain subword embeddings which word2vec does not. This means that fastText splits words into subwords which allows for greater generalizability, which allows the model to capture out-of-vocabulary words which is something the word2vec model is not able to. 

__Download a pretrained model__

In [12]:
# Download the GloVe model that has been trained on the wiki gigaword corpus
pretrained_vectors = gensim.downloader.load('glove-wiki-gigaword-100')



IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





In many cases it is useful to train your own word embedding model, but for downstream NLP tasks, e.g. text classification, we can just use a pretrained model that has been trained on a very large corpus to turn a text into a numerical representation. 

__Inspect vector for specific word__

In [13]:
# Now we can inspect the vector for the word denmark generated by this particular model
pretrained_vectors['denmark']

array([-0.28875  , -0.19655  ,  0.26046  ,  0.086723 ,  0.25918  ,
       -0.1897   , -0.54331  ,  0.009582 , -0.30836  , -0.0031624,
        0.33199  , -0.29428  , -0.24047  ,  1.19     , -0.084937 ,
        0.11623  , -0.21052  , -0.54361  , -0.99796  ,  0.12067  ,
        0.14138  ,  0.65072  ,  1.2077   ,  1.1735   ,  0.23783  ,
       -0.98251  ,  0.41053  ,  0.27652  ,  0.52805  , -0.48693  ,
       -0.8589   ,  0.35657  ,  0.71596  ,  0.17604  ,  0.52895  ,
       -0.2974   ,  0.44817  ,  0.40725  , -0.98995  , -0.90026  ,
       -0.57812  ,  0.050827 ,  0.32352  ,  0.087861 , -0.023458 ,
       -0.34776  ,  0.88943  ,  0.10766  ,  0.46515  , -0.20827  ,
        0.59546  ,  0.16455  , -0.45227  ,  0.6851   , -0.87772  ,
       -1.7848   , -0.37841  , -0.25611  ,  0.15408  ,  0.067509 ,
        0.71967  , -0.31071  , -0.15901  , -0.066492 ,  0.50181  ,
        0.99762  , -1.1725   ,  1.5181   ,  0.14916  , -0.11483  ,
        0.072389 , -0.66993  ,  0.36882  ,  0.37702  ,  0.3675

__Find most similar words to target__

With gensim we can find the most similar words to a target word rather than comparing two specific words. This allows us to explore the semantic space of a particular word. 

In [14]:
pretrained_vectors.most_similar('denmark')

[('sweden', 0.8624401688575745),
 ('norway', 0.828826367855072),
 ('netherlands', 0.8032277822494507),
 ('finland', 0.7628087997436523),
 ('austria', 0.7483422756195068),
 ('germany', 0.7414340972900391),
 ('belgium', 0.7279534935951233),
 ('hungary', 0.7076718807220459),
 ('luxembourg', 0.6797298192977905),
 ('switzerland', 0.6770632266998291)]

We can see that the most similar words to "denmark" are very close in terms of geography. 

__Compare specific words__

We can also compare two specific words.

In [15]:
pretrained_vectors.similarity('denmark', 'scotland')

0.61651856

In [16]:
pretrained_vectors.similarity('denmark', 'sweden')

0.86244005

__Vector algebra__

We can use word embeddings to study structural relationships. 

*Man* is to *king* as *woman* is to ...

In [34]:
pretrained_vectors.most_similar(positive=['king', 'woman'], 
                                negative=['man'], 
                                topn=1)

[('queen', 0.7698541283607483)]

This is just simple algebra: <br>

[V(king) - V(man)] + V(woman)

We subract the vector for man from the vector for king and add the vector for women and we get the vector for queen.

Here we are working with gender as a dimension.

In [18]:
pretrained_vectors.most_similar(positive=['walk', 'swim'], 
                           negative=['walked'], 
                           topn=1)

[('swimming', 0.6574411988258362)]

Once again it is just simple algebra:


[V(walk) - V(walked)] + V(swam)

Here we are working with tense as a dimension.

In [19]:
pretrained_vectors.most_similar(positive=['berlin', 'denmark'], 
                           negative=['germany'], 
                           topn=1)

[('copenhagen', 0.7726544737815857)]

[V(berlin) - V(germany)] + V(denmark)

here we are working with nationality as a dimension.

__Odd one out!__

We can also take a list of words and find the word that does not match - the word that is most dissimilar to the others (furthest away in the embedding space)

In [20]:
pretrained_vectors.doesnt_match(["france", "germany", "dog", "japan"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'dog'

Word embedding models are capturing something meaningful about words and langauge in general, which can be seen from the word association tasks (e.g. man is to king as woman is to queen). We are also able to encode grammatical information (e.g. walk is to walked as swim is to swam). Hence, there is some kind of structural information in the language that the word embedding model is able to capture in terms of semantic relationships and co-occurrence. 

## Train your own models

Training your own model with gensim is simple.

__Load data with pandas__

In [21]:
filename = os.path.join("..", "data", "labelled_data", "fake_or_real_news.csv")

In [22]:
data = pd.read_csv(filename)

In [23]:
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


__Tokenize with ```spaCy```__

In [24]:
sentences = []

for post in tqdm(data["text"]):
    # create a temporary list
    tmp_list = []
    # create spaCy doc object
    doc = nlp(post.lower()) # we need everything to be lowercase to not consider the same word as different words just because of casing
    # loop over
    for token in doc:
        tmp_list.append(token.text)
    # append tmp_list to sentences
    sentences.append(tmp_list)

100%|██████████| 6335/6335 [15:07<00:00,  6.98it/s]


__Train model with ```gensim```__

In [25]:
model = Word2Vec(sentences=sentences,  # input data
                 size=50,              # embedding size (the number of dimensions)
                 window=3,             # context window (the number of words before and after the target word)
                 sg=1,                 # cbow or skip-gram (cbow=0, sg=1)
                 negative=5,           # number of negative samples - the more negative samples, the longer the training will take, because it is going to classify more labels
                 min_count=3,          # remove rare words. Words that appear less than 3 times will be excluded.
                 workers=6)            # number of CPU processes/cores. More cores allows for more parallel processing.

__Inspect most similar word__

In [35]:
model.wv.most_similar('obama', topn=10)

[('barack', 0.9361744523048401),
 ('administration', 0.8315683603286743),
 ('president', 0.7966960668563843),
 ('biden', 0.7546178102493286),
 ('rouhani', 0.7359713912010193),
 ('legacy', 0.7211123704910278),
 ('reclassify', 0.7160322666168213),
 ('congress', 0.7083786129951477),
 ('clouded', 0.7081670761108398),
 ('hardball', 0.6991458535194397)]

Part of the problem with word embedding models is that they are data-hungry - it takes a lot of data to create useful representation of data. This is something we often run into when we are training our own models. Hence, when training your own model we need to consider whether we have enough data, and how we perform the preprocessing. 

__Compare words__

In [27]:
model.wv.similarity('jesus', 'god')

0.7994694