### Coding Challenge #3: Natural Language Processing

In this Coding Challenge, you will cover **Word2vec **which is a popular algorithm for building vector representations of words (i.e. word embeddings). The concept behind Word2Vec is quite straightforward - an assumption is made that the meaning of a word can be inferred by the *context it appears in* or *the company it keeps*. This is similar to stating: “tell me about your friends, and I will tell who you are”. 

If **2 **words  have very similar neighbors (meaning: the context in which it is used is similar), then the words are most likely quite similar.

In this Coding Challenge, you will go through the process of training a Word2vec model with a sample set of documents and then examine certain attributes of the model. After that, you will train a Word2vec model with a large corpus of text and then ascertain the similarity among words in the corpus.


In [1]:
# https://radimrehurek.com/gensim/install.html
!pip install --upgrade gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/33/33/df6cb7acdcec5677ed130f4800f67509d24dbec74a03c329fcbf6b0864f0/gensim-3.4.0-cp36-cp36m-manylinux1_x86_64.whl (22.6MB)
[K    100% |████████████████████████████████| 22.6MB 2.0MB/s 
[?25hRequirement not upgraded as not directly required: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.14.3)
Requirement not upgraded as not directly required: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (0.19.1)
Requirement not upgraded as not directly required: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.11.0)
Collecting smart-open>=1.2.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/4b/69/c92661a333f733510628f28b8282698b62cdead37291c8491f3271677c02/smart_open-1.5.7.tar.gz
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/bd/b7/a88a67002b1185ed9a8e8a6ef15266728c2

In [2]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /c

[nltk_data]    | Downloading package nps_chat to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/nps_chat.zip.
[nltk_data]    | Downloading package omw to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/omw.zip.
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/opinion_lexicon.zip.
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/paradigms.zip.
[nltk_data]    | Downloading package pil to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/pil.zip.
[nltk_data]    | Downloading package pl196x to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/pl196x.zip.
[nltk_data]    | Downloading package ppattach to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/ppattach.zip.
[nltk_data]    | Downloading package problem_reports to
[nltk_data]    |     /content/nltk_data...
[nltk_data]  

[nltk_data]    |   Unzipping corpora/ycoe.zip.
[nltk_data]    | Downloading package rslp to /content/nltk_data...
[nltk_data]    |   Unzipping stemmers/rslp.zip.
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping taggers/maxent_treebank_pos_tagger.zip.
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping taggers/universal_tagset.zip.
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data]    | Downloading package punkt to /content/nltk_data...
[nltk_data]    |   Unzipping tokenizers/punkt.zip.
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping grammars/book_grammars.zip.
[nltk_data]    | Downloading package sample_grammars to
[nltk_dat

True

**Step #1: ** Tokenize the sample set of documents



In [0]:
# Step 1

import gensim

raw_content = ['The dog ran up the steps and entered the owner\'s room to check if the owner was in the room.',
               'My name is Thomson Comer, commander of the Machine Learning program at Lambda school.',
               'I am creating the curriculum for the Machine Learning program and will be teaching the full-time Machine Learning program.',
               'Machine Learning is one of my favorite subjects.',
               'I am excited about taking the Machine Learning class at the Lambda school starting in April.',
               'When does the Machine Learning program kick-off at Lambda school?',
               'The batter hit the ball out off AT&T park into the pacific ocean.',
               'The pitcher threw the ball into the dug-out.']

In [0]:
tokens = [nltk.word_tokenize(doc) for doc in raw_content]

**Step #2: ** Train the Word2vec model with tokenized content; size of the word vectors is 5; the word should show-up at least once in the raw content

In [0]:
model = gensim.models.Word2Vec(tokens, size=5, min_count=1)

**Step #3: **Output the number of words as well as the list of words in the model's vocabulary

In [6]:
print('# of words:', len(model.wv.vocab))
print(list(model.wv.vocab))

# of words: 69
['The', 'dog', 'ran', 'up', 'the', 'steps', 'and', 'entered', 'owner', "'s", 'room', 'to', 'check', 'if', 'was', 'in', '.', 'My', 'name', 'is', 'Thomson', 'Comer', ',', 'commander', 'of', 'Machine', 'Learning', 'program', 'at', 'Lambda', 'school', 'I', 'am', 'creating', 'curriculum', 'for', 'will', 'be', 'teaching', 'full-time', 'one', 'my', 'favorite', 'subjects', 'excited', 'about', 'taking', 'class', 'starting', 'April', 'When', 'does', 'kick-off', '?', 'batter', 'hit', 'ball', 'out', 'off', 'AT', '&', 'T', 'park', 'into', 'pacific', 'ocean', 'pitcher', 'threw', 'dug-out']


**Step #4: **Output the vector of words for the following tokens: **a)** curriculum, **b)** ocean, and **c) **pitcher

In [7]:
for token in ['curriculum', 'ocean', 'pitcher']:
    print('{}: {}'.format(token, model.wv[token]))

curriculum: [ 0.07319207 -0.02311418  0.04490997  0.09295546 -0.05926471]
ocean: [ 0.07285784 -0.06638347 -0.01627363 -0.03860937 -0.03333457]
pitcher: [-0.02468206 -0.04679758  0.0996118  -0.02658264  0.08133115]


**Step #5:** Now we are going to train the model with more data - larger corpus i.e. the 20 newsgroups text dataset. Fetch the data from the training subset

*Reference*: http://scikit-learn.org/stable/datasets/index.html

In [0]:
from sklearn.datasets import fetch_20newsgroups

In [9]:
data = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


**Step #6:** Output the metadata for the data that is fetched

In [10]:
print(list(data.target_names))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


**Step #7: ** Output the # of posts across the different categories

In [11]:
categories = list(data.target_names)
for c in range(len(categories)):
    print('{} Posts: {}'.format(categories[c], 
                                           data.target[data.target==c].shape[0]))

alt.atheism Posts: 480
comp.graphics Posts: 584
comp.os.ms-windows.misc Posts: 591
comp.sys.ibm.pc.hardware Posts: 590
comp.sys.mac.hardware Posts: 578
comp.windows.x Posts: 593
misc.forsale Posts: 585
rec.autos Posts: 594
rec.motorcycles Posts: 598
rec.sport.baseball Posts: 597
rec.sport.hockey Posts: 600
sci.crypt Posts: 595
sci.electronics Posts: 591
sci.med Posts: 594
sci.space Posts: 593
soc.religion.christian Posts: 599
talk.politics.guns Posts: 546
talk.politics.mideast Posts: 564
talk.politics.misc Posts: 465
talk.religion.misc Posts: 377


**Step #8**: Tokenize the body of text for each post

In [0]:
news_tokens = [nltk.word_tokenize(post) for post in data.data]

**Step #9**: Train the Word2vec model - words should show up at least 3 times in the corpus of text
and the size of each word vector is 200 (i.e. dimension = 200)

Reference" Scroll down to the section "A closer look at the parameter settings" to review the parameters that can be set

In [0]:
news_model = gensim.models.Word2Vec(news_tokens, size=200, min_count=3)

**Step #10**:  List the number of words in the model's vocabulary

In [14]:
print('# of words:', len(news_model.wv.vocab))

# of words: 40240


**Step #11:** Examine word similarity to the word "Christ"

In [15]:
news_model.wv.most_similar(positive=['Christ'])

[('Jesus', 0.9195095300674438),
 ('Father', 0.8961330652236938),
 ('Lord', 0.886461079120636),
 ('Son', 0.8793392181396484),
 ('Spirit', 0.8787533044815063),
 ('God', 0.8761974573135376),
 ('sin', 0.8591848611831665),
 ('Holy', 0.8402365446090698),
 ('death', 0.8378164768218994),
 ('resurrection', 0.8373886942863464)]

**Step #12**: Examine document similarity with Doc2vec to any body of text of your choice

*Reference*: https://radimrehurek.com/gensim/models/doc2vec.html

In [0]:
docs = []
for i, doc in enumerate(data.data):
    str_list = doc.split()
    T = gensim.models.doc2vec.TaggedDocument(str_list, [i])
    docs.append(T)

In [0]:
docmodel = gensim.models.Doc2Vec(docs, vector_size=200, min_count=5)

In [18]:
!wget https://raw.githubusercontent.com/PedramNavid/trump_speeches/master/data/speech_00.txt

--2018-06-13 22:54:45--  https://raw.githubusercontent.com/PedramNavid/trump_speeches/master/data/speech_00.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36098 (35K) [text/plain]
Saving to: ‘speech_00.txt’


2018-06-13 22:54:45 (2.01 MB/s) - ‘speech_00.txt’ saved [36098/36098]



In [19]:
def cleaned_text(text):
    punctuation = ['!', ',', '.', ':', ';', '?']
    lowercase = 'abcdefghijklmnopqrstuvwxyz'
    
    for char in set(text):
        if char not in punctuation and char not in lowercase:
            text = text.replace(char, ' ')

    return text

query = open('speech_00.txt').read().lower()
query = cleaned_text(query)

print(query[:1000])


remarks announcing candidacy for president in new york city trump: wow. whoa. that is some group of people. thousands. so nice, thank you very much. that s really nice. thank you. it s great to be at trump tower. it s great to be in a wonderful city, new york. and it s an honor to have everybody here. this is beyond anybody s expectations. there s been no crowd like this. and, i can tell, some of the candidates, they went in. they didn t know the air conditioner didn t work. they sweated like dogs.  laughter  they didn t know the room was too big, because they didn t have anybody there. how are they going to beat isis? i don t think it s gonna happen.  applause  our country is in serious trouble. we don t have victories anymore. we used to have victories, but we don t have them. when was the last time anybody saw us beating, let s say, china in a trade deal? they kill us. i beat china all the time. all the time.  applause  audience member: we want trump. we want trump. trump: when did 

In [0]:
vocab = list(docmodel.wv.vocab)
query = [w for w in query if w in vocab]
query = docmodel.infer_vector(query)

In [0]:
similar_docs = docmodel.docvecs.most_similar([query])

In [22]:
# Examine the first document in the list above to gauge the similarity
similar_docs[0]

(8048, 0.647288978099823)

In [23]:
docs[9025]

TaggedDocument(words=["What's", 'with', 'you', 'stupid', 'dorks', 'from', 'the', '"Western', 'Business', 'School"???!!!', 'First', 'there', 'was', 'that', 'Cary', 'asshole,', 'and', 'now', 'you.', "Don't", 'you', 'have', 'anything', 'better', 'to', 'do', 'instead', 'of', 'being', 'obnoxious,', 'antagonistic', 'little', 'shits', 'over', 'the', 'network???', 'Why', "don't", 'you', 'just', 'take', 'a', 'hike,', 'and', 'stop', 'embarrasing', 'yourself,', 'your', 'school,', 'and', 'Canada!'], tags=[9025])

**Stretch Goal: **

Download the pre-trained word vectors from Google. Access the pre-trained vectors via the following link: https://code.google.com/archive/p/word2vec

Load the pre-trained word vectors and train the **Word2vec** model

Examine the first 100 keys or words of the vocabulary

Outputs the vector representation for a select set of words - the words can be of your choice

Examine the similarity between words - the words can be of your choice

For example: 

model.similarity('house', 'bungalow')

model.similarity('house', 'umbrella')


In [35]:
!pip install googledrivedownloader

Collecting googledrivedownloader
  Downloading https://files.pythonhosted.org/packages/7e/41/d59b2a5fcc7afeb40f23091694bd6e6a63ad118c93f834353ee5100285d5/googledrivedownloader-0.3-py2.py3-none-any.whl
Installing collected packages: googledrivedownloader
Successfully installed googledrivedownloader-0.3


In [36]:
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id='0B7XkCwpI5KDYNlNUTTlSS21pQmM',
                                    dest_path='./data/GoogleNews-vectors-negative300.bin.gz')

Downloading 0B7XkCwpI5KDYNlNUTTlSS21pQmM into ./data/GoogleNews-vectors-negative300.bin.gz... Done.


In [0]:
model = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [49]:
print(list(model.vocab)[:100])

['</s>', 'in', 'for', 'that', 'is', 'on', '##', 'The', 'with', 'said', 'was', 'the', 'at', 'not', 'as', 'it', 'be', 'from', 'by', 'are', 'I', 'have', 'he', 'will', 'has', '####', 'his', 'an', 'this', 'or', 'their', 'who', 'they', 'but', '$', 'had', 'year', 'were', 'we', 'more', '###', 'up', 'been', 'you', 'its', 'one', 'about', 'would', 'which', 'out', 'can', 'It', 'all', 'also', 'two', 'after', 'first', 'He', 'do', 'time', 'than', 'when', 'We', 'over', 'last', 'new', 'other', 'her', 'people', 'into', 'In', 'our', 'there', 'A', 'she', 'could', 'just', 'years', 'some', 'U.S.', 'three', 'million', 'them', 'what', 'But', 'so', 'no', 'like', 'if', 'only', 'percent', 'get', 'did', 'him', 'game', 'back', 'because', 'now', '#.#', 'before']


In [50]:
model['hello']

array([-0.05419922,  0.01708984, -0.00527954,  0.33203125, -0.25      ,
       -0.01397705, -0.15039062, -0.265625  ,  0.01647949,  0.3828125 ,
       -0.03295898, -0.09716797, -0.16308594, -0.04443359,  0.00946045,
        0.18457031,  0.03637695,  0.16601562,  0.36328125, -0.25585938,
        0.375     ,  0.171875  ,  0.21386719, -0.19921875,  0.13085938,
       -0.07275391, -0.02819824,  0.11621094,  0.15332031,  0.09082031,
        0.06787109, -0.0300293 , -0.16894531, -0.20800781, -0.03710938,
       -0.22753906,  0.26367188,  0.012146  ,  0.18359375,  0.31054688,
       -0.10791016, -0.19140625,  0.21582031,  0.13183594, -0.03515625,
        0.18554688, -0.30859375,  0.04785156, -0.10986328,  0.14355469,
       -0.43554688, -0.0378418 ,  0.10839844,  0.140625  , -0.10595703,
        0.26171875, -0.17089844,  0.39453125,  0.12597656, -0.27734375,
       -0.28125   ,  0.14746094, -0.20996094,  0.02355957,  0.18457031,
        0.00445557, -0.27929688, -0.03637695, -0.29296875,  0.19

In [43]:
print('house-bungalow similarity:', model.similarity('house', 'bungalow'))
print('house-umbrella similarity:', model.similarity('house', 'umbrella'))

house-bungalow similarity: 0.6878559817059837
house-umbrella similarity: 0.1358489851424372
