[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_11-NLP/blob/master/W11_CCS_AG_NLP_Coding_Challenge__3_Live_Coding_Solution.ipynb)

### Coding Challenge #3: Natural Language Processing

In this Coding Challenge, you will cover **Word2vec **which is a popular algorithm for building vector representations of words (i.e. word embeddings). The concept behind Word2Vec is quite straightforward - an assumption is made that the meaning of a word can be inferred by the *context it appears in* or *the company it keeps*. This is similar to stating: “tell me about your friends, and I will tell who you are”. 

If **2 **words  have very similar neighbors (meaning: the context in which it is used is similar), then the words are most likely quite similar.

In this Coding Challenge, you will go through the process of training a Word2vec model with a sample set of documents and then examine certain attributes of the model. After that, you will train a Word2vec model with a large corpus of text and then ascertain the similarity among words in the corpus.


In [1]:
# https://radimrehurek.com/gensim/install.html
#!pip install --upgrade gensim

In [2]:
%%time
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/darwinm/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nl

**Step #1: ** Tokenize the sample set of documents



In [3]:
%%time
# Step 1

import gensim

raw_content = ['The dog ran up the steps and entered the owner\'s room to check if the owner was in the room.',
             'My name is Thomson Comer, commander of the Machine Learning program at Lambda school.',
             'I am creating the curriculum for the Machine Learning program and will be teaching the full-time Machine Learning program.',
            'Machine Learning is one of my favorite subjects.',
            'I am excited about taking the Machine Learning class at the Lambda school starting in April.',
                'When does the Machine Learning program kick-off at Lambda school?',
                'The batter hit the ball out off AT&T park into the pacific ocean.',
                'The pitcher threw the ball into the dug-out.']

from nltk.tokenize import word_tokenize
sentences = [word_tokenize(text) for text in raw_content]
print(sentences)

[['The', 'dog', 'ran', 'up', 'the', 'steps', 'and', 'entered', 'the', 'owner', "'s", 'room', 'to', 'check', 'if', 'the', 'owner', 'was', 'in', 'the', 'room', '.'], ['My', 'name', 'is', 'Thomson', 'Comer', ',', 'commander', 'of', 'the', 'Machine', 'Learning', 'program', 'at', 'Lambda', 'school', '.'], ['I', 'am', 'creating', 'the', 'curriculum', 'for', 'the', 'Machine', 'Learning', 'program', 'and', 'will', 'be', 'teaching', 'the', 'full-time', 'Machine', 'Learning', 'program', '.'], ['Machine', 'Learning', 'is', 'one', 'of', 'my', 'favorite', 'subjects', '.'], ['I', 'am', 'excited', 'about', 'taking', 'the', 'Machine', 'Learning', 'class', 'at', 'the', 'Lambda', 'school', 'starting', 'in', 'April', '.'], ['When', 'does', 'the', 'Machine', 'Learning', 'program', 'kick-off', 'at', 'Lambda', 'school', '?'], ['The', 'batter', 'hit', 'the', 'ball', 'out', 'off', 'AT', '&', 'T', 'park', 'into', 'the', 'pacific', 'ocean', '.'], ['The', 'pitcher', 'threw', 'the', 'ball', 'into', 'the', 'dug-ou

**Step #2: ** Train the Word2vec model with tokenized content; size of the word vectors is 5; the word should show-up at least once in the raw content

In [4]:
%%time
# Step 2
from gensim.models.word2vec import Word2Vec

model = Word2Vec(sentences, min_count=1, size=5)
dir(model)

CPU times: user 12.6 ms, sys: 3.43 ms, total: 16 ms
Wall time: 55.4 ms


**Step #3: **Output the number of words as well as the list of words in the model's vocabulary

In [5]:
# Step 3
print(model)
print(list(model.wv.vocab))
print(len(model.wv.vocab))

Word2Vec(vocab=69, size=5, alpha=0.025)
['The', 'dog', 'ran', 'up', 'the', 'steps', 'and', 'entered', 'owner', "'s", 'room', 'to', 'check', 'if', 'was', 'in', '.', 'My', 'name', 'is', 'Thomson', 'Comer', ',', 'commander', 'of', 'Machine', 'Learning', 'program', 'at', 'Lambda', 'school', 'I', 'am', 'creating', 'curriculum', 'for', 'will', 'be', 'teaching', 'full-time', 'one', 'my', 'favorite', 'subjects', 'excited', 'about', 'taking', 'class', 'starting', 'April', 'When', 'does', 'kick-off', '?', 'batter', 'hit', 'ball', 'out', 'off', 'AT', '&', 'T', 'park', 'into', 'pacific', 'ocean', 'pitcher', 'threw', 'dug-out']
69


**Step #4: **Output the vector of words for the following tokens: **a)** curriculum, **b)** ocean, and **c) **pitcher

In [6]:
# Step 4
print(model.wv['curriculum', 'ocean', 'pitcher'])

[[ 0.0429332  -0.0387117  -0.0377904   0.07507995  0.03391911]
 [ 0.06513041 -0.09689079 -0.09367523 -0.0006524  -0.0025307 ]
 [ 0.00753346 -0.02386155 -0.0763839  -0.06631807  0.00179267]]


**Step #5:** Now we are going to train the model with more data - larger corpus i.e. the 20 newsgroups text dataset. Fetch the data from the training subset

*Reference*: http://scikit-learn.org/stable/datasets/index.html

In [7]:
%%time
# Step 5
from sklearn.datasets import fetch_20newsgroups
text_from_corpus = fetch_20newsgroups(subset='train')

CPU times: user 354 ms, sys: 148 ms, total: 502 ms
Wall time: 1.17 s


**Step #6:** Output the metadata for the data that is fetched (investigate the object and what you can do with it)

In [8]:
# Step 6
print(dir(text_from_corpus))
print(text_from_corpus.description)

['DESCR', 'data', 'description', 'filenames', 'target', 'target_names']
the 20 newsgroups by date dataset


**Step #7: ** Output the # of posts across the different categories

In [9]:
# Step 7
len(text_from_corpus.data)

11314

**Step #8**: Tokenize the body of text for each post

In [10]:
# Step 8
import string

def process_text(text):
  """Remove punctuation, lowercase, and tokenize text."""
  # TODO: check for special cases like "I'll"
  text = "".join([char.lower() for char in text
                  if char not in string.punctuation])
  return word_tokenize(text)

sentences = [process_text(document) for document in text_from_corpus.data]

**Step #9**: Train the Word2vec model - words should show up at least 3 times in the corpus of text
and the size of each word vector is 200 (i.e. dimension = 200)

Reference" Scroll down to the section "A closer look at the parameter settings" to review the parameters that can be set

In [11]:
# Step 9
news_model = Word2Vec(sentences, min_count=3, size=200)

**Step #10**:  List the number of words in the model's vocabulary

In [12]:
# Step 10
print(len(news_model.wv.vocab))

43312


**Step #11:** Examine word similarity to the word "Christ" (find other words most similar to it)

In [13]:
# Step 11
news_model.wv.most_similar('christ')

[('jesus', 0.914254367351532),
 ('lord', 0.8938565850257874),
 ('spirit', 0.8803279399871826),
 ('father', 0.8511791825294495),
 ('son', 0.8467549681663513),
 ('disciples', 0.8440340161323547),
 ('satan', 0.8411128520965576),
 ('holy', 0.8389605283737183),
 ('messiah', 0.8368544578552246),
 ('himself', 0.8333228230476379)]

**Step #12**: Examine document similarity with Doc2vec to any body of text of your choice

*Reference*: https://radimrehurek.com/gensim/models/doc2vec.html

In [14]:
%%time
# Step 12

# We need to train a doc2vec model with the 20 Newsgroup dataset
# One of arguments to the model is a "TaggedDocument" so we will first go ahead and create a Tagged document

# Import TaggedDocument
from gensim.models.doc2vec import TaggedDocument

# Tokenize each of the posts within the newsgroups
sentences = [process_text(document) for document in text_from_corpus.data]

# Create a list of tagged_documents
# Every item within the tagged_documents list is a tokenized version of the posts 
tagged_documents_list = []
for i, sent in enumerate(sentences):
    tagged_documents_list.append(TaggedDocument(sent, ["sent_{}".format(i)]))

# Examine the first item within tagged_document_lists
# print(tagged_documents_list[0])

# Train the model with the list of Tagged Documents
# size of the vector is 300
doc2vec_model = gensim.models.doc2vec.Doc2Vec(tagged_documents_list,
                                              vector_size=300)

# Get the vector representation for a new document
vec_representation = doc2vec_model.infer_vector('I love test driving luxury cars.'.split())
#print(vec_representation)

#Determine the documents (posts) similar to the new document/post
doc2vec_model.docvecs.most_similar([vec_representation])

CPU times: user 2min 27s, sys: 3.27 s, total: 2min 31s
Wall time: 3min 22s


In [15]:
# Examine the first document in the list above to gauge the similarity
print(tagged_documents_list[8027])

TaggedDocument(['nntppostinghost', 'surtifiuiono', 'from', 'thomas', 'parsli', 'thomaspifiuiono', 'subject', 're', 'my', 'gun', 'is', 'like', 'my', 'american', 'express', 'card', 'inreplyto', 'vikingiastateedu', 'dan', 'sorensons', 'message', 'of', 'mon', '19', 'apr', '1993', '085242', 'gmt', 'organization', 'dept', 'of', 'informatics', 'university', 'of', 'oslo', 'norway', '1qjmnuinnlmdclemhandheldcom', 'cmm0902734911642thomaspsurtifiuiono', 'viking734945095ponderouscciastateedu', 'cmm0902735132009thomaspsurtifiuiono', 'viking735209562ponderouscciastateedu', 'lines', '51', 'originator', 'thomaspsurtifiuiono', 'i', 'dont', 'remember', 'the', 'figures', 'exactly', 'but', 'there', 'were', 'about', '3500', 'deaths', 'in', 'texas', 'in', '1991', 'that', 'was', 'caused', 'by', 'guns', 'this', 'is', 'more', 'than', 'those', 'beeing', 'killed', 'in', 'caraccidents', 'yes', 'there', 'could', 'be', 'that', 'low', 'sentences', 'or', 'high', 'poverty', 'could', 'influence', 'the', 'figures', 'but

**Stretch Goal: **

Download the pre-trained word vectors from Google. Access the pre-trained vectors via the following link: https://code.google.com/archive/p/word2vec

Load the pre-trained word vectors and train the **Word2vec** model

Examine the first 100 keys or words of the vocabulary

Outputs the vector representation for a select set of words - the words can be of your choice

Examine the similarity between words - the words can be of your choice

For example: 

model.similarity('house', 'bungalow')

model.similarity('house', 'umbrella')
