# Chapter 3: Learning Distributed Word Embeddings and Using Them for NLP

<p>In this notebook, you'll learn to load texts into Tensorflow by converting words to numbers. 
You'll learn how to train distributed word representations, also known as word embeddings, by 
building your first tensorflow deep network for NLP. You'll compare these word embeddings to similar representations
built with Latent Semantic Indexing (LSI). You'll learn how to save your embeddings for re-use, and how to load
pre-trained embeddings which you borrow from the cloud. Finally, you'll learn how to use pre-trained embeddings
for your first NLP task, categorizing documents. Along the way we'll point out many foundational techniques 
for NLP which will be helpful for you as your skills increase.</p>

<i>A tip of the keyboard to the following sources that provided inspiration for this notebook:
    <ul>
        <li>1</li>
    </ul>
    </i>

## Table of Contents
(Internal Links)

## Imports

In [1]:
import tensorflow as tf
from tensorflow import keras
import nltk
import sklearn
import numpy as np

## Constants and Magic Numbers
To be used throughout our code

In [2]:
vocabLength = 10000 #The number of unique words in our corpus which we'll use as inputs
trainingEpochs = 1000 #How many training iterations we'll put our network through
embeddingDim = 100 #How many dimensions in each embedding vector?
skipgramWidth = 3 #How many words on either side of the target word should be included in its context?



## Load and Preprocess Text Corpus

Tip: Make sure you've downloaded the NLTK text corpora following the directions at <a href="https://www.nltk.org/data.html">https://www.nltk.org/data.html</a>

In [3]:
#Let's use the text of Melville's novel Moby Dick as our corpus. We'll load it from the NLTK corpus library.
#Here's what the first couple sentences look like:
i = 0
for s in nltk.corpus.gutenberg.sents('melville-moby_dick.txt'):
    i += 1
    print(s)
    if i > 3:
        break

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']']
['ETYMOLOGY', '.']
['(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar', 'School', ')']
['The', 'pale', 'Usher', '--', 'threadbare', 'in', 'coat', ',', 'heart', ',', 'body', ',', 'and', 'brain', ';', 'I', 'see', 'him', 'now', '.']


In [4]:
#Let's count how many unique words are in Moby Dick.
#We lower-case them first and remove punctuation. Here we're using the python string.lower() method and
#a home-rolled punctuation stripper to do this normalization. In other NLP tasks you'll do additional
#type of normalization including stripping non-ascii characters (pre-processing), stemming, and PoS tagging.
from nltk import FreqDist
from Chapter_03_utils import isPunctuation #Home-brewed function to test if token is punctuation

mobyDickWords = FreqDist(w.lower() for w in nltk.corpus.gutenberg.words('melville-moby_dick.txt') if isPunctuation(w) == False)
print("There are {} word tokens in Moby Dick, of which {} are unique types.".format(len(nltk.corpus.gutenberg.words('melville-moby_dick.txt')),len(mobyDickWords)))
#Phew! Melville was prolific!


There are 260819 word tokens in Moby Dick, of which 17152 are unique types.


## Convert words to numbers
<i>This is the first step in preparing the text data to be fed into a neural network or other machine learning model.</i>

In [5]:
#Create dictionary of words and integer keys
#We'll use a home-grown function even though there are several available, 
#including gensim.corpora.dictionary.Dictionary
from Chapter_03_utils import terms2ints, ints2terms

mobyDickTermsDict = terms2ints([term for (term, freq) in mobyDickWords.most_common(vocabLength)])
print("Let's inspect the terms and integer codes...")
print({t:i for (t, i) in mobyDickTermsDict.items() if i < 11})
mobyDickIntsDict = ints2terms(mobyDickTermsDict)
#Test a few lookups in the reverse dictionary to make sure it's working
for i in range(3):
    print("{}: {}".format(i, mobyDickIntsDict[i]))
# Source: https://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/
#https://adventuresinmachinelearning.com/word2vec-keras-tutorial/
#    https://blog.cambridgespark.com/tutorial-build-your-own-embedding-and-use-it-in-a-neural-network-e9cde4a81296

Let's inspect the terms and integer codes...
{'the': 0, 'of': 1, 'and': 2, 'a': 3, 'to': 4, 'in': 5, 'that': 6, 'his': 7, 'it': 8, 'i': 9, 'he': 10}
0: the
1: of
2: and


In [6]:
#Now we'll instantiate an encoder with our terms Dictionary and reverse dictionary
# that can encode input lists as integers
from Chapter_03_utils import IntEncoder
enc = IntEncoder(mobyDickTermsDict, mobyDickIntsDict)
#print("call: {}".format(mobyDickIntsDict['call']))
sentence = 'Call me Ishmael'
result = enc.encode([word.lower() for word in nltk.word_tokenize(sentence) if isPunctuation(word) == False])
print("Encoded sentence:", result)
print(mobyDickIntsDict[400])
print(mobyDickIntsDict[1000])

Encoded sentence: [400, 40, 1014]
call
invested


In [8]:
#Now we'll use our encoder object to encode the words from Moby Dick as a sequence of integers
words = [enc.lookupCode(w.lower()) for w in nltk.corpus.gutenberg.words('melville-moby_dick.txt') if isPunctuation(w) == False]


## Define Deep Network for Word Embeddings
<i>Thanks to <a href="https://adventuresinmachinelearning.com/word2vec-keras-tutorial/">Adventures in Deep Learning's blog</a> for inspiring this section.</i>

In [9]:
from tensorflow.keras.preprocessing.sequence import make_sampling_table, skipgrams
sampling_table = make_sampling_table(vocabLength + 1) #Add one to accommodate the out-of-vocab marker
pairs, categories = skipgrams(words, vocabLength, window_size=skipgramWidth, sampling_table=sampling_table)
word_target, word_context = zip(*pairs)
word_target = np.array(word_target, dtype="int32")
word_context = np.array(word_context, dtype="int32")

print(pairs[:10], categories[:10])

[[19, 625], [6037, 3], [10000, 13], [7325, 12], [71, 47], [3312, 536], [10000, 6783], [1063, 7803], [49, 10000], [10000, 2]] [1, 1, 1, 1, 0, 1, 0, 0, 1, 1]


In [11]:
enc.lookupTerm(19)

'this'

### Define and build input layers using tf.keras functional API

In [25]:
from tensorflow.keras.layers import Input, Embedding, Reshape, Dense
# create some input variables
input_target = Input((1,))
input_context = Input((1,))

embedding = Embedding(vocabLength, embeddingDim, input_length=1, name='embedding')

In [26]:
target = embedding(input_target)
target = Reshape((embeddingDim, 1))(target)
context = embedding(input_context)
context = Reshape((embeddingDim, 1))(context)

In [27]:
# setup a cosine similarity operation which will be output in a secondary model
from tensorflow.keras.layers import dot
similarity = dot([target, context], axes=1)

In [28]:
# now perform the dot product operation to get a similarity measure
dot_product = dot([target, context], axes=1)
dot_product = Reshape((1,))(dot_product)
# add the sigmoid output layer
output = Dense(1, activation='sigmoid')(dot_product)

In [33]:
# create the primary training model
from tensorflow.keras.models import Model
model = Model(inputs=[input_target, input_context], outputs=[output])
model.compile(loss='binary_crossentropy', optimizer='rmsprop')

In [34]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 1, 100)       1000000     input_5[0][0]                    
                                                                 input_6[0][0]                    
__________________________________________________________________________________________________
reshape_3 (Reshape)             (None, 100, 1)       0           embedding[0][0]              

## Train Network with Corpus

## Examine What the Network Has Learned

## Train Latent Semantic Index Word Representations from Corpus

In [None]:
roshansantosh.wordpress.com Evaluating Term and Document Similarity Using Latent Semantic ANalysis

## Compare LSI Word Representations to Deep Learning Embeddings

## Save Trained Embeddings for Later Use

## Load Pre-Trained Embeddings

## Load Corpus of Categorized Documents

## Define Deep Network to Categorize Documents Using Pre-Trained Embeddings

## Train Network, Test Accuracy on Hold-out Set