# Introduction to Natural Language Processing (NLP) in TensorFlow

### Word Embeddings

Word embeddings, or word vectors, provide a way of mapping words from a vocabulary into a low-dimensional space, where words with similar meanings are close together. Let's play around with a set of pre-trained word vectors, to get used to their properties. There exist many sets of pretrained word embeddings; here, we use ConceptNet Numberbatch, which provides a relatively small download in an easy-to-work-with format (h5).

Note from class: Why wouldn't we just want to put words in a big dictionary?<br>
A: You may give implicit relationship between words that are close to each other in the dictionary.

Q: Why not one-hot encoding?<br>
A: You're not reducing dimensionality here. You're also ruining any chance of learning relationships between words - "toad", "frog", and "toaster" will all be orthogonal.

In [None]:
# Download word vectors => did through wget
from urllib.request import urlretrieve
import os
if not os.path.isfile('mini.h5'):
    print("Downloading Conceptnet Numberbatch word embeddings...")
    conceptnet_url = 'http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.06/mini.h5'
    urlretrieve(conceptnet_url, 'mini.h5')

To read an `h5` file, we'll need to use the `h5py` package. Below, we use the package to open the `mini.h5` file we just downloaded. We extract from the file a list of utf-8-encoded words, as well as their $300$-dimensional vectors.

In [1]:
# Load the file and pull out words and embeddings
import h5py

with h5py.File("mini.h5", "r") as f:
    all_words = [word.decode('utf-8') for word in f['mat']['axis1'][:]]
    all_embeddings = f['mat']['block0_values'][:]
    
print("all_words dimensions: {0}".format(len(all_words)))
print("all_embeddings dimensions: {0}".format(all_embeddings.shape))

print(all_words[1337])

all_words dimensions: 362891
all_embeddings dimensions: (362891, 300)
/c/de/aufmachung


Now, `all_words` is a list of $V$ strings (what we call our *vocabulary*), and `all_embeddings` is a $V \times 300$ matrix. The strings are of the form `/c/language_code/word`—for example, `/c/en/cat` and `/c/es/gato`.

We are interested only in the English words. We use Python list comprehensions to pull out the indices of the English words, then extract just the English words (stripping the six-character `/c/en/` prefix) and their embeddings.

In [9]:
# Restrict our vocabulary to just the English words
# first 6 are /c/en
english_words = [word[6:] for word in all_words if word.startswith('/c/en')]
english_word_indices = [i for i, word in enumerate(all_words) if word.startswith('/c/en')]
english_embeddings = all_embeddings[english_word_indices]

print("english_words dimensions: {0}".format(len(english_words)))
print("english_embeddings dimensions: {0}".format(english_embeddings.shape))

print(english_words[1337])

english_words dimensions: 150875
english_embeddings dimensions: (150875, 300)
activated_carbon


The magnitude of a word vector is less important than its direction; the magnitude can be thought of as representing frequency of use, independent of the semantics of the word. 
Here, we will be interested in semantics, so we *normalize* our vectors, dividing each by its length. 
The result is that all of our word vectors are length 1, and as such, lie on a unit circle. 
The dot product of two vectors is proportional to the cosine of the angle between them, and provides a measure of similarity (the bigger the cosine, the smaller the angle).

<img src="Figures/cosine_similarity.png" alt="cosine" style="width: 500px;"/>
<center>Figure adapted from *[Mastering Machine Learning with Spark 2.x](https://www.safaribooksonline.com/library/view/mastering-machine-learning/9781785283451/ba8bef27-953e-42a4-8180-cea152af8118.xhtml)*</center>

In [18]:
import numpy as np

norms = np.linalg.norm(english_embeddings, axis = 1)
normalized_embeddings = english_embeddings.astype('float32') / norms.astype('float32').reshape([-1, 1])

print("normalized_embeddings dimensions: {0}".format(normalized_embeddings.shape))
print(norms.astype('float32').reshape([-1, 1]))

normalized_embeddings dimensions: (150875, 300)
[[57.5326  ]
 [57.29747 ]
 [57.21014 ]
 ...
 [57.384666]
 [56.97368 ]
 [57.471733]]


We want to look up words easily, so we create a dictionary that maps us from a word to its index in the word embeddings matrix.

In [26]:
index = {word: i for i, word in enumerate(english_words)}
print(np.dot(normalized_embeddings[0,:], normalized_embeddings[0,:]))
print(index["actionable"])
print(np.dot(normalized_embeddings[1330,:], normalized_embeddings[1330,:]))

0.99999994
1330
1.0


Now we are ready to measure the similarity between pairs of words. We use numpy to take dot products.

In [30]:
def similarity_score(w1, w2):
    score = np.dot(normalized_embeddings[index[w1],:], normalized_embeddings[index[w2],:])
    return score

# # A word is as similar with itself as possible:
print('cat\tcat\t', similarity_score('cat', 'cat'))

# # Closely related words still get high scores:
print('cat\tfeline\t', similarity_score('cat', 'feline'))
print('cat\tdog\t', similarity_score('cat', 'dog'))

# # Unrelated words, not so much
print('cat\tmoo\t', similarity_score('cat', 'moo'))
print('cat\tfreeze\t', similarity_score('cat', 'freeze'))

# # Antonyms are still considered related, sometimes more so than synonyms
print('antonyms\topposites\t', similarity_score('antonym', 'opposite'))
print('antonyms\tsynonyms\t', similarity_score('antonym', 'synonym'))

cat	cat	 1.0
cat	feline	 0.81995475
cat	dog	 0.59072405
cat	moo	 0.0039538294
cat	freeze	 -0.030225188
antonyms	opposites	 0.39410648
antonyms	synonyms	 0.46883985


We can also find, for instance, the most similar words to a given word.

In [31]:
def closest_to_vector(v, n):
    """Return n closest words to vector v"""
    all_scores = np.dot(normalized_embeddings, v)
    best_words = map(lambda i: english_words[i], reversed(np.argsort(all_scores)))
    return [next(best_words) for _ in range(n)]

def most_similar(w, n):
    """Return n closest words to word w"""
    return closest_to_vector(normalized_embeddings[index[w],:], n)

In [33]:
print(most_similar("cat", 10))
print(most_similar("dog", 10))
print(most_similar("duke", 10))

['cat', 'humane_society', 'kitten', 'feline', 'colocolo', 'cats', 'kitty', 'maine_coon', 'housecat', 'sharp_teeth']
['dog', 'dogs', 'wire_haired_dachshund', 'doggy_paddle', 'lhasa_apso', 'good_friend', 'puppy_dog', 'bichon_frise', 'woof_woof', 'golden_retrievers']
['duke', 'dukes', 'duchess', 'duchesses', 'ducal', 'dukedom', 'duchy', 'voivode', 'princes', 'prince']


We can also use `closest_to_vector` to find words "nearby" vectors that we create ourselves. This allows us to solve analogies. For example, in order to solve the analogy "man : brother :: woman : ?", we can compute a new vector `brother - man + woman`: the meaning of brother, minus the meaning of man, plus the meaning of woman. We can then ask which words are closest, in the embedding space, to that new vector.

In [40]:
def solve_analogy(a1, b1, a2):
    """Solve the analogy of a1 : b1 :: a2 : ?"""
    b2_vector = normalized_embeddings[index[b1],:] - normalized_embeddings[index[a1],:] + normalized_embeddings[index[a2],:]
    b2 = closest_to_vector(b2_vector, 10)
    return b2
    
print(solve_analogy("man", "brother", "woman"))
print(solve_analogy("man", "woman", "brother"))
print(solve_analogy("king", "man", "queen")) # didn't work as well...
print(solve_analogy("spain", "madrid", "france"))

['sister', 'brother', 'sisters', 'kid_sister', 'younger_brother', 'niece', 'nieces', 'sistren', 'stepsister', 'daughter']
['sister', 'brother', 'sisters', 'kid_sister', 'younger_brother', 'niece', 'nieces', 'sistren', 'stepsister', 'daughter']
['man', 'mans', 'woman', 'men', 'dude', 'brotherman', 'bloke', 'peter_pan', 'guy', 'mandem']
['paris', 'france', 'le_havre', 'in_france', 'montmartre', 'marseille', 'loire_valley', 'saone', 'lyonnais', 'jacques_chirac']


These three results are quite good, but in general, the results of these analogies can be disappointing. Try experimenting with other analogies, and see if you can think of ways to get around the problems you notice (i.e., modifications to the solve_analogy algorithm).

### Using word embeddings in deep models
Word embeddings are fun to play around with, but their primary use is that they allow us to think of words as existing in a continuous, Euclidean space; we can then use an existing arsenal of techniques for machine learning with continuous numerical data (like logistic regression or neural networks) to process text.

Let's take a look at an especially simple version of this. We'll perform *sentiment analysis* on a set of movie reviews: in particular, we will attempt to classify a movie review as positive or negative based on its text.

We will use a [Simple Word Embedding Model](http://people.ee.duke.edu/~lcarin/acl2018_swem.pdf) (SWEM, Shen et al. 2018) to do so. We will represent a review as the *mean* of the embeddings of the words in the review. Then we'll train a three-layer MLP (a neural network) to classify the review as positive or negative.

Download the `movie-simple.txt` file from Google Classroom into this directory. Each line of that file contains 

1. the numeral 0 (for negative) or the numeral 1 (for positive), followed by
2. a tab (the whitespace character), and then
3. the review itself.

In [41]:
import string
remove_punct=str.maketrans('','',string.punctuation)

# This function converts a line of our data file into
# a tuple (x, y), where x is 300-dimensional representation
# of the words in a review, and y is its label.
def convert_line_to_example(line):
    # Pull out the first character: that's our label (0 or 1)
    y = int(line[0])
    
    # Split the line into words using Python's split() function
    words = line[2:].translate(remove_punct).lower().split()
    
    # Look up the embeddings of each word, ignoring words not
    # in our pretrained vocabulary.
    embeddings = [normalized_embeddings[index[w]] for w in words
                  if w in index]
    
    # Take the mean of the embeddings
    x = np.mean(np.vstack(embeddings), axis=0)
    return {'x': x, 'y': y}

# Apply the function to each line in the file.
with open("movie-simple.txt", "r", encoding='utf-8', errors='ignore') as f:
    dataset = [convert_line_to_example(l) for l in f.readlines()]

In [47]:
print("Number of reviews in dataset: {}".format(len(dataset)))
print("Each one is a dictionary: {x: [300 long average embedding], y: label}")

Number of reviews in dataset: 1411
Each one is a dictionary: {x: [300 long average embedding], y: label}


Now that we have a dataset, let's shuffle it and do a train/test split. We use a quarter of the dataset for testing, 3/4 for training (but also ensure that we have a whole number of batches in our training set, to make the code nicer later).

In [53]:
import random
random.shuffle(dataset)

batch_size = 100
total_batches = len(dataset) // batch_size 
train_batches = 3*total_batches // 4
train, test = dataset[:train_batches*batch_size], dataset[train_batches*batch_size:]

In [50]:
5 // 6

0

Time to build our MLP in Tensorflow. We'll use placeholders for `X` and `y` as usual.

In [52]:
import tensorflow as tf

# Placeholders for input
X = tf.placeholder(tf.float32, [None, 300])
y = tf.placeholder(tf.float32, [None, 1])

# Three-layer MLP
h1 = tf.layers.dense(X, 100, tf.nn.relu)
h2 = tf.layers.dense(h1, 20, tf.nn.relu)
logits = tf.layers.dense(h2, 1)
# Note: do not apply nonlinearity to logits before softmax, because softmax is already nonlinear
probabilities = tf.sigmoid(logits)

# Loss and metrics
# not softmax, but this fxn is same but only for one class
# reduce_mean: average across all samples in the minibatch
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits = logits, labels = y))
# Round probabilities => get 0,1 prediction
# Compare to labels y
# Cast to float and get average accuracy
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.round(probabilities), y), tf.float32))

# Training
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(loss)

# Initialization of variables
init_op = tf.global_variables_initializer()

We can now begin a session and train our model. We'll train for 250 epochs. When we're finished, we'll evaluate our accuracy on all the test data.

In [58]:
# Train
sess = tf.Session()
sess.run(init_op)

for epoch in range(250):
    for batch in range(train_batches):
        data = train[batch*batch_size:(batch+1)*batch_size]
        reviews = [sample['x'] for sample in data]
        labels  = [sample['y'] for sample in data]
        labels = np.array(labels).reshape([-1, 1])
    
        _, l, acc = sess.run([train_step, loss, accuracy], feed_dict={X: reviews, y: labels})
        
    if epoch % 10 == 0:
        print("Epoch: {0}\tLoss: {1:1.2f}\tAccuracy: {2:1.2f}".format(epoch, l, acc))    
    random.shuffle(train)

# Evaluate on test set
test_reviews = [sample['x'] for sample in test]
test_labels  = [sample['y'] for sample in test]
test_labels = np.array(test_labels).reshape([-1, 1])
acc = sess.run(accuracy, feed_dict={X: test_reviews, y: test_labels})

print("Final test accuracy: {0:1.2f}".format(acc))

Epoch: 0	Loss: 0.69	Accuracy: 0.60
Epoch: 10	Loss: 0.67	Accuracy: 0.60
Epoch: 20	Loss: 0.66	Accuracy: 0.59
Epoch: 30	Loss: 0.65	Accuracy: 0.62
Epoch: 40	Loss: 0.66	Accuracy: 0.53
Epoch: 50	Loss: 0.61	Accuracy: 0.70
Epoch: 60	Loss: 0.59	Accuracy: 0.75
Epoch: 70	Loss: 0.51	Accuracy: 0.84
Epoch: 80	Loss: 0.45	Accuracy: 0.88
Epoch: 90	Loss: 0.39	Accuracy: 0.88
Epoch: 100	Loss: 0.34	Accuracy: 0.93
Epoch: 110	Loss: 0.26	Accuracy: 0.91
Epoch: 120	Loss: 0.20	Accuracy: 0.97
Epoch: 130	Loss: 0.22	Accuracy: 0.90
Epoch: 140	Loss: 0.18	Accuracy: 0.96
Epoch: 150	Loss: 0.16	Accuracy: 0.96
Epoch: 160	Loss: 0.18	Accuracy: 0.91
Epoch: 170	Loss: 0.15	Accuracy: 0.96
Epoch: 180	Loss: 0.11	Accuracy: 0.96
Epoch: 190	Loss: 0.08	Accuracy: 0.98
Epoch: 200	Loss: 0.17	Accuracy: 0.92
Epoch: 210	Loss: 0.10	Accuracy: 0.97
Epoch: 220	Loss: 0.14	Accuracy: 0.94
Epoch: 230	Loss: 0.10	Accuracy: 0.96
Epoch: 240	Loss: 0.10	Accuracy: 0.98
Final test accuracy: 0.95


We can now examine what our model has learned, seeing how it responds to word vectors for different words:

In [63]:
# Check some words
words_to_test = ["exciting", "hated", "boring", "loved", "okay", "all_right"]

for word in words_to_test:
    print("{0} {1}".format(word, sess.run(probabilities, feed_dict={X: normalized_embeddings[index[word]].reshape(1, 300)})))

exciting [[0.99997854]]
hated [[1.1961651e-08]]
boring [[1.2153959e-06]]
loved [[0.99999964]]
okay [[0.03028856]]
all_right [[0.9836664]]


Try some words of your own!

In [64]:
sess.close()
tf.reset_default_graph()

This model works great for such a simple dataset, but does a little less well on something more complex. `movie-pang02.txt`, for instance, has 2000 longer, more complex movie reviews. It's in the same format as our simple dataset. On those longer reviews, this model achieves only 60-80% accuracy. (Increasing the number of epochs to, say, 1000, does help.)

### Recurrent Neural Networks (RNNs)

In the context of deep learning, natural language is commonly modeled with Recurrent Neural Networks (RNNs).
RNNs pass the output of a neuron back to the input of the next time step of the same neuron.
These directed cycles in the RNN architecture gives them the ability to model temporal dynamics, making them particularly suited for modeling sequences (e.g. text).
We can visualize an RNN layer as follows:

<img src="Figures/basic_RNN.PNG" alt="basic_RNN" style="width: 80px;"/>
<center>Figure from *Understanding LSTMs*. https://colah.github.io/posts/2015-08-Understanding-LSTMs/</center>

We can unroll an RNN through time, making the sequence aspect of them more obvious:

<img src="Figures/unrolled_RNN.PNG" alt="basic_RNN" style="width: 400px;"/>
<center>Figure from *Understanding LSTMs*. https://colah.github.io/posts/2015-08-Understanding-LSTMs/</center>

#### RNNs in TensorFlow
How would we implement an RNN in TensorFlow? Given the different forms of RNNs, there are quite a few ways, but we'll stick to a simple one. 

In [65]:
# As always, import TensorFlow first
import tensorflow as tf

Let's assume we have our inputs in word embedding form already, say of dimensionality 100. We'll use a minibatch size of 16.

In [66]:
mb = 16
x_dim = 100

# Inputs
x1 = tf.placeholder(tf.float32, [mb, x_dim])

Define weight matrices for projecting the input, the previous state, and the output. Rather arbitrarily, let's pick a hidden layer size of 64.

In [67]:
h_dim = 64

# For projecting the input
U = tf.Variable(tf.truncated_normal([x_dim, h_dim], stddev=0.1))

# For projecting the previous state, which is already 64x64
W = tf.Variable(tf.truncated_normal([h_dim, h_dim], stddev=0.1))

# For projecting the output back to x_dim
V = tf.Variable(tf.truncated_normal([h_dim, x_dim], stddev=0.1))

Next, a function for one time step of the RNN.

In [70]:
def RNN_step(x, h):
    # add new input + memory from previous state
    h_next = tf.tanh(tf.matmul(x, U) + tf.matmul(h, W))
    
    # Project from hidden dimension to output
    output = tf.matmul(h_next, V)
    
    return output, h_next

In [71]:
# Initialize hidden state to 0
h0 = tf.zeros([mb, h_dim])

# Forward pass of one RNN step for time step t=1
y1, h1 = RNN_step(x1, h0)

print("Output y1 dimensions: {0}".format(y1.shape))
print("Hidden state h1 dimensions: {0}".format(h1.shape))

Output y1 dimensions: (16, 100)
Hidden state h1 dimensions: (16, 64)


We can repeat using the `RNN_step` function to continue unrolling the RNN as far as we need to. For each step, we feed in the next input (a new placeholder) and get a new output.

In [72]:
x2 = tf.placeholder(tf.float32, [mb, x_dim]) # word 2

# Forward pass of one RNN step for time step t=2
y2, h2 = RNN_step(x2, h1)

print("Output y2 dimensions: {0}".format(y2.shape))
print("Hidden state h2 dimensions: {0}".format(h2.shape))

Output y2 dimensions: (16, 100)
Hidden state h2 dimensions: (16, 64)


Of course, in practice, you'd want to do this unrolling with a `for` loop, and the RNN functionality is more cleanly wrapped up in a class. 
We're not going to implement the class version here though, as TensorFlow already has these implemented: https://www.tensorflow.org/api_guides/python/contrib.rnn#Base_interface_for_all_RNN_Cells.

In [73]:
# Number of steps to unroll
num_steps = 10

# List of inputs and hidden states
xs = []
hs = []

# Build RNN
rnn = tf.contrib.rnn.BasicRNNCell(h_dim)

# Initialize hidden state to zero
h_t = tf.zeros([mb, h_dim])
    
for t in range(num_steps):
    x_t = tf.placeholder(tf.float32, [mb, x_dim])
    h_t, _ = rnn(x_t, h_t)
    
    xs.append(x_t)
    hs.append(h_t)

print("x dimensions:")
print([x_t.shape for x_t in xs])
print("\nh dimensions:")
print([h_t.shape for h_t in hs])

x dimensions:
[TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)]), TensorShape([Dimension(16), Dimension(100)])]

h dimensions:
[TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)]), TensorShape([Dimension(16), Dimension(64)])]


Note: people don't actually use RNNs this way. Because of backpropagation, the gradients might vanish over time => vanishing gradient problem. LSTMs are a way around that.

#### Long Short-Term Memory (LSTM)
One popular type of RNNs are Long Short-Term Memory (LSTM) networks.
We're not going to go into detail here about what structural differences they have from vanilla RNNs, but LSTMs are also sequence modeling neural networks, with much better long range model capabilities.
If you're curious, [this](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) does a fantastic job describing them.

### Other materials:
Like Reinforcement Learning, Natural Language Processing can also easily be several full courses on its own at most universities, both with or without neural networks.
In fact, Prof Mohit Bansal has [taught](http://www.cs.unc.edu/~mbansal/teaching/nlp-course-fall17.html) [several](http://www.cs.unc.edu/~mbansal/teaching/nlp-seminar-spring18.html).

- [Fantastic introduction to LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Popular blog post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

## Survey of Machine Learning

Question: Have we covered the major fields of ML?

Answer: NLP, RL, Computer Vision are pretty major. Recommendation systems don't really fall into these buckets.

For computer vision: we didn't go into all the subfields, just touched on classification.