[View in Colaboratory](https://colab.research.google.com/github/wronnyhuang/notebooks/blob/master/encoder_classifier.ipynb)

# Universal sentence encoder used for text classification

This notebook illustrates how to access the Universal Sentence Encoder and use it for sentence classification tasks such as negative news.

The sentence embeddings can be used to compute sentence level meaning similarity as well as to enable better performance on downstream classification tasks using less supervised training data.

We'll demonstrate it on the use case of movie review sentiment from the IMDB dataset (50k labeled examples across train/test). We will train a few neural layers on top of the sentence encoding.

The goal of this demonstration is to show that the universal sentence encoder can be used effectively for negative news classification if we train it with enough data.

**Setup**
This section sets up the environment for access to the Universal Sentence Encoder on TF Hub and provides examples of applying the encoder to words, sentences, and paragraphs.

In [0]:
# Install the latest Tensorflow version.
!pip3 install --quiet "tensorflow>=1.7"
# Install TF-Hub.
!pip3 install --quiet tensorflow-hub
!pip3 install --quiet seaborn

In [0]:
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
from glob import glob
import pickle

In [0]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3" #@param ["https://tfhub.dev/google/universal-sentence-encoder/2", "https://tfhub.dev/google/universal-sentence-encoder-large/3"]

Now load tensorflow hub's module for the universal sentence encoder and name it ```embed```.


In [0]:
# Import the Universal Sentence Encoder's TF Hub module
tf.reset_default_graph()
embed = hub.Module(module_url)

# Compute a representation for each message, showing various lengths supported.
word = "Elephant"
sentence = "I am a sentence for which I would like to get its embedding."
paragraph = (
    "Universal Sentence Encoder embeddings also support short paragraphs. "
    "There is no hard limit on how lon the paragraph is. Roughly, the longer "
    "the more 'diluted' the embedding will be.")
messages = [word, sentence, paragraph]

# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)

with tf.Session() as session:
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  message_embeddings = session.run(embed(messages))

  for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
    print("Message: {}".format(messages[i]))
    print("Embedding size: {}".format(len(message_embedding)))
    message_embedding_snippet = ", ".join(
        (str(x) for x in message_embedding[:3]))
    print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

In [4]:
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)

INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.
INFO:tensorflow:Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder-large/3'.
INFO:tensorflow:Downloaded TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder-large/3'.


# Load and embed dataset
***Only need to do this once per session!!***
*Skip to next section if already done*

Download and untar the IMDB dataset

In [5]:
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
! tar -xzf aclImdb_v1.tar.gz


Redirecting output to ‘wget-log’.


Extract text from files to a list of strings and save that list

In [0]:
# get list of filenames
data_root = 'aclImdb'
trainposfiles = glob(os.path.join(data_root, 'train', 'pos', '*.txt'))
trainnegfiles = glob(os.path.join(data_root, 'train', 'neg', '*.txt'))
testposfiles = glob(os.path.join(data_root, 'test', 'pos', '*.txt'))
testnegfiles = glob(os.path.join(data_root, 'test', 'neg', '*.txt'))

# initialize list of training/test data which we'll load all into memory
xtrain = []
ytrain = []
xtest = []
ytest = []

# loop through all training/test files and store the sentences/labels
for file in trainposfiles:
  with open(file) as f:
    xtrain.append(f.read())
    ytrain.append(1.) # 1 for positive sentiment, 0 for negative

for file in trainnegfiles:
  with open(file) as f:
    xtrain.append(f.read())
    ytrain.append(0.)

for file in testposfiles:
  with open(file) as f:
    xtest.append(f.read())
    ytest.append(1.)

for file in testnegfiles:
  with open(file) as f:
    xtest.append(f.read())
    ytest.append(0.)

Convert the messages to sentence embeddings using the encoder. Then save the resulting embeddings of all the train/test data into a pickle file
**This cell takes a while to execute, also must run on CPU (OOM error on GPU)**

In [0]:
# convert the messages (strings) into sentence embeddings using the encoder
with tf.Session() as sess:
  sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
  
  # train - embed sentences TAKES A WHILE
  xtrain_emb = np.empty((0,512), float)
  batch_size = 500
  n_batch = int(len(xtrain)/batch_size)
  for i in range(n_batch):
    batch = xtrain[(batch_size*i):(batch_size*(i+1))]
    batch_emb = sess.run(embed(batch))
    xtrain_emb = np.append(xtrain_emb, batch_emb, axis=0)
    print(i, n_batch, xtrain_emb.shape)
  ytrain = np.array(ytrain).reshape(-1,1) # reshape ytrain
  with open('train_embeddings.pkl', 'wb') as f: # save to disk
    pickle.dump((xtrain, xtrain_emb, ytrain), f)

  # test - embed sentences TAKES A WHILE
  xtest_emb = np.empty((0,512), float)
  batch_size = 500
  n_batch = int(len(xtest)/batch_size)
  for i in range(n_batch):
    batch = xtest[(batch_size*i):(batch_size*(i+1))]
    batch_emb = sess.run(embed(batch))
    xtest_emb = np.append(xtest_emb, batch_emb, axis=0)
    print(i, n_batch, xtest_emb.shape)
  ytest = np.array(ytest).reshape(-1,1) # reshape ytest
  with open('test_embeddings.pkl', 'wb') as f: # save to disk
    pickle.dump((xtest, xtest_emb, ytest), f)

Optional: upload the pickle files to dropbox

In [0]:
# setup dropbox access with a token
! curl "https://raw.ontent.com/andreafabrizi/Dropbox-Uploader/master/dropbox_uploader.sh" -o /bin/dropbox_uploader.sh
! chmod +x /bin/dropbox_uploader.sh
! dropbox_uploader.sh

In [36]:
# upload pickles to dropbox
! dropbox_uploader.sh upload t*_embeddings.pkl datasets/imdb_universal_sentence/
! dropbox_uploader.sh -q share datasets/imdb_universal_sentence

 > Uploading "/content/test_embeddings.pkl" to "/datasets/imdb_universal_sentence/test_embeddings.pkl"... DONE
> Skipping file "/content/train_embeddings.pkl", file exists with the same hash
https://www.dropbox.com/sh/86u0243mq6fhv2z/AAAJebdtVG5fgLuZVx6BzzREa?dl=0


# Train neural net on top of embedding layer

First download pickle files saved from previous section

In [0]:
! curl -L https://www.dropbox.com/sh/0243mq6fhv2z/AAAJebdtVG5fgLuZVx6BzzREa?dl=0 > tmp.zip
! unzip tmp.zip -d 

Load the embeddings of train/test data from pickle file

In [0]:
# load train/test embeddings to memory
with open('train_embeddings.pkl', 'rb') as f:
  xtrain, xtrain_emb, ytrain = pickle.load(f)
with open('test_embeddings.pkl', 'rb') as f:
  xtest, xtest_emb, ytest = pickle.load(f)

Get tensorboard running so we can visualize our results while training

In [0]:
# tensorboard
! git clone https://github.com/wronnyhuang/bin
! mv bin/* /bin/ && rm -r bin
! sh /bin/install_ngrok.sh
! sh /bin/tboard.sh . n

Add a neural network on top of the embedding layer with ```len(n_hidden)``` layers and ```n_hidden[i]``` hidden units in the ```i```th layer

In [0]:
n_hidden = [768, 1024]

# reset graph
tf.reset_default_graph()
embed = hub.Module(module_url)

# input placeholders
xinput = tf.placeholder(dtype=tf.float32, shape=[None, xtrain_emb.shape[1]])
yinput = tf.placeholder(dtype=tf.float32, shape=[None, 1])
linput = tf.placeholder(dtype=tf.float32) # lrn rate
tinput = tf.placeholder(dtype=tf.bool) # training mode flag

# add neural network layers
z = xinput
for nh in n_hidden:
  z = tf.layers.dense(inputs=z, units=nh, use_bias=False)
#   z = tf.layers.batch_normalization(z, training=tinput)
  z = tf.nn.relu(z)
  z = tf.layers.dropout(z, rate=0.5, training=tinput)

# logit layer
logits = tf.layers.dense(inputs=z, units=1)

# binary cross entropy
xent = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=yinput)
xent = tf.reduce_mean(xent)

# weight decay
wdec = tf.add_n([tf.nn.l2_loss(t) for t in tf.trainable_variables()])

# total loss with regularizers
criterion = xent + 5e-4*wdec

# step operation for the optimizer
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
#   train_step = tf.train.MomentumOptimizer(learning_rate=linput, momentum=.9).minimize(criterion)
#   train_step = tf.train.GradientDescentOptimizer(learning_rate=linput).minimize(criterion)
  train_step = tf.train.AdamOptimizer(learning_rate=linput).minimize(criterion)

# prediction and accuracy
prediction = tf.nn.sigmoid(logits)
accuracy = tf.reduce_mean(tf.cast(tf.equal(yinput, tf.round(prediction)), tf.float32))

# adding some operations for tensorboard visualization
tf.summary.scalar('accuracy', accuracy)
tf.summary.scalar('loss', criterion)
merged = tf.summary.merge_all()

Having built our static computational graph, we can now train it by feeding in some data and doing backprop. We'll run the training for ```n_epoch``` epochs with a batch size of ```batch_size```

In [122]:
n_epoch = 50
batch_size = 100

# prepare for minibatching
np.random.seed(1)
x_size = len(xtrain_emb)
n_batch = int(x_size/batch_size)

# prepare session
sess = tf.Session()
sess.run([tf.global_variables_initializer(), tf.tables_initializer()])

log_root = '512-'+'-'.join([str(nh) for nh in n_hidden])+'-1'
train_writer = tf.summary.FileWriter('./'+log_root+'/train', sess.graph)
test_writer = tf.summary.FileWriter('./'+log_root+'/test', sess.graph)

# train and test iteratively
global_step = 1
for i in range(n_epoch): # loop over epochs

  # perform test run
  summaries_test, acc, loss, ypred = sess.run([merged, accuracy, criterion, prediction],
                       feed_dict={xinput:xtest_emb, yinput:ytest, tinput:False})
  test_writer.add_summary(summaries_test, global_step)
  print('TEST: epoch='+str(i)+' acc='+str(acc)+' loss='+str(loss))

  # learning rate schedule
  if i<30: lr = 1e-2
  elif i<45: lr = 1e-3
  elif i<55: lr = 1e-4

  # perform training runs
  for j in range(n_batch): # loop over minibatches

    # acquire minibatches
    randidx = np.random.permutation(len(xtrain))
    batchidx = randidx[(j*batch_size):((j+1)*batch_size)]
    xbatch = xtrain_emb[batchidx]
    ybatch = ytrain[batchidx]

    # run backprop
    summaries, _, _ = sess.run([merged, train_step, update_ops],
                            feed_dict={xinput:xbatch, yinput:ybatch, linput:lr, tinput:True})    
    train_writer.add_summary(summaries, global_step)
    global_step += 1

TEST: epoch=0 acc=0.46812 loss=1.0691527
TEST: epoch=1 acc=0.83412 loss=0.47312164
TEST: epoch=2 acc=0.80968 loss=0.5102475
TEST: epoch=3 acc=0.83036 loss=0.47979146
TEST: epoch=4 acc=0.84112 loss=0.46518356
TEST: epoch=5 acc=0.83796 loss=0.46715677
TEST: epoch=6 acc=0.82624 loss=0.4889658
TEST: epoch=7 acc=0.83912 loss=0.47654745
TEST: epoch=8 acc=0.83292 loss=0.46943253
TEST: epoch=9 acc=0.8402 loss=0.4546898
TEST: epoch=10 acc=0.83796 loss=0.47060314
TEST: epoch=11 acc=0.83548 loss=0.47652918
TEST: epoch=12 acc=0.83392 loss=0.48103708
TEST: epoch=13 acc=0.83544 loss=0.48671293
TEST: epoch=14 acc=0.83812 loss=0.45280257
TEST: epoch=15 acc=0.83536 loss=0.46472287
TEST: epoch=16 acc=0.84052 loss=0.47275773
TEST: epoch=17 acc=0.83264 loss=0.51111686
TEST: epoch=18 acc=0.83212 loss=0.4753885
TEST: epoch=19 acc=0.82808 loss=0.49583513
TEST: epoch=20 acc=0.84 loss=0.46736062
TEST: epoch=21 acc=0.8356 loss=0.4710739
TEST: epoch=22 acc=0.838 loss=0.49092877
TEST: epoch=23 acc=0.83684 loss=0.

Save trained model to disk

In [0]:
# saving trained weights to disk
log_root = '512-'+'-'.join([str(nh) for nh in n_hidden])+'-1'
saver = tf.train.Saver()
save_path = saver.save(sess, os.path.join(log_root,'model.ckpt'))

# Evaluation

Take a look at a random sampling of the test results

In [123]:
np.random.seed(2)
idx = np.random.permutation(xtest_emb.shape[0])[:5]
for i in idx:
  print('Sample '+str(i)+' | truth='+str(ytest[i])+' prediction='+str(ypred[i]))
  print(xtest[i]+'\n')

Sample 15050 | truth=[0.] prediction=[0.06854259]
this movie is allegedly a comedy.so where did all the laughs go.did the forget to put them in,on the version i watched.as a football movie,it is mildly entertaining,i guess.maybe'm just a stick in the mud,with no discernible sense of humour.or maybe this movie just isn't funny.it is also annoying,with that way over the top "you're a winner"musical score.and the odd thing is,the team sucked through most of the season,only winning the last two games,and the last game meant nothing since they were not in the playoffs.so what is the point? are they celebrating mediocrity?I don't see it.if anybody knows,please let me know.anyway,this movie isn't great or even very good.i'm giving it a low 3*

Sample 9386 | truth=[1.] prediction=[0.9910898]
Ray is one of those movies that makes you pause. You actually think about what you heard or think about what you read about this man and it doesn't even come close. During my first viewing of Ray I forgot 

Let's take a look also at the errors

In [58]:
ypred_ = np.round(ypred)
false_positive = [(x,yp,yt) for x,yp,yp_,yt in zip(xtest,ypred,ypred_,ytest) if yp_==1 and yt==0]
false_negative = [(x,yp,yt) for x,yp,yp_,yt in zip(xtest,ypred,ypred_,ytest) if yp_==0 and yt==1]

print('False positives:\n')
idx = np.random.permutation(len(false_positive))[:5]
for i in idx:
  print('Sample '+str(i)+' | truth='+str(false_positive[i][2])+' prediction='+str(false_positive[i][1]))
  print(false_positive[i][0]+'\n')
  
print('False negatives:\n')
idx = np.random.permutation(len(false_negative))[:5]
for i in idx:
  print('Sample '+str(i)+' | truth='+str(false_negative[i][2])+' prediction='+str(false_negative[i][1]))
  print(false_negative[i][0]+'\n')
  

False positives:

Sample 236 | truth=[0.] prediction=[0.8386168]
If it wasn't for the terrific music, I would not hesitate to give this cinematic underachievement 2/10. But the music actually makes me like certain passages, and so I give it 5/10.

Sample 547 | truth=[0.] prediction=[0.6503953]
I am astonished at the major comments here for this OK surf film. It really stems from the "California Dreamin" school of barnyard to beach antics and isn't really plausible. The idea that the lead kid learn to ride a board SO well in a concrete wave pool that he beats the real surfers at their game in the real ocean, is just plain silly. In Australia where most urban teens do surf, this film was laughed at audiences took it all with a grain of sea salt. Made in the 80s but with its heart in the 60s, it is fun to watch and looks and sounds good, but it is not a in a classic class at all. Even the actors didn't outlast this. We're seriously in LIQUID BRIDGE or RIDE THE WILD SURF or BEACH BLANKET B

Let's try our own paragraphs

In [107]:
xinfer = ['i did like this movie',
          'i did not like this movie',
          'this movie was not good',
          'this movie was not bad',
          'i loved nothing in this movie',
          'i loved everything in this movie',
          'i hated nothing in this movie',
          'i hated everything in this movie',
          'tom thought this was a great movie, i disagree with him',
          'tom thought this was a great movie, i agree with him',
          'tom thought this was a great movie',
          'this was not my favorite. the story was good but the graphics left something to be desired',
          'it seems like every time i go to the theater this movie has the most people. when i finally decided to go watch it, i realized what i was missing out on',
          'this was my favorite. the story was good and the graphics were a force to be reckoned with',
          'people have said this was not so bad, they were not kidding',
         ]

# inference thru model
xinfer_emb = sess.run(embed(xinfer))

# xinfer_emb = sess.run(embed(xinfer))
ypred = sess.run(prediction, feed_dict={xinput:xinfer_emb, tinput:False})

# display results
for x,y in zip(xinfer,ypred):
  print('Prediction: '+str(y))
  print(x+'\n')

Prediction: [0.78410256]
i did like this movie

Prediction: [0.00273543]
i did not like this movie

Prediction: [0.00068496]
this movie was not good

Prediction: [0.7811971]
this movie was not bad

Prediction: [0.22123347]
i loved nothing in this movie

Prediction: [0.9904136]
i loved everything in this movie

Prediction: [0.00336299]
i hated nothing in this movie

Prediction: [0.00048355]
i hated everything in this movie

Prediction: [0.40297914]
tom thought this was a great movie, i disagree with him

Prediction: [0.8812856]
tom thought this was a great movie, i agree with him

Prediction: [0.8496319]
tom thought this was a great movie

Prediction: [0.09111233]
this was not my favorite. the story was good but the graphics left something to be desired

Prediction: [0.93032455]
it seems like every time i go to the theater this movie has the most people. when i finally decided to go watch it, i realized what i was missing out on

Prediction: [0.99503905]
this was my favorite. the story 