# Intro to Sparse Data and Embeddings

We're now going to look quickly at some textual data, to explore sparse data and work with embeddings.

The text data is based on *movie review data*.  Our task is to predict if the review is generally *favorable* or *unfavorable*.

This data has already been processed into tf.Example format.  Let's download the training and test data.

In [None]:
!wget https://storage.googleapis.com/advanced-solutions-lab/mlcc/sparse_data_embedding/test.tfrecord -O /tmp/test.tfrecord
!wget https://storage.googleapis.com/advanced-solutions-lab/mlcc/sparse_data_embedding/train.tfrecord -O /tmp/train.tfrecord

Let's take a look at one example.

In [None]:
import tensorflow as tf
record_iterator = tf.python_io.tf_record_iterator(path='/tmp/train.tfrecord')
example = tf.train.Example()
example.ParseFromString(record_iterator.next())
print example

We'll turn these string-valued terms into feature vectors by using a vocabulary.

The vocabulary is a list of each term we expect to see.  The first term listed will be mapped to the first coordinate in the feature vector, the second term to the second coordinate, and so on.

For the purposes of this exercise, we've created a small vocabulary that focues on a limited set of terms.

Most of these terms were found to be strongly indicative of *favorable* or *unfavorable*, but some were just added because they're interesting.

Terms not appearing in this vocabulary are thrown away.

We could of course use a larger vocabulary, and there are special tools for creating these.

We could also use a *feature hashing* approach that hashes each term, instead of creating an explicit vocabulary.  This works well in practice, but loses interpretability which is useful for this exercise.

# Task 1: Use a linear model with sparse inputs and an explicit vocabulary

First, we'll start with a linear model using these 36 informative terms -- always start simple!

The `sparse_column_with_keys` feature column allows us to set up the string to feature vector mapping conveniently.

After you read through the code, run it and see how we do.

In [None]:
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from IPython import display
from sklearn import metrics

tf.logging.set_verbosity(tf.logging.ERROR)

# First, set up a dictionary that allows us to parse out the features from the
# tf.Examples
features_to_types_dict = {
    "terms": tf.VarLenFeature(dtype=tf.string),
    "labels": tf.FixedLenFeature(shape=[1], dtype=tf.float32)}

# Create an input_fn that parses the tf.Examples from the given file pattern,
# and split these into features and targets.
def _input_fn(input_file_pattern):
  features = tf.contrib.learn.io.read_batch_features(
    file_pattern=input_file_pattern,
    batch_size=25,
    features=features_to_types_dict,
    reader=tf.TFRecordReader)
  targets = features.pop("labels")
  return features, targets

informative_terms = [ "bad", "great", "best", "worst", "fun", "beautiful",
                      "excellent", "poor", "boring", "awful", "terrible",
                      "definitely", "perfect", "liked", "worse", "waste",
                      "entertaining", "loved", "unfortunately", "amazing",
                      "enjoyed", "favorite", "horrible", "brilliant", "highly",
                      "simple", "annoying", "today", "hilarious", "enjoyable",
                      "dull", "fantastic", "poorly", "fails", "disappointing",
                      "disappointment", "not", "him", "her", "good", "time",
                       "?", ".", "!", "movie", "film", "action", "comedy",
                       "drama", "family", "man", "woman", "boy", "girl" ]

# Create a feature column from "terms", using our informative terms.
terms_feature_column = tf.contrib.layers.sparse_column_with_keys(column_name="terms",
                                                                 keys=informative_terms)

feature_columns = [ terms_feature_column ]

classifier = tf.contrib.learn.LinearClassifier(
  feature_columns=feature_columns,
  optimizer=tf.train.AdagradOptimizer(
    learning_rate=0.1),
  gradient_clip_norm=5.0
)

classifier.fit(
  input_fn=lambda: _input_fn("/tmp/train.tfrecord"),
  steps=1000)

evaluation_metrics = classifier.evaluate(
  input_fn=lambda: _input_fn("/tmp/train.tfrecord"),
  steps=1000)
print "Training set metrics:"
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print "---"

evaluation_metrics = classifier.evaluate(
  input_fn=lambda: _input_fn("/tmp/test.tfrecord"),
  steps=1000)

print "Test set metrics:"
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print "---"

# Task 2: Use a deep model

The above model is a linear model.  It works awfully well.  But can we do better with a Neural Net?  Let's try.

Swap in a [DNNClassifier](https://www.tensorflow.org/api_docs/python/tf/contrib/learn/DNNClassifier) for the Linear Classifier and try to run it.  What happens?

In [None]:
#@title Expand me for a possible solution

import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from IPython import display
from sklearn import metrics

tf.logging.set_verbosity(tf.logging.ERROR)

# First, set up a dictionary that allows us to parse out the features from the
# tf.Examples
features_to_types_dict = {
    "terms": tf.VarLenFeature(dtype=tf.string),
    "labels": tf.FixedLenFeature(shape=[1], dtype=tf.float32)}

# Create an input_fn that parses the tf.Examples from the given file pattern,
# and split these into features and targets.
def _input_fn(input_file_pattern):
  features = tf.contrib.learn.io.read_batch_features(
    file_pattern=input_file_pattern,
    batch_size=25,
    features=features_to_types_dict,
    reader=tf.TFRecordReader)
  targets = features.pop("labels")
  return features, targets

informative_terms = [ "bad", "great", "best", "worst", "fun", "beautiful",
                      "excellent", "poor", "boring", "awful", "terrible",
                      "definitely", "perfect", "liked", "worse", "waste",
                      "entertaining", "loved", "unfortunately", "amazing",
                      "enjoyed", "favorite", "horrible", "brilliant", "highly",
                      "simple", "annoying", "today", "hilarious", "enjoyable",
                      "dull", "fantastic", "poorly", "fails", "disappointing",
                      "disappointment", "not", "him", "her", "good", "time",
                       "?", ".", "!", "movie", "film", "action", "comedy",
                       "drama", "family", "man", "woman", "boy", "girl" ]

# Create a feature column from "terms", using feature hashing.
terms_feature_column = tf.contrib.layers.sparse_column_with_keys(column_name="terms",
                                                                 keys=informative_terms)

feature_columns = [ terms_feature_column ]

##################### Here's what we changed ##################################
classifier = tf.contrib.learn.DNNClassifier(                                  #
  feature_columns=feature_columns,                                            #
  hidden_units=[20,20],                                                       #
  optimizer=tf.train.AdagradOptimizer(                                        #
    learning_rate=0.1),                                                       #
  gradient_clip_norm=5.0                                                      #
)                                                                             #
###############################################################################

classifier.fit(
  input_fn=lambda: _input_fn("/tmp/train.tfrecord"),
  steps=1000)

evaluation_metrics = classifier.evaluate(
  input_fn=lambda: _input_fn("/tmp/train.tfrecord"),
  steps=1)
print "Training set metrics:"
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print "---"

evaluation_metrics = classifier.evaluate(
  input_fn=lambda: _input_fn("/tmp/test.tfrecord"),
  steps=1)

print "Test set metrics:"
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print "---"

# Task 3: Use an embedding with a deep model.

Ah, right, we need to use an embedding, which serves as a nice adapter for sparse data to be used as input to a deep Neural Net.

This is easy to set up using an `embedding_column`, which is a feature column that take sparse data as input and gives a lower-dimensional dense vector as an output, using an embedding to do the conversion.

Go ahead insert an embedding_column, projecting into 2 dimensions.  Here's a example code snippet you might use:

`terms_embedding_column = tf.contrib.layers.embedding_column(terms_feature_column, dimension=2)`

`feature_columns = [ terms_embedding_column ]`

In practice, we might project to dimensions higher than 2, like 50 or 100.  But for now, 2 dimensions is easy to visualize.

In [None]:
#@title Expand for a possible solution

import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from IPython import display
from sklearn import metrics

tf.logging.set_verbosity(tf.logging.ERROR)

# First, set up a dictionary that allows us to parse out the features from the
# tf.Examples
features_to_types_dict = {
    "terms": tf.VarLenFeature(dtype=tf.string),
    "labels": tf.FixedLenFeature(shape=[1], dtype=tf.float32)}

# Create an input_fn that parses the tf.Examples from the given file pattern,
# and split these into features and targets.
def _input_fn(input_file_pattern):
  features = tf.contrib.learn.io.read_batch_features(
    file_pattern=input_file_pattern,
    batch_size=25,
    features=features_to_types_dict,
    reader=tf.TFRecordReader)
  targets = features.pop("labels")
  return features, targets


informative_terms = [ "bad", "great", "best", "worst", "fun", "beautiful",
                      "excellent", "poor", "boring", "awful", "terrible",
                      "definitely", "perfect", "liked", "worse", "waste",
                      "entertaining", "loved", "unfortunately", "amazing",
                      "enjoyed", "favorite", "horrible", "brilliant", "highly",
                      "simple", "annoying", "today", "hilarious", "enjoyable",
                      "dull", "fantastic", "poorly", "fails", "disappointing",
                      "disappointment", "not", "him", "her", "good", "time",
                       "?", ".", "!", "movie", "film", "action", "comedy",
                       "drama", "family", "man", "woman", "boy", "girl" ]

# Create a feature column from "terms", using feature hashing.
terms_feature_column = tf.contrib.layers.sparse_column_with_keys(column_name="terms",
                                                                 keys=informative_terms)

############################# Here's what we changed ###########################################
terms_embedding_column = tf.contrib.layers.embedding_column(terms_feature_column, dimension=2) #
feature_columns = [ terms_embedding_column ]                                                   #
################################################################################################

classifier = tf.contrib.learn.DNNClassifier(
  feature_columns=feature_columns,
  hidden_units=[10,10],
  optimizer=tf.train.AdagradOptimizer(
    learning_rate=0.1),
  gradient_clip_norm=5.0
)

classifier.fit(
  input_fn=lambda: _input_fn("/tmp/train.tfrecord"),
  steps=1000)

evaluation_metrics = classifier.evaluate(
  input_fn=lambda: _input_fn("/tmp/train.tfrecord"),
  steps=1000)
print "Training set metrics:"
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print "---"

evaluation_metrics = classifier.evaluate(
  input_fn=lambda: _input_fn("/tmp/test.tfrecord"),
  steps=1000)

print "Test set metrics:"
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print "---"

# Task 4: Convince yourself there's actually an embedding in there

The above model used something called an `embedding_column`, and it seemed to work but this doesn't tell us much about what's going on internally.

How do we know this was actually using an embedding inside?  How would we check?

To start, let's look at the tensors in the model.

In [None]:
classifier.get_variable_names()

In [None]:
classifier.get_variable_value('dnn/input_from_feature_columns/terms_embedding/weights').shape

There's something interesting here, by the way: adding an `embedding_column` created a new layer in the model, and this layer is trainable along with the rest of the model just as any hidden layer is.

Spend some time manually checking the various layers and shapes to make sure everything is connected the way you would expect it would be.

# Task 5: Examine the Embedding

Okay, let's now take a look at the actual space.  Run the following to see where the terms end up in the embedding space.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

embedding_matrix = classifier.get_variable_value('dnn/input_from_feature_columns/terms_embedding/weights')

for term_index in range(len(informative_terms)):
  # Create a one-hot encoding for our term.  It has 0's everywhere, except for
  # a single 1 in the coordinate that corresponds to that term.
  term_vector = np.zeros(len(informative_terms))
  term_vector[term_index] = 1
  # We'll now project that one-hot vector into the embedding space.
  embedding_xy = np.matmul(term_vector, embedding_matrix)
  plt.text(embedding_xy[0],
           embedding_xy[1],
           informative_terms[term_index])

# Do a little set-up to make sure the plot displays nicely.
plt.rcParams["figure.figsize"] = (12, 12)
plt.xlim(1.2 * embedding_matrix.min(), 1.2 * embedding_matrix.max())
plt.ylim(1.2 * embedding_matrix.min(), 1.2 * embedding_matrix.max())
plt.show() 

Run this once with an embedding from a trained model.  Do things end up where you'd expect?

Re-train the model and run the embedding visualization again.  What stays the same?  What changes?

Finally, re-train the model again with just a single step.  (This will yield a terrible model.)  Run the embedding visualization again.  What do you see now, and why?

# Task 6:  Try to Improve the Model's Performance

Go ahead and try to change the model to improve performance.

This can be done by changing hyperparameters, or using a different optimizer like Adam or Adagrad.

But you might only get one or two percentage points following these strategies.

You might also get some wins by adding additional terms.

There's a full vocabulary file with all 30716 terms for this data set that you can use at: `https://storage.googleapis.com/advanced-solutions-lab/mlcc/sparse_data_embedding/terms.txt`

You can pick out additional terms from this vocabulary file, or use the whole thing via the `sparse_column_with_vocabulary_file` feature column.

In [None]:
!wget https://storage.googleapis.com/advanced-solutions-lab/mlcc/sparse_data_embedding/terms.txt -O /tmp/terms.txt

In [None]:
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from IPython import display
from sklearn import metrics

tf.logging.set_verbosity(tf.logging.ERROR)

# First, set up a dictionary that allows us to parse out the features from the
# tf.Examples
features_to_types_dict = {
    "terms": tf.VarLenFeature(dtype=tf.string),
    "labels": tf.FixedLenFeature(shape=[1], dtype=tf.float32)}

# Create an input_fn that parses the tf.Examples from the given file pattern,
# and split these into features and targets.
def _input_fn(input_file_pattern):
  features = tf.contrib.learn.io.read_batch_features(
    file_pattern=input_file_pattern,
    batch_size=100,
    features=features_to_types_dict,
    reader=tf.TFRecordReader)
  targets = features.pop("labels")
  return features, targets

# Create a feature column from "terms", using a full vocabulary file.
informative_terms = None
with open("/tmp/terms.txt", 'r') as f:
  # Convert it to set first to remove duplicates.
  informative_terms = list(set(f.read().split()))
terms_feature_column = tf.contrib.layers.sparse_column_with_keys(column_name="terms",
                                                                 keys=informative_terms)

terms_embedding_column = tf.contrib.layers.embedding_column(terms_feature_column, dimension=2)
feature_columns = [ terms_embedding_column ]

classifier = tf.contrib.learn.DNNClassifier(
  feature_columns=feature_columns,
  hidden_units=[10, 10],
  optimizer=tf.train.AdamOptimizer(
    learning_rate=0.001),
  gradient_clip_norm=1.0
)

classifier.fit(
  input_fn=lambda: _input_fn("/tmp/train.tfrecord"),
  steps=1000)

evaluation_metrics = classifier.evaluate(
  input_fn=lambda: _input_fn("/tmp/train.tfrecord"),
  steps=1000)
print "Training set metrics:"
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print "---"

evaluation_metrics = classifier.evaluate(
  input_fn=lambda: _input_fn("/tmp/test.tfrecord"),
  steps=1000)

print "Test set metrics:"
for m in evaluation_metrics:
  print m, evaluation_metrics[m]
print "---"

# A final word
One final thing to note here. We may have gotten a DNN solution with an embedding that was better than our original linear model, but the linear model was also pretty good and was quite a bit faster to train.
The faster training is because linear models do not have nearly as many parameters to update or layers to back-prop through.
In some applications, the speed of linear models may be a game changer, or linear models may be perfectly sufficient from a quality standpoint.
In others areas, the additional model complexity and capacity provided by DNN's might be more important.
The key is to remember to explore your problem sufficiently so that you know which space you're in.