## Content-Based Filtering Using Neural Networks

This lab relies on files created in the [content_based_preproc.ipynb](./content_based_preproc.ipynb) notebook. Be sure to complete the TODOs in that notebook and run the code there before completing this lab.  
Also, we'll be using the **python3** kernel from here on out so don't forget to change the kernel if it's still python2.

This lab illustrates:
1. how to build feature columns for a model using tf.feature_column
2. how to create custom evaluation metrics and add them to Tensorboard
3. how to train a model and make predictions with the saved model

Tensorflow Hub should already be installed. You can check using pip freeze.

In [1]:
%%bash
pip freeze | grep tensor

tensorboard==1.8.0
tensorflow==1.8.0


If 'tensorflow-hub' isn't one of the outputs above, then you'll need to install it. Uncomment the cell below and execute the commands. After doing the pip install, click **"Reset Session"** on the notebook so that the Python environment picks up the new packages.

In [2]:
%%bash
pip install tensorflow-hub

Collecting tensorflow-hub
  Downloading https://files.pythonhosted.org/packages/10/5c/6f3698513cf1cd730a5ea66aec665d213adf9de59b34f362f270e0bd126f/tensorflow_hub-0.4.0-py2.py3-none-any.whl (75kB)
Installing collected packages: tensorflow-hub
Successfully installed tensorflow-hub-0.4.0


In [3]:
import os
import tensorflow as tf
import numpy as np
import tensorflow_hub as hub
import shutil

PROJECT = 'qwiklabs-gcp-7eb10d4d720d2fdd' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'qwiklabs-gcp-7eb10d4d720d2fdd' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'

  from ._conv import register_converters as _register_converters
W0421 13:34:48.102803 140563460257536 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


In [4]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


### Build the feature columns for the model.

To start, we'll load the list of categories, authors and article ids we created in the previous **Create Datasets** notebook.

In [10]:
category_list = open("categories.txt").read().splitlines()
author_list = open("authors.txt").read().splitlines()
content_id_list = open("content_ids.txt").read().splitlines()
mean_months_since_epoch = 523

In [11]:
print(category_list)
print(author_list)

['Stars & Kultur', 'News', 'Lifestyle']
['Christina Michlits', 'Mathias Kainz', 'Thomas  Trescher', 'Stefan Berndl', 'Anita Kattinger', 'Martina Salomon', 'Marlene Patsalidis', 'Georg Leyrer', 'Elisabeth Spitzer', 'Elisabeth Sereda', 'Gabriele Kuhn', 'Maria Zelenko', 'Elisabeth Mittendorfer', 'Cordula Puchwein', 'Daniela Wahl', 'Kid Möchel', 'Yvonne Widler', 'Moritz Gottsauner-Wolf', 'Stefan Hofer', 'Raffaela Lindorfer', 'Peter Temel', 'Wolfgang Atzenhofer', 'Heidi Strobl', 'Helmut Brandstätter', 'Sandra Lumetsberger', 'Alexander Huber', 'Mirad Odobasic', 'Irmgard Kischko', 'Daniela Davidovits', 'Bernhard Gaul', 'Ute Brühl', 'Margaretha Kopeinig', 'Ricardo Peyerl', 'Christine Klafl', 'Alexander Strecha', 'Julia Schrenk', 'Michaela Reibenwein', 'Stefanie Rachbauer', 'Hermann Sileitsch-Parzer', 'Andreas Anzenberger', 'Elisabeth Holzer', 'Franz Jandrasits', 'Jens Mattern', 'Sandra Baierl', 'Claudia Elmer', 'Günther Pavlovics', 'Bilal Baltaci', 'Brigitte Schokarth', 'Julia Pfligl', 'Bernha

In the cell below we'll define the feature columns to use in our model. If necessary, remind yourself the [various feature columns](https://www.tensorflow.org/api_docs/python/tf/feature_column) to use.  
For the embedded_title_column feature column, use a Tensorflow Hub Module to create an embedding of the article title. Since the articles and titles are in German, you'll want to use a German language embedding module.  
Explore the text embedding Tensorflow Hub modules [available here](https://alpha.tfhub.dev/). Filter by setting the language to 'German'. The 50 dimensional embedding should be sufficient for our purposes. 

In [22]:
#TODO: use a Tensorflow Hub module to create a text embeddding column for the article "title". 
                        # Use the module available at https://alpha.tfhub.dev/ filtering by German language.
# this is some crazy pre-trainied embedding model to find similarities in setences and return numeric feature embeddings
embedded_title_column = hub.text_embedding_column(key='title'
                                                 ,module_spec="https://tfhub.dev/google/nnlm-de-dim50/1"
                                                 ,trainable=False)
    
    
#TODO: create an embedded categorical feature column for the article id; i.e. "content_id".
# First make a categorial column with hash bucket (single col, understood as a category by TF)
content_id_column = tf.feature_column.categorical_column_with_hash_bucket(key='content_id', 
                                                                          hash_bucket_size=len(content_id_list)+1)
# then convert into 10 categories via embeddings, set dim=10
embedded_content_column = tf.feature_column.embedding_column(categorical_column=content_id_column, dimension=10)
# This is crazy...but apparently works https://www.tensorflow.org/guide/feature_columns#hashed_column
# not sure why the content Id should matter when bucketed in hashes

#TODO: create an embedded categorical feature column for the article "author"
# again first make the author a column categorical with hash bucket, then convert to embedding
author_column = tf.feature_column.categorical_column_with_hash_bucket(key='author', 
                                                                          hash_bucket_size=len(author_list)+1)
embedded_author_column = tf.feature_column.embedding_column(categorical_column=author_column, dimension=3)
# this makes sense to have each author as it's own category (even with the crazy hash)
# not sure why 3 dimensions makes sense?

#TODO: create a categorical feature column for the article "category"
# only 3 known categories, so go with a vocab list + Out of Vocab allowed bucket
# again, start with a 'categorical' column then convert into indicator. (note that the 'category' here refers to the news cat, not the type of column which is also called cateogrical...confusing)
cat_col_categorical = tf.feature_column.categorical_column_with_vocabulary_list(key="category",
                                                                           vocabulary_list=category_list
                                                                           ,num_oov_buckets=1)
category_column = tf.feature_column.indicator_column(cat_col_categorical)
# this double column wrapping thing is really confusing

months_since_epoch_boundaries = list(range(400,700,20))
#TODO: create a bucketized numeric feature column of values for the "months since epoch"
# first make the numeric, THEN make the buckets
months_col = tf.feature_column.numeric_column(key="months_since_epoch")
months_since_epoch_bucketized = tf.feature_column.bucketized_column(source_column=months_col,
                                                                   boundaries=months_since_epoch_boundaries)

#TODO: create a crossed feature column using the "category" and "months since epoch" values
# note we use the pre-wrapped column for category here where it's still a 'categorical column', before it's turned into an indicator
# OK TF is confusing as shit with the input features
crossed_months_since_category_column = \
tf.feature_column.indicator_column(
    tf.feature_column.crossed_column(
        keys=[cat_col_categorical, months_since_epoch_bucketized],
        hash_bucket_size=len(months_since_epoch_bucketized) * len(cat_col_categorical) + 1
    ) 
)

# final feature columns in use
feature_columns = [embedded_content_column,
                   embedded_author_column,
                   category_column,
                   embedded_title_column,
                   crossed_months_since_category_column] 

### Create the input function.

Next we'll create the input function for our model. This input function reads the data from the csv files we created in the previous labs. 

In [15]:
record_defaults = [["Unknown"], ["Unknown"],["Unknown"],["Unknown"],["Unknown"],[mean_months_since_epoch],["Unknown"]]
column_keys = ["visitor_id", "content_id", "category", "title", "author", "months_since_epoch", "next_content_id"]
label_key = "next_content_id"
def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
      def decode_csv(value_column):
          columns = tf.decode_csv(value_column,record_defaults=record_defaults)
          features = dict(zip(column_keys, columns))          
          label = features.pop(label_key)         
          return features, label

      # Create list of files that match pattern
      file_list = tf.gfile.Glob(filename)

      # Create dataset from file list
      dataset = tf.data.TextLineDataset(file_list).map(decode_csv)

      if mode == tf.estimator.ModeKeys.TRAIN:
          num_epochs = None # indefinitely
          dataset = dataset.shuffle(buffer_size = 10 * batch_size)  # control train time by setting the number of steps later
      else:
          num_epochs = 1 # end-of-input after this

      dataset = dataset.repeat(num_epochs).batch(batch_size)
      return dataset.make_one_shot_iterator().get_next()
  return _input_fn

### Create the model and train/evaluate


Next, we'll build our model which recommends an article for a visitor to the Kurier.at website. Look through the code below. We use the input_layer feature column to create the dense input layer to our network. This is just a sigle layer network where we can adjust the number of hidden units as a parameter.

Currently, we compute the accuracy between our predicted 'next article' and the actual 'next article' read next by the visitor. Resolve the TODOs in the cell below by adding additional performance metrics to assess our model. You will need to 
* use the [tf.metrics library](https://www.tensorflow.org/api_docs/python/tf/metrics) to compute an additional performance metric
* add this additional metric to the metrics dictionary, and 
* include it in the tf.summary that is sent to Tensorboard.

In [23]:
def model_fn(features, labels, mode, params):
  net = tf.feature_column.input_layer(features, params['feature_columns'])
  for units in params['hidden_units']:  # need to understand this API, does tf.layers add a new layer each time it's called?
        net = tf.layers.dense(net, units=units, activation=tf.nn.relu)

   # Compute logits (1 per class).
  logits = tf.layers.dense(net, params['n_classes'], activation=None) 

  predicted_classes = tf.argmax(logits, 1)
  from tensorflow.python.lib.io import file_io
    
  with file_io.FileIO('content_ids.txt', mode='r') as ifp:
    content = tf.constant([x.rstrip() for x in ifp])
  predicted_class_names = tf.gather(content, predicted_classes) # gets the content id of the predicted classes
  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {
        'class_ids': predicted_classes[:, tf.newaxis],
        'class_names' : predicted_class_names[:, tf.newaxis],
        'probabilities': tf.nn.softmax(logits),
        'logits': logits,
    }
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)
  table = tf.contrib.lookup.index_table_from_file(vocabulary_file="content_ids.txt")
  labels = table.lookup(labels)
  # Compute loss.
  loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

  # Compute evaluation metrics.
  accuracy = tf.metrics.accuracy(labels=labels,
                                 predictions=predicted_classes,
                                 name='acc_op')
  #TODO: Compute the top_10 accuracy, using the tf.nn.in_top_k and tf.metrics.mean functions in Tensorflow
  # does the actual value lie anywhere within the top 10 predicted labels as per looking at the highets the logit outputs
  # tf.nn.in_top_k(predictions=logits, targets=labels, k=10) this finds if we are in top 10 for all predictions, then we use mean to figure out preformance
  top_10_accuracy = tf.metrics.mean(tf.nn.in_top_k(predictions=logits, targets=labels, k=10))

  metrics = {
    'accuracy': accuracy,
    'top_10_accuracy': top_10_accuracy
    #TODO: Add top_10_accuracy to the metrics dictionary
      
  }

  #TODO: Add the top_10_accuracy metric to the Tensorboard summary
  tf.summary.scalar('accuracy', accuracy[1])
  tf.summary.scalar('top_10_accuracy', top_10_accuracy)
    
  if mode == tf.estimator.ModeKeys.EVAL:
      return tf.estimator.EstimatorSpec(
          mode, loss=loss, eval_metric_ops=metrics)

  # Create training op.
  assert mode == tf.estimator.ModeKeys.TRAIN

  optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
  train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
  return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)


### Train and Evaluate

In [25]:
outdir = 'content_based_model_trained'
shutil.rmtree(outdir, ignore_errors = True) # start fresh each time
tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file
estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir = outdir,
    params={
     'feature_columns': feature_columns,
      'hidden_units': [200, 100, 50],  # this can be more or less to change the number of layers
      'n_classes': len(content_ids_list)
    })

train_spec = tf.estimator.TrainSpec(
    input_fn = read_dataset("training_set.csv", tf.estimator.ModeKeys.TRAIN),
    max_steps = 200)

eval_spec = tf.estimator.EvalSpec(
    input_fn = read_dataset("test_set.csv", tf.estimator.ModeKeys.EVAL),
    steps = None,  # epochs is just 1
    start_delay_secs = 30,
    throttle_secs = 60)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

INFO:tensorflow:Using default config.


I0421 14:11:03.640650 140563460257536 tf_logging.py:116] Using default config.


INFO:tensorflow:Using config: {'_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_task_id': 0, '_save_checkpoints_steps': None, '_master': '', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd7397c4630>, '_num_ps_replicas': 0, '_evaluation_master': '', '_session_config': None, '_num_worker_replicas': 1, '_train_distribute': None, '_tf_random_seed': None, '_model_dir': 'content_based_model_trained', '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_is_chief': True, '_service': None, '_task_type': 'worker'}


I0421 14:11:03.649497 140563460257536 tf_logging.py:116] Using config: {'_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_task_id': 0, '_save_checkpoints_steps': None, '_master': '', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd7397c4630>, '_num_ps_replicas': 0, '_evaluation_master': '', '_session_config': None, '_num_worker_replicas': 1, '_train_distribute': None, '_tf_random_seed': None, '_model_dir': 'content_based_model_trained', '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_is_chief': True, '_service': None, '_task_type': 'worker'}


INFO:tensorflow:Running training and evaluation locally (non-distributed).


I0421 14:11:03.661592 140563460257536 tf_logging.py:116] Running training and evaluation locally (non-distributed).


INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 60 secs (eval_spec.throttle_secs) or training is finished.


I0421 14:11:03.667795 140563460257536 tf_logging.py:116] Start train and evaluate loop. The evaluate will happen after 60 secs (eval_spec.throttle_secs) or training is finished.


INFO:tensorflow:Calling model_fn.


I0421 14:11:03.725089 140563460257536 tf_logging.py:116] Calling model_fn.


AttributeError: module 'tensorflow' has no attribute 'init_scope'

### Make predictions with the trained model. 

With the model now trained, we can make predictions by calling the predict method on the estimator. Let's look at how our model predicts on the first five examples of the training set.  
To start, we'll create a new file 'first_5.csv' which contains the first five elements of our training set. We'll also save the target values to a file 'first_5_content_ids' so we can compare our results. 

In [None]:
%%bash
head -5 training_set.csv > first_5.csv
head first_5.csv
awk -F "\"*,\"*" '{print $2}' first_5.csv > first_5_content_ids

Recall, to make predictions on the trained model we pass a list of examples through the input function. Complete the code below to make predicitons on the examples contained in the "first_5.csv" file we created above. 

In [None]:
output = #TODO: Use the predict method on our trained model to find the predictions for the examples contained in "first_5.csv".

In [None]:
import numpy as np
recommended_content_ids = [np.asscalar(d["class_names"]).decode('UTF-8') for d in output]
content_ids = open("first_5_content_ids").read().splitlines()

Finally, we'll map the content id back to the article title. We can then compare our model's recommendation for the first of our examples. This can all be done in BigQuery. Look through the query below and make sure it is clear what is being returned.

In [None]:
import google.datalab.bigquery as bq
recommended_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(recommended_content_ids[0])

current_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(content_ids[0])
recommended_title = bq.Query(recommended_title_sql).execute().result().to_dataframe()['title'].tolist()[0]
current_title = bq.Query(current_title_sql).execute().result().to_dataframe()['title'].tolist()[0]
print("Current title: {} ".format(current_title))
print("Recommended title: {}".format(recommended_title))

### Tensorboard

As usual, we can monitor the performance of our training job using Tensorboard. 

In [None]:
from google.datalab.ml import TensorBoard
TensorBoard().start('content_based_model_trained')

In [None]:
for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print("Stopped TensorBoard with pid {}".format(pid))