# NOTEBOOK 2: Content-Based Filtering Using Neural Networks

__This notebook relies on files created in the__ [content_based_preproc.ipynb](./1 -content-based-preproc.ipynb) __notebook__. __Be sure to run the code in there before completing this notebook.__  
Also, you'll be using the **python3** kernel from here on out so don't forget to change the kernel if it's still Python2.

### Learning objectives
This notebook illustrates:
1. How to build feature columns for a model using tf.feature_column.
2. How to create custom evaluation metrics and add them to Tensorboard.
3. How to train a model and make predictions with the saved model.

Each learning objective will correspond to a __#TODO__ in the notebook, where you will complete the notebook cell's code before running the cell. Refer to the [solution notebook](../solutions/content_based_using_neural_networks.ipynb) for reference.

Tensorflow Hub should already be installed. You can check that it is by using "pip freeze".

In [1]:
%%bash
pip freeze | grep tensor

tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow @ file:///opt/conda/conda-bld/dlenv-tf-1-15-cpu_1647393499471/work/tensorflow-1.15.5-cp37-cp37m-linux_x86_64.whl
tensorflow-cloud==0.1.13
tensorflow-data-validation==0.23.1
tensorflow-datasets==1.2.0
tensorflow-estimator==1.15.1
tensorflow-hub==0.6.0
tensorflow-io==0.8.1
tensorflow-metadata==0.23.0
tensorflow-model-analysis==0.23.0
tensorflow-probability==0.8.0
tensorflow-serving-api==1.15.0
tensorflow-transform==0.23.0


Let's make sure you install the necessary version of tensorflow-hub. After doing the pip install below, click **"Restart the kernel"** on the notebook so that the Python environment picks up the new packages.

In [1]:
!pip3 install tensorflow-hub==0.7.0
!pip3 install --upgrade tensorflow==1.15.3
!pip3 install google-cloud-bigquery==1.10

Collecting google-cloud-core<0.30dev,>=0.29.0
  Using cached google_cloud_core-0.29.1-py2.py3-none-any.whl (25 kB)
Installing collected packages: google-cloud-core
  Attempting uninstall: google-cloud-core
    Found existing installation: google-cloud-core 2.2.2
    Uninstalling google-cloud-core-2.2.2:
      Successfully uninstalled google-cloud-core-2.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-cloud-translate 3.7.2 requires google-cloud-core<3.0.0dev,>=1.3.0, but you have google-cloud-core 0.29.1 which is incompatible.
google-cloud-storage 2.2.1 requires google-cloud-core<3.0dev,>=1.6.0, but you have google-cloud-core 0.29.1 which is incompatible.
google-cloud-spanner 3.13.0 requires google-cloud-core<3.0dev,>=1.4.1, but you have google-cloud-core 0.29.1 which is incompatible.
google-cloud-logging 3.0.0 requires google-cloud-core<3.0.0dev

#### **Note**: Please ignore any incompatibility warnings and errors and re-run the cell to view the installed tensorflow version.

In [2]:
import os
import tensorflow as tf
import numpy as np
import tensorflow_hub as hub
import shutil

PROJECT = 'YOUR_PROJECT_ID' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'YOUR_BUCKET' # REPLACE WITH YOUR BUCKET NAME WITHOUT gs://
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.15.3'

In [3]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


## Build the feature columns for the model

To start, you'll load the list of categories, authors and article ids you created in the previous **Create Datasets** notebook.

In [4]:
categories_list = open("categories.txt").read().splitlines()
authors_list = open("authors.txt").read().splitlines()
content_ids_list = open("content_ids.txt").read().splitlines()
mean_months_since_epoch = 523

In the cell below you'll define the feature columns to use in your model. If necessary, remind yourself the [various feature columns](https://www.tensorflow.org/api_docs/python/tf/feature_column) to use.  
For the embedded_title_column feature column, use a Tensorflow Hub Module to create an embedding of the article title. Since the articles and titles are in German, you'll want to use a German language embedding module.  
Explore the text embedding Tensorflow Hub modules [available here](https://alpha.tfhub.dev/). Filter by setting the language to 'German'. The 50 dimensional embedding should be sufficient for your purposes. 

In [5]:
embedded_title_column = hub.text_embedding_column(
    key="title", 
    module_spec="https://tfhub.dev/google/nnlm-de-dim50/1",
    trainable=False)

content_id_column = tf.feature_column.categorical_column_with_hash_bucket(
    key="content_id",
    hash_bucket_size= len(content_ids_list) + 1)
embedded_content_column = tf.feature_column.embedding_column(
    categorical_column=content_id_column,
    dimension=10)

author_column = tf.feature_column.categorical_column_with_hash_bucket(key="author",
    hash_bucket_size=len(authors_list) + 1)
embedded_author_column = tf.feature_column.embedding_column(
    categorical_column=author_column,
    dimension=3)

category_column_categorical = tf.feature_column.categorical_column_with_vocabulary_list(
    key="category",
    vocabulary_list=categories_list,
    num_oov_buckets=1)
category_column = tf.feature_column.indicator_column(category_column_categorical)

months_since_epoch_boundaries = list(range(400,700,20))
months_since_epoch_column = tf.feature_column.numeric_column(
    key="months_since_epoch")
months_since_epoch_bucketized = tf.feature_column.bucketized_column(
    source_column = months_since_epoch_column,
    boundaries = months_since_epoch_boundaries)

crossed_months_since_category_column = tf.feature_column.indicator_column(tf.feature_column.crossed_column(
  keys = [category_column_categorical, months_since_epoch_bucketized], 
  hash_bucket_size = len(months_since_epoch_boundaries) * (len(categories_list) + 1)))

feature_columns = [embedded_content_column,
                   embedded_author_column,
                   category_column,
                   embedded_title_column,
                   crossed_months_since_category_column] 

## Create the input function

Next you'll create the input function for your model. This input function reads the data from the csv files you created in the previous notebook. 

In [6]:
record_defaults = [["Unknown"], ["Unknown"],["Unknown"],["Unknown"],["Unknown"],[mean_months_since_epoch],["Unknown"]]
column_keys = ["visitor_id", "content_id", "category", "title", "author", "months_since_epoch", "next_content_id"]
label_key = "next_content_id"
def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
      def decode_csv(value_column):
          columns = tf.decode_csv(value_column,record_defaults=record_defaults)
          features = dict(zip(column_keys, columns))          
          label = features.pop(label_key)         
          return features, label

      # Create list of files that match pattern
      file_list = tf.io.gfile.glob(filename)

      # Create dataset from file list
      dataset = # TODO 1: Your code here

      if mode == tf.estimator.ModeKeys.TRAIN:
          num_epochs = None # indefinitely
          dataset = dataset.shuffle(buffer_size = 10 * batch_size)
      else:
          num_epochs = 1 # end-of-input after this

      dataset = dataset.repeat(num_epochs).batch(batch_size)
      return dataset.make_one_shot_iterator().get_next()
  return _input_fn

## Create the model and train/evaluate


Next, you'll build your model which recommends an article for a visitor to the Kurier.at website. Look through the code below. You use the input_layer feature column to create the dense input layer to your network. This is just a single layer network where you can adjust the number of hidden units as a parameter.

Currently, you compute the accuracy between your predicted 'next article' and the actual 'next article' read next by the visitor. You'll also add an additional performance metric of top 10 accuracy to assess your model. To accomplish this, you compute the top 10 accuracy metric, add it to the metrics dictionary below and add it to the tf.summary so that this value is reported to Tensorboard as well.

In [7]:
def model_fn(features, labels, mode, params):
  net = tf.feature_column.input_layer(features, params['feature_columns'])
  for units in params['hidden_units']:
        net = tf.layers.dense(net, units=units, activation=tf.nn.relu)
   # Compute logits (1 per class).
  logits = tf.layers.dense(net, params['n_classes'], activation=None) 

  predicted_classes = tf.argmax(logits, 1)
  from tensorflow.python.lib.io import file_io
    
  with file_io.FileIO('content_ids.txt', mode='r') as ifp:
    content = tf.constant([x.rstrip() for x in ifp])
  predicted_class_names = tf.gather(content, predicted_classes)
  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {
        'class_ids': predicted_classes[:, tf.newaxis],
        'class_names' : predicted_class_names[:, tf.newaxis],
        'probabilities': tf.nn.softmax(logits),
        'logits': logits,
    }
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)
  table = tf.contrib.lookup.index_table_from_file(vocabulary_file="content_ids.txt")
  labels = table.lookup(labels)
  # Compute loss.
  loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

  # Compute evaluation metrics.
  accuracy = # TODO 2: Your code here
  top_10_accuracy = tf.metrics.mean(tf.nn.in_top_k(predictions=logits, 
                                                   targets=labels, 
                                                   k=10))
  
  metrics = {
    'accuracy': accuracy,
    'top_10_accuracy' : top_10_accuracy}
  
  tf.summary.scalar('accuracy', accuracy[1])
  tf.summary.scalar('top_10_accuracy', top_10_accuracy[1])

  if mode == tf.estimator.ModeKeys.EVAL:
      return tf.estimator.EstimatorSpec(
          mode, loss=loss, eval_metric_ops=metrics)

  # Create training op.
  assert mode == tf.estimator.ModeKeys.TRAIN

  optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
  train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
  return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

## Train and Evaluate

In [8]:
outdir = 'content_based_model_trained'
shutil.rmtree(outdir, ignore_errors = True) # start fresh each time
#tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file
estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir = outdir,
    params={
     'feature_columns': feature_columns,
      'hidden_units': [200, 100, 50],
      'n_classes': len(content_ids_list)
    })

# Provide input data for training
train_spec = tf.estimator.TrainSpec(
    input_fn = # TODO 3: Your code here
    max_steps = 2000)

eval_spec = tf.estimator.EvalSpec(
    input_fn = read_dataset("test_set.csv", tf.estimator.ModeKeys.EVAL),
    steps = None,
    start_delay_secs = 30,
    throttle_secs = 60)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

INFO:tensorflow:Using default config.


INFO:tensorflow:Using default config.


INFO:tensorflow:Using config: {'_model_dir': 'content_based_model_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff326428b90>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


INFO:tensorflow:Using config: {'_model_dir': 'content_based_model_trained', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff326428b90>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


INFO:tensorflow:Not using Distribute Coordinator.


INFO:tensorflow:Not using Distribute Coordinator.


INFO:tensorflow:Running training and evaluation locally (non-distributed).


INFO:tensorflow:Running training and evaluation locally (non-distributed).


INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.


INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.


Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.


Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.








Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.


Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.








Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
2022-04-27 14:33:59.576379: W tensorflow/core/graph/graph_constructor.cc:1491] Importing a graph with a lower producer version 26 into an existing graph with producer version 134. Shape inference will have run different parts of the graph with different producer versions.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
Use keras.layers.Dense instead.


Instructions for updating:
Use keras.layers.Dense instead.


Instructions for updating:
Please use `layer.__call__` method instead.


Instructions for updating:
Please use `layer.__call__` method instead.


The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.







































Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.
2022-04-27 14:34:00.403736: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-04-27 14:34:00.409247: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200215000 Hz
2022-04-27 14:34:00.409555: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c341fd5480 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-04-27 14:34:00.409584: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Saving checkpoints for 0 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Saving checkpoints for 0 into content_based_model_trained/model.ckpt.


INFO:tensorflow:loss = 9.657414, step = 1


INFO:tensorflow:loss = 9.657414, step = 1


INFO:tensorflow:global_step/sec: 8.52674


INFO:tensorflow:global_step/sec: 8.52674


INFO:tensorflow:loss = 5.915848, step = 101 (11.730 sec)


INFO:tensorflow:loss = 5.915848, step = 101 (11.730 sec)


INFO:tensorflow:global_step/sec: 8.7278


INFO:tensorflow:global_step/sec: 8.7278


INFO:tensorflow:loss = 4.780274, step = 201 (11.458 sec)


INFO:tensorflow:loss = 4.780274, step = 201 (11.458 sec)


INFO:tensorflow:global_step/sec: 8.7398


INFO:tensorflow:global_step/sec: 8.7398


INFO:tensorflow:loss = 4.9286423, step = 301 (11.442 sec)


INFO:tensorflow:loss = 4.9286423, step = 301 (11.442 sec)


INFO:tensorflow:global_step/sec: 8.77428


INFO:tensorflow:global_step/sec: 8.77428


INFO:tensorflow:loss = 4.652366, step = 401 (11.397 sec)


INFO:tensorflow:loss = 4.652366, step = 401 (11.397 sec)


INFO:tensorflow:global_step/sec: 8.74748


INFO:tensorflow:global_step/sec: 8.74748


INFO:tensorflow:loss = 5.425353, step = 501 (11.432 sec)


INFO:tensorflow:loss = 5.425353, step = 501 (11.432 sec)


INFO:tensorflow:global_step/sec: 8.62954


INFO:tensorflow:global_step/sec: 8.62954


INFO:tensorflow:loss = 5.481843, step = 601 (11.588 sec)


INFO:tensorflow:loss = 5.481843, step = 601 (11.588 sec)


INFO:tensorflow:global_step/sec: 8.29585


INFO:tensorflow:global_step/sec: 8.29585


INFO:tensorflow:loss = 4.631025, step = 701 (12.054 sec)


INFO:tensorflow:loss = 4.631025, step = 701 (12.054 sec)


INFO:tensorflow:global_step/sec: 8.78159


INFO:tensorflow:global_step/sec: 8.78159


INFO:tensorflow:loss = 4.6681156, step = 801 (11.388 sec)


INFO:tensorflow:loss = 4.6681156, step = 801 (11.388 sec)


INFO:tensorflow:global_step/sec: 8.73307


INFO:tensorflow:global_step/sec: 8.73307


INFO:tensorflow:loss = 4.144435, step = 901 (11.450 sec)


INFO:tensorflow:loss = 4.144435, step = 901 (11.450 sec)


INFO:tensorflow:global_step/sec: 8.75974


INFO:tensorflow:global_step/sec: 8.75974


INFO:tensorflow:loss = 5.518801, step = 1001 (11.416 sec)


INFO:tensorflow:loss = 5.518801, step = 1001 (11.416 sec)


INFO:tensorflow:global_step/sec: 8.70351


INFO:tensorflow:global_step/sec: 8.70351


INFO:tensorflow:loss = 4.6834345, step = 1101 (11.490 sec)


INFO:tensorflow:loss = 4.6834345, step = 1101 (11.490 sec)


INFO:tensorflow:global_step/sec: 8.8177


INFO:tensorflow:global_step/sec: 8.8177


INFO:tensorflow:loss = 4.931611, step = 1201 (11.341 sec)


INFO:tensorflow:loss = 4.931611, step = 1201 (11.341 sec)


INFO:tensorflow:global_step/sec: 8.82789


INFO:tensorflow:global_step/sec: 8.82789


INFO:tensorflow:loss = 4.692869, step = 1301 (11.328 sec)


INFO:tensorflow:loss = 4.692869, step = 1301 (11.328 sec)


INFO:tensorflow:global_step/sec: 8.67418


INFO:tensorflow:global_step/sec: 8.67418


INFO:tensorflow:loss = 5.488703, step = 1401 (11.529 sec)


INFO:tensorflow:loss = 5.488703, step = 1401 (11.529 sec)


INFO:tensorflow:global_step/sec: 8.75842


INFO:tensorflow:global_step/sec: 8.75842


INFO:tensorflow:loss = 5.4614077, step = 1501 (11.417 sec)


INFO:tensorflow:loss = 5.4614077, step = 1501 (11.417 sec)


INFO:tensorflow:global_step/sec: 8.71944


INFO:tensorflow:global_step/sec: 8.71944


INFO:tensorflow:loss = 4.7395725, step = 1601 (11.469 sec)


INFO:tensorflow:loss = 4.7395725, step = 1601 (11.469 sec)


INFO:tensorflow:global_step/sec: 8.63775


INFO:tensorflow:global_step/sec: 8.63775


INFO:tensorflow:loss = 4.5634813, step = 1701 (11.581 sec)


INFO:tensorflow:loss = 4.5634813, step = 1701 (11.581 sec)


INFO:tensorflow:global_step/sec: 8.54084


INFO:tensorflow:global_step/sec: 8.54084


INFO:tensorflow:loss = 4.310399, step = 1801 (11.704 sec)


INFO:tensorflow:loss = 4.310399, step = 1801 (11.704 sec)


INFO:tensorflow:global_step/sec: 8.70934


INFO:tensorflow:global_step/sec: 8.70934


INFO:tensorflow:loss = 5.2967205, step = 1901 (11.482 sec)


INFO:tensorflow:loss = 5.2967205, step = 1901 (11.482 sec)


INFO:tensorflow:Saving checkpoints for 2000 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Saving checkpoints for 2000 into content_based_model_trained/model.ckpt.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


2022-04-27 14:37:53.648305: W tensorflow/core/graph/graph_constructor.cc:1491] Importing a graph with a lower producer version 26 into an existing graph with producer version 134. Shape inference will have run different parts of the graph with different producer versions.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Starting evaluation at 2022-04-27T14:37:53Z


INFO:tensorflow:Starting evaluation at 2022-04-27T14:37:53Z


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Finished evaluation at 2022-04-27-14:37:59


INFO:tensorflow:Finished evaluation at 2022-04-27-14:37:59


INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.04125161, global_step = 2000, loss = 5.1096807, top_10_accuracy = 0.25665066


INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.04125161, global_step = 2000, loss = 5.1096807, top_10_accuracy = 0.25665066


INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2000: content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Loss for final step: 4.5537767.


INFO:tensorflow:Loss for final step: 4.5537767.


({'accuracy': 0.04125161,
  'loss': 5.1096807,
  'top_10_accuracy': 0.25665066,
  'global_step': 2000},
 [])

This takes a while to complete but in the end, you will get about **30% top 10 accuracies**.

## Make predictions with the trained model 

With the model now trained, you can make predictions by calling the predict method on the estimator. Let's look at how your model predicts on the first five examples of the training set.  
To start, You'll create a new file 'first_5.csv' which contains the first five elements of your training set. You'll also save the target values to a file 'first_5_content_ids' so you can compare your results. 

In [9]:
%%bash
head -5 training_set.csv > first_5.csv
head first_5.csv
awk -F "\"*,\"*" '{print $2}' first_5.csv > first_5_content_ids

1013445690169368902,299827911,News,"""Vulkanausbrüche sind normal""",Michaela Reibenwein,574,299779564
1013445690169368902,299779564,Stars & Kultur,Geschenk: Nicole Kidman bekommt Traumhaus um 40 Mio. Dollar ,Elisabeth Spitzer,574,299777664
1022059616427871901,299798467,Lifestyle,Frau täuscht Tod vor um Fake-Liebhaber zu entkommen,Elisabeth Mittendorfer,574,299777082
1022059616427871901,299777082,Lifestyle,Die simple Strategie für strahlende Model-Haut,Maria Zelenko,574,299814775
1029992987987017563,299779564,Stars & Kultur,Geschenk: Nicole Kidman bekommt Traumhaus um 40 Mio. Dollar ,Elisabeth Spitzer,574,299826775


Recall, to make predictions on the trained model you pass a list of examples through the input function. Complete the code below to make predictions on the examples contained in the "first_5.csv" file you created above. 

In [10]:
output = list(estimator.predict(input_fn=read_dataset("first_5.csv", tf.estimator.ModeKeys.PREDICT)))

INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


2022-04-27 14:40:09.806367: W tensorflow/core/graph/graph_constructor.cc:1491] Importing a graph with a lower producer version 26 into an existing graph with producer version 134. Shape inference will have run different parts of the graph with different producer versions.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Restoring parameters from content_based_model_trained/model.ckpt-2000


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


In [11]:
import numpy as np
recommended_content_ids = [np.asscalar(d["class_names"]).decode('UTF-8') for d in output]
content_ids = open("first_5_content_ids").read().splitlines()

  


Finally, you map the content id back to the article title. Let's compare your model's recommendation for the first example. This can be done in BigQuery. Look through the query below and make sure it is clear what is being returned.

In [12]:
from google.cloud import bigquery
recommended_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(recommended_content_ids[0])

current_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(content_ids[0])
recommended_title = bigquery.Client().query(recommended_title_sql).to_dataframe()['title'].tolist()[0].encode('utf-8').strip()
current_title = bigquery.Client().query(current_title_sql).to_dataframe()['title'].tolist()[0].encode('utf-8').strip()
print("Current title: {} ".format(current_title))
print("Recommended title: {}".format(recommended_title))

Current title: b'"Vulkanausbr\xc3\xbcche sind normal"' 
Recommended title: b'Fahnenskandal von Mailand: Die Austria zeigt Flagge'
