# Fine-tuned Retrieval item-to-item model on full dataset

Fine-tuned Retrieval item-to-item model performed well on the train and test data. The accuracy rate for Top-10 is 57% on test and 69% on train indicating overfitting. We will test this model on the full data set to check if the overfit is happenning on the full dataset as well. 

### Imports

In [1]:
! pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
! pip install -q tensorflow-recommenders
! pip install -q --upgrade tensorflow-datasets
! pip install -q scann

[K     |████████████████████████████████| 89 kB 8.1 MB/s 
[K     |████████████████████████████████| 4.7 MB 17.3 MB/s 
[K     |████████████████████████████████| 10.4 MB 21.1 MB/s 
[K     |████████████████████████████████| 578.0 MB 15 kB/s 
[K     |████████████████████████████████| 5.9 MB 57.0 MB/s 
[K     |████████████████████████████████| 1.7 MB 48.5 MB/s 
[K     |████████████████████████████████| 438 kB 61.7 MB/s 
[?25h

In [3]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_recommenders as tfrs

# import interactive table 
from google.colab import data_table
data_table.enable_dataframe_formatter()

# set seed
tf.random.set_seed(42)

### Preparing the dataset

In [7]:
# load full dataset
ratings, info = tfds.load("amazon_us_reviews/Outdoors_v1_00", split="train", with_info = True)

[1mDownloading and preparing dataset 428.16 MiB (download: 428.16 MiB, generated: Unknown size, total: 428.16 MiB) to /root/tensorflow_datasets/amazon_us_reviews/Outdoors_v1_00/0.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/2302401 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/amazon_us_reviews/Outdoors_v1_00/0.1.0.incompleteTG94GN/amazon_us_reviews-…

[1mDataset amazon_us_reviews downloaded and prepared to /root/tensorflow_datasets/amazon_us_reviews/Outdoors_v1_00/0.1.0. Subsequent calls will reuse this data.[0m


In [8]:
# Select the basic features.

products = ratings.map(lambda x: x['data']['product_title'])

In [9]:
# train-test split
tf.random.set_seed(42)
shuffled = ratings.shuffle(2_302_401, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(1_841_921)
test = shuffled.skip(1_841_921).take(460_480)

In [10]:
# vocabulary to map raw feature values to embedding vectors
product_titles = products.batch(50_000)
unique_product_titles = np.unique(np.concatenate(list(product_titles)))

unique_product_titles[:10]

array([b'! 1pc Small S Navy Blue Replacement Band + Free Small Grey Band With Clasp for Fitbit FLEX Only /No tracker/ Wireless Activity Bracelet Sport Wristband Fit Bit Flex Bracelet Sport Arm Band Armband',
       b'! 1pc Small S Purple / Pink Replacement Band + Free Small Grey Band With Clasp for Fitbit FLEX Only /No tracker/ Wireless Activity Bracelet Sport Wristband Fit Bit Flex Bracelet Sport Arm Band Armband',
       b'! 1pc Small S Teal (Blue/Green) Replacement Band + Free Small Grey Band With Clasp for Fitbit FLEX Only /No tracker/ Wireless Activity Bracelet Sport Wristband Fit Bit Flex Bracelet Sport Arm Band Armband',
       b'! 2pcs Small S Red (Tangerine) Replacement Bands + 1pc Free Small Grey Band With Clasp for Fitbit FLEX Only /No tracker/ Wireless Activity Bracelet Sport Wristband Fit Bit Flex Bracelet Sport Arm Band Armband',
       b'! Large L 1pc Light Blue 1pc White Replacement Bands + 1pc Free Large Grey Band With Clasp for Fitbit FLEX Only /No tracker/ Wireless A

### Implementing the model

In [11]:
# dimensionality of the query and candidate representations:
embedding_dimension = 64

In [12]:
# develop query and candidate models 
product_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_product_titles, mask_token=None),
  tf.keras.layers.Embedding(len(unique_product_titles) + 1, embedding_dimension),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(32, activation = 'relu')
])

In [10]:
# define metrics
metrics = tfrs.metrics.FactorizedTopK(
  candidates=products.batch(1032).map(product_model)
)

In [11]:
# deefine task
task = tfrs.tasks.Retrieval(
  metrics=metrics
)

In [12]:
# create item-to-item model with query and candidate models using the same product_model

class AmazonModel(tfrs.Model):

  def __init__(self, user_model, product_model):
    super().__init__()
    self.product_model: tf.keras.Model = product_model
    self.user_model: tf.keras.Model = product_model
    self.task: tf.keras.layers.Layer = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # We pick out the product features and pass them into the user model.
    product_embeddings = self.product_model(features['data']["product_title"])
    # And pick out the product features and pass them into the product model,
    # getting embeddings back.
    positive_product_embeddings = self.product_model(features['data']["product_title"])

    # The task computes the loss and the metrics.
    return self.task(product_embeddings, positive_product_embeddings, compute_metrics=not training)

In [13]:
# initiate model
item_item_model_full = AmazonModel(product_model, product_model)
item_item_model_full.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.5))

### Fitting and Evaluaitng the model

In [14]:
# shuffle, batch, and cache train and test data
cached_train = train.shuffle(1_841_921).batch(8192).cache()
cached_test = test.batch(4096).cache()

In [15]:
# train the model
item_item_model_full.fit(cached_train, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fada974fa10>

In [16]:
# evaluate model
item_item_model_full.evaluate(cached_test, return_dict=True)



{'factorized_top_k/top_1_categorical_accuracy': 0.959943950176239,
 'factorized_top_k/top_5_categorical_accuracy': 0.9611709713935852,
 'factorized_top_k/top_10_categorical_accuracy': 0.9611753225326538,
 'factorized_top_k/top_50_categorical_accuracy': 0.961257815361023,
 'factorized_top_k/top_100_categorical_accuracy': 0.961835503578186,
 'loss': 10995.7666015625,
 'regularization_loss': 0,
 'total_loss': 10995.7666015625}

Accuracy rate of 96% for Top- 1 to 100. The model took over 5 hours to run evaluation. 

### Approximate prediction

#### BruteForce

BruteForce layer retieve top candidates in response to a query. It computes user-product scores for all products, sorts them and selects Top-k recommendations. 

In [17]:
# recommending Top-10 products for customer 52228204

# Create a item_item_model_full that takes in raw query features, and
brute_force = tfrs.layers.factorized_top_k.BruteForce(item_item_model_full.product_model)
# recommends products out of the entire products dataset.
brute_force.index_from_dataset(
  tf.data.Dataset.zip((products.batch(4096), products.batch(4096).map(item_item_model_full.product_model)))
)

# Get recommendations.
_, titles = brute_force(tf.constant(["52228204"]))
print(f"Recommendations for user 52228204: {titles[0, :10]}")

Recommendations for user 52228204: [b'WaterVault Thermos Water Bottle - Double Insulated Copper Plated Stainless Steel - Keeps Hot 12 Hours, Cold up to 36 - BPA-Free (12oz, 17oz, 26oz, 1 liter) Assorted Colors'
 b'Stearns Sospenders Manual Inflatable Life Jacket'
 b'Vader Bicycle Cycling Bike Road Offroad MTB Mountain Saddle Seat'
 b'Shimano Acera SL-M310 Rapid Fire Shifter, Right (Black, 7-Speed)'
 b"Women's Cycling Biking Pants Underwear Bike Bicycle Padded Shorts"
 b'Safety Care Big Foot Ice Claws Snow & Ice Traction Cleats - Re-Design 10/01/2015 - Heavy Duty, Lightweight No-Slip Studded Grips Prevent Slipping on Snow & Ice - Fits All Shoes, Boots, & Monster Size Boot Styles - Great for Hiking, Walking, Hunting, Construction & Other Outdoor Jobs & Activities'
 b'Astronaut Freeze Dried Ice Cream, One Serving Pouch'
 b'Handlebar front Bag with draw cord and clip, \xe2\x80\x9cTour Guide\xe2\x80\x9d by Biria'
 b'Victorinox Swiss Army Knife'
 b'Thin Line Adjustable Shackle Paracord Survi

In [18]:
# add timeit to record time it takes to retrieve recommnedations
%timeit _, titles = brute_force(tf.constant(["52228204"]), k = 10)

5.29 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
# model serving: saving the model to G-Drive

# mount G-Drive
from google.colab import drive
drive.mount('/content/drive')

# Export the query model.
gdrive_path = '/content/drive/MyDrive/Models'
path = os.path.join(gdrive_path, "item_item_model_full")

# Save the index.
tf.saved_model.save(brute_force, path)

# Load it back; can also be done in TensorFlow Serving.
item_item_model_full_2 = tf.saved_model.load(path)

# Pass a user id in, get top predicted movie titles back.
scores, titles = item_item_model_full_2(["52228204"])

print(f"Recommendations: {titles[0][:3]}")

Mounted at /content/drive




Recommendations: [b'WaterVault Thermos Water Bottle - Double Insulated Copper Plated Stainless Steel - Keeps Hot 12 Hours, Cold up to 36 - BPA-Free (12oz, 17oz, 26oz, 1 liter) Assorted Colors'
 b'Stearns Sospenders Manual Inflatable Life Jacket'
 b'Vader Bicycle Cycling Bike Road Offroad MTB Mountain Saddle Seat']


#### ScaNN

ScaNN is an aproximate retrieval layer that is implemented in similar way as BruteForce. 

In [21]:
# adding ScaNN layer
scann_index = tfrs.layers.factorized_top_k.ScaNN(item_item_model_full.product_model)
scann_index.index_from_dataset(
  tf.data.Dataset.zip((products.batch(4096), products.batch(4096).map(item_item_model_full.product_model)))
)

# Get recommendations.
_, titles = scann_index(tf.constant(["52228204"]))
print(f"Recommendations for user 52228204: {titles[0, :10]}")

Recommendations for user 52228204: [b'WaterVault Thermos Water Bottle - Double Insulated Copper Plated Stainless Steel - Keeps Hot 12 Hours, Cold up to 36 - BPA-Free (12oz, 17oz, 26oz, 1 liter) Assorted Colors'
 b'Stearns Sospenders Manual Inflatable Life Jacket'
 b'Vader Bicycle Cycling Bike Road Offroad MTB Mountain Saddle Seat'
 b'Shimano Acera SL-M310 Rapid Fire Shifter, Right (Black, 7-Speed)'
 b"Women's Cycling Biking Pants Underwear Bike Bicycle Padded Shorts"
 b'Safety Care Big Foot Ice Claws Snow & Ice Traction Cleats - Re-Design 10/01/2015 - Heavy Duty, Lightweight No-Slip Studded Grips Prevent Slipping on Snow & Ice - Fits All Shoes, Boots, & Monster Size Boot Styles - Great for Hiking, Walking, Hunting, Construction & Other Outdoor Jobs & Activities'
 b'Astronaut Freeze Dried Ice Cream, One Serving Pouch'
 b'Handlebar front Bag with draw cord and clip, \xe2\x80\x9cTour Guide\xe2\x80\x9d by Biria'
 b'Victorinox Swiss Army Knife'
 b'Thin Line Adjustable Shackle Paracord Survi

In [22]:
# Get recommendations.
%timeit _, titles = scann_index(tf.constant(["52228204"]), k = 10)


4.55 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [24]:
# exporting ScaNN layer

# Export the query model.
gdrive_path = '/content/drive/MyDrive/Models'
path = os.path.join(gdrive_path, "item_item_model_full")

# Save the index.
tf.saved_model.save(
    scann_index,
    path,
    options=tf.saved_model.SaveOptions(namespace_whitelist=["Scann"])
)

# Load it back; can also be done in TensorFlow Serving.
item_item_model_full_3 = tf.saved_model.load(path)

# Pass a user id in, get top predicted movie titles back.
scores, titles = item_item_model_full_3(["52228204"])

print(f"Recommendations: {titles[0][:10]}")



Recommendations: [b'WaterVault Thermos Water Bottle - Double Insulated Copper Plated Stainless Steel - Keeps Hot 12 Hours, Cold up to 36 - BPA-Free (12oz, 17oz, 26oz, 1 liter) Assorted Colors'
 b'Stearns Sospenders Manual Inflatable Life Jacket'
 b'Vader Bicycle Cycling Bike Road Offroad MTB Mountain Saddle Seat'
 b'Shimano Acera SL-M310 Rapid Fire Shifter, Right (Black, 7-Speed)'
 b"Women's Cycling Biking Pants Underwear Bike Bicycle Padded Shorts"
 b'Safety Care Big Foot Ice Claws Snow & Ice Traction Cleats - Re-Design 10/01/2015 - Heavy Duty, Lightweight No-Slip Studded Grips Prevent Slipping on Snow & Ice - Fits All Shoes, Boots, & Monster Size Boot Styles - Great for Hiking, Walking, Hunting, Construction & Other Outdoor Jobs & Activities'
 b'Astronaut Freeze Dried Ice Cream, One Serving Pouch'
 b'Handlebar front Bag with draw cord and clip, \xe2\x80\x9cTour Guide\xe2\x80\x9d by Biria'
 b'Victorinox Swiss Army Knife'
 b'Thin Line Adjustable Shackle Paracord Survival Bracelet-7 Wri

Accuracy rate on Top 1 through 100 are at 96% for all on test data. Both BruteForce and ScaNN output the same Top-10 recommendations indicating that the accuracy rate is the same. BruteForce retrieved Top-10 recs in 5.29 ms vs ScaNN in 4.55 ms. ScaNN retrieval is 15% faster than BruteForce. 