## Modeling Notebook 

We will start with Retrieval model and use it as a baseline model.
According to TensorFlow documentation: "The retrieval stage is responsible for selecting an initial set of hundreds of candidates from all possible candidates. The main objective of this model is to efficiently weed out all candidates that the user is not interested in. Because the retrieval model may be dealing with millions of candidates, it has to be computationally efficient."

Our first Retrieval model will be User-Item Matrix factorization. Our concern that due to the sparcity of data (most customers leaving 1-2 reviews on over 46,000 products on 115k reviews total dataset), SVD type model won't be performing well. 


### Imports

In [1]:
! pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
! pip install -q tensorflow-recommenders
! pip install -q --upgrade tensorflow-datasets
! pip install -q scann

[K     |████████████████████████████████| 89 kB 3.2 MB/s 
[K     |████████████████████████████████| 4.7 MB 4.1 MB/s 
[K     |████████████████████████████████| 10.4 MB 3.9 MB/s 
[K     |████████████████████████████████| 578.0 MB 15 kB/s 
[K     |████████████████████████████████| 5.9 MB 49.4 MB/s 
[K     |████████████████████████████████| 1.7 MB 54.9 MB/s 
[K     |████████████████████████████████| 438 kB 56.6 MB/s 
[?25h

In [3]:
import tensorflow as tf
print(tf.__version__)

2.10.0


In [4]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow_datasets as tfds
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_recommenders as tfrs

# import interactive table 
from google.colab import data_table
data_table.enable_dataframe_formatter()

# set seed
tf.random.set_seed(42)

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Preparing the dataset

In [6]:
# load data subset 
gdrive_path = '/content/drive/MyDrive/ModelingData'
path = os.path.join(gdrive_path, "ratings")

ratings = tf.data.Dataset.load(path)

In [7]:
ratings

<_LoadDataset element_spec={'data': {'review_date': TensorSpec(shape=(), dtype=tf.string, name=None), 'product_title': TensorSpec(shape=(), dtype=tf.string, name=None), 'verified_purchase': TensorSpec(shape=(), dtype=tf.int64, name=None), 'review_body': TensorSpec(shape=(), dtype=tf.string, name=None), 'review_headline': TensorSpec(shape=(), dtype=tf.string, name=None), 'product_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'product_parent': TensorSpec(shape=(), dtype=tf.string, name=None), 'vine': TensorSpec(shape=(), dtype=tf.int64, name=None), 'customer_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'total_votes': TensorSpec(shape=(), dtype=tf.int32, name=None), 'marketplace': TensorSpec(shape=(), dtype=tf.string, name=None), 'review_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'helpful_votes': TensorSpec(shape=(), dtype=tf.int32, name=None), 'product_category': TensorSpec(shape=(), dtype=tf.string, name=None), 'star_rating': TensorSpec(shape=(), dtype=tf.

In [8]:
len(ratings)

115120

In [10]:
# Select the basic features.
ratings = ratings.map(lambda x: {
    'product_title': x['data']['product_title'], 
    'customer_id': x['data']['customer_id']
})
products = ratings.map(lambda x: x['product_title'])

### Implementing User-Item Matrix Factorization as Baseline Model

In [11]:
# train-test split
tf.random.set_seed(42)
shuffled = ratings.shuffle(92_096, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(92_096)
test = shuffled.skip(92_096).take(23_024)

In [12]:
# vocabulary to map raw feature values to embedding vectors
product_titles = products.batch(50_000)
customer_ids = ratings.batch(110_000).map(lambda x: x['customer_id'])

unique_product_titles = np.unique(np.concatenate(list(product_titles)))
unique_customer_ids = np.unique(np.concatenate(list(customer_ids)))

unique_product_titles[:10]

array([b'! Set 7 Colors Small S Replacement Bands + 1pc Free Small Grey Band With Clasp for Fitbit FLEX Only /No tracker/ 1pc Teal (Blue/Grey) 1pc Purple / Pink 1pc Red (Tangerine) 1pc Green 1pc Slate (Blue/Grey) 1pc Black 1pc Navy (Blue) Bands Wireless Activity Bracelet Sport Wristband Fit Bit Flex Bracelet Sport Arm Band Armband',
       b'! Small S 1pc Green 1pc Teal (Blue/Green) 1pc Red (Tangerine) Replacement Bands + 1pc Free Small Grey Band With Clasp for Fitbit FLEX Only /No tracker/ Wireless Activity Bracelet Sport Wristband Fit Bit Flex Bracelet Sport Arm Band Armband',
       b'! Small S 1pc Teal (Blue/Green) 1pc Purple / Pink Replacement Bands + 1pc Free Small Grey Band With Clasp for Fitbit FLEX Only /No tracker/ Wireless Activity Bracelet Sport Wristband Fit Bit Flex Bracelet Sport Arm Band Armband',
       b'"""SEASON SPECIAL"""THE ORIGINAL HEAVY DUTY BIG GRIZZLY COT-HEAVY DUTY QUALITY w/ IPHONE Holder & Drink Holder-High Quality Product-10 YEARS WARRANTY-84\xe2\x80\x9d L

In [13]:
# number of unique products 
len(unique_product_titles)

46438

In [14]:
# number of unique customers
len(unique_customer_ids)

109308

In [15]:
# dimensionality of the query and candidate representations:
embedding_dimension = 32

In [16]:
# query model
user_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_customer_ids, mask_token=None),
  # We add an additional embedding to account for unknown tokens.
  tf.keras.layers.Embedding(len(unique_customer_ids) + 1, embedding_dimension)
])

In [17]:
# candidate model
product_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_product_titles, mask_token=None),
  tf.keras.layers.Embedding(len(unique_product_titles) + 1, embedding_dimension)
])

In [18]:
metrics = tfrs.metrics.FactorizedTopK(
  candidates=products.batch(128).map(product_model)
)

In [19]:
task = tfrs.tasks.Retrieval(
  metrics=metrics
)

In [24]:
class AmazonModel(tfrs.Model):

  def __init__(self, user_model, product_model):
    super().__init__()
    self.product_model: tf.keras.Model = product_model
    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(features["customer_id"])
    # And pick out the product features and pass them into the product model,
    # getting embeddings back.
    positive_product_embeddings = self.product_model(features["product_title"])

    # The task computes the loss and the metrics.
    return self.task(user_embeddings, positive_product_embeddings, compute_metrics=not training)

### Fitting and Evaluating the model

In [25]:
# initiate model
baseline_model = AmazonModel(user_model, product_model)
baseline_model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

In [26]:
# shuffle, batch, and cache train and test data
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

In [27]:
# train the model
baseline_model.fit(cached_train, epochs = 15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f018e5ab190>

In [28]:
# evaluate model
baseline_model.evaluate(cached_test, return_dict=True)



{'factorized_top_k/top_1_categorical_accuracy': 0.00017373176524415612,
 'factorized_top_k/top_5_categorical_accuracy': 0.000304030574625358,
 'factorized_top_k/top_10_categorical_accuracy': 0.00039089645724743605,
 'factorized_top_k/top_50_categorical_accuracy': 0.0006514940760098398,
 'factorized_top_k/top_100_categorical_accuracy': 0.0011292564449831843,
 'loss': 20335.0703125,
 'regularization_loss': 0,
 'total_loss': 20335.0703125}

### Model serving and saving

Top-n recommendations will be performed for customer_id # 52228204. This customer has high interest in biking (specifically mountain biking) that has indicated by his/her previous purchase history. 

In [29]:
# recommending Top-10 products for customer 52228204

# Create a baseline_model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(baseline_model.user_model)
# recommends products out of the entire products dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((products.batch(100), products.batch(100).map(baseline_model.product_model)))
)

# Get recommendations.
_, titles = index(tf.constant(["52228204"]))
print(f"Recommendations for user 52228204: {titles[0, :10]}")

Recommendations for user 52228204: [b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out Outdoor Sport Action Helmet Car DVR Camera Camcorder DVR Dv for Driving Sport Outdoor"'
 b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out Outdoor Sport Action Helmet Car DVR Camera Camcorder DVR Dv for Driving Sport Outdoor"'
 b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out Outdoor Sport Action Helmet Car DVR Camera Camcorder DVR Dv for Driving Sport Outdoor"'
 b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Po

In [30]:
# model serving: saving the model to G-Drive

# Export the query model.
gdrive_path = '/content/drive/MyDrive/Models'
path = os.path.join(gdrive_path, "baseline_model")

# Save the index.
tf.saved_model.save(index, path)

# Load it back; can also be done in TensorFlow Serving.
baseline_model_2 = tf.saved_model.load(path)

# Pass a user id in, get top predicted movie titles back.
scores, titles = baseline_model_2(["52228204"])

print(f"Recommendations: {titles[0][:10]}")




Recommendations: [b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out Outdoor Sport Action Helmet Car DVR Camera Camcorder DVR Dv for Driving Sport Outdoor"'
 b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out Outdoor Sport Action Helmet Car DVR Camera Camcorder DVR Dv for Driving Sport Outdoor"'
 b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out Outdoor Sport Action Helmet Car DVR Camera Camcorder DVR Dv for Driving Sport Outdoor"'
 b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out

Exporting retrieval index to speed up predictions using scann package.

In [31]:
# adding ScaNN layer
scann_index = tfrs.layers.factorized_top_k.ScaNN(baseline_model.user_model)
scann_index.index_from_dataset(
  tf.data.Dataset.zip((products.batch(100), products.batch(100).map(baseline_model.product_model)))
)

<tensorflow_recommenders.layers.factorized_top_k.ScaNN at 0x7f0190df4ed0>

ScaNN layer does approximate lookup. Retrieval is slightly less accurate, but perfiorms much faster on large candidate set.

In [32]:
# Get recommendations.
_, titles = scann_index(tf.constant(["52228204"]))
print(f"Recommendations for user 52228204: {titles[0, :10]}")

Recommendations for user 52228204: [b'WATERFLY Travel Outdoor Cooler Thermal Neoprene Lunch Bag Picnic Tote Box Container Insulated Zip Out Removable School Carry Handle Tote Lunch Bag'
 b'ANCHEER AN-KB001 Breathable Neoprene Knee Support & Brace Black (L)'
 b'Himal Ultra Bright LED Lantern for Camping, Hiking or Any Type of Emergency (Red)'
 b'Ingear 2-in-1 LED Camping Lantern Flashlight - 200 Lumen, 3 Mode, Portable and Collapsible Aluminum Camp Light'
 b'Ingear 2-in-1 LED Camping Lantern Flashlight - 200 Lumen, 3 Mode, Portable and Collapsible Aluminum Camp Light'
 b'Attmu Ice Grips Snow Grips for Shoes and Boot, Traction Cleats for Snow and Ice, 2 Size'
 b'Attmu Ice Grips Snow Grips for Shoes and Boot, Traction Cleats for Snow and Ice, 2 Size'
 b'Hand Warmers - HotSnapZ Reusable Round & Pocket Warmers'
 b'Hand Warmers - HotSnapZ Reusable Round & Pocket Warmers'
 b'Hand Warmers - HotSnapZ Reusable Round & Pocket Warmers']


In [33]:
# exporting ScaNN layer

# Export the query model.
gdrive_path = '/content/drive/MyDrive/Models'
path = os.path.join(gdrive_path, "baseline_model")

# Save the index.
tf.saved_model.save(
    index,
    path,
    options=tf.saved_model.SaveOptions(namespace_whitelist=["Scann"])
)

# Load it back; can also be done in TensorFlow Serving.
baseline_model_2 = tf.saved_model.load(path)

# Pass a user id in, get top predicted movie titles back.
scores, titles = baseline_model_2(["52228204"])

print(f"Recommendations: {titles[0][:10]}")



Recommendations: [b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out Outdoor Sport Action Helmet Car DVR Camera Camcorder DVR Dv for Driving Sport Outdoor"'
 b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out Outdoor Sport Action Helmet Car DVR Camera Camcorder DVR Dv for Driving Sport Outdoor"'
 b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out Outdoor Sport Action Helmet Car DVR Camera Camcorder DVR Dv for Driving Sport Outdoor"'
 b'"Dbpower Mini Hd 720p 1.3mp Waterproof 120\xc2\xb0wide Angle Lcd Display Support 32g Sd Card Mini Size Portable Only 26.5g 220mah Rechargeable Battery Auto Power-off Usb/tv Out

On the baseline model we ran 20 epochs total. The accuracy has been improving with more epochs on the train data, but the model has been performing poorly on the test data. It might be an indicator of overfitting. 

In terms of recommendations, it produced multiple replicas of the same products. However, I do think the flashlight recommendation is a solid recommendatioon for maountain biker. 