# Part 2: Neural Recommender

I want to be able to generate recommendations for a user based on all the other recipes they've liked. The basic requirement is that a new user should be able to select a sample of recipes they like and generate recommendations without re-training the model.
* User embeddings
* Item embeddings

## 1. Pre-Processing Recipe Features

In [1]:
%colors nocolor

In [140]:
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline

import numpy as np
import pandas as pd

import ast
import math

df_recipes = pd.read_csv('data/RAW_recipes.csv', index_col="id")
for col in ["tags", "nutrition", "steps", "ingredients"]:
    df_recipes[col] = df_recipes[col].apply(ast.literal_eval)

df_recipes.head(1)

Unnamed: 0_level_0,name,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
137739,arriba baked winter squash mexican style,55,47892,2005-09-16,"[60-minutes-or-less, time-to-make, course, mai...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"[make a choice and proceed with recipe, depend...",autumn is my favorite time of year to cook! th...,"[winter squash, mexican seasoning, mixed spice...",7


In [144]:
# handle nans
df_recipes["description"] = df_recipes["description"].fillna("")
df_recipes = df_recipes.dropna(subset=["name"])

# Convert nutrition info to individual columns
NUTRITION_COLS = ['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates']
df_recipes[NUTRITION_COLS] = df_recipes["nutrition"].tolist()

# Preprocess numerical features
NUMERICAL_FEATURES = ["minutes", "n_steps", "n_ingredients"] + NUTRITION_COLS
TEXT_FEATURES = ["ingredients"]

numerical_pipeline = make_pipeline(
    FunctionTransformer(lambda x: np.sign(x) * np.log(np.abs(x)+1)),
    StandardScaler()
).set_output(transform="pandas")

df_recipes_numerical = numerical_pipeline.fit_transform(df_recipes[NUMERICAL_FEATURES])
for col in NUMERICAL_FEATURES:
    df_recipes[f"F_{col}"] = df_recipes_numerical[col]
    # because tensorflow expects float32
    df_recipes = df_recipes.astype({f"F_{col}": 'float32'})

# Join lists of strings
for col in ["tags", "steps", "ingredients"]:
    df_recipes[f"F_{col}"] = df_recipes[col].apply(lambda x: " ".join(x))

# Unchanged cols
for col in ["name", "description"]:
    df_recipes[f"F_{col}"] = df_recipes[col]

df_recipes.head(1)



Unnamed: 0_level_0,name,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,...,F_sugar,F_sodium,F_protein,F_saturated_fat,F_carbohydrates,F_tags,F_steps,F_ingredients,F_name,F_description
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
137739,arriba baked winter squash mexican style,55,47892,2005-09-16,"[60-minutes-or-less, time-to-make, course, mai...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"[make a choice and proceed with recipe, depend...",autumn is my favorite time of year to cook! th...,"[winter squash, mexican seasoning, mixed spice...",...,-0.406204,-1.972144,-1.403525,-2.027874,-0.567595,60-minutes-or-less time-to-make course main-in...,make a choice and proceed with recipe dependin...,winter squash mexican seasoning mixed spice ho...,arriba baked winter squash mexican style,autumn is my favorite time of year to cook! th...


In [146]:
feature_cols = [col for col in df_recipes.columns if col.startswith("F_")]
print(f"{len(feature_cols)} features")
print(feature_cols)

15 features
['F_minutes', 'F_n_steps', 'F_n_ingredients', 'F_calories', 'F_total_fat', 'F_sugar', 'F_sodium', 'F_protein', 'F_saturated_fat', 'F_carbohydrates', 'F_tags', 'F_steps', 'F_ingredients', 'F_name', 'F_description']


## 2. Pre-Processing Interaction Features

In this section we pre-process user-recipe interactions into a format where we each interaction is eriched with the features for:
1. The target recipe (i.e. one being interacted with)
2. The set of recipes the user has interacted with *excluding* the target recipe (and ignoring time dependencies). I'll call these *context* recipes.

In [147]:
df_interactions = pd.read_csv('data/RAW_interactions.csv')
df_interactions = df_interactions.drop(["review", "date"], axis=1)
df_interactions = df_interactions.astype({"rating": float})
df_interactions.head(1)

Unnamed: 0,user_id,recipe_id,rating
0,38094,40893,4.0


Step 2: Join recipe features onto the interactions by `recipe_id`

In [148]:
df_interactions_joined = pd.merge(df_interactions, df_recipes[feature_cols], how='inner', left_on=['recipe_id'], right_index=True)
df_interactions_joined.head(1)

Unnamed: 0,user_id,recipe_id,rating,F_minutes,F_n_steps,F_n_ingredients,F_calories,F_total_fat,F_sugar,F_sodium,F_protein,F_saturated_fat,F_carbohydrates,F_tags,F_steps,F_ingredients,F_name,F_description
0,38094,40893,4.0,2.322343,-1.177384,0.170344,-0.380026,-0.843167,-0.631142,0.533637,0.258756,-1.271503,0.188034,weeknight time-to-make course main-ingredient ...,"combine beans , onion , chilies , 1 / 2 teaspo...",great northern beans yellow onion diced green ...,white bean green chile pepper soup,easy soup for the crockpot.


Step 3: Aggregate each column on `user_id` into lists, such that each each row in the aggregation table respresents the set of recipes a user interacts with. This is a table *context recipes* for each user.

In [149]:
df_interactions_groupby = df_interactions_joined.groupby("user_id").agg(list)
# only keep groups with at least 2 interactions
df_interactions_groupby = df_interactions_groupby.rename(columns={col: col + "_list" for col in df_interactions_groupby.columns if col != "user_id"})
df_interactions_groupby = df_interactions_groupby.loc[df_interactions_groupby['recipe_id_list'].apply(len) > 2]
df_interactions_groupby.head(1)

Unnamed: 0_level_0,recipe_id_list,rating_list,F_minutes_list,F_n_steps_list,F_n_ingredients_list,F_calories_list,F_total_fat_list,F_sugar_list,F_sodium_list,F_protein_list,F_saturated_fat_list,F_carbohydrates_list,F_tags_list,F_steps_list,F_ingredients_list,F_name_list,F_description_list
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1533,"[116345, 32907, 14750, 24136, 63598, 83375, 35...","[5.0, 5.0, 5.0, 5.0, 4.0, 5.0, 5.0, 5.0, 5.0, ...","[0.4711129367351532, -0.5799463391304016, 1.28...","[0.46291449666023254, -0.07609347999095917, -1...","[1.5522392988204956, -0.4107809066772461, -0.4...","[0.24415189027786255, 0.18646495044231415, -0....","[0.19557610154151917, 0.40097182989120483, -0....","[0.5205592513084412, 0.44479694962501526, -0.3...","[0.34256595373153687, -0.22151923179626465, -0...","[1.0546941757202148, 0.9451212286949158, -0.54...","[0.03462296724319458, 0.7556794285774231, -0.4...","[0.09669177979230881, -0.39286482334136963, -0...",[time-to-make course main-ingredient cuisine p...,[combine all cashew crust ingredients in a sma...,[tilapia fillets egg white flour lemon cashews...,"[cashew crusted stuffed tilapia, indecent brea...",[this recipe was created for ready set cook 20...


Step 4: Assembling the training sample records

In [150]:
import random

def get_padding_value(feature_list):
    if isinstance(feature_list[0], str):
        return ""
    else:
        return 0


def sample_sequence_features(user_id, recipe_id, interactions_grouped, sequence_len=20):

    record = {}
    
    row = interactions_grouped.loc[user_id]
    target_index = row["recipe_id_list"].index(recipe_id)

    # work out sequence size
    n_samples = len(row["recipe_id_list"])
    sample_size = min(sequence_len, n_samples-1)    # -1 because we will always remove the target before sampling
    sample_idx = random.sample(range(sample_size), sample_size)

    for col in row.index:
        if col.endswith("_list"):
            feature_list = list(row[col])
            # remove target
            del feature_list[target_index]
            feature_sample = [feature_list[i] for i in sample_idx]
            padding_value = get_padding_value(feature_list)
            while len(feature_sample) < sequence_len:
                feature_sample.append(padding_value)
    
        record[col] = feature_sample
    return record


def create_dataset_record(user_id, recipe_id):
    # uses df_recipes, df_interactions_groupby, features_cols from global scope
    record = {"user_id": user_id, "recipe_id": recipe_id}
    record.update(df_recipes.loc[recipe_id][feature_cols])
    record.update(sample_sequence_features(user_id, recipe_id, df_interactions_groupby))
    return record


def records_to_lists(records):
    return {key: [i[key] for i in records] for key in records[0]}

In [170]:
# filter off user_ids with sequence len < 2
df_interactions_filtered = df_interactions[df_interactions["user_id"].isin(df_interactions_groupby.index)]
args = df_interactions_filtered[["user_id", "recipe_id"]].to_records(index=False)
records = []
for user_id, recipe_id in args[:10000]:
    records.append(create_dataset_record(user_id, recipe_id))

Finally, we use the `tf.data.Dataset` API to wrap the dataset, ready to be used for modelling.

In [171]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import tensorflow as tf
dataset = tf.data.Dataset.from_tensor_slices(records_to_lists(records))

for record in dataset:
    print(record)
    break

{'user_id': <tf.Tensor: shape=(), dtype=int32, numpy=38094>, 'recipe_id': <tf.Tensor: shape=(), dtype=int32, numpy=40893>, 'F_minutes': <tf.Tensor: shape=(), dtype=float32, numpy=2.3223433>, 'F_n_steps': <tf.Tensor: shape=(), dtype=float32, numpy=-1.1773845>, 'F_n_ingredients': <tf.Tensor: shape=(), dtype=float32, numpy=0.17034367>, 'F_calories': <tf.Tensor: shape=(), dtype=float32, numpy=-0.38002646>, 'F_total_fat': <tf.Tensor: shape=(), dtype=float32, numpy=-0.84316725>, 'F_sugar': <tf.Tensor: shape=(), dtype=float32, numpy=-0.6311418>, 'F_sodium': <tf.Tensor: shape=(), dtype=float32, numpy=0.533637>, 'F_protein': <tf.Tensor: shape=(), dtype=float32, numpy=0.2587562>, 'F_saturated_fat': <tf.Tensor: shape=(), dtype=float32, numpy=-1.2715031>, 'F_carbohydrates': <tf.Tensor: shape=(), dtype=float32, numpy=0.18803358>, 'F_tags': <tf.Tensor: shape=(), dtype=string, numpy=b'weeknight time-to-make course main-ingredient preparation occasion soups-stews beans vegetables easy crock-pot-slow-c

## 3. Configuring the Model

Here I build a standard 2 tower model (similar to a the basic TF tutorial [here](https://www.tensorflow.org/recommenders/examples/basic_retrieval)), with a few twists:
* Rather than learning a table of fixed embeddings for each user and movie, I dynamically generate an embedding for the user and recipe solely based on content and context based features. This should mean that if I add new users and new recipes, I won't need to retrain the model from scratch
* A key piece of information that feeds into the user model is an aggregation over the set of other recipes they've liked. I share many of the embedding layers across the user and recipe towers.

In [155]:
from typing import List, Dict, Text
import tensorflow.python.keras.backend as K
import tensorflow_recommenders as tfrs


NUMERICAL = [
    'F_minutes',
    'F_n_steps',
    'F_n_ingredients',
    'F_calories',
    'F_sugar',
    'F_total_fat',
    'F_sodium',
    'F_protein',
    'F_saturated_fat',
    'F_carbohydrates'
]

NUMERICAL_HISTORY = [f"{num}_list" for num in NUMERICAL] #+ ["rating_list"]


class PoolingTextEmbedder(tf.keras.Model):
    """Currently masking of padding tokens""" 
    def __init__(self, vocabulary_list=List[str], embedding_dim=16, max_tokens=10_000):
        super().__init__()

        self.embedding_dim = embedding_dim
        self.text_vectorizor = tf.keras.layers.TextVectorization(max_tokens=max_tokens)
        self.text_vectorizor.adapt(vocabulary_list)
        self.embedding_layer = tf.keras.layers.Embedding(max_tokens, embedding_dim)

    def call(self, x):
        x = self.text_vectorizor(x)
        x = self.embedding_layer(x)
        x = tf.math.reduce_mean(x, axis=-2)
        return x


class UserModel(tf.keras.Model):

    def __init__(self, numerical_cols: List[str], text_embedding_layer: tf.keras.Model, output_dims: List[int]=[32]):
        super().__init__()

        # inputs
        self.numerical_cols = numerical_cols
        self.text_embedding_layer = text_embedding_layer
        
        # attention and pooling over sequence
        self.attention_dim = len(numerical_cols) + text_embedding_layer.embedding_dim
        self.key_layer = tf.keras.layers.Dense(self.attention_dim, activation="relu", name="dense_key")
        self.query_layer = tf.keras.layers.Dense(self.attention_dim, activation="relu", name="dense_query")
        self.attention_layer = tf.keras.layers.Attention(name="attention")
        
        # Use the ReLU activation for all but the last layer
        self.dense_layers = tf.keras.Sequential(name="user_dense_output")
        for dim in output_dims[:-1]:
            self.dense_layers.add(tf.keras.layers.Dense(dim, activation="relu"))
        self.dense_layers.add(tf.keras.layers.Dense(output_dims[-1]))

    def _pad_rank(self, x):
        return tf.expand_dims(x, axis=-1)

    def __call__(self, inputs):
        # Prepare text and numerical inputs
        text_inputs = self.text_embedding_layer(self._pad_rank(inputs["F_ingredients_list"]))
        numerical_inputs = [self._pad_rank(inputs[f]) for f in self.numerical_cols]
        x = tf.concat([text_inputs] + numerical_inputs, axis=-1)
        # apply attention -> dense layers
        x = self.attention_layer([self.query_layer(x), x, self.key_layer(x)])
        x = tf.math.reduce_mean(x, axis=-2)
        x = tf.reshape(x, [-1, self.attention_dim]) # hack required to handle single sample at inference
        x = self.dense_layers(x)
        return x


class ItemModel(tf.keras.Model):

    def __init__(self, numerical_cols: List[str], text_embedding_layer: tf.keras.Model, output_dims: List[str]=[32]):
        super().__init__()
    
        # inputs
        self.numerical_cols = numerical_cols
        self.text_embedding_layer = text_embedding_layer

        # Use the ReLU activation for all but the last layer
        self.dense_layers = tf.keras.Sequential(name="item_dense_output")
        for dim in output_dims[:-1]:
            self.dense_layers.add(tf.keras.layers.Dense(dim, activation="relu"))
        self.dense_layers.add(tf.keras.layers.Dense(output_dims[-1]))

    def __call__(self, inputs):
        text_inputs = self.text_embedding_layer(inputs["F_ingredients"])
        numerical_inputs = [tf.expand_dims(inputs[f], axis=-1) for f in self.numerical_cols]
        x = tf.concat([text_inputs] + numerical_inputs, axis=-1)
        # apply dense layers
        x = self.dense_layers(x)
        return x


# Based largely on this code: https://www.tensorflow.org/recommenders/examples/basic_retrieval
class SimpleRetrievalModel(tfrs.Model):

	def __init__(
		self,
        user_model: tf.keras.Model,
        item_model: tf.keras.Model,
		candidates
	):
		super().__init__()
		
		self.user_model = user_model
		self.item_model = item_model

		self.task = tfrs.tasks.Retrieval(
			metrics=tfrs.metrics.FactorizedTopK(
				candidates=candidates.batch(128).map(self.item_model),
				ks=(10,)
			)
		)

	def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
		user_embeddings = self.user_model(features)
		item_embeddings = self.item_model(features)
		return self.task(user_embeddings, item_embeddings)

In [156]:
ingredient_embedding_layer = PoolingTextEmbedder(dataset.map(lambda x: x["F_ingredients"]), embedding_dim=16)

user_model = UserModel(NUMERICAL_HISTORY, ingredient_embedding_layer, output_dims=[8])
item_model = ItemModel(NUMERICAL, ingredient_embedding_layer, output_dims=[8])

for batch in dataset.take(10).batch(2):
    tester = batch
    break

print("User model output example shape:", user_model(tester).shape)
print("Item model output example shape:", item_model(tester).shape)

User model output example shape: (2, 8)
Item model output example shape: (2, 8)


If we train this for a few epochs on a tiny dataset, the training loss goes down which gives me confidence the model is learning something. I just need to now scale the model up and use the full training set. Since I'm developing this locally, this will require a bigger machine...

In [182]:
# create a tiny train and test set
tf.random.set_seed(42)
shuffled = dataset.shuffle(1000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(7000)
validation = shuffled.skip(7000).take(1000)
test = shuffled.skip(8000).take(2000)

cached_train = train.batch(100).cache()
cached_validation = validation.batch(100).cache()
cached_test = test.batch(100).cache()

# create a dataset of recipes to pass as candidate recommendations
df_recipes_with_id_col = df_recipes
df_recipes_with_id_col["id"] = df_recipes_with_id_col.index
recipes_dataset = tf.data.Dataset.from_tensor_slices(df_recipes_with_id_col[["name"] + feature_cols].to_dict("list"))

model = SimpleRetrievalModel(
    user_model=user_model,
    item_model=item_model,
    candidates=recipes_dataset
)

In [183]:
es_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))
history = model.fit(cached_train, epochs=20, validation_data=cached_validation, callbacks=[es_callback])

results = model.evaluate(cached_test, return_dict=True)
print(results)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
{'factorized_top_k/top_10_categorical_accuracy': 0.0, 'loss': 452.5555419921875, 'regularization_loss': 0, 'total_loss': 452.5555419921875}


For simplicity, we use a brute force nearest neighbour search method to recommend the top N recipes. We could use `ScANN` for more efficient serving...

In [None]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# recommends movies out of the entire movies dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((recipes_dataset.batch(20).map(lambda x: x["name"]), recipes_dataset.batch(20).map(model.item_model)))
)

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x7f96bb169b20>

In [16]:
# generating recommendations for a single test user
index(test.take(1).get_single_element())

(<tf.Tensor: shape=(1, 10), dtype=float32, numpy=
 array([[1.4075656, 1.383584 , 1.3598764, 1.3485157, 1.3378793, 1.3328927,
         1.3225005, 1.3119059, 1.3078291, 1.306123 ]], dtype=float32)>,
 <tf.Tensor: shape=(1, 10), dtype=string, numpy=
 array([[b'roasted turkey pesto panini', b'fatal attraction cocktail',
         b'yogurt dill sauce', b'cayenne mayonnaise dip',
         b'rich scrambled eggs for those not afraid of fat content',
         b'marinated tuna', b'agua', b'intrigue summer breeze',
         b'tuna parmesan spread', b'scrappy snack']], dtype=object)>)

Next steps: train a bigger model on the full dataset on a big machine...enter AWS.