# Part 2: Neural Recommender

I want to be able to generate recommendations for a user based on all the other recipes they've liked. The basic requirement is that a new user should be able to select a sample of recipes they like and generate recommendations without re-training the model.
* User embeddings
* Item embeddings

## 1. Pre-Processing Recipe Features

In [3]:
%colors nocolor

In [4]:
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline

import numpy as np
import pandas as pd

import ast
import math

df_recipes = pd.read_csv('data/RAW_recipes.csv')
for col in ["tags", "nutrition", "steps", "ingredients"]:
    df_recipes[col] = df_recipes[col].apply(ast.literal_eval)

df_recipes.head(1)

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"[60-minutes-or-less, time-to-make, course, mai...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"[make a choice and proceed with recipe, depend...",autumn is my favorite time of year to cook! th...,"[winter squash, mexican seasoning, mixed spice...",7


In [5]:
# handle nans
df_recipes["description"] = df_recipes["description"].fillna("")
df_recipes = df_recipes.dropna(subset=["name"])

# Convert nutrition info to individual columns
NUTRITION_COLS = ['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates']
df_recipes[NUTRITION_COLS] = df_recipes["nutrition"].tolist()

# Preprocess numerical features
NUMERICAL_FEATURES = ["minutes", "n_steps", "n_ingredients"] + NUTRITION_COLS
TEXT_FEATURES = ["ingredients"]

numerical_pipeline = make_pipeline(
    FunctionTransformer(lambda x: np.sign(x) * np.log(np.abs(x)+1)),
    StandardScaler()
).set_output(transform="pandas")

df_recipes_numerical = numerical_pipeline.fit_transform(df_recipes[NUMERICAL_FEATURES])
for col in NUMERICAL_FEATURES:
    df_recipes[f"F_{col}"] = df_recipes_numerical[col]

# Join lists of strings
for col in ["tags", "steps", "ingredients"]:
    df_recipes[f"F_{col}"] = df_recipes[col].apply(lambda x: " ".join(x))

# Unchanged cols
for col in ["id", "name", "description"]:
    df_recipes[f"F_{col}"] = df_recipes[col]

df_recipes.head(1)



Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,...,F_sodium,F_protein,F_saturated_fat,F_carbohydrates,F_tags,F_steps,F_ingredients,F_id,F_name,F_description
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"[60-minutes-or-less, time-to-make, course, mai...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"[make a choice and proceed with recipe, depend...",autumn is my favorite time of year to cook! th...,...,-1.972144,-1.403524,-2.027874,-0.567595,60-minutes-or-less time-to-make course main-in...,make a choice and proceed with recipe dependin...,winter squash mexican seasoning mixed spice ho...,137739,arriba baked winter squash mexican style,autumn is my favorite time of year to cook! th...


In [6]:
feature_cols = [col for col in df_recipes.columns if col.startswith("F_")]
print(f"{len(feature_cols)} features:")
print(feature_cols)

16 features:
['F_minutes', 'F_n_steps', 'F_n_ingredients', 'F_calories', 'F_total_fat', 'F_sugar', 'F_sodium', 'F_protein', 'F_saturated_fat', 'F_carbohydrates', 'F_tags', 'F_steps', 'F_ingredients', 'F_id', 'F_name', 'F_description']


## 2. Pre-Processing Interaction Features

In this section we pre-process user-recipe interactions into a format where we each interaction is eriched with the features for:
1. The target recipe (i.e. one being interacted with)
2. The set of recipes the user has interacted with *excluding* the target recipe (and ignoring time dependencies). I'll call these *context* recipes.

In [7]:
df_interactions = pd.read_csv('data/RAW_interactions.csv')
df_interactions = df_interactions.drop(["review", "date"], axis=1)
df_interactions.head(1)

Unnamed: 0,user_id,recipe_id,rating
0,38094,40893,4


Step 2: Join recipe features onto the interactions by `recipe_id`

In [8]:
df_interactions_joined = pd.merge(df_interactions, df_recipes[["id"] + feature_cols], how='inner', left_on=['recipe_id'], right_on=['id'])
df_interactions_joined.head(1)

Unnamed: 0,user_id,recipe_id,rating,id,F_minutes,F_n_steps,F_n_ingredients,F_calories,F_total_fat,F_sugar,F_sodium,F_protein,F_saturated_fat,F_carbohydrates,F_tags,F_steps,F_ingredients,F_id,F_name,F_description
0,38094,40893,4,40893,2.322343,-1.177385,0.170344,-0.380026,-0.843167,-0.631142,0.533637,0.258756,-1.271503,0.188034,weeknight time-to-make course main-ingredient ...,"combine beans , onion , chilies , 1 / 2 teaspo...",great northern beans yellow onion diced green ...,40893,white bean green chile pepper soup,easy soup for the crockpot.


Step 3: Aggregate each column on `user_id` into lists, such that each each row in the aggregation table respresents the set of recipes a user interacts with. This is a table *context recipes* for each user.

In [9]:
df_interactions_groupby = df_interactions_joined.groupby("user_id").agg(list)
# only keep groups with at least 2 
df_interactions_groupby = df_interactions_groupby.reset_index()
df_interactions_groupby = df_interactions_groupby.rename(columns={col: col + "_list" for col in df_interactions_groupby.columns if col != "user_id"})
df_interactions_groupby = df_interactions_groupby.loc[df_interactions_groupby['recipe_id_list'].apply(len) > 2]
df_interactions_groupby.head(1)

Unnamed: 0,user_id,recipe_id_list,rating_list,id_list,F_minutes_list,F_n_steps_list,F_n_ingredients_list,F_calories_list,F_total_fat_list,F_sugar_list,F_sodium_list,F_protein_list,F_saturated_fat_list,F_carbohydrates_list,F_tags_list,F_steps_list,F_ingredients_list,F_id_list,F_name_list,F_description_list
0,1533,"[116345, 32907, 14750, 24136, 63598, 83375, 35...","[5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...","[116345, 32907, 14750, 24136, 63598, 83375, 35...","[0.4711129434497512, -0.5799463465672527, 1.28...","[0.4629144894108395, -0.0760934775964507, -1.1...","[1.552239356685329, -0.4107808961373495, -0.41...","[0.24415188448893518, 0.1864649534815458, -0.7...","[0.1955761015159676, 0.40097184283934584, -0.7...","[0.5205592569484367, 0.444796953040082, -0.360...","[0.3425659458571573, -0.2215192245871849, -0.1...","[1.054694179027774, 0.9451212086373081, -0.542...","[0.03462296668504492, 0.755679422126608, -0.44...","[0.0966917832022376, -0.392864824739948, -0.11...",[time-to-make course main-ingredient cuisine p...,[combine all cashew crust ingredients in a sma...,[tilapia fillets egg white flour lemon cashews...,"[116345, 32907, 14750, 24136, 63598, 83375, 35...","[cashew crusted stuffed tilapia, indecent brea...",[this recipe was created for ready set cook 20...


Step 4: Left join the context recipe table back to the interaction table

In [10]:
df_samples = pd.merge(df_interactions_joined, df_interactions_groupby, on="user_id")
df_samples.head(1)

Unnamed: 0,user_id,recipe_id,rating,id,F_minutes,F_n_steps,F_n_ingredients,F_calories,F_total_fat,F_sugar,...,F_sodium_list,F_protein_list,F_saturated_fat_list,F_carbohydrates_list,F_tags_list,F_steps_list,F_ingredients_list,F_id_list,F_name_list,F_description_list
0,38094,40893,4,40893,2.322343,-1.177385,0.170344,-0.380026,-0.843167,-0.631142,...,"[0.5336369886062798, 0.4751245248312128, -0.14...","[0.2587561767435663, 0.4764186786861827, -0.63...","[-1.271503074491975, -1.0734401952842334, -0.3...","[0.18803357608915794, 0.8077374177449108, -0.7...",[weeknight time-to-make course main-ingredient...,"[combine beans , onion , chilies , 1 / 2 teasp...",[great northern beans yellow onion diced green...,"[40893, 16954, 40753, 34513, 69545, 49064, 800...","[white bean green chile pepper soup, black b...","[easy soup for the crockpot., one of my favori..."


Step 5: remove the target recipe from each set of context recipes for each interaction row.

This caused me a lot of headaches, because it turns out the left join in step 4 resulted in only *references* to the aggregate rows produced step 3 being stored in the final joined table, `df_samples`. The data itself is only stored once. This means trying to directly modify the contents of a joined cell will modify it for all other values for that user. For example, removing an element from `F_minutes_list` for `user_1`, `recipe_a` has a side effect of removing the same elements from `F_minutes_list` in `user_1`, `recipe_b`. This is because both cells reference the same underlying object in memory (you can check this using the built in `id` function).

This is why I resorted to writing a list in-place - but this is extremely slow taking ~0.1s per row (or ~25 hours for the whole thing), hence this ugly line of code:
``` 
    df_samples.at[i, col] = [get_list_sample_val_or_pad(i, row[col], sample_idx, padding) for i in range(max_sequence)]
```

Other gotchas:
* Using `df.apply()` here didn't work too well either...it didn't scale linearly with the number of rows I applied it to, which was a bit worrying. I haven't tried running it for longer than 2 hours though.
* Another gotcha in `pandas` is trying to modify an element by *chained indexing* sometimes results in nothing happening in the original table. See [here](https://pandas.pydata.org/docs/user_guide/indexing.html#returning-a-view-versus-a-copy) for more details.
* For the sake of brevity, I've truncated each context recipe set to just 20 (padding with default values) by sampling.

Fow now, the remaining work is done on only 100 rows of data just to show the model is learning *something* and test that all the components for training, testing and inference work.

In [11]:
import random

max_sequence = 20

def get_list_sample_val_or_pad(i, lst, sample_idx, padding):
    if i < len(sample_idx):
        return lst[sample_idx[i]]
    else:
        return padding

# for i in range(len(df_samples)): # this should take ~25 hours...need to figure out how to do this faster
for i in range(100):
    if (i+1) % 10000 == 0:
        print(f"Processed {i+1} rows")
    row = df_samples.iloc[i]
    target_index = row["recipe_id_list"].index(row["recipe_id"])
    
    sequence_size = len(row["recipe_id_list"])
    sample_size = min(max_sequence, sequence_size)
    sample_idx = random.sample(range(sample_size), sample_size)
    # remove target index from sample
    if target_index in sample_idx:
        sample_idx.remove(target_index)

    for col in row.index:
        if col.endswith("_list"):
            # work out what value to pad rows with
            if isinstance(row[col][0], str):
                padding = ""
            else:
                padding = 0
            df_samples.at[i, col] = [get_list_sample_val_or_pad(i, row[col], sample_idx, padding) for i in range(max_sequence)]

In [12]:
for key, val in df_samples.iloc[0].items():
    if key.endswith("list"):
        print(key, len(val))

recipe_id_list 20
rating_list 20
id_list 20
F_minutes_list 20
F_n_steps_list 20
F_n_ingredients_list 20
F_calories_list 20
F_total_fat_list 20
F_sugar_list 20
F_sodium_list 20
F_protein_list 20
F_saturated_fat_list 20
F_carbohydrates_list 20
F_tags_list 20
F_steps_list 20
F_ingredients_list 20
F_id_list 20
F_name_list 20
F_description_list 20


Finally, we use the `tf.data.Dataset` API to wrap the dataset, ready to be used for modelling.

In [13]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import tensorflow as tf
dataset = tf.data.Dataset.from_tensor_slices(df_samples.head(100).to_dict("list"))

## 3. Configuring the Model

Here I build a standard 2 tower model (similar to a the basic TF tutorial [here](https://www.tensorflow.org/recommenders/examples/basic_retrieval)), with a few twists:
* Rather than learning a table of fixed embeddings for each user and movie, I dynamically generate an embedding for the user and recipe solely based on content and context based features. This should mean that if I add new users and new recipes, I won't need to retrain the model from scratch
* A key piece of information that feeds into the user model is an aggregation over the set of other recipes they've liked. I share many of the embedding layers across the user and recipe towers.

In [14]:
from typing import List, Dict, Text
import tensorflow.python.keras.backend as K
import tensorflow_recommenders as tfrs


NUMERICAL = [
    'F_minutes',
    'F_n_steps',
    'F_n_ingredients',
    'F_calories',
    'F_sugar',
    'F_total_fat',
    'F_sodium',
    'F_protein',
    'F_saturated_fat',
    'F_carbohydrates'
]

NUMERICAL_HISTORY = [f"{num}_list" for num in NUMERICAL]


class PoolingTextEmbedder(tf.keras.Model):
    """Currently masking of padding tokens""" 
    def __init__(self, vocabulary_list=List[str], embedding_dim=16, max_tokens=10_000):
        super().__init__()

        self.embedding_dim = embedding_dim
        self.text_vectorizor = tf.keras.layers.TextVectorization(max_tokens=max_tokens)
        self.text_vectorizor.adapt(vocabulary_list)
        self.embedding_layer = tf.keras.layers.Embedding(max_tokens, embedding_dim)

    def call(self, x):
        x = self.text_vectorizor(x)
        x = self.embedding_layer(x)
        x = tf.math.reduce_mean(x, axis=-2)
        return x


class UserModel(tf.keras.Model):

    def __init__(self, numerical_cols: List[str], text_embedding_layer: tf.keras.Model, output_dims: List[int]=[32]):
        super().__init__()

        # inputs
        self.numerical_cols = numerical_cols
        self.text_embedding_layer = text_embedding_layer
        
        # attention and pooling over sequence
        self.attention_dim = len(numerical_cols) + text_embedding_layer.embedding_dim
        self.key_layer = tf.keras.layers.Dense(self.attention_dim, activation="relu", name="dense_key")
        self.query_layer = tf.keras.layers.Dense(self.attention_dim, activation="relu", name="dense_query")
        self.attention_layer = tf.keras.layers.Attention(name="attention")
        
        # Use the ReLU activation for all but the last layer
        self.dense_layers = tf.keras.Sequential(name="user_dense_output")
        for dim in output_dims[:-1]:
            self.dense_layers.add(tf.keras.layers.Dense(dim, activation="relu"))
        self.dense_layers.add(tf.keras.layers.Dense(output_dims[-1]))

    def _pad_rank(self, x):
        return tf.expand_dims(x, axis=-1)

    def __call__(self, inputs):
        # Prepare text and numerical inputs
        text_inputs = self.text_embedding_layer(self._pad_rank(inputs["F_ingredients_list"]))
        numerical_inputs = [self._pad_rank(inputs[f]) for f in self.numerical_cols]
        x = tf.concat([text_inputs] + numerical_inputs, axis=-1)
        # apply attention -> dense layers
        x = self.attention_layer([self.query_layer(x), x, self.key_layer(x)])
        x = tf.math.reduce_mean(x, axis=-2)
        x = tf.reshape(x, [-1, self.attention_dim]) # hack required to handle single sample at inference
        x = self.dense_layers(x)
        return x


class ItemModel(tf.keras.Model):

    def __init__(self, numerical_cols: List[str], text_embedding_layer: tf.keras.Model, output_dims: List[str]=[32]):
        super().__init__()
    
        # inputs
        self.numerical_cols = numerical_cols
        self.text_embedding_layer = text_embedding_layer

        # Use the ReLU activation for all but the last layer
        self.dense_layers = tf.keras.Sequential(name="item_dense_output")
        for dim in output_dims[:-1]:
            self.dense_layers.add(tf.keras.layers.Dense(dim, activation="relu"))
        self.dense_layers.add(tf.keras.layers.Dense(output_dims[-1]))

    def __call__(self, inputs):
        text_inputs = self.text_embedding_layer(inputs["F_ingredients"])
        numerical_inputs = [tf.expand_dims(inputs[f], axis=-1) for f in self.numerical_cols]
        x = tf.concat([text_inputs] + numerical_inputs, axis=-1)
        # apply dense layers
        x = self.dense_layers(x)
        return x


# Based largely on this code: https://www.tensorflow.org/recommenders/examples/basic_retrieval
class SimpleRetrievalModel(tfrs.Model):

	def __init__(
		self,
        user_model: tf.keras.Model,
        item_model: tf.keras.Model,
		candidates
	):
		super().__init__()
		
		self.user_model = user_model
		self.item_model = item_model

		self.task = tfrs.tasks.Retrieval(
			metrics=tfrs.metrics.FactorizedTopK(
				candidates=candidates.batch(128).map(self.item_model),
				ks=(10,)
			)
		)

	def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
		user_embeddings = self.user_model(features)
		item_embeddings = self.item_model(features)
		return self.task(user_embeddings, item_embeddings)

In [15]:
ingredient_embedding_layer = PoolingTextEmbedder(dataset.map(lambda x: x["F_ingredients"]), embedding_dim=16)

user_model = UserModel(NUMERICAL_HISTORY, ingredient_embedding_layer, output_dims=[8])
item_model = ItemModel(NUMERICAL, ingredient_embedding_layer, output_dims=[8])

for batch in dataset.take(10).batch(2):
    tester = batch
    break

print("User model output example shape:", user_model(tester).shape)
print("Item model output example shape:", item_model(tester).shape)

User model output example shape: (2, 8)
Item model output example shape: (2, 8)


If we train this for a few epochs on a tiny dataset, the training loss goes down which gives me confidence the model is learning something. I just need to now scale the model up and use the full training set. Since I'm developing this locally, this will require a bigger machine...

In [20]:
# create a tiny train and test set
tf.random.set_seed(42)
shuffled = dataset.shuffle(100, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80)
test = shuffled.skip(80).take(20)

cached_train = train.batch(10).cache()
cached_test = test.batch(10).cache()

# create a dataset of recipes to pass as candidate recommendations
recipes_dataset = tf.data.Dataset.from_tensor_slices(df_recipes[["id", "name"] + feature_cols].to_dict("list"))

model = SimpleRetrievalModel(
    user_model=user_model,
    item_model=item_model,
    candidates=recipes_dataset.take(100)
)

model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))
model.fit(cached_train, epochs=10)

results = model.evaluate(cached_test, return_dict=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


For simplicity, we use a brute force nearest neighbour search method to recommend the top N recipes. We could use `ScANN` for more efficient serving...

In [17]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# recommends movies out of the entire movies dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((recipes_dataset.batch(20).map(lambda x: x["name"]), recipes_dataset.batch(20).map(model.item_model)))
)

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x7ffa80d7af40>

In [18]:
# generating recommendations for a single test user
index(test.take(1).get_single_element())

(<tf.Tensor: shape=(1, 10), dtype=float32, numpy=
 array([[1.2600166, 1.2317915, 1.2290578, 1.2176449, 1.2156482, 1.2108135,
         1.2101247, 1.2046136, 1.1976283, 1.189898 ]], dtype=float32)>,
 <tf.Tensor: shape=(1, 10), dtype=string, numpy=
 array([[b'really easy vanilla sugar', b'lavender white tea',
         b'lemon verbena water', b'amarula coffee', b'apple tea',
         b'americano', b'butterscotch tea',
         b'vegan condensed milk substitute', b'white cactus',
         b'la bou creamy dill dressing']], dtype=object)>)

Next steps: train a bigger model on the full dataset on a big machine...enter AWS.