# Build a product recommender databricks CF 01 - data preparation
> Preparing data for an end to end collaborative filtering based recommender system

- toc: true
- badges: true
- comments: true
- categories: [retail, databricks, ETL]
- image:

The purpose of this notebook is to prepare the dataset we will use to explore collaborative filtering recommenders.  This notebook should be run on a **Databricks 7.1+ cluster**.

## Introduction 

Collaborative filters are an important enabler of modern recommendation experiences.  ***Customers like you also bought***-type recommendations provide us an important means of identifying products that are likely to be of interest based on the buying patterns of closely related customers:

<img src="https://brysmiwasb.blob.core.windows.net/demos/images/instacart_collabrecom.png" width="600">

In [None]:
from pyspark.sql.types import *
from pyspark.sql.functions import count, countDistinct, avg, log, lit, expr

import shutil

## Step 1: Load the Data

The basic building block of this kind of recommendation is customer transaction data. To provide us data of this type, we'll be using the popular [Instacart dataset](https://www.kaggle.com/c/instacart-market-basket-analysis). This dataset provides cart-level details on over 3 million grocery orders placed by over 200,000 Instacart users across of portfolio of nearly 50,000 products.

**NOTE** Due to the terms and conditions by which these data are made available, anyone interested in recreating this work will need to download the data files from Kaggle and upload them to a folder structure as described below.

The primary data files available for download are organized as follows under a pre-defined [mount point](https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs) that we have named */mnt/instacart*:

<img src='https://brysmiwasb.blob.core.windows.net/demos/images/instacart_filedownloads.png' width=250>



Read into dataframes, these files form the following data model which captures the products customers have included in individual transactions:

<img src='https://brysmiwasb.blob.core.windows.net/demos/images/instacart_schema2.png' width=300>

We will apply minimal transformations to this data, persisting it to the Delta Lake format for speedier access:

In [None]:
_ = spark.sql('CREATE DATABASE IF NOT EXISTS instacart')

**NOTE** The orders data set is pre-split into *prior* and *training* datasets.  Because date information in this dataset is very limited, we'll need to work with these pre-defined splits.  We'll treat the *prior* dataset as our ***calibration*** dataset and we'll treat the *training* dataset as our ***evaluation*** dataset. To minimize confusion, we'll rename these as part of our data preparation steps.

In [None]:
# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS instacart.orders')

# drop any old delta lake files that might have been created
shutil.rmtree('/dbfs/mnt/instacart/silver/orders', ignore_errors=True)

# define schema for incoming data
orders_schema = StructType([
  StructField('order_id', IntegerType()),
  StructField('user_id', IntegerType()),
  StructField('eval_set', StringType()),
  StructField('order_number', IntegerType()),
  StructField('order_dow', IntegerType()),
  StructField('order_hour_of_day', IntegerType()),
  StructField('days_since_prior_order', FloatType())
  ])

# read data from csv
orders = (
  spark
    .read
    .csv(
      '/mnt/instacart/bronze/orders',
      header=True,
      schema=orders_schema
      )
  )

# rename eval_set entries
orders_transformed = (
  orders
    .withColumn('split', expr("CASE eval_set WHEN 'prior' THEN 'calibration' WHEN 'train' THEN 'evaluation' ELSE NULL END"))
    .drop('eval_set')
  )

# write data to delta
(
  orders_transformed
    .write
    .format('delta')
    .mode('overwrite')
    .save('/mnt/instacart/silver/orders')
  )

# make accessible as spark sql table
_ = spark.sql('''
  CREATE TABLE instacart.orders
  USING DELTA
  LOCATION '/mnt/instacart/silver/orders'
  ''')

# present the data for review
display(
  spark.table('instacart.orders')
  )

order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,split
2539329,1,1,2,8,,calibration
2398795,1,2,3,7,15.0,calibration
473747,1,3,3,12,21.0,calibration
2254736,1,4,4,7,29.0,calibration
431534,1,5,4,15,28.0,calibration
3367565,1,6,2,7,19.0,calibration
550135,1,7,1,9,20.0,calibration
3108588,1,8,1,14,14.0,calibration
2295261,1,9,1,16,0.0,calibration
2550362,1,10,4,8,30.0,calibration


In [None]:
# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS instacart.products')

# drop any old delta lake files that might have been created
shutil.rmtree('/dbfs/mnt/instacart/silver/products', ignore_errors=True)

# define schema for incoming data
products_schema = StructType([
  StructField('product_id', IntegerType()),
  StructField('product_name', StringType()),
  StructField('aisle_id', IntegerType()),
  StructField('department_id', IntegerType())
  ])

# read data from csv
products = (
  spark
    .read
    .csv(
      '/mnt/instacart/bronze/products',
      header=True,
      schema=products_schema
      )
  )

# write data to delta
(
  products
    .write
    .format('delta')
    .mode('overwrite')
    .save('/mnt/instacart/silver/products')
  )

# make accessible as spark sql table
_ = spark.sql('''
  CREATE TABLE instacart.products
  USING DELTA
  LOCATION '/mnt/instacart/silver/products'
  ''')

# present the data for review
display(
  spark.table('instacart.products')
  )

product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7
4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1
5,Green Chile Anytime Sauce,5,13
6,Dry Nose Oil,11,11
7,Pure Coconut Water With Orange,98,7
8,Cut Russet Potatoes Steam N' Mash,116,1
9,Light Strawberry Blueberry Yogurt,120,16
10,Sparkling Orange Juice & Prickly Pear Beverage,115,7


In [None]:
# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS instacart.order_products')

# drop any old delta lake files that might have been created
shutil.rmtree('/dbfs/mnt/instacart/silver/order_products', ignore_errors=True)

# define schema for incoming data
order_products_schema = StructType([
  StructField('order_id', IntegerType()),
  StructField('product_id', IntegerType()),
  StructField('add_to_cart_order', IntegerType()),
  StructField('reordered', IntegerType())
  ])

# read data from csv
order_products = (
  spark
    .read
    .csv(
      '/mnt/instacart/bronze/order_products',
      header=True,
      schema=order_products_schema
      )
  )

# write data to delta
(
  order_products
    .write
    .format('delta')
    .mode('overwrite')
    .save('/mnt/instacart/silver/order_products')
  )

# make accessible as spark sql table
_ = spark.sql('''
  CREATE TABLE instacart.order_products
  USING DELTA
  LOCATION '/mnt/instacart/silver/order_products'
  ''')

# present the data for review
display(
  spark.table('instacart.order_products')
  )

order_id,product_id,add_to_cart_order,reordered
2,33120,1,1
2,28985,2,1
2,9327,3,0
2,45918,4,1
2,30035,5,0
2,17794,6,1
2,40141,7,1
2,1819,8,1
2,43668,9,0
3,33754,1,1


In [None]:
# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS instacart.departments')

# drop any old delta lake files that might have been created
shutil.rmtree('/dbfs/mnt/instacart/silver/departments', ignore_errors=True)

# define schema for incoming data
departments_schema = StructType([
  StructField('department_id', IntegerType()),
  StructField('department', StringType())  
  ])

# read data from csv
departments = (
  spark
    .read
    .csv(
      '/mnt/instacart/bronze/departments',
      header=True,
      schema=departments_schema
      )
  )

# write data to delta
(
  departments
    .write
    .format('delta')
    .mode('overwrite')
    .save('/mnt/instacart/silver/departments')
  )

# make accessible as spark sql table
_ = spark.sql('''
  CREATE TABLE instacart.departments
  USING DELTA
  LOCATION '/mnt/instacart/silver/departments'
  ''')

# present the data for review
display(
  spark.table('instacart.departments')
  )

department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


In [None]:
# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS instacart.aisles')

# drop any old delta lake files that might have been created
shutil.rmtree('/dbfs/mnt/instacart/silver/aisles', ignore_errors=True)

# define schema for incoming data
aisles_schema = StructType([
  StructField('aisle_id', IntegerType()),
  StructField('aisle', StringType())  
  ])

# read data from csv
aisles = (
  spark
    .read
    .csv(
      '/mnt/instacart/bronze/aisles',
      header=True,
      schema=aisles_schema
      )
  )

# write data to delta
(
  aisles
    .write
    .format('delta')
    .mode('overwrite')
    .save('/mnt/instacart/silver/aisles')
  )

# make accessible as spark sql table
_ = spark.sql('''
  CREATE TABLE instacart.aisles
  USING DELTA
  LOCATION '/mnt/instacart/silver/aisles'
  ''')

# present the data for review
display(
  spark.table('instacart.aisles')
  )

aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars
4,instant foods
5,marinades meat preparation
6,other
7,packaged meat
8,bakery desserts
9,pasta sauce
10,kitchen supplies


## Step 2: Derive Product *Ratings*

For our collaborative filter (CF), we need a way to understand user preferences for individual products. In some scenarios, explicit user ratings, such as a 3 out of 5 stars rating, may be provided, but not every interaction receives a rating and in many transactional engagements the idea of asking customers for such ratings just seems out of place. In these scenarios, we might use other user-generated data to indicate product preferences. In the context of the Instacart dataset, the frequency of product purchases by a user may serve as such an indicator:

In [None]:
# drop any old delta lake files that might have been created
shutil.rmtree('/dbfs/mnt/instacart/gold/ratings__user_product_orders', ignore_errors=True)

# identify number of times product purchased by user
user_product_orders = (
  spark
    .table('instacart.orders')
    .join(spark.table('instacart.order_products'), on='order_id')
    .groupBy('user_id', 'product_id', 'split')
    .agg( count(lit(1)).alias('purchases') )
  )

# write data to delta
(
  user_product_orders
    .write
    .format('delta')
    .mode('overwrite')
    .save('/mnt/instacart/gold/ratings__user_product_orders')
  )

# display results
display(
  spark.sql('''
    SELECT * 
    FROM DELTA.`/mnt/instacart/gold/ratings__user_product_orders` 
    ORDER BY split, user_id, product_id
    ''')
)

user_id,product_id,split,purchases
1,196,calibration,10
1,10258,calibration,9
1,10326,calibration,1
1,12427,calibration,10
1,13032,calibration,3
1,13176,calibration,2
1,14084,calibration,1
1,17122,calibration,1
1,25133,calibration,8
1,26088,calibration,2


Using product purchases as *implied ratings* presents us with a scaling problem.  Consider a scenario where a user purchases a given product 10 times while another user purchases a product 20 times.  Does the first user have a stronger preference for the product?  What if we new the first customer has made 10 purchases in total so that this product was included in each checkout event while the second user had made 50 total purchases, only 20 of which included the product of interest?  Does our understanding of the users preferences change in light of this additional information?

Rescaling our data to account for differences in overall purchase frequency will provide us a more reliable basis for the comparison of users. There are several options for doing this, but because of how we intend to measure the similarity between users (to provide the basis of collaborative filtering), our preference will be to use what is referred to as L2-normalization.

To understand L2-normalization, consider two users who have purchased products X and Y. The first user has purchased product X 10 times and product Y 5 times. The second user has purchased products X and Y 20 times each.  We might plot these purchases (with product X on the x-axis and product Y on the y-axis) as follows:

<img src='https://brysmiwasb.blob.core.windows.net/demos/images/lsh_norm01.png' width=380>

To determine similarity, we'll be measuring the (Euclidean) distance between the points formed at the intersection of these two axes, *i.e.* the peak of the two triangles in the graphic.  Without rescaling, the first user resides about 11 units from the origin and the second user resides about 28 units.  Calculating the distance between these two users in this space would provide a measure of both differing product preferences AND purchase frequencies. Rescaling the distance each user resides from the origin of the space eliminates the differences related to purchase frequencies, allowing us to focus on differences in product preferences:

<img src='https://brysmiwasb.blob.core.windows.net/demos/images/lsh_norm02.png' width=400>

The rescaling is achieved by calculating the Euclidean distance between each user and the origin - there's no need to limit ourselves to two-dimensions for this math to work - and then dividing each product-specific value for that user by this distance which is referred to as the L2-norm.  Here, we apply the L2-norm to our implied ratings:

In [None]:
%sql
DROP VIEW IF EXISTS instacart.user_ratings;

CREATE VIEW instacart.user_ratings 
AS
  WITH ratings AS (
    SELECT
      split,
      user_id,
      product_id,
      SUM(purchases) as purchases
    FROM DELTA.`/mnt/instacart/gold/ratings__user_product_orders`
    GROUP BY split, user_id, product_id
    )
  SELECT
    a.split,
    a.user_id,
    a.product_id,
    a.purchases,
    a.purchases/b.l2_norm as normalized_purchases
  FROM ratings a
  INNER JOIN (
    SELECT
      split,
      user_id,
      POW( 
        SUM(POW(purchases,2)),
        0.5
        ) as l2_norm
    FROM ratings
    GROUP BY user_id, split
    ) b
    ON a.user_id=b.user_id AND a.split=b.split;
  
SELECT * FROM instacart.user_ratings ORDER BY user_id, split, product_id;

split,user_id,product_id,purchases,normalized_purchases
calibration,1,196,10,0.508328567775349
calibration,1,10258,9,0.457495710997814
calibration,1,10326,1,0.0508328567775348
calibration,1,12427,10,0.508328567775349
calibration,1,13032,3,0.1524985703326046
calibration,1,13176,2,0.1016657135550697
calibration,1,14084,1,0.0508328567775348
calibration,1,17122,1,0.0508328567775348
calibration,1,25133,8,0.4066628542202791
calibration,1,26088,2,0.1016657135550697


You may have noted that we elected to implement these calculations through a view.  If we consider the values for a user must be recalculated with each purchase event by that user as that event will impact the value of the L2-norm by which each implied rating is adjusted. Persisting raw purchase counts in our base *ratings* table provides us an easy way to incrementally add new information to this table without having to re-traverse a user's entire purchase history.  Aggregating and normalizing the values in that table on the fly through a view gives us an easy way to extract normalized data with less ETL effort.

It's important to consider which data is included in these calculations. Depending on your scenario, it might be appropriate to limit the transaction history from which these *implied ratings* are derived to a period within which expressed preferences would be consistent with the user's preferences in the period over which the recommender might be used.  In some scenarios, this may mean limiting historical data to a month, quarter, year, etc.  In other scenarios, this may mean limiting historical data to periods with comparable seasonal components as the current or impending period.  For example, a user may have a strong preference for pumpkin spice flavored products in the Fall but may not be really keen on it during the Summer months.  For demonstration purposes, we'll just use the whole transaction history as the basis of our ratings but this is a point you'd want to carefully consider for a real-world implementation.

## Step 3: Derive Naive Product *Ratings*

A common practice when evaluating a recommender is to compare it to a prior or alternative recommendation engine to see which better helps the organization achieve its goals. To provide us a starting point for such comparisons, we might consider using overall product popularity as the basis for making *naive* collaborative recommendations. Here, we calculate normalized product ratings based on overall purchase frequencies to enable this work:

In [None]:
%sql
DROP VIEW IF EXISTS instacart.naive_ratings;

CREATE VIEW instacart.naive_ratings 
AS
  WITH ratings AS (
    SELECT
      split,
      product_id,
      SUM(purchases) as purchases
    FROM DELTA.`/mnt/instacart/gold/ratings__user_product_orders`
    GROUP BY split, product_id
    )
  SELECT
    a.split,
    a.product_id,
    a.purchases,
    a.purchases/b.l2_norm as normalized_purchases
  FROM ratings a
  INNER JOIN (
    SELECT
      split,
      POW( 
        SUM(POW(purchases,2)),
        0.5
        ) as l2_norm
    FROM ratings
    GROUP BY split
    ) b
    ON a.split=b.split;
  
SELECT * FROM instacart.naive_ratings ORDER BY split, product_id;

split,product_id,purchases,normalized_purchases
calibration,1,1852,0.0017180921708043
calibration,2,90,8.349260009308546e-05
calibration,3,277,0.00025697166917538523
calibration,4,329,0.00030521183811805684
calibration,5,15,1.3915433348847576e-05
calibration,6,8,7.421564452718707e-06
calibration,7,30,2.7830866697695152e-05
calibration,8,165,0.00015306976683732333
calibration,9,156,0.0001447205068280148
calibration,10,2572,0.002386032971549
