# Market Basket Analysis with Spark FPGrowth (Data mining)
#### Dataset download > 
* #### [Instacart](https://www.kaggle.com/c/instacart-market-basket-analysis)

#### Library used >
* #### [FPGrowth](https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#fp-growth)

Mining frequent items, itemsets, subsequences, or other substructures is usually among the first steps to analyze a large-scale dataset, which has been an active research topic in data mining for years. We refer users to Wikipedia’s association rule learning for more information. spark.mllib provides a parallel implementation of FP-growth, a popular algorithm to mining frequent itemsets.

Market basket analysis may provide the retailer with information to understand the purchase behavior of a buyer. This information will enable the retailer to understand the buyer's needs and rewrite the store's layout accordingly, develop cross-promotional programs, or even capture new buyers (much like the cross-selling concept). An apocryphal early illustrative example for this was when one super market chain discovered in its analysis that male customers that bought diapers often bought beer as well, have put the diapers close to beer coolers, and their sales increased dramatically. Although this urban legend is only an example that professors use to illustrate the concept to students, the explanation of this imaginary phenomenon might be that fathers that are sent out to buy diapers often buy a beer as well, as a reward. This kind of analysis is supposedly an example of the use of data mining. A widely used example of cross selling on the web with market basket analysis is Amazon.com's use of "customers who bought book A also bought book B", e.g. "People who read History of Portugal were also interested in Naval History".

This is a series of two notebooks. This is notebook #1. The purpose of this notebook is to prepare the dataset.

# Data engineering pipelines
Data engineering pipelines are commonly comprised of these components:
![image-alt-text](https://s3.us-east-2.amazonaws.com/databricks-dennylee/media/data-engineering-pipeline-3.png)

- Ingest Data: Bringing in the data from your source systems; often involving ETL processes (though we will skip this step in this demo for brevity)
- Explore Data: Now that you have cleansed data, explore it so you can get some business insight
- Train ML Model: Execute FP-growth for frequent pattern mining
- Review Association Rules: Review the generated association rules

In [1]:
from pyspark.sql.types import *
import pyspark.sql.functions as f
from pyspark.sql import window as w 

StatementMeta(, , , Cancelled, )

In [None]:
# Reading csv files in a dataframe
file_path = "abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/bronze/aisles/aisles.csv"
df = spark.read.csv(file_path, header=True, inferSchema=True)
display(df.limit(10))

StatementMeta(, , , Cancelled, )

# Ingest Data

The basic building block of the collaborative filter is transactional data containing a customer identifier. The popular [Instacart dataset](https://www.kaggle.com/c/instacart-market-basket-analysis) provides us a nice collection of such data with over 3 million grocery orders placed by over 200,000 Instacart users over a nearly 2-year period across of portfolio of nearly 50,000 products. 

**NOTE** Due to the terms and conditions by which these data are made available, anyone interested in recreating this work will need to download the data files from Kaggle and upload them to a folder structure as described below.

The primary data files available for download are organized as follows under a pre-defined [mount point](https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs) that we have named */mnt/instacart*:

<img src='https://brysmiwasb.blob.core.windows.net/demos/images/instacart_filedownloads.png' width=250>



Read into dataframes, these files form the following data model which captures the products customers have included in individual transactions:

<img src='https://brysmiwasb.blob.core.windows.net/demos/images/instacart_schema2.png' width=300>

We will apply minimal transformations to this data, persisting it to the Delta Lake format for speedier access:

In [None]:
_ = spark.sql('CREATE DATABASE IF NOT EXISTS instacart')

StatementMeta(, , , Cancelled, )

The orders data is pre-divided into *prior* and *training* evaluation sets, where the *training* dataset represents the last order placed in the overall sequence of orders associated with a given customer.  The *prior* dataset represents those orders that proceed the *training* order.  In a previous set of notebooks built on this data, we relabeled the *prior* and *training* evaluation sets as *calibration* and *evaluation*, respectively, to better align terminology with how the data was being used.  Here, we will preserve the *prior* & *training* designations as this better aligns with our current modeling needs.

We will add to this dataset a field, *days_prior_to_last_order*, which calculates the days from a given order to the order that represents the *training* instance. This field will help us when developing features around purchases taking place different intervals prior to the final order.  All other tables will be brought into the database without schema changes, simply converting the underlying format from CSV to delta lake for better query performance later:

In [None]:
# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS instacart.orders')

# define schema for incoming data
orders_schema = StructType([
  StructField('order_id', IntegerType()),
  StructField('user_id', IntegerType()),
  StructField('eval_set', StringType()),
  StructField('order_number', IntegerType()),
  StructField('order_dow', IntegerType()),
  StructField('order_hour_of_day', IntegerType()),
  StructField('days_since_prior_order', FloatType())
  ])

# read data from csv
orders = (
  spark
    .read
    .csv(
      'abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/bronze/orders',
      header=True,
      schema=orders_schema
      )
  )

# calculate days until final purchase 
win = (
  w.Window.partitionBy('user_id').orderBy(f.col('order_number').desc())
  )

orders_enhanced = (
    orders
      .withColumn(
        'days_prior_to_last_order', 
        f.sum('days_since_prior_order').over(win) - f.coalesce(f.col('days_since_prior_order'),f.lit(0))
        ) 
  )

# write data to delta
(
  orders_enhanced
    .write
    .format('delta')
    .mode('overwrite')
    .option('overwriteSchema','true')
    .save('abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/silver/orders')
  )

# make accessible as spark sql table
_ = spark.sql('''
  CREATE TABLE instacart.orders
  USING DELTA
  LOCATION 'abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/silver/orders'
  ''')

# present the data for review
display(
  spark
    .table('instacart.orders')
    .orderBy('user_id','order_number')
    .limit(10)
  )

StatementMeta(, , , Cancelled, )

In [None]:
# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS instacart.products')

# define schema for incoming data
products_schema = StructType([
  StructField('product_id', IntegerType()),
  StructField('product_name', StringType()),
  StructField('aisle_id', IntegerType()),
  StructField('department_id', IntegerType())
  ])

# read data from csv
products = (
  spark
    .read
    .csv(
     'abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/bronze/products',
      header=True,
      schema=products_schema
      )
  )

# write data to delta
(
  products
    .write
    .format('delta')
    .mode('overwrite')
    .option('overwriteSchema','true')
    .save('abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/silver/products')
  )

# make accessible as spark sql table
_ = spark.sql('''
  CREATE TABLE instacart.products
  USING DELTA
  LOCATION 'abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/silver/products'
  ''')

# present the data for review
display(
  spark.table('instacart.products').limit(10)
  )

StatementMeta(, , , Cancelled, )

In [None]:
# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS instacart.order_products')

# define schema for incoming data
order_products_schema = StructType([
  StructField('order_id', IntegerType()),
  StructField('product_id', IntegerType()),
  StructField('add_to_cart_order', IntegerType()),
  StructField('reordered', IntegerType())
  ])

# read data from csv
order_products = (
  spark
    .read
    .csv(
      'abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/bronze/order_products',
      header=True,
      schema=order_products_schema
      )
  )

# write data to delta
(
  order_products
    .write
    .format('delta')
    .mode('overwrite')
    .option('overwriteSchema','true')
    .save('abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/silver/order_products')
  )

# make accessible as spark sql table
_ = spark.sql('''
  CREATE TABLE instacart.order_products
  USING DELTA
  LOCATION 'abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/silver/order_products'
  ''')

# present the data for review
display(
  spark.table('instacart.order_products').limit(10)
  )

StatementMeta(, , , Cancelled, )

In [None]:
# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS instacart.departments')

# define schema for incoming data
departments_schema = StructType([
  StructField('department_id', IntegerType()),
  StructField('department', StringType())  
  ])

# read data from csv
departments = (
  spark
    .read
    .csv(
      'abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/bronze/departments',
      header=True,
      schema=departments_schema
      )
  )

# write data to delta
(
  departments
    .write
    .format('delta')
    .mode('overwrite')
    .option('overwriteSchema','true')
    .save('abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/silver/departments')
  )

# make accessible as spark sql table
_ = spark.sql('''
  CREATE TABLE instacart.departments
  USING DELTA
  LOCATION 'abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/silver/departments'
  ''')

# present the data for review
display(
  spark.table('instacart.departments').limit(10)
  )

StatementMeta(, , , Cancelled, )

In [None]:
# delete the old table if needed
_ = spark.sql('DROP TABLE IF EXISTS instacart.aisles')

# define schema for incoming data
aisles_schema = StructType([
  StructField('aisle_id', IntegerType()),
  StructField('aisle', StringType())  
  ])

# read data from csv
aisles = (
  spark
    .read
    .csv(
      'abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/bronze/aisles',
      header=True,
      schema=aisles_schema
      )
  )

# write data to delta
(
  aisles
    .write
    .format('delta')
    .mode('overwrite')
    .option('overwriteSchema','true')
    .save('abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/silver/aisles')
  )

# make accessible as spark sql table
_ = spark.sql('''
  CREATE TABLE instacart.aisles
  USING DELTA
  LOCATION 'abfss://recommender@salabcommercedatalake.dfs.core.windows.net/instacart/silver/aisles'
  ''')

# present the data for review
display(
  spark.table('instacart.aisles').limit(10)
  )

StatementMeta(, , , Cancelled, )

# Combine Order Details

With our data loaded, we will flatten our order details through a view.  This will make access to our data during feature engineering significantly easier:

In [None]:
%%sql
DROP VIEW IF EXISTS instacart.order_details;

CREATE VIEW instacart.order_details as
  SELECT
    a.eval_set,
    a.user_id,
    a.order_number,
    a.order_id,
    a.order_dow,
    a.order_hour_of_day,
    a.days_since_prior_order,
    a.days_prior_to_last_order,
    b.product_id,
    c.aisle_id,
    c.department_id,
    b.reordered
  FROM instacart.orders a
  INNER JOIN instacart.order_products b
    ON a.order_id=b.order_id
  INNER JOIN instacart.products c
    ON b.product_id=c.product_id;
    
SELECT *
FROM instacart.order_details;

StatementMeta(, , , Cancelled, )