# Market Basket Analysis using PySpark's Implementation of FPGrowth

FPGrowth is an algorithm that performs market basket analysis, similar to the Apriori algorithm. I first used it when I ran into resource issues with Apriori and I was impressed with the speed. So I am giving it a try on this dataset using pyspark. The [documentation for FPGrowth](https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html) is pretty straightforward and describes the hyperparameters and the results.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
!pip install pyspark
!pip install pyspark_dist_explore # Used for a histogram

In [None]:
from pyspark import SparkContext
# Rather than generally using the functions, I should explicitly import the ones I want.
from pyspark.sql import functions as f, SparkSession, Column
from pyspark_dist_explore import hist
import matplotlib.pyplot as plt
from pyspark.ml.fpm import FPGrowth



In [None]:
# Create a spark session. All sorts of settings can be specified here. 
spark = SparkSession.builder \
    .appName("arlUsingPyspark") \
    .getOrCreate()

# Read in the data 

I didn't end up using the ID number of the customer, but one thing that is important to know about pyspark dataframes is that they do not preserve order once they are sliced and diced. This dataset relies on the order of the two dataframes from the csv files having their order preserved, because the basket does not contain the customer ID number. Since I didn't need the customer ID number, I assigned a monotonically increasing ID number to each row as the file is read in. 

This monotonically increasing ID number is not sequential, so it cannot be used directly to match the rows of the two dataframes. If I had needed the customer ID number to be associated with the basket, I would have had to use a window function over the ID number to create an index, and then match on the index. 

In [None]:
df = spark.read.csv("/kaggle/input/groceries-dataset-for-market-basket-analysismba/basket.csv", header=True).withColumn("id", f.monotonically_increasing_id())
df_all = spark.read.csv("/kaggle/input/groceries-dataset-for-market-basket-analysismba/Groceries data.csv", header=True).withColumn("id", f.monotonically_increasing_id())

In [None]:
# Show is pyspark's version of head(), although it can be slow so I do try to skip this. 
df.show(5)
df_all.show(5)

In [None]:
# printSchema() shows the structure of the dataframe. This is important for debugging.
df.printSchema()
df_all.printSchema()

# How many baskets are there per customer? 

I wanted to look at the number of baskets each customer had in the dataset. 

In [None]:
num_baskets = df_all.groupBy("Member_number").count()
num_baskets.show(5)

# The distribution of the number of baskets

Create a histogram of the number of baskets using pyspark_dist_explore. This library can create some fast visualizations on a pyspark dataframe, similar to matplotlib. 

In [None]:
fig, ax = plt.subplots()

hist(ax, num_baskets.select('count'), bins = 30, color=['blue'])

# Run PySpark's implementation of FPGrowth

First step is to collect the baskets into sets. FPGrowth requires each basket to be an array that looks like:

* ['item1','item2', 'imem3']

The basket dataframe uses wide rather than long format, with Null if the basket contains fewer than 10 items. 

In [None]:
df_basket = df.select("id", f.array([df[c] for c in df.columns[:11]]).alias("basket"))
df_basket.printSchema()
df_basket.show(3, False) # False tells show to not truncate the columns when printing.

### There should not be any nulls in the array. Remove using array_except()

This will be the final dataframe used for FPGrowth. 

In [None]:
df_aggregated = df_basket.select("id", f.array_except("basket", f.array(f.lit(None))).alias("basket"))
df_aggregated.show(3, False)

## Hyperparameters

The hyperparameters used in FPGrowth are minimum support, minimum confidence, and number of partitions. 

* minSupport - The minimum support of an item to be considered in a frequent itemset. 
* minConfidence - The minimum confidence for generating an association rule from an itemset. 
* numPartitions - The number of partitions used to distribute the work. This is Spark-specific. 

The default number of partitions is the number of partitions for the input dataset. 

In [None]:
# Run FPGrowth and fit the model.
fp = FPGrowth(minSupport=0.001, minConfidence=0.001, itemsCol='basket', predictionCol='prediction')
model = fp.fit(df_aggregated)

In [None]:
# View a subset of the frequent itemset. 
model.freqItemsets.show(10, False)

In [None]:
# Use filter to view just the association rules with the highest confidence.
model.associationRules.filter(model.associationRules.confidence>0.15).show(20, False)

## Let's create a prediction based on the generated association rules

This is pretty similar to creating a prediction using other methods. The data column needs to have the same column name as the column specified in the model fit.

In [None]:
# Create a PySpark dataframe
columns = ['basket']
new_data = [(['ham', 'yogurt', 'light bulbs'],), (['jam', 'cocoa drinks', 'pet care'],)]
rdd = spark.sparkContext.parallelize(new_data)
new_df = rdd.toDF(columns)
new_df.printSchema()
new_df.show(2,False)

# Predict!

Now that we have a new PySpark dataframe with data, predict. The first basket generates numerous predictions based on the association rules, however the second basket does not generate any. 

In [None]:
model.transform(new_df).show(5, False)