**Market basket analysis is a simple but important technique commonly used by retailers to provide product recommendations. It uses transactional datasets to determine which products are frequently purchased together. Retailers can use the recommendations to inform personalized cross-selling and upselling, helping increase conversion and maximizing the value of each customer.**

**You most likely have already seen market basket analysis in action while browsing through Amazon.com. An Amazon.com product page will usually have a section called “Customers who bought this item also bought,” presenting you with a list of items that are frequently bought together with the product you are currently browsing. That list is generated via market basket analysis. Market basket analysis is also used by brick-and-mortar retailers for store optimization by informing product placements and adjacencies in planograms. The idea is to drive more sales by placing complementary items next to each other.**

**Market basket analysis uses association rules learning to make recommendations. Association rules look for relationships between items using large transactional datasets.xiv Association rules are calculated from two or more items called itemsets. An association rule consists of an antecedent (if) and a consequent (then). For example, if someone buys cookies (antecedent), then the person is also more likely to buy milk (consequent). Popular association rule algorithms include Apriori, SETM, ECLAT, and FP-Growth. Spark MLlib includes a highly scalable implementation of FP-Growth for association rule mining.xv FP-Growth identifies frequent items and calculates item frequencies using a frequent pattern (“FP” stands for frequent pattern) tree structure.xvi**

We will use the popular Instacart Online Grocery Shopping Dataset for our market basket analysis example using FP-Growth.xviii The dataset contains 3.4 million grocery orders for 50,000 products from 200,000 Instacart customers. You can download the dataset from www.instacart.com/datasets/grocery-shopping-2017. For FP-Growth, we only need the products and order_products_train tables

##### uncomment and run as code cell to download the dataset
! wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [1]:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext("local", "Market Bucket Analysis")
spark = SparkSession(sc)

In [2]:
! ls ../datasets/instacart_2017_05_01/

aisles.csv	 order_products__prior.csv  orders.csv
departments.csv  order_products__train.csv  products.csv


In [3]:
df_products = spark.read.csv("../datasets/instacart_2017_05_01/products.csv", header=True, inferSchema=True)

In [4]:
df_products.show()

+----------+--------------------+--------+-------------+
|product_id|        product_name|aisle_id|department_id|
+----------+--------------------+--------+-------------+
|         1|Chocolate Sandwic...|      61|           19|
|         2|    All-Seasons Salt|     104|           13|
|         3|Robust Golden Uns...|      94|            7|
|         4|Smart Ones Classi...|      38|            1|
|         5|Green Chile Anyti...|       5|           13|
|         6|        Dry Nose Oil|      11|           11|
|         7|Pure Coconut Wate...|      98|            7|
|         8|Cut Russet Potato...|     116|            1|
|         9|Light Strawberry ...|     120|           16|
|        10|Sparkling Orange ...|     115|            7|
|        11|   Peach Mango Juice|      31|            7|
|        12|Chocolate Fudge L...|     119|            1|
|        13|   Saline Nasal Mist|      11|           11|
|        14|Fresh Scent Dishw...|      74|           17|
|        15|Overnight Diapers..

In [5]:
df_order_products = spark.read.csv("../datasets/instacart_2017_05_01/order_products__train.csv", header=True, inferSchema=True)

In [6]:
df_order_products.show()

+--------+----------+-----------------+---------+
|order_id|product_id|add_to_cart_order|reordered|
+--------+----------+-----------------+---------+
|       1|     49302|                1|        1|
|       1|     11109|                2|        1|
|       1|     10246|                3|        0|
|       1|     49683|                4|        0|
|       1|     43633|                5|        1|
|       1|     13176|                6|        0|
|       1|     47209|                7|        0|
|       1|     22035|                8|        1|
|      36|     39612|                1|        0|
|      36|     19660|                2|        1|
|      36|     49235|                3|        0|
|      36|     43086|                4|        1|
|      36|     46620|                5|        1|
|      36|     34497|                6|        1|
|      36|     48679|                7|        1|
|      36|     46979|                8|        1|
|      38|     11913|                1|        0|


### create temporary tables

In [7]:
df_order_products.createOrReplaceTempView("order_products_train")
df_products.createOrReplaceTempView("products")

note: the views are created for sql functions otherwise the dataframe remains the same

##### do joins on dataframe 

In [8]:
joined_data = spark.sql("select p.product_name, o.order_id \
from order_products_train o inner join products p \
where p.product_id = o.product_id")

In [9]:
from pyspark.sql.functions import max as spark_max
from pyspark.sql.functions import collect_set

In [10]:
df_baskets = joined_data.groupBy('order_id').agg(collect_set('product_name').alias('items'))

In [11]:
df_baskets.createOrReplaceTempView("baskets")

In [12]:
df_baskets.show(20,25)

+--------+-------------------------+
|order_id|                    items|
+--------+-------------------------+
|    1342|[Raw Shrimp, Seedless ...|
|    1591|[Cracked Wheat, Strawb...|
|    4519|[Beet Apple Carrot Lem...|
|    4935|                  [Vodka]|
|    6357|[Globe Eggplant, Panko...|
|   10362|[Organic Baby Spinach,...|
|   19204|[Reduced Fat Crackers,...|
|   29601|[Organic Red Onion, Sm...|
|   31035|[Organic Cripps Pink A...|
|   40011|[Organic Baby Spinach,...|
|   46266|[Uncured Beef Hot Dog,...|
|   51607|[Donut House Chocolate...|
|   58797|[Concentrated Butcher'...|
|   61793|[Raspberries, Green Se...|
|   67089|[Original Tofurky Deli...|
|   70863|[Extra Hold Non-Aeroso...|
|   88674|[Organic Coconut Milk,...|
|   91937|[No. 485 Gin, Monterey...|
|   92317|[Red Vine Tomato, Harv...|
|   99621|[Organic Baby Arugula,...|
+--------+-------------------------+
only showing top 20 rows



In [13]:
from pyspark.ml.fpm import FPGrowth

FPGrowth needs string containing list of items

In [14]:
df_baskets = spark.sql("SELECT items from baskets").toDF("items")

In [15]:
df_baskets.show()

+--------------------+
|               items|
+--------------------+
|[Raw Shrimp, Seed...|
|[Cracked Wheat, S...|
|[Beet Apple Carro...|
|             [Vodka]|
|[Globe Eggplant, ...|
|[Organic Baby Spi...|
|[Reduced Fat Crac...|
|[Organic Red Onio...|
|[Organic Cripps P...|
|[Organic Baby Spi...|
|[Uncured Beef Hot...|
|[Donut House Choc...|
|[Concentrated But...|
|[Raspberries, Gre...|
|[Original Tofurky...|
|[Extra Hold Non-A...|
|[Organic Coconut ...|
|[No. 485 Gin, Mon...|
|[Red Vine Tomato,...|
|[Organic Baby Aru...|
+--------------------+
only showing top 20 rows



#### train FPGrowth 

In [16]:
fpgrowth = FPGrowth(
            itemsCol="items",
            minSupport=0.002,
            minConfidence=0
)

In [17]:
model = fpgrowth.fit(df_baskets)

In [18]:
most_popular_items = model.freqItemsets

In [19]:
most_popular_items.createOrReplaceTempView("mostPopularItems")

#### verify results 

In [31]:
spark.sql("select items, freq from mostPopularItems where size(items) >= 2 order by freq desc").show(20,55)

+----------------------------------------------+----+
|                                         items|freq|
+----------------------------------------------+----+
|[Organic Strawberries, Bag of Organic Bananas]|3074|
|[Organic Hass Avocado, Bag of Organic Bananas]|2420|
|[Organic Baby Spinach, Bag of Organic Bananas]|2236|
|                     [Organic Avocado, Banana]|2216|
|                [Organic Strawberries, Banana]|2174|
|                         [Large Lemon, Banana]|2158|
|                [Organic Baby Spinach, Banana]|2000|
|                        [Strawberries, Banana]|1948|
| [Organic Raspberries, Bag of Organic Bananas]|1780|
|   [Organic Raspberries, Organic Strawberries]|1670|
|  [Organic Baby Spinach, Organic Strawberries]|1639|
|                          [Limes, Large Lemon]|1595|
|  [Organic Hass Avocado, Organic Strawberries]|1539|
|       [Organic Avocado, Organic Baby Spinach]|1402|
|                [Organic Avocado, Large Lemon]|1349|
|                           

In [30]:
spark.sql("select items, freq from mostPopularItems where size(items) > 2 order by freq desc").show(20,65)

+-----------------------------------------------------------------+----+
|                                                            items|freq|
+-----------------------------------------------------------------+----+
|[Organic Hass Avocado, Organic Strawberries, Bag of Organic Ba...| 710|
|[Organic Raspberries, Organic Strawberries, Bag of Organic Ban...| 649|
|[Organic Baby Spinach, Organic Strawberries, Bag of Organic Ba...| 587|
|[Organic Raspberries, Organic Hass Avocado, Bag of Organic Ban...| 531|
|[Organic Hass Avocado, Organic Baby Spinach, Bag of Organic Ba...| 497|
|                  [Organic Avocado, Organic Baby Spinach, Banana]| 484|
|                           [Organic Avocado, Large Lemon, Banana]| 477|
|                                     [Limes, Large Lemon, Banana]| 452|
| [Organic Cucumber, Organic Strawberries, Bag of Organic Bananas]| 424|
|                            [Limes, Organic Avocado, Large Lemon]| 389|
|[Organic Raspberries, Organic Hass Avocado, Organi

Both lists shows the items that are likely to be purchased together

##### The FP-Growth model also generates association rules. The output includes // the antecedent, consequent, and confidence (probability). The minimum // confidence for generating association rule is determined by the // minConfidence parameter. 

In [32]:
assoc_rules = model.associationRules

In [33]:
assoc_rules.createOrReplaceTempView("AssocRules")

In [34]:
spark.sql("select antecedent, consequent, confidence from AssocRules order by confidence desc").show(20,55)

+-------------------------------------------------------+------------------------+-------------------+
|                                             antecedent|              consequent|         confidence|
+-------------------------------------------------------+------------------------+-------------------+
|            [Organic Raspberries, Organic Hass Avocado]|[Bag of Organic Bananas]|  0.521099116781158|
|                        [Strawberries, Organic Avocado]|                [Banana]| 0.4643478260869565|
|           [Organic Hass Avocado, Organic Strawberries]|[Bag of Organic Bananas]| 0.4613385315139701|
|                  [Organic Lemon, Organic Hass Avocado]|[Bag of Organic Bananas]| 0.4519846350832266|
|                  [Organic Lemon, Organic Strawberries]|[Bag of Organic Bananas]| 0.4505169867060561|
|               [Organic Cucumber, Organic Hass Avocado]|[Bag of Organic Bananas]| 0.4404332129963899|
|[Organic Large Extra Fancy Fuji Apple, Organic Straw...|[Bag of Organic 

According to the output, customers who bought organic raspberries, organic avocados, and organic strawberries are also more likely to buy organic bananas. As you can see, bananas are a very popular item. This kind of lists could be the basis for “customers who bought this item also bought”-type recommendations.