### This is a notebook for creating recommendation model (1) using ECFAR algorithm. 

##### Author - Reshma
-- Association Mining Rule part

#### Get train users distinct item counts 
    * K=15 which is number of items predicted for next basket
    * Select user who have atleast 4 transaction orders
    * Primary rule: Association rule
    * Secondary rule: Collaborative filtering rule
    
Wang, Feiran and Wen, Yiping and Guo, Tianhang and Liu, Jianxun and Cao, Buqing. Collaborative filtering and association rule mining-based market basket recommendation on spark. Concurrency and Computation: Practice and Experience

In [22]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Capstone_MBA').getOrCreate()

%matplotlib inline

In [2]:
## Define K

In [3]:
k = 15

In [16]:
# Import Data
dataDir = "/user/reshmask/capstone/"
data    = spark.read.csv(dataDir + "instacart.csv", header=True, inferSchema=True)

In [17]:
## split train and test dataset
train_df = data.filter(data["eval_set"]=="prior")
test_df  = data.filter(data["eval_set"]=="train")

## Association Mining

### Organize Shopping Basket

To prepare our data for downstream processing, we will organize our data by shopping basket. That is, each row of our DataFrame represents an order_id with each items column containing an array of items.

In [6]:
from pyspark.sql.functions import collect_set, col, count, countDistinct, explode
baskets = train_df.groupBy('order_id').agg(collect_set('db_food_id').alias('items'))
baskets.createOrReplaceTempView('baskets')

In [7]:
baskets.show()

+--------+--------------------+
|order_id|               items|
+--------+--------------------+
|     148|[32903, 35394, 38...|
|     471|[43621, 16226, 25...|
|     496|[6167, 35086, 28308]|
|     833|[4701, 32903, 397...|
|    1088|[35143, 31794, 19...|
|    1238|[7076, 23438, 116...|
|    1580|[4701, 33546, 722...|
|    1645|[55799, 14418, 53...|
|    1829|[32903, 53945, 22...|
|    1959|[35299, 56803, 35...|
|    2122|[30132, 19770, 44...|
|    2142|[11358, 31739, 33...|
|    2366|[8128, 5387, 3231...|
|    2659|[32903, 34569, 53...|
|    2866|[7398, 35339, 345...|
|    3175|[31542, 33982, 35...|
|    3749|       [5335, 20711]|
|    3794|[34925, 28230, 31...|
|    3918|[35041, 43165, 11...|
|    3997|[32921, 10459, 55...|
+--------+--------------------+
only showing top 20 rows



### Frequent Itemsets

The FP-Growth (Frequent Pattern growth) algorithm is currently one of the fastest approaches to frequent item set mining. FP-Growth is an improvement of apriori designed to eliminate some of the heavy bottlenecks in apriori. It works well with any distributed system focused on MapReduce. FP-Growth simplifies all the problems present in apriori by using a structure called an FP-Tree.

Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items. Different from Apriori-like algorithms designed for the same purpose, the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets explicitly, which are usually expensive to generate.

In [8]:
from pyspark.ml.fpm import FPGrowth

#set the minimum thresholds for support and confidence
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.00002, minConfidence=0.00002)

model_mba = fpGrowth.fit(baskets)

#Calculate frequent itemsets
mostPopularItemInABasket = model_mba.freqItemsets
mostPopularItemInABasket.createOrReplaceTempView("mostPopularItemInABasket")

# Display frequent itemsets.
model_mba.freqItemsets.show(truncate =False)

+---------------------+-----+
|items                |freq |
+---------------------+-----+
|[35010]              |19286|
|[35010, 54035]       |301  |
|[35010, 54035, 32923]|68   |
|[35010, 54035, 32924]|99   |
|[35010, 54035, 32903]|100  |
|[35010, 54035, 33482]|92   |
|[35010, 54035, 53945]|80   |
|[35010, 4705]        |96   |
|[35010, 43265]       |438  |
|[35010, 43265, 32923]|66   |
|[35010, 43265, 53614]|66   |
|[35010, 43265, 32924]|129  |
|[35010, 43265, 32903]|123  |
|[35010, 43265, 33482]|121  |
|[35010, 43265, 53615]|77   |
|[35010, 43265, 32912]|70   |
|[35010, 43265, 34569]|78   |
|[35010, 43265, 53945]|142  |
|[35010, 43265, 32884]|70   |
|[35010, 16222]       |311  |
+---------------------+-----+
only showing top 20 rows



In [9]:
#sort by confidence
primary_rule = model_mba.associationRules.orderBy("confidence")

In [10]:
asso_df = primary_rule.toPandas()

In [12]:
asso_df.to_csv('association_rule_results', index=False)