### This is a notebook for creating reference model using ECFAR algorithm. 

##### Author - Reshma
-- Collaborative filtering part

#### Get train users distinct item counts 
    * K=15 which is number of items predicted for next basket
    * Select user who have atleast 4 transaction orders
    * Primary rule: Association rule
    * Secondary rule: Collaborative filtering rule
    
Wang, Feiran and Wen, Yiping and Guo, Tianhang and Liu, Jianxun and Cao, Buqing. Collaborative filtering and association rule mining-based market basket recommendation on spark. Concurrency and Computation: Practice and Experience

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Capstone_MBA').getOrCreate()

%matplotlib inline

In [6]:
# Import Data
dataDir = "/user/reshmask/capstone/"
data    = spark.read.csv(dataDir + "instacart.csv", header=True, inferSchema=True)

### ALS

[Alternating Least Squares(ALS)](https://spark.apache.org/docs/latest/ml-collaborative-filtering.html) is a the model we’ll use to fit our data and find similarities. ALS is an iterative optimization process where we for every iteration try to arrive closer and closer to a factorized representation of our original data.

For implicit preference data, the algorithm used is based on “Collaborative Filtering for Implicit Feedback Datasets”,, adapted for the blocked approach used here.

Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r > 0 and 0 if r <= 0. The ratings then act as ‘confidence’ values related to strength of indicated user preferences rather than explicit ratings given to items.

### Cold Start Predictions

When there are cold start users or items to make predictions on (ones not available in the model) the predictions produce NaNs as shown in the summary below. This also causes evaluation with the mean squared error to produce a NaN.To solve this problem, the rows can be dropped with <code>predictions.na.drop()</code>. A more streamlined way is to add the <code>coldStartStrategy="drop"</code> as a model parameter.

In [7]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

In [8]:
## split train and test dataset
train_df = data.filter(data["eval_set"]=="prior")
test_df  = data.filter(data["eval_set"]=="train")

In [10]:
## build user-item frequency matrix
train_als = train_df.groupby(['db_food_id', 'user_id']).count()
test_als  = test_df.groupby(['db_food_id', 'user_id']).count()

In [11]:
train_als.show(10)

+----------+-------+-----+
|db_food_id|user_id|count|
+----------+-------+-----+
|     40847|  45765|   17|
|     55383| 156184|    1|
|     32191| 110777|    5|
|     36574|  23297|    2|
|     35052| 134599|    2|
|     11852| 185905|    5|
|     10996|  87494|   10|
|     33393| 125187|    3|
|     39785| 204950|    6|
|     37677|  36894|    1|
+----------+-------+-----+
only showing top 10 rows



In [12]:
# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(maxIter=10, regParam=0.01, userCol="user_id", itemCol="db_food_id", ratingCol="count",
          coldStartStrategy="drop", nonnegative = True)

In [13]:
#fit and predict
model_als   = als.fit(train_als)
predictions = model_als.transform(test_als)

In [14]:
evaluator  = RegressionEvaluator(metricName="rmse",labelCol="count",predictionCol="prediction")
predictions= model_als.transform(test_als)
rmse       = evaluator.evaluate(predictions)
print("RMSE="+str(rmse))
predictions.show()

RMSE=4.130624211430937
+----------+-------+-----+----------+
|db_food_id|user_id|count|prediction|
+----------+-------+-----+----------+
|      1238|  10195|    1|0.93837816|
|      1238|  37389|    1| 0.7414965|
|      1238| 111286|    1| 1.3206091|
|      1238| 205187|    1|  3.215652|
|      1238| 164114|    1| 0.8272763|
|      1238| 165935|    1|0.88317794|
|      1238| 136805|    1| 0.9235882|
|      1238|  64208|    1| 2.1437192|
|      1238|  58526|    1| 0.9380222|
|      1238|  98144|    1| 0.8043657|
|      1238|  24457|    1| 1.2789723|
|      1238| 142267|    1|  0.523955|
|      1238| 161319|    1| 1.2288841|
|      1238|  14928|    1| 1.2893299|
|      1238| 195058|    1| 1.7174466|
|      1238|  32320|    1| 1.6573441|
|      1238|  16710|    1| 1.4726865|
|      1238| 180934|    1| 1.5805019|
|      1238|   7040|    1| 1.3913372|
|      1580|  44032|    1|0.69242245|
+----------+-------+-----+----------+
only showing top 20 rows



In [15]:
#explain parameters of the model
model_als.explainParams()

'coldStartStrategy: strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: nan,drop. (default: nan, current: drop)\nitemCol: column name for item ids. Ids must be within the integer value range. (default: item, current: db_food_id)\npredictionCol: prediction column name (default: prediction)\nuserCol: column name for user ids. Ids must be within the integer value range. (default: user, current: user_id)'

In [16]:
#item factors 
model_als.itemFactors.show(10, truncate = False)

+----+------------------------------------------------------------------------------------------------------------------+
|id  |features                                                                                                          |
+----+------------------------------------------------------------------------------------------------------------------+
|970 |[0.088116884, 0.0, 0.0, 0.035574753, 0.27838606, 0.0, 0.486654, 0.7738914, 0.0, 0.033094883]                      |
|980 |[0.0, 0.013763662, 0.0, 0.0, 0.3498928, 0.13102037, 1.0802736, 0.19282363, 0.0, 0.3486929]                        |
|1020|[0.065856196, 0.0, 0.55094784, 0.0, 0.0, 0.0, 0.91577226, 0.2944313, 0.0, 0.15547165]                             |
|1030|[0.19036083, 0.17601016, 0.0723071, 0.12571405, 0.47627324, 0.0, 0.042583797, 0.16411342, 0.048832636, 0.0]       |
|1060|[0.044638943, 0.0, 0.20901686, 0.2618084, 0.011542382, 0.0, 0.05577788, 0.22124489, 0.28142706, 0.15492912]       |
|1070|[0.0, 0.5380989, 0

In [28]:
predictions.show(10)

+----------+-------+-----+----------+
|db_food_id|user_id|count|prediction|
+----------+-------+-----+----------+
|      1238|  37389|    1|0.98203385|
|      1238|  10195|    1| 1.2160385|
|      1238| 111286|    1|  1.648296|
|      1238|  32320|    1| 2.8713043|
|      1238|  64208|    1| 2.3204348|
|      1238| 142267|    1| 0.4775596|
|      1238|  58526|    1| 1.3425761|
|      1238| 161319|    1| 1.3234239|
|      1238| 165935|    1| 0.6718278|
|      1238|  14928|    1| 1.3865718|
+----------+-------+-----+----------+
only showing top 10 rows



In [17]:
recs=model_als.recommendForAllUsers(15).toPandas()

In [18]:
import pandas as pd
# recs=model.recommendForAllUsers(10).toPandas()
nrecs=recs.recommendations.apply(pd.Series) \
            .merge(recs, right_index = True, left_index = True) \
            .drop(["recommendations"], axis = 1) \
            .melt(id_vars = ['user_id'], value_name = "recommendation") \
            .drop("variable", axis = 1) \
            .dropna() 
nrecs=nrecs.sort_values('user_id')
nrecs=pd.concat([nrecs['recommendation'].apply(pd.Series), nrecs['user_id']], axis = 1)
nrecs.columns = [
        
        'db_food_id',
        'count',
        'user_id'
       
     ]

In [21]:
nrecs.to_csv('collab_filter_results.csv', index= False)