## Methodology

We start building up our methodology by down-sizing the interaction dataset to a more manageable size. We choose to build our recommendation system using collaborative filtering algorithm. Matrix Factorization is out go-to class because of its effectiveness. The prediction results can be improved by assigning different regularization weights to the latent factors based on items' popularity and users' activeness. We will use Alternative Least Square (ASL) model from PySpark library as our first simple model.

**Note** PySpark library is more compatible with Google Colab, which we will be using to run the code. Hit the "Open in Colab" under the picture to have this notebook open in Google Colab. 

![flowchart](./images/flowchart.png)

<a href="https://colab.research.google.com/github/ramilchai/capstone/blob/main/ALS_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import json
import pickle


These following !pip install are for set up pyspark for Google Colab.

In [None]:
#run these lines of code to make PySpark work on Colab
!pip install pyspark
!pip install openjdk-8-jdk-headless -qq
!pip install mlflow

Import our dataframe from the pickle file we dumped in the previous notebook (recommendation_book.ipynb). 

In [3]:
df = pd.read_pickle('/content/interact')

In [6]:
import pyspark
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml import feature
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
# import org.apache.spark.sql.functions.col
# import org.apache.spark.sql.types.IntegerType
# import pyspark.sql.functions.col
from pyspark.sql.types import IntegerType

### Set up Spark session

Initialize a SparkSession object and import the dataset into a PySpark DataFrame

In [7]:
spark = SparkSession\
        .builder\
        .appName('bookrec').config('spark.driver.host', 'localhost')\
        .getOrCreate()

In [8]:
df_sp = spark.createDataFrame(df)

In [9]:
df_sp.dtypes

[('user_id_num', 'bigint'), ('book_id', 'bigint'), ('rating', 'bigint')]

### First Simple Model (FSM)

Building up our first simple model using ALS.

In [10]:
(training, test) = df_sp.randomSplit([0.8, 0.2])

In [15]:
als = ALS(maxIter=5,rank=4, regParam=0.01, userCol='user_id_num', itemCol='book_id', ratingCol='rating',
          coldStartStrategy='drop')

In [16]:
fsm_model = als.fit(training)

In [17]:
predictions = fsm_model.transform(test)
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating',
                                predictionCol='prediction')
rmse = evaluator.evaluate(predictions)
print('Root-mean-square error = ' + str(rmse))

Root-mean-square error = 1.623987864634006


RMSE from our FSM reads 1.62 which is not really good prediction.

### Tuning the Model with Cross-validation

Let's now optimize our FSM by using the built-in `CrossValidator` in PySpark with a suitable param grid and determine the optimal model.

In [18]:
als_model = ALS(userCol='user_id_num', itemCol='book_id', 
                ratingCol='rating', coldStartStrategy='drop')

In [19]:
params = ParamGridBuilder()\
          .addGrid(als_model.regParam, [0.01, 0.001, 0.1])\
          .addGrid(als_model.rank, [4, 10, 50]).build()

In [20]:
cv = CrossValidator(estimator=als_model, 
                    estimatorParamMaps=params,
                    evaluator=evaluator,
                    parallelism=4)

best_als_model = cv.fit(df_sp)    

In [21]:
best_als_model.bestModel.rank

50

In [26]:
best_als_model.bestModel.__dict__

{'_defaultParamMap': {Param(parent='ALS_60d28e8bdf3b', name='blockSize', doc='block size for stacking input data in matrices. Data is stacked within partitions. If block size is more than remaining data in a partition then it is adjusted to the size of this data.'): 4096,
  Param(parent='ALS_60d28e8bdf3b', name='coldStartStrategy', doc="strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'."): 'nan',
  Param(parent='ALS_60d28e8bdf3b', name='itemCol', doc='column name for item ids. Ids must be within the integer value range.'): 'item',
  Param(parent='ALS_60d28e8bdf3b', name='predictionCol', doc='prediction column name.'): 'prediction',
  Param(parent='ALS_60d28e8bdf3b', name='userCol', doc='column name for user ids. Ids must be within the integer value range.'): 'user'},
 '_java_obj': JavaObject id=o57

### Best ALS Model

We finally get our best ALS model with suitable hyperparameters: `maxIter`=5; `rank`=50; and `regParam`=0.1.

In [37]:
best_als = ALS(maxIter=5,rank=50, regParam=0.1, userCol='user_id_num', itemCol='book_id', 
                ratingCol='rating', coldStartStrategy='drop')

In [38]:
best_als_model = best_als.fit(training)

In [39]:
best_als_predictions = best_als_model.transform(test)
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating',
                                predictionCol='prediction')
rmse = evaluator.evaluate(best_als_predictions)
print('Root-mean-square error = ' + str(rmse))

Root-mean-square error = 1.092242472532004


RMSE from our best ALS model reads 1.09, which is way better than our FSM.