# Spark Evaluation

This notebook uses MMLSpark, please see github for installation details for your platform.
https://github.com/Azure/mmlspark

Evaluation with offline metrics is pivotal to assess the quality of a recommender before it goes into production. Usually, evaluation metrics are carefully chosen based on the actual application scenario of a recommendation system. It is hence important to data scientists and AI developers that build recommendation systems to understand how each evaluation metric is calculated and what it is for.

This notebook deep dives into several commonly used evaluation metrics, and illustrates how these metrics are used in practice. The metrics covered in this notebook are merely for off-line evaluations.

## 0 Global settings

In [5]:
# set the environment path to find Recommenders
import sys
import os
sys.path.append("../../")
import pandas as pd
import pyspark
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

from mmlspark.RankingAdapter import RankingAdapter
from mmlspark.RankingEvaluator import RankingEvaluator

SUBMIT_ARGS = "--packages Azure:mmlspark:0.15 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS

from pyspark.sql import SparkSession

spark = (
   SparkSession.builder.appName("Spark Evaluation Metrics")
   .master("local[*]")
   .config("memory", "1G")
   .getOrCreate()
)

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("PySpark version: {}".format(pyspark.__version__))

Note to successfully run Spark codes with the Jupyter kernel, one needs to correctly set the environment variables of `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` that point to Python executables with the desired version. Detailed information can be found in the setup instruction document [SETUP.md](https://raw.githubusercontent.com/dciborow/Recommenders/master/SETUP.md).

In [7]:
COL_USER = "UserId"
COL_ITEM = "MovieId"
COL_RATING = "Rating"
COL_PREDICTION = "prediction"

## 1 Prepare data

For illustration purpose, a dummy data set is created for demonstrating how different evaluation metrics work. 

The data has the schema that can be frequently found in a recommendation problem, that is, each row in the dataset is a (user, item, rating) tuple, where "rating" can be an ordinal rating score (e.g., discrete integers of 1, 2, 3, etc.) or an numerical float number that quantitatively indicates the preference of the user towards that item. 

For simplicity reason, the column of rating in the dummy dataset we use in the example represent some ordinal ratings.

In [10]:
df_true = pd.DataFrame(
        {
            COL_USER: [1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
            COL_ITEM: [1, 2, 3, 1, 4, 5, 6, 7, 2, 5, 6, 8, 9, 10, 11, 12, 13, 14],
            COL_RATING: [5, 4, 3, 5, 5, 3, 3, 1, 5, 5, 5, 4, 4, 3, 3, 3, 2, 1],
        }
    )
ratings = spark.createDataFrame(df_true)

ratings.filter(ratings[COL_USER] == 1).show()

Build the recommendation model using ALS on the training data.

Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics

In [12]:
als = ALS(maxIter=5, regParam=0.01, userCol=COL_USER, itemCol=COL_ITEM, ratingCol=COL_RATING,
          coldStartStrategy="drop")
predictions = als.fit(ratings).transform(ratings)

predictions.show()

## 2 Evaluation metrics

### 2.1 Rating metrics

In [15]:
rmse = RegressionEvaluator(metricName="rmse", labelCol=COL_RATING, predictionCol="prediction").evaluate(predictions)
print("The RMSE is {}".format(rmse))

r2 = RegressionEvaluator(metricName="r2", labelCol=COL_RATING, predictionCol="prediction").evaluate(predictions)
print("The R2 is {}".format(r2))

mae = RegressionEvaluator(metricName="mae", labelCol=COL_RATING, predictionCol="prediction").evaluate(predictions)
print("The MAE is {}".format(mae))

mse = RegressionEvaluator(metricName="mse", labelCol=COL_RATING, predictionCol="prediction").evaluate(predictions)
print("The mse is {}".format(mse))

### 2.2 Ranking metrics

The following code fits the model, and creates the properly formated output dataset for ranking evaluation.

In [18]:
output = RankingAdapter(k=5, recommender=als).fit(ratings).transform(ratings)

output.show()

A few ranking metrics can then be calculated.

In [20]:
# prec = RankingEvaluator(k=5, metricName='precision').evaluate(output)
# print("The precision at k is {}".format(prec))

# recall = RankingEvaluator(k=5, metricName='recall').evaluate(output)
# print("The recall at k is {}".format(recall))

ndcg = RankingEvaluator(k=5, metricName='ndcgAt').evaluate(output)
print("The ndcg at k is {}".format(ndcg))

map  = RankingEvaluator(k=5, metricName='map').evaluate(output)
print("The map at k is {}".format(map))

fcp  = RankingEvaluator(k=5, metricName='fcp').evaluate(output)
print("The fcp is {}".format(fcp))

mrr  = RankingEvaluator(k=5, metricName='mrr').evaluate(output)
print("The mrr is {}".format(mrr))