<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Content Based Personalization
## LightGBM on Azure Databricks<br>
This notebook provides a quick example of how to train LightGBM model on Azure Databricks and deploy it using MML Spark for a content personalization scenario.<br><br>
[LightGBM](https://github.com/Microsoft/Lightgbm) \[1\] is a gradient boosting framework that uses tree-based learning algorithms.<br>
[MMLSpark](https://github.com/Azure/mmlspark) \[2\] allows LightGBM to be called in a Spark environment which provides several advantages:
- Distributed computation for model development
- Easy integration into existing Spark workflows
- Model serving through Spark Serving \[3\]

## Global Settings and Imports

A python script is provided to simplify setting up Azure Databricks with the correct
dependencies.<br> Run ```python scripts/databricks_install.py -h``` for more details.

In [1]:
import os
import sys

sys.path.append("../../")

import pyspark
from pyspark.ml import PipelineModel
from pyspark.ml.feature import FeatureHasher

from reco_utils.common.spark_utils import start_or_get_spark
from reco_utils.common.notebook_utils import is_databricks
from reco_utils.dataset.criteo import load_spark_df
from reco_utils.dataset.spark_splitters import spark_random_split
from scripts.databricks_install import MMLSPARK_INFO

# Setup MML Spark
if not is_databricks():
    # get the maven coordinates for MML Spark from databricks_install script
    packages = [MMLSPARK_INFO['maven']['coordinates']]
    spark = start_or_get_spark(packages=packages)
    dbutils = None

from mmlspark import ComputeModelStatistics
from mmlspark import LightGBMClassifier

print("System version: {}".format(sys.version))
print("PySpark version: {}".format(pyspark.version.__version__))
print("MMLSpark version: {}".format(MMLSPARK_INFO['maven']['coordinates']))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
PySpark version: 2.3.1
MMLSpark version: Azure:mmlspark:0.16


In [3]:
# Criteo data size, it can be "sample" or "full"
DATA_SIZE = "sample"

# LightGBM parameters
# More details on parameters: https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
NUM_LEAVES = 32
NUM_ITERATIONS = 50
LEARNING_RATE = 0.1
FEATURE_FRACTION = 0.8
EARLY_STOPPING_ROUND = 10

# Model name
MODEL_NAME = 'finished.model'

## Data Preparation
The Criteo Display Advertising Challenge (DAC) dataset [4] is a well-known industry benchmarking dataset for developing CTR prediction models, and is used frequently by research papers. The original dataset contains over 45M rows, but there is also a down-sampled dataset which has 100,000 rows (this can be used by setting DATA_SIZE = 'sample').<br><br>
The dataset contains 1 label column and 39 feature columns, where 13 columns are integer values (int00-int12) and 26 columns are categorical features (cat00-cat25).<br><br>
What the columns represent is not provided, but for this case we can consider the integer and categorical values as features representing the user and / or item content. The label is binary and is an example of implicit feedback indicating a user's interaction with an item. With this dataset we can demonstrate how to build a model that predicts the probability of a user interacting with an item based on available user and item content features.


In [4]:
raw_data = load_spark_df(size=DATA_SIZE, spark=spark, dbutils=dbutils)
# visualize data
raw_data.limit(2).toPandas().head()

8.79MB [00:02, 4.07MB/s]                            


Unnamed: 0,label,int00,int01,int02,int03,int04,int05,int06,int07,int08,...,cat16,cat17,cat18,cat19,cat20,cat21,cat22,cat23,cat24,cat25
0,0,1,1,5,0,1382,4,15,2,181,...,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16
1,0,2,0,44,1,102,8,2,2,4,...,07c540c4,b04e4670,21ddcdc9,5840adea,60f6221e,,3a171ecb,43f13e8b,e8b83407,731c3655


### Feature Processing
The feature data provided has many missing values across both integer and categorical feature fields. In addition the categorical features have many distinct values, so effectively cleaning and representing the feature data is an important step prior to training a model.<br><br>
One of the simplest ways of managing both features that have missing values as well as high cardinality is to use the hashing trick. The [FeatureHasher](http://spark.apache.org/docs/latest/ml-features.html#featurehasher) transformer will pass integer values through and will hash categorical features into a sparse vector of lower dimensionality which can be used effectively by LightGBM.<br><br>
First the dataset is split randomly for training and testing and feature processing is applied to each dataset.

In [5]:
raw_train, raw_test = spark_random_split(raw_data, ratio=0.8, seed=42)

In [6]:
columns = [c for c in raw_data.columns if c != 'label']
feature_processor = FeatureHasher(inputCols=columns, outputCol='features')

In [7]:
train = feature_processor.transform(raw_train)
test = feature_processor.transform(raw_test)

## Model Training
In MML Spark the LightGBM implementation for binary classification is invoked using the LightGBMClassifier class and specifying the objective as 'binary'. In this instance the occurrence of positive labels is quite low, so setting the isUnbalance flag to true helps account for this imbalance.<br><br>

### Hyper-parameters
Below are some of the key [hyper-parameters](https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters-Tuning.rst) for training a LightGBM classifier on Spark
- numLeaves: the number of leaves in each tree
- numIterations: the number of iterations to apply boosting
- learningRate: the learning rate for training across trees
- featureFraction: the fraction of features used for training a tree
- earlyStoppingRound: round at which early stopping can be applied to avoid overfitting

In [8]:
lgbm = LightGBMClassifier(
    labelCol='label',
    featuresCol='features',
    objective='binary',
    isUnbalance=True,
    boostingType='gbdt',
    boostFromAverage=True,
    baggingSeed=42,
    numLeaves=NUM_LEAVES,
    numIterations=NUM_ITERATIONS,
    learningRate=LEARNING_RATE,
    featureFraction=FEATURE_FRACTION,
    earlyStoppingRound=EARLY_STOPPING_ROUND
)

### Model Training and Evaluation

In [9]:
model = lgbm.fit(train)
predictions = model.transform(test)

In [10]:
evaluator = (
    ComputeModelStatistics()
    .setScoredLabelsCol("prediction")
    .setLabelCol("label")
    .setEvaluationMetric("AUC")
)

evaluator.transform(predictions).show()

+---------------+------------------+
|evaluation_type|               AUC|
+---------------+------------------+
| Classification|0.6889596274427175|
+---------------+------------------+



## Model Saving and Loading
The full pipeline for operating on raw data including feature processing and model prediction can be saved and reloaded for use in another workflow.

In [None]:
# save model
pipeline = PipelineModel(stages=[feature_processor, model])
pipeline.write().overwrite().save(MODEL_NAME)


## Reference
\[1\] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. 3146–3154. https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf <br>
\[2\] MML Spark: https://mmlspark.blob.core.windows.net/website/index.html <br>
\[3\] MML Spark Serving: https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md <br>
\[4\] The Criteo dataset: http://labs.criteo.com/wp-content/uploads/2015/04/dac_sample.tar.gz <br>
\[5\] LightGBM Parameter Tuning: https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html <br>
