<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# LightGBM: A Highly Efficient Gradient Boosting Decision Tree
This notebook will give you a quick example of how to train LightGBM model on Spark and deploy it using MML Spark for a content personalization scenario.<br> 
LightGBM \[1\] is a gradient boosting framework that uses tree-based learning algorithms.<br>
MML Spark \[2\] allows LightGBM to be called in a Spark environment which provides several advantages:
- Distributed computation for model development
- Easy integration into existing Spark workflows
- Model serving through Spark Serving \[3\]

## Global Settings and Imports

In [1]:
import os
import sys
from tempfile import TemporaryDirectory
sys.path.append("../../")

import pyspark
from pyspark.ml.feature import FeatureHasher
from pyspark.sql.functions import col, udf
from pyspark.sql.types import FloatType
import requests

from reco_utils.common.spark_utils import start_or_get_spark
from reco_utils.common.notebook_utils import is_databricks
from reco_utils.dataset.criteo_dac import load_spark_df
from reco_utils.dataset.spark_splitters import spark_random_split

print("System version: {}".format(sys.version))
print("PySpark version: {}".format(pyspark.version.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
PySpark version: 2.3.1


In [2]:
# Setup MML Spark
if not is_databricks():
    spark = start_or_get_spark(packages=['Azure:mmlspark:0.16'])

from mmlspark import ComputeModelStatistics
from mmlspark import DiscreteHyperParam
from mmlspark import HyperparamBuilder
from mmlspark import LightGBMClassifier
from mmlspark import RandomSpace
from mmlspark import RangeHyperParam
from mmlspark import TuneHyperparameters

## Data Preparation
The Criteo Display Advertising Challenge (DAC) dataset [3] is a well-known industry benchmarking dataset for developing CTR prediction models, and is used frequently by research papers. The original dataset is too large for a lightweight demo, so we use a smaller sample for a demo dataset. <br><br>
The sample data consist of 100,000 rows with 1 label column and 39 feature columns, where 13 columns are integer values (int00-int12) and 26 columns are categorical features (cat00-cat25).<br><br>
What the columns represent is not provided, but for this case we can consider the integer and categorical values as features representing the user and / or item content. The label is binary and indicates a user interaction with an item, so this is a useful dataset to demonstrate how to build a model that will predict likelihood of a user interacting with an item based on the user and item content features.


In [3]:
raw_data = load_spark_df(size='sample', spark=spark)

### Feature Processing
The feature data provided has many missing values across both integer and categorical feature fields. In addition the categorical features have many distinct values, so effectively cleaning and representing the feature data is an important step prior to training a model.<br>
One of the simplest ways of managing both features that have missing values as well as high cardinality is to use the hashing trick. The FeatureHasher transformer will pass integer values through and will hash categorical features into a sparse vector of lower dimensionality which can be used effectively by LightGBM.<br>
Lastly the dataset is split randomly for training and testing the model.

In [4]:
columns = [c for c in raw_data.columns if c != 'label']
feature_processor = FeatureHasher(inputCols=columns, outputCol='features')
data = feature_processor.transform(raw_data)
train, test = spark_random_split(data, ratio=0.75, seed=42)

## Model Training
In MML Spark the LightGBM implementation for binary classification is invoked using the LightGBMClassifier class and specifying the objective as 'binary'. In this instance the occurrence of positive labels is quite low, so setting the isUnbalance flag to true helps account for this imbalance.<br>

### Hyper-parameters
Key hyper-parameters \[5\] for LightGBM classifier on Spark are the number of leaves (numLeaves) in each tree, the number of iterations (numIterations) for training, the learning rate (learningRate) and the fraction of features used during training a tree (featureFraction). Lastly, early stopping round (earlyStoppingRound) can be useful to stop learning at the point where overfitting can begin to occur.

In [5]:
NUM_LEAVES = 64
NUM_ITERATIONS = 100
LEARNING_RATE = 0.15
FEATURE_FRACTION = 0.8
EARLY_STOPPING_ROUND = 20

In [6]:
lgbm = LightGBMClassifier(
    labelCol='label',
    featuresCol='features',
    objective='binary',
    isUnbalance=True,
    boostingType='gbdt',
    boostFromAverage=True,
    numLeaves=NUM_LEAVES,
    numIterations=NUM_ITERATIONS,
    learningRate=LEARNING_RATE,
    featureFraction=FEATURE_FRACTION,
    earlyStoppingRound=EARLY_STOPPING_ROUND,
)

### Model Training and Evaluation

In [7]:
model = lgbm.fit(train)

evaluator = (
    ComputeModelStatistics()
    .setScoredLabelsCol("prediction")
    .setLabelCol("label")
    .setEvaluationMetric("AUC")
)

predictions = model.transform(test)
evaluator.transform(predictions).show()

+---------------+------------------+
|evaluation_type|               AUC|
+---------------+------------------+
| Classification|0.6716842093722328|
+---------------+------------------+



### Model Tuning

MML Spark supports hyper-parameter tuning from a specified space of parameters which can be randomly sampled (or sampled from a grid of options) from continuous or discrete ranges of values. TuneHyperparameters can apply n-fold cross-validation with the given evaluation metric to more robustly identify the best set of parameters to use for the given model. 

In [None]:
params = (
    HyperparamBuilder()
    .addHyperparam(lgbm, lgbm.learningRate, RangeHyperParam(0.001, 1.0))
    .addHyperparam(lgbm, lgbm.numIterations, RangeHyperParam(10, 100))
    .addHyperparam(lgbm, lgbm.numLeaves, DiscreteHyperParam([32, 64, 128]))
).build()
paramSpace = RandomSpace(params).space()

tuner = TuneHyperparameters(
    evaluationMetric="AUC", 
    models=[lgbm], 
    numFolds=5,
    numRuns=10, 
    parallelism=1,
    paramSpace=paramSpace, 
    seed=42
)

bestModel = tuner.fit(train)

In [None]:
print(bestModel.getBestModelInfo())
print(bestModel.getBestModel())

predictions = bestModel.transform(test)
evaluator.transform(predictions).show()

## Model Saving and Loading
The model can be saved and reloaded for use in another workflow.

In [8]:
with TemporaryDirectory() as tmp:
    save_file = os.path.join(tmp, r'finished.model')
    model.save(save_file)
    loaded_model = model.load(save_file)

In [9]:
# Re-evaluate the performance again
predictions = loaded_model.transform(test)
evaluator.transform(predictions).show()

+---------------+------------------+
|evaluation_type|               AUC|
+---------------+------------------+
| Classification|0.6716842093722328|
+---------------+------------------+



## Model Deployment
MML Spark provides an easy way to quickly spin up a server to deploy trained models built on top of Spark Streaming DataFrames. In this example the server reads a request, parses it to the same input as the original raw data and applies feature processing then computes the probability of engagement given the user and item features provided. This probability is written back as a response to the original request.<br><br>
Content-based personalization can be accomplished by leveraging this engagement prediction service as the key machine-learning component inside a larger system. To personalize content for a user, a set of items is selected for evaluation and item-content features are extracted for each. These item features can be combined with the user features for each user-item combination and sent to the engagement prediction service which evaluates the probability that a user will engage with each item. The probability can be used to rank the items and select the top-k desired results.

In [11]:
# Define spark serving input
input_df = (
    spark.readStream.server()
    .address("localhost", 8089, "predict")
    .load()
    .parseRequest(raw_data.schema)
)

# Process features and make predictions
get_pos_prob = udf(lambda x: float(x[1]))

processed_df = feature_processor.transform(input_df)
output_df = (
    loaded_model.transform(processed_df)
    .withColumn('p_eng', get_pos_prob(col('probability')).cast(FloatType()))
    .makeReply("p_eng")
)

# Define spark serving output and start server
checkpoint = TemporaryDirectory()
server = (
    output_df.writeStream.server()
    .replyTo("predict")
    .queryName("prediction")
    .option("checkpointLocation", "file://{}".format(checkpoint.name))
    .start()
)

In [12]:
server.status

{'message': 'Waiting for data to arrive',
 'isDataAvailable': False,
 'isTriggerActive': False}

In [13]:
query = raw_data.limit(1).collect()[0].asDict()
r = requests.post(data=query, url="http://localhost:8089/predict")
print("Response {}".format(r.text))

Response {"p_eng":0.16379395}


In [18]:
# Cleanup
server.stop()
checkpoint.cleanup()
server.status

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}