### MLFlow Model Selection Use Case: Claims Prediction Model

In this notebook, we will build on our Claims Prediction Model Tracking use case - and enhance it a bit, this time to find out how we can use the best model generated over multiple runs. Once the best model has been selected we will also "download" the model prediction function from the tracking server, and run some predictions against data that is previously unseen by the model.

> **NOTE:** New to MLFlow? Head over to the [Beginners Guide to MLFlow Concepts notebook](https://eastus2.azuredatabricks.net/?o=3428697504158163#notebook/789425822738632/) and [MLFlow Tracking Use Case: Claims Prediction Model](https://eastus2.azuredatabricks.net/?o=3428697504158163#notebook/789425822738708) to learn the prerequisites.

### Load Required Packages

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import TrainValidationSplit
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col
import mlflow
from mlflow import create_experiment
from mlflow import start_run
from mlflow import log_params
from mlflow.mleap import log_model
from mlflow import log_metric
from mlflow import end_run

### Querying, Preprocessing and Train-Test Split

> To reduce commentary, we have bundled multiple steps into a single command cell. For a more detailed explanation, please refer to the [MLFlow Tracking Use Case: Claims Prediction Model](https://eastus2.azuredatabricks.net/?o=3428697504158163#notebook/789425822738708).

In [0]:
# ============================================================================================================
# Specify predictor and response variables

numeric_X = ["accident_year", 
      "claim_trigger_year"]
string_X = ["claim_status", 
      "claim_cause_of_loss", 
      "claim_loss_location_stateprovince", 
      "claim_loss_location_country", 
      "underwriting_unit"]

y = ["total_paid_net_act_usd"]

# ============================================================================================================
# Pull data (only as specified in needed variables)

query_skeleton = "SELECT {data_vars} FROM dw_xle_pz.t_cps_claims WHERE original_currency = 'USD' LIMIT 10000"
uc_DF = spark.sql(query_skeleton.format(data_vars = ", ".join(numeric_X + string_X + y)))
print("Shape of queried data: ({rows}, {cols})".format(rows = uc_DF.count(), cols = len(uc_DF.columns)))

# # ============================================================================================================
# # Clean up nulls in dataset

uc_DF = uc_DF.na.replace("", None, subset=string_X)
uc_DF = uc_DF.dropna()
print("Shape of null-removed data: ({rows}, {cols})".format(rows = uc_DF.count(), cols = len(uc_DF.columns)))

# # ============================================================================================================

### Index + Encode String Columns and Assemble Feature Set

In [0]:
encoders = []
encoded_vars = []
for col in string_X:
  uc_DF = (StringIndexer(inputCol=col, 
                         outputCol=col + "_ix", 
                         handleInvalid="skip")
           .fit(uc_DF)
           .transform(uc_DF))
  encoders.append(OneHotEncoder(inputCol=col + "_ix", 
                                outputCol=col + "_enc"))
  encoded_vars.append(col + "_enc")
  
assembler = (VectorAssembler()
             .setInputCols(numeric_X + encoded_vars)
             .setOutputCol("features"))

classifier = LinearRegression(featuresCol="features", 
                              labelCol=y[0], 
                              maxIter=10)

pipeline = Pipeline(stages=encoders + [assembler, classifier])

### Create a Parameter Grid to track how model validation changes on MLFlow

In [0]:
paramGrid = (ParamGridBuilder()
             .addGrid(classifier.elasticNetParam, [0.6, 0.85, 0.2, 0.05])
             .addGrid(classifier.regParam, [0.3, 0.7, 0.5, 0.1])
             .build())

# # ============================================================================================================
# # Train-Test Split (3:1 ratio)
trainValidSplit = TrainValidationSplit(estimator=pipeline, 
                                       estimatorParamMaps=paramGrid, 
                                       evaluator=RegressionEvaluator(labelCol=y[0], 
                                                                     predictionCol="prediction", 
                                                                     metricName="rmse"), 
                                       trainRatio=0.75, 
                                       parallelism=4)

### Train a `LinearRegression` model and track prediction accuracies with MLFlow Runs
> **NOTE:** This is where you will find some deviations from the previous version of the use case. 

> 1. We are explicitly creating the experiment instead of letting Databricks autogenerate it for us. This helps in controlling the identifier of this experiment, and helps in "querying" all the runs (we will get to that in subsequent cells).

> 2. During the Run, we have now introduced a model logging step (look for the `log_model()` function). This helps us to not only track the performance of a model within each run, but saves the model predictor function itself. At a later stage, when we settle on the "best model", we will directly download the predictor for it and use it for predicting on unforeseen data.    

In [0]:
# Create a new experiment and track it's ID
experiment_id = create_experiment(name="/Users/souradeep.sinha@axaxl.com/claims_sample_model_selection")

# Start Run
with start_run(experiment_id=experiment_id):
  
  # Fit the model. This will automatically track metrics and parameters
  model = trainValidSplit.fit(uc_DF)
  
  # Log an extra metric
  log_metric("bestRMSE", min(model.validationMetrics))
  
  # Log the best model to this run
  mlflow.spark.log_model(model.bestModel, "model-file")   
  
  # End Run
  end_run()

> **NOTE:** 

> Experiments are tracked under user accounts. In the previous cell, the author of this notebook has created one under his account path. For any other user, they will have to input their own path to create an experiment successfully.

> Since we are creating an explicit experiment, this will "detach" all experiment activity from the current notebook. As a result, the **`Runs`** panel will not reflect any activity. On the other hand, if you are looking into your **`Home`** folder, you should find the experiment created under the same name, with all activity tracked and intact.

> **NOTE:** Interested in learning more about MLFlow? There's a wealth of documentation under [MLFlow](https://www.mlflow.org/docs/latest/tracking.html) and [Databricks](https://docs.databricks.com/applications/mlflow/quick-start.html) that you can review and use according to your needs.

> For questions, please reach out to the [Data Analytics Workbench email](mailto:EDS-DEEP-Workbench-Business-Support@axaxl.com) or our [support channel on Teams](https://teams.microsoft.com/l/channel/19%3afec947f0c9144bcbad26b5a245e0080e%40thread.skype/Ask%2520Workbench%2520Support?groupId=18b5e01c-2032-4dbb-ab5a-8aeb6f79968f&tenantId=53b7cac7-14be-46d4-be43-f2ad9244d901).