# Machine Learning Model Creation in Splice Machine
#### Starting the Spark Session

In [2]:
# Setup
from pyspark.sql import SparkSession
from splicemachine.spark import PySpliceContext

spark = SparkSession.builder.getOrCreate()
splice = PySpliceContext(spark)

## Importing MLflow Support
<blockquote><p class='quotation'><span style='font-size:15px'> As explained in <a href='./7.2 Splice MLflow Support.ipynb'>7.2 Splice MLflow Support</a>, using MLflow on Splice Machine is extremely easy. Referencing our <a href='https://pysplice.readthedocs.io/en/dbaas-4100/splicemachine.mlflow_support.html'>documentation</a> for the available functionality.<br><footer>Splice Machine</footer>
</blockquote>

In [4]:
# MLFlow Setup
from splicemachine.mlflow_support import *
mlflow.register_splice_context(splice)

## Starting an experiment
<blockquote><p class='quotation'><span style='font-size:15px'> Here we'll begin an experiment to keep track of our modeling efforts for this prediction task.<footer>Splice Machine</footer>
</blockquote>

In [8]:
mlflow.set_experiment('model_creation_demo')

INFO: 'model_creation_demo' does not exist. Creating a new experiment


## Starting a run
<blockquote><p class='quotation'><span style='font-size:15px'> Here we'll begin an experiment to keep track of our modeling efforts in this notebook specifically.<footer>Splice Machine</footer>
</blockquote>

In [10]:
#start our first MLFlow run
from datetime import datetime

tags = {'team': 'Splice Machine', 'purpose': 'fraud DEMO'}
mlflow.start_run(tags=tags, run_name=f"RF_run")

<ActiveRun: >

## Ingesting Data
<blockquote><p class='quotation'><span style='font-size:15px'> Ingesting the table created in <a href='./7.3 Data Exploration.ipynb'>7.3 Data Exploration</a>, we will begin constructing a very simple Machine Learning Model. <footer>Splice Machine</footer>
</blockquote>

In [12]:
sql_query = "SELECT * FROM cc_fraud_data"
df = splice.df(sql_query)

## Logging our first Parameter 
<blockquote><p class='quotation'><span style='font-size:15px'> We're utilizing MLFlow to keep track of the query we used to ingest the data for this modeling effort. <footer>Splice Machine</footer>
</blockquote>

In [13]:
# Logging our first parameter: the query we used to ingest our data
mlflow.log_param("ingest_query", sql_query)

## Selecting Our Features
<blockquote>Here we'll select the features only most strongly correlated to our target<footer>Splice Machine</footer>
</blockquote>

In [9]:
import pandas as pd
pdf = df.filter(df.CLASS_RESULT == 0).limit(900).toPandas()\
        .append(df.filter(df.CLASS_RESULT == 1).limit(100).toPandas())
pdf = pdf.apply(pd.to_numeric)
corr = pdf.corr()

most_correlated = corr.abs()['CLASS_RESULT'].sort_values(ascending=False).reset_index()
most_correlated = most_correlated.iloc[1:].rename({"index":"feature","CLASS_RESULT":"correlation_to_target"}, axis = 1)
print(most_correlated)

                          feature  correlation_to_target
1                     TIME_OFFSET               0.861746
2    ROLLING_AVG_DAILY_TRANS_AMNT               0.834575
3                  MACD_TRANS_CNT               0.826945
4   ROLLING_AVG_WEEKLY_TRANS_AMNT               0.752052
5     ROLLING_AVG_DAILY_TRANS_CNT               0.739858
6        EXPECTED_DAILY_TRANS_CNT               0.708627
7                AROON_TRANS_AMNT               0.705022
8       EXPECTED_DAILY_TRANS_AMNT               0.696459
9                   RSI_TRANS_CNT               0.683260
10                DAILY_TRANS_CNT               0.663161
11     EXPECTED_WEEKLY_TRANS_AMNT               0.626029
12   ROLLING_AVG_WEEKLY_TRANS_CNT               0.567841
13                AROON_TRANS_CNT               0.545996
14      EXPECTED_WEEKLY_TRANS_CNT               0.538239
15               WEEKLY_TRANS_CNT               0.513323
16              WEEKLY_TRANS_AMNT               0.475171
17               DAILY_TRANS_AM

In [17]:
CORRELATION_CUTOFF = 0.05
#Logging this in mlflow
mlflow.log_param("correlation_cutoff", CORRELATION_CUTOFF)

feature_cols = list(most_correlated[most_correlated['correlation_to_target']>CORRELATION_CUTOFF]['feature'])
print(feature_cols)

['TIME_OFFSET', 'ROLLING_AVG_DAILY_TRANS_AMNT', 'MACD_TRANS_CNT', 'ROLLING_AVG_WEEKLY_TRANS_AMNT', 'ROLLING_AVG_DAILY_TRANS_CNT', 'EXPECTED_DAILY_TRANS_CNT', 'AROON_TRANS_AMNT', 'EXPECTED_DAILY_TRANS_AMNT', 'RSI_TRANS_CNT', 'DAILY_TRANS_CNT', 'EXPECTED_WEEKLY_TRANS_AMNT', 'ROLLING_AVG_WEEKLY_TRANS_CNT', 'AROON_TRANS_CNT', 'EXPECTED_WEEKLY_TRANS_CNT', 'WEEKLY_TRANS_CNT', 'WEEKLY_TRANS_AMNT', 'DAILY_TRANS_AMNT', 'CREDIT_SCORE', 'CURRENT_BALANCE', 'ADX_TRANS_CNT', 'CREDIT_LIMIT', 'ADX_TRANS_AMNT', 'MACD_BALANCE', 'RSI_BALANCE', 'AROON_BALANCE', 'ROLLING_AVG_BALANCE', 'MACD_TRANS_AMNT', 'ADX_BALANCE', 'RSI_TRANS_AMNT']


## Defining a Machine Learning Pipeline

<blockquote>We'll use Spark's <code>Pipeline</code> class to define a set of <code>Transformers</code> that get your dataset ready for modeling<br>
We'll then use <code>mlflow</code> to <code>log</code> our Pipeline stages. Both <code>log_pipeline_stages</code> and <code>log_feature_transformations</code> are custom Splice Machine functions for tracking Spark Pipelines. </blockquote>

In [21]:
%%time
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.classification import RandomForestClassifier, MultilayerPerceptronClassifier

"""
The preprocessing stages for this example are: 
1) Vector assembling the feature columns 
2) Standardizing our feature columns
"""
max_depth = 5  
num_trees = 20

assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
scaler = StandardScaler(inputCol="features", outputCol='scaledFeatures')
rf = RandomForestClassifier(featuresCol = 'scaledFeatures', labelCol = 'CLASS_RESULT', maxDepth = max_depth, numTrees = num_trees)

# Pipeline to preprocess and model our data
mlpipe = Pipeline(stages=[assembler,scaler, rf])

# Custom Splice functions to add granularity and governance to your Spark Pipeline Models
mlflow.log_pipeline_stages(mlpipe)
mlflow.log_feature_transformations(mlpipe)

CPU times: user 91.8 ms, sys: 4.56 ms, total: 96.4 ms
Wall time: 1.14 s


## Separating our data for performance evaluation 
<blockquote> We are using a simple, single train/ test split to assess the performance of our simple model. Of note, we are not invesitgated the class balances, and we are using untuned hyperparameters to predict the target variable. These can be adjusted as an exercise. <footer>Splice Machine</footer>
</blockquote>

In [19]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, CrossValidatorModel
from splicemachine.stats import *

#splitting our data into a training and testing set
(train, test) = df.randomSplit([0.8, 0.2])

mlflow.lp("train_ratio", 0.80)

## Fitting our model 
<blockquote> Training our model and logging executing time using Splcie's custom <code>with mlflow.timer('timer_name')</code> block function to track the time it takes to complete a block. Everything in the block will be timed, and then logged to mlflow under the timer name provided to the function. <footer>Splice Machine</footer>
</blockquote>

In [23]:
with mlflow.timer('training'):
    fitted_model = mlpipe.fit(train)
# Log the parameters of the best model
mlflow.log_model_params(fitted_model)


## Assessing our Model Performance
<blockquote> Making predicitons on the test set, evaluating performance, and logging this to MLFlow <footer>Splice Machine</footer>
</blockquote>

In [29]:
#Inference
predictions = fitted_model.transform(test)

#Performance Evaluation
binary_evaluator = SpliceBinaryClassificationEvaluator(spark, labelCol = "CLASS_RESULT")
binary_evaluator.input(predictions)
performance_metrics = binary_evaluator.get_results(as_dict = True)

#Logging Performance
mlflow.log_metrics(performance_metrics)

Current areaUnderROC: 0.8835821074497563
Current areaUnderPR: 0.7292677480397618
+-----+----+-------+
|     |True|  False|
+-----+----+-------+
| True|33.0|    7.0|
|False|10.0|25205.0|
+-----+----+-------+



## Logging Artifacts of this Run
<blockquote> We can store the notebook associated with a particular run as well as the fitted model created by this run <footer>Splice Machine</footer>
</blockquote>

In [30]:
# Store the notebook for easy retrieval
mlflow.log_artifact('7.5 Model Creation.ipynb', 'training_notebook')
#Log the best model
mlflow.log_model(fitted_model, 'rf_model')

Saving artifact of size: 18.99 KB to Splice Machine DB
Saving artifact of size: 80.927 KB to Splice Machine DB


## Finish our run
<blockquote>Now we'll end our run, and view the results in the <a href="/mlflow">MLFlow UI</a>. We can look at our different runs, the parameters, metrics, tags and artifacts logged, and download our notebook directly. You'll know the run is complete fom the small green check mark on the leftmost side of the run</blockquote>

In [31]:
mlflow.end_run()

# Fantastic!
<blockquote> 
This basically shows how our platform can be used to train and evaluate machine learning models! <br>
    Next Up: <a href='./7.6 Data Exploration.ipynb'>Using MLManager to Deploy Machine Learning Models</a>
<footer>Splice Machine</footer>
</blockquote>