# Let's Talk About Model Deployment
<blockquote>
    Model Deployment is extrememly important and overly complex. We think that it should be as easy and straightforward as possible for Data Scientists to hand off their work to the Application Developers. We see that existing in the database itself. Straightforward APIs for the Data Scientists, and simple SQL statements for the Application Developers. When models are deployed to the database, a table is created and as soon as data enters that table, predictions are made immediately.<footer>Splice Machine</footer>
</blockquote><br><br>
<center><img class='log' src='https://splice-demo.s3.amazonaws.com/Database+Deployment.png' width='30%' style='z-index:5'></center><br><br>


#### Let's take a look at deploying some simple models to the database

In [None]:
# Setup
from pyspark.sql import SparkSession
from splicemachine.spark import PySpliceContext
from splicemachine.mlflow_support import *

spark = SparkSession.builder\
        .config('spark.dynamicAllocation.enabled','false')\
        .config('spark.executor.instances',2)\
        .getOrCreate()
splice = PySpliceContext(spark)
mlflow.register_splice_context(splice)

In [None]:
import warnings
warnings.simplefilter("ignore")
warnings.filterwarnings("ignore")

In [None]:
help(mlflow.deploy_db)

# Model Choice
<blockquote>
    With our MLManager API, we've abstracted away the model itself, and made our functions model agnostic. Functions like <code>log_model</code> and <code>load_model</code> take any supported model type and handle the rest under the hood<footer>Splice Machine</footer>
</blockquote>

## We'll try it out with SKLearn, Spark and H2O
<blockquote>
   Because we're focusing on model deployment specifically, we will skip the logging of parameters and metrics etc. For more information on that, see some of our other <a href='./Examples'>MLManager tutorials</a><footer>Splice Machine</footer>
</blockquote>

In [None]:
%%sql
-- Ingest Data
drop table if exists cc_fraud_data; 
create table cc_fraud_data (
    time_offset integer,
    expected_weekly_trans_cnt double,
    expected_weekly_trans_amnt double,
    expected_daily_trans_cnt double,
    expected_daily_trans_amnt double,
    weekly_trans_cnt double,
    weekly_trans_amnt double,
    daily_trans_cnt double,
    daily_trans_amnt double,
    rolling_avg_weekly_trans_cnt double,
    rolling_avg_weekly_trans_amnt double,
    rolling_avg_daily_trans_cnt double,
    rolling_avg_daily_trans_amnt double,
    MACD_trans_amnt double,
    MACD_trans_cnt double,
    RSI_trans_amnt double,
    RSI_trans_cnt double,
    Aroon_trans_amnt double,
    Aroon_trans_cnt double,
    ADX_trans_amnt double,
    ADX_trans_cnt double,
    current_balance double,
    rolling_avg_balance double,
    MACD_balance double,
    Aroon_balance double,
    RSI_balance double,
    ADX_balance double,
    credit_score double,
    credit_limit double,
    amount decimal(10,2),
    class_result int
);

call SYSCS_UTIL.IMPORT_DATA (
     null,
     'cc_fraud_data',
     null,
     's3a://splice-demo/kaggle-fraud-data/creditcard.csv',
     ',',
     null,
     null,
     null,
     null,
     -1,
     's3a://splice-demo/kaggle-fraud-data/bad',
     null, 
     null);
     

In [None]:
# Set our Experiment
mlflow.set_experiment('simple model deployment')

## SKLearn
#### Build our SKLearn Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import mean_squared_error
from splicemachine.mlflow_support.utilities import get_user
import pandas as pd
import numpy as np

df = splice.df('select EXPECTED_WEEKLY_TRANS_CNT,WEEKLY_TRANS_CNT,RSI_TRANS_AMNT,AMOUNT, class_result from cc_fraud_data').limit(10000).toPandas()
# Split into train/test
train, test = train_test_split(df, test_size=0.2)

# Train, save and deploy
with mlflow.start_run(run_name='SKlearn'):
    model = GradientBoostingClassifier(n_estimators=10, learning_rate=1.0)
    X_train,y_train = train[train.columns[:-1]], train[train.columns[-1]]
    y_train = y_train.map(lambda x: int(x)) # So the model outputs int format
    X_test,y_test = test[test.columns[:-1]], test[test.columns[-1]]
    
    model.fit(X_train,y_train)
    print('MSE:', mean_squared_error(y_test, model.predict(X_test)))
    run_id = mlflow.current_run_id()
    # Save the model for deployment or later use
    mlflow.log_model(model, 'sklearn_model')
    
# Deploy the model
schema = get_user()
splice.dropTableIfExists(f'{schema}.sklearn_model')
jid = mlflow.deploy_db(schema, 'sklearn_model', run_id, primary_key={'MOMENT_KEY': 'INT'}, df=df, model_cols=list(X_train.columns), 
                 create_model_table=True)
mlflow.watch_job(jid)

### That's it! 
<blockquote>
    It really was that easy. You may be thinking "well, now what? How do I see my model in action?" That's a great question, and that's easy too. If you look at the output above, you can see a table called <code>SKLEARN_TABLE</code> was created for you with the columns of your model as well as the primary key provided and an extra column for prediction.<br>
    To invoke your model, simply insert a row of data<footer>Splice Machine</footer>
</blockquote>

#### Let's use the model

In [None]:
%%sql
insert into sklearn_model (expected_weekly_trans_cnt, weekly_trans_cnt, rsi_trans_amnt, amount, moment_key) values (1.5, 2.2, 2.5, 4.4, 1);
insert into sklearn_model (expected_weekly_trans_cnt, weekly_trans_cnt, rsi_trans_amnt, amount, moment_key) values (2342.7, 3334.0, -23.1, 11010.9, 2);

#### View results

In [None]:
%%sql
select * from sklearn_model;

### Pretty Cool!
<blockquote>As you can see, the <code>deploy_db</code> function created the table and injected your ML model directly inside. It also added some automatic columns to track which model is making the predictions, who is using your model and when. If you deploy more complex models with probabilities, more columns will be created to handle that. We can tell MLManager which SKlearn function call to use by passing in the <code>library_specific</code> parameter. Let's try that next.<footer>Splice Machine</footer>
</blockquote>
    
#### Deploy model with complex output

In [None]:
# This SKLearn model can also output the probability of each column. Let's deploy out model so it contains those probabilities
print(f'Model prediction of {X_test.iloc[0].values} with probabilities:', model.predict_proba(X_test)[0], '\n')

splice.dropTableIfExists(f'{schema}.sklearn_model_probs')
# Deploy our model
jid = mlflow.deploy_db(schema, 'sklearn_model_probs', run_id, primary_key={'MOMENT_KEY': 'INT'}, df=df, model_cols=list(X_train.columns), 
                 create_model_table=True, classes=['no_fraud','fraud'], library_specific={'predict_call': 'predict_proba'}) # Added sklearn args
mlflow.watch_job(jid)
del df

#### Let's use the model

In [None]:
%%sql
insert into sklearn_model_probs (expected_weekly_trans_cnt, weekly_trans_cnt, rsi_trans_amnt, amount, moment_key) values (6.3, 2.9, 5.6, 1.8, 1);
select * from sklearn_model_probs;

### Great!
<blockquote>You can see that the probabilities of each class match that of the local model prediction two cells above. The prediction column contains the index of prediction class. <br>So, prediction a of 0 means that the model is predicting the 1st column, no_fraud (remember that indexes start at 0!), just like the model.<footer>Splice Machine</footer>
</blockquote>

#### Show the Prediction from the model

In [None]:
print(model.predict([[6.3, 2.9, 5.6, 1.8]])[0])
print(model.predict_proba([[6.3, 2.9, 5.6, 1.8]])[0])

## SparkML
<blockquote>Let's try the same thing with Spark. Although Spark is typically used for big data processesing, their <a href='https://spark.apache.org/docs/2.4.0/ml-classification-regression.html'>ML Libraries</a> come with some pretty powerful models as well. And they can scale for massive datasets too.<footer>Splice Machine</footer>
</blockquote>

#### Build and deploy a SparkML model

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from splicemachine.stats import SpliceMultiClassificationEvaluator

# Create our dataset
spark_df = splice.df('select EXPECTED_WEEKLY_TRANS_CNT,WEEKLY_TRANS_CNT,RSI_TRANS_AMNT,AMOUNT, class_result from cc_fraud_data')
spark_df.show(5)
train, test = spark_df.randomSplit([0.8,0.2])

with mlflow.start_run(run_name='spark'):
    # Set our feature vector to be all column except the label
    va = VectorAssembler(inputCols = train.columns[:-1], outputCol='features')
    rf = RandomForestClassifier(labelCol='CLASS_RESULT', featuresCol='features')
    pipeline = Pipeline(stages=[va,rf])
    
    trainedModel = pipeline.fit(train)
    predictions = trainedModel.transform(test)
    # Log our model for deployment or future use
    mlflow.log_model(trainedModel, 'spark_model')
    
    ev = SpliceMultiClassificationEvaluator(spark, labelCol='CLASS_RESULT')
    ev.input(predictions)
    run_id = mlflow.current_run_id()
    
splice.dropTableIfExists(f'{schema}.spark_model')
# Deploy our model
jid = mlflow.deploy_db(schema, 'spark_model', run_id, primary_key={'MOMENT_KEY': 'INT'}, df=spark_df, model_cols=spark_df.columns[:-1], 
                 create_model_table=True, classes=['no_fraud','fraud'])
    
mlflow.watch_job(jid)

#### Try out our model

In [None]:
%%sql
insert into spark_model (expected_weekly_trans_cnt, weekly_trans_cnt, rsi_trans_amnt, amount, moment_key) values (5.1, 3.5, 1.4, 0.2, 1);
insert into spark_model (expected_weekly_trans_cnt, weekly_trans_cnt, rsi_trans_amnt, amount, moment_key) values (2.7, 4.0, 3.1, 1.9, 3);
select * from spark_model

## Last but not least, H2O
<blockquote>Lastly, we'll build an H2O model for the same prediction task. <a href='http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#h2ocoxproportionalhazardsestimator'>H2O AI</a> is an extrememly powerful, distributed ML framework with a plethora of Machine Learning models. These models can handle massive data, just like Spark, and they have very sophisticated algorithms.We've pre-configured <a href='http://docs.h2o.ai/sparkling-water/2.1/latest-stable/doc/pysparkling.html'>H2O PySparkling Water</a> into our system so you can immediately use it.<footer>Splice Machine</footer>
</blockquote>

#### Build an H2O model

In [None]:
# First, we start our PySparkling Cluster
from pysparkling import *
import h2o
# Create H2O Cluster
conf = H2OConf().setInternalClusterMode()
hc = H2OContext.getOrCreate(conf)

In [None]:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

# Get data
hdf = hc.asH2OFrame(spark_df)
hdf['CLASS_RESULT'] = hdf['CLASS_RESULT'].asfactor()
train, test = hdf.split_frame(ratios=[0.8])

with mlflow.start_run(run_name='h2o'):
    model = H2ODeepLearningEstimator()
    model.train(x=train.columns[:-1],
                  y=train.columns[-1],
                  training_frame=train)
    print('logloss', model.logloss())
    
    mlflow.log_model(model, 'h2o_model')
    run_id = mlflow.current_run_id()
    
splice.dropTableIfExists(f'{schema}.h2o_model')
# Deploy our model
jid = mlflow.deploy_db(schema, 'h2o_model', run_id, primary_key={'MOMENT_KEY': 'INT'}, df=spark_df, model_cols=spark_df.columns[:-1], 
                 create_model_table=True, classes=['no_fraud','fraud'])
mlflow.watch_job(jid)

#### Invoke our model

In [None]:
%%sql
insert into h2o_model (expected_weekly_trans_cnt, weekly_trans_cnt, rsi_trans_amnt, amount, moment_key) values (6.7, 3.1, 5.6, 2.4, 1);
insert into h2o_model (expected_weekly_trans_cnt, weekly_trans_cnt, rsi_trans_amnt, amount, moment_key) values (4.9, 3.0, 1.4, 0.2, 2);
insert into h2o_model (expected_weekly_trans_cnt, weekly_trans_cnt, rsi_trans_amnt, amount, moment_key) values (5.6, 3.0, 4.5, 1.5, 3);
select * from h2o_model

In [None]:
spark.stop()

### Amazing! 
<blockquote>Just like that, you've deployed 3 models to the database, one of them in two different ways! If you'd like to see everything you've learned put together in an end-to-end example, check out our <a href='./Examples'>Example</a> notebooks. <footer>Splice Machine</footer>
</blockquote>