<style>
#s {
}
h1, h2, h3, h4, h5, h6, table, button, a, p, blockquote {
font-family:Geneva;
}

.log {
transition: all .2s ease-in-out;
}

.log:hover {a
transform: scale(1.05);
}
</style>
<div id='s' style='width:90%'>
<center><img class='log' src='https://splicemachine.com/wp-content/uploads/splice-logo-1.png' width='20%' style='z-index:5'></center>
<center><h1 class='log' style='font-size:40px; color:black;'>Welcome to Splice Machine MLManager</h1></center>
<center><h2 class = 'log' style='font-size:25px; color:grey;'>The data platform for intelligent applications</center>
<center><img class='log' src='https://splice-demo.s3.amazonaws.com/splice-machine-data-science-process-h2o.png' width='40%' style='z-index:5'></center>
</div>
    
    

# In this notebook, we're going to take a look at using MLManager with [H2O](https://www.h2o.ai/) + [Spark](https://spark.apache.org/)
<h2 style='font-size:25px;  font-weight:bold'>What is <a href=http://docs.h2o.ai/sparkling-water/2.1/latest-stable/doc/pysparkling.html>PySparkling Water?</a> What is <a href=https://splicemachine.com/product/ml-manager/>MLManager?</a></h2>
<style>
blockquote{
  font-size: 15px;
  background: #f9f9f9;
  border-left: 10px solid #ccc;
  margin: .5em 10px;
  padding: 30em, 10px;
  quotes: "\201C""\201D""\2018""\2019";
  padding: 10px 20px;
  line-height: 1.4;
}

blockquote:before {
  content: open-quote;
  display: inline;
  height: 0;
  line-height: 0;
  left: -10px;
  position: relative;
  top: 30px;
  bottom:30px;
  color: #ccc;
  font-size: 3em;
    display:none;

}

p{
  margin: 0;
}

footer{
  margin:0;
  text-align: right;
  font-size: 1em;
  font-style: italic;
}
</style>
<blockquote><p class='quotation'><b><br><span style='font-size:25px'>PySparkling</span></b> <br><br>PySparkling Water is an awesome H2O extension that allows you to run H2O clusters on top of existing Spark clusters. With Splice Machine, this integration is taken care of for you, so it's simple to start modeling with your new favorite library</i></br><footer>Splice Machine</footer></blockquote><br>
<blockquote><p class='quotation'><b><br><span style='font-size:25px'>MLManager (+MLFlow)</span></b><br><br>As a data scientist constantly creating new models and testing new features, it is necessary to effectively track and manage those different ML runs. MLManager + MLFlow allows you to track entire <code>experiments</code> and individual <code>run</code> parameters and metrics. The way you organize your flow is unique to you, and the intuitive Python API allows you to organize your development process and run with it.<br>
     <center><img class='log' src='https://s3.amazonaws.com/splice-demo/mlflow+ui.png' width='40%' style='z-index:5'></center>

# Let's get started
## In this notebook, we will see how to use Spark, H2O and MLManager to predict sentiment analysis of Amazon reviews, tracking everything in the [MLFlow UI](/mlflow) and deploy our models to production
This is an adaptation of the original [H2O Demo](http://docs.h2o.ai/h2o-tutorials/latest-stable/h2o-world-2017/nlp/index.html)

## Important imports and setup
* Create our Spark Session
* Create our Native Spark Data Source
* Create our PySparkling Water cluster
* Import our MLManager functionality

In [None]:
from pyspark.sql import SparkSession
from splicemachine.spark import PySpliceContext
from splicemachine.mlflow_support import *
from splicemachine.mlflow_support.utilities import get_user
from pysparkling import *
import h2o
import warnings
warnings.simplefilter("ignore")
warnings.filterwarnings("ignore")

# Spark Session
spark = SparkSession.builder.config('spark.driver.memoryOverhead',1000).config('spark.driver.memory','2g').getOrCreate()
# Native Spark Data Source
splice = PySpliceContext(spark)
# Register Splice so we can access database functions
mlflow.register_splice_context(splice)
# Create H2O Cluster
conf = H2OConf().setInternalClusterMode()
hc = H2OContext.getOrCreate(conf)
schema = get_user()

## View Our Spark UI
<blockquote>Now that we've created a SparkSession dedicated to this notebook, we can monitor the active jobs in the Spark UI. You can navigate to <code>/sparkmonitor/PORT</code> in the URL, replacing the port with the port of your active Spark Session. You can also view the Spark Session right here in the notebook using our <code>get_spark_ui</code> function

In [None]:
from splicemachine.notebook import get_spark_ui
help(get_spark_ui)

In [None]:
get_spark_ui()

# Great! Now let's import our data
<style>
blockquote{
  font-size: 15px;
  background: #f9f9f9;
  border-left: 10px solid #ccc;
  margin: .5em 10px;
  padding: 30em, 10px;
  quotes: "\201C""\201D""\2018""\2019";
  padding: 10px 20px;
  line-height: 1.4;
}

blockquote:before {
  content: open-quote;
  display: inline;
  height: 0;
  line-height: 0;
  left: -10px;
  position: relative;
  top: 30px;
  bottom:30px;
  color: #ccc;
  font-size: 3em;
    display:none;

}

p{
  margin: 0;
}

footer{
  margin:0;
  text-align: right;
  font-size: 1em;
  font-style: italic;
}
</style>
<blockquote><p class='quotation'><b><br><span style='font-size:25px'>Importing Data</span></b> <br><br>There are a few easy ways to get data into Splice Machine, and we'll demonstrate 2 of them here. You can use the built-in <code>%%sql</code> magic to import data directly from external sources, such as S3, or you can use H2O to directly read the data from S3, create a table from that dataframe, and insert the data directly using the <code>PySpliceContext</code> you created in the cell above. </i></br><footer>Splice Machine</footer></blockquote><br>

## Option 1: Direct Import from SQL
<style>
blockquote{
  font-size: 15px;
  background: #f9f9f9;
  border-left: 10px solid #ccc;
  margin: .5em 10px;
  padding: 30em, 10px;
  quotes: "\201C""\201D""\2018""\2019";
  padding: 10px 20px;
  line-height: 1.4;
}

blockquote:before {
  content: open-quote;
  display: inline;
  height: 0;
  line-height: 0;
  left: -10px;
  position: relative;
  top: 30px;
  bottom:30px;
  color: #ccc;
  font-size: 3em;
    display:none;

}

p{
  margin: 0;
}

footer{
  margin:0;
  text-align: right;
  font-size: 1em;
  font-style: italic;
}
</style>
<blockquote><p class='quotation'><b><br><span style='font-size:25px'>SQL Import</span></b> <br><br>This method is simple: Create your table, point it to a an S3 location, and run the import command</i></br><footer>Splice Machine</footer></blockquote><br>

In [None]:
%%sql
DROP TABLE IF EXISTS AMAZON_REVIEWS;
CREATE TABLE AMAZON_REVIEWS(
    PRODUCTID VARCHAR(250),
    USERID VARCHAR(250),
    SUMMARY VARCHAR(500),
    SCORE INT,
    HELPFULNESSDENOMINATOR BIGINT,
    ID INT,
    PROFILENAME VARCHAR(500),
    HELPFULNESSNUMERATOR BIGINT,
    REVIEW_TIME BIGINT,
    REVIEW VARCHAR(15000),
    PRIMARY KEY(ID)
);


-- Import the data
call SYSCS_UTIL.IMPORT_DATA (
     null,
     'AMAZON_REVIEWS',
     null,
     's3a://splice-demo/AmazonReviews.csv',
     ',',
     null,
     null,
     null,
     null,
     -1,
     's3a://splice-demo/bad',
     null, 
     null);

In [None]:
%%sql
select top 10 * from AMAZON_REVIEWS

### Now we can easily import the data into a Spark or H2O Data Frame with the <code>PySpliceContext</code> (Native Spark Data Source)

In [None]:
# Get data from table into Spark Dataframe
df2 = splice.df(f'select * from {schema}.amazon_reviews')
hdf = hc.asH2OFrame(df2)
hdf.describe()
del hdf # Delete because we won't be using this frame

## Option 2: Import from H2O
<style>
blockquote{
  font-size: 15px;
  background: #f9f9f9;
  border-left: 10px solid #ccc;
  margin: .5em 10px;
  padding: 30em, 10px;
  quotes: "\201C""\201D""\2018""\2019";
  padding: 10px 20px;
  line-height: 1.4;
}

blockquote:before {
  content: open-quote;
  display: inline;
  height: 0;
  line-height: 0;
  left: -10px;
  position: relative;
  top: 30px;
  bottom:30px;
  color: #ccc;
  font-size: 3em;
    display:none;

}

p{
  margin: 0;
}

footer{
  margin:0;
  text-align: right;
  font-size: 1em;
  font-style: italic;
}
</style>
<blockquote><p class='quotation'><b><br><span style='font-size:25px'>H2O Import</span></b> <br><br>This method is also straightforward, and may be preferable to Data Scientists: Import your data using H2O, and then use the <code>PySpliceContext</code> to create the table from the dataframe and insert the data directly. You'll notice that this method doesn't directly involve any SQL.</i></br><footer>Splice Machine</footer></blockquote><br>

In [None]:
data_path = "https://splice-demo.s3.amazonaws.com/AmazonReviews.csv"
# Load data into H2O
reviews = h2o.import_file(data_path)
reviews.head()

## H2O offers great functions to convert H2OFrames into Pandas and Spark DataFrames

In [None]:
# Spark DataFrame
df = hc.asSparkFrame(reviews, copyMetadata=False)
df.limit(100).show()
del df._h2o_frame
print(type(df))
# Pandas DataFrame
pdf = reviews.head().as_data_frame()
display(pdf)
print(type(pdf))

## Nice!
<style>
blockquote{
  font-size: 15px;
  background: #f9f9f9;
  border-left: 10px solid #ccc;
  margin: .5em 10px;
  padding: 30em, 10px;
  quotes: "\201C""\201D""\2018""\2019";
  padding: 10px 20px;
  line-height: 1.4;
}

blockquote:before {
  content: open-quote;
  display: inline;
  height: 0;
  line-height: 0;
  left: -10px;
  position: relative;
  top: 30px;
  bottom:30px;
  color: #ccc;
  font-size: 3em;
    display:none;

}

p{
  margin: 0;
}

footer{
  margin:0;
  text-align: right;
  font-size: 1em;
  font-style: italic;
}
</style>
<blockquote><p class='quotation'><b><br><span style='font-size:25px'>Create Table and Insert Data</span></b> <br><br>Now that we have our Spark DataFrame, we can create a table and insert data using <code>splice.createTable</code> and <code>splice.insert</code><br><b>Note: </b>If your code is hanging on the <code>insert</code> your cluser may be out of memory. Try configuring your Spark or H2O cluster with more memory. Read about that <a href=https://docs.h2o.ai/sparkling-water/2.1/latest-stable/doc/configuration/configuration_properties.html>here</a> and <a href=https://spark.apache.org/docs/latest/configuration.html#available-properties>here</a></footer></blockquote><br>


In [None]:
help(splice.createTable)
print('----------------------------------------------------------------------------------------------------------------')
help(splice.insert)

In [None]:
# Create the table
df = df.withColumnRenamed('Time', 'Review_Time')
df = df.withColumnRenamed('Text', 'Review')
splice._dropTableIfExists(f'{schema}.AMAZON_REVIEWS_H2O')
splice.createTable(df, f'{schema}.AMAZON_REVIEWS_H2O', to_upper=True, drop_table=True)
print('Inserting... ', end='')
splice.insert(df, f'{schema}.AMAZON_REVIEWS_H2O',to_upper=True) # Use to_upper to give the SQL table uppercase columns
print('Done.')

In [None]:
%%sql
select top 10 varchar(Summary) Summary, Score, HelpfulnessDenominator, Id  from AMAZON_REVIEWS_H2O;

# Awesome! Let's get modeling
<style>
blockquote{
  font-size: 15px;
  background: #f9f9f9;
  border-left: 10px solid #ccc;
  margin: .5em 10px;
  padding: 30em, 10px;
  quotes: "\201C""\201D""\2018""\2019";
  padding: 10px 20px;
  line-height: 1.4;
}

blockquote:before {
  content: open-quote;
  display: inline;
  height: 0;
  line-height: 0;
  left: -10px;
  position: relative;
  top: 30px;
  bottom:30px;
  color: #ccc;
  font-size: 3em;
    display:none;

}

p{
  margin: 0;
}

footer{
  margin:0;
  text-align: right;
  font-size: 1em;
  font-style: italic;
}
</style>
<blockquote><p class='quotation'><b><br><span style='font-size:25px'>Modeling</span></b> <br><br>We're going to try three different ways to approach this problem, and track it all with MLManager. 
    <ol>
        <li>No Text Model: We will try to predict the customer reviews without using the text from the review. Just the Numeric Columns</li>
        <li>Using the reviews: We will use Word2Vec to create vectors from the text of the reviews. We will then train a model on that word embedding feature-vector</li>
        <li>Using the review summaries: We will use Word2Vec to create vectors from the text of the review summaries</li>
    </ol>
    Which do you think will perform the best?
    </i></br><footer>Splice Machine</footer></blockquote><br>

## First Attempt
Let's create a simple model using the non-review columns
<br>
<blockquote>
    First, let's start our mlflow experiment! We can start a run and log import parameters, tags, and metrics as they come<br>
Next, we can turn this into a binary-classification problem by turning score into a positive or negative review. We will say that 4 and 5 start reviews are positive, but you can change this and try other things!
</br><footer>Splice Machine</footer></blockquote><br>

In [None]:
# Set our mlflow experiment
mlflow.set_experiment('Sentiment Analysis')
# Look at our dataframe
reviews["PositiveReview"] = (reviews["Score"] >= 4).ifelse("1", "0")
reviews["PositiveReview"].table()


## We can now see our experiment in the MLFlow UI 

<blockquote>You can always navigate to <a href=/mlflow>/mlflow</a> to view the MLFlow UI, or you can view it right here in the notebook

In [None]:
from splicemachine.notebook import get_mlflow_ui
get_mlflow_ui()

## Let's see our Data Correlation

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import colors
import numpy as np
import seaborn as sns
from pyspark.sql.types import FloatType, IntegerType

%matplotlib inline

pdf = reviews[['ProductId', 'UserId', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Time','PositiveReview']].as_data_frame()
corr = pdf.corr()

ticks = [i for i in range(len(corr.columns))]
# Generate a mask for the upper triangle

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Color Scheme
cmap = "Greens"

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr,  cmap=cmap, vmax=.3, center=0,
            square=False, linewidths=.5, cbar_kws={"shrink": .5})

plt.xticks(ticks, corr.columns)
plt.yticks(ticks, corr.columns)
plt.title('Sentiment Data correlation heatmap')
plt.show()

## We can see some of our features have decent correlation (remember that we aren't using the reviews yet). Let's try a basic model
### First, let's log some important information in our <code>run</code>
<blockquote>H2O Has a <a href=https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html>lot</a> of algorithms, so we'll use a Gradient Boosting Estimator<br>We'll log some things like our feature vector, label, train/test/validation split, training time, and even the model and notebook themselves directly to <a href='/mlflow'>mlflow</a></blockquote>

In [None]:
%%time
from h2o.estimators import H2OGradientBoostingEstimator
from splicemachine.mlflow_support.utilities import get_user

RATIOS = [0.7,0.15]

# Start our run to keep track of important information
mlflow.start_run(run_name='simple_run')

predictors = ['ProductId', 'UserId', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Time']
response = 'PositiveReview'

# lp is short for log_param
# lm is short for log_metric
mlflow.lp('predictors', predictors)
mlflow.lp('label', response)
mlflow.lp('source data table', f'{get_user()}.AMAZON_REVIEWS')

# Train Test Split
train,test,valid = reviews.split_frame(ratios=RATIOS)
# Log our ratios
mlflow.lp('ratios',RATIOS)

gbm_baseline = H2OGradientBoostingEstimator(stopping_metric = "AUC", stopping_tolerance = 0.001,
                                            stopping_rounds = 5, score_tree_interval = 10,
                                            model_id = "gbm_baseline.hex"
                                           )

mlflow.lp('model_type', gbm_baseline.__class__)

# Code block to time training
with mlflow.timer('train_time'):
    gbm_baseline.train(x = predictors, y = response, 
                       training_frame = train, validation_frame = test
                      )
# Log the model params to mlflow
mlflow.log_params(gbm_baseline.get_params())
# Log the model to MLFlow
mlflow.log_model(gbm_baseline, 'baseline_model')
# Log the training notebook to MLFlow
mlflow.log_artifact('MLManager NLP H2O Demo.ipynb', 'training_notebook')
gbm_baseline

## You can see above that H2O gives you loads of details about your model. Let's inspect it a bit more and log some results to MLFlow

In [None]:
%matplotlib inline  

# Print and Log Model params
params = dict(zip(gbm_baseline.summary().col_header[1:],
                    gbm_baseline.summary().cell_values[0][1:]))
print(gbm_baseline.summary())
mlflow.log_params(params)


In [None]:
from IPython.display import HTML, IFrame
#Plot and Log Scoring history
gbm_baseline.plot()
print("AUC on Validation Data: " + str(round(gbm_baseline.auc(valid = True), 3)))
# Log training and validation metrics over time
for step, row in gbm_baseline.scoring_history().iterrows():
    row_dict = row.to_dict()
    for r in row_dict:
        if 'train' in r or 'valid' in r:
            mlflow.log_metric(r, row_dict[r],step=step)

cur_run = mlflow.current_run_id()
cur_exp = mlflow.current_exp_id()
link = f'/mlflow/#/metric/training_auc?runs=["{cur_run}"]&experiment={cur_exp}&plot_metric_keys=[\"training_logloss\",\"validation_logloss\",\"training_rmse\",\"validation_rmse\"]'
display(HTML(f'<font size="+2">See your metrics plot <a target="_blank" href={link}>here</a></font>'))


In [None]:
# Print and Log Confusion Matrix
print(gbm_baseline.confusion_matrix(valid = True))
mlflow.lm('fpr', gbm_baseline.fpr(valid=True)[0][0])
mlflow.lm('tpr', gbm_baseline.tpr(valid=True)[0][0])
mlflow.lm('fnr', gbm_baseline.fnr(valid=True)[0][0])
mlflow.lm('tnr', gbm_baseline.fnr(valid=True)[0][0])
mlflow.lm('F0point5', gbm_baseline.F0point5(valid=True)[0][1])
mlflow.lm('F1', gbm_baseline.F1(valid=True)[0][1])
mlflow.lm('F2', gbm_baseline.F2(valid=True)[0][1])
mlflow.lm('auc', gbm_baseline.auc(valid = True))
mlflow.lp('threshold', gbm_baseline.F1(valid=True)[0][0]) # First element is the threshold

In [None]:
# Plot and Log Variable Importance
gbm_baseline.varimp_plot()
for var in gbm_baseline.varimp():
    mlflow.lm(f'varimp_{var[0]}',var[-1])

In [None]:
# Partial Dependence Plot
pdp_helpfulness = gbm_baseline.partial_plot(train, cols = ["HelpfulnessNumerator"])

In [None]:
mlflow.end_run()

# There is room for improvement 
## Let's now Tokenize words in the Review
### But first, let's start our new run

In [None]:
import pandas as pd
# Start new mlflow run
mlflow.start_run(run_name='review_tokenizer')
mlflow.lp('source data table', f'{get_user()}.AMAZON_REVIEWS')
# Get common stop words from H2O
data_path = "https://splice-demo.s3.amazonaws.com/stop_words.csv"
STOP_WORDS = pd.read_csv(data_path, header=0)
STOP_WORDS = list(STOP_WORDS['STOP_WORD'])
print(STOP_WORDS)

In [None]:
# Inspect our reviews before tokenization
reviews['Text']

## Now we can train our Doc2Vec model
<blockquote>We are going to use the popular Gensim doc2vec implementation for scikit-learn. We use scikit-learn here because of it's implementation of <code>Pipelines</code> which allow us to create custom transformations on the data before training/running our model. This gives us the ultimate flexibility. We will use doc2vec, which is just an extension of word2vec but for full documents (sentences). We can also time how long it takes to train the word vectorizer, and log the vector size so we can change it and see how performance changes. Then we can look at some word synonyms to see how well the tokenizer did</i></br><footer>Splice Machine</footer></blockquote><br>

In [None]:
import re
def tokenize( reviews ):
    review_tokens = []
    for review in reviews[0]:
        # Remove non-letters
        review = re.sub("[^a-zA-Z]"," ", review)
        review = review.lower().split()

        stops = set(STOP_WORDS)
        review = [w for w in review if w not in stops]
        review_tokens.append(review)
    return(review_tokens)

## Train Doc2Vec Model
<blockquote>This will take a few minutes to run as the model needs to generate mappings of every sentence to vectors</i></br><footer>Splice Machine</footer></blockquote><br>

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import FeatureUnion, Pipeline as skPipe
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
from gensim.sklearn_api import W2VTransformer, D2VTransformer
import pandas as pd

print('Collecting dataset...', end='')
pdf = reviews['Text'].as_data_frame().astype('string')
print('Done.')

VECTOR_LENGTH = 50

d2v_model = skPipe(verbose=True,
                   steps = [
                       ('preprocessor', FunctionTransformer(tokenize, validate=False)),
                       ('doc2vec', D2VTransformer(size=VECTOR_LENGTH)),
                       ('postprocessor', FunctionTransformer(lambda X: X.astype('double'), validate=True))
                   ])
with mlflow.timer('doc2vec_train_time'):
    # tokenize and build vocab
    d2v_model.fit([pdf.dropna()['Text']])

## Now we can use the Doc2Vec Model to see the most similar sentences of an input

In [None]:
from copy import deepcopy
doc2vec_model = deepcopy(d2v_model.steps[1][1].gensim_model)

# Tokenize our input
inp = "This tastes delicious"
tokens = tokenize([[inp]])
new_vector = doc2vec_model.infer_vector(tokens[0])
# Get our vectorized sentence
sims = doc2vec_model.docvecs.most_similar([new_vector])
print(f'Vector of input: {sims}\n')
# Get the most similar review
index = sims[0][0]
output = pdf.dropna()['Text'].iloc[index]

print(f'Input: {inp}\nMost similar Output: {output}')

## Now we can save this model and end our run
<blockquote>We want to save this vectorizer as an <b>independent</b> <code>run</code>. This is because we may want to build more than 1 model that utilizes these word vectors. We don't want to duplicate those identical word vectors, so we can use the outputs for <b>more than one model</b>. This is the idea of a <i>feature store</i> where we use features from one dataset on multiple models. This is crucial to creating efficient ML workflow systems.

In [None]:
mlflow.log_model(d2v_model, 'doc2vec_model')
mlflow.lp('vector_size', VECTOR_LENGTH)
exp_id = mlflow.current_exp_id()
mlflow.end_run()

In [None]:
get_mlflow_ui(exp_id)

## Let's vectorize our reviews
<blockquote>Now that we have a word embedding for each word in our vocabulary, we will aggregate the words for each review using the <code>transform</code> function.  This will give us one aggregated word embedding for each review.</blockquote>

In [None]:
# Calculate a vector for each review
review_vecs = h2o.H2OFrame(d2v_model.transform([pdf.fillna("")['Text']]))
# Add the review vectors to the original dataframe
# Add aggregated word embeddings 
ext_reviews = reviews.cbind(review_vecs)
ext_reviews.head()

## Model 2: GBM with Review vectors
<blockquote>
    Now we can train a GBM like before, but include the review vectors. This should hopefully increase improvement! We'll log everything to mlflow so we can compare the results.
    </i></br><footer>Splice Machine</footer></blockquote><br>

In [None]:
from h2o.estimators import H2OGradientBoostingEstimator
mlflow.end_run()
mlflow.start_run(run_name='GBM with word vectors')
RATIOS = [0.7,0.15]
# Train Test Split
ext_train,ext_test,ext_valid = ext_reviews.split_frame(ratios=RATIOS)
# Log our ratios
mlflow.lp('ratios',RATIOS)
# Log what word vectors we're using
mlflow.lp('word vectors', 'reviews')

non_token_predictors = ['ProductId', 'UserId', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Time']
predictors = non_token_predictors + review_vecs.names
response = 'PositiveReview'

mlflow.lp('label', response)
# There are a lot of predictors here (C1-C50 + features) so let's shorten that
mlflow.lp('predictors', non_token_predictors + [f'C1-C{len(review_vecs.columns)}'])

gbm_embeddings = H2OGradientBoostingEstimator(stopping_metric = "AUC", stopping_tolerance = 0.001,
                                              stopping_rounds = 5, score_tree_interval = 10,
                                              model_id = "gbm_embeddings.hex"
                                             )
with mlflow.timer('train_time'):
    gbm_embeddings.train(x = predictors, y = response, 
                       training_frame = ext_train, validation_frame = ext_test
                      )

# Log the model params to mlflow
mlflow.log_params(gbm_embeddings.get_params())
# Log the model to MLFlow
mlflow.log_model(gbm_embeddings, 'vectorized_model')
# Log the training notebook to MLFlow
mlflow.log_artifact('MLManager NLP H2O Demo.ipynb', 'training_notebook')
gbm_embeddings

## Just like before, let's log all of our outcomes

In [None]:
%matplotlib inline  
from IPython.display import HTML, IFrame

# Print and Log Model params
params = dict(zip(gbm_embeddings.summary().col_header[1:],
                    gbm_embeddings.summary().cell_values[0][1:]))
print(gbm_embeddings.summary())
mlflow.log_params(params)


#Plot and Log Scoring history
gbm_embeddings.plot()
print("AUC on Validation Data: " + str(round(gbm_embeddings.auc(valid = True), 3)))
# Log training and validation metrics over time
for step, row in gbm_embeddings.scoring_history().iterrows():
    row_dict = row.to_dict()
    for r in row_dict:
        if 'train' in r or 'valid' in r:
            mlflow.log_metric(r, row_dict[r],step=step)


# Print and Log Confusion Matrix
print(gbm_embeddings.confusion_matrix(valid = True))
mlflow.lm('fpr', gbm_embeddings.fpr(valid=True)[0][0])
mlflow.lm('tpr', gbm_embeddings.tpr(valid=True)[0][0])
mlflow.lm('fnr', gbm_embeddings.fnr(valid=True)[0][0])
mlflow.lm('tnr', gbm_embeddings.fnr(valid=True)[0][0])
mlflow.lm('F0point5', gbm_embeddings.F0point5(valid=True)[0][1])
mlflow.lm('F1', gbm_embeddings.F1(valid=True)[0][1])
mlflow.lm('F2', gbm_embeddings.F2(valid=True)[0][1])
mlflow.lm('auc', gbm_embeddings.auc(valid = True))
mlflow.lp('threshold', gbm_embeddings.F1(valid=True)[0][0]) # First element is the threshold


# Plot and Log Variable Importance
gbm_embeddings.varimp_plot()
for var in gbm_embeddings.varimp():
    mlflow.lm(f'varimp_{var[0]}',var[-1])
    
    
# Partial Dependence Plot
pdp_helpfulness = gbm_embeddings.partial_plot(ext_train, cols = ["HelpfulnessNumerator"])

In [None]:
old_run = cur_run
cur_run = mlflow.current_run_id()
cur_exp = mlflow.current_exp_id()
link = f'/mlflow/#/metric/training_auc?runs=["{cur_run}","{old_run}"]&experiment={cur_exp}&plot_metric_keys=[\"validation_logloss\"]'
display(HTML(f'<font size="+1">Compare your 2 runs <a target="_blank" href={link}>here</a></font>'))

In [None]:
gbm_embeddings.scoring_history()

In [None]:
mlflow.end_run()

In [None]:
print("Baseline AUC: " + str(round(gbm_baseline.auc(valid = True), 3)))
print("With Embeddings AUC: " + str(round(gbm_embeddings.auc(valid = True), 3)))
link = f'/mlflow/#/metric/training_auc?runs=["{cur_run}","{old_run}"]&experiment={cur_exp}&plot_metric_keys=[\"training_auc\",\"validation_auc\"]'
display(HTML(f'<font size="+1">See a metrics comparison <a target="_blank" href={link}>here</a></font>'))
IFrame(link.replace('\"','%22'),width='100%',height='800px')

# That's some great imrpovement! So what's next?
<blockquote>We included the customer reviews and developed a better model. We've logged everything to MLFlow for detailed comparisons. Now what?<br>
    Let's deploy our models to production so we can utilize what we've built. First, we'll deploy our word vectorizer model, and then deploy our GBM model. Finally, we'll create a feed from the first model to the second, so we can see the final predictions!
        </i></br><footer>Splice Machine</footer></blockquote><br>
<img src=https://splice-demo.s3.amazonaws.com/H2O+Model+Deployment+Flow.png>

## Step 1: Deploy Word2Vec Model

In [None]:
help(mlflow.deploy_db)

In [None]:
# Get the run_id from the name. Note that multiple runs can have the same name, so this returns a list
run_id = mlflow.get_run_ids_by_name('review_tokenizer')[0]

# We specify model_cols so the trigger knows which columns go into the model
jid = mlflow.deploy_db(schema, 'AMAZON_REVIEWS', run_id, classes=[f'C{i+1}' for i in range(VECTOR_LENGTH)], model_cols=['REVIEW'])
mlflow.watch_job(jid)

## Sweet! Now when we insert new data into our <code>AMAZON_REVIEWS</code> table, we will automatically have the review vectorization

In [None]:
%%sql

delete from amazon_reviews where ID=9993329;

insert into amazon_reviews (productid, userid, summary, score, helpfulnessdenominator, id, profilename, helpfulnessnumerator, review_time, review) 
    values ('B00141QYSQ', 'A1YS02UZZGRDCT', 'Do Not Buy', 2, 7, 9993329, 'Evan Eberhardt', 0, 1314489600, 'Nothing like what i expected');

select * from amazon_reviews where id=9993329;
delete from amazon_reviews where ID=9993329;

## Amazing! Up Next: GBM
<blockquote>
    Now we need to deploy our GBM model in a slightly different way. We will create a <b>new</b> table with the GBM model. This model will take input as the original features plus the vectorization of the review. 
    <br>To create a new table, you use the same <code>deploy_db</code> function but with a few extra paramaters:
    <ul>
        <li><code>df</code>: The dataframe used to create the table with</li>
        <li><code>create_model_table</code>: a boolean to indicate that you'd like to create a new table</li>
        <li><code>primary_key</code>: A list[tuple[str,str]] the primary key(s) you'd like to use for the table. Tables deployed with a model <b>require</b> a primary key</li>
    </ul>
        </i></br><footer>Splice Machine</footer></blockquote><br>



In [None]:
# Get the run_id from the name. Note that multiple runs can have the same name, so this returns a list
run_id = mlflow.get_run_ids_by_name('GBM with word vectors')[0]

deploy_df = hc.asSparkFrame(ext_reviews[predictors])

splice._dropTableIfExists(f'{schema}.gbm_w2v_model')

jid = mlflow.deploy_db(schema, 'gbm_w2v_model', run_id, df=deploy_df, create_model_table=True, primary_key={'REVIEW_ID': 'INT'}, classes=['negative', 'positive'])
mlflow.watch_job(jid)

## Almost Done!
<blockquote>
    Now we have 2 models deployed in the database:
    <ul>
        <li> A model that takes a sentence and converts it into a vector</li>
        <li> A model that takes the vector + a few other features and makes a prediction about the review </li>
    </ul>
    Now, all we need to do is connect them with a simple pipeline. We'll connect the 2 tables together like the image above to create a full cycle ML Pipeline
</i></br><footer>Splice Machine</footer></blockquote><br>

In [None]:
%%sql

CREATE TRIGGER WORD2VEC_PIPELINE
AFTER UPDATE
ON AMAZON_REVIEWS
REFERENCING NEW AS N
FOR EACH ROW
INSERT INTO GBM_W2V_MODEL (PRODUCTID,USERID,HELPFULNESSNUMERATOR,HELPFULNESSDENOMINATOR,TIME,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26,C27,C28,C29,C30,C31,C32,C33,C34,C35,C36,C37,C38,C39,C40,C41,C42,C43,C44,C45,C46,C47,C48,C49,C50,REVIEW_ID)
    values(N.PRODUCTID,N.USERID,N.HELPFULNESSNUMERATOR,N.HELPFULNESSDENOMINATOR,N.REVIEW_TIME,N.C1,N.C2,N.C3,N.C4,N.C5,N.C6,N.C7,N.C8,N.C9,N.C10,N.C11,N.C12,N.C13,N.C14,N.C15,N.C16,N.C17,N.C18,N.C19,N.C20,N.C21,N.C22,N.C23,N.C24,N.C25,N.C26,N.C27,N.C28,N.C29,N.C30,N.C31,N.C32,N.C33,N.C34,N.C35,N.C36,N.C37,N.C38,N.C39,N.C40,N.C41,N.C42,N.C43,N.C44,N.C45,N.C46,N.C47,N.C48,N.C49,N.C50,N.ID)
    

In [None]:
%%sql

delete from amazon_reviews where ID in (9993329,999350);
delete from GBM_W2V_MODEL where REVIEW_ID in (9993329,999350);

insert into amazon_reviews (productid, userid, summary, score, helpfulnessdenominator, id, profilename, helpfulnessnumerator, review_time, review) 
    values ('B00141QYSQ', 'A1YS02UZZGRDCT', 'Do Not Buy', 2, 7, 9993329, 'Evan Eberhardt', 0, 1314489600, 'Nothing like what i expected');

insert into amazon_reviews (productid, userid, summary, score, helpfulnessdenominator, id, profilename, helpfulnessnumerator, review_time, review) 
    values ('B0009XLVGA', 'A1NHQNQ3TVXTZF', 'An awesome choice', 5, 2, 999350, 'Evan Eberhardt', 0, 1314433500, 'You have to buy this! Its great!');


select REVIEW_ID, CUR_USER, EVAL_TIME, PREDICTION, "negative", "positive" from GBM_W2V_MODEL WHERE REVIEW_ID=9993329 or REVIEW_ID=999350

## Let's see which models are deployed

In [None]:
mlflow.get_deployed_models()

# Incredible!
## Let's recap
<blockquote>
    We:
    <ul>
        <li> Imported data from external sources in both SQL and Python</li>
        <li> Created a simple model with decent accuracy using H2O </li>
        <li> Imroved that model drastically using a word2vec pipeline with SKLearn </li>
        <li> Tracked, compared, and persisted all of our model and run information in mlflow </li>
        <li> Deployed the better model, along with the standalone word2vec pipeline to table directly in the database </li>
        <li> Chained those tables together with simple triggers </li>
        <li> Made predictions on new Amazon reviews </li>
        <li> Viewed which models we have in production </li>
    </ul>
    
 That's quite the accomplishment. Congratulations!
</i></br><footer>Splice Machine</footer></blockquote><br>