# Splice + MLflow: What you need to know
<blockquote><p class='quotation'><span style='font-size:15px'>Mlflow allows you to track experiments and share results with teammates easily.<br>At Splice Machine, MLflow is embedded directly into your database (MLManager). This means that all of the configuration is taken care of for you, and <b>everything</b> you track in mlflow is persisted to the database.<br><br>
    MLflow requires the NSDS (or ExtNSDS) as a parameter to connect to the database. If are unfamliar with our NSDS, check out the <a href="./7.1 Splice and Spark.ipynb">previous notebook</a> on using Splice Machine and Spark.<footer>Splice Machine</footer>
</blockquote>

#### Let's start our Spark Session

In [None]:
# Setup
from pyspark.sql import SparkSession
from splicemachine.spark import PySpliceContext

spark = SparkSession.builder.getOrCreate()
splice = PySpliceContext(spark)

import warnings
warnings.simplefilter("ignore")
warnings.filterwarnings("ignore")

## Importing MLflow Support
<blockquote><p class='quotation'><span style='font-size:15px'>Using MLflow on Splice is as easy as a single import. After imporing, you immediately have access to the <code>mlflow</code> module. <br>You will have access to all of the functions in the standard MLflow API as well as some extra ones that are custom to Splice Machine.<br> You can check out our full <a href='https://pysplice.readthedocs.io/en/latest/splicemachine.mlflow_support.html'>documentation</a> for everything available and our <a href="https://www.github.com/splicemachine/pysplice">GitHub</a> repo to raise issues and ask questions. <br>After importing, you can register your Splice Context for access to even more functions.<br><br><footer>Splice Machine</footer>
</blockquote>

In [None]:
# MLFlow Setup
from splicemachine.mlflow_support import *
mlflow.register_splice_context(splice)

## Step 0: The MLflow UI
<blockquote> You can access the MLflow UI in 2 ways:
    <ul>
        <li>From the url at <a href=/mlflow>/mlflow</a></li>
        <li>From the Notebook as an IFrame using the <code>get_mlflow_ui</code> function. You can also pass in an optional experiment ID and/or run ID to open the IFrame directly to your experiment/run.</li>
    </ul>
</blockquote>

In [None]:
from splicemachine.notebook import get_mlflow_ui
get_mlflow_ui()

## MLflow concepts
<blockquote>MLflow Tracking is organized around the concept <code>experiments</code> and <code>runs</code>:<br> 
    <ul>
        <li>Experiments can be thought of as the problem you are trying to track or solve (ie Performance Testing TPC-C</li>
        <li>Runs are single executions of some piece of code (ie a single full execution of TPC-C with some database configuration). Experiments have multiple runs (1-to-many).</li>
    </ul>
</blockquote>

### Setting an Experiment
<blockquote>To start an Experiment, you can call <code>mlflow.set_experiment('EXP_NAME')</code> and pass in an experiment name.<br> 
    If the Experiment exists, it will be set to the <code>active</code> experiment. Otherwise, mlflow will create the Experiment for you and set it to active.

</blockquote>

In [None]:
help(mlflow.set_experiment)

In [None]:
mlflow.set_experiment('mlflow_api_demo')

#### View Your [Experiment](/mlflow)

In [None]:
exp_id = mlflow.client.get_experiment_by_name('mlflow_api_demo').experiment_id
get_mlflow_ui(exp_id)

### Starting a run
<blockquote>Once you have an Experiment, you can start your run by calling <code>mlflow.start_run(run_name='RUN_NAME')</code> and pass in a run name. You can also pass in the optional <code>tags</code> parameter as a dictionary and store key value pairs associated to the run.<br> 
When you start a run, MLFlow (MLManager) automatically logs some information for you:
    <ul>
        <li>Start Date</li>
        <li>Current User</li>
        <li>Run ID</li>
        <li>DB Transaction ID</li>
        <li>Source (where the run came from)</li>
    </ul>
</blockquote>

In [None]:
help(mlflow.start_run)

In [None]:
mlflow.start_run(run_name='First_pass_default_settings', tags={'team': 'pd', 'purpose':'performance testing'})

### Tracking Concepts
<blockquote>There are 4 main conepts when tracking a run:<br>
    <ul>
        <li><b>Tags</b>: Any key value pair that likely won't be used for comparison between runs (non-measurable items). Only tags can be overwritten</li>
        <li><b>Parameters</b>: Configuration options that were made before starting the run that may have a measurable effect on the outcome</li>
        <li><b>Metrics</b>: The measured outcomes between runs that can be compared. Metrics have an optional <code>step</code> parmeter if you want to track metrics over time for a specific run</li>
        <li><b>Artifacts</b>: Objects (files, images, notebooks, etc) to be associated with a run</li>
    </ul>
</blockquote>

In [None]:
help(mlflow.set_tag)
print('---------------------------------------------------------------------------------')
help(mlflow.lp)
print('---------------------------------------------------------------------------------')
help(mlflow.lm)
print('---------------------------------------------------------------------------------')
help(mlflow.log_artifact)

In [None]:
mlflow.set_tag('teammates', 'carol, daniel')

In [None]:
mlflow.lp('spark executors', '5')

In [None]:
mlflow.lm('execution time sec', 25)

In [None]:
# Setting metrics over "steps"
for i in range(10):
    mlflow.lm('Build time', i*3, step=i)

In [None]:
get_mlflow_ui(mlflow.current_exp_id(), mlflow.current_run_id())

### End Run
<blockquote>When you finish a run, you call <code>mlflow.end_run()</code>.<br> You know a run is ended in the MLFlow UI because there is a green check mark next to it</blockquote>

In [None]:
mlflow.end_run()

## Artifacts
<blockquote>Artifacts can be any file type. The artifact is serialized as a BLOB and stored in the database. When storing artifacts in the database, files with file extensions such as <code>.txt</code>, <code>.pdf</code>, <code>.yaml</code>, <code>.pdf</code>, <code>.jpeg</code> etc. will be available for preview in the mlflow ui <br>We can use some neat Jupyter tricks like <code>writefile</code> to make artifacts even more useful.
</blockquote>

#### Write a yaml file

In [None]:
%%writefile my_env.yaml

name: datatest  
channels:
- defaults
- conda-forge
- ericmjl
dependencies:
- python=3.6
- colorama=0.3.9
- jupyter=1.0.0
- ipykernel=4.6.1
- jupyterlab=0.25.2
- pytest=3.1.3
- pytest-cov=2.5.1
- tinydb=3.3.1
- pyyaml=3.12
- pandas-summary=0.0.41
- environment_kernels=1.1
- missingno=0.3.7


#### Write a code snippet

In [None]:
%%writefile harm_mean.py
def harm_mean(nums, rnd=4):
    """
    Calculates the harmonic mean of n numbers rounded to rnd decimal places
    :param nums: List of numbers
    :param rnd: Number of decimal places to round the result
    """
    return round(len(nums)/sum([1/i for i in nums]),rnd)

## Put it together
#### Start a run, log our artifacts, view the results

In [None]:
!jupyter nbconvert --to html '7.2 Splice MLflow Support.ipynb'
with mlflow.start_run(run_name='environment_requirements'):
    run_id = mlflow.current_run_id()
    exp_id = mlflow.current_exp_id()
    mlflow.log_artifact('my_env.yaml', name='my_env.yaml')
    mlflow.log_artifact('harm_mean.py', name='harm_mean.py')
    mlflow.log_artifact('7.2 Splice MLflow Support.ipynb', name='training_notebook.ipynb')
    mlflow.log_artifact('7.2 Splice MLflow Support.html', name='training_notebook.html')

#### Click on one of your artifacts to render the results!

In [None]:
get_mlflow_ui(exp_id, run_id)

#### Another Artifact Example

In [None]:
import matplotlib.pyplot as plt
from random import random
with mlflow.start_run(run_name='my_plot'):
    plt.rcParams.update({
        "pgf.texsystem": "pdflatex",
        "pgf.preamble": [
             r"\usepackage[utf8x]{inputenc}",
             r"\usepackage[T1]{fontenc}",
             r"\usepackage{cmbright}",
             ]
    })

    plt.figure(figsize=(4.5, 2.5))
    plt.plot([random()*19 for _ in range(10)])
    plt.text(0.5, 3., "serif", family="serif")
    plt.text(0.5, 2., "monospace", family="monospace")
    plt.text(2.5, 2., "sans-serif", family="sans-serif")
    plt.xlabel(r"µ is not $\mu$")
    plt.tight_layout(.5)

    plt.savefig("pgf_texsystem.png")
    mlflow.log_artifact('pgf_texsystem.png', 'results.png')
    rid = mlflow.current_run_id()
    eid = mlflow.current_exp_id()

In [None]:
get_mlflow_ui(eid,rid)

### Context Managers in Runs
<blockquote>There are 2 Context Managers in MLManager/MLflow. <code>start_run</code> and <code>timer</code>.<br>
Context managers enable some autologging and cleanup functions for you. To use a Context Manager, prepend the command with the <code>with</code> call append a <code>:</code> after the call, and indent all lines after it.<br>
Another great feature is if the run fails for some reason MLflow will track that for you</blockquote>

In [None]:
with mlflow.start_run(run_name='run with context manager'):
    mlflow.lp('foo','bar')
    mlflow.lm('score', 92)

In [None]:
with mlflow.start_run(run_name='a run that failed'):
    raise Exception

In [None]:
from time import sleep
# Multiple context managers
with mlflow.start_run(run_name='using the timer too'):
    with mlflow.timer('run time'):
        sleep(2)
    print('done!')

#### Timers are default stored as parameters, but can also be stored as metrics

In [None]:
from time import sleep
# Multiple context managers
with mlflow.start_run(run_name='using the timer as a metric'):
    with mlflow.timer('run time', param=False):
        sleep(2)
    print('done!')

### Nested Runs
<blockquote>MLFlow supports the concept of <code>nested</code> runs. A nested run is a run that occurs underneath a parent run. In machine learning, this could be used for hyperparmeter tuning (like choosing K in a k-means clustering algorithm). But it can be used for anything you find useful.<br> To use it, simply pass <code>nested=True</code> to the <code>start_run</code> function</blockquote>

In [None]:
from random import randint, sample
from time import sleep
from tqdm.notebook import tqdm
exec_time = [1,3,5,2]
num_execs = []
with mlflow.start_run(run_name='parent run'):
    for i in tqdm(range(4)):
        with mlflow.start_run(run_name=f'child {i+1}', nested=True):
            with mlflow.timer('run time', param=False):
                sleep(exec_time[i])
            mlflow.set_tag('child', 'yes')
            mlflow.lp('num_executors', i+1)
            num_execs.append(i+1)
    # Plot results
    plt.figure(figsize=(4.5, 2.5))
    plt.plot(num_execs, exec_time)

    plt.ylabel('exec time')
    plt.xlabel('num executors')
    plt.tight_layout(.5)
    plt.savefig("spark_results.png")
    mlflow.log_artifact('spark_results.png','spark_results.png')
    display(get_mlflow_ui(mlflow.current_exp_id()))

## Storing ML Models
<blockquote><p class='quotation'><span style='font-size:15px'>Just like everything else we've tracked so far, tracking ML Models is easy with Splice Machine's MLManager. The <code>log_model</code> and <code>load_model</code> functions are all you need. 
    <footer>Splice Machine</footer>   
</blockquote>

#### Let's try it out

In [None]:
from sklearn import svm
from sklearn import datasets
from sklearn.metrics import accuracy_score

# Start a run
with mlflow.start_run(run_name='my first model'):
    # Load some sklearn data
    digits = datasets.load_digits()

    # Build a simple model
    clf = svm.SVC(gamma=0.001, C=100.)
    # Log parameters to mlflow
    mlflow.lp('gamma', 0.001)
    mlflow.lp('C', 100.0)

    # Train the model
    with mlflow.timer('train_time'):
        clf.fit(digits.data[:-1], digits.target[:-1])

    # Predict with some data
    preds = clf.predict(digits.data[:-1])

    # Measure accuracy
    acc = accuracy_score(digits.target[:-1], preds)
    print('Accuracy:',acc)
    # Log metric to mlflow
    mlflow.lm('accuracy', acc)
    
    # Save model
    mlflow.log_model(clf, 'clf_model')
    rid = mlflow.current_run_id()

#### Load our model back and make new predictions

In [None]:
loaded_model = mlflow.load_model(run_id=rid, name='clf_model')
display(loaded_model)
# Make a new prediction
new_data = [ 
    0.,  0., 12., 10.,  0.,  0.,  0.,  0.,  0.,  0., 14., 16., 16.,
    14.,  0.,  0.,  0.,  0., 13., 16., 15., 10.,  1.,  0.,  0.,  0.,
    11., 16., 16.,  7.,  0.,  0.,  0.,  0.,  0.,  4.,  7., 16.,  7.,
    0.,  0.,  0.,  0.,  0.,  4., 16.,  9.,  0.,  0.,  0.,  5.,  4.,
    12., 16.,  4.,  0.,  0.,  0.,  9., 16., 16., 10.,  0.,  0.
]
print('Prediction on new data:', loaded_model.predict([new_data])[0])

In [32]:
spark.stop()

# Fantastic!
<blockquote> 
Now you have all of the tools necessary to start tracking your experiments and sharing results! Again, feel free to check out our <a href="https://pysplice.readthedocs.io/en/latest/splicemachine.mlflow_support.html">documentation</a>!<br><br>
    Next Up: <a href='./7.3 Data Exploration.ipynb'>Using MLManager to explore and analyze your data</a>
<footer>Splice Machine</footer>
</blockquote>