# Welcome to the ninth MAST-ML tutorial notebook, model container hosting!

## In this notebook, we will learn about how MAST-ML can be used to:

1. [Set up MAST-ML,import dependencies, and set important variables](#task1)
2. [Standard machine learning setup](#task2)
3. [Perform uncertainty calibration](#task3)
4. [Fit domain model](#task4)
5. [Gather all files to build model in a container](#task5)
6. [Build and push a container with trained model](#task6)


Note that this notebook will not work on Google Colab due to the Docker dependency for building containers.

## Task 1: Set up MAST-ML,import dependencies, and set important variables<a name="task1"></a>

Crate a clean environment which will make building a container easier later. The steps needed are as follows:

1. python3 -m venv python_env
2. source python_env/bin/activate
3. pip install -U pip
5. pip install jupyterlab
6. jupyter lab

In [None]:
# Install mastml (in this case a specific branch)
!pip install git+https://github.com/uw-cmg/MAST-ML.git@dev_lane

Import all packages that will be used

In [None]:
from mastml.data_splitters import SklearnDataSplitter, NoSplit
from mastml.preprocessing import SklearnPreprocessor
from mastml.models import SklearnModel, HostedModel
from mastml.datasets import LocalDatasets
from mastml.domain import Domain
from pathlib import Path

import subprocess
import docker
import shutil
import glob
import os

Define standard variables that will be used

In [None]:
# Standard names and locations to be used
cal_name = 'calibration_run'  # Location to save calibration run
dom_name = 'domain_run'  # Location to save domain run
output = 'container_files'  # Building container
docker_username = 'leschultz'  # Username
container_name = 'test'  # Container name
container_tag = 'dev_test'  # Container tag (or version)
target = 'E_regression_shift'  # The target variable
extra_columns = ['mat', 'group']  # Columns not used as features

# Location in Dockerhub
container = '{}/{}:{}'.format(
                              docker_username,
                              container_name,
                              container_tag,
                              )

Load Data

In [None]:
# Load the data in a standard manner
d = LocalDatasets(
                  file_path='./diffusion.csv',
                  target=target,
                  extra_columns=extra_columns,
                  as_frame=True
                  )
data_dict = d.load_data()  # The actual loading

# Data in a useful form
X = data_dict['X']  # The features
y = data_dict['y']  # The target

## Task 2: Standard machine learning setup<a name="task2"></a>

In [None]:
# Regression metrics to include
metrics = [
           'r2_score',
           'mean_absolute_error',
           'root_mean_squared_error',
           'rmse_over_stdev',
           ]

In [None]:
# Data scaling that comes standard with many models
preprocessor = SklearnPreprocessor(
                                   preprocessor='StandardScaler',
                                   as_frame=True,
                                   )

# The type of regression model to use
model = SklearnModel(model='RandomForestRegressor')

## Task 3: Perform uncertainty calibration<a name="task3"></a>

In [None]:
# The type of cross validation to conduct
splitter = SklearnDataSplitter(
                               splitter='RepeatedKFold',
                               n_repeats=1,
                               n_splits=5
                               )

# Perform unceratinty quantification
splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram', 'Error'],
                  error_method='stdev_weak_learners',
                  recalibrate_errors=True,
                  )

# Rename the output directory
file_to_move = glob.glob('Ran*')[0]
subprocess.run(['mv', file_to_move, cal_name])

## Task 4: Fit domain model<a name="task4"></a>

In [None]:
# Domain with MADML
params = {'n_repeats': 2}
domain = ('madml', params)

# MADML has a default set of splitters (can add other set with params)
splitter = NoSplit()
splitter.evaluate(
                  X=X,
                  y=y,
                  models=[model],
                  preprocessor=preprocessor,
                  metrics=metrics,
                  plots=['Scatter', 'Histogram'],
                  domain=[domain],
                  )

# Rename the output directory
file_to_move = glob.glob('Ran*')[0]
subprocess.run(['mv', file_to_move, dom_name])

## Task 5: Gather all files to build model in a container<a name="task5"></a>

In [None]:
# Gather the standard objects to create a single model
cal_params = os.path.join(cal_name, 'recalibration_parameters_train.csv')
model_path = os.path.join(dom_name, 'RandomForestRegressor.pkl')
preprocessor_path = os.path.join(dom_name, 'StandardScaler.pkl')
domain_path = list(map(str, Path(dom_name).rglob('domain_*.pkl')))

files = [cal_params, model_path, preprocessor_path, *domain_path]

# Copy the files
for f in files:
    shutil.copy(f, os.path.join(output, os.path.basename(f)))

# The training features
X.to_csv(
         os.path.join(output, 'X_train.csv'), 
         index=False
         )
y.to_csv(
         os.path.join(output, 'y_train.csv'), 
         index=False
         )

## Task 6: Build and push a container with trained model<a name="task6"></a>

Build the container from a provided Dockerfile. You need to modify the Dockerfile and the predict.py files according to how the model you build behaves. Consider the type of scaler you use, model type, domain assessments, packages installed by pip, etc.

In [None]:
# Build container
client = docker.from_env()
image, _ = client.images.build(
                               path=output,
                               tag=container,
                               quiet=False
                               )

In [None]:
# Push container
client.images.push(
                   repository=container_name,
                   tag=container_tag
                   )

client.images.remove(image.id)

In [None]:
# Now predict on the training featues to make sure the container runs
model = HostedModel(container)
preds = model.predict(X)
print(preds)