<a href="https://colab.research.google.com/github/datarobot-community/DRU-MLOps/blob/master/27May2021 - MLOps_III_DRUM_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLOps III - DRUM Laboratory

 In this notebook we will

* Build a simple regression model using Scikit-Learn
* Use DRUM to test & validate the model
* Use DRUM to score data in batch mode


## Use case to be addressed:

We will build a regression model to predict median value of owner-occupied homes prices in the Boston area.

Let's begin by uploading a few resources we will need:

1. Training set: **boston_housing.csv**
2. Scoring set: **boston_housing_inference.csv**
3. Requirements file: **colab_requirements.txt**
4. File with hooks used by the model: **custom.py**

In [1]:
from google.colab import files
uploaded = files.upload()

Saving boston_housing.csv to boston_housing.csv
Saving boston_housing_inference.csv to boston_housing_inference.csv
Saving colab_requirements.txt to colab_requirements.txt
Saving custom.py to custom.py


In [2]:
!ls

boston_housing.csv	      colab_requirements.txt  sample_data
boston_housing_inference.csv  custom.py


We will now create parameters to pass the names of the training and inference datasets to the next cells. First we will create a parameter for the name of the training dataset:

In [3]:
TRAINING = 'boston_housing.csv'

We will define an environment variable to pass the name of the inference dataset to the DRUM command:

In [4]:
INFERENCE = 'boston_housing_inference.csv'

In [5]:
!export INFERENCE

Let's install the Python modules we need using the requirements file:

In [6]:
!cat colab_requirements.txt

datarobot==2.24.0
datarobot-drum==1.6.0
PyYAML==5.4.1
xgboost==1.2.1
folium==0.2.1
imgaug==0.2.5

In [7]:
!pip install -r colab_requirements.txt -q

[K     |████████████████████████████████| 418 kB 5.4 MB/s 
[K     |████████████████████████████████| 8.9 MB 31.8 MB/s 
[K     |████████████████████████████████| 636 kB 48.0 MB/s 
[K     |████████████████████████████████| 148.9 MB 84 kB/s 
[K     |████████████████████████████████| 69 kB 7.6 MB/s 
[K     |████████████████████████████████| 562 kB 44.9 MB/s 
[K     |████████████████████████████████| 54 kB 2.7 MB/s 
[K     |████████████████████████████████| 17.7 MB 81 kB/s 
[K     |████████████████████████████████| 3.0 MB 15.6 MB/s 
[K     |████████████████████████████████| 198 kB 46.2 MB/s 
[K     |████████████████████████████████| 50 kB 5.9 MB/s 
[K     |████████████████████████████████| 781 kB 44.1 MB/s 
[K     |████████████████████████████████| 146 kB 43.7 MB/s 
[K     |████████████████████████████████| 67 kB 5.5 MB/s 
[K     |████████████████████████████████| 101 kB 10.4 MB/s 
[K     |████████████████████████████████| 546 kB 47.8 MB/s 
[K     |█████████████████████████

# 1.- Model Training

We will now build a very simple Scikit-Learn Regression model using the boston_housing prices dataset.

In [8]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import pickle
import datetime

## load data

df = pd.read_csv(TRAINING)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [9]:
## set features and target

X = df.drop('MEDV', axis=1)
y = df['MEDV']

## train the model
rf = RandomForestRegressor(n_estimators = 20, random_state = 801)
rf.fit(X,y)

## serialize the model

with open('rf.pkl', 'wb') as pkl:
    pickle.dump(rf, pkl)

print("Done!")    

Done!


In [13]:
! ls

boston_housing.csv	      custom.py    sample_data
boston_housing_inference.csv  __pycache__  validation.log
colab_requirements.txt	      rf.pkl


# 2.- Model Testing

We will now use DRUM to test how the model performs by computing latency times and memory usage for several different test case sizes. A report is generated after this process is completed.



In [11]:
!drum perf-test --code-dir ./ --input $INFERENCE --target-type regression 

DRUM performance test
Model:      /content
Data:       /content/boston_housing_inference.csv
# Features: 13
Preparing test data...



Running test case with timeout: 600
Running test case: 72 bytes - 1 samples, 100 iterations
Processing |################################| 100/100
Running test case with timeout: 600
Running test case: 0.1MB - 1447 samples, 50 iterations
Processing |################################| 50/50
Running test case with timeout: 600
Running test case: 10MB - 144742 samples, 5 iterations
Processing |################################| 5/5
Running test case with timeout: 600
Running test case: 50MB - 723711 samples, 1 iterations
Processing |################################| 1/1
Test is done stopping drum server
[m[?7h[4l>7[r[?1;3;4;6l8
  size     samples   iters    min     avg     max     total     used     total p
                                                       (s)      (MB)     hysical
                                                                  

# 3.- Model Validation: Handling of Missing Values

We will now validate the model to detect and address issues before deployment. It’s highly encouraged that you run these tests, which are the same ones that DataRobot performs automatically before deploying models.

Especifically, DRUM will test null values imputation by setting each feature in the dataset to "missing" and then feeding the features to the model. We will send the results to **validation.log**

In [12]:
!drum validation --code-dir ./ --input $INFERENCE --target-type regression > validation.log

  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)
  defaults = yaml.load(f)


In [14]:
!cat validation.log 



Validation checks results
      Test case          Status   Details
Basic batch prediction   PASSED          
Null value imputation    PASSED          


# 4.- Batch Scoring with DRUM
<a id="setup_complete"></a>

We want to use our model to make predictions; to do this, we'll leverage DRUM and its ability to natively handle our Scikit-Learn model. All we need to do is tell DRUM where the model resides and what data we wish to score.  

DRUM provides native support for many frameworks. To use DRUM with model frameworks that are not supported out-of-the box, we'll just need to create some custom hooks so DRUM.  In this example, we'll explain some very simple custom hooks and provide links to more complex examples.  

In [15]:
!drum score  --code-dir ./ --input $INFERENCE --target-type regression > predictions.csv

  defaults = yaml.load(f)


Let's have a look at the predictions:

In [16]:
pd.read_csv("predictions.csv").head()

Unnamed: 0,Predictions
0,0 25.740
1,1 21.720
2,2 33.860
3,3 33.615
4,4 35.315


In [17]:
! head predictions.csv

   Predictions
0       25.740
1       21.720
2       33.860
3       33.615
4       35.315
5       26.570
6       21.330
7       23.975
8       17.100
