### Operationalizing ML Models
John Hoff  
Machine Learning Architect  
jhoff@productiveedge.com
# Step 1: Preparing the Data
![Step 1: Prepare](https://drive.google.com/uc?export=view&id=1WM-vhpcZEjXxlWAodTKGnP_18-gunQT8)

This step will download the dataset for use in the example.  It will split the dataset into two lists: one for training the model and one for testing model deployment.

_Please Note: The "Run All" command is safe to run on this notebook._

In [2]:
import os.path
import pandas as pd
import random
from urllib import request
from zipfile import ZipFile

In [3]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip'
data_file = 'bank-additional.zip'
data_csv = 'bank-additional/bank-additional-full.csv'

original_data_table = 'bank_marketing'
training_data_table = 'bank_marketing_training'
skewed_data_table = 'bank_marketing_skewed'

#Adjusting this values will alter the skew applied to the dataset for usage analysis
normal_split_ratio = 0.4
skewed_increase_factor = 1.5
skewed_decrease_factor = 0.5

seed = 1023

## Downloading the Dataset
The dataset is downloaded directly from the UCI Machine Learning Repository and loaded into a Hive table.

In [5]:
if not os.path.isfile(data_file):
  request.urlretrieve(data_url, data_file)
  
with ZipFile(data_file) as data_zip:
  dataframe = pd.read_csv(data_zip.open(data_csv), sep=';')
  original_dataframe = sqlContext.createDataFrame(dataframe);
  original_dataframe.write.saveAsTable(original_data_table, mode='overwrite')

## Creating Drift
A challenge of this impelementation is the need to have a drifted dataset that can be run against the model to detect concept drift.  I am randomly splitting the full dataset into a training set and a skewed set using the following criteria:

* **A record is less likely to be placed in the skewed dataset if the job is blue-collar.**  
  This represents a change in the underlying distribution of the job attribute and should be picked up by drift detection.

* **A record is more likely to be placed in the skewed dataset if the marital status is single.**  
  This represents a change in the underlying distribution of the marital attribute and should be picked up by drift detection.

* **A record that has an unknown job will always be placed in the skewed dataset.**  
  This represents a change in the underlying definition of the job attribute.  This is being used to demonstrate the durability of pipelines to changes in the attributes used for predictions.  Using an unknown attribute value in predictions should not completely break the model and should also be picked up by drift detection.

In [7]:
random.seed(seed)

training_rows = list()
skewed_rows = list()

# I am stepping through each row of the dataframe using rdd.collect().  This iterates through
# each record in a non-paralell manner.  This is useful to ensure that the random seed is properly
# applied for reproducibility.
for row in original_dataframe.rdd.collect():
  
  # The initial chance for a row to end up in the skewed file is set as the normal split ratio.
  threshold = normal_split_ratio
  
  # If the row is a housemaid, they will be less likely to end up in the skewed file. 
  if row.job == 'blue-collar':
    threshold *= skewed_decrease_factor
  
  # If the row is single, they will be more likely to end up in the skewed file.
  if row.marital == 'single':
    threshold *= skewed_increase_factor
  
  if row.job == 'unknown':
    # TODO: Randomly remove some data
    skewed_rows.append(row)
  else:
    if random.random() < threshold:
      # TODO: Randomly remove some data
      skewed_rows.append(row)
    else:
      # TODO: Randomly remove some data
      training_rows.append(row)

sqlContext.createDataFrame(training_rows).write.saveAsTable(training_data_table, mode='overwrite')
sqlContext.createDataFrame(skewed_rows).write.saveAsTable(skewed_data_table, mode='overwrite')

### Verifying Concept Drift for the `job` Attribute

In [9]:
%sql
select s.job, t.rate as training_rate, s.rate as skewed_rate
from (
  select job, format_number(count(*)/(select count(*) from bank_marketing_skewed), 3) as rate
  from bank_marketing_skewed group by job order by job asc
  ) s
left join (
  select job, format_number(count(*)/(select count(*) from bank_marketing_training), 3) as rate
  from bank_marketing_training group by job order by job asc
  ) t on s.job = t.job;

job,training_rate,skewed_rate
admin.,0.224,0.294
blue-collar,0.3,0.118
entrepreneur,0.034,0.038
housemaid,0.025,0.026
management,0.07,0.072
retired,0.041,0.042
self-employed,0.033,0.037
services,0.088,0.108
student,0.015,0.03
technician,0.147,0.187


### Verifying Concept Drift for the `marital` Attribute

In [11]:
%sql
select t.marital, t.rate as training_rate, s.rate as skewed_rate
from (
  select marital, format_number(count(*)/(select count(*) from bank_marketing_training), 3) as rate
  from bank_marketing_training group by marital order by marital asc
  ) t
left join (
  select marital, format_number(count(*)/(select count(*) from bank_marketing_skewed), 3) as rate
  from bank_marketing_skewed
  group by marital
  order by marital asc
  ) s on t.marital = s.marital;

marital,training_rate,skewed_rate
divorced,0.121,0.1
married,0.668,0.516
single,0.21,0.382
unknown,0.002,0.002
