# Inference Pipeline using Scikit-learn models
<br />
1. build sklearn pipeline locally<br />

A). **preprocessor**<br />   
   it will perform preprocessing of numeric cat cols<br />
   
*numeric* : imputation, scaling<br />

*categoric* : imputation, one-hot-encoding<br />
      
B). **classifier**<br />
    LogisticRegression Classifier<br />
    
<br />
2. tune the pipeline model<br />

<br />
3. Register the artifacts<br />
train/test data<br />
pipeline model<br />

<br />
4. deploy inference-ml-pipeline as an endpoint<br />

<br />
5. prediction using the endpoint

In [None]:
SEED = 100

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

#  prepare raw data
This data will be used for training/testing the ml pipeline

In [2]:
col_to_predict = 'survived'
col_primary_identifer = "user_id"

COLLIST_META = [col_primary_identifer]
COLLIST_FEATURE = ['age', 'fare', 'embarked', 'sex', 'pclass']
COLLIST_ALL     = [col_to_predict] + COLLIST_META + COLLIST_FEATURE
COLLIST_NUMERIC = ['age', 'fare']
COLLIST_CATEGORICAL = ['embarked', 'sex', 'pclass']

In [51]:
RAW_TRAIN_PATH = "./data_dir/train_data.csv"
RAW_TEST_PATH  = "./data_dir/test_data.csv"
RAW_VAL_PATH   = "./data_dir/val_data.csv"

In [52]:
# run only during first run
#!mkdir ./data_dir

In [53]:
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)
data["user_id"] = ["user_"+str(i) for i in range(len(data))]

X = data[COLLIST_ALL].copy()

train_data, test_data, val_data = np.split(X.sample(frac=1, random_state=SEED), [int(.65*len(X)), int(.9*len(X))])
val_data = val_data[COLLIST_FEATURE]

# save the data
train_data[COLLIST_NUMERIC].to_csv(path_or_buf=RAW_TRAIN_PATH, index=False, header=None)
test_data[COLLIST_NUMERIC].to_csv(path_or_buf=RAW_TEST_PATH, index=False, header=None)
val_data[COLLIST_NUMERIC].to_csv(path_or_buf=RAW_VAL_PATH, index=False, header=None)

print("raw_data: {}, train: {}, test: {}, val: {}".format(X.shape, train_data.shape, test_data.shape, val_data.shape))
X.head(2)

print(X.shape)
X.head(2)

raw_data: (1309, 7), train: (850, 7), test: (328, 7), val: (131, 5)
(1309, 7)


Unnamed: 0,survived,user_id,age,fare,embarked,sex,pclass
0,1,user_0,29.0,211.3375,S,female,1
1,1,user_1,0.9167,151.55,S,male,1


In [54]:
!ls ./data_dir

test_data.csv  train_data.csv  val_data.csv


# 1. build sklearn pipeline locally
1. Pipeline setup
2. Train
3. Tune
4. Persist

## setup

In [55]:
X_train = train_data[COLLIST_FEATURE]
X_test  = test_data[COLLIST_FEATURE]
y_train = train_data[col_to_predict]
y_test  = test_data[col_to_predict]

X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

In [56]:
train_data[COLLIST_FEATURE].columns.values
#['age', 'fare', 'embarked', 'sex', 'pclass']
#  0,        1,       2,       3,      4

array(['age', 'fare', 'embarked', 'sex', 'pclass'], dtype=object)

In [57]:
# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = [0, 1]
categorical_features = [2, 3, 4]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])


categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

clf

Pipeline(memory=None,
     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('num', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbo...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

## model traning

In [22]:
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.799


In [25]:
payload = val_data[0:1].values.tolist()
pred = clf.predict(payload)

print("payload    : ", payload)
print("prediction : ", pred)

payload    :  [[8.0, 36.75, 'S', 'male', 2]]
prediction :  [0]


## model tuning

In [26]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
}

grid_search = GridSearchCV(clf, param_grid, cv=10, iid=False)
grid_search.fit(X_train, y_train)

print(("best logistic regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

best logistic regression from grid search: 0.799


In [27]:
model_main = grid_search.best_estimator_
model_main

Pipeline(memory=None,
     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('num', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [28]:
#model_main.predict(X_train)

## persist the model

In [29]:
import joblib

In [30]:
joblib.dump(model_main, 'sklearn_clf_pipeline_titanic_v1.pkl')

['sklearn_clf_pipeline_titanic_v1.pkl']

# 2. Register the artifacts
1. input and output datasets
2. pipeline model

In [31]:
from azureml.core import Workspace


ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

bais-ml
ml-resource-group
eastus
328d2afe-2a26-4e47-8e3b-db00b6ada105


## input and output datasets

Here, you will register the data used to create the model in your workspace.

In [60]:
# persisting the training data : features/label in .csv format
temp_X_train = pd.DataFrame(X_train)
temp_y_train = pd.DataFrame(y_train)

temp_X_train.to_csv(path_or_buf="features.csv", index=False)
temp_y_train.to_csv(path_or_buf="labels.csv", index=False)

pd.read_csv("./features.csv").head(2)
pd.read_csv("./labels.csv").head(2)

Unnamed: 0,0
0,0
1,0


In [37]:
import numpy as np

from azureml.core import Dataset

#np.savetxt('features.csv', X_train, delimiter=',')
#np.savetxt('labels.csv', y_train, delimiter=',')

datastore = ws.get_default_datastore()
datastore.upload_files(files=['./features.csv', './labels.csv'],
                       target_path='ds_sklearn_clf_titanic/',
                       overwrite=True)

input_dataset_clf = Dataset.Tabular.from_delimited_files(path=[(datastore, 'ds_sklearn_clf_titanic/features.csv')])
output_dataset_clf = Dataset.Tabular.from_delimited_files(path=[(datastore, 'ds_sklearn_clf_titanic/labels.csv')])

Uploading an estimated of 2 files
Uploading ./features.csv
Uploading ./labels.csv
Uploaded ./labels.csv, 1 files out of an estimated total of 2
Uploaded ./features.csv, 2 files out of an estimated total of 2
Uploaded 2 files


## Register model

In [58]:
#!ls -al

In [39]:
import sklearn

from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration


model = Model.register(workspace=ws,
                       model_name='titanic-sklearn-model',                # Name of the registered model in your workspace.
                       model_path='sklearn_clf_pipeline_titanic_v1.pkl',  # Local file to upload and register as a model.
                       model_framework=Model.Framework.SCIKITLEARN,  # Framework used to create the model.
                       model_framework_version=sklearn.__version__,  # Version of scikit-learn used to create the model.
                       sample_input_dataset=input_dataset_clf,
                       sample_output_dataset=output_dataset_clf,
                       resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5),
                       description='titanic clf model to predict survival.',
                       tags={'area': 'titanic', 'type': 'clf'})

print('Name:', model.name)
print('Version:', model.version)

Registering model titanic-sklearn-model
Name: titanic-sklearn-model
Version: 8


# 3. Deploy model

Deploy your model as a web service using [Model.deploy()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py#deploy-workspace--name--models--inference-config--deployment-config-none--deployment-target-none-). Web services take one or more models, load them in an environment, and run them on one of several supported deployment targets. For more information on all your options when deploying models, see the [next steps](#Next-steps) section at the end of this notebook.

For this example, we will deploy your scikit-learn model to an Azure Container Instance (ACI).

In [40]:
service_name = 'service-titanic-sklearn-clf-v7'

service = Model.deploy(ws, service_name, [model], overwrite=True)
service.wait_for_deployment(show_output=True)

Running...........................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [62]:
# to check the logs uncomment the below line

#print(service.get_logs())

# 4. Prediction

In [41]:
import json

In [49]:
payload = val_data[0:2].values.tolist()

input_payload = json.dumps({
    'data': payload,
    'method': 'predict'  # If you have a classification model, you can get probabilities by changing this to 'predict_proba'.
})

parsed_data = json.loads(input_payload)["data"]
pred = model_main.predict(parsed_data)

print("input_payload : ", input_payload)
print("prediction    : ", pred)

input_payload :  {"data": [[8.0, 36.75, "S", "male", 2], [37.0, 29.7, "C", "male", 1]], "method": "predict"}
prediction    :  [0 1]


# terminate end-point

In [46]:
service.delete()