## Deployment of Machine Learning

After having a notebook running machine learning, we may consider to bring what we have produced to the production environment. Since the purpose of working with notebook is as far different as does the deployment, it would be better to extract only important items (trained models) that works well with other components such as web service.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [3]:
df = pd.read_csv('Telco-Customer-Churn.csv')

df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce')
df.TotalCharges = df.TotalCharges.fillna(0)

df.columns = df.columns.str.lower().str.replace(' ', '_')
string_columns = list(df.dtypes[df.dtypes == 'object'].index)

for col in string_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')
    
df.churn = (df.churn == 'yes').astype(int)


In [9]:
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)

y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

del df_train['churn']
del df_val['churn']
del df_test['churn']

In [22]:
df_train.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
1814,5442-pptjy,male,0,yes,yes,12,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.7,258.35,0
5946,6261-rcvns,female,0,no,no,42,yes,no,dsl,yes,...,yes,yes,no,yes,one_year,no,credit_card_(automatic),73.9,3160.55,1
2389,6161-erdgd,male,0,yes,yes,71,yes,yes,dsl,yes,...,yes,yes,yes,yes,one_year,no,electronic_check,85.45,6300.85,0
3676,2364-ufrom,male,0,no,no,30,yes,no,dsl,yes,...,no,yes,yes,no,one_year,no,electronic_check,70.4,2044.75,0
611,4765-oxppd,female,0,yes,yes,9,yes,no,dsl,yes,...,yes,yes,no,no,month-to-month,no,mailed_check,65.0,663.05,1


In [10]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
               'phoneservice', 'multiplelines', 'internetservice',
               'onlinesecurity', 'onlinebackup', 'deviceprotection',
               'techsupport', 'streamingtv', 'streamingmovies',
               'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges']

In [14]:
## Function train
def train(df_train, y_train, C=1.0): 
    ## Modify format from training dataframe to dictionary
    train_dict = df_train[categorical + numerical].to_dict(orient='records')

    ## Using DictVectorizer to create one-hot encoding for categorical variables
    dv = DictVectorizer(sparse=False)

    ## Fitting DictVectorizer on training dictionary
    dv.fit(train_dict)
    X_train = dv.transform(train_dict)

    ## Fitting Logistic Regression with solver 'liblinear' to training set
    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=1)
    model.fit(X_train, y_train)


    return dv, model

# Function predict
def predict(df_val, dv, model):
    ## Modify format from validation dataframe to dictionary
    val_dict = df_val[categorical + numerical].to_dict(orient='records')
    X_val = dv.transform(val_dict)

    ## Predict model on validation set, returning probabilities
    y_pred = model.predict_proba(X_val)[:, 1]

    return y_pred

In [12]:
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score

C = 1.0
n_splits = 5

In [15]:
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)

scores = []

for train_idx, val_idx in kfold.split(df_train_full):
    df_train = df_train_full.iloc[train_idx]
    df_val = df_train_full.iloc[val_idx]

    y_train = df_train.churn.values
    y_val = df_val.churn.values

    dv, model = train(df_train, y_train, C=C)
    y_pred = predict(df_val, dv, model)

    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

print(f'C={C} {np.mean(scores):.3f} +- {np.std(scores):.3f}')

C=1.0 0.841 +- 0.007


In [16]:
scores

[0.8423279509541489,
 0.8453247086478611,
 0.8335059201284366,
 0.8323627454115241,
 0.8521736060995889]

In [17]:
dv, model = train(df_train, y_train, C=1.0)
y_pred = predict(df_test, dv, model)

auc = roc_auc_score(y_test, y_pred)
auc

0.8573902845938012

### Saving and Loading the Model

In [18]:
import pickle

output_file = f"model_C={C}.bin"

In [19]:
# write a binary file
f_out = open(output_file, "wb")
# save the model and the dictionary vectorizer (we need that in order to run the model)
pickle.dump((dv, model), f_out)
# close the file
f_out.close()

In [None]:
# alternative
# this ensures that the file is closed, when the "with" statement is left
with open(output_file, "wb") as f_out:
    pickle.dump((dv, model), f_out)
    # do stuff
    
# do other stuff (file is closed)

### Model Loading from Pickle

In [20]:
model_file = "model_C=1.0.bin"

with open(model_file, "rb") as f_in:
    dv, model = pickle.load(f_in)

In [21]:
dv, model

(DictVectorizer(sparse=False),
 LogisticRegression(max_iter=1000, random_state=1, solver='liblinear'))

In [23]:
df_train.columns

Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

In [24]:
## Using loaded DictVectorizer and LogisticRegression to predict new customer

customer = {
    "customerid": "XXXG-00W0",
    "gender": "male",
    "seniorcitizen": 1,
    "partner": "yes",
    "dependents": "no",
    "tenure": 21,
    "phoneservice": "no",
    "multiplelines": "no_phone_service",
    "internetservice": "dsl",
    "onlinesecurity": "no",
    "onlinebackup": "no",
    "deviceprotection": "no",
    "techsupport": "no",
    "streamingtv": "no",
    "streamingmovies": "no",
    "contract": "month-to-month",
    "paperlessbilling": "yes",
    "paymentmethod": "electronic_check", 
    "monthlycharges": 30.55,
    "totalcharges": 43.12
}

In [25]:
# turn this customer into a feature matrix
X = dv.transform([customer])

In [26]:
# probabilty that this customer churns
model.predict_proba(X)[0,1]

0.3786074719362307

At this point, we will proceed to working with py files to be deployed for web service.

### Testing Flask Application

In [2]:
import requests

url = 'http://localhost:9696/predict'

In [4]:
customer = {
    "customerid": "XXXG-00W0",
    "gender": "male",
    "seniorcitizen": 1,
    "partner": "yes",
    "dependents": "no",
    "tenure": 21,
    "phoneservice": "no",
    "multiplelines": "no_phone_service",
    "internetservice": "dsl",
    "onlinesecurity": "no",
    "onlinebackup": "no",
    "deviceprotection": "no",
    "techsupport": "no",
    "streamingtv": "no",
    "streamingmovies": "no",
    "contract": "month-to-month",
    "paperlessbilling": "yes",
    "paymentmethod": "electronic_check", 
    "monthlycharges": 30.55,
    "totalcharges": 43.12
}

In [15]:
requests.post(url, json=customer)

<Response [200]>

In [16]:
response = requests.post(url, json=customer).json()
response

{'churn': False, 'churn_probability': 0.3786074719362307}

### Decision to do

Send a command to marketing team according to the prediction made.

In [13]:
customer['customerid']

'XXXG-00W0'

In [17]:
if response['churn'] == True:
    print(f"Sending promo email to {customer['customerid']}")
else:
    print(f'Enjoy our service')

Enjoy our service


### Random Baseline

### Ideal Model

### Putting Everything Together



### ROC Curve

Using scikit-learn for plotting the ROC curve

### AUC: Area under the ROC curve

Comparing multiple models with ROC curves

Score on ROC AUC

Interpretation of AUC: the probability that a randomly chosen positive example ranks higher than a randomly chosen negative example

### K-fold cross validation

### Full Retrain