<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Module 5:</span> Project Delivery</h1>
<hr>

Congratulations on making it to the final module of Project 3! 
* You've come a long way up to this point. 
* You've taken this project from a simple dataset all the way to a high-performing predictive model. 
* Most importantly, you came up with almost all of the mission-critical code on your own!

Now, we'll show you how you can use your model to predict brand new (**raw**) data and package your work together into an executable script.

Before moving on, we also recommend opening your Companion Workbook for <span style="color:royalblue">Module 2: ABT Construction</span>.

<br><hr id="toc">

### In this module...

First, we'll import libraries and load our model from Module 4.

Then, we'll cover these steps:

1. [Confirm your model was saved correctly](#confirm)
2. [Write pre-modeling functions](#pre-model)
3. [Construct a model class](#model-class)
4. [Method 1: Jupyter notebook](#jupyter)
5. [Method 2: Executable script](#exectuable)

<br><hr>

### First, let's import libraries and load the model.

First, let's import the libraries that we'll need.

In [13]:
# print_function for compatibility with Python 3
from __future__ import print_function

# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd

# Pickle for reading model files
import pickle

# Scikit-Learn for Modeling
import sklearn
from sklearn.model_selection import train_test_split

Next, let's import the helpers for the **area under ROC curve** metric (you'll see why soon). Now, if we don't need to actually plot the ROC curve, there's a shortcut function called <code style="color:steelblue">roc_auc_score()</code> that we can import instead.

In [14]:
# Area under ROC curve
from sklearn.metrics import roc_auc_score

Next, load the final model saved from Module 4.

In [15]:
# Load final_model.pkl as model
with open('final_model.pkl', 'rb') as f:
    model = pickle.load(f)

Great, let's begin.

<span id="confirm"></span>
# 1. Confirm your model was saved correctly

One nice and quick sanity check we can do is confirm that our model was saved correctly.

<br>
**First, let's display the model object. We're confirming a few key details:**
* It should be a model <code style="color:steelblue">Pipeline</code>.
* The first step should be a <code style="color:steelblue">StandardScaler</code> preprocessing step.
* The second step should be a <code style="color:steelblue">RandomForestClassifier</code> model.

In [16]:
# Display model object
print(model)

Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.33, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=123,
            verbose=0, warm_start=False))])


**Next, load the same analytical base table that we imported at the beginning of Module 4.**

In [17]:
# Load analytical base table used in Module 4
df = pd.read_csv('analytical_base_table.csv')

**Next, split it into training and test sets.**
* Remember to first separate the dataframe into separate objects for the target variable (<code style="color:steelblue">y</code>) and the input features (<code style="color:steelblue">X</code>).
* <code style="color:steelblue">test_size=0.2</code> (exactly the same as in Module 4)
* <code style="color:steelblue">random_state=1234</code> (exactly the same as in Module 4)
* <code style="color:steelblue">stratify=df.status</code> (exactly the same as in Module 4)

In [18]:
# Create separate object for target variable
y = df['status']

# Create separate object for input features
X = df.drop('status',axis=1)

# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1234,stratify=df.status)


**Finally, use your model to predict <code style="color:steelblue">X_test</code> again.**
* Then, print the <code style="color:steelblue">roc_auc_score</code>.
* Remember the difference between <code style="color:steelblue">.predict()</code> and <code style="color:steelblue">.predict_proba()</code>

In [19]:
# Predict X_test
pred = model.predict_proba(X_test)

# Get just the prediction for the postive class (1)
pred = [p[1] for p in pred]

# Print AUROC
print( 'AUROC:', roc_auc_score(y_test, pred) )

AUROC: 0.991520189216


Let's load some brand new, **raw data** that we've never seen before.

In [20]:
raw_data = pd.read_csv('project_files/unseen_raw_data.csv')

print( raw_data.shape )
raw_data.head()

(750, 9)


Unnamed: 0,avg_monthly_hrs,department,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure
0,228,management,,0.735618,2,,high,0.805661,3.0
1,229,product,,1.0,4,,low,0.719961,4.0
2,196,sales,1.0,0.557426,4,,low,0.749835,2.0
3,207,IT,,0.715171,3,,high,0.987447,3.0
4,129,management,,0.484818,2,,low,0.441219,3.0


**Let's see what happens when we try to apply our model to this raw dataset.**

In [21]:
# Should throw an error
pred = model.predict_proba(raw_data)

ValueError: could not convert string to float: 'low'

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<span id="pre-model"></span>
# 2. Write pre-modeling functions

All we need to do is write a few functions to **convert the raw data to the same format as the analytical base table**.

<br>
**Write a function called <code style="color:steelblue">clean_data()</code> that bundles together all of the data cleaning steps.**
* It's helpful to open up <span style="color:royalblue">Module 2: ABT Construction</span> and go from top to bottom.
* Only include steps that altered your dataframe! You can just copy-paste them here.
* Check the Answer Key to confirm your answer.

In [22]:
def clean_data(df):
    # Drop duplicates
    df=df.drop_duplicates()
    
    # Drop temporary workers
    df = df[df.department!='temp']
    
    # Missing filed_complaint values should be 0
    df['filed_complaint'] = df.filed_complaint.fillna(0)

    # Missing recently_promoted values should be 0
    df['recently_promoted'] = df.recently_promoted.fillna(0)
    
    # 'information_technology' should be 'IT'
    df.department.replace('information_technology','IT',inplace=True)


    # Fill missing values in department with 'Missing'
    df.department.fillna('Missing',inplace=True)

    # Indicator variable for missing last_evaluation
    df['last_evaluation_missing'] = df.last_evaluation.isnull().astype(int)
    
    # Fill missing values in last_evaluation with 0
    df.last_evaluation.fillna(0,inplace=True)
    
    # Return cleaned dataframe
    return(df)

**Create a new DataFrame named <code style="color:steelblue">cleaned_data</code> using the function you just wrote.**
* Then display its first 5 rows.

In [23]:
# Create cleaned_new_data 
cleaned_data = clean_data(raw_data)
# Display first 5 rows
cleaned_data.head()

Unnamed: 0,avg_monthly_hrs,department,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure,last_evaluation_missing
0,228,management,0.0,0.735618,2,0.0,high,0.805661,3.0,0
1,229,product,0.0,1.0,4,0.0,low,0.719961,4.0,0
2,196,sales,1.0,0.557426,4,0.0,low,0.749835,2.0,0
3,207,IT,0.0,0.715171,3,0.0,high,0.987447,3.0,0
4,129,management,0.0,0.484818,2,0.0,low,0.441219,3.0,0


**Next, write a function called <code style="color:steelblue">engineer_features()</code> that compiles all of the feature engineering steps.**
* Continue where you left off in <span style="color:Steelblue">Module 2: ABT Construction</span> and continue going from top to bottom.
* Only include steps that altered your dataframe!
* Check the Answer Key to confirm your answer.
* **Do not include any steps used to process the target variable**. We don't have that variable when predicting new, unseen observations.

In [24]:
def engineer_features(df):
    # Create indicator features
    df['underperformer'] = ((df.last_evaluation < 0.6) & (df.last_evaluation_missing == 0)).astype(int)
    df['unhappy'] = ((df.satisfaction < 0.2)).astype(int)
    df['overachiever'] = ((df.last_evaluation > 0.8) & (df.satisfaction > 0.7)).astype(int)
        
    # Create new dataframe with dummy features
    df=pd.get_dummies(df,['department','salary'])
    
    # Return augmented DataFrame
    return(df)

**Create a new DataFrame named <code style="color:steelblue">augmented_data</code> using the function you just wrote.**
* Then display its first 5 rows.
* Remember to pass in <code style="color:steelblue">cleaned_data</code>, not <code style="color:steelblue">raw_data</code>.

In [25]:
# Create augmented_new_data
augmented_data = engineer_features(cleaned_data)

# Display first 5 rows
augmented_data.head()

Unnamed: 0,avg_monthly_hrs,filed_complaint,last_evaluation,n_projects,recently_promoted,satisfaction,tenure,last_evaluation_missing,underperformer,unhappy,...,department_finance,department_management,department_marketing,department_procurement,department_product,department_sales,department_support,salary_high,salary_low,salary_medium
0,228,0.0,0.735618,2,0.0,0.805661,3.0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
1,229,0.0,1.0,4,0.0,0.719961,4.0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
2,196,1.0,0.557426,4,0.0,0.749835,2.0,0,1,0,...,0,0,0,0,0,1,0,0,1,0
3,207,0.0,0.715171,3,0.0,0.987447,3.0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,129,0.0,0.484818,2,0.0,0.441219,3.0,0,1,0,...,0,1,0,0,0,0,0,0,1,0


**Predict probabilities for <code style="color:steelblue">augmented_data</code> using your model.**
* Then print the first 5 predictions.

In [26]:
# Predict probabilities
pred = model.predict_proba(augmented_data)

# Print first 5 predictions
print(pred[:5])

[[ 1.    0.  ]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.    1.  ]]


<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<br id="model-class">
# 3. Construct a model class

Great, now let's package these functions together into a single **model class**. 

<br>
**Add the <code style="color:steelblue">self.clean_data()</code> and <code style="color:steelblue">self.engineer_features()</code> to the code below.**
* These are the same functions you wrote earlier.
* You can copy-paste, but remember to add <code style="color:steelblue">self</code> as the first argument.

In [27]:
class EmployeeRetentionModel:
    
    def __init__(self, model_location):
        with open(model_location, 'rb') as f:
            self.model = pickle.load(f)
    
    def predict_proba(self, X_new, clean=True, augment=True):
        if clean:
            X_new = self.clean_data(X_new)
        
        if augment:
            X_new = self.engineer_features(X_new)
        
        return X_new, self.model.predict_proba(X_new)
    
    # Add functions here
    def clean_data(self,df):
        # Drop duplicates
        df=df.drop_duplicates()
    
        # Drop temporary workers
        df = df[df.department!='temp']
    
        # Missing filed_complaint values should be 0
        df['filed_complaint'] = df.filed_complaint.fillna(0)

        # Missing recently_promoted values should be 0
        df['recently_promoted'] = df.recently_promoted.fillna(0)
    
        # 'information_technology' should be 'IT'
        df.department.replace('information_technology','IT',inplace=True)

        # Fill missing values in department with 'Missing'
        df.department.fillna('Missing',inplace=True)

        # Indicator variable for missing last_evaluation
        df['last_evaluation_missing'] = df.last_evaluation.isnull().astype(int)
    
        # Fill missing values in last_evaluation with 0
        df.last_evaluation.fillna(0,inplace=True)
    
        # Return cleaned dataframe
        return(df)
    
    def engineer_features(self,df):
        df['underperformer'] = ((df.last_evaluation < 0.6) & (df.last_evaluation_missing == 0)).astype(int)
        df['unhappy'] = ((df.satisfaction < 0.2)).astype(int)
        df['overachiever'] = ((df.last_evaluation > 0.8) & (df.satisfaction > 0.7)).astype(int)
        
        # Create new dataframe with dummy features
        df=pd.get_dummies(df,['department','salary'])
    
        # Return augmented DataFrame
        return(df)
    

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<span id="jupyter"></span>
# 4. Method 1: Jupyter notebook

In this course, we will cover 2 different ways to deploy your models.
1. Keep it in Jupyter Notebook
2. Port it to an executable script

If you keep your model in Jupyter Notebook, you can directly use the model class you defined earlier.

First, simply initialize an instance of it:

In [28]:
# Initialize an instance
retention_model = EmployeeRetentionModel('final_model.pkl')

If implemented correctly, these next three statements should all work.

In [29]:
# Predict raw data
_, pred1 = retention_model.predict_proba(raw_data, clean=True, augment=True)

# Predict cleaned data
_, pred2 = retention_model.predict_proba(cleaned_data, clean=False, augment=True)

# Predict cleaned and augmented data
_, pred3 = retention_model.predict_proba(augmented_data, clean=False, augment=False)

By the way, <code style="color:steelblue">_, pred1 =</code> simply means we're throwing away the first object that's returned (which was <code style="color:steelblue">X_new</code>).

Their predictions should all be equivalent.

In [30]:
# Should be true
np.array_equal(pred1, pred2) and np.array_equal(pred2, pred3)

True

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<span id="executable"></span>
# 5. Method 2: Executable script (optional)

We've included an example script in the <code style="color:crimson">project_files/</code> directory of the Workbook Bundle called <code style="color:crimson">retention_model.py</code>.

To run the script, you can call it from the command line:

<pre style="color:crimson; margin-bottom:30px">
EDS:Module 5 - Project Delivery EDS$ python project_files/retention_model.py project_files/unseen_raw_data.csv predictions.csv final_model.pkl True True
</pre>

This saves a new file that includes the predictions. It looks like this:

In [31]:
# Will only work after running the command above
predictions = pd.read_csv('predictions.csv')

predictions.head()

FileNotFoundError: File b'predictions.csv' does not exist

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<br>
## Next Steps

Congratulations for completing the final module for <span style="color:royalblue">Project 3: Employee Retention</span>!

As a reminder, here are a few things you did in this module:
* You created confirmed your model was saved correctly.
* You compiled data cleaning and feature engineering functions from code you wrote in past modules.
* You learned how to package everything together in a custom model class.
* And you applied your model to raw data in Jupyter Notebook.

In the next project, <span style="color:royalblue">Project 4: Customer Segmentations</span>, you'll get to practice unsupervised learning for the first time. Specifically, the task of **clustering**. As you'll see, even for unsupervised learning, much of this machine learning workflow will remain the same.

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>