<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Machine Learning from End to End </span> </h1>
<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"> Employee Retention: A Classification Problem, Part 4: Project Delivery </h1>
<hr>

* You've taken this project from a simple dataset all the way to a high-performing predictive model. 
* Most importantly, you came up with almost all of the mission-critical code on your own!

Up to this point I've taken this project from a simple dataset all the way to a high-performing predictive model. Now, I'll show you how I can use this model to predict brand new (**raw**) data and package the work together into an executable script.
<br><hr id="toc">

### In this module...

First, we'll import libraries and load our model from Part 3.

Then, we'll cover these steps:

1. [Confirm your model was saved correctly](#confirm)
2. [Write pre-modeling functions](#pre-model)
3. [Construct a model class](#model-class)
4. [Method 1: Jupyter notebook](#jupyter)
5. [Method 2: Executable script](#exectuable)

<br><hr>

### First, let's import libraries and load the model.

First, let's import the libraries that we'll need.

In [1]:
# Computing libraries
import numpy as np
import pandas as pd

# Pickle for reading model files
import pickle

# Scikit-Learn for Modeling
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

Next, load the final model saved from Module 4.

In [3]:
# Load final_model.pkl as model
with open('final_model.pkl', 'rb') as f:
    model = pickle.load(f)

Great, let's begin.

<span id="confirm"></span>
# 1. Confirm your model was saved correctly

A nice and quick sanity check I can do is confirm that the model was saved correctly.
<br>
**First, I'm going display the model object. This helps to confirm a few key details:**
* It should be a model <code style="color:steelblue">Pipeline</code>.
* The first step should be a <code style="color:steelblue">StandardScaler</code> preprocessing step.
* The second step should be a <code style="color:steelblue">RandomForestClassifier</code> model.

Then I will load my analytical_base_table, split it into training and test sets, and use it to predict <code style="color:steelblue">X_test</code> again just like I did in Part 3. However, the difference now is that I'm using the <code style="color:steelblue">roc_auc_score</code> instead of the <code style="color:steelblue">roc_curve</code> and <code style="color:steelblue">auc</code> of said curve. This allows me to skip calculating the <code style="color:steelblue">roc_curve</code> as an intermediate step and go right to the <code style="color:steelblue">auc</code> which is the metric I'm after.

In [4]:
# Display model object
model

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.33, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_i...imators=100, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False))])

In [5]:
# Load analytical base table used in Module 4
abt_df = pd.read_csv('analytical_base_table.csv')

In [6]:
# Create separate object for target variable
y = abt_df['status']

# Create separate object for input features
X = abt_df.drop('status', axis=1)

# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234, stratify=y)


In [9]:
# Predict X_test
pred = model.predict_proba(X_test)

# Get just the prediction for the postive class (1)
pred = [p[1] for p in pred]

# Print AUROC
print('AUROC:', roc_auc_score(y_test, pred))

AUROC: 0.9915201892159932


Now I'm going to load some brand new, **raw data** that we've never seen before. Then we'll see what happens when we try to apply the model to this raw dataset. As one might predict, this throws an error because our model is based on the <code style="color:steelblue">analytical_base_table</code> with all the data cleaning and feature engineering we did in Part 2. 

In [10]:
raw_data = pd.read_csv('project_files/unseen_raw_data.csv')

print( raw_data.shape )
raw_data.head()

(750, 9)


Unnamed: 0,avg_monthly_hrs,department,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure
0,228,management,,0.735618,2,,high,0.805661,3.0
1,229,product,,1.0,4,,low,0.719961,4.0
2,196,sales,1.0,0.557426,4,,low,0.749835,2.0
3,207,IT,,0.715171,3,,high,0.987447,3.0
4,129,management,,0.484818,2,,low,0.441219,3.0


In [11]:
# Should throw an error
pred = model.predict_proba(raw_data)

ValueError: could not convert string to float: 'low'

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<span id="pre-model"></span>
# 2. Write pre-modeling functions

All we need to do is write a few functions to **convert the raw data to the same format as the analytical base table**. So I'll now write a function called <code style="color:steelblue">clean_data()</code> that bundles together all of the data cleaning steps. Once I do that I create a new DataFrame named <code style="color:steelblue">cleaned_data</code> using this new function.


In [12]:
def clean_data(df):
    # Drop duplicates
    df.drop_duplicates()
    # Drop temporary workers
    df = df[df.department != 'temp']
    # Missing filed_complaint values should be 0
    df['filed_complaint'] = df['filed_complaint'].fillna(0)
    # Missing recently_promoted values should be 0
    df['recently_promoted'] = df['recently_promoted'].fillna(0)
    # 'information_technology' should be 'IT'
    df.department.replace('information_technology', 'IT', inplace=True)
    # Fill missing values in department with 'Missing'
    df['department'].fillna('Missing', inplace=True)
    # Indicator variable for missing last_evaluation
    df['last_evaluation_missing'] = df['last_evaluation'].isnull().astype(int)
    # Fill missing values in last_evaluation with 0
    df['last_evaluation'].fillna(0, inplace=True)
    # Return cleaned dataframe
    return df

**Create a new DataFrame named <code style="color:steelblue">cleaned_data</code> using the function you just wrote.**
* Then display its first 5 rows.

In [13]:
# Create cleaned_new_data 
cleaned_data = clean_data(raw_data)

# Display first 5 rows
cleaned_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is 

Unnamed: 0,avg_monthly_hrs,department,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure,last_evaluation_missing
0,228,management,0.0,0.735618,2,0.0,high,0.805661,3.0,0
1,229,product,0.0,1.0,4,0.0,low,0.719961,4.0,0
2,196,sales,1.0,0.557426,4,0.0,low,0.749835,2.0,0
3,207,IT,0.0,0.715171,3,0.0,high,0.987447,3.0,0
4,129,management,0.0,0.484818,2,0.0,low,0.441219,3.0,0


Next, I write a function called <code style="color:steelblue">engineer_features()</code> that compiles all of the feature engineering steps making sure not to include any steps used to process the target variable since I don't have that variable when predicting new, unseen observations. Then I create a new DataFrame named <code style="color:steelblue">augmented_data</code> using the newly written function, remembering to pass in <code style="color:steelblue">cleaned_data</code> and not <code style="color:steelblue">raw_data</code>. To double check the accuracy of my functions I predict the probabilities to make sure I get values close to 0 or 1.

In [14]:
def engineer_features(df):
    # Create indicator features
    df['underperformer'] = ((df['last_evaluation'] < 0.6) & (df['last_evaluation_missing'] == 0)).astype(int)
    df['unhappy'] = (df['satisfaction'] < 0.2).astype(int)
    df['overachiever'] = ((df['last_evaluation'] > 0.8) & (df['satisfaction'] > 0.7)).astype(int)
    
    # Create new dataframe with dummy features
    df = pd.get_dummies(df, columns=['department','salary'])
    
    # Return augmented DataFrame
    return df

In [17]:
# Create augmented_new_data
augmented_data = engineer_features(cleaned_data)

# Display first 5 rows
augmented_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,avg_monthly_hrs,filed_complaint,last_evaluation,n_projects,recently_promoted,satisfaction,tenure,last_evaluation_missing,underperformer,unhappy,...,department_finance,department_management,department_marketing,department_procurement,department_product,department_sales,department_support,salary_high,salary_low,salary_medium
0,228,0.0,0.735618,2,0.0,0.805661,3.0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
1,229,0.0,1.0,4,0.0,0.719961,4.0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
2,196,1.0,0.557426,4,0.0,0.749835,2.0,0,1,0,...,0,0,0,0,0,1,0,0,1,0
3,207,0.0,0.715171,3,0.0,0.987447,3.0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,129,0.0,0.484818,2,0.0,0.441219,3.0,0,1,0,...,0,1,0,0,0,0,0,0,1,0


In [18]:
# Predict probabilities
pred = model.predict_proba(augmented_data)

# Print first 5 predictions
pred[:5]

array([[1.  , 0.  ],
       [0.98, 0.02],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.  , 1.  ]])

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<br id="model-class">
# 3. Construct a model class

Great, now let's package these functions together into a single **model class**. 

In [19]:
class EmployeeRetentionModel:
    
    def __init__(self, model_location):
        with open(model_location, 'rb') as f:
            self.model = pickle.load(f)
    
    def predict_proba(self, X_new, clean=True, augment=True):
        if clean:
            X_new = self.clean_data(X_new)
        
        if augment:
            X_new = self.engineer_features(X_new)
        
        return X_new, self.model.predict_proba(X_new)
    
    # Add functions here
    def clean_data(self, df):
        df.drop_duplicates()
        df = df[df.department != 'temp']
        df['filed_complaint'] = df['filed_complaint'].fillna(0)
        df['recently_promoted'] = df['recently_promoted'].fillna(0)
        df.department.replace('information_technology', 'IT', inplace=True)
        df['department'].fillna('Missing', inplace=True)
        df['last_evaluation_missing'] = df['last_evaluation'].isnull().astype(int)
        df['last_evaluation'].fillna(0, inplace=True)
        return df
    
    def engineer_features(self, df):
        df['underperformer'] = ((df['last_evaluation'] < 0.6) & (df['last_evaluation_missing'] == 0)).astype(int)
        df['unhappy'] = (df['satisfaction'] < 0.2).astype(int)
        df['overachiever'] = ((df['last_evaluation'] > 0.8) & (df['satisfaction'] > 0.7)).astype(int)
        df = pd.get_dummies(df, columns=['department','salary'])
        return df

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<span id="jupyter"></span>
# 4. Jupyter notebook

In this course, we will cover 2 different ways to deploy your models.
1. Keep it in Jupyter Notebook
2. Port it to an executable script

Since I prefer to work in Jupyter notebooks I'm deciding to keep the model in a Jupyter Notebook, I can directly use the model class defined earlier.

To demonstrate I simply initialize an instance of it:

In [20]:
# Initialize an instance
retention_model = EmployeeRetentionModel('final_model.pkl')

If implemented correctly, these next three statements should all work.

In [21]:
# Predict raw data
_, pred1 = retention_model.predict_proba(raw_data, clean=True, augment=True)

# Predict cleaned data
_, pred2 = retention_model.predict_proba(cleaned_data, clean=False, augment=True)

# Predict cleaned and augmented data
_, pred3 = retention_model.predict_proba(augmented_data, clean=False, augment=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFram

By the way, <code style="color:steelblue">_, pred1 =</code> simply means we're throwing away the first object that's returned (which was <code style="color:steelblue">X_new</code>).

Their predictions should all be equivalent.

In [22]:
# Should be true
np.array_equal(pred1, pred2) and np.array_equal(pred2, pred3)

True

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>

<br>
## Next Steps

Congratulations for completing the final module for <span style="color:royalblue">Project 3: Employee Retention</span>!

As a reminder, here are a few things you did in this module:
* You created confirmed your model was saved correctly.
* You compiled data cleaning and feature engineering functions from code you wrote in past modules.
* You learned how to package everything together in a custom model class.
* And you applied your model to raw data in Jupyter Notebook.

<div style="text-align:center; margin: 40px 0 40px 0; font-weight:bold">
<a href="#toc">Back to Contents</a>
</div>