# Project Delivery
Let's seehow you can use your model to predict brand new (raw) data and package your work together into an executable script that can be called from the command line.

In [131]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd

# Matplotlib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

# Function for splitting training and test set
from sklearn.model_selection import train_test_split

# To save the final model on disk
from sklearn.externals import joblib

# Area under ROC curve
from sklearn.metrics import roc_auc_score

In [132]:
model = joblib.load('Save/rfc_emp_retention.pkl') 

## 1. Confirm your model was saved correctly
### 1.1 Display the model object
It should be a RandomForestClassifier model with the right parameter values

In [102]:
# Display model object
model

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=5,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### 1.2 Replicate model scores
Let's use it to predict the same test set from previous dataset to confirm our model was saved correctly.

In [170]:
# Load analytical base table used in previous file
df = pd.read_csv('Files/analytical_base_table.csv')

**Note:** The values below should match with what was used previously

In [172]:
# Create separate object for target variable
y = df.status

# Create separate object for input features
X = df.drop('status', axis=1)

# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=1234,
                                                    stratify=df.status)

In [171]:
train_mean = pd.read_pickle("Save/train_mean.pkl")
train_std = pd.read_pickle("Save/train_std.pkl")

In [173]:
## Standardize the train data set
X_train = (X_train - train_mean) / train_std

## Note: We use train_mean and train_std_dev to standardize test data set
X_test = (X_test - train_mean) / train_std

In [174]:
# Predict X_test
y_pred_proba = model.predict_proba(X_test)[:,1]

In [175]:
# Print AUC
print('AUC:', roc_auc_score(y_test, y_pred_proba))

AUC: 0.990330649118


Seems good. Some minor difference is OK!

### 1.3 - Predict raw data
Just now, we loaded our analytical base table and applied our model to it.
* But if new data arrives in the same format as the original **raw data**, can you still apply your model to it?
* Wouldn't you need to first clean the new data the same way and engineer the same features?

First, let's load some brand new data that we've never seen before.

In [139]:
raw_data = pd.read_csv('Files/unseen_raw_data.csv')

print(raw_data.shape)
raw_data.head()

(750, 9)


Unnamed: 0,avg_monthly_hrs,department,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure
0,228,management,,0.735618,2,,high,0.805661,3.0
1,229,product,,1.0,4,,low,0.719961,4.0
2,196,sales,1.0,0.557426,4,,low,0.749835,2.0
3,207,IT,,0.715171,3,,high,0.987447,3.0
4,129,management,,0.484818,2,,low,0.441219,3.0


As you can see, this is in the original format of the raw data, before we cleaning or engineering features.
* There are still missing values, wannabe indicator variables, and many other things to fix.
* **Note that it does not have the target variable.** When you actually apply your model to new observations, they won't have labels (because it's your job to predict them).

**So let's see what happens when we try to apply our model to this raw dataset.**

In [111]:
# Should throw an error
y_pred_proba = model.predict_proba(raw_data)[:,1]

ValueError: could not convert string to float: 'low'

#### Very Obvious. Let's fix it !!

## 2. Write pre-modeling functions
All we need to do is write a few functions to **convert the raw data to the same format as the analytical base table.**
* That means we need to bundle together our data cleaning steps.
* Then we need to bundle together our feature engineering steps.
* **Note:** We can skip the exploratory analysis steps because we didn't alter our dataframe then.

### 2.1 Data Cleaning
Write a function called **clean_data()** that bundles together all of the data cleaning steps.
* It's helpful to open up 'Employee Retention' ipynb and go from top to bottom.
* Only include steps that altered your dataframe! You can just copy-paste them here.

In [140]:
def clean_data(df):
    '''
       This function takes a raw dataframe and returns it 
       after performing all the data cleaning steps.
    '''
    
    # Drop duplicates
    df = df.drop_duplicates()
    
    # Drop temporary workers
    df = df[df.department != 'temp']
    
    # Missing filed_complaint values should be 0
    df['filed_complaint'] = df.filed_complaint.fillna(0)

    # Missing recently_promoted values should be 0
    df['recently_promoted'] = df.recently_promoted.fillna(0)
    
    # 'information_technology' should be 'IT'
    df.department.replace('information_technology', 'IT', inplace=True)

    # Fill missing values in department with 'Missing'
    df['department'].fillna('Missing', inplace=True)

    # Indicator variable for missing last_evaluation
    df['last_evaluation_missing'] = df.last_evaluation.isnull().astype(int)
    
    # Fill missing values in last_evaluation with 0
    df.last_evaluation.fillna(0, inplace=True)
    
    # Return cleaned dataframe
    return df

Create a new DataFrame named cleaned_data using the function you just wrote.

Then display its first 5 rows.

In [141]:
# Create cleaned_new_data 
cleaned_data = clean_data(raw_data)

# Display first 5 rows
cleaned_data.head()

Unnamed: 0,avg_monthly_hrs,department,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure,last_evaluation_missing
0,228,management,0.0,0.735618,2,0.0,high,0.805661,3.0,0
1,229,product,0.0,1.0,4,0.0,low,0.719961,4.0,0
2,196,sales,1.0,0.557426,4,0.0,low,0.749835,2.0,0
3,207,IT,0.0,0.715171,3,0.0,high,0.987447,3.0,0
4,129,management,0.0,0.484818,2,0.0,low,0.441219,3.0,0


As you can see, we no longer have missing values, wannabe indicator variables, or other structural issues that we fixed during data cleaning.

### 2.2 Feature Engineering
Next, write a function called engineer_features() that compiles all of the feature engineering steps.
* Continue where you left off in 'Employee Retention' ipynb and continue going from top to bottom.
* Only include steps that altered your dataframe!
* **Do not include any steps used to process the target variable.** We don't have that variable when predicting new, unseen observations.

In [142]:
def engineer_features(df):
    '''
       This funtion takes a cleaned data frame as input and returns it
       after performing all the feature engineering steps.
    '''
    # Create indicator features
    df['underperformer'] = ((df.last_evaluation < 0.6) & 
                            (df.last_evaluation_missing == 0)).astype(int)

    df['unhappy'] = (df.satisfaction < 0.2).astype(int)

    df['overachiever'] = ((df.last_evaluation > 0.8) & (df.satisfaction > 0.7)).astype(int)
        
    # Create new dataframe with dummy features
    df = pd.get_dummies(df, columns=['department', 'salary'])
    
    # Return augmented DataFrame
    return df

Create a new DataFrame named augmented_data using the function you just wrote.

Then display its first 5 rows.

**Remember to pass in cleaned_data, not raw_data.**

In [159]:
# Create augmented_new_data
augmented_data = engineer_features(cleaned_data)

# Display first 5 rows
augmented_data.head()

Unnamed: 0,avg_monthly_hrs,filed_complaint,last_evaluation,n_projects,recently_promoted,satisfaction,tenure,last_evaluation_missing,underperformer,unhappy,...,department_finance,department_management,department_marketing,department_procurement,department_product,department_sales,department_support,salary_high,salary_low,salary_medium
0,228,0.0,0.735618,2,0.0,0.805661,3.0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
1,229,0.0,1.0,4,0.0,0.719961,4.0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
2,196,1.0,0.557426,4,0.0,0.749835,2.0,0,1,0,...,0,0,0,0,0,1,0,0,1,0
3,207,0.0,0.715171,3,0.0,0.987447,3.0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,129,0.0,0.484818,2,0.0,0.441219,3.0,0,1,0,...,0,1,0,0,0,0,0,0,1,0


### 2.3 Data Preprocessing
In this step we perform all the pre-processing steps that we performed on test data just before modelling

**Remember we did data standardization**

In [160]:
## Note: We use train_mean and train_std_dev to standardize any new data set
augmented_data_std = (augmented_data - train_mean) / train_std

In [161]:
augmented_data_std.describe()

Unnamed: 0,avg_monthly_hrs,filed_complaint,last_evaluation,n_projects,recently_promoted,satisfaction,tenure,last_evaluation_missing,underperformer,unhappy,...,department_finance,department_management,department_marketing,department_procurement,department_product,department_sales,department_support,salary_high,salary_low,salary_medium
count,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,...,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0
mean,0.007014,0.007707,0.060717,0.024359,-0.032071,0.021283,0.027987,-0.043061,-0.042975,-0.004111,...,-0.00154,0.013034,-0.020201,0.073833,-0.016832,0.031326,-0.012163,0.125797,-0.006273,-0.063284
std,0.998375,1.008486,0.966701,0.978446,0.885742,1.010111,1.025819,0.938918,0.979231,0.994729,...,0.997554,1.02892,0.961187,1.285909,0.969028,1.015461,0.988482,1.170468,1.000465,0.989611
min,-2.077494,-0.406642,-2.417939,-1.455347,-0.145713,-2.314177,-1.02492,-0.326797,-0.634434,-0.315177,...,-0.236674,-0.216067,-0.246763,-0.111006,-0.254373,-0.622329,-0.418293,-0.29836,-0.976767,-0.868481
25%,-0.874096,-0.406642,-0.395147,-0.643647,-0.145713,-0.665331,-0.339651,-0.326797,-0.634434,-0.315177,...,-0.236674,-0.216067,-0.246763,-0.111006,-0.254373,-0.622329,-0.418293,-0.29836,-0.976767,-0.868481
50%,-0.031717,-0.406642,0.168131,0.168052,-0.145713,0.175075,-0.339651,-0.326797,-0.634434,-0.315177,...,-0.236674,-0.216067,-0.246763,-0.111006,-0.254373,-0.622329,-0.418293,-0.29836,-0.976767,-0.868481
75%,0.910944,-0.406642,0.815132,0.979752,-0.145713,0.7935,0.345618,-0.326797,1.576068,-0.315177,...,-0.236674,-0.216067,-0.246763,-0.111006,-0.254373,1.606725,-0.418293,-0.29836,1.023695,1.151334
max,2.174513,2.458948,1.311132,2.603151,6.862203,1.518713,4.457231,3.059733,1.576068,3.172536,...,4.224847,4.627788,4.05211,9.007706,3.930883,1.606725,2.390458,3.35136,1.023695,1.151334


Predict probabilities for augmented_data using your model.

In [162]:
# Predict probabilities
y_pred_proba = model.predict_proba(augmented_data_std)[:,1]

# Print first 5 predictions
print(y_pred_proba[:5])

[ 0.          0.09587491  0.00487793  0.          0.99759977]


## 3. Construct a model class
* Now let's package these functions together into a single model class.
* This is a convenient way to keep all of the logic for a given model in one place.

### 3.1 - Python classes
Remember how when we were training our model, we imported LogisticRegression and RandomForestClassifier, XGBoostClassifier, etc.?

We called them "algorithms," but they are technically Python classes.

***Python classes*** are structures that allow us to group related code, logic, and functions in one place.
* Those familiar with object-oriented programming will have recognized this concept.
* For our purpose, we only need to write some bare bones, very basic classes.

For example, each of those algorithms have the fit() and predict_proba() functions that allow you to train and apply models, respectively.

### 3.2 - Custom class
We're going to construct our own custom Python class for our employee retention model.
* Thankfully, it doesn't need to be nearly as complex as those other algorithm classes because we're not actually using this to train the model.
* Instead, we already have the model saved in a rfc_emp_retention.pkl file.
* We only need to include logic for cleaning data, feature engineering, and predicting new observations.

In [177]:
class EmployeeRetentionModel:
    
    def __init__(self, model_location, train_mean, train_std):
        '''
           model_location -> will be the file location of the saved final model.
           train_mean -> location of mean of the features in training data
           train_std -> location of standard deviation of the features in training data
        '''
        # Load the model
        self.model = joblib.load(model_location)
        
        # Load the train mean and train std dev
        self.train_mean = pd.read_pickle(train_mean)
        self.train_std = pd.read_pickle(train_std)
    
    def predict_proba(self, X_new, clean=True, augment=True):
        if clean:
            X_new = self.clean_data(X_new)
        
        if augment:
            X_new = self.engineer_features(X_new)
        
        ## Standardizing the data
        X_new = (X_new - self.train_mean)/ self.train_std
        
        return X_new, self.model.predict_proba(X_new)[:, 1]
    
    # Add functions here
    def clean_data(self, df):
        # Drop duplicates
        df = df.drop_duplicates()

        # Drop temporary workers
        df = df[df.department != 'temp']

        # Missing filed_complaint values should be 0
        df['filed_complaint'] = df.filed_complaint.fillna(0)

        # Missing recently_promoted values should be 0
        df['recently_promoted'] = df.recently_promoted.fillna(0)

        # 'information_technology' should be 'IT'
        df.department.replace('information_technology', 'IT', inplace=True)

        # Fill missing values in department with 'Missing'
        df['department'].fillna('Missing', inplace=True)

        # Indicator variable for missing last_evaluation
        df['last_evaluation_missing'] = df.last_evaluation.isnull().astype(int)

        # Fill missing values in last_evaluation with 0
        df.last_evaluation.fillna(0, inplace=True)

        # Return cleaned dataframe
        return df
    
    def engineer_features(self, df):
        # Create indicator features
        df['underperformer'] = ((df.last_evaluation < 0.6) & 
                                (df.last_evaluation_missing == 0)).astype(int)

        df['unhappy'] = (df.satisfaction < 0.2).astype(int)

        df['overachiever'] = ((df.last_evaluation > 0.8) & (df.satisfaction > 0.7)).astype(int)

        # Create new dataframe with dummy features
        df = pd.get_dummies(df, columns=['department', 'salary'])

        # Return augmented DataFrame
        return df

#### Description
* Name of the class "EmployeeRetentionModel"
* \__init\__() function is used to load the model as soon as the object is initialized.
* predict_proba() function is defined to apply our model to new data.
* We give the option to clean / engineer features or not. This allows us to handle data that has already been cleaned.
* Finally clean_data() and engineer_features() are the same functions that we had written before

## 4 Deploying the model
We will learn two ways to deploy the model
### 4.1 Jupyter Notebook
The easiest way to deploy your model: Keep it in Jupyter Notebook.

#### 4.1.1 Benefits
* It's seamless to update your model.
* You can perform ad-hoc data exploration and visualization.
* You can write detailed comments and documentation, with images, bullet lists, etc.

#### 4.1.2 Applying your model class
If you keep your model in Jupyter Notebook, you can directly use the model class you defined earlier.

First, simply initialize an instance of it:

In [178]:
# Initialize an instance
retention_model = EmployeeRetentionModel('Save/rfc_emp_retention.pkl', 'Save/train_mean.pkl', 'Save/train_std.pkl')

#### 4.1.3 - Predicting new data
If implemented correctly, these next three statements should all work.

In [179]:
# Predict raw data
_, pred1 = retention_model.predict_proba(raw_data, clean=True, augment=True)

# Predict cleaned data
_, pred2 = retention_model.predict_proba(cleaned_data, clean=False, augment=True)

# Predict cleaned and augmented data
_, pred3 = retention_model.predict_proba(augmented_data, clean=False, augment=False)

_, pred1 = simply means we're throwing away the first object that's returned (which was X_new).

Their predictions should all be equivalent.

In [180]:
# Should be true
np.array_equal(pred1, pred2) and np.array_equal(pred2, pred3)

True

### 4.2 Executable Script
However, there will definitely be situations in which Jupyter notebooks are not enough. The most common scenarios are if you need to integrate your model with other parts of an application or automate it.

For these use cases, you can simply package your model into an executable script.

#### 4.2.1 Benefits
You can call these scripts from any command line program. That means you can:
* Host your model on the cloud (like AWS)
* Integrate it as part of a web application (have other parts of the application call the script)
* Or automate it (schedule recurring jobs to run the script)

Let's how to actually set up the script.

#### 4.2.2 Script structure
Refer the  retention_model.py file.
Once you open it up (you can open it up in Jupyter browser or in any text editor), you'll see that you've already written most of the code needed, such as:
* The libraries to import
* The custom model class, including the data cleaning, feature engineering, and predict probability functions

All that's left is a little bit of logic at the bottom to handle command line requests.

#### 4.2.3 Example usage
To run the script, you can call it from the command line:
* Assuming you know your way around a command line tool.
* You need to first navigate to the project folder from the command line.

Call the script as follows: (Assuming that you are in the folder "Employee Retention")

python Files/retention_model.py Files/unseen_raw_data.csv Files/predictions.csv Save/rfc_emp_retention.pkl Save/train_mean.pkl Save/train_std.pkl True True