# Employee Attrition: Basis to Create ML-Helper Lib

Analyzing a dataset from [HR Analytics](https://www.kaggle.com/lnvardanyan/hr-analytics) which contains employee information of a given company, we will hash out a library that can help us speed up and structure future machine learning projects.

### Objective

Given the following variables:

* satisfaction_level: The satisfaction of the employee
* last_evaluation: How long ago the employee had his last evaluation
* number_project: The amount of projects the employee has been involved in 
* average_montly_hours: The average amount of hours the employee works each month
* time_spend_company: The amount of years the employee has worked there
* Work_accident: Boolean representing if the employee has been involved in an accident
* left: Our target variable, determines if the employee left the company or not
* promotion_last_5years: Boolean on whether the employee was promoted in the last 5 years or not
* sales: The name of the department the employee works in
* salary: The salary of the employee (can be low, medium or high)

We want to build a classification model that can determine which employee will likely leave the company in order to make the necessary changes to reduce employee attrition. We will use 80% of the data for training and the remaining 20% for validation of our modeling.

### Outline

We separate the project in 3 steps:

Data Loading and Exploratory Data Analysis: Load the data and analyze it to obtain an accurate picture of it, its features, its values (and whether they are incomplete or wrong), its data types among others. Also, the creation of different types of plots in order to help us understand the data and make the model creation easier.

Feature Engineering / Modeling and Pipeline: Once we have the data, we create some features and then the modeling stage begins, making use of different models (and ensembles) and a strong pipeline with different transformers, we will hopefully produce a model that fits our expectations of performance. Once we have that model, a process of tuning it to the training data would be performed.

Results and Conclusions: Finally, with our tuned model, we  predict against the test set we decided to separate initially, then we review those results against their actual values to determine the performance of the model, and finally, outlining our conclusions.

### Helpers

As mentioned, this notebook contains many functions that help speed up the machine learning process and provide a formal structure to it. **These helpers are the basis for my package ML-Helper** and they can be used in your own projects by downloading the package at [Pypi](https://pypi.org/project/ml-helper/) ```pip install ml-helper```.

If you wish to see a working example using these helpers through the package, please see my [kernel on time series regression](https://www.kaggle.com/akoury/bike-sharing-in-washington-d-c-using-ml-helper)

In [None]:
import timeit
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from tempfile import mkdtemp
from sklearn.base import clone
import matplotlib.pyplot as plt
from scipy.stats import variation
from sklearn.cluster import KMeans
from imblearn import FunctionSampler
from imblearn.combine import SMOTEENN
from sklearn.decomposition import PCA
from imblearn.pipeline import Pipeline
from vecstack import StackingTransformer
from scipy.stats import chi2_contingency
from sklearn.ensemble import IsolationForest
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE, SelectFromModel
from imblearn.over_sampling import RandomOverSampler, SMOTE
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve, classification_report
from sklearn.metrics import accuracy_score as metric_scorer, classification_report
from imblearn.under_sampling import RandomUnderSampler, RepeatedEditedNearestNeighbours
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, TimeSeriesSplit, StratifiedKFold
from sklearn.preprocessing import PolynomialFeatures, KBinsDiscretizer, PowerTransformer, OneHotEncoder, FunctionTransformer

warnings.filterwarnings('ignore')

### Setting Key Values

The following values are used throught the code, this cell gives a central source where they can be managed.

In [None]:
MEMORY = mkdtemp()

KEYS = {
    'SEED': 1,
    'DATA_PATH' : '../input/turnover.csv',
    'TARGET': 'left',
    'METRIC': 'accuracy',
    'TIMESERIES': False,
    'SPLITS': 5,
    'ESTIMATORS': 150,
    'ITERATIONS': 500,
    'MEMORY': MEMORY
}

### Data Loading

Here we load the necessary data, print its first rows and describe its contents.

In [None]:
def read_data(input_path):
    return pd.read_csv(input_path)

data = read_data(KEYS['DATA_PATH'])

data.head()

In [None]:
data.describe()

### Data types

We review the data types for each column.

In [None]:
data.dtypes

### Missing Data

We check if there is any missing data.

In [None]:
def missing_data(df):
    total = df.isnull().sum().sort_values(ascending=False)
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    
missing_data(data)

### Converting columns to their true categorical type
Now we convert the data types of numerical columns that are actually categorical.

In [None]:
 def convert_to_category(df, cols):
    for i in cols:
        df[i] = df[i].astype('category')
    return df

data = convert_to_category(data, data.iloc[:,5:8])

data.dtypes

### Defining Holdout Set for Validation

80% of the data will be used to train our model, while the remaining data will be used later on to validate the accuracy of our model.

In [None]:
train_data, holdout = train_test_split(data, test_size=0.2)

## Exploratory Data Analysis

Here we will perform all of the necessary data analysis, with different plots that will help us understand the data and therefore, create a better model.

We must specify that all of this analysis is performed only on the training data, so that we do not incur in any sort of bias when modeling.

We begin by plotting pairwise relationships between variables, as well as the distribution for each column in the diagonal.

In [None]:
pairplot = sns.pairplot(train_data, hue=KEYS['TARGET'], palette="husl")

### Boxplot of Numerical Variables

We review the distribution of scaled numerical data through a boxplot for each variable. The first 3 functions are used to obtain automatically the numerical/categorical columns of our data, they are used throught the notebook, therefore they are defined here.

In [None]:
def types(df, types, exclude = None):
    types = df.select_dtypes(include=types)
    excluded = [KEYS['TARGET']]
    if exclude:
        for i in exclude:
            excluded.append(i)
    cols = [col for col in types.columns if col not in excluded]
    return df[cols]

def numericals(df, exclude = None):
    return types(df, [np.number], exclude)

def categoricals(df, exclude = None):
    return types(df, ['category', object], exclude)

def boxplot(df, exclude = []):
    plt.figure(figsize=(12,10))
    num = numericals(df, exclude)
    num = (num - num.mean())/num.std()
    ax = sns.boxplot(data=num, orient='h')
    
boxplot(data)

As we can see, there are only a few outliers in the time spent in company, so outlier treatment does not seem necessary.

### Coefficient of Variation

The coefficient of variation is a dimensionless meassure of dispersion in data, the lower the value the less dispersion a feature has. We will select columns that have a variance of less than 0.05 since they would probably perform poorly.

In [None]:
def coefficient_variation(df, threshold = 0.05, exclude=[]):
        plt.figure(figsize=(8, 6))
        cols = numericals(df, exclude)
        variance = variation(cols)
        ax = sns.barplot(
            x=np.sort(variance)[::-1],
            y=cols.columns,
        )
        
        cols = [x for x in cols.columns[np.argwhere(variance < threshold)]]
        if len(cols) > 0:
            print(str(cols) + ' are invariant with a threshold of ' + str(threshold))
        else:
            print('No invariant columns')
        return cols
    
invariant = coefficient_variation(data, threshold = 0.05)

### Data Correlation

Now we analyze correlation in the data for both numerical and categorical columns and plot them, using a threshold of 70%.

For the numerical features we use Spearman correlation and for the categorical ones we use CramÃ©r's V.

In [None]:
def correlated(df, threshold = 0.9):
    categoric = categorical_correlated(df, threshold)
    numeric = numerical_correlated(df, threshold)

    plt.figure(figsize=(12,10))
    sns.heatmap(categoric[1],cbar=True,fmt =' .2f', annot=True, cmap='viridis').set_title('Categorical Correlation', fontsize=30)

    plt.figure(figsize=(12,10))
    sns.heatmap(numeric[1],cbar=True,fmt =' .2f', annot=True, cmap='viridis').set_title('Numerical Correlation', fontsize=30)

    correlated_cols = categoric[0] + numeric[0]

    if(len(correlated_cols) > 0):
        print('The following columns are correlated with a threshold of ' + str(threshold) + ': ' + str(correlated_cols))

        if KEYS['TARGET'] in correlated_cols:
            print('The target variable is correlated, consider removing its correlated counterpart')
            correlated_cols.remove(KEYS['TARGET'])
    else:
        print('No correlated columns for the  ' + str(threshold) + ' threshold')

    return correlated_cols

def numerical_correlated(df, threshold=0.9):
    corr_matrix = np.absolute(df.select_dtypes(include=[np.number]).corr(method='spearman')).abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
    return [column for column in upper.columns if any(abs(upper[column]) > threshold)], corr_matrix

def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1), (rcorr-1)))

def categorical_correlated(df, threshold=0.9):
    columns = df.select_dtypes(include=['object', 'category']).columns.tolist()
    corr = pd.DataFrame(index=columns, columns=columns)
    for i in range(0, len(columns)):
        for j in range(i, len(columns)):
            if i == j:
                corr[columns[i]][columns[j]] = 1.0
            else:
                cell = cramers_v(df[columns[i]], df[columns[j]])
                corr[columns[i]][columns[j]] = cell
                corr[columns[j]][columns[i]] = cell
    corr.fillna(value=np.nan, inplace=True)
    upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
    return [column for column in upper.columns if any(abs(upper[column]) > threshold)], corr

correlated_cols = correlated(train_data, 0.7)

### Underrepresented Features

Now we determine underrepresented features, meaning those that in more than 97% of the records are composed of a single value.

In [None]:
def under_represented(df, threshold = 0.99):
    under_rep = []
    for column in df:
        counts = df[column].value_counts()
        majority_freq = counts.iloc[0]
        if (majority_freq / len(df)) > threshold:
            under_rep.append(column)

    if not under_rep:
        print('No underrepresented features')
    else:
        if KEYS['TARGET'] in under_rep:
            print('The target variable is underrepresented, consider rebalancing')
            under_represented.remove(KEYS['TARGET'])
        print(str(under_rep) + ' underrepresented')

    return under_rep

under_rep = under_represented(train_data, 0.97)

### Principal Component Analysis (PCA)

We plot PCA component variance to define the number of components we wish to consider in the pipeline.

In [None]:
def split_x_y(df):
    return df.loc[:, df.columns != KEYS['TARGET']], df.loc[:, KEYS['TARGET']]

def one_hot_encode(df, cols):
    for i in cols:
        dummies = pd.get_dummies(df[i], prefix=i, drop_first = True)
        df = pd.concat([df, dummies], axis = 1)
        df = df.drop(i, axis = 1)

    return df

def plot_pca_components(df, variance = 0.9, convert = False):
    X, y = split_x_y(df)

    if convert:
        X = one_hot_encode(X, categoricals(X))

    pca = PCA().fit(X)

    sns.set_style("whitegrid")
    plt.figure(figsize=(9, 7))
    plt.plot(np.cumsum(pca.explained_variance_ratio_))
    plt.xlabel('Number of Components')
    plt.ylabel('Variance (%)')
    plt.show()
    
plot_pca_components(data, convert = True)

### Feature Importance

Here we plot feature importance using a random forest in order to get a sense of which features have the most importance.

In [None]:
def feature_importance(df, model, convert = False):
    X, y = split_x_y(df)

    if convert:
        X = one_hot_encode(X, categoricals(X))
    model.fit(X, y)
    importances = model.feature_importances_
    std = np.std([tree.feature_importances_ for tree in model.estimators_],axis=0)
    indices = np.argsort(importances)

    print("Feature ranking:")
    plt.figure(figsize=(16, 14))
    plt.title("Feature importances")
    plt.barh(range(X.shape[1]), importances[indices],color="r", xerr=std[indices], align="center")
    plt.yticks(range(X.shape[1]), [list(X)[i] for i in indices])
    plt.ylim([-1, X.shape[1]])
    plt.show()

feature_importance(data, RandomForestClassifier(n_estimators=KEYS['ESTIMATORS'], random_state = KEYS['SEED']), convert = True)

### Check target variable balance
We review the distribution of values in the target variable.

In [None]:
def target_distribution(df):
    plt.figure(figsize=(8,7))
    target_count = (df[KEYS['TARGET']].value_counts()/len(df))*100
    target_count.plot(kind='bar', title='Target Distribution (%)')

target_distribution(train_data)

Since 0 is employees that stay and 1 is employees that leave, a rebalancing should be tried since there is a very big difference in the number of values for each option.

## Feature Engineering / Pipeline / Modeling

A number of different combinations of feature engineering steps and transformations will be performed in a pipeline with different models, each one will be cross validated to review the performance of the model.

A feature called 'avg_time_per_project' is added to determine the average time each employee spends on a project.

In [None]:
def avg_time_pp(df):
    df = df.copy()
    df['avg_time_per_project'] = (df['average_montly_hours'] * 12 * df['time_spend_company'])/ df['number_project']
    df['avg_time_per_project'] = df['avg_time_per_project'].replace([np.inf, -np.inf], np.nan)
    df['avg_time_per_project'] = df['avg_time_per_project'].fillna(0)
    
    return df

We also drop some features, like the invariant, correlated and underrepresented ones.

In [None]:
def drop_features(df, cols):
    return df[df.columns.difference(cols)]

Here we will standardize features and fix its skewness so that the scale does not affect the modeling.

In [None]:
num_pipeline = Pipeline([ 
    ('power_transformer', PowerTransformer(method='yeo-johnson', standardize = True)),
])

Now we one hot encode categorical features.

In [None]:
categorical_pipeline = Pipeline([
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

And finally we put them all together and we will try this pipeline with 3 different models, a simple logistic regression, a random forest and an extra tree classifier.

In [None]:
pipe = Pipeline([
    ('avg_time_pp', FunctionTransformer(avg_time_pp, validate=False)),
    ('drop_features', FunctionTransformer(drop_features, kw_args={'cols': correlated_cols + under_rep}, validate=False)),
    ('column_transformer', ColumnTransformer([
        ('numerical_pipeline', num_pipeline, numericals(data, [KEYS['TARGET']]).columns),
        ('categorical_pipeline', categorical_pipeline, ['sales', 'salary']),
    ], remainder='passthrough')),
])

models = [
    {'name':'logistic_regression', 'model': LogisticRegression(solver = 'lbfgs', max_iter = KEYS['ITERATIONS'], random_state = KEYS['SEED'])},
    {'name':'random_forest', 'model': RandomForestClassifier(n_estimators = KEYS['ESTIMATORS'], random_state = KEYS['SEED'])},
    {'name': 'extra_tree', 'model': ExtraTreesClassifier(random_state = KEYS['SEED'])}
]

## Scores

Here you can see all of the scores for the different models throughout the entire cross validation process for each pipeline, in certain cases errors can happen (for example when a certain fold contains a sparse matrix), therefore you may see errors marked as such in the score.

In [None]:
def pipeline(df, models, pipe, all_scores = pd.DataFrame(), splits = None, note = ''):
    if splits is None:
        splits = KEYS['SPLITS']

    for model in models:
        if len(all_scores) == 0 or len(all_scores[(all_scores['Model'] == model['name']) & (all_scores['Steps'] == ', '.join(pipe_steps(pipe)))]) == 0:
            try:
                start = timeit.default_timer()

                scores, cv_model = cross_val(df.copy(), model, pipe = pipe, splits = splits)

            except Exception as error:
                cv_model = pipe
                note = 'Error: ' + str(error)
                print(note)
                scores = np.array([0])

            all_scores = score(model['name'], scores, timeit.default_timer(), start, cv_model, note, all_scores)

        else:
            print(str(model['name']) + ' already trained on those parameters, ignoring')

    show_scores(all_scores)

    return all_scores

def cross_val(df, model, splits = None, pipe = None, grid = None):
    if splits is None:
        splits = KEYS['SPLITS']

    X, y = split_x_y(df)

    if KEYS['TIMESERIES']:
        folds = TimeSeriesSplit(n_splits = splits)
    else:
        folds = StratifiedKFold(n_splits = splits, shuffle = True, random_state=KEYS['SEED'])

    if pipe:
        pipe_cv = clone(pipe)
        pipe_cv.steps.append((model['name'], model['model']))
        model = pipe_cv

    if grid:
        model = RandomizedSearchCV(model, grid, scoring = KEYS['METRIC'], cv = folds, n_iter = 10, refit=True, return_train_score = False, error_score=0.0, n_jobs = -1, random_state = KEYS['SEED'])
        model.fit(X, y)
        scores = model.cv_results_['mean_test_score']
    else:
        scores = cross_val_score(model, X, y, scoring = KEYS['METRIC'], cv = folds, n_jobs = -1)

    return scores, model

def pipe_steps(pipe):
    return flatten([x[0] if not isinstance(x[1], ColumnTransformer) else [list(i[1].named_steps.keys()) for ind,i in enumerate(x[1].transformers)] for x in pipe.steps])

def flatten(pipe):
    flat = []
    for i in pipe:
        if isinstance(i,list): flat.extend(flatten(i))
        else: flat.append(i)
    return flat

def score(model, scores, stop, start, pipe, note = '', all_scores = pd.DataFrame()):
    if len(all_scores) == 0:
        all_scores  = pd.DataFrame(columns = ['Model', 'Mean', 'CV Score', 'Time', 'Cumulative', 'Pipe', 'Steps', 'Note'])

    if len(scores[scores > 0]) == 0:
        note = 'Warning: All scores negative'
        mean = 0
        std = 0
    else:
        mean = np.mean(scores[scores > 0])
        std = np.std(scores[scores > 0])

    cumulative = stop - start
    if len(all_scores[all_scores['Model'] == model]) > 0:
        cumulative += all_scores[all_scores['Model'] == model].tail(1)['Cumulative'].values[0]

    return all_scores.append({'Model': model, 'Mean': mean, 'CV Score': '{:.3f} +/- {:.3f}'.format(mean, std), 'Time': stop - start, 'Cumulative': cumulative, 'Pipe': pipe, 'Steps': ', '.join(pipe_steps(pipe)[:-1]), 'Note': note}, ignore_index=True)

def show_scores(all_scores, top = False):
    pd.set_option('max_colwidth', -1)

    if top:
        a_s = all_scores.sort_values(['Mean'], ascending = False).groupby('Model').first()
        display(a_s.loc[:, ~a_s.columns.isin(['Mean', 'Pipe', 'Cumulative'])])
    else:
        display(all_scores.loc[:, ~all_scores.columns.isin(['Mean', 'Pipe', 'Cumulative'])])
            
all_scores = pipeline(train_data, models, pipe)

### Binning and Polynomials
Now we try adding binning and polynomial features to our pipeline and see how it performs.

In [None]:
num_pipeline = Pipeline([ 
    ('power_transformer', PowerTransformer(method='yeo-johnson', standardize = True)),
    ('binning', KBinsDiscretizer(n_bins = 5, encode = 'onehot-dense')),
    ('polynomial', PolynomialFeatures(degree = 2, include_bias = False)),
])

categorical_pipeline = Pipeline([
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

pipe = Pipeline([
    ('avg_time_pp', FunctionTransformer(avg_time_pp, validate=False)),
    ('drop_features', FunctionTransformer(drop_features, kw_args={'cols': correlated_cols + under_rep}, validate=False)),
    ('column_transformer', ColumnTransformer([
        ('numerical_pipeline', num_pipeline, numericals(data, [KEYS['TARGET']]).columns),
        ('categorical_pipeline', categorical_pipeline, ['sales', 'salary']),
    ], remainder='passthrough'))
])

all_scores = pipeline(train_data, models, pipe, all_scores)

### SMOTEENN
For the class unbalance that is present in the data, we combine over and under-sampling techniques using SMOTE and Edited Nearest Neighbours (SMOTEENN) to our pipeline and see how it performs.

In [None]:
num_pipeline = Pipeline([ 
    ('power_transformer', PowerTransformer(method='yeo-johnson', standardize = True))
])

categorical_pipeline = Pipeline([
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

pipe = Pipeline([
    ('avg_time_pp', FunctionTransformer(avg_time_pp, validate=False)),
    ('drop_features', FunctionTransformer(drop_features, kw_args={'cols': correlated_cols + under_rep}, validate=False)),
    ('column_transformer', ColumnTransformer([
        ('numerical_pipeline', num_pipeline, numericals(data, [KEYS['TARGET']]).columns),
        ('categorical_pipeline', categorical_pipeline, ['sales', 'salary']),
    ], remainder='passthrough')),
    ('combined_sampler', SMOTEENN(random_state = KEYS['SEED'])),
])

all_scores = pipeline(train_data, models, pipe, all_scores)

### PCA
We try doing Principal Component Analysis and see how it performs.

In [None]:
num_pipeline = Pipeline([ 
    ('power_transformer', PowerTransformer(method='yeo-johnson', standardize = True))
])

categorical_pipeline = Pipeline([
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

pipe = Pipeline([
    ('avg_time_pp', FunctionTransformer(avg_time_pp, validate=False)),
    ('drop_features', FunctionTransformer(drop_features, kw_args={'cols': correlated_cols + under_rep}, validate=False)),
    ('column_transformer', ColumnTransformer([
        ('numerical_pipeline', num_pipeline, numericals(data, [KEYS['TARGET']]).columns),
        ('categorical_pipeline', categorical_pipeline, ['sales', 'salary']),
    ], remainder='passthrough')),
    ('pca', PCA(n_components = 6))
])

all_scores = pipeline(train_data, models, pipe, all_scores)

## Note:
Consider that in reality, you will have a single pipeline cell with different steps that you comment/uncomment to see the performance of the transformers in the modeling. Since Kaggle must run the entire notebook when commiting it, these changes are done separately in different cells.

### Pipeline Performance by Model
Here we can see the performance of each model in the different pipelines we created.

In [None]:
def plot_models(all_scores):
    sns.set_style("whitegrid")
    plt.figure(figsize=(16, 8))
    ax = sns.lineplot(x="Cumulative", y="Mean", hue="Model", style="Model", markers=True, dashes=False, data=all_scores)
    label = str(KEYS['METRIC']) + ' Score'
    ax.set(ylabel=label, xlabel='Time')
    
plot_models(all_scores)

### Top Pipelines per Model

Here we show the top pipelines per model.

In [None]:
show_scores(all_scores, top = True)

## Randomized Grid Search

Once we have a list of models, we perform a cross validated, randomized grid search on the best performing one to define the final model.

In [None]:
def top_pipeline(all_scores, index = 0):
    return all_scores.sort_values(by=['Mean'], ascending = False).iloc[index]['Pipe']
    
grid = {
    'random_forest__criterion': ['gini', 'entropy'],
    'random_forest__min_samples_leaf': [10, 20],
    'random_forest__min_samples_split': [5, 8],
    'random_forest__max_leaf_nodes': [30, 60],
}

final_scores, grid_pipe = cross_val(train_data, model = clone(top_pipeline(all_scores)), grid = grid)
final_scores

### Best Parameters for the Model

In [None]:
print(grid_pipe.best_params_)
final_pipe = grid_pipe.best_estimator_

# Results
We evaluate the final model with the holdout, obtaining the definitive score of the model.

In [None]:
def predict(df, holdout, pipe):
    X_train, y_train = split_x_y(df)
    pipe.fit(X_train, y_train)

    X, y = split_x_y(holdout)

    return y, pipe.predict(X)

y, predictions = predict(train_data, holdout, final_pipe)
score = metric_scorer(y, predictions)
score

## Receiver Operating Characteristic (ROC) / Area Under the Curve 
To review the performance of the model, accuracy is not enough, therefore we plot the ROC of the model on the holdout data and print a classification report.

In [None]:
def plot_roc(fpr, tpr, logit_roc_auc):
    plt.figure(figsize=(12, 6))
    plt.plot(fpr, tpr)
    plt.plot([0, 1], [0, 1], 'r--')
    plt.xlim([0.0, 1.05])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC curve')
    plt.show()

def roc(df, model, predictions):
    X, y = split_x_y(df)
    logit_roc_auc = roc_auc_score(y, predictions)
    fpr, tpr, thresholds = roc_curve(y, model.predict_proba(X)[:,1])
    plot_roc(fpr, tpr, logit_roc_auc)
    print(classification_report(y, predictions))
        
roc(holdout, final_pipe, predictions)

## Stacked Model
Finally, we create a stacked model using the top 2 models obtained during the modeling phase and obtain the holdout results.

In [None]:
def stack_predict(df, holdout, pipes, amount = 2):
    X, y = split_x_y(df)
    X_test, y_test = split_x_y(holdout)

    pipe = Pipeline(top_pipeline(pipes).steps[:-1])
    X = pipe.fit_transform(X)
    X_test = pipe.transform(X_test)

    estimators = []

    for i in range(amount):
        estimators.append((str(i), top_pipeline(pipes, i).steps[-1][1]))

    regression = False

    if KEYS['METRIC'] in ['explained_variance', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'r2'] :
        regression = True

    stack = StackingTransformer(estimators, regression)
    stack.fit(X, y)

    S_train = stack.transform(X)
    S_test = stack.transform(X_test)

    final_estimator = estimators[0][1]
    final_estimator.fit(S_train, y)

    return final_estimator, y_test, final_estimator.predict(S_test)

stacked, y_stacked, predictions_stacked = stack_predict(train_data, holdout, all_scores, amount = 2)
score_stacked = metric_scorer(y_stacked, predictions_stacked)
score_stacked

In [None]:
print(classification_report(y_stacked, predictions_stacked))

# Conclusions
The classification report obtained from our stacked model shows its precision (how often the predictions are correct) and the recall (how many of the total observations in the set are correctly classified), also f1-score (combination of both). The weighted average for all of them is near perfect, which means that it can classify which employees will leave the company with great efficacy.

As it was seen in the feature importance step, the most important features in determining employee attrition are their satisfaction level, the number of projects they had, the time spent in the company, their average monthly hours and the score on their last evaluation.

This information is extremely useful to the company and can be used to help them retain their talent and reduce financial losses, first by knowing that these are the factors that they must pay the most attention to, and second, because for each employee, they can obtain an accurate estimation on whether they will leave or not and take the necessary measures to prevent it.

Finally, from this analysis we obtained a set of functions and steps that can greatly speed up and structure our future machine learning endeavors, we hope that you find them useful.