This notebook documents k-fold cross validation for three models: logistic regression (linear), SVC (non-linear), and random forest (ensemble). The code for cross-validation is sourced from https://towardsdatascience.com/the-right-way-of-using-smote-with-cross-validation-92a8d09d00c7.

In [1]:
# Import dependencies.
import pandas as pd
from numpy import mean, std

from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

from project_pipeline import preprocess, perf_metrics

In [2]:
# Read in 'cleaned_mode.csv' data.
df = pd.read_csv('../resources/cleaned_mode.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   city                    19158 non-null  object 
 1   city_development_index  19158 non-null  float64
 2   gender                  19158 non-null  object 
 3   relevent_experience     19158 non-null  int64  
 4   enrolled_university     19158 non-null  object 
 5   education_level         19158 non-null  object 
 6   major_discipline        19158 non-null  object 
 7   experience              19158 non-null  object 
 8   company_size            19158 non-null  object 
 9   company_type            19158 non-null  object 
 10  last_new_job            19158 non-null  object 
 11  training_hours          19158 non-null  int64  
 12  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(9)
memory usage: 1.9+ MB


## Model performance based on random split

In [3]:
# Split data into train and test.
X_train, X_test, y_train, y_test = preprocess(df)

### Logistic Regression

In [4]:
# Implement a logistic regression model.
lr_model = LogisticRegression(solver='lbfgs', random_state=42)
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)

# Print out accuracy score, ROC AUC score, and classification report.
perf_metrics(y_test, y_pred)

ROC AUC: 0.74
Classification report:
              precision    recall  f1-score   support

        stay       0.90      0.74      0.81      3596
       leave       0.49      0.74      0.59      1194

    accuracy                           0.74      4790
   macro avg       0.69      0.74      0.70      4790
weighted avg       0.79      0.74      0.76      4790



### SVC

In [5]:
# Implement a SVC model with RBF kernel.
svc_model = SVC(kernel='rbf', random_state=42)
svc_model.fit(X_train, y_train)
y_pred = svc_model.predict(X_test)

# Print out accuracy score, ROC AUC score, and classification report.
perf_metrics(y_test, y_pred)

ROC AUC: 0.74
Classification report:
              precision    recall  f1-score   support

        stay       0.89      0.74      0.81      3596
       leave       0.48      0.74      0.59      1194

    accuracy                           0.74      4790
   macro avg       0.69      0.74      0.70      4790
weighted avg       0.79      0.74      0.75      4790



### Random Forest

In [6]:
# Create the random forest classifier instance.
rf_model = RandomForestClassifier(n_estimators=500, random_state=42)
rf_model = rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)

# Print out accuracy score, ROC AUC score, and classification report.
perf_metrics(y_test, y_pred)

ROC AUC: 0.66
Classification report:
              precision    recall  f1-score   support

        stay       0.83      0.84      0.83      3596
       leave       0.50      0.47      0.48      1194

    accuracy                           0.75      4790
   macro avg       0.66      0.66      0.66      4790
weighted avg       0.74      0.75      0.75      4790



## Model performance based on cross validation

In [7]:
# Use `get_dummies` to encode all categorical features.
df = pd.get_dummies(df)
y = df.target
X = df.drop(columns='target')

In [8]:
# Create all models.
models = {
    'lr': LogisticRegression(solver='lbfgs', random_state=42),
    'svc': SVC(kernel='rbf', random_state=42),
    'rf': RandomForestClassifier(n_estimators=500, random_state=42)
}

In [9]:
for name, model in models.items():
    # Prepare the cross-validation procedure.
    cv = StratifiedKFold(n_splits=10, random_state=42, shuffle=True)

    # Create a pipeline that includes oversampler, scaler, and model.
    clf = imbpipeline(steps = [['oversampler', RandomOverSampler(random_state=42)],
                               ['scaler', StandardScaler()],
                               ['classifer', model]])

    # Evaluate model.
    roc_auc_scores = cross_val_score(clf, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
    recall_scores = cross_val_score(clf, X, y, scoring='recall', cv=cv, n_jobs=-1)

    # Report performance.
    print(name)
    print('---')
    print(f'roc_auc: {mean(roc_auc_scores):.2f} ({std(roc_auc_scores):.2f})')
    print(f'recall: {mean(recall_scores):.2f} ({std(recall_scores):.2f})')
    print('---')

lr
---
roc_auc: 0.78 (0.02)
recall: 0.73 (0.03)
---
svc
---
roc_auc: 0.77 (0.01)
recall: 0.72 (0.02)
---
rf
---
roc_auc: 0.74 (0.01)
recall: 0.47 (0.02)
---


Cross validation demonstrates the following key points:
- `Logistic regression results in an average ROC AUC score of about 0.78,` which is above and close to 0.74 achieved with a single random split. Since it achieves the highest ROC AUC and recall scores of the three models investigated, it is used for further study regarding feature importance.
- `SVC with `RBF` kernel results in an average ROC AUC score of about 0.77,` which is above and close to 0.74 achieved with a single random split.
- `Random forest classifier results in an average ROC AUC score of 0.74,` which is 0.08 higher than the score achieved with a single random split. However, the average recall obtained is considerably lower than the other two models. 
- `Since predicting individuals leaving their current employment is an important objective of the analysis, random forest classifer is not recommended for this classification task.`