The notebook applies the preprocessing process from preliminary data exploration to investigate the impact of different imputation methods on the performance of logistic regression classification:
- dropping all null values of the original dataset
- imputing null values using Datawig with the training set missing values
- imputing null values using Datawig with the training set not missing values

Preprocessing and model implementaion are written as functions in the `pipeline` module.

In [1]:
import pandas as pd
from project_pipeline import clean_data, preprocess, lr_model

## Drop all null values

In [2]:
# Read in the original data.
df = pd.read_csv('../resources/train.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [3]:
# Drop `enrollee_id` as before and all the null values in the DataFrame.
df.drop(columns='enrollee_id', inplace=True)
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8955 entries, 1 to 19155
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   city                    8955 non-null   object 
 1   city_development_index  8955 non-null   float64
 2   gender                  8955 non-null   object 
 3   relevent_experience     8955 non-null   object 
 4   enrolled_university     8955 non-null   object 
 5   education_level         8955 non-null   object 
 6   major_discipline        8955 non-null   object 
 7   experience              8955 non-null   object 
 8   company_size            8955 non-null   object 
 9   company_type            8955 non-null   object 
 10  last_new_job            8955 non-null   object 
 11  training_hours          8955 non-null   int64  
 12  target                  8955 non-null   float64
dtypes: float64(2), int64(1), object(10)
memory usage: 979.5+ KB


In [4]:
X_train, X_test, y_train, y_test = preprocess(clean_data(df))
lr_model(X_train, X_test, y_train, y_test)

ROC AUC: 0.73
Classification report:
              precision    recall  f1-score   support

        stay       0.92      0.84      0.88      1868
       leave       0.43      0.63      0.51       371

    accuracy                           0.80      2239
   macro avg       0.68      0.73      0.69      2239
weighted avg       0.84      0.80      0.82      2239



While the accuracy score for the testing set increases from 0.74 to 0.80, the recall score for predicting individuals leaving their current employment drops from 0.74 to 0.63. Since a higher recall score for predicting individuals leaving their current employment is an important objective of the analysis, this approach of dropping all null values is not considered for the rest of the study.

## Datawig imputations - training set missing values

In [5]:
# Read in the original data.
df = pd.read_csv('../resources/imputed.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18014 entries, 0 to 18013
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             18014 non-null  int64  
 1   city                    18014 non-null  object 
 2   city_development_index  18014 non-null  float64
 3   gender                  18014 non-null  object 
 4   relevent_experience     18014 non-null  object 
 5   enrolled_university     18014 non-null  object 
 6   education_level         18014 non-null  object 
 7   major_discipline        18014 non-null  object 
 8   experience              18014 non-null  object 
 9   company_size            18014 non-null  object 
 10  company_type            18014 non-null  object 
 11  last_new_job            18014 non-null  object 
 12  training_hours          18014 non-null  int64  
 13  target                  18014 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [6]:
X_train, X_test, y_train, y_test = preprocess(clean_data(df))
lr_model(X_train, X_test, y_train, y_test)

ROC AUC: 0.72
Classification report:
              precision    recall  f1-score   support

        stay       0.88      0.76      0.81      3399
       leave       0.48      0.69      0.56      1105

    accuracy                           0.74      4504
   macro avg       0.68      0.72      0.69      4504
weighted avg       0.78      0.74      0.75      4504



## Datawig imputations - training set not missing values

In [7]:
# Read in the original data.
df = pd.read_csv('../resources/imputed_loop.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18014 entries, 0 to 18013
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             18014 non-null  int64  
 1   city                    18014 non-null  object 
 2   city_development_index  18014 non-null  float64
 3   gender                  18014 non-null  object 
 4   relevent_experience     18014 non-null  object 
 5   enrolled_university     18014 non-null  object 
 6   education_level         18014 non-null  object 
 7   major_discipline        18014 non-null  object 
 8   experience              18014 non-null  object 
 9   company_size            18014 non-null  object 
 10  company_type            18014 non-null  object 
 11  last_new_job            18014 non-null  object 
 12  training_hours          18014 non-null  int64  
 13  target                  18014 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [8]:
X_train, X_test, y_train, y_test = preprocess(clean_data(df))
lr_model(X_train, X_test, y_train, y_test)

ROC AUC: 0.72
Classification report:
              precision    recall  f1-score   support

        stay       0.88      0.75      0.81      3399
       leave       0.48      0.69      0.56      1105

    accuracy                           0.74      4504
   macro avg       0.68      0.72      0.69      4504
weighted avg       0.78      0.74      0.75      4504



The performance metrics obtained with Datawig imputation show that the logistic regression model results in similar accuracy but lower recall (from 0.74 to 0.69) in comparison with those obtained with mode imputation. Considering that the latter is also computationally less costly, it is used for the rest of the analysis.