<a href="https://colab.research.google.com/github/sumedhakoranga/employee_future_prediction/blob/main/feature_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset link: https://www.kaggle.com/tejashvi14/employee-future-prediction

# Uploading dataset

In [None]:
from google.colab import files

files.upload()

Saving Employee.csv to Employee.csv


{'Employee.csv': b'Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot\r\nBachelors,2017,Bangalore,3,34,Male,No,0,0\r\nBachelors,2013,Pune,1,28,Female,No,3,1\r\nBachelors,2014,New Delhi,3,38,Female,No,2,0\r\nMasters,2016,Bangalore,3,27,Male,No,5,1\r\nMasters,2017,Pune,3,24,Male,Yes,2,1\r\nBachelors,2016,Bangalore,3,22,Male,No,0,0\r\nBachelors,2015,New Delhi,3,38,Male,No,0,0\r\nBachelors,2016,Bangalore,3,34,Female,No,2,1\r\nBachelors,2016,Pune,3,23,Male,No,1,0\r\nMasters,2017,New Delhi,2,37,Male,No,2,0\r\nMasters,2012,Bangalore,3,27,Male,No,5,1\r\nBachelors,2016,Pune,3,34,Male,No,3,0\r\nBachelors,2018,Pune,3,32,Male,Yes,5,1\r\nBachelors,2016,Bangalore,3,39,Male,No,2,0\r\nBachelors,2012,Bangalore,3,37,Male,No,4,0\r\nBachelors,2017,Bangalore,1,29,Male,No,3,0\r\nBachelors,2014,Bangalore,3,34,Female,No,2,0\r\nBachelors,2014,Pune,3,34,Male,No,4,0\r\nBachelors,2015,Pune,2,30,Female,No,0,1\r\nBachelors,2016,New Delhi,2,22,Female,No,0,1\r\nBachelor

# Initialization

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('Employee.csv')

X = df.drop(['LeaveOrNot'], axis=1)
y = df['LeaveOrNot']

# Preparing data

In [None]:
X_full_train, X_test, y_full_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_full_train, y_full_train, test_size=0.25, random_state=42)

In [None]:
numerical = ['Age']
categorical = ['Education', 'JoiningYear', 'City', 'PaymentTier', 'Gender', 'EverBenched', 'ExperienceInCurrentDomain']

# Creating Pipeline

In [None]:
def create_new_pipeline(numerical, categorical):
    numerical_transformer = SimpleImputer(strategy='median')

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoding', OneHotEncoder(drop='first'))
    ])

    preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', numerical_transformer, numerical),
        ('categorical', categorical_transformer, categorical)
    ])

    model = LogisticRegression()

    pipeline = Pipeline(
    steps=[
           ('preprocessing', preprocessor),
           ('model', model)
          ]
    )

    return pipeline

# Deciding features to use based on their correlation with target (calculated during EDA)

In [None]:
chi_square_drop_order = ['JoiningYear',
                         'ExperienceInCurrentDomain',
                         'PaymentTier',
                         'EverBenched',
                         'Education',
                         'Gender',
                         'City']

In [None]:
for i in range(len(chi_square_drop_order)):
    pipeline = create_new_pipeline(numerical, chi_square_drop_order[i:])

    pipeline.fit(X_train.drop(chi_square_drop_order[:i], axis=1), y_train)

    print(f'Features included: {chi_square_drop_order[i:]}')
    print(f'Training score: {pipeline.score(X_train.drop(chi_square_drop_order[:i], axis=1), y_train)}')
    print(f'Validation score: {pipeline.score(X_val.drop(chi_square_drop_order[:i], axis=1), y_val)}')
    print()
    print()

Features included: ['JoiningYear', 'ExperienceInCurrentDomain', 'PaymentTier', 'EverBenched', 'Education', 'Gender', 'City']
Training score: 0.802221426012182
Validation score: 0.8098818474758325


Features included: ['ExperienceInCurrentDomain', 'PaymentTier', 'EverBenched', 'Education', 'Gender', 'City']
Training score: 0.7284127552848442
Validation score: 0.7411385606874329


Features included: ['PaymentTier', 'EverBenched', 'Education', 'Gender', 'City']
Training score: 0.714797563597277
Validation score: 0.7346938775510204


Features included: ['EverBenched', 'Education', 'Gender', 'City']
Training score: 0.7312791114295951
Validation score: 0.7443609022556391


Features included: ['Education', 'Gender', 'City']
Training score: 0.7319957004657829
Validation score: 0.7551020408163265


Features included: ['Gender', 'City']
Training score: 0.7327122895019706
Validation score: 0.7551020408163265


Features included: ['City']
Training score: 0.6589036187746328
Validation score: 0.6680

We can conclude that we cannot drop any feature based on Chi Square test.

In [None]:
mutual_info_drop_order = ['ExperienceInCurrentDomain',
                          'Education',
                          'EverBenched',
                          'City',
                          'Gender',
                          'PaymentTier',
                          'JoiningYear']

In [None]:
for i in range(len(mutual_info_drop_order)):
    pipeline = create_new_pipeline(numerical, mutual_info_drop_order[i:])

    pipeline.fit(X_train.drop(mutual_info_drop_order[:i], axis=1), y_train)

    print(f'Features included: {mutual_info_drop_order[i:]}')
    print(f'Training score: {pipeline.score(X_train.drop(mutual_info_drop_order[:i], axis=1), y_train)}')
    print(f'Validation score: {pipeline.score(X_val.drop(mutual_info_drop_order[:i], axis=1), y_val)}')
    print()
    print()

Features included: ['ExperienceInCurrentDomain', 'Education', 'EverBenched', 'City', 'Gender', 'PaymentTier', 'JoiningYear']
Training score: 0.8029380150483698
Validation score: 0.8098818474758325


Features included: ['Education', 'EverBenched', 'City', 'Gender', 'PaymentTier', 'JoiningYear']
Training score: 0.7968470082407739
Validation score: 0.8120300751879699


Features included: ['EverBenched', 'City', 'Gender', 'PaymentTier', 'JoiningYear']
Training score: 0.7975635972769617
Validation score: 0.7980665950590763


Features included: ['City', 'Gender', 'PaymentTier', 'JoiningYear']
Training score: 0.7957721246864923
Validation score: 0.8023630504833512


Features included: ['Gender', 'PaymentTier', 'JoiningYear']
Training score: 0.790756001433178
Validation score: 0.7969924812030075


Features included: ['PaymentTier', 'JoiningYear']
Training score: 0.7832318165532067
Validation score: 0.7862513426423201


Features included: ['JoiningYear']
Training score: 0.7219634539591544
Valid

We can not remove any features.