## IBM EMPLOYEE DATA ANALYSIS AND MACHINE LEARNING MODEL TRAINING TO PREDICT EMPLOYEE CHURN

### PROJECT DESCRIPTION

This project is for International Business Machines, popularly known as IBM. It was reported that there have been an increase in the rate at which employees are leaving the company. As Data Analysts at IBM, we are tasked to undertake deep data exploration and analysis to gain insights into employee characteristics and develop machine learning models that will accurately predict the likelihood of an employee leaving the company.

The solution will help the HR department develop policies that will raise employee morale and overall satisfaction. This will help reduce the rate at which employees are leaving the company.

### HYPOTHESIS FORMULATION

**NULL HYPOTHESIS(H0) :** There is a significant relationship between employee's income and churn rate.

**ALTERNATE HYPOTHESIS(H1) :** There is no significant difference between employee's income and churn rate.

### BUSINESS QUESTIONS

1. What is the rate of churn?
2. What is the ratio of male employees to female employees in the company?
3. What is the percentage of males who churned and females who churned?
4. What is the commonest education level in the company?
5. Which department has the highest churn rate?
6. How many workers worked overtime in the period?
7. What is the average age of males who churned and females who churned?
8. What is the highest and lowest monthly income?

### IMPORT PACKAGES AND LOAD DATA

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pylab import rcParams

# Import pandas and numpy
import pandas as pd
import numpy as np

# Import statistical packages
from scipy.stats import ttest_ind
import scipy.stats as stats

# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")

# Import machine learning models
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Import Pipeline, Scaler,Sampler, train_test_split, imputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from imblearn.combine import SMOTEENN 

# Import Encoders
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
from category_encoders import BinaryEncoder
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# import metrics
from sklearn.metrics import recall_score, mean_squared_log_error
from sklearn.metrics import precision_score, accuracy_score, f1_score 
from sklearn.model_selection import GridSearchCV, cross_val_predict
from sklearn.metrics import average_precision_score, precision_recall_curve
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, roc_auc_score 

import os, pickle, joblib

import warnings
warnings.filterwarnings('ignore', message='The default value of numeric_only in DataFrame.corr is deprecated')

%matplotlib inline

In [17]:
# Load Train and Test Data

employee_train  = pd.read_csv("C:\\Users\\elvis_d\\DATA_ANALYTICS\\GITHUB\\IBM EMPLOYEE CHURN ANALYSIS\\EMPLOYEE-CHURN--SQL-POWER-BI-PYTHON-\\Datasets\\train.csv")
employee_test  = pd.read_csv("C:\\Users\\elvis_d\\DATA_ANALYTICS\\GITHUB\\IBM EMPLOYEE CHURN ANALYSIS\\EMPLOYEE-CHURN--SQL-POWER-BI-PYTHON-\\Datasets\\test.csv")

### EXPLORATORY DATA ANALYSIS

In [6]:
# Check sample of train data and display all columns to have a view

pd.options.display.max_columns = None
employee_train.sample(5, random_state=1)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
215,41,0,Travel_Rarely,896,Sales,6,3,Life Sciences,1,298,4,Female,75,3,3,Manager,4,Single,13591,14674,3,Y,Yes,18,3,3,80,0,16,3,3,1,0,0,0
663,21,1,Travel_Rarely,1427,Research & Development,18,1,Other,1,923,4,Female,65,3,1,Research Scientist,4,Single,2693,8870,1,Y,No,19,3,1,80,0,1,3,2,1,0,0,0
773,36,0,Travel_Rarely,796,Research & Development,12,5,Medical,1,1073,4,Female,51,2,3,Manufacturing Director,4,Single,8858,15669,0,Y,No,11,3,2,80,0,15,2,2,14,8,7,8
798,33,1,Travel_Rarely,1017,Research & Development,25,3,Medical,1,1108,1,Male,55,2,1,Research Scientist,2,Single,2313,2993,4,Y,Yes,20,4,2,80,0,5,0,3,2,2,2,2
629,28,0,Travel_Rarely,1169,Human Resources,8,2,Medical,1,869,2,Male,63,2,1,Human Resources,4,Divorced,4936,23965,1,Y,No,13,3,4,80,1,6,6,3,5,1,0,4


In [7]:
# Check basic info of train data

employee_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1058 entries, 0 to 1057
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1058 non-null   int64 
 1   Attrition                 1058 non-null   int64 
 2   BusinessTravel            1058 non-null   object
 3   DailyRate                 1058 non-null   int64 
 4   Department                1058 non-null   object
 5   DistanceFromHome          1058 non-null   int64 
 6   Education                 1058 non-null   int64 
 7   EducationField            1058 non-null   object
 8   EmployeeCount             1058 non-null   int64 
 9   EmployeeNumber            1058 non-null   int64 
 10  EnvironmentSatisfaction   1058 non-null   int64 
 11  Gender                    1058 non-null   object
 12  HourlyRate                1058 non-null   int64 
 13  JobInvolvement            1058 non-null   int64 
 14  JobLevel                

The train data has 1058 rows and 35 columns.

In [9]:
# Check for missing values in the dataset

employee_train.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

There are no missing values in the train data

In [10]:
# Check for duplicates in the data

employee_train.duplicated().sum()

0

There are no duplicates in the data

In [11]:
# Check the shape of the train data

print(f"train dataframe shape: {employee_train.shape}")

train dataframe shape: (1058, 35)


In [12]:
# Check sample of test data

employee_test.sample(5, random_state=1)

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
258,43,Travel_Frequently,1422,Sales,2,4,Life Sciences,1,1849,1,Male,92,3,2,Sales Executive,4,Married,5675,19246,1,Y,No,20,4,3,80,1,7,5,3,7,7,7,7
29,34,Travel_Rarely,1440,Sales,7,2,Technical Degree,1,1541,2,Male,55,3,1,Sales Representative,3,Married,2308,4944,0,Y,Yes,25,4,2,80,1,12,4,3,11,10,5,7
187,24,Travel_Frequently,897,Human Resources,10,3,Medical,1,1746,1,Male,59,3,1,Human Resources,4,Married,2145,2097,0,Y,No,14,3,4,80,1,3,2,3,2,2,2,1
293,48,Travel_Frequently,117,Research & Development,22,3,Medical,1,1900,4,Female,58,3,4,Manager,4,Divorced,17174,2437,3,Y,No,11,3,2,80,1,24,3,3,22,17,4,7
261,32,Travel_Frequently,1318,Sales,10,4,Marketing,1,1853,4,Male,79,3,2,Sales Executive,4,Single,4648,26075,8,Y,No,13,3,3,80,0,4,2,4,0,0,0,0


In [13]:
# Check shape of test data

print(f"train dataframe shape: {employee_test.shape}")

train dataframe shape: (412, 34)


In [14]:
# Check basic info of test data

employee_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412 entries, 0 to 411
Data columns (total 34 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       412 non-null    int64 
 1   BusinessTravel            412 non-null    object
 2   DailyRate                 412 non-null    int64 
 3   Department                412 non-null    object
 4   DistanceFromHome          412 non-null    int64 
 5   Education                 412 non-null    int64 
 6   EducationField            412 non-null    object
 7   EmployeeCount             412 non-null    int64 
 8   EmployeeNumber            412 non-null    int64 
 9   EnvironmentSatisfaction   412 non-null    int64 
 10  Gender                    412 non-null    object
 11  HourlyRate                412 non-null    int64 
 12  JobInvolvement            412 non-null    int64 
 13  JobLevel                  412 non-null    int64 
 14  JobRole                   

In [15]:
# Check for missing value in test data

employee_test.isnull().sum()

Age                         0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithC

There are no missing value in the test data

In [16]:
# Check for duplicates in the test data

employee_test.duplicated().any()

False

There are no duplicates in the test data

### UNIVARIATE ANALYSIS