<table align="left" width=100%>
    <tr>
        <td>
            <div align="middle">
                <font color="#21618C" size=5px>
                  <b>EMPLOYEE ATTRITION BY HR ANALYSIS
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## Problem Statement

-We analyse and predict, the dataset containing information about the employees working in an organisation and the factors affecting the chances of attrition based on the geographical, familial and economical conditions as well as their job history. 

-We are building a suitable model to predict the attrition considering relevant features and parameters, thus making it easier for HR department to make appropriate decisions.

## Data Description 

AGE - Numerical Value

ATTRITION - Employee leaving the company (0=Current Employee, 1=Voluntary Resignation)

BUSINESS TRAVEL - (1=No Travel, 2=Travel Frequently, 3=Tavel Rarely)

DAILY RATE - Salary Level

DEPARTMENT - (1=HR, 2=R&D, 3=Sales)

DISTANCE FROM HOME - The distance from work to home

EDUCATION - (1=Below College, 2=College, 3=Bachelor, 4=Master, 5=Doctor)

EMPLOYEE COUNT - Numerical Value

EDUCATION FIELD - (1=HR, 2=LIFE SCIENCES, 3=MARKETING, 4=MEDICAL SCIENCES,
 5=OTHERS, 6= TEHCNICAL)

EMPLOYEE NUMBER	- EMPLOYEE ID

ENVIROMENT SATISFACTION	- Satisfaction with the environment

GENDER - (1=FEMALE, 2=MALE)

HOURLY RATE - Hourly Salary

JOB INVOLVEMENT - (1=Low, 2=Medium, 3=High, 4=Very High)

JOB LEVEL - Level of Job

JOB ROLE - Position

JOB SATISFACTION - (1=Low, 2=Medium, 3=High, 4=Very High)

MARITAL STATUS - (1=Divorced, 2=Married, 3=Single)

MONTHLY INCOME - Monthly Salary

MONTHY RATE - MONTHY RATE

NUMCOMPANIES WORKED - Number of companies worked	

OVER 18 - (Y=YES, N=NO)

OVERTIME - (YES, NO)

PERCENT SALARY HIKE - Percentage increase in salary

PERFORMANCE RATING - (1=Low, 2=Good, 3=Excellent, 4=Outstanding)

RELATIONSHIP SATISFACTION - (1=Low, 2=Medium, 3=High, 4=Very High)

STANDARD HOURS - Standard working hours

STOCK OPTIONS LEVEL - Stock options

TOTAL WORKING YEARS - Number of years worked

TRAINING TIMES LAST YEAR - Hours spent for training

WORK LIFE BALANCE - Time spent between work and personal life

YEARS AT COMPANY - Total number of years at the company

YEARS IN CURRENT ROLE - Number of years in current role

YEARS SINCE LAST PROMOTION - Years since last promotion

YEARS WITH CURRENT MANAGER - Years spent with current manager

## Table of Content

1. **[Import Libraries](#import_lib)**
2. **[Set Options](#set_options)**
3. **[Read Data](#RD)**
4. **[Data Analysis and Preparation](#data_preparation)**
5. **[Base Model](#LogisticReg)**


## 1. Import Libraries

In [None]:
import warnings
warnings.filterwarnings("ignore")

import os

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.externals.six import StringIO  
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import export_graphviz
import featuretools as ft

import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

import pydotplus
from IPython.display import Image  
import graphviz

## 2. Set Options 

In [None]:
# pd.options.display.max_columns = None

# pd.options.display.max_rows = None

# np.set_printoptions(suppress=True)

## 3. Read Data

In [None]:
df = pd.read_csv("IBM HR Data new.csv")
df.head()

## 4. Data Analysis and Preparation

In [None]:
df.shape

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>We see that there are 23,436 observations and 37 features </br></b>
                </font>
            </div>
        </td>
    </tr>
</table>




In [None]:
df.describe()

In [None]:
df.describe(include='object')

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
total = df.isnull().sum()
percent = ((df.isnull().sum()/df.isnull().count())*100).sort_values(ascending=False)

missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

print(missing_data)

In [None]:
df['Attrition'] = df['Attrition'].replace(['Current employee','Voluntary Resignation'],[0,1])

In [None]:
df[df['Education']==6]

In [None]:
df.drop(15655,inplace=True)

In [None]:
df['Education'] = df['Education'].replace([1,2,3,4,5],['Below college','College','Bachelors','Master','Doctorate'])

In [None]:
df.columns

In [None]:
df.drop(['EmployeeCount','EmployeeNumber','Application ID','Over18'],inplace=True,axis=1)

In [None]:
df[['EnvironmentSatisfaction','JobSatisfaction','PerformanceRating', 'RelationshipSatisfaction','StockOptionLevel','WorkLifeBalance']]

In [None]:
df['Gender'] = df['Gender'].replace(['Male','Female'],[0,1])

In [None]:
df[df['Gender']=='2']

In [None]:
df.drop(17027,inplace=True)

## Null Imputation

In [None]:
df['Age'] = df['Age'].fillna(df['Age'].mean())

In [None]:
df['Attrition'] = df['Attrition'].fillna(df['Attrition'].mode()[0])

In [None]:
df['BusinessTravel'] = df['BusinessTravel'].fillna(df['BusinessTravel'].mode()[0])

In [None]:
df['DailyRate'] = df['DailyRate'].fillna(df['DailyRate'].mean())

In [None]:
df['Department'] = df['Department'].fillna(df['Department'].mode()[0])

In [None]:
df['DistanceFromHome'] = df['DistanceFromHome'].astype(float)
df['DistanceFromHome'] = df['DistanceFromHome'].fillna(df['DistanceFromHome'].mean())

In [None]:
df['Education'] = df['Education'].fillna(df['Education'].mode()[0])

In [None]:
df['EducationField'] = df['EducationField'].fillna(df['EducationField'].mode()[0])

In [None]:
df['EnvironmentSatisfaction'] = df['EnvironmentSatisfaction'].fillna(df['EnvironmentSatisfaction'].median())

In [None]:
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])

In [None]:
df['HourlyRate'] = df['HourlyRate'].astype(float)
df['HourlyRate'] = df['HourlyRate'].fillna(df['HourlyRate'].mean())

In [None]:
df['JobInvolvement'] = df['JobInvolvement'].fillna(df['JobInvolvement'].median())

In [None]:
df['JobLevel'] = df['JobLevel'].fillna(df['JobLevel'].mean())

In [None]:
df['JobRole'] = df['JobRole'].fillna(df['JobRole'].mode()[0])

In [None]:
df['JobSatisfaction'] = df['JobSatisfaction'].astype(float)
df['JobSatisfaction'] = df['JobSatisfaction'].fillna(df['JobSatisfaction'].mean())

In [None]:
df['MaritalStatus'] = df['MaritalStatus'].fillna(df['MaritalStatus'].mode()[0])

In [None]:
df['MonthlyIncome'] = df['MonthlyIncome'].astype(float)
df['MonthlyIncome'] = df['MonthlyIncome'].fillna(df['MonthlyIncome'].median())

In [None]:
df['MonthlyRate'] = df['MonthlyRate'].fillna(df['MonthlyRate'].mean())

In [None]:
df['NumCompaniesWorked'] = df['NumCompaniesWorked'].fillna(df['NumCompaniesWorked'].median())

In [None]:
df['OverTime'] = df['OverTime'].fillna(df['OverTime'].mode()[0])

In [None]:
df['PercentSalaryHike'] = df['PercentSalaryHike'].astype(float)
df['PercentSalaryHike'] = df['PercentSalaryHike'].fillna(df['PercentSalaryHike'].mean())

In [None]:
df['PerformanceRating'] = df['PerformanceRating'].fillna(df['PerformanceRating'].median())

In [None]:
df['RelationshipSatisfaction'] = df['RelationshipSatisfaction'].fillna(df['RelationshipSatisfaction'].mean())

In [None]:
df['StandardHours'] = df['StandardHours'].fillna(df['StandardHours'].mean())

In [None]:
df['StockOptionLevel'] = df['StockOptionLevel'].fillna(df['StockOptionLevel'].median())

In [None]:
df['TotalWorkingYears'] = df['TotalWorkingYears'].fillna(df['TotalWorkingYears'].median())

In [None]:
df['TrainingTimesLastYear'] = df['TrainingTimesLastYear'].fillna(df['TrainingTimesLastYear'].median())

In [None]:
df['WorkLifeBalance'] = df['WorkLifeBalance'].fillna(df['WorkLifeBalance'].mean())

In [None]:
df['YearsAtCompany'] = df['YearsAtCompany'].fillna(df['YearsAtCompany'].median())

In [None]:
df['YearsInCurrentRole'] = df['YearsInCurrentRole'].fillna(df['YearsInCurrentRole'].median())

In [None]:
df['YearsSinceLastPromotion'] = df['YearsSinceLastPromotion'].fillna(df['YearsSinceLastPromotion'].median())

In [None]:
df['YearsWithCurrManager'] = df['YearsWithCurrManager'].fillna(df['YearsWithCurrManager'].median())

In [None]:
df['Employee Source'] = df['Employee Source'].fillna(df['Employee Source'].mode()[0])

In [None]:
df.drop('StandardHours',axis=1,inplace=True)

In [None]:
df.isnull().sum()

## Checking for Outliers

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

In [None]:
total_out = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()

In [None]:
percent_out = ((((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()/df.count())*100).sort_values(ascending=False)

In [None]:
outliers = pd.concat([total_out, percent_out], axis=1, keys=['Total', 'Percent'])
outliers

In [None]:
col = ['BusinessTravel','Education','Department','EducationField','Gender','JobRole','MaritalStatus', 'OverTime','Employee Source']

In [None]:
dfnew = pd.get_dummies(df[col] , drop_first = True) 

In [None]:
df.drop(col,inplace=True,axis=1)

In [None]:
df=pd.concat([df,dfnew],axis=1)

In [None]:
df.T

In [None]:
sns.distplot(df['MonthlyIncome'])

In [None]:
df['MonthlyIncome'] = df['MonthlyIncome'].apply(lambda x : np.log(x+1))

In [None]:
sns.distplot(df['MonthlyIncome'])

In [None]:
df['YearsAtcompany_1_10']=df['YearsAtCompany'][df['YearsAtCompany']<11]

In [None]:
df['YearsAtcompany_11+']=df['YearsAtCompany'][df['YearsAtCompany']>=11]

In [None]:
df['YearsAtcompany_1_10'] = df['YearsAtcompany_1_10'].fillna(0)

In [None]:
df['YearsAtcompany_11+'] = df['YearsAtcompany_11+'].fillna(0)

In [None]:
df['YearsAtcompany_1_10']=df['YearsAtcompany_1_10'].apply(lambda x: 1 if x>1 else x)

In [None]:
df['YearsAtcompany_11+']=df['YearsAtcompany_11+'].apply(lambda x: 1 if x>1 else x)

In [None]:
df['NumCompaniesWorked'].value_counts()

In [None]:
df['NumCompaniesWorked_0_5']=df['NumCompaniesWorked'][df['NumCompaniesWorked']<=5]

In [None]:
df['NumCompaniesWorked_6+']=df['NumCompaniesWorked'][df['NumCompaniesWorked']>5]

In [None]:
df['NumCompaniesWorked_0_5'] = df['NumCompaniesWorked_0_5'].fillna(0)

In [None]:
df['NumCompaniesWorked_6+'] = df['NumCompaniesWorked_6+'].fillna(0)

In [None]:
df['NumCompaniesWorked_0_5']=df['NumCompaniesWorked_0_5'].apply(lambda x: 1 if x>1 else x)

In [None]:
df['NumCompaniesWorked_6+']=df['NumCompaniesWorked_6+'].apply(lambda x: 1 if x>1 else x)

In [None]:
df['PerformanceRating'].value_counts()

In [None]:
df['PerformanceRating_3']=df['PerformanceRating'][df['PerformanceRating']==3]

In [None]:
df['PerformanceRating_4']=df['PerformanceRating'][df['PerformanceRating']==4]

In [None]:
df['PerformanceRating_3'] = df['PerformanceRating_3'].fillna(0)

In [None]:
df['PerformanceRating_4'] = df['PerformanceRating_4'].fillna(0)

In [None]:
df['PerformanceRating_3']=df['PerformanceRating_3'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['PerformanceRating_4']=df['PerformanceRating_4'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['StockOptionLevel'].value_counts()

In [None]:
df['StockOptionLevel0_1']=df['StockOptionLevel'][df['StockOptionLevel']<=1]

In [None]:
df['StockOptionLevel2_3']=df['StockOptionLevel'][df['StockOptionLevel']>1]

In [None]:
df['StockOptionLevel0_1'] = df['StockOptionLevel0_1'].fillna(0)

In [None]:
df['StockOptionLevel2_3'] = df['StockOptionLevel2_3'].fillna(0)

In [None]:
df['StockOptionLevel0_1']=df['StockOptionLevel0_1'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['StockOptionLevel2_3']=df['StockOptionLevel2_3'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['TotalWorkingYears'].value_counts()

In [None]:
df['TotalWorkingYears_1-10']=df['TotalWorkingYears'][df['TotalWorkingYears']<=10]

In [None]:
df['TotalWorkingYears_11+']=df['TotalWorkingYears'][df['TotalWorkingYears']>10]

In [None]:
df['TotalWorkingYears_1-10'] = df['TotalWorkingYears_1-10'].fillna(0)

In [None]:
df['TotalWorkingYears_11+'] = df['TotalWorkingYears_11+'].fillna(0)

In [None]:
df['TotalWorkingYears_1-10']=df['TotalWorkingYears_1-10'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['TotalWorkingYears_11+']=df['TotalWorkingYears_11+'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['TrainingTimesLastYear'].value_counts()

In [None]:
df['TrainingTimesLastYear_0_4']=df['TrainingTimesLastYear'][df['TrainingTimesLastYear']<=4]

In [None]:
df['TrainingTimesLastYear_5+']=df['TrainingTimesLastYear'][df['TrainingTimesLastYear']>4]

In [None]:
df['TrainingTimesLastYear_0_4'] = df['TrainingTimesLastYear_0_4'].fillna(0)

In [None]:
df['TrainingTimesLastYear_5+'] = df['TrainingTimesLastYear_5+'].fillna(0)

In [None]:
df['TrainingTimesLastYear_0_4']=df['TrainingTimesLastYear_0_4'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['TrainingTimesLastYear_5+']=df['TrainingTimesLastYear_5+'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['YearsInCurrentRole'].value_counts()

In [None]:
df['YearsInCurrentRole0_10']=df['YearsInCurrentRole'][df['YearsInCurrentRole']<=10]

In [None]:
df['YearsInCurrentRole11+']=df['YearsInCurrentRole'][df['YearsInCurrentRole']>10]

In [None]:
df['YearsInCurrentRole0_10'] = df['YearsInCurrentRole0_10'].fillna(0)

In [None]:
df['YearsInCurrentRole11+'] = df['YearsInCurrentRole11+'].fillna(0)

In [None]:
df['YearsInCurrentRole0_10']=df['YearsInCurrentRole0_10'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['YearsInCurrentRole11+']=df['YearsInCurrentRole11+'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['YearsSinceLastPromotion'].value_counts()

In [None]:
df['YearsSinceLastPromotion0_10']=df['YearsSinceLastPromotion'][df['YearsSinceLastPromotion']<=10]

In [None]:
df['YearsSinceLastPromotion11+']=df['YearsSinceLastPromotion'][df['YearsSinceLastPromotion']>10]

In [None]:
df['YearsSinceLastPromotion0_10'] = df['YearsSinceLastPromotion0_10'].fillna(0)

In [None]:
df['YearsSinceLastPromotion11+'] = df['YearsSinceLastPromotion11+'].fillna(0)

In [None]:
df['YearsSinceLastPromotion0_10']=df['YearsSinceLastPromotion0_10'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['YearsSinceLastPromotion11+']=df['YearsSinceLastPromotion11+'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['YearsWithCurrManager'].value_counts()

In [None]:
df['YearsWithCurrManager0_8']=df['YearsWithCurrManager'][df['YearsWithCurrManager']<=8]

In [None]:
df['YearsWithCurrManager9+']=df['YearsWithCurrManager'][df['YearsWithCurrManager']>8]

In [None]:
df['YearsWithCurrManager0_8'] = df['YearsWithCurrManager0_8'].fillna(0)

In [None]:
df['YearsWithCurrManager9+'] = df['YearsWithCurrManager9+'].fillna(0)

In [None]:
df['YearsWithCurrManager0_8']=df['YearsWithCurrManager0_8'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df['YearsWithCurrManager9+']=df['YearsWithCurrManager9+'].apply(lambda x: 1 if x>=1 else x)

In [None]:
df.columns

In [None]:
df.drop(['YearsWithCurrManager','YearsAtCompany', 'YearsInCurrentRole','YearsSinceLastPromotion','PerformanceRating','NumCompaniesWorked','StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear'],axis=1,inplace=True)

## BASE MODEL - Logistic Regression 

In [None]:
X=df.drop('Attrition',axis=1)
y=df['Attrition']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)

In [None]:
logreg_scaled_features = LogisticRegression()
logreg_scaled_features.fit(X_train,y_train)

In [None]:
y_pred = logreg_scaled_features.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)

# label the confusion matrix  
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])

# set size of the plot
plt.figure(figsize = (8,5))

# plot a heatmap
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
plt.show()

In [None]:
result = classification_report(y_test,y_pred)

# print the result
print(result)

In [None]:
TN = cm[0,0]

# True Positives are denoted by 'TP'
# Actual '1' values which are classified correctly
TP = cm[1,1]

# False Negatives are denoted by 'FN'
# Actual '1' values which are classified wrongly as '0'
FN = cm[1,0]

# False Positives are denoted by 'FP'
# Actual 'O' values which are classified wrongly as '1'
FP = cm[0,1]

In [None]:
print(TN,',',TP,',',FN,',',FP)

In [None]:
plt.rcParams['figure.figsize']=(8,5)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.plot(fpr,tpr)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

plt.plot([0, 1], [0, 1],'r--')

plt.title('ROC curve for logistic Regression')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')

plt.text(x = 0.05, y = 0.8, s =('AUC Score:',round(metrics.roc_auc_score(y_test, y_pred),4)))
plt.grid(True)

In [None]:
cols = ['Model', 'AUC Score', 'Precision Score', 'Recall Score','Accuracy Score','f1-score']

# creating an empty dataframe of the colums
result_tabulation = pd.DataFrame(columns = cols)
Logistic_regression = pd.Series({'Model': "Logistic regression ",
                     'AUC Score' : metrics.roc_auc_score(y_test, y_pred),
                 'Precision Score': metrics.precision_score(y_test, y_pred),
                 'Recall Score': metrics.recall_score(y_test, y_pred),
                 'Accuracy Score': metrics.accuracy_score(y_test, y_pred),
                  'f1-score': metrics.f1_score(y_test, y_pred)})


# appending our result table
result_tabulation = result_tabulation.append(Logistic_regression, ignore_index = True)

# view the result table
result_tabulation

##  Applying Smote

In [None]:
from collections import Counter
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size = 0.30, random_state = 0)

In [None]:
logreg_scaled_features = LogisticRegression()
logreg_scaled_features.fit(X_train,y_train)

In [None]:
y_pred = logreg_scaled_features.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)

# label the confusion matrix  
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])

# set size of the plot
plt.figure(figsize = (8,5))

# plot a heatmap
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
plt.show()

In [None]:
result = classification_report(y_test,y_pred)

# print the result
print(result)

In [None]:
TN = cm[0,0]

# True Positives are denoted by 'TP'
# Actual '1' values which are classified correctly
TP = cm[1,1]

# False Negatives are denoted by 'FN'
# Actual '1' values which are classified wrongly as '0'
FN = cm[1,0]

# False Positives are denoted by 'FP'
# Actual 'O' values which are classified wrongly as '1'
FP = cm[0,1]

In [None]:
print(TN,',',TP,',',FN,',',FP)

In [None]:
plt.rcParams['figure.figsize']=(8,5)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.plot(fpr,tpr)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

plt.plot([0, 1], [0, 1],'r--')

plt.title('ROC curve')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')

plt.text(x = 0.05, y = 0.8, s =('AUC Score:',round(metrics.roc_auc_score(y_test, y_pred),4)))
plt.grid(True)

In [None]:
Logistic_regression_After_smote = pd.Series({'Model': "Logistic regression After Smote ",
                     'AUC Score' : metrics.roc_auc_score(y_test, y_pred),
                 'Precision Score': metrics.precision_score(y_test, y_pred),
                 'Recall Score': metrics.recall_score(y_test, y_pred),
                 'Accuracy Score': metrics.accuracy_score(y_test, y_pred),
                  'f1-score': metrics.f1_score(y_test, y_pred)})


# appending our result table
result_tabulation = result_tabulation.append(Logistic_regression_After_smote, ignore_index = True)

# view the result table
result_tabulation

In [None]:
plt.rcParams['figure.figsize']=(28,10)

result_tabulation.plot(secondary_y=['Accuracy Score','Precision Score'], mark_right=True)

plt.xticks([0,1,2,3,4,5,6,7,8,9], list(result_tabulation.Model))
plt.show()

In [None]:
result_tabulation.to_excel('result.xlsx')

In [None]:
dfnew=pd.concat([X_res,y_res],axis=1)

In [None]:
dfnew

In [None]:
dfnew.to_excel('New IBM.xlsx')

In [None]:
X=dfnew.drop('Attrition',axis=1)
y=dfnew['Attrition']

In [None]:
# from sklearn.feature_selection import RFE
from sklearn.feature_selection import RFE
model = LogisticRegression()
#Initializing RFE model
rfe = RFE(model, 1)
#Transforming data using RFE
X_rfe = rfe.fit_transform(X,y)  
#Fitting the data to model
model.fit(X_rfe,y)
print(rfe.support_)
print(rfe.ranking_)
#no of features
nof_list=np.arange(1,67)            
high_score=0



In [None]:
nof=0           
score_list =[]
for n in range(len(nof_list)):
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
    model = LogisticRegression()
    rfe = RFE(model,nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train,y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train)
    score = model.score(X_test_rfe,y_test)
    score_list.append(score)
    if(score>high_score):
        high_score = score
        nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))

In [None]:
cols = list(X.columns)
model = LogisticRegression()
#Initializing RFE model
rfe = RFE(model, 45)             
#Transforming data using RFE
X_rfe = rfe.fit_transform(X,y)  
#Fitting the data to model
model.fit(X_rfe,y)              
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index
print(selected_features_rfe)

In [None]:
dfn=dfnew[['EnvironmentSatisfaction', 'JobInvolvement', 'JobSatisfaction',
       'MonthlyIncome', 'Gender', 'BusinessTravel_Travel_Frequently',
       'BusinessTravel_Travel_Rarely', 'Education_Doctorate',
       'Department_Research & Development', 'EducationField_Life Sciences',
       'EducationField_Marketing', 'EducationField_Medical',
       'EducationField_Other', 'EducationField_Technical Degree',
       'JobRole_Human Resources', 'JobRole_Laboratory Technician',
       'JobRole_Manager', 'JobRole_Manufacturing Director',
       'JobRole_Research Director', 'JobRole_Research Scientist',
       'JobRole_Sales Executive', 'OverTime_Yes',
       'Employee Source_Company Website', 'Employee Source_GlassDoor',
       'Employee Source_Indeed', 'Employee Source_Jora',
       'Employee Source_LinkedIn', 'Employee Source_Recruit.net',
       'Employee Source_Referral', 'Employee Source_Seek',
       'YearsAtcompany_11+', 'NumCompaniesWorked_0_5', 'NumCompaniesWorked_6+',
       'PerformanceRating_3', 'PerformanceRating_4', 'StockOptionLevel0_1',
       'StockOptionLevel2_3', 'TotalWorkingYears_11+',
       'TrainingTimesLastYear_0_4', 'TrainingTimesLastYear_5+',
       'YearsInCurrentRole0_10', 'YearsInCurrentRole11+',
       'YearsSinceLastPromotion0_10', 'YearsWithCurrManager0_8',
       'YearsWithCurrManager9+']]

In [None]:
X=dfn
y=dfnew['Attrition']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)

In [None]:
logreg_scaled_features = LogisticRegression()
logreg_scaled_features.fit(X_train,y_train)

In [None]:
y_pred = logreg_scaled_features.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)

# label the confusion matrix  
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])

# set size of the plot
plt.figure(figsize = (8,5))

# plot a heatmap
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
plt.show()

In [None]:
result = classification_report(y_test,y_pred)

# print the result
print(result)

In [None]:
TN = cm[0,0]

# True Positives are denoted by 'TP'
# Actual '1' values which are classified correctly
TP = cm[1,1]

# False Negatives are denoted by 'FN'
# Actual '1' values which are classified wrongly as '0'
FN = cm[1,0]

# False Positives are denoted by 'FP'
# Actual 'O' values which are classified wrongly as '1'
FP = cm[0,1]

In [None]:
print(TN,',',TP,',',FN,',',FP)

In [None]:
plt.rcParams['figure.figsize']=(8,5)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.plot(fpr,tpr)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

plt.plot([0, 1], [0, 1],'r--')

plt.title('ROC curve')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')

plt.text(x = 0.05, y = 0.8, s =('AUC Score:',round(metrics.roc_auc_score(y_test, y_pred),4)))
plt.grid(True)

In [None]:
Logistic_regression_after_FS = pd.Series({'Model': "Logistic regression after feature selection",
                     'AUC Score' : metrics.roc_auc_score(y_test, y_pred),
                 'Precision Score': metrics.precision_score(y_test, y_pred),
                 'Recall Score': metrics.recall_score(y_test, y_pred),
                 'Accuracy Score': metrics.accuracy_score(y_test, y_pred),
                  'f1-score': metrics.f1_score(y_test, y_pred)})


# appending our result table
result_tabulation = result_tabulation.append(Logistic_regression_after_FS, ignore_index = True)

# view the result table
result_tabulation

## Decision Tree

In [None]:
decision_tree_classification = DecisionTreeClassifier(criterion='entropy')

# train model
decision_tree = decision_tree_classification.fit(X_train, y_train)

In [None]:
decision_tree_pred = decision_tree.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, decision_tree_pred)

# label the confusion matrix  
conf_matrix = pd.DataFrame(data = cm, columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])

# set sizeof the plot
plt.figure(figsize = (8,5))

# plot a heatmap
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="YlGnBu")
plt.show()

In [None]:
TN = cm[0,0]

# True Positives are denoted by 'TP'
# Actual '1' values which are classified correctly
TP = cm[1,1]

# False Negatives are denoted by 'FN'
# Actual '1' values which are classified wrongly as '0'
FN = cm[1,0]

# False Positives are denoted by 'FP'
# Actual 'O' values which are classified wrongly as '1'
FP = cm[0,1]

In [None]:
print("Accuracy is:",metrics.accuracy_score(y_test,decision_tree_pred))

print('train score:',decision_tree.score(X_train,y_train))

print('test score:',decision_tree.score(X_test,y_test))

In [None]:
result = classification_report(y_test, decision_tree_pred)

# print the result
print(result)

In [None]:
# set the figure size
plt.rcParams['figure.figsize']=(8,5)

fpr, tpr, thresholds = roc_curve(y_test, decision_tree_pred)

# plot the ROC curve
plt.plot(fpr,tpr)

# set limits for x and y axes
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

# plot the straight line showing worst prediction for the model
plt.plot([0, 1], [0, 1],'r--')

# add the AUC score
plt.text(x = 0.05, y = 0.8, s =('AUC Score:', round(metrics.roc_auc_score(y_test, decision_tree_pred),4)))


# name the plot, and both axes
plt.title('ROC curve')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')

# plot the grid
plt.grid(True)

In [None]:
Decision_tree_metrics = pd.Series({'Model': "Decision Tree ",
                     'AUC Score' : metrics.roc_auc_score(y_test, decision_tree_pred),
                 'Precision Score': metrics.precision_score(y_test, decision_tree_pred),
                 'Recall Score': metrics.recall_score(y_test, decision_tree_pred),
                 'Accuracy Score': metrics.accuracy_score(y_test, decision_tree_pred),
                 
                  'f1-score':metrics.f1_score(y_test, decision_tree_pred)})



# appending our result table
result_tabulation = result_tabulation.append(Decision_tree_metrics , ignore_index = True)

# view the result table
result_tabulation

## Pruned Decision Tree

In [None]:
pruned = DecisionTreeClassifier(criterion="entropy", max_depth=25)

# train the classifier
decision_tree_prune = pruned.fit(X_train,y_train)

In [None]:
decision_tree_prune_pred = decision_tree_prune.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, decision_tree_prune_pred)

# label the confusion matrix  
conf_matrix = pd.DataFrame(data = cm, columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])

# set sizeof the plot
plt.figure(figsize = (8,5))

# plot a heatmap
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="YlGnBu")
plt.show()

In [None]:
TN = cm[0,0]

# True Positives are denoted by 'TP'
# Actual '1' values which are classified correctly
TP = cm[1,1]

# False Negatives are denoted by 'FN'
# Actual '1' values which are classified wrongly as '0'
FN = cm[1,0]

# False Positives are denoted by 'FP'
# Actual 'O' values which are classified wrongly as '1'
FP = cm[0,1]

In [None]:
result = classification_report(y_test,decision_tree_prune_pred)

# print the result
print(result)

In [None]:
print("Accuracy is:",metrics.accuracy_score(y_test,decision_tree_prune_pred))

print('train score:',decision_tree_prune.score(X_train,y_train))

print('test score:',decision_tree_prune.score(X_test,y_test))

In [None]:
plt.rcParams['figure.figsize']=(8,5)

fpr, tpr, thresholds = roc_curve(y_test, decision_tree_prune_pred)

# plot the ROC curve
plt.plot(fpr,tpr)

# set limits for x and y axes
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

# plot the straight line showing worst prediction for the model
plt.plot([0, 1], [0, 1],'r--')

# add the AUC score
plt.text(x = 0.05, y = 0.8, s =('AUC Score:', round(metrics.roc_auc_score(y_test, decision_tree_prune_pred),4)))


# name the plot, and both axes
plt.title('ROC curve')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')

# plot the grid
plt.grid(True)

In [None]:
Pruned_Decision_tree_metrics = pd.Series({'Model': "Pruned Decision Tree ",
                     'AUC Score' : metrics.roc_auc_score(y_test, decision_tree_prune_pred),
                 'Precision Score': metrics.precision_score(y_test, decision_tree_prune_pred),
                 'Recall Score': metrics.recall_score(y_test, decision_tree_prune_pred),
                 'Accuracy Score': metrics.accuracy_score(y_test, decision_tree_prune_pred),
               
                  'f1-score':metrics.f1_score(y_test, decision_tree_prune_pred)})



# appending our result table
result_tabulation = result_tabulation.append(Pruned_Decision_tree_metrics , ignore_index = True)

# view the result table
result_tabulation

In [None]:
param_grid = {"criterion": ["gini", "entropy"],
              "min_samples_split": [10, 20],
              "max_depth": [3, 5, 10, 20,25],
              "min_samples_leaf": [30, 100, 300],
              "max_leaf_nodes": [None,2,3,5],
              }

In [None]:
decision_tree_Gridsearch = DecisionTreeClassifier()
decision_tree_Gridsearch = GridSearchCV(decision_tree_Gridsearch, param_grid, cv=10)
decision_tree_Gridsearch.fit(X_train, y_train)

## Decision GridSearch

In [None]:
decision_tree_Gridsearch.best_params_

In [None]:
decision_tree_best_parameters = DecisionTreeClassifier(max_depth= decision_tree_Gridsearch.best_params_.get('max_depth'), 
                                                       min_samples_leaf= decision_tree_Gridsearch.best_params_.get('min_samples_leaf'), 
                                                       min_samples_split= decision_tree_Gridsearch.best_params_.get('min_samples_split'),
                                                       criterion=decision_tree_Gridsearch.best_params_.get('criterion')).fit(X_train, y_train)

In [None]:
decision_tree_best_parameters_pred = decision_tree_best_parameters.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, decision_tree_best_parameters_pred)

# label the confusion matrix  
conf_matrix = pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])

# set sizeof the plot
plt.figure(figsize = (8,5))

# plot a heatmap
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
plt.show()

In [None]:
result = classification_report(y_test,decision_tree_best_parameters_pred)

# print the result
print(result)

In [None]:
plt.rcParams['figure.figsize']=(8,5)

fpr, tpr, thresholds = roc_curve(y_test, decision_tree_best_parameters_pred)

# plot the ROC curve
plt.plot(fpr,tpr)

# set limits for x and y axes
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

# plot the straight line showing worst prediction for the model
plt.plot([0, 1], [0, 1],'r--')

# add the AUC score
plt.text(x = 0.05, y = 0.8, s =('AUC Score:', round(metrics.roc_auc_score(y_test, decision_tree_best_parameters_pred),4)))


# name the plot, and both axes
plt.title('ROC curve')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')

# plot the grid
plt.grid(True)

In [None]:
Decision_tree_GridSearch_metrics = pd.Series({'Model': "Decision Tree (GridSearchCV) ",
                     'AUC Score' : metrics.roc_auc_score(y_test, decision_tree_best_parameters_pred),
                 'Precision Score': metrics.precision_score(y_test, decision_tree_best_parameters_pred),
                 'Recall Score': metrics.recall_score(y_test, decision_tree_best_parameters_pred),
                 'Accuracy Score': metrics.accuracy_score(y_test, decision_tree_best_parameters_pred),
     
                 'f1-score':metrics.f1_score(y_test, decision_tree_best_parameters_pred)})



# appending our result table
result_tabulation = result_tabulation.append(Decision_tree_GridSearch_metrics , ignore_index = True)

# view the result table
result_tabulation

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier


In [None]:
clf=RandomForestClassifier(n_estimators=5,max_depth=30)
#Train the model using the training sets y_pred=clf.predict(X_test)
ran=clf.fit(X_train,y_train)
#predict the model
y_pred=clf.predict(
    X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
import seaborn as sns
import numpy as np; np.random.seed(0)
import matplotlib.pylab as plt
import matplotlib.transforms

data = np.random.randint(100, size=(5,5))
akws = {"ha": 'left',"va": 'top'}
ax = sns.heatmap(data,  annot=True, annot_kws=akws)

for t in ax.texts:
    trans = t.get_transform()
    offs = matplotlib.transforms.ScaledTranslation(0.75, 0.5,
                    matplotlib.transforms.IdentityTransform())
    t.set_transform( offs + trans )

plt.show()
offs = matplotlib.transforms.ScaledTranslation(0.50, 0.50,
                    matplotlib.transforms.IdentityTransform())

In [None]:
cm = confusion_matrix(y_test, y_pred)

# label the confusion matrix  
conf_matrix = pd.DataFrame(data = cm, columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])

# set sizeof the plot
plt.figure(figsize = (8,5))

# plot a heatmap
akws = {"ha": 'center',"va": 'center'}
ax = sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="YlGnBu" )
for t in ax.texts:
    trans = t.get_transform()
    offs = matplotlib.transforms.ScaledTranslation(-0.45,0.45,
                    matplotlib.transforms.IdentityTransform())
    t.set_transform( offs + trans )

plt.show()


In [None]:
result = classification_report(y_test, y_pred)

# print the result
print(result)

In [None]:
print("Accuracy is:",metrics.accuracy_score(y_test,y_pred))

print('train score:',ran.score(X_train,y_train))

print('test score:',ran.score(X_test,y_test))

In [None]:
plt.rcParams['figure.figsize']=(8,5)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)

# plot the ROC curve
plt.plot(fpr,tpr)

# set limits for x and y axes
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

# plot the straight line showing worst prediction for the model
plt.plot([0, 1], [0, 1],'r--')

# add the AUC score
plt.text(x = 0.05, y = 0.8, s =('AUC Score:', round(metrics.roc_auc_score(y_test, y_pred),4)))


# name the plot, and both axes
plt.title('ROC curve')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')

# plot the grid
plt.grid(True)

In [None]:
ran_metrics = pd.Series({'Model': "RandomForest ",
                     'AUC Score' : metrics.roc_auc_score(y_test, y_pred),
                 'Precision Score': metrics.precision_score(y_test, y_pred),
                 'Recall Score': metrics.recall_score(y_test, y_pred),
                 'Accuracy Score': metrics.accuracy_score(y_test, y_pred),

                  'f1-score':metrics.f1_score(y_test, y_pred)})



# appending our result table
result_tabulation = result_tabulation.append(ran_metrics , ignore_index = True)

# view the result table
result_tabulation

## Random Forest Using Cross Valid

In [None]:

from sklearn.model_selection import cross_val_score
print(cross_val_score(RandomForestClassifier(max_depth=25), X, y, cv=10))


## Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test) 

In [None]:
cm = confusion_matrix(y_test, y_pred)

# label the confusion matrix  
conf_matrix = pd.DataFrame(data = cm, columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])

# set sizeof the plot
plt.figure(figsize = (8,5))

# plot a heatmap
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="YlGnBu")
plt.show()

In [None]:
result = classification_report(y_test, y_pred)

# print the result
print(result)

In [None]:
print("Accuracy is:",metrics.accuracy_score(y_test,y_pred))

print('train score:',classifier.score(X_train,y_train))

print('test score:',classifier.score(X_test,y_test))

In [None]:
plt.rcParams['figure.figsize']=(8,5)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)

# plot the ROC curve
plt.plot(fpr,tpr)

# set limits for x and y axes
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

# plot the straight line showing worst prediction for the model
plt.plot([0, 1], [0, 1],'r--')

# add the AUC score
plt.text(x = 0.05, y = 0.8, s =('AUC Score:', round(metrics.roc_auc_score(y_test, y_pred),4)))


# name the plot, and both axes
plt.title('ROC curve')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')

# plot the grid"
plt.grid(True)

In [None]:
Naive_bayes = pd.Series({'Model': "Naive Bayes ",
                     'AUC Score' : metrics.roc_auc_score(y_test, y_pred),
                 'Precision Score': metrics.precision_score(y_test, y_pred),
                 'Recall Score': metrics.recall_score(y_test, y_pred),
                 'Accuracy Score': metrics.accuracy_score(y_test, y_pred),

                  'f1-score':metrics.f1_score(y_test, y_pred)})



# appending our result table
result_tabulation = result_tabulation.append(Naive_bayes , ignore_index = True)

# view the result table
result_tabulation

In [None]:
plt.rcParams['figure.figsize']=(28,10)

result_tabulation.plot()

plt.xticks([0,1,2,3,4,5,6,7,8,9], list(result_tabulation.Model))
plt.show()

In [None]:
result_tabulation.to_excel('result.xlsx')

## Ensemble

In [None]:
from sklearn.ensemble import AdaBoostClassifier


cls=AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=200)
cls.fit(X_train,y_train)
yp=cls.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, yp)

# label the confusion matrix  
conf_matrix = pd.DataFrame(data = cm, columns = ['Predicted:0','Predicted:1'], index = ['Actual:0','Actual:1'])

# set sizeof the plot
plt.figure(figsize = (8,5))

# plot a heatmap
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="YlGnBu")
plt.show()

In [None]:
result = classification_report(y_test, y_pred)

# print the result
print(result)

In [None]:
from sklearn import model_selection

In [None]:
from sklearn.ensemble import BaggingClassifier
seed = 8
kfold = model_selection.KFold(n_splits = 3, 
                       random_state = seed) 
  
# initialize the base classifier 
base_cls = DecisionTreeClassifier() 
  
# no. of base classifier 
num_trees = 500
  
# bagging classifier 
model = BaggingClassifier(base_estimator = base_cls, 
                          n_estimators = num_trees, 
                          random_state = seed) 
  
results = model_selection.cross_val_score(model, X, y, cv = kfold) 
print("accuracy :") 
print(results.mean())

In [None]:
seed = 8
kfold = model_selection.KFold(n_splits = 3, 
                       random_state = seed) 
  
# initialize the base classifier 
base_cls = DecisionTreeClassifier() 
  
# no. of base classifier 
num_trees = 500
  
# bagging classifier 
model = AdaBoostClassifier(base_estimator = base_cls, 
                          n_estimators = num_trees, 
                          random_state = seed) 
  
results = model_selection.cross_val_score(model, X, y, cv = kfold) 
print("accuracy :") 
print(results.mean()) 