## Problem Statement:

Attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company. Employee attrition leads to significant costs for a business, including the cost of business disruption, hiring new staff and training new staff. As such, there is great business interest in understanding the drivers of, and minimizing staff attrition.

This data set presents an employee survey from IBM, indicating if there is attrition or not. The data set contains approximately 24000 entries. Given the limited size of the data set, the model should only be expected to provide modest improvement in indentification of attrition vs a random allocation of probability of attrition.

While some level of attrition in a company is inevitable, minimizing it and being prepared for the cases that cannot be helped will significantly help improve the operations of most businesses. As a future development, with a sufficiently large data set, it would be used to run a segmentation on employees, to develop certain “at risk” categories of employees. This could generate new insights for the business on what drives attrition, insights that cannot be generated by merely informational interviews with employees.

## Importing the necessary libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Increasing the display width

In [None]:
pd.set_option('display.max_columns',37)

In [None]:
df=pd.read_csv('../input/ibm-hr-analytics-classification/IBM_HR.csv')
df.head()

## Checking for the null values:

In [None]:
df.isnull().sum()[df.isnull().sum()!=0]

In [None]:
Null_values_percentage=(df.isnull().sum().sum()/len(df))*100
Null_values_percentage

In [None]:
### Inference: As there is only 1.5% of total null values in dataset, we will drop those null values

In [None]:
df=df.dropna()
df.shape

## Checking for duplicate values:

In [None]:
df.drop_duplicates(keep='first',inplace=True)
df.shape

In [None]:
df.info()

## DATA VISUALIZATION

### Age

In [None]:
df['Age'].value_counts()

In [None]:
sns.distplot(df['Age'],hist=True,kde=True,color='k',bins=10)

In [None]:
# Majority of employees lie between the age range of 30 to 40

In [None]:
sns.catplot(x='Age',hue='Attrition',data=df,kind='count',height=15)

In [None]:
# Majority of attritions can be seen in 28 to 33 age group range

### Attrition - Target Variable

In [None]:
df['Attrition'].value_counts()

In [None]:
sns.countplot(x='Attrition',data=df,hue='Gender')

In [None]:
# Count of male employees are more in case of attrition

### Business Travel

In [None]:
df['BusinessTravel'].value_counts()

In [None]:
sns.countplot(x='BusinessTravel',data=df,hue='Attrition')

In [None]:
sns.catplot(x='BusinessTravel',data=df,hue='Attrition',col='Department',kind='count',height=5)

In [None]:
# Wrt all the departments we can conclude that 'Travel_Frequently Business Travel' are in the verge towards attrition for HR Dept.

## <font color='red'>Later we need to transform Business Travel into a numerical column before model building</font>

### Daily Rate

In [None]:
df['DailyRate'].value_counts()

In [None]:
sns.distplot(df['DailyRate'],bins=10,color='k')

In [None]:
df['DailyRate'].mean()

In [None]:
df['DailyRate'].min()

In [None]:
df['DailyRate'].max()

In [None]:
# The average of daily rate is somewhere around 802,minimum is 102,and maximum is 1499.

### Department

In [None]:
df['Department'].value_counts()

In [None]:
sns.countplot(df['Department'])

In [None]:
# Around 60% employees are working in R&D Department

In [None]:
sns.catplot(x='Department',hue='Attrition',col='Gender',data=df,kind='count',height=10)

In [None]:
# Sales department has a high attrition rate

### DistanceFromHome

In [None]:
df['DistanceFromHome'].value_counts()

In [None]:
# As from info it is observed that 'Distance From Home' is object type,so we converted it to numeric type

In [None]:
df['DistanceFromHome']=pd.to_numeric(df['DistanceFromHome'],errors='coerce')

In [None]:
plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
sns.countplot(df['DistanceFromHome'])

In [None]:
sns.distplot(df['DistanceFromHome'],color='k',bins=10)

In [None]:
# From the above count plot we can see that there are multiple instances of some numbers in int and float,so we will convert all to a single datatype

In [None]:
df['DistanceFromHome']=df['DistanceFromHome'].astype('int')

In [None]:
df['DistanceFromHome'].value_counts()

In [None]:
plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
sns.countplot(df['DistanceFromHome'])

In [None]:
sns.distplot(df['DistanceFromHome'],color='k',bins=10)

In [None]:
df['DistanceFromHome'].mean()

In [None]:
df['DistanceFromHome'].min()

In [None]:
df['DistanceFromHome'].max()

In [None]:
# We can see that the avg distance from home is around 9Km, minimum is 1Km and maximum is 29Km.

In [None]:
sns.catplot(x='DistanceFromHome',hue='Attrition',col='Gender',data=df,kind='count',height=15,aspect=0.5)

In [None]:
# In case of both male and female,attrition rate tends to be higher when the distance exceed 10Km.

In [None]:
sns.catplot(x='DistanceFromHome',hue='Attrition',col='Department',data=df,kind='count',height=15,aspect=0.5)

In [None]:
# In case of all departments,attrition rate tends to be higher when the distance exceed 10Km.

### Education

In [None]:
df['Education'].value_counts()

In [None]:
sns.countplot(df['Education'])

In [None]:
# Around 30% of employees have education level of 3

In [None]:
sns.catplot(x='Education',hue='Attrition',col='Gender',data=df,kind='count',height=10)

In [None]:
# For both male and female,attrition rate is higher for education level 1,2 and 3.

### EducationField

In [None]:
df['EducationField'].value_counts()

In [None]:
# As there is only 1 count in 'Test' category,so we will impute it in 'Other' category.

In [None]:
df.loc[df['EducationField']=='Test','EducationField']='Other'

In [None]:
df['EducationField'].value_counts()

In [None]:
plt.xticks(rotation='vertical')
sns.countplot(df['EducationField'])

In [None]:
# Around 70% of employees are having 'Life Sciences' and 'Medical' education field.

In [None]:
sns.catplot(x='EducationField',hue='Attrition',col='Gender',data=df,kind='count',height=10)

In [None]:
# Attrition rate of female in 'HR' education field is less when compared to male,
# Attrition rate of female in 'Life Sciences' and 'Medical' is more when compared to male.

### EmployeeCount

In [None]:
df['EmployeeCount'].value_counts()

## <font color='red'>Since there is unique employee count for whole dataset, we will drop this feature before model building.
</font>

### EmployeeNumber

In [None]:
df['EmployeeNumber'].value_counts()

## <font color='red'>Since the length of employee number is 23141, we will drop this feature before model building.
</font>

### Application ID

In [None]:
df['Application ID'].value_counts()

## <font color='red'>Since the 'Application ID' is unique, we will drop this feature before model building.
</font>

### EnvironmentSatisfaction

In [None]:
df['EnvironmentSatisfaction'].value_counts()

In [None]:
sns.countplot(df['EnvironmentSatisfaction'])

In [None]:
# Count of environment satisfaction is more towards 3 and 4.

In [None]:
sns.catplot(x='EnvironmentSatisfaction',hue='Attrition',col='Gender',data=df,kind='count',height=10)

In [None]:
# For both male and female, attrition rate is high environment satisfaction is 1 and 2. 

### Gender

In [None]:
df['Gender'].value_counts()

In [None]:
sns.countplot(df['Gender'])

In [None]:
# Approximately female and male ratio is 3:2

In [None]:
sns.catplot(x='Gender',hue='Attrition',kind='count',data=df,height=5)

In [None]:
# For better inference, lets calculate male and female attrition rate.

In [None]:
df.loc[(df['Gender']=='Female') & (df['Attrition']=='Voluntary Resignation')]

In [None]:
Female_Attrition_Rate=1420/9283
Female_Attrition_Rate

In [None]:
df.loc[(df['Gender']=='Male') & (df['Attrition']=='Voluntary Resignation')]

In [None]:
Male_Attrition_Rate=2243/13907
Male_Attrition_Rate

In [None]:
# Hence, Male attrition rate is slightly higher than Female attrition rate.

### HourlyRate

In [None]:
df['HourlyRate'].value_counts()

In [None]:
# From info we can see that HourlyRate has dtype as object, so lets convert it in integer form

In [None]:
df.info()

In [None]:
df['HourlyRate']=df['HourlyRate'].astype('int')

In [None]:
df.info()

In [None]:
sns.distplot(df['HourlyRate'],color='k',bins=10)

In [None]:
df['HourlyRate'].mean()

In [None]:
df['HourlyRate'].min()

In [None]:
df['HourlyRate'].max()

In [None]:
# Avg hourly rate is around 65 and min hourly rate is 65 and max hourly rate is 100

In [None]:
sns.catplot(x='HourlyRate',hue='Attrition',kind='count',data=df,height=15,aspect=1)

In [None]:
# There is no clear evidence that HourlyRate has any impact on attrition of employees.

### JobInvolvement

In [None]:
df['JobInvolvement'].value_counts()

In [None]:
sns.countplot(df['JobInvolvement'])

In [None]:
# Majority of employees lie in the job involvement 2 and 3

In [None]:
sns.catplot(x='JobInvolvement',hue='Attrition',col='Gender',data=df,kind='count')

In [None]:
# Job involvement 3 has slighly more attrition rate than others.

### JobLevel

In [None]:
df['JobLevel'].value_counts()

In [None]:
sns.countplot(df['JobLevel'])

In [None]:
# Majority of employees lie in the job level 1 and 2

In [None]:
sns.catplot(x='JobLevel',hue='Attrition',col='Gender',data=df,kind='count')

In [None]:
# Attrition rate is higher in job level 1 and 2.

### JobRole

In [None]:
df['JobRole'].value_counts()

In [None]:
plt.xticks(rotation='vertical')
sns.countplot(df['JobRole'])

In [None]:
# Count of employees is more in job role as Sales Executive,Laboratory Technician,Research Scientist.

In [None]:
g=sns.catplot(x='JobRole',hue='Attrition',col='Gender',data=df,kind='count',height=7)
g.set_xticklabels(rotation=90)

In [None]:
# Job role as Sales Representative has the highest attrition rate for both male and female,
# Job role as HR has high rate of attrition in case of female gender.

### JobSatisfaction

In [None]:
df['JobSatisfaction'].value_counts()

In [None]:
sns.countplot(df['JobSatisfaction'])

In [None]:
# Job Satisfaction count for 3 and 4 are more than 1 and 2.

In [None]:
sns.catplot(x='JobSatisfaction',hue='Attrition',col='Gender',data=df,kind='count',height=7)

In [None]:
# Higher attrition rate can be seen in Job Satisfaction level 1 and 2.

### MaritalStatus

In [None]:
df['MaritalStatus'].value_counts()

In [None]:
sns.countplot(df['MaritalStatus'])

In [None]:
# Count of married employees is more

In [None]:
sns.catplot(x='MaritalStatus',hue='Attrition',col='Gender',data=df,kind='count',height=7)

In [None]:
# Attrition rate in singles are higher for both male and female

### MonthlyIncome

In [None]:
df['MonthlyIncome'].value_counts()

In [None]:
# As,monthly income column has object dtype, we need to convert it in integer form.

In [None]:
df['MonthlyIncome']=df['MonthlyIncome'].astype('int')

In [None]:
sns.distplot(df['MonthlyIncome'],bins=10,color='k')

In [None]:
df['MonthlyIncome'].mean()

In [None]:
df['MonthlyIncome'].min()

In [None]:
df['MonthlyIncome'].max()

In [None]:
# Minimum monthly income of employees is 1009 and maximum monthly income of employees is 19999 and avg monthly income of employees is 6507.
# Majority of employees are having monthly income lower than 5000.

### MonthlyRate

In [None]:
df['MonthlyRate'].value_counts()

In [None]:
sns.distplot(df['MonthlyRate'],20,color='k')

In [None]:
df['MonthlyRate'].mean()

In [None]:
df['MonthlyRate'].min()

In [None]:
df['MonthlyRate'].max()

In [None]:
# Avg monthly rate of employees is around 14302,min monthly rate is 2094 and max monthly rate is 26999.

### NumCompaniesWorked

In [None]:
df['NumCompaniesWorked'].value_counts()

In [None]:
sns.countplot(df['NumCompaniesWorked'])

In [None]:
# Maximum employees have worked in only 1 company.

In [None]:
sns.catplot(x='NumCompaniesWorked',hue='Attrition',col='Gender',data=df,kind='count',height=7)

In [None]:
# It can be observed that employees who have worked in 1 company have higher attrition rate

### Over18

In [None]:
df['Over18'].value_counts()

## <font color='red'>As all the employees are over 18,so we will drop this feature before model building
</font>


### OverTime

In [None]:
df['OverTime'].value_counts()

In [None]:
sns.countplot(df['OverTime'])

In [None]:
# Approximately ratio of employees doing overtime and employees not doing overtime is 30:70

In [None]:
sns.catplot(x='OverTime',hue='Attrition',col='Gender',data=df,kind='count',height=7)

In [None]:
# A very high attrition rate is seen in employees who are doing overtime for both male and female.

In [None]:
sns.catplot(x='OverTime',hue='Gender',data=df,kind='count',height=7)

In [None]:
# Male has a higher attrition rate in both cases

### PercentSalaryHike

In [None]:
df['PercentSalaryHike'].value_counts()

In [None]:
sns.countplot(df['PercentSalaryHike'])

In [None]:
# Majority of employees got a salary hike less than 15%

In [None]:
sns.catplot(x='PercentSalaryHike',hue='Attrition',col='Gender',data=df,kind='count',height=7)

In [None]:
# Higher attrition is observed in cases where the salary hike is less than 16% for male when compared to female.

### PerformanceRating

In [None]:
df['PerformanceRating'].value_counts()

In [None]:
sns.countplot(df['PerformanceRating'])

In [None]:
# There are very few employees who have performance rating 4.

In [None]:
sns.catplot(x='PerformanceRating',hue='Attrition',col='Gender',data=df,kind='count',height=7)

In [None]:
# Performance Rating 3 has higher rate of attrition for both male and female.

### RelationshipSatisfaction

In [None]:
df['RelationshipSatisfaction'].value_counts()

In [None]:
sns.countplot(df['RelationshipSatisfaction'])

In [None]:
# Count of employees having relationship satisfaction 3,4 are more than 1,2.

In [None]:
sns.catplot(x='RelationshipSatisfaction',hue='Attrition',col='Gender',data=df,kind='count',height=7)

In [None]:
# Higher attrition is observed in lower relationship satisfaction for both genders

### StandardHours

In [None]:
df['StandardHours'].value_counts()

## <font color='red'> Since,Standard Hours is set as 80 for all employees, we will drop this feature
</font>

### StockOptionLevel

In [None]:
df['StockOptionLevel'].value_counts()

In [None]:
sns.countplot(df['StockOptionLevel'])

In [None]:
# There are many employees who does not have stock options level,
# As the stock options level increases the count of employees reduces.

In [None]:
sns.catplot(x='StockOptionLevel',hue='Attrition',col='Gender',data=df,kind='count',height=7)

In [None]:
# Higher attrition rate is observed in lower stock options level for both genders.

### TotalWorkingYears

In [None]:
df['TotalWorkingYears'].value_counts()

In [None]:
sns.distplot(df['TotalWorkingYears'],bins=10,color='k')

In [None]:
plt.figure(figsize=(10,10))
plt.xticks(rotation='vertical')
sns.countplot(df['TotalWorkingYears'])

In [None]:
# Maximum number of employees have total working years as 10 and the count decreases gradually after 10 years.

In [None]:
sns.catplot(x='TotalWorkingYears',hue='Attrition',data=df,kind='count',height=15)

In [None]:
# Higher attrition rate is observed for employees having total working years less than 10 years.

### TrainingTimesLastYear

In [None]:
df['TrainingTimesLastYear'].value_counts()

In [None]:
sns.countplot(df['TrainingTimesLastYear'])

In [None]:
# Maximum employees where trained 2 to 3 times since last year

In [None]:
sns.catplot(x='TrainingTimesLastYear',hue='Attrition',col='Gender',data=df,kind='count',height=10)

In [None]:
# Higher attrition rate can be seen where number of trainings given to employees are less for both gender.

### WorkLifeBalance

In [None]:
df['WorkLifeBalance'].value_counts()

In [None]:
sns.countplot(df['WorkLifeBalance'])

In [None]:
# Count of employees having worklife balance as 3 is more wrt others

In [None]:
sns.catplot(x='WorkLifeBalance',hue='Attrition',col='Gender',data=df,kind='count',height=10)

In [None]:
# Lower work life balance has somewhat high rate of attrition

In [None]:
sns.catplot(x='WorkLifeBalance',hue='Attrition',col='Department',data=df,kind='count',height=7)

In [None]:
# HR Department has less attrition rate in any cases of work life balance

### YearsAtCompany

In [None]:
df['YearsAtCompany'].value_counts()

In [None]:
sns.distplot(df['YearsAtCompany'],bins=20,color='k')

In [None]:
# Count of employees is maximum who have worked less than 8 years

In [None]:
sns.catplot(x='YearsAtCompany',hue='Attrition',data=df,kind='count',height=15)

In [None]:
# We can see higher attrition rate for those employees who have worked for less than 10 years

### YearsInCurrentRole

In [None]:
df['YearsInCurrentRole'].value_counts()

In [None]:
sns.distplot(df['YearsInCurrentRole'],bins=20,color='k')

In [None]:
# Count of employees having 2 to 3 years in current role are more.

In [None]:
sns.catplot(x='YearsInCurrentRole',hue='Attrition',data=df,kind='count',height=10)

In [None]:
# After 5 years in same role,attrition rate gradually decreases with increase in years.

In [None]:
df.info()

### YearsSinceLastPromotion

In [None]:
df['YearsSinceLastPromotion'].value_counts()

In [None]:
sns.distplot(df['YearsSinceLastPromotion'],bins=20,color='k')

In [None]:
sns.countplot(df['YearsSinceLastPromotion'])

In [None]:
# Majority of employees are in the category of having 0,1 or 2 years since last promotion.

In [None]:
sns.catplot(x='YearsSinceLastPromotion',hue='Attrition',data=df,kind='count',height=10)

In [None]:
# Attrition rate is higher where Years since last promotion is less than 7

### YearsWithCurrManager

In [None]:
df['YearsWithCurrManager'].value_counts()

In [None]:
sns.distplot(df['YearsWithCurrManager'],bins=20,color='k')

In [None]:
plt.figure(figsize=(10,7))
plt.xticks(rotation='vertical')
sns.countplot(df['YearsWithCurrManager'])

In [None]:
# Majority of employees areworking with their manager for around 2 years.

In [None]:
sns.catplot(x='YearsWithCurrManager',hue='Attrition',data=df,kind='count',height=10)

In [None]:
# As the employees work for more years with same manager,they get mentally attached with that manager and have a good comfort zone.
# Hence, they get retained for a longer period of time.
# But there are a few exceptions where the attrition rate is high even if the years are more.This maybe due to internal disputes.So,regular counselling should be done.

### Employee Source

In [None]:
df['Employee Source'].value_counts()

In [None]:
# Since there is only 1 entry in Test,we will simply shift in other group

In [None]:
df.loc[df['Employee Source']=='Test','Employee Source']='Company Website'

In [None]:
df['Employee Source'].value_counts()

In [None]:
plt.xticks(rotation='vertical')
sns.countplot(df['Employee Source'])

In [None]:
# Around 25% employee source is Company Website, so we should management to emhance its worth more.

In [None]:
sns.catplot(x='Employee Source',hue='Attrition',col='Gender',data=df,kind='count',height=10)

In [None]:
# At the same time,it is observed that the maximum attrition is taking place for those employees who have joined organization through companies website.
# Hence, reality check should be done in the website.

## DATA CLEANING

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

### Dropping the unnecessary columns

In [None]:
df1=df.drop(['EmployeeCount','EmployeeNumber','Application ID','StandardHours','Over18'],axis=1)

In [None]:
df1.shape

In [None]:
df1.head()

## Checking the correlation of target variable with other features:

In [None]:
df1['Attrition']=df1['Attrition'].apply(lambda x:1 if x=='Voluntary Resignation' else 0)

In [None]:
plt.figure(figsize=(20,15))
ax = sns.heatmap(df1.corr(),cmap='rainbow',mask=abs(df1.corr())<0.05,annot=True)
bottom,top = ax.get_ylim()
ax.set_ylim(bottom+0.5,top-0.5)

### Based on the above correlation, lets derive some important factors responsible for attrition of employees.

# INFERENCES

In [None]:
# Impact of Age on Attrition of employees
sns.catplot(x='Age',hue='Attrition',data=df,kind='count',height=15)

### <font color='blue'>INFERENCE 1: HR Team should focus more on young employees whose age is less 35 particularly.Careful attention should be given to employees with age 18,19,20 as those ages attrition rate is more than the current employee rate. Hence, company is facing loss as the company is investing so much for the candidates training but the candidates are still leaving the job.</font>

In [None]:
# Impact of Job Level on Attrition of employees
sns.catplot(x='JobLevel',hue='Attrition',data=df,kind='count')

### <font color='blue'>INFERENCE 2: Employees having Job level 1 and 2 are not satisfied and are having a high attrition rate. So, HR should focus on those set of employees who are having lower job levels.
</font>

In [None]:
# Impact of Marital Status on Attrition of employees
sns.catplot(x='MaritalStatus',hue='Attrition',data=df,kind='count',height=7)

### <font color='blue'>INFERENCE 3: HR should focus more on employees who are singles as their attrition rate is higher.
</font>

In [None]:
# Monthly Income affecting Attrition rate:
sns.barplot(x='Attrition',y='MonthlyIncome',data=df)

In [None]:
sns.relplot(x='JobInvolvement',y='MonthlyIncome',hue='Attrition',data=df,size='MonthlyIncome')

### <font color='blue'>INFERENCE 4: HR Team should take attentive counselling of employees whose job involvement is less i.e.,1. Another important observation can be seen that irrestpective of Job Involvement,employees whose monthly income is lees have maximum attrition.
</font>

In [None]:
# Business Travel affecting attrition rate
sns.countplot(x='BusinessTravel',hue='Attrition',data=df)

### <font color='blue'>INFERENCE 5: Employees who do business travel are more likely for attrition than the employees who do not do business travel.
</font>

## Checking for outliers in the dataset:

In [None]:
df1.head()

In [None]:
# For chosing outliers we will only chose continous feature
# Lets check the value counts of all the features

In [None]:
for i in df1.columns:
    print(i)
    print('value_counts :-','\n',df[i].value_counts(),'\n'*3)

In [None]:
list=['Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome','MonthlyRate','TotalWorkingYears','YearsAtCompany',
      'YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']

In [None]:
for i in list:
    sns.boxplot(y=df1[i])
    plt.show()

#### Monthly Income, Total Working Years, Years At Company, Years In Current Role, Years Since Last Promotion, Years with Current Manager have outliers.

In [None]:
# We will use Z-score to remove outliers

In [None]:
import scipy.stats as st
outliers = st.zscore(df1[['Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome','MonthlyRate','TotalWorkingYears'
                          ,'YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']])

In [None]:
df1 = df1[(abs(outliers)<3).all(axis=1)]
df1.head()

In [None]:
for i in list:
    sns.boxplot(y=df1[i])
    plt.show()

## Checking for skewness of the continous features:

In [None]:
for i in list:
    print(i,' : ',df[i].skew())

In [None]:
for i in list:
    sns.distplot(df[i])
    plt.show()

In [None]:
# We will do boxcox transformation for fixing the skewness of the dataset

In [None]:
for i in list:
    df1[i]=st.boxcox(df1[i]+1)[0]
df1.skew()

## CONVERTING CATEGORICAL COLUMNS TO NUMERICAL COLUMNS

In [None]:
df2=df1.copy()

In [None]:
df2.info()

### Encoding categorical columns:

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

In [None]:
df2['BusinessTravel'].value_counts()

In [None]:
# We will do get dummies or ohe for this column

In [None]:
df2['Gender'].value_counts()

In [None]:
df2['Gender']=le.fit_transform(df2['Gender'])

In [None]:
df2['JobRole'].value_counts()

In [None]:
# We will do get dummies or ohe for this column

In [None]:
df2['JobSatisfaction'].value_counts()

In [None]:
df2['JobSatisfaction']=df2['JobSatisfaction'].astype('int')

In [None]:
df2.info()

In [None]:
df2['MaritalStatus'].value_counts()

In [None]:
# We will do get dummies or ohe for this column

In [None]:
df2['OverTime'].value_counts()

In [None]:
df2['OverTime']=le.fit_transform(df2['OverTime'])

In [None]:
df2['PercentSalaryHike'].value_counts()

In [None]:
df2['PercentSalaryHike']=df2['PercentSalaryHike'].astype('int')

In [None]:
df2['Employee Source'].value_counts()

In [None]:
# We will do get dummies or ohe for this column

In [None]:
df2.info()

In [None]:
df2=pd.get_dummies(df2,drop_first=True)

In [None]:
df2.head()

In [None]:
Unscaled_data=df2.drop('Attrition',axis=1)
Unscaled_data

In [None]:
plt.figure(figsize=(50,35))
ax = sns.heatmap(df2.corr(),annot=True,mask=abs(df2.corr())<0.05)
bottom,top = ax.get_ylim()
ax.set_ylim(bottom+0.5,top-0.5)

### Marital Status Single, Business Travel Frequently, Job Role Manager, Department Sales are the sub features which are mostly contributing towards attrition of employees

In [None]:
# Now our dataset is cleaned and ready for processing

### Splitting the dataset into independent features 'X' and target variable 'y'

In [None]:
X=df2.drop('Attrition',axis=1)
y=df2['Attrition']

### Statistical Model:

In [None]:
import statsmodels.api as sm

In [None]:
X_con=sm.add_constant(X)

In [None]:
model=sm.Logit(y,X_con).fit()
result=model.summary()
result

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.neighbors import KNeighborsClassifier as KNN
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
import lightgbm as lgb
from sklearn.metrics import classification_report
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import ExtraTreesClassifier
from tpot import TPOTClassifier
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import f1_score


In [None]:
# Splitting dataset in train and test:

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.3, random_state=0)

In [None]:
# Data Scaling using standard scaler
# Apply classifier

In [None]:
pipeline_lr=Pipeline([('scalar1',StandardScaler()),
                     ('lr',LogisticRegression())])
pipeline_dt=Pipeline([('scaler2',StandardScaler()),
                     ('dt',DecisionTreeClassifier())])
pipeline_rf=Pipeline([('scalar3',StandardScaler()),
                     ('rfc',RandomForestClassifier())])
pipeline_knn=Pipeline([('scalar4',StandardScaler()),
                     ('knn',KNN())])
pipeline_xgbc=Pipeline([('scalar5',StandardScaler()),
                     ('xgboost',XGBClassifier())])
pipeline_lgbc=Pipeline([('scalar6',StandardScaler()),
                     ('lgbc',lgb.LGBMClassifier())])
pipeline_ada=Pipeline([('scalar7',StandardScaler()),
                     ('adaboost',AdaBoostClassifier())])
pipeline_sgdc=Pipeline([('scalar8',StandardScaler()),
                     ('sgradient',SGDClassifier())])
pipeline_nb=Pipeline([('scalar9',StandardScaler()),
                     ('nb',GaussianNB())])
pipeline_extratree=Pipeline([('scalar10',StandardScaler()),
                     ('extratree',ExtraTreesClassifier())])
pipeline_svc=Pipeline([('scalar11',StandardScaler()),
                     ('svc',SVC())])
pipeline_gbc=Pipeline([('scalar12',StandardScaler()),
                     ('GBC',GradientBoostingClassifier())])

In [None]:
# Lets make the list of pipelines

In [None]:
pipelines=[pipeline_lr,pipeline_dt,pipeline_rf,pipeline_knn,pipeline_xgbc,pipeline_lgbc,pipeline_ada,
           pipeline_sgdc,pipeline_nb,pipeline_extratree,pipeline_svc,pipeline_gbc]

In [None]:
pipe_dict={0:'Logistic Regression',1:'Decision Tree',2:'RandomForestClassifier',3:'KNN',4:'XGBC',5:'LGBC',6:'ADA',7:'SGDC',8:'NB',9:'ExtraTree',10:'SVC',11:'GBC'}

In [None]:
# Let's check whether the target variable is balanced or not:

In [None]:
sns.countplot(df2['Attrition'])
df2['Attrition'].value_counts()

In [None]:
# As the dataset is highly imbalanced, we will use SMOTE:

In [None]:
smote = SMOTE('auto')

In [None]:
X_sm, y_sm = smote.fit_sample(X_train,y_train)
print(X_sm.shape, y_sm.shape)

### With SMOTE - Base model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix , accuracy_score, classification_report,roc_auc_score, roc_curve

### Logistic Regression:

In [None]:
lr=LogisticRegression()
lr.fit(X_sm,y_sm)
y_train_pred=lr.predict(X_sm)
y_train_prob=lr.predict_proba(X_sm)[:,1]

y_test_pred=lr.predict(X_test)
y_test_prob=lr.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_sm,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_sm,y_train_pred))
print('Classification Report-Train\n',classification_report(y_sm,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_sm,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,threshold= roc_curve(y_test,y_test_prob)
threshold[0]=threshold[0]-1
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr)
ax1.plot(fpr,fpr)
ax1.set_xlabel('FPR')
ax1.set_ylabel('TPR')
ax2=ax1.twinx()
ax2.plot(fpr,threshold,'-g')
ax2.set_ylabel('TRESHOLD')
plt.show()
plt.show()

### Without SMOTE-Base Model

In [None]:
lr=LogisticRegression()
lr.fit(X_train,y_train)
y_train_pred=lr.predict(X_train)
y_train_prob=lr.predict_proba(X_train)[:,1]

y_test_pred=lr.predict(X_test)
y_test_prob=lr.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,threshold= roc_curve(y_test,y_test_prob)
threshold[0]=threshold[0]-1
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr)
ax1.plot(fpr,fpr)
ax1.set_xlabel('FPR')
ax1.set_ylabel('TPR')
ax2=ax1.twinx()
ax2.plot(fpr,threshold,'-g')
ax2.set_ylabel('TRESHOLD')
plt.show()
plt.show()

### Comparing Models ROC-AUC Curve:

In [None]:
# train models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# logistic regression
model1 = LogisticRegression()

# knn
model2 = KNeighborsClassifier()

# Random Forest Classifier
model3 = RandomForestClassifier()

# XGBClassifier
model4=XGBClassifier()

# fit model
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)
model4.fit(X_train, y_train)

# predict probabilities
pred_prob1 = model1.predict_proba(X_test)
pred_prob2 = model2.predict_proba(X_test)
pred_prob3 = model3.predict_proba(X_test)
pred_prob4 = model4.predict_proba(X_test)

In [None]:
from sklearn.metrics import roc_curve

# roc curve for models
fpr1, tpr1, thresh1 = roc_curve(y_test, pred_prob1[:,1], pos_label=1)
fpr2, tpr2, thresh2 = roc_curve(y_test, pred_prob2[:,1], pos_label=1)
fpr3, tpr3, thresh3 = roc_curve(y_test, pred_prob3[:,1], pos_label=1)
fpr4, tpr4, thresh4 = roc_curve(y_test, pred_prob4[:,1], pos_label=1)

# roc curve for tpr = fpr 
random_probs = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)

In [None]:
from sklearn.metrics import roc_auc_score

# auc scores
auc_score1 = roc_auc_score(y_test, pred_prob1[:,1])
auc_score2 = roc_auc_score(y_test, pred_prob2[:,1])
auc_score3 = roc_auc_score(y_test, pred_prob3[:,1])
auc_score4 = roc_auc_score(y_test, pred_prob4[:,1])

print(auc_score1, auc_score2,auc_score3, auc_score4)

In [None]:
# matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn')

# plot roc curves
plt.plot(fpr1, tpr1, linestyle='--',color='orange', label='Logistic Regression')
plt.plot(fpr2, tpr2, linestyle='--',color='green', label='KNN')
plt.plot(fpr3, tpr3, linestyle='--',color='red', label='Random Forest')
plt.plot(fpr4, tpr4, linestyle='--',color='black', label='XGBC')
plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')
# title
plt.title('ROC curve')
# x label
plt.xlabel('False Positive Rate')
# y label
plt.ylabel('True Positive rate')

plt.legend(loc='best')
plt.savefig('ROC',dpi=300)
plt.show();

### Multiple Base Model Performance:

In [None]:
for i in pipelines:
    i.fit(X_sm,y_sm)
    y_pred=i.predict(X_test)
    print('Classification Report : ', i[1] ,'\n',(classification_report(y_test,y_pred)))
    print('f1-score : ', i[1],' : ',(f1_score(y_test,y_pred)))
    print('\n'*2,'------------------------------------------------------------------------------------------------')

## KNN

In [None]:
knn=KNN(n_neighbors=9)
knn.fit(X_train,y_train)
y_train_pred=knn.predict(X_train)
y_train_prob=knn.predict_proba(X_train)[:,1]

y_test_pred=knn.predict(X_test)
y_test_prob=knn.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,thresholds= roc_curve(y_test,y_test_prob)
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr)
ax1.plot(fpr,fpr)
ax1.set_xlabel('FPR')
ax1.set_ylabel('TRP')
ax2=ax1.twinx()
ax2.plot(fpr,thresholds,'-g')
ax2.set_ylabel('TRESHOLDS')
plt.show()
plt.show()

### Random Forest Classifier:

In [None]:
rf=RandomForestClassifier(max_depth=15, min_samples_leaf=10, min_samples_split=20,
                       n_estimators=5)
rf.fit(X_train,y_train)
y_train_pred=rf.predict(X_train)
y_train_prob=rf.predict_proba(X_train)[:,1]

y_test_pred=rf.predict(X_test)
y_test_prob=rf.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,thresholds= roc_curve(y_test,y_test_prob)
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr)
ax1.plot(fpr,fpr)
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax2=ax1.twinx()
ax2.plot(fpr,thresholds,'-g')
ax2.set_ylabel('THRESHOLD')
plt.show()
plt.show()

### XGBClassifier:

In [None]:
xgbc=XGBClassifier()
xgbc.fit(X_train,y_train)
y_train_pred=xgbc.predict(X_train)
y_train_prob=xgbc.predict_proba(X_train)[:,1]

y_test_pred=xgbc.predict(X_test)
y_test_prob=xgbc.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,thresholds= roc_curve(y_test,y_test_prob)
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr)
ax1.plot(fpr,fpr)
ax1.set_xlabel('FPR')
ax1.set_ylabel('TRP')
ax2=ax1.twinx()
ax2.plot(fpr,thresholds,'-g')
ax2.set_ylabel('TRESHOLDS')
plt.show()
plt.show()

### Feature Importance using different classifiers:

In [None]:
pipeline=[DecisionTreeClassifier(),RandomForestClassifier(),XGBClassifier(),
        ExtraTreesClassifier()]

In [None]:
for i in pipeline:
    i.fit(X,y)
    i.feature_importances_
    print(i)
    imp_features = pd.Series(i.feature_importances_,index=X.columns)
    plt.figure(figsize =(10,10))
    imp_features.nlargest(8).sort_values(ascending=True).plot(kind='barh')

    plt.show()

In [None]:
a=[]
for i in pipeline:
    i.fit(X,y)
    i.feature_importances_
    imp_features = pd.Series(i.feature_importances_,index=X.columns)
    x = pd.DataFrame(imp_features.nlargest(8).sort_values(ascending=False))
    a.append(x.index.values)
    b=pd.DataFrame(a)

In [None]:
c=b.T
c

In [None]:
c[0]

In [None]:
d=pd.DataFrame()
for i in c.columns:
    d=pd.concat([d,c[i]],ignore_index=True)
print(d)

In [None]:
d = d.rename(columns={0: 'Imp_Features'})

In [None]:
d

In [None]:
d['Imp_Features'].value_counts()

In [None]:
d['Imp_Features'].unique()

In [None]:
df2

In [None]:
new_X=df2[['DailyRate', 'Age', 'DistanceFromHome', 'MonthlyIncome',
       'TrainingTimesLastYear', 'TotalWorkingYears', 'MonthlyRate',
       'HourlyRate', 'PercentSalaryHike','BusinessTravel_Travel_Frequently', 'OverTime', 'StockOptionLevel']]
new_X

In [None]:
new_y=df2['Attrition']
new_y

### Building a model after finding optimum features

In [None]:
X

In [None]:
X_train,X_test,y_train,y_test=train_test_split(new_X,new_y, test_size=0.3, random_state=0)

In [None]:
rf=RandomForestClassifier(max_depth=5, min_samples_leaf=10, min_samples_split=20,
                       n_estimators=5)
rf.fit(X_train,y_train)
y_train_pred=rf.predict(X_train)
y_train_prob=rf.predict_proba(X_train)[:,1]

y_test_pred=rf.predict(X_test)
y_test_prob=rf.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,threshold= roc_curve(y_test,y_test_prob)
threshold[0]=threshold[0]-1
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr)
ax1.plot(fpr,fpr)
ax1.set_xlabel('FPR')
ax1.set_ylabel('TPR')
ax2=ax1.twinx()
ax2.plot(fpr,threshold,'-g')
ax2.set_ylabel('THRESHOLD')
plt.show()
plt.show()

## HyperTuning Random Forest

### GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
rfgrid=GridSearchCV(estimator=RandomForestClassifier(),
                   param_grid=[{'n_estimators': [5,10],
                               'max_depth':[5,10,15],
                               'min_samples_leaf':[10,50,100],
                               'min_samples_split': [20,100,200]}])

In [None]:
rfgrid_fit=rfgrid.fit(X_train,y_train)

In [None]:
print(rfgrid_fit.best_estimator_)

In [None]:
rfgrid_score=rfgrid_fit.score(X_train,y_train)
rfgrid_score

In [None]:
pred=rfgrid_fit.predict(X_test)
pred

In [None]:
rfgrid_score_test=rfgrid_fit.score(X_test,pred)
rfgrid_score_test

In [None]:
y_train_pred=rfgrid.predict(X_train)
y_train_prob=rfgrid.predict_proba(X_train)[:,1]

y_test_pred=rfgrid.predict(X_test)
y_test_prob=rfgrid.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,threshold= roc_curve(y_test,y_test_prob)
threshold[0]=threshold[0]-1
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr)
ax1.plot(fpr,fpr)
ax1.set_xlabel('FPR')
ax1.set_ylabel('TPR')
ax2=ax1.twinx()
ax2.plot(fpr,threshold,'-g')
ax2.set_ylabel('THRESHOLD')
plt.show()
plt.show()

### RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
rfrandomized=RandomizedSearchCV(estimator=RandomForestClassifier(),
                   param_distributions=[{'n_estimators': [1,5,10],
                               'max_depth':[5, 10,15],
                               'min_samples_leaf':[5,10, 50, 100],
                               'min_samples_split': [10,20,100,200]}])

In [None]:
rfrand_fit=rfrandomized.fit(X_train,y_train)

In [None]:
print(rfrand_fit.best_estimator_)

In [None]:
rfrand_score=rfrand_fit.score(X_train,y_train)
rfrand_score

In [None]:
y_train_pred=rfrand_fit.predict(X_train)
y_train_prob=rfrand_fit.predict_proba(X_train)[:,1]

y_test_pred=rfrand_fit.predict(X_test)
y_test_prob=rfrand_fit.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,threshold= roc_curve(y_test,y_test_prob)
threshold[0]=threshold[0]-1
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr)
ax1.plot(fpr,fpr)
ax1.set_xlabel('FPR')
ax1.set_ylabel('TPR')
ax2=ax1.twinx()
ax2.plot(fpr,threshold,'-g')
ax2.set_ylabel('THRESHOLD')
plt.show()
plt.show()

### Let's try using PCA to reduce dimensions:

In [None]:
dfpca=df2.drop('Attrition',axis=1)
dfpca

In [None]:
from sklearn.decomposition import PCA

### Before PCA,dataset must be scaled:

In [None]:
# Create scaler: scaler
scaler = StandardScaler()
scaler.fit(dfpca)

In [None]:
# transform
data_scaled = scaler.transform(dfpca)

In [None]:
data_scaled

### As PCA is sensitive to the scale of features. Hence, I have scaled the data

In [None]:
pca = PCA()

# fit PCA
pca.fit(data_scaled)
pd.DataFrame({'Eigenvalue':pca.explained_variance_,'Proporsion Explained':pca.explained_variance_ratio_,'Cummumlative Proportion Exaplained':np.cumsum(pca.explained_variance_ratio_)})


In [None]:
plt.figure(figsize=(15,8))
plt.bar(range(1,53),pca.explained_variance_ratio_)
plt.plot(range(1,53),np.cumsum(pca.explained_variance_ratio_),'r')
plt.show()

In [None]:
# PCA features
features = range(pca.n_components_)
features

In [None]:
# PCA transformed data
data_pca = pca.transform(data_scaled)
data_pca.shape

In [None]:
# PCA components variance ratios.
pca.explained_variance_ratio_

In [None]:
plt.figure(figsize=(15,12))
plt.bar(features, pca.explained_variance_ratio_)
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')
plt.show()

In [None]:
pca2 = PCA(n_components=20, svd_solver='full')

# fit PCA
pca2.fit(data_scaled)

# PCA transformed data
data_pca2 = pca2.transform(data_scaled)
data_pca2.shape

In [None]:
Xpca=pd.DataFrame(data_pca2)

In [None]:
ypca=df2['Attrition']

In [None]:
Xpca.shape,ypca.shape

In [None]:
X_train,X_test,y_train,y_test=train_test_split(Xpca,ypca, test_size=0.3, random_state=0)

In [None]:
rf=RandomForestClassifier()
rf.fit(X_train,y_train)
y_train_pred=rf.predict(X_train)
y_train_prob=rf.predict_proba(X_train)[:,1]

y_test_pred=rf.predict(X_test)
y_test_prob=rf.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,thresholds= roc_curve(y_test,y_test_prob)
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr)
ax1.plot(fpr,fpr)
ax1.set_xlabel('FPR')
ax1.set_ylabel('TRP')
ax2=ax1.twinx()
ax2.plot(fpr,thresholds,'-g')
ax2.set_ylabel('TRESHOLDS')
plt.show()
plt.show()

### Final Accuracy of the model:

Train Accuracy - 100%
Test Accuracy - 97.975%