# Problem Statement

Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

## Education
>
1. 'Below College'
2. 'College'
3. 'Bachelor'
4. 'Master'
5. 'Doctor'
>

## EnvironmentSatisfaction
1. 'Low'
2. 'Medium'
3. 'High'
4. 'Very High'
>

## JobInvolvement
1. 'Low'
2. 'Medium'
3. 'High'
4. 'Very High'
>

## JobSatisfaction
1. 'Low'
2. 'Medium'
3. 'High'
4. 'Very High'
>

## PerformanceRating
1. 'Low'
2. 'Good'
3. 'Excellent'
4. 'Outstanding'
>

## RelationshipSatisfaction
1. 'Low'
2. 'Medium'
3. 'High'
4. 'Very High'
>

## WorkLifeBalance
1. 'Bad'
2. 'Good'
3. 'Better'
4. 'Best'

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')

In [None]:
data.head()

In [None]:
data.describe()

# Data Visualization

In [None]:
print('Unique values for below columns from dataset:')
print('========='*10)
print(data['Attrition'].unique())
print(data['Over18'].unique())
print(data['OverTime'].unique())
print('========='*10)

In [None]:
data['Attrition'] = data['Attrition'].apply(lambda x:1 if x == 'Yes' else 0)
data['Over18'] = data['Over18'].apply(lambda x:1)
data['OverTime'] = data['OverTime'].apply(lambda x:1 if x == 'Yes' else 0)

In [None]:
data.head()

Let's check the null data from the dataset

In [None]:
sns.heatmap(data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

We can see we don't have any missing values in our dataset. So there is no need of any imputation techniques

In [None]:
data.hist(bins=30,figsize=(20,20),color='b')
plt.show()

## As we can observe the above data and can make different predictions:
1. Max 30-40 age people are there.
2. Attriation Rate class is highly unbalances as we can see number of people want to leave the job are more.
3. Lot of people are lives close to the office.
4. Many people got the job only by doing or after completing the Bacholers, there are very less people they have completed there masters.
5. Max of people are satisfied with the work enviroment. So they have less changes to leave the company.
6. Max poeple getting average salary between 0 - 5000 dollers. and might be only those are on higher post they are getting higher salary.
7. Lot of people doing overtime.
8. 0-15 percent salary hike is normal among the people.
9. Performance rating also highly imbalances.
10. Lot of people having the greate experience with this company and they don't want to live.We can decide it by how many years spend at company plot

Employee count , Standard Hours, Over 18 there columns are not changes from one employee to another employee we can delete it

In [None]:
data.columns

In [None]:
data.drop(['EmployeeNumber','Over18','StandardHours','EmployeeCount'],axis=1,inplace=True)

Let's check the Attrition Rate columns.
By above understandig i think we are dealing with imbalanced data.
So let's print the data count,graph for it, and percentage for both the class

In [None]:
fig = plt.figure(figsize=(20,2))
sns.countplot(y = data['Attrition'])
plt.title('Count for employee attrition')
plt.show()

In [None]:
df_left = data[data['Attrition'] == 1]
df_stayes = data[data['Attrition'] == 0]

In [None]:
print(f'Percentage of People left the Company : {(len(df_left) / len(data))*100} %')
print(f'Percentage of People Stayed the Company : {(len(df_stayes) / len(data))*100} %')

In [None]:
#let's check the correlation between the data
plt.figure(figsize=(20,20))
sns.heatmap(data.corr(),cmap='viridis',annot=True,fmt='.2f')
plt.show()

1. Job level is strongly correlated with total working hours
2. Monthly income is strongly correlated with Job level
3. Monthly income is strongly correlated with total working hours
4. Age is stongly correlated with monthly income

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(data['Age'],hue=data['Attrition'])
plt.show()

We can observe that from age 0-29 max people are leaving the company

In [None]:
plt.figure(figsize=[20,20])

plt.subplot(411)
sns.countplot(x=data['JobSatisfaction'],hue=data['Attrition'])
plt.subplot(412)
sns.countplot(x=data['MaritalStatus'],hue=data['Attrition'])
plt.subplot(413)
sns.countplot(x=data['JobRole'],hue=data['Attrition'])
plt.subplot(414)
sns.countplot(x=data['JobLevel'],hue=data["Attrition"])
# plt.subplot(415)
# sns.countplot(x=data['JobInvolvement'],hue=data["Attrition"])
plt.show()

1. Single employees tend to leave compared to married and divorced
2. Sales Representitives tend to leave compared to any other job 
3. Less involved employees tend to leave the company 
4. Less experienced (low job level) tend to leave the company 

In [None]:
plt.figure(figsize=(12,7))

sns.kdeplot(df_left['DistanceFromHome'],label='Employees who left',shade='True',color='r')
sns.kdeplot(df_stayes['DistanceFromHome'],label='Employees who Stayes',shade='True',color='b')
plt.xlabel('Distance from Home')
plt.show()

In [None]:
plt.figure(figsize=(12,7))

sns.kdeplot(df_left['TotalWorkingYears'],label='Employees who left',shade='True',color='r')
sns.kdeplot(df_stayes['TotalWorkingYears'],label='Employees who Stayes',shade='True',color='b')
plt.xlabel('Total working years')
plt.show()

In [None]:
plt.figure(figsize=(12,7))

sns.kdeplot(df_left['YearsWithCurrManager'],label='Employees who left',shade='True',color='r')
sns.kdeplot(df_stayes['YearsWithCurrManager'],label='Employees who Stayes',shade='True',color='b')
plt.xlabel('Years with current manager')
plt.show()

In [None]:
sns.boxplot(x= 'MonthlyIncome',y='Gender',data=data)

In [None]:
sns.boxplot(x='MonthlyIncome',y='JobRole',data=data)

In [None]:
x_cat = data.select_dtypes(include='object')
x_cat

In [None]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder()
x_cat = onehotencoder.fit_transform(x_cat).toarray()

In [None]:
x_cat = pd.DataFrame(x_cat)
x_cat

In [None]:
x_numerical = data.select_dtypes(exclude='object')
x_numerical.drop('Attrition',axis=1,inplace=True)
x_numerical

In [None]:
x_data = pd.concat([x_cat,x_numerical],axis=1)
x_data

In [None]:
from sklearn.preprocessing import MinMaxScaler
scalar = MinMaxScaler()
x = scalar.fit_transform(x_data)

In [None]:
x

In [None]:
y = data['Attrition']
y

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.25)

In [None]:
print(f'Shape of X train :{X_train.shape}')
print(f'Shape of X test :{X_test.shape}')
print(f'Shape of y train :{y_train.shape}')
print(f'Shape of y test :{y_test.shape}')

# Logistic Regression Classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
model = LogisticRegression()
model.fit(X_train,y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(f'Accuracy : {100 * accuracy_score(y_pred,y_test)}')

In [None]:
cm = confusion_matrix(y_pred,y_test)
sns.heatmap(cm,annot=True,fmt='d')

In [None]:
print('Classification report')
print('======='*10)
print(classification_report(y_pred,y_test))
print('======='*10)

# Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

randomforest = RandomForestClassifier()
randomforest.fit(X_train,y_train)

In [None]:
y_pred = randomforest.predict(X_test)

In [None]:
cm = confusion_matrix(y_pred,y_test)
sns.heatmap(cm,annot=True,fmt='d')

In [None]:
print('Classification report')
print('======='*10)
print(classification_report(y_pred,y_test))
print('======='*10)

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_train,y_train)

In [None]:
y_pred = xgb.predict(X_test)

In [None]:
cm = confusion_matrix(y_pred,y_test)
sns.heatmap(cm,annot=True,fmt='d')

In [None]:
print('Classification report')
print('======='*10)
print(classification_report(y_pred,y_test))
print('======='*10)

# SMOTE - Oversampling Technique

In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=27,sampling_strategy=1.0)
X_train,y_train = sm.fit_sample(X_train,y_train)

In [None]:
smote_logistic = LogisticRegression()
smote_logistic.fit(X_train,y_train)

In [None]:
smote_pred = smote_logistic.predict(X_test)

In [None]:
sns.heatmap(confusion_matrix(y_test,smote_pred),annot=True,fmt='.2f',cmap='YlGnBu')
plt.savefig('rand_after_oversample.png')
plt.show()

In [None]:
print('Classification report')
print('======='*10)
print(classification_report(smote_pred,y_test))
print('======='*10)

In [None]:
randomforest = RandomForestClassifier()
randomforest.fit(X_train,y_train)

In [None]:
y_pred_smote = randomforest.predict(X_test)

In [None]:
sns.heatmap(confusion_matrix(y_pred_smote,y_test),annot=True,fmt='d',cmap='YlGnBu')
plt.savefig('rand_after_oversample.png')
plt.show()