# <h1 style="font-family: Times New Roman; border-radius : 10px;padding: 25px; font-size: 40px; color: #FCF6F5; text-align: center; line-height: 0.50;background-color: #2BAE66"><b>IBM HR Analytics</b><br></h1>

<center>
    <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQoKWCb0545g__QBdCLP8_7IUmIjC2GFZtzBQ&usqp=CAU" alt="IBM HR Analytics" width="50%">
</center>

### Problem Statement :

IBM HR Analytics is a dataset with more than 30 features that are categorical and discrete with numerical and text data. With the emergence of storing data in digital format as well as recognising it's value, the race of automating the number of outdated systems to improve speed  accuracy is on! Thus, this fictional dataset gives us an opportunity to automate employee hiring & firing system of an organization. This is possible with the help of Data Science & Machine Learning techniques.      
  
### Aim :
- To classify / predict whether an employee continues with the company or not!
- To draw insights about employee performance as well as employee retention / departure from the employee data!

### Dataset Attributes :
    
- **Age** : Numerical Discrete Data
- **Attrition** : Text Categorical Data
- **BusinessTravel** : Text Categorical Data
- **DailyRate** : Numerical Discrete Data
- **Department** : Text Categorical Data
- **DistanceFromHome** : Numerical Discrete Data
- **Education** : Numerical Categorical Data
    - **1 : Below College**
    - **2 : College**
    - **3 : Bachelor**
    - **4 : Master**
    - **5 : Doctor**
- **EducationField** : Text Categorical Data
- **EmployeeCount** : Numerical Categorical Data
- **EmployeeNumber** : Numerical Categorical Data
- **EnvironmentSatisfaction** : Numerical Categorical Data
    - **1 : Low**
    - **2 : Medium**
    - **3 : High**
    - **4 : Very High**
- **Gender** : Text Categorical Data
- **HourlyRate** : Numerical Discrete Data
- **JobInvolvement** : Numerical Categorical Data
    - **1 : Low**
    - **2 : Medium**
    - **3 : High**
    - **4 : Very High**
- **JobLevel** : Numerical Categorical Data
- **JobRole** : Text Categorical Data
- **JobSatisfaction** : Numerical Categorical Data
    - **1 : Low**
    - **2 : Medium**
    - **3 : High**
    - **4 : Very High**
- **MaritalStatus** : Text Categorical Data
- **MonthlyIncome** : Numerical Discrete Data
- **MonthlyRate** : Numerical Discrete Data
- **NumCompaniesWorked** : Numerical Discrete Data
- **Over18** : Text Categorical Data
- **OverTime** : Text Categorical Data
- **PercentSalaryHike** : Numerical Discrete Data
- **PerformanceRating** : Numerical Categorical Data
    - **1 : Low**
    - **2 : Good**
    - **3 : Excellent**
    - **4 : Outstanding**
- **RelationshipSatisfaction** : Numerical Categorical Data
    - **1 : Low**
    - **2 : Medium**
    - **3 : High**
    - **4 : Very High**
- **StandardHours** : Numerical Discrete Data
- **StockOptionLevel** : Numerical Categorical Data
- **TotalWorkingYears** : Numerical Discrete Data
- **TrainingTimesLastYear** : Numerical Discrete Data
- **WorkLifeBalance** : Numerical Categorical Data
    - **1 : Bad**
    - **2 : Good**
    - **3 : Better**
    - **4 : Best**
- **YearsAtCompany** : Numerical Discete Data
- **YearsInCurrentRole** : Numerical Discrete Data
- **YearsSinceLastPromotion** : Numerical Discrete Data
- **YearsCurrManager** : Numerical Discrete Data

### Notebook Contents :
- Dataset Information
- Exploratory Data Analysis (EDA)
- Summary of EDA
- Feature Engineering (Data Balancing & Data Leakage)
- Modeling
- Conclusion

### What you will learn :
- Data Visualization
- Data Balancing using SMOTE
- Data Leakage
- Statistical Tests for Feature Selection
- Modeling and visualization of results for algorithms

### Lets get started!

# <center><div style="font-family: Times New Roman; border-radius : 10px; background-color: #2BAE66; color: #FCF6F5; padding: 12px; line-height: 1;">Dataset Information</div></center>

### Import the Necessary Libraries :

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.options.display.float_format = '{:.2f}'.format
import warnings
warnings.filterwarnings('ignore')

from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from datetime import datetime

In [None]:
data = pd.read_csv('/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
data.head()

### Data Info :

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.info()

In [None]:
sns.heatmap(data.isnull(),cmap = 'magma',cbar = False)

- **No null values** present in the data!

In [None]:
data.describe()

In [None]:
yes = data[data['Attrition'] == 'Yes'].describe().T
no = data[data['Attrition'] == 'No'].describe().T

colors = ['#2BAE66','#FCF6F5']

fig,ax = plt.subplots(nrows = 1,ncols = 2,figsize = (10,10))
plt.subplot(1,2,1)
sns.heatmap(yes[['mean']],annot = True,cmap = colors,linewidths = 0.4,linecolor = 'black',cbar = False,fmt = '.2f')
plt.title('Mean Values : Attrited Employees');

plt.subplot(1,2,2)
sns.heatmap(no[['mean']],annot = True,cmap = colors,linewidths = 0.4,linecolor = 'black',cbar = False,fmt = '.2f')
plt.title('Mean Values : Retained Employees');

fig.tight_layout(pad = 2)

- **Mean** values of all the features for cases of **Attrited Employees** and **Retained Employees**.
- When considering **age**, mean values of **staying employees** is **37.56** i.e more than the **departing employess**, **33.61**.
- Similarly, **DailyRate** & **JobLevel** is higher for **staying employees** than **departing employees**.
- **Staying employees** have higher values for features : **TotalWorkingYears**, **YearsAtCompany**, **YearsInCurrentRole** & **YearsWithCurrManager**.

# <center><div style="font-family: Times New Roman; border-radius : 10px; background-color: #2BAE66; color: #FCF6F5; padding: 12px; line-height: 1;">Exploratory Data Analysis</div></center>

### Dividing features into Numerical and Categorical :

In [None]:
discrete_features = ['Age', 'DailyRate', 'DistanceFromHome', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
                 'PercentSalaryHike', 'StandardHours', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 
                 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
categorical_features = ['Attrition', 'BusinessTravel','Department', 'Education', 'EducationField', 'EmployeeCount','EmployeeNumber',
                    'EnvironmentSatisfaction', 'Gender', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
                    'MaritalStatus', 'Over18', 'OverTime', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel',
                    'WorkLifeBalance']

df1 = data.copy(deep = True)

- According to the dataset information, we divide the features into **categorical and discrete features**.
- Typical approach for this division of features can also be based on the datatypes of the elements of the respective 
attribute.

**Eg :** datatype = integer, attribute = numerical feature ; datatype = string, attribute = categorical feature

- Creating a deep copy of the orginal dataset for experimenting with data, visualization and modeling.
- Modifications in the original dataset will not be highlighted in this deep copy.
- We now LabelEncode the categorical features with text data.

In [None]:
le = LabelEncoder()
l1 = []; l2 = []; text_categorical_features = []
print('Label Encoder Transformation')
for i in tqdm(categorical_features):
    if type(df1[i][0]) == str:
        text_categorical_features.append(i)
        df1[i] = le.fit_transform(df1[i])
        l1.append(list(df1[i].unique())); l2.append(list(le.inverse_transform(df1[i].unique())))
        print(i,' : ',df1[i].unique(),' = ',le.inverse_transform(df1[i].unique()))

- We store the label encoded transformations inside a dictionary that gives us the information about the encoded value and it's original value! 

In [None]:
tf1 = {}
for i in range(len(text_categorical_features)):
    tf1[text_categorical_features[i]] = {}
    for j,k in zip(l1[i],l2[i]):
        tf1[text_categorical_features[i]][j] = k

### Categorical Features :

#### Distribution of Categorical Features :

In [None]:
for i in range(5):
    fig, ax = plt.subplots(nrows = 1,ncols = 4,figsize = (15,3))
    a = 1
    for j in categorical_features[(i*4) : (i*4) + 4]:
        plt.subplot(1,4,a) 
        sns.distplot(df1[j],kde_kws = {'bw' : 1},color = colors[0]);
        plt.title('Distribution : ' + j)
        a += 1

- **EmployeeNumber** is a just a unique identifying number with no repetitive elements. Hence, we will drop this feature.
- A **bimodal** distribution can be observed for **JobRole**.
- A lot of features have slight **rightly** & **leftly** skewed data distribution.
- **Over18** & **EmployeeCount** are single value features.
- We now drop the redundant features from the dataframe as well as from the list of categorical features. We also drop the **Attrition** feature as it is the target variable & will consider it separately.

In [None]:
df1.drop(columns = ['EmployeeCount', 'EmployeeNumber', 'Over18'], inplace = True)
categorical_features.remove('EmployeeCount'); categorical_features.remove('EmployeeNumber') 
categorical_features.remove('Over18'); categorical_features.remove('Attrition')

### Discrete Features :

#### Distribution of Discrete Features :

In [None]:
for i in range(5):
    fig, ax = plt.subplots(nrows = 1,ncols = 3,figsize = (15,3))
    a = 1
    for j in discrete_features[(i*3) : (i*3) + 3]:
        plt.subplot(1,3,a) 
        sns.distplot(df1[j],kde_kws = {'bw' : 1},color = colors[0]);
        plt.title('Distribution : ' + j)
        a += 1

- **HourlyRate**, **DailyRate** & **MonthlyRate** display graphs that are usually found in **Time Series**. These graphs  display values w.r.t time.
- **DistanceFromHome**, **MonthlyIncome**, **NumCompaniesWorked**, **PercentSalaryHike**, **TotalWorkingYears**, **YearsAtCompany** & **YearsSinceLastPromotion** display a **rightly-skewed** data distribution.
- **YearsInCurrentRole** & **YearsWithCurrManager** have a **bimodal** data distribution. **StandardHours** is a single value feature.
- We now drop the redundant features from the dataframe as well as from the list of discrete features.

In [None]:
df1.drop(columns = ['StandardHours'], inplace = True)
discrete_features.remove('StandardHours')

### Target Variable Visualization (Attrition) : 

In [None]:
l = list(df1['Attrition'].value_counts())
circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]

fig = plt.subplots(nrows = 1,ncols = 2,figsize = (20,5))
plt.subplot(1,2,1)
plt.pie(circle,labels = list(tf1['Attrition'][j] for j in sorted(df1['Attrition'].unique())),autopct='%1.1f%%',startangle = 90,explode = (0.1,0),colors = colors,
       wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
plt.title('Attrited Employee (%)');

plt.subplot(1,2,2)
ax = sns.countplot('Attrition',data = df1, palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels(tf1['Attrition'][j] for j in sorted(df1['Attrition'].unique()))
plt.title('Number of Attrited Employees');
plt.show()

- The dataset is **highly unbalanced**!
- **5.2 : 1** ratio for **Retained Employees : Attrited Employee** is found!
- Due to this, predictions will be biased towards **Retention** cases.
- Visualizations will also display this bias, thus making it difficult to gain insight.

- After dropping the single value features and removing the target feature, **Attrition**, we group the remaining 30 features according to their characteristics & by intuition. 
- They are divided into 5 groups as follows :
    - **General Employee Information :**
    - **Employee - Job Information**
    - **Employe - Company Information**
    - **Company Features**
    - **Finances**

In [None]:
l1 = ['Age', 'Gender','MaritalStatus', 'Education', 
      'DistanceFromHome', 'TotalWorkingYears', 'NumCompaniesWorked'] # General Employee Information

l2 = ['EducationField', 'Department', 'JobLevel', 'JobRole', 
      'JobInvolvement', 'OverTime', 'JobSatisfaction'] # Employee Job Information

l3 = ['YearsAtCompany', 'YearsInCurrentRole', 'YearsWithCurrManager', 
      'YearsSinceLastPromotion', 'TrainingTimesLastYear', 'WorkLifeBalance'] # Employee - Company Information

l4 = ['PercentSalaryHike', 'StockOptionLevel', 'BusinessTravel', 
      'PerformanceRating', 'EnvironmentSatisfaction', 'RelationshipSatisfaction'] # Company Information 

l5 = ['MonthlyIncome', 'HourlyRate', 'DailyRate', 'MonthlyRate'] # Finances

df2 = pd.DataFrame()
df2['Attrition'] = df1['Attrition']

- We create a dummy dataframe with the **Attrition** feature that can be used for storing features that need to be manipulated for drawing insights  visualization purposes! 

**We will draw insights from the group of features by visualization techniques!**

### General Employee Information :

- It includes features that provide information about the basic information of an employee! 

- List of Features :
    - **Age**
    - **Gender**
    - **MaritalStatus**
    - **Education**
    - **DistanceFromHome**
    - **TotalWorkingYears**
    - **NumCompaniesWorked**

In [None]:
df2['Age_Group'] = [int(i/5) for i in df1['Age']]

plt.figure(figsize = (15,5))
ax = sns.countplot('Age_Group', data = df2, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height(), rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'], loc = 'upper right')
plt.title('Age');

- **Attrition** is present in near about all the age groups. 
- For **Age** values between **30 - 34**, highest number of employees, **59**, have departed. Employees with **Age** values **25 - 29** come second with **53** employees discontinuing their jobs with the company.
- Age values from **20 - 24** & **35 - 40** near about display the same number of attrited employees with **28** & **30** respectively.
- Employees above the age of **40** have also been relieved of their duties.

In [None]:
df2['DistanceFromHome_Group'] = [int(i/5) for i in df1['DistanceFromHome']]

plt.figure(figsize = (15,5))
ax = sns.countplot('DistanceFromHome_Group', data = df2, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height(), rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'], loc = 'upper right'); plt.title('DistanceFromHome');

for i in range(2):
    fig = plt.subplots(nrows = 1,ncols = 3,figsize = (15,15)); a = 1
    for j in range(3):
        plt.subplot(1,3,a)
        if i == 0:
            l = list(df2.loc[(df2['DistanceFromHome_Group'] == j)]['Attrition'].value_counts())
        else:
            l = list(df2.loc[(df2['DistanceFromHome_Group'] == (j+3))]['Attrition'].value_counts())
            
        circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
        plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df2['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
                colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
        if i == 0:
            plt.title('DistanceFromHome : ' + str(j*5) + ' - ' + str(j*5 + 4) + ' ('+ str(j) + ')');
        else:
            plt.title('DistanceFromHome : ' + str((j+3)*5) + ' - ' + str((j+3)*5 + 4) + ' ('+ str(j+3) + ')');
        a += 1

- From the 1st graph, we can say that employees living nearest to the company i.e within **0 - 4**, they have been attrited the most, however when we check the percentage of attrition, it tells us a different story.
- Employees living within the distance of **0 - 4** have been attrited the least. As the value of **DistanceFromHome** increases, employee attrition increases! 

In [None]:
df2['TotalWorkingYears_Group'] = [int(i/5) for i in df1['TotalWorkingYears']]

plt.figure(figsize = (15,5))
ax = sns.countplot('TotalWorkingYears_Group', data = df2, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height(), rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'], loc = 'upper right')
plt.title('TotalWorkingYears');

- From the above visualization, we can say that employees **within their 1st 10 years of work experience** are highly prone to being removed! 
- As the work experience increases, chances of attrition reduces!

In [None]:
plt.figure(figsize = (15,5))
ax = sns.countplot('NumCompaniesWorked', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height(), rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'], loc = 'upper right')
plt.title('Number of Companies Worked For');

- We can clearly see that a lot of the volatility can be seen between **1st - 2nd** job.
- This volatility gets calmed down after the 2nd job. 
- However, as the employee works in more than **4 companies**, chances of **attrition** increase drastically.

In [None]:
fig = plt.subplots(nrows = 1, ncols = 2, figsize = (15,5))

plt.subplot(1,2,1)
ax = sns.countplot('Gender',data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels(tf1['Gender'][j] for j in sorted(df1['Gender'].unique()))
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('Gender');

plt.subplot(1,2,2)
ax = sns.countplot('MaritalStatus',data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels(tf1['MaritalStatus'][j] for j in sorted(df1['MaritalStatus'].unique()))
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('Marital Status');

- According to the data, more **Male** employees have been removed than the **Female** employees. 
- **Single** employees have been attrited the most. **Married** employees occupy the 2nd place and **Divorced** come at the last position.

In [None]:
plt.figure(figsize = (15,5))
ax = sns.countplot('Education',data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels(['Below College', 'College', 'Bachelor', 'Master', 'Doctor'])
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('Education');

fig = plt.subplots(nrows = 1,ncols = 3,figsize = (15,15))
for i in range(1,4):
    plt.subplot(1,3,i)
    l = list(df2.loc[(df1['Education'] == i)]['Attrition'].value_counts())
    
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df2['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('Education : ' + ['Below College', 'College', 'Bachelor', 'Master', 'Doctor'][i-1]);
    
fig = plt.subplots(nrows = 1,ncols = 2,figsize = (10,10))
for i in range(2):
    plt.subplot(1,2,i+1)
    l = list(df2.loc[(df1['Education'] == (i+4))]['Attrition'].value_counts())
    
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df2['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('Education : ' + ['Below College', 'College', 'Bachelor', 'Master', 'Doctor'][i-2]);

- Employees with **Bachelor**'s degree have been discontinued the most times followed by employees with **Master**'s degree.
- Surprisingly, employees with **Below College** education come at the 4th rank out of 5. But, they have the **highest attrition rate**.
- **Docter** degree employees have been attrited the least number of times & also has the **lowest attrition rate**.

- We will only check for **Age** vs **Gender**, **Marital Status** & **Education**!

In [None]:
fig = plt.subplots(nrows = 1,ncols = 2,figsize = (15,5))
for i in range(len(['Gender', 'MaritalStatus'])):
    plt.subplot(1,2,i+1)
    ax = sns.boxplot(x = ['Gender', 'MaritalStatus'][i],y = 'Age',data = df1,hue = 'Attrition',palette = colors);
    plt.legend(['RE', 'AE'])
    ax.set_xticklabels(tf1[['Gender', 'MaritalStatus'][i]][j] for j in sorted(df1[['Gender', 'MaritalStatus'][i]].unique()))
    plt.title(['Gender', 'MaritalStatus'][i] + ' vs Age');
    
plt.figure(figsize = (15,5))
ax = sns.boxplot(x = 'Education',y = 'Age',data = df1,hue = 'Attrition',palette = colors);
plt.legend(['Retained Employees', 'Attrited Employees'])
ax.set_xticklabels(['Below College', 'College', 'Bachelor', 'Master', 'Doctor'])
plt.title('Education' + ' vs Age');

- Employees with an **Age** of less than **30** are most prone to attrition irrespective of their **Gender**, **MaritalStatus** & **Education**.
- Both **Gender** overlap the same **Age** values somewhere **below 30 - above 35**. When it comes to **MaritalStatus**, **Single** employees have a lower limit value for **Age** than that for **Married** & **Divorced**.
- When it comes to **Education**, employees with **College** degree have a high range of **Age** values making them highly prone to removal from company.
- Lower limit of **Master** & **Doctor** degree employees have a high value. **Doctor** degree holders also have the least range of values making them the least targeted employees when it comes to removal of employees.

### Employee Job Information :

- It includes features that provide information about the job &  it's characteristics!

- List of Features :
    - **EducationField**
    - **Department**
    - **JobLevel**
    - **JobRole**
    - **JobInvolvement**
    - **OverTime**
    - **JobSatisfaction**

In [None]:
fig = plt.subplots(nrows = 2,ncols = 2,figsize = (25,10))

for i in range(4):
    plt.subplot(2,2,i+1)
    ax = sns.countplot(['EducationField', 'Department', 'JobRole', 'OverTime'][i],data = df1, 
                       hue = 'Attrition', palette = colors,edgecolor = 'black')
    for rect in ax.patches:
        ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
    ax.set_xticklabels(tf1[['EducationField', 'Department', 'JobRole', 'OverTime'][i]][j] 
                       for j in sorted(df1[['EducationField', 'Department', 'JobRole', 'OverTime'][i]].unique()))
    plt.legend(['Retained Employees', 'Attrited Employees'])
    plt.title(['EducationField', 'Department', 'JobRole', 'OverTime'][i]);

fig = plt.subplots(nrows = 1,ncols = 3,figsize = (25,5))
plt.subplot(1,3,1)
ax = sns.countplot('JobLevel',data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('JobLevel');

plt.subplot(1,3,2)
ax = sns.countplot('JobInvolvement',data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels(['Low', 'Medium','High','Very High'])
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('JobeInvolvement');

plt.subplot(1,3,3)
ax = sns.countplot('JobSatisfaction',data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 2, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels(['Low', 'Medium', 'High','Very High'])
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('JobSatisfaction');

- All these graphs pretty much follow the same pattern of more the people in a category, higher is the number of removal of employees.
- Hence, it can be deceiving as it does not call out the complete picture. Thus, we will check the attrition percentage of the individual category. 

In [None]:
fig = plt.subplots(nrows = 1,ncols = 6,figsize = (20,20))
for i in range(len(df1['EducationField'].unique())):
    plt.subplot(1,6,i+1)
    l = list(df1.loc[(df1['EducationField'] == i)]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    b = list(tf1['EducationField'][k] for k in sorted(df1['EducationField'].unique()))
    plt.title('EducationField : ' + b[i]);

- We can see that employees with **EducationField** of **Human Resources**, **Technical Degree** & **Marketing** have a higher chance of being removed. 

In [None]:
fig = plt.subplots(nrows = 1,ncols = 6,figsize = (20,20))
c = list((sorted(df1['Department'].unique()) + sorted(df1['JobRole'].unique())[:3]))
for i in range(len(c)):
    
    plt.subplot(1,6,i+1)
    if i < 3:
        l = list(df1.loc[(df1['Department'] == i)]['Attrition'].value_counts())
        circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    elif i > 2:
        l = list(df1.loc[(df1['JobRole'] == (i - 3))]['Attrition'].value_counts())
        circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]

    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    
    if i < 3:
        b = list(tf1['Department'][k] for k in sorted(df1['Department'].unique()))
        plt.title('Department : ' + b[i]);
    elif i > 2:
        b = list(tf1['JobRole'][k] for k in sorted(df1['JobRole'].unique()))
        plt.title('JobRole : ' + b[i-3]);
        
fig = plt.subplots(nrows = 1,ncols = 6,figsize = (20,20))

for i in range(len(sorted(df1['JobRole'].unique())[3:])):
    
    plt.subplot(1,6,i+1)

    l = list(df1.loc[(df1['JobRole'] == (i+3))]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    b = list(tf1['JobRole'][k] for k in sorted(df1['JobRole'].unique()))
    plt.title('JobRole : ' + b[i+3]);

- From the above of pie charts, **Sales** & **Human Resource** **Department** employees have a high probability of discontinuing with the company than **Research & Development**. 
- When it comes to **JobRole**, out of the 9 roles, 4 roles display less than **7%** of attrition rate whereas the remaining 5 roles have an attrition rate more than **15%**. 

In [None]:
fig = plt.subplots(nrows = 1, ncols = 2, figsize = (10,10))
for i in range(len(df1['OverTime'].unique())):
    plt.subplot(1,2,i+1)
    l = list(df1.loc[(df1['OverTime'] == i)]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    b = list(tf1['OverTime'][k] for k in sorted(df1['OverTime'].unique()))
    plt.title('OverTime : ' + b[i]);

- We can see that people that work **OverTime** are prone to be discontinued from the company! It has a **30%** attrition rate i.e very less as compared to employees that do not work **OverTime**.

In [None]:
fig = plt.subplots(nrows = 1,ncols = 4,figsize = (15,15))
for i in range(len(df1['JobInvolvement'].unique())):
    plt.subplot(1,4,i+1)
    l = list(df1.loc[(df1['JobInvolvement'] == (i+1))]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('JobInvolvement : ' + ['Low', 'Medium', 'High', 'Very High'][i]);
        
fig = plt.subplots(nrows = 1,ncols = 4,figsize = (15,15))
for i in range(len(df1['JobSatisfaction'].unique())):
    plt.subplot(1,4,i+1)
    l = list(df1.loc[(df1['JobSatisfaction'] == (i+1))]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('JobSatisfaction : ' + ['Low', 'Medium', 'High', 'Very High'][i]);

- We can observe that higher the **JobInvolement**, lower the attrition rate!
- Similar pattern can be observed for **JobSatisfaction**.

In [None]:
fig = plt.subplots(nrows = 1,ncols = 5,figsize = (15,15))

for i in range(len(df1['JobLevel'].unique())):
    plt.subplot(1,5,i+1)
    l = list(df1.loc[(df1['JobLevel'] == (i+1))]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('JobLevel : ' + str(i+1));

- **JobLevel 1** has the attrition rate with **26.3%**. **JobLevel 3** comes at the 2nd position with **14.7%**.
- **JobLevel 4** has the lowest attrition rate with **4.7%**. 
- There seems to be no pattern. Hence, we will visualize the **JobLevel** with some features of the same group.

In [None]:
plt.subplots(nrows = 1, ncols = 3, figsize = (25,5))
for i in range(len(sorted(df1['JobLevel'].unique())[:3])):
    plt.subplot(1,3,i+1)
    ax = sns.countplot('JobRole',data = df1[(df1['JobLevel'] == (i+1))], palette = colors,edgecolor = 'black')
    for rect in ax.patches:
        ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 1, rect.get_height(), horizontalalignment='center', fontsize = 11)
    ax.set_xticklabels(list(tf1['JobRole'][k] for k in sorted(df1[(df1['JobLevel'] == (i+1))]['JobRole'].unique())))
    plt.title('JobRoles : JobLevel ' + str(i+1));
    
plt.subplots(nrows = 1, ncols = 2, figsize = (15,5))
for i in range(len(sorted(df1['JobLevel'].unique())[3:])):
    plt.subplot(1,2,i+1)
    ax = sns.countplot('JobRole',data = df1[(df1['JobLevel'] == (i+4))], palette = colors,edgecolor = 'black')
    for rect in ax.patches:
        ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 1, rect.get_height(), horizontalalignment='center', fontsize = 11)
    ax.set_xticklabels(list(tf1['JobRole'][k] for k in sorted(df1[(df1['JobLevel'] == (i+4))]['JobRole'].unique())))
    plt.title('JobRoles : JobLevel ' + str(i+4));

- **JobLevel 1** has **JobRole** **Research Scientist** & **Laboratory Technician** in heavy numbers. For **JobLevel 2**, **Sales Executive** has the highest number of roles but it also has other 6 roles listed with low numbers as compared to **Sales Executive**. 
- **Sales Executive**, **Manufacturing Director** & **Healthcare Representative** take the 1st, 2nd & 3rd rank respectively in **JobLevel 3**. **Manager** roles have been found the highest number of times for **JobLevel 4**.
- **Manager** & **Research Director** occupy the **JobRole** in **JobLevel 5**. A pattern that can be observed i.e as the **JobLevel** increases, number of **JobRole** & it's count decreases as well.

### Employee - Company Information :

- It includes features that provide information employee's association with the company!

- List of Features :
    - **YearsAtCompany**
    - **YearsInCurrentRole**
    - **YearsWithCurrManager**
    - **YearsSinceLastPromotion**
    - **TrainingTimesLastYear**
    - **WorkLifeBalance**

In [None]:
df2['YearsAtCompany_Group'] = [int(i / 5) for i in df1['YearsAtCompany']]

plt.figure(figsize = (15, 5))
ax = sns.countplot('YearsAtCompany_Group', data = df2, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('YearsAtCompany');

- Clearly, employees that have been at the company for **0 - 4 (0)** years have been attrited the most number of times.
- As the employees gain experience at the company, attrition reduces.

In [None]:
plt.figure(figsize = (15, 5))

ax = sns.countplot('YearsInCurrentRole', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black');
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('YearsInCurrentRole');

- As expected by now, employees in their 1st role are very volatile and look for an early exit.
- Another spike of attrition can also be observed when employees complete **2 years** in their current role. It looks like either employees look for improvement in their role or the companies have done evaluation, thus taking a call about the employees.
- This is then followed by attrition in years **3** & **4**. This is probably a continuation of the attrition carried out in year 2.
- One more significant spike can be observed in **year 7** of their current role as the employees might look for an improvement or company decides to shake up things.

In [None]:
plt.figure(figsize = (15, 5))

ax = sns.countplot('YearsWithCurrManager', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black');
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('YearsWithCurrManager');

- This is a very similar visualization to the previous graph of **YearsInCurrentRole**. 
- Peaks of attrition can be found at the sames of : **0**, **2** & **7**.

In [None]:
plt.figure(figsize = (15, 5))

ax = sns.countplot('YearsSinceLastPromotion', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black');
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('YearsSinceLastPromotion');

- We can see that a huge number of attrition cases can be found for value **0**. I guess it's majority values represent the freshers in the company.
- **1** & **2** years since the last promotion have also recorded a significant number of employee removal cases.
- **7** years since last promotion also has decent number of employee removal cases. This is value seems to have some correlation with the previous 2 graphs of **YearsInCurrentRole** & **YearsWithCurrManager**.

In [None]:
plt.figure(figsize = (15, 5))

ax = sns.countplot('TrainingTimesLastYear', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black');
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('TrainingTimesLastYear');

fig = plt.subplots(nrows = 1,ncols = 7,figsize = (25,25))

for i in range(len(df1['TrainingTimesLastYear'].unique())):
    plt.subplot(1,7,i+1)
    l = list(df1.loc[(df1['TrainingTimesLastYear'] == (i))]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('TrainingTimesLastYear : ' + str(i));

- Values **3** & **4** have higher values but the attrition percentage tells us a different story.
- **TrainingTimesLastYear : 0**, **TrainingTimesLastYear : 4** & **TrainingTimesLastYear : 2** dominate the attrition percentage.
- It looks like training is very essential as the attrition percentage is very high when no training is conducted, **27.8%**. Clearly, there is a competency problem.
- For **TrainingTimesLastYear : 4**, attrition percentage of **21.1%** can be found which is high. Another point can be about the difficulty of the training & it's evaluation.

In [None]:
plt.figure(figsize = (15, 5))

ax = sns.countplot('WorkLifeBalance', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black');
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
ax.set_xticklabels(['Bad', 'Good', 'Better', 'Best'])
plt.title('WorkLifeBalance');

fig = plt.subplots(nrows = 1,ncols = 4,figsize = (15,15))

for i in range(len(df1['WorkLifeBalance'].unique())):
    plt.subplot(1,4,i+1)
    l = list(df1.loc[(df1['WorkLifeBalance'] == (i+1))]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('WorkLifeBalance : ' + ['Bad', 'Good', 'Better', 'Best'][i]);

- As expected **Bad WorkLifeBalance** has resulted in a massive attrition percentage of **31.2%**. 
- Surprisingly, **Best WorkLifeBalance** has the 2nd highest value of attrition percentage. 

- We will check the **WorkLifeBalance** feature with the **JobRole** & **JobLevel** features of the **Employee Job Information**!

In [None]:
fig = plt.subplots(nrows = 2, ncols = 2, figsize = (25,10))
for i in range(4):
    plt.subplot(2,2,i+1)
    ax = sns.countplot('JobRole', data = df1[df1['WorkLifeBalance'] == (i+1)], palette = colors, edgecolor = 'black')
    for rect in ax.patches:
        ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 0.25, rect.get_height(), horizontalalignment='center', fontsize = 11)
    ax.set_xticklabels(tf1['JobRole'][k] for k in sorted(df1[df1['WorkLifeBalance'] == (i+1)]['JobRole'].unique()))
    plt.legend(['Retained Employees', 'Attrited Employees'])
    plt.title(['Bad', 'Good', 'Better', 'Best'][i] + ' WorklifeBalance of Different JobRoles');

- **Laboratory Technician**, **Research Scientist** & **Sales Executive** have recorded high numbers for all the values of **WorkLifeBalance**.

In [None]:
fig = plt.subplots(nrows = 2, ncols = 2, figsize = (25,10))
for i in range(4):
    plt.subplot(2,2,i+1)
    ax = sns.countplot('JobLevel', data = df1[df1['WorkLifeBalance'] == (i+1)], palette = colors, edgecolor = 'black')
    for rect in ax.patches:
        ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height() + 0.25, rect.get_height(), horizontalalignment='center', fontsize = 11)
    ax.set_xticklabels(['JobLevel 1', 'JobLevel 2', 'JobLevel 3', 'JobLevel 4', 'JobLevel 5'])
    plt.title(['Bad', 'Good', 'Better', 'Best'][i] + ' WorklifeBalance');

- **JobLevel 1** & **JobLevel 2** record high values in for all the **WorkLifeBalance** values.

### Company Information :

- It includes features that provide information company's characteristics w.r.t employees!

- List of Features :
    - **PercentSalaryHike**
    - **StockOptionLevel**
    - **BusinessTravel**
    - **PerformanceRating**
    - **EnvironmentSatisfaction**
    - **RelationshipSatisfaction**

In [None]:
plt.figure(figsize = (15, 5))
ax = sns.countplot('PercentSalaryHike', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('PercentSalaryHike');

- We can see that low salary hikes of **11 - 14** have been given to  a lot of employees and hence the attrition is high as well.
- As the **PercentSalaryHike** increases, number of attrited employees decrease!

In [None]:
plt.figure(figsize = (15, 5))
ax = sns.countplot('StockOptionLevel', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('StockOptionLevel');

- Same story as the **PercentSalaryHike** can be observed.
- Number of employees reduces as the **StockOptionLevel** increases.

In [None]:
plt.figure(figsize = (15, 5))
ax = sns.countplot('BusinessTravel', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels([tf1['BusinessTravel'][k] for k in sorted(df1['BusinessTravel'].unique())])
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('BusinessTravel');

fig = plt.subplots(nrows = 1,ncols = 3,figsize = (15,15))

for i in range(len(df1['BusinessTravel'].unique())):
    plt.subplot(1,3,i+1)
    l = list(df1.loc[(df1['BusinessTravel'] == i)]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('BusinessTravel : ' + tf1['BusinessTravel'][i]);

- We can see that number of employees that **Travel_Rarely** is huge as compared to **Non-Travel** & **Travel_Frequently**.
- When it comes to attrition rate, **Travel_Frequently** employees have a **25%** probability of being removed from the company.

In [None]:
plt.figure(figsize = (15, 5))
ax = sns.countplot('PerformanceRating', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels(['Excellent', 'Outstanding'])
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('PerformanceRating');

fig = plt.subplots(nrows = 1,ncols = 2,figsize = (10,10))

for i in range(len(df1['PerformanceRating'].unique())):
    plt.subplot(1,2,i+1)
    l = list(df1.loc[(df1['PerformanceRating'] == (i+3))]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('PerformanceRating : ' + ['Low', 'Good', 'Excellent', 'Outstanding'][i+2]);

- As expected, employees have more **Excellent** rating than **Outstanding**. But when it comes to attrition rate, values of **Excellent** & **Outstanding** are very close with **16.1%** & **16.4%**.
- No data of **Low** & **Good** **PerformanceRating** are recorded.

In [None]:
plt.figure(figsize = (15, 5))
ax = sns.countplot('EnvironmentSatisfaction', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels(['Low', 'Medium', 'High', 'Very High'])
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('EnvironmentSatisfaction');

fig = plt.subplots(nrows = 1,ncols = 4,figsize = (15,15))

for i in range(len(df1['EnvironmentSatisfaction'].unique())):
    plt.subplot(1,4,i+1)
    l = list(df1.loc[(df1['EnvironmentSatisfaction'] == (i+1))]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('EnvironmentSatisfaction : ' + ['Low', 'Medium', 'High', 'Very High'][i]);

- **High** & **Very High** **EnvironmentSatisfaction** values have been noted the most number of times.
- As expected, they have a low attrition rate as compared to **Low** & **Medium** **EnvironmentSatisfaction**.
- The attrition rate improves as the **EnvironmentSatisfaction** improves!

In [None]:
plt.figure(figsize = (15, 5))
ax = sns.countplot('RelationshipSatisfaction', data = df1, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
ax.set_xticklabels(['Low', 'Medium', 'High', 'Very High'])
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('RelationshipSatisfaction');

fig = plt.subplots(nrows = 1,ncols = 4,figsize = (15,15))

for i in range(len(df1['RelationshipSatisfaction'].unique())):
    plt.subplot(1,4,i+1)
    l = list(df1.loc[(df1['RelationshipSatisfaction'] == (i+1))]['Attrition'].value_counts())
    circle = [l[0] / sum(l) * 100,l[1] / sum(l) * 100]
    plt.pie(circle,labels = list(tf1['Attrition'][k] for k in sorted(df1['Attrition'].unique())),autopct = '%1.1f%%',startangle = 90,explode = (0.1,0),
            colors = colors, wedgeprops = {'edgecolor' : 'black','linewidth': 1,'antialiased' : True})
    plt.title('RelationshipSatisfaction : ' + ['Low', 'Medium', 'High', 'Very High'][i]);

- The above visualizations of **RelationshipSatisfaction** is very similar to **EnvironmentSatisfaction**.
- As the values of **RelationshipSatisfaction** improves, attrition rate reduces. 

### Finances :

- It includes features that provide information about employee finances!

- List of Features :
    - **MonthlyIncome**
    - **HourlyRate**
    - **DailyRate**
    - **MonthlyRate**

In [None]:
df2['MonthlyIncome_Group'] = [int(i / 1000) for i in df1['MonthlyIncome']]
v1 = [df2['MonthlyIncome_Group'].value_counts()[i] for i in sorted(df2['MonthlyIncome_Group'].value_counts().index)]

plt.figure(figsize = (15,5))
ax = sns.lineplot(x = sorted(df2['MonthlyIncome_Group'].value_counts().index), y = v1, lw = 2, color = colors[0], marker = 'o', 
                  markersize = 10, markerfacecolor = colors[1], markeredgewidth = 2, markeredgecolor = colors[0], )
plt.xlabel('MonthlyIncome : Value*1000'); plt.ylabel('Count')
plt.title("MonthlyIncome");

plt.figure(figsize = (15, 5))
ax = sns.countplot('MonthlyIncome_Group', data = df2, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('MonthlyIncome');

- The graph highlights an overall decline in the count of the values.
- **MonthlyIncome** values between **1000 - 2000** are present in high numbers. Values between **3000 - 4000** comes second with more than 200 values present in this range.

In [None]:
df2['HourlyRate_Group'] = [int(i / 10) for i in df1['HourlyRate']]
v1 = [df2['HourlyRate_Group'].value_counts()[i] for i in sorted(df2['HourlyRate_Group'].value_counts().index)]

plt.figure(figsize = (15,5))
ax = sns.lineplot(x = sorted(df2['HourlyRate_Group'].value_counts().index), y = v1, lw = 2, color = colors[0], marker = 'o', 
                  markersize = 10, markerfacecolor = colors[1], markeredgewidth = 2, markeredgecolor = colors[0], )
plt.xlabel('HourlyRate : Value*10'); plt.ylabel('Count')
plt.title("HourlyRate");

plt.figure(figsize = (15, 5))
ax = sns.countplot('HourlyRate_Group', data = df2, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('HourlyRate');

- For **HourlyRate**, values between **30 - 100** are present with a count of more than **175+** each.
- Attrition rate of these values is also low and very close to each other. 
- For **HourlyRate** of more than **100**, very few values are present and hence attrition is high as well.

In [None]:
df2['DailyRate_Group'] = [int(i / 100) for i in df1['DailyRate']]
v1 = [df2['DailyRate_Group'].value_counts()[i] for i in sorted(df2['DailyRate_Group'].value_counts().index)]

plt.figure(figsize = (15,5))
ax = sns.lineplot(x = sorted(df2['DailyRate_Group'].value_counts().index), y = v1, lw = 2, color = colors[0], marker = 'o', 
                  markersize = 10, markerfacecolor = colors[1], markeredgewidth = 2, markeredgecolor = colors[0], )
plt.xlabel('DailyRate : Value*100'); plt.ylabel('Count')
plt.title("DailyRate");

plt.figure(figsize = (15, 5))
ax = sns.countplot('DailyRate_Group', data = df2, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('DailyRate');

- Number of attrited employees is close to each other. Certain drop in count of values are present.
- Values between **600 - 700** have the lowest count.

In [None]:
df2['MonthlyRate_Group'] = [int(i / 1000) for i in df1['MonthlyRate']]
v1 = [df2['MonthlyRate_Group'].value_counts()[i] for i in sorted(df2['MonthlyRate_Group'].value_counts().index)]

plt.figure(figsize = (15,5))
ax = sns.lineplot(x = sorted(df2['MonthlyRate_Group'].value_counts().index), y = v1, lw = 2, color = colors[0], marker = 'o', 
                  markersize = 10, markerfacecolor = colors[1], markeredgewidth = 2, markeredgecolor = colors[0], )
plt.xlabel('MonthlyRate : Value*1000'); plt.ylabel('Count')
plt.title("MonthlyRate");

plt.figure(figsize = (15, 5))
ax = sns.countplot('MonthlyRate_Group', data = df2, hue = 'Attrition', palette = colors,edgecolor = 'black')
for rect in ax.patches:
    ax.text(rect.get_x() + rect.get_width() / 2, rect.get_height()+0.01, rect.get_height(), horizontalalignment='center', fontsize = 11)
plt.legend(['Retained Employees', 'Attrited Employees'])
plt.title('MonthlyRate');

- Values of attrited employees are very close to each other. 
- Values between **21000 - 22000** have the highest count.

- We will check the features of **Finances** with **Department** & **JobLevel** features of **Employee Job Information**.

In [None]:
fig = plt.subplots(nrows = 2,ncols = 2,figsize = (15,10))
for i in range(len(l5)):
    plt.subplot(2,2,i+1)
    ax = sns.boxplot(x = 'Department', y = l5[i], data = df1, hue = 'Attrition',palette = colors);
    plt.legend(['RE', 'AE'])
    ax.set_xticklabels([tf1['Department'][k] for k in sorted(df1['Department'].unique())])
    plt.title('Department vs ' + l5[i]);

- **MonthlyIncome** has too many outlier values. These outliers are probably from the **JobLevel 5** which is low in numbers & attrition rate.
- For **HourlyRate**, **Research & Development** & **Sales** department pretty much occupy the same range of values for attrition & non-attrition. Range of attrition values for **Human Resources** is very small.
- Similar to **HourlyRate**, same pattern can be observed for **Research & Development** & **Sales** department for **DailyRate** & **MonthlyRate**.

In [None]:
fig = plt.subplots(nrows = 2,ncols = 2,figsize = (15,10))
for i in range(len(l5)):
    plt.subplot(2,2,i+1)
    ax = sns.boxplot(x = 'JobLevel', y = l5[i], data = df1, hue = 'Attrition',palette = colors);
    plt.legend(['RE', 'AE'])
    ax.set_xticklabels(['JobLevel 1', 'JobLevel 2', 'JobLevel 3', 'JobLevel 4', 'JobLevel 5'])
    plt.title('JobLevel vs ' + l5[i]);

- As expected, as the **JobLevel** increases, **MonthlyIncome** increases! The upper limit value of the previous **JobLevel** value is lower than the lower limit value of the succeding **JobLevel** value. 
- Upper limit values of **JobLevel** of **HourlyRate** are very close to each other. It does not display a significant difference to separate out as **MonthlyIncome**.
- Pretty much same thing can be observed for **DailyRate** & **MonthlyRate**. **JobLevel 5**'s **DailyRate** & **MonthlyRate** upper limit is clearly differentiable.

# <center><div style="font-family: Times New Roman; border-radius : 10px; background-color: #2BAE66; color: #FCF6F5; padding: 12px; line-height: 1;">Summary of EDA</div></center>

### Summary of Insights / Order / Values of features w.r.t target variable (Attrition) :


- **General Employee Information :**
    
    - **Age** : 20 - 44
    - **Gender**: Male > Female
    - **MaritalStatus** : Single > Married > Divorced
    - **Education** : Below College > Bachelor > College > Master > Doctor
    - **DistanceFromHome** : 20 - 24 > 15 - 19 > 25 - 29 > 10 - 14 > 5 - 9
    - **TotalWorkingYears** : Very high chances during the 1st 10 working years
    - **NumCompaniesWorked** : High chances during 1st - 2nd job. Chances increase by a huge margin after working in 4th company.
    
    
- **Employee Job Information :**
    
    - **EducationField** : Human Resources > Technical Degree > Marketing > Life Sciences > Medical > Other
    - **Department** : Sales > Human Resources > Reasearch & Development
    - **JobLevel** : JobLevel 1 > JobLevel 3 > JobLevel 2 > JobLevel 5 > JobLevel 4. We can see that the **JobRoles** with high attrition rate are present in the **JobLevel** with high attrition rate.
    - **JobRole** : Sales Representative > Laboratory Technician > Human Resources > Sales Executive > Research Scientist > Healthcare Representative = Manufacturing Director > Manager > Research Director
    - **JobInvolvement** : Low > Medium > High > Very High
    - **OverTime** : Yes > No
    - **JobSatisfaction** :  : Low > Medium > High > Very High
    
    
- **Employee Company Information :**
    
    - **YearsAtCompany** : 0 - 4 > 5 - 9 > 10 - 14 > 
    - **YearsInCurrentRole** : Some peaks of high attrition values without any pattern is found.
    - **YearsWithCurrManager** : Some peaks of high attrition values without any pattern is found.
    - **YearsSinceLastPromotion** : 0 > 1 > 2. Some other peaks are also found with significant values.
    - **TrainingTimesLastYear** : 0 > 4 > 2 > 3 > 1 > 5 > 6
    - **WorkLifeBalance** : Bad > Best > Good > Better
    
    
- **Company Features :**
    
    - **PercentSalaryHike** : 11 - 14 has the highest attrition rate. As the value increases, number of attrited employees decrease.
    - **StockOptionLevel** : Number of employees reduces as the StockOptionLevel increases.
    - **BusinessTravel** : Travel_Frequently > Travel_Rarely > Non-Travel
    - **PerformanceRating** : Excellent = Outstanding. No values of Low & Good recorded.
    - **EnvironmentSatisfaction** : Low > Medium > High > Very High
    - **RelationshipSatisfaction** : Low > High > Medium > Very High
    
    
- **Finances :**
    
    - **MonthlyIncome** : 2000 - 3000
    - **HourlyRate** : 50 - 60. Values are very close to each other.
    - **DailyRate** : 300 - 400. Values are very close to each other.
    - **MonthlyRate** : Very close and small peaks are present. 


**According to the data, these insights / order / range of values can lead to attrition!**

# <center><div style="font-family: Times New Roman; border-radius : 10px; background-color: #2BAE66; color: #FCF6F5; padding: 12px; line-height: 1;">Feature Engineering</div></center>

- The dataset is **Unbalanced** with a bias towards **Retained Employees** in a ratio of **5.2 : 1** for **Retained Employees : Attrited Employees**. We will first balance the dataset using **SMOTE Analysis**!

- In order to cope with unbalanced data, there are 2 options :

    - **Undersampling** : Trim down the majority samples of the target variable.
    - **Oversampling** : Increase the minority samples of the target variable to the majority samples.
    
- For best performances, combination of undersampling and oversampling is recommended.
- First, we will undersample the majority samples and it is followed by oversampling minority samples.
- For data balancing, we will use **imblearn**.
- **PIP statement** : pip install imbalanced-learn

### Data Balancing using SMOTE :

In [None]:
import imblearn
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

In [None]:
cols = list(df1.columns)
cols.remove('Attrition')

over = SMOTE(sampling_strategy = 0.85)
under = RandomUnderSampler(sampling_strategy = 0.1)
f1 = df1.loc[:,cols]
t1 = df1.loc[:,'Attrition']

steps = [('over', over)]
pipeline = Pipeline(steps=steps)
f1, t1 = pipeline.fit_resample(f1, t1)
Counter(t1)

### Calculation for Data Balancing :

- **Sampling Strategy** : It is a ratio which is the common paramter for oversampling and undersampling.
- **Sampling Strategy** : **( Samples of Minority Class ) / ( Samples of Majority Class )**


- In this case,

    - **Majority Class : Retained Employees** : 1233 samples
    - **Minority Class : Attrited Employees** : 237 samples


### Oversampling : Increase the minority class samples

- Sampling_Strategy = 0.85
- 0.85 = ( Minority Class Samples ) / 1233
- After oversampling, 

    - **Majority Class : Retained Employees** : 1233 samples
    - **Minority Class : Attrited Employees** : 1048 samples
    

- Final Class Samples :

    - **Majority Class : Retained Employees** : 1233 samples
    - **Minority Class : Attrited Employees** : 1048 samples


- Here, we balance the data by increasing the minority group to majority group. In this case we only increase the minority data points as the data is very less.
- For imbalanced datasets, we **duplicate the data** to deal with the potential bias in the predictions. 
- Due to this duplication process, we are using **synthetic data** for modeling purposes to ensure that the predictions are not skewed towards the majority target class value.
- Thus, evaluating models using **accuracy** will be misleading. Instead, we will go for **confusion matrix, ROC-AUC graph and ROC-AUC score** for model evaluation.

### Data Leakage : 

- **Data Leakage** is the problem when the information outside the training data is used for model creation. It is one of the most ignored problem.
- In order to create robust models, solving data leakage is a must! Creation of overly optimistic models which are practically useless & cannot be used in production have become common.
- Model performance degrades when **Data Leakage** is not dealt with & the model is sent online. It is a difficult concept to understand because it seems quite trivial.
- Typical approach used is transforming / modifying the entire dataset by filling NAN values with mean, median & mode, standardisation, normalization, etc.
- When we execute the above process in order to make the dataset ready for modeling, we use the values from the entire dataset & thus indirectly provide information from the to-be test data i.e outside of the training data.
- Thus, in order to avoid **Data Leakage**, it is advised to use train-test-split before any transformations. Execute the transformations according to the training data for the training as well as test data. Use of k-fold cross validation is also suggested!

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(f1, t1, test_size = 0.15, random_state = 2)

### Correlation Matrix :

In [None]:
x_train_test = x_train.copy(deep = True)
x_train_test['Attrition'] = y_train

- In order to visualize the correlation matrix, we create a new dataframe that contains values from **x_train** & **y_train**.
- Thus, we reject anything outside the training data to avoid data leakage.

In [None]:
corr = x_train_test.corrwith(x_train_test['Attrition']).sort_values(ascending = False).to_frame()
corr.columns = ['Attrition']
plt.subplots(figsize = (7,10))
sns.heatmap(corr,annot = True,cmap = colors,linewidths = 0.4,linecolor = 'black');
plt.title('Correlation w.r.t Attrition');

- None of the features display a strong positive or negative correlation with **Attrition**.
- Most of the features have values between [-0.3 - 0.14].  

### Feature Selection for Categorical Features :

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif,chi2

#### Mutual Information Test :

In [None]:
features = x_train.loc[:,categorical_features]
target = pd.DataFrame(y_train)

best_features = SelectKBest(score_func = mutual_info_classif,k = 'all')
fit = best_features.fit(features,target)

featureScores = pd.DataFrame(data = fit.scores_,index = list(features.columns),columns = ['Mutual Information Score']) 

plt.subplots(figsize = (5,5))
sns.heatmap(featureScores.sort_values(ascending = False,by = 'Mutual Information Score'),annot = True,cmap = colors,linewidths = 0.4,linecolor = 'black',fmt = '.2f');
plt.title('Selection of Categorical Features');

- Mutual Information Score of **Attrition** with categorical features display very low scores.
- According to the above scores, none of the features should be selected for modeling.

#### Chi Squared Test :

In [None]:
features = x_train.loc[:,categorical_features]
target = pd.DataFrame(y_train)

best_features = SelectKBest(score_func = chi2,k = 'all')
fit = best_features.fit(features,target)

featureScores = pd.DataFrame(data = fit.scores_,index = list(features.columns),columns = ['Chi Squared Score']) 

plt.subplots(figsize = (5,5))
sns.heatmap(featureScores.sort_values(ascending = False,by = 'Chi Squared Score'),annot = True,cmap = colors,linewidths = 0.4,linecolor = 'black',fmt = '.2f');
plt.title('Selection of Categorical Features');

- From the above **Chi Squared Score Test**, we will drop the following features : **PerformanceRating**, **Department**, **JobRole**, **EducationField**, **BusinessTravel**, **MaritalStatus** & **Gender**.

### Feature Selection for Numerical Features :

#### ANOVA Test :

In [None]:
from sklearn.feature_selection import f_classif

features = x_train.loc[:,discrete_features]
target = pd.DataFrame(y_train)

best_features = SelectKBest(score_func = f_classif,k = 'all')
fit = best_features.fit(features,target)

featureScores = pd.DataFrame(data = fit.scores_,index = list(features.columns),columns = ['ANOVA Score']) 

plt.subplots(figsize = (5,5))
sns.heatmap(featureScores.sort_values(ascending = False,by = 'ANOVA Score'),annot = True,cmap = colors,linewidths = 0.4,linecolor = 'black',fmt = '.2f');
plt.title('Selection of Numerical Features');

- From the above **ANOVA Score Test**, we will drop the following features : **MonthlyRate**, **HourlyRate**, **NumCompaniesWorked**, **PercentSalaryHike**, **YearsSinceLastPromotion**, **DistanceFromHome** & **DailyRate**.
- We ready the datasets for data scaling by dropping the features based on the above statistical tests.

In [None]:
x_train = x_train.drop(columns = ['MonthlyRate', 'HourlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 
                                  'YearsSinceLastPromotion', 'DistanceFromHome','DailyRate',
                                  'PerformanceRating', 'Department', 'JobRole', 'EducationField', 
                                  'BusinessTravel', 'MaritalStatus' ,'Gender'])

x_test = x_test.drop(columns = ['MonthlyRate', 'HourlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 
                                  'YearsSinceLastPromotion', 'DistanceFromHome','DailyRate',
                                  'PerformanceRating', 'Department', 'JobRole', 'EducationField', 
                                  'BusinessTravel', 'MaritalStatus' ,'Gender'])

### Data Scaling :

In [None]:
from sklearn.preprocessing import MinMaxScaler,StandardScaler
mms = MinMaxScaler() # Normalization
ss = StandardScaler() # Standardization

# Normalization
x_train['MonthlyIncome'] = mms.fit_transform(x_train[['MonthlyIncome']]); x_test['MonthlyIncome'] = mms.transform(x_test[['MonthlyIncome']])
x_train['TotalWorkingYears'] = mms.fit_transform(x_train[['TotalWorkingYears']]); x_test['TotalWorkingYears'] = mms.transform(x_test[['TotalWorkingYears']])
x_train['YearsAtCompany'] = mms.fit_transform(x_train[['YearsAtCompany']]); x_test['YearsAtCompany'] = mms.transform(x_test[['YearsAtCompany']])
x_train['YearsInCurrentRole'] = mms.fit_transform(x_train[['YearsInCurrentRole']]); x_test['YearsInCurrentRole'] = mms.transform(x_test[['YearsInCurrentRole']])
x_train['YearsWithCurrManager'] = mms.fit_transform(x_train[['YearsWithCurrManager']]); x_test['YearsWithCurrManager'] = mms.transform(x_test[['YearsWithCurrManager']])

# Standardization
x_train['Age'] = ss.fit_transform(x_train[['Age']]); x_test['Age'] = ss.transform(x_test[['Age']])
x_train['Education'] = ss.fit_transform(x_train[['Education']]); x_test['Education'] = ss.transform(x_test[['Education']])
x_train['EnvironmentSatisfaction'] = ss.fit_transform(x_train[['EnvironmentSatisfaction']]); x_test['EnvironmentSatisfaction'] = ss.transform(x_test[['EnvironmentSatisfaction']])
x_train['JobInvolvement'] = ss.fit_transform(x_train[['JobInvolvement']]); x_test['JobInvolvement'] = ss.transform(x_test[['JobInvolvement']])
x_train['JobLevel'] = ss.fit_transform(x_train[['JobLevel']]); x_test['JobLevel'] = ss.transform(x_test[['JobLevel']])
x_train['JobSatisfaction'] = ss.fit_transform(x_train[['JobSatisfaction']]); x_test['JobSatisfaction'] = ss.transform(x_test[['JobSatisfaction']])
x_train['OverTime'] = ss.fit_transform(x_train[['OverTime']]); x_test['OverTime'] = ss.transform(x_test[['OverTime']])
x_train['RelationshipSatisfaction'] = ss.fit_transform(x_train[['RelationshipSatisfaction']]); x_test['RelationshipSatisfaction'] = ss.transform(x_test[['RelationshipSatisfaction']])
x_train['StockOptionLevel'] = ss.fit_transform(x_train[['StockOptionLevel']]); x_test['StockOptionLevel'] = ss.transform(x_test[['StockOptionLevel']])
x_train['TrainingTimesLastYear'] = ss.fit_transform(x_train[['TrainingTimesLastYear']]); x_test['TrainingTimesLastYear'] = ss.transform(x_test[['TrainingTimesLastYear']])
x_train['WorkLifeBalance'] = ss.fit_transform(x_train[['WorkLifeBalance']]); x_test['WorkLifeBalance'] = ss.transform(x_test[['WorkLifeBalance']])

- Machine learning model does not understand the units of the values of the features. It treats the input just as a simple number but does not understand the true meaning of that value. Thus, it becomes necessary to scale the data.

- We have 2 options for data scaling : 
    
    1) **Normalization** 
    
    2) **Standardization**. 


- As most of the algorithms assume the data to be normally (Gaussian) distributed, **Normalization** is done for features whose data does not display normal distribution and **standardization** is carried out for features that are normally distributed but the range of values is huge or small as compared to other features.

- From the above transformation, we fit the data on the training data and transform the test data from information based on the training data. If we check the formulas of the **Normalization** & **Standardization**, we use **mean**, **standard deviation**, **min & max** values.

- Thus if these above statistical parameters are calculated using the complete dataset, then we are sharing the values from the **to-be test data** and thus sharing this **to-be test data** with the training data and cause **Data Leakage**.

# <center><div style="font-family: Times New Roman; border-radius : 10px; background-color: #2BAE66; color: #FCF6F5; padding: 12px; line-height: 1;">Modeling</div></center>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import plot_roc_curve
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import precision_recall_curve

- Selecting the features from the above conducted tests and splitting the data into **85 - 15 train - test** groups.

In [None]:
def model(classifier,x_train,y_train,x_test,y_test):
    
    classifier.fit(x_train,y_train)
    prediction = classifier.predict(x_test)
    cv = RepeatedStratifiedKFold(n_splits = 10,n_repeats = 3,random_state = 1)
    print("Cross Validation Score : ",'{0:.2%}'.format(cross_val_score(classifier,x_train,y_train,cv = cv,scoring = 'roc_auc').mean()))
    print("ROC_AUC Score : ",'{0:.2%}'.format(roc_auc_score(y_test,prediction)))
    plot_roc_curve(classifier, x_test,y_test)
    plt.title('ROC_AUC_Plot')
    plt.show()

def model_evaluation(classifier,x_test,y_test):
    
    # Confusion Matrix
    cm = confusion_matrix(y_test,classifier.predict(x_test))
    names = ['True Neg','False Pos','False Neg','True Pos']
    counts = [value for value in cm.flatten()]
    percentages = ['{0:.2%}'.format(value) for value in cm.flatten()/np.sum(cm)]
    labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(names,counts,percentages)]
    labels = np.asarray(labels).reshape(2,2)
    sns.heatmap(cm,annot = labels,cmap = colors,fmt ='')
    
    # Classification Report
    print(classification_report(y_test,classifier.predict(x_test)))

### 1] XGBoostClassifier :

In [None]:
from xgboost import XGBClassifier

In [None]:
classifier_xgb = XGBClassifier(learning_rate= 0.01,max_depth = 3,n_estimators = 1000)

In [None]:
model(classifier_xgb,x_train.values,y_train.values,x_test.values,y_test.values)
model_evaluation(classifier_xgb,x_test.values,y_test.values)

### 2] LGBMClassifier :

In [None]:
from lightgbm import LGBMClassifier

In [None]:
classifier_lgbm = LGBMClassifier(learning_rate= 0.01,max_depth = 3,n_estimators = 1000)

In [None]:
model(classifier_lgbm,x_train.values,y_train.values,x_test.values,y_test.values)
model_evaluation(classifier_lgbm,x_test.values,y_test.values)

### 3] Decision Tree Classifier :

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
classifier_dt = DecisionTreeClassifier(random_state = 1000,max_depth = 4,min_samples_leaf = 1)

In [None]:
model(classifier_dt,x_train.values,y_train.values,x_test.values,y_test.values)
model_evaluation(classifier_dt,x_test.values,y_test.values)

### 4] RandomForest Classifier :

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
classifier_rf = RandomForestClassifier(max_depth = 4,random_state = 0)

In [None]:
model(classifier_rf,x_train.values,y_train.values,x_test.values,y_test.values)
model_evaluation(classifier_rf,x_test.values,y_test.values)

### ML Alogrithm Results Table :

#### Results Table for models based on Statistical Test : 

|Sr. No.|ML Algorithm|Cross Validation Score|ROC AUC Score|F1 Score (Attrition)| F1 Score (No Attrition)|
|-|-|-|-|-|-|
|1|XGB Classifier|91.92%|88.25%|87%|89%|
|2|LGBM Regression|92.14%|88.44%|88%|90%|
|3|Decision Tree Classifier|80.22%|78.68%|77%|81%|
|4|RandomForest Classifier|87.62%|81.72%|80%|84%|

# <center><div style="font-family: Times New Roman; border-radius : 10px; background-color: #2BAE66; color: #FCF6F5; padding: 12px; line-height: 1;">Conclusion</div></center>

- This is an extensive & huge dataset that poses the problem of binary classification with multiple text and numerical features that are categorical & discrete in nature.


- This is another imbalanced dataset that needs to be dealt using **SMOTE analysis**. It provides us an with a plethora of opportunities to work on EDA using visualizations to gain insights. Grouping the features together is key!


- We also aim to make the models robust by solving the **Data Leakage** problem.  Model performances are good as well. It also gives us to chances to learn about varied code optimization techniques as well.

### References :
- [Image Source](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQoKWCb0545g__QBdCLP8_7IUmIjC2GFZtzBQ&usqp=CAU)

# <center><div style="font-family: Times New Roman; background-color: #2BAE66; color: #FCF6F5; padding: 12px; line-height: 1;">Please Upvote if you like the work!</div><div style="font-family: Times New Roman; background-color: #2BAE66; color: #FCF6F5; padding: 12px; line-height: 1;">Any Sort of Feedback is Appreciated!</div><div style="font-family: Times New Roman; background-color: #2BAE66; color: #FCF6F5; padding: 12px; line-height: 1;">Thank You!</div></center>