# Introduction

<font color ='blue' >
Content:

1. [Load and Check Data](#1)
    
2. [Variable Description](#2)
    
3. [Outlier Detection](#3)
    
4. [Missing Value](#4)
    
5. [Basic Data Analysis and Feature Engineering](#5)
    
  5.1. [Numerical Variable](#6)
    
  5.2. [Categorical Variable](#7)
    
6. [Modeling](#8)
    
  6.1 [Hyperparameter Tuning -- Cross Validation Setings](#9)
    
  6.2 [Ensemble modelling with inbalanced and balanced dataset](#10)
        
    6.2.1 [Inbalanced Dataset](#11)
    
    6.2.1 [Over sampling Dataset](#12)
    
    6.2.2 [Under sampling the Dataset](#13)
    
    6.2.3 [Smote Dataset](#14)
    
    6.2.4 [Adasyn Dataset](#15)
    
7. [Accuracy Score Table](#16)
  7.1 [Best 10 Value Score Table](#17)
    
    
    

    

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.figure_factory as ff

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import scipy.stats as stats
import sklearn
sklearn.model_selection.RandomizedSearchCV

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import warnings
warnings.filterwarnings("ignore")


<a id= '1'></a><br>
<font color ='blue' >
# Load and Check Data

Using the "IBM HR Analytics Employee Attrition & Performance" dataset, what are the factors that affect the dismissal of IBM company?  I selected the 'Attrition' feature as my Target feature in our dataset, that is Target.

In [None]:
employee = pd.read_csv("/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [None]:
employee.head()

In [None]:
employee.info()

<a id= '2'></a><br>
<font color ='blue' >
# Variable Description

Feature descriptions are below:

    * AGE: Numerical Value
    * ATTRITION: Employee leaving the company (0=no, 1=yes)
    * BUSINESS TRAVEL: (1=No Travel, 2=Travel Frequently, 3=Tavel Rarely)
    * DAILY RATE: Numerical Value - Salary Level
    * DEPARTMENT: (1=HR, 2=R&D, 3=Sales)
    * DISTANCE FROM HOME: Numerical Value - THE DISTANCE FROM WORK TO HOME
    * EDUCATION: Numerical Value
    * EDUCATION FIELD: (1=HR, 2=LIFE SCIENCES, 3=MARKETING, 4=MEDICAL SCIENCES, 5=OTHERS, 6= TEHCNICAL)
    * ENVIROMENT SATISFACTION: Numerical Value - SATISFACTION WITH THE ENVIROMENT
    * GENDER: (1=FEMALE, 2=MALE)
    * HOURLY RATE: Numerical Value - HOURLY SALARY
    * JOB INVOLVEMENT: Numerical Value - JOB INVOLVEMENT
    * JOB LEVEL: Numerical Value - LEVEL OF JOB
    * JOB ROLE: (1=HC REP, 2=HR, 3=LAB TECHNICIAN, 4=MANAGER, 5= MANAGING DIRECTOR, 6= REASEARCH DIRECTOR, 7= RESEARCH SCIENTIST, 8=SALES EXECUTIEVE, 9= SALES REPRESENTATIVE)
    * JOB SATISFACTION: Numerical Value - SATISFACTION WITH THE JOB
    * MARITAL STATUS: (1=DIVORCED, 2=MARRIED, 3=SINGLE)
    * MONTHLY INCOME: Numerical Value - MONTHLY SALARY
    * MONTHY RATE: Numerical Value - MONTHY RATE
    * NUMCOMPANIES WORKED: Numerical Value - NO. OF COMPANIES WORKED AT
    * OVERTIME: (1=NO, 2=YES)
    * PERCENT SALARY HIKE: Numerical Value - PERCENTAGE INCREASE IN SALARY
    * PERFORMANCE RATING: Numerical Value - ERFORMANCE RATING
    * RELATIONS SATISFACTION: Numerical Value - RELATIONS SATISFACTION
    * STOCK OPTIONS LEVEL: Numerical Value - STOCK OPTIONS
    * TOTAL WORKING YEARS: Numerical Value - TOTAL YEARS WORKED
    * TRAINING TIMES LAST YEAR: Numerical Value - HOURS SPENT TRAINING
    * WORK LIFE BALANCE: Numerical Value - TIME SPENT BEWTWEEN WORK AND OUTSIDE
    * YEARS AT COMPANY: Numerical Value - TOTAL NUMBER OF YEARS AT THE COMPNAY
    * YEARS IN CURRENT ROLE: Numerical Value -YEARS IN CURRENT ROLE
    * YEARS SINCE LAST PROMOTION: Numerical Value - LAST PROMOTION
    * YEARS WITH CURRENT MANAGER: Numerical Value - YEARS SPENT WITH CURRENT MANAGER


In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
layout = go.Layout(
    title='Attrition Özelliğinin Genel Dağılımı Grafiği',
)
fig = go.Figure([go.Bar(x=employee["Attrition"].value_counts().index.values, y=employee["Attrition"].value_counts().values)],layout=layout)
fig.show()


    * Dataset structure: 1470 satır, 35 özellik
    * Data type: int64 ve object
    * Imbalanced dataset: 1233 (84%) 'no' attrition and 237 (16%) 'yes' attrition



<a id= '3'></a><br>
<font color ='blue' >
# Outlier Detection


Instead of deleting the features one by one according to their outlier, I deleted the outlier values ​​of the features on the common lines.

In [None]:
from collections import Counter
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indeces
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # store indeces
        outlier_indices.extend(outlier_list_col)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [None]:
employee.loc[detect_outliers(employee,['Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome','MonthlyRate','PercentSalaryHike',
                                                           'TotalWorkingYears','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager','NumCompaniesWorked',
                                                           'Education','EnvironmentSatisfaction','JobInvolvement','JobLevel','JobSatisfaction','NumCompaniesWorked','PerformanceRating',
                                                           'RelationshipSatisfaction','StockOptionLevel','TrainingTimesLastYear','WorkLifeBalance'])]


In [None]:
# drop outliers
employee = employee.drop(detect_outliers(employee,['Age','DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome','MonthlyRate','PercentSalaryHike',
                                                           'TotalWorkingYears','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager','NumCompaniesWorked',
                                                           'Education','EnvironmentSatisfaction','JobInvolvement','JobLevel','JobSatisfaction','NumCompaniesWorked','PerformanceRating',
                                                           'RelationshipSatisfaction','StockOptionLevel','TrainingTimesLastYear','WorkLifeBalance']),axis = 0).reset_index(drop = True)

<a id= '4'></a><br>
<font color ='blue' >
# Missing Value

We are checking a null value. If there is, we will evaluate it and delete it or add it with an estimate.

In [None]:
employee.columns[employee.isnull().any()]

In [None]:
# OLEYYYY

<a id= '5'></a><br>
<font color ='blue' >
# Basic Data Analysis and Feature Engineering

I looked at the number of unique values of each feature. We will also benefit from this information when determining the data type of the features.

In [None]:
employee.nunique()

#### Dropped Feature

 Feature that has 1470 unique value
    
    * 'EmployeeNumber'
    
   
 Feature that has 1 unique value
 
    * 'Over18'
    * 'StandardHours' 
    * 'EmployeeCount'

In [None]:
employee.drop(['EmployeeCount','Over18','StandardHours','EmployeeNumber'],axis=1,inplace=True)

Categorical Variable:

    * 'BusinessTravel'
    * 'Department'
    * 'Education'
    * 'EducationField'
    * 'EnvironmentSatisfaction'
    * 'Gender'
    * 'JobInvolvement'
    * 'JobLevel'
    * 'JobRole'
    * 'JobSatisfaction'
    * 'MaritalStatus'
    * 'NumCompaniesWorked'
    * 'OverTime'
    * 'PerformanceRating'
    * 'RelationshipSatisfaction'
    * 'StockOptionLevel'
    * 'TrainingTimesLastYear'
    * 'WorkLifeBalance'
    * 'PercentSalaryHike'
    * 'DistanceFromHome'
    
Numerical Variable:

    * 'Age'
    * 'DailyRate'
    * 'YearsSinceLastPromotion'
    * 'HourlyRate'
    * 'MonthlyIncome'
    * 'MonthlyRate'
    * 'TotalWorkingYears'
    * 'YearsAtCompany'
    * 'YearsWithCurrManager'
    * 'YearsInCurrentRole

 Target Variable:
 
    * 'Attrition' 


<a id= '6'></a><br>
<font color ='blue' >
## Numerical Variable

In [None]:
numerical_employee=employee.drop(['Attrition','BusinessTravel','Department','Education','EducationField','EnvironmentSatisfaction','Gender','JobInvolvement','JobLevel','JobRole','JobSatisfaction','MaritalStatus','NumCompaniesWorked','OverTime','PerformanceRating','RelationshipSatisfaction','StockOptionLevel','TrainingTimesLastYear','WorkLifeBalance','DistanceFromHome','PercentSalaryHike'],axis=1)

In [None]:
numerical_employee.head()

In [None]:
numerical_employee.info()

In [None]:
def datauret(a,numerical_employee):
    x = ["Yes", "No"]
    y = [numerical_employee[employee['Attrition']=='Yes'][a].mean(),numerical_employee[employee['Attrition']=='No'][a].mean()]
    
    trace = go.Bar(
        name=a,
        x=x,
        y=y,
    )
    
    return trace

In [None]:
def datahist(a,numerical_employee):
    
    trace = go.Histogram(
        name=a,
        x=numerical_employee[a],
        nbinsx=60,
    )
    
    return trace

In [None]:
data_numerical=list()
rate=numerical_employee
for i in range(len(rate.columns)):
    data_numerical.append(datahist(rate.columns[i],numerical_employee))
    


In [None]:

def visibleTF_s(number):
    liste=list()
    for i in range(len(data_numerical)):
        liste.append(False)
    liste[number]=True
    return liste

def button(attribute,number):
    return dict(label = attribute,method = 'update',args = [{'visible': visibleTF_s(number)},{'title': 'numerical-Attribute ilişkisi'}])

In [None]:
layout = go.Layout(
    barmode='stack',
    width=700,
    height=500,
    autosize=False,
    title='Numerical-Attribute relationship',
        
    xaxis=go.layout.XAxis(
        title=go.layout.xaxis.Title(
            #text='x Axis',
            font=dict(
                family='Courier New, monospace',
                size=18,
                color='#7f7f6f'
            )
        )
    ),
    yaxis=go.layout.YAxis(
        title=go.layout.yaxis.Title(
            #text='y Axis',
            font=dict(
                family='Courier New, monospace',
                size=18,
                color='#7f7f7f'
            )
        )
    )
)
rate=numerical_employee
updatemenus = list([dict(active=-1,buttons=[button(rate.columns[i],i) for i in range(len(rate.columns))])])

In [None]:
# Create figure
fig = go.Figure()

for i in range(len(numerical_employee.columns)):
    fig.add_trace(data_numerical[i])   

fig.update_layout(
    updatemenus=updatemenus)

fig.show()

In [None]:
import plotly.express as px
fig = px.scatter_matrix(numerical_employee)
fig.show()

In [None]:
numerical_employee.corr()

##### When I observed the correlation matrix and the graphs above,I will get new features by combining the highly related ones with PCA.

#### Age-MonthlyIncome-TotalWorkingYears

When PCA for the Age, MonthlyIncome and TotalWorkingYears features, We first found the value of n.

In [None]:
numerical_employee[['Age','MonthlyIncome','TotalWorkingYears']].corr()

In [None]:
# PCA1--------------------------------Age---MonthlyIncome----TotalWorkingYears

X = StandardScaler().fit_transform(numerical_employee[['Age','MonthlyIncome','TotalWorkingYears']])
pca = PCA(n_components=2)
pca.fit(X)
X_pca=pca.transform(X)
print(pca.explained_variance_ratio_)
sum(pca.explained_variance_ratio_)


In [None]:
import matplotlib.pyplot as plt
pca=PCA(whiten=True).fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulatıve explained variance')
plt.show()

In [None]:
# from above graph , we found n=1
X = StandardScaler().fit_transform(numerical_employee[['Age','MonthlyIncome','TotalWorkingYears']])
pca = PCA(n_components=1)
pca.fit(X)
X_pca=pca.transform(X)
numerical_employee['PCA1']=X_pca
numerical_employee.drop(['Age','MonthlyIncome','TotalWorkingYears'],axis=1,inplace=True)
employee['PCA1']=X_pca


#### YearsAtCompany--YearsInCurrentRole--YearsSinceLastPromotion--YearsWithCurrManager

When PCA for the YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion and YearsWithCurrManager features, We first found the value of n.

In [None]:
numerical_employee[['YearsAtCompany','YearsWithCurrManager','YearsInCurrentRole','YearsSinceLastPromotion']].corr()

In [None]:
X = StandardScaler().fit_transform(numerical_employee[['YearsAtCompany','YearsWithCurrManager','YearsInCurrentRole','YearsSinceLastPromotion']])
pca = PCA(n_components=2)
pca.fit(X)
X_pca=pca.transform(X)
print(pca.explained_variance_ratio_)
sum(pca.explained_variance_ratio_)

In [None]:
pca=PCA(whiten=True).fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulatıve explained variance')
plt.show()

In [None]:
#We used PCA with 2 component as PCA2 and PCA3 feature

X = StandardScaler().fit_transform(numerical_employee[['YearsAtCompany','YearsWithCurrManager','YearsInCurrentRole']])
pca = PCA(n_components=2)
pca.fit(X)
X_pca=pca.transform(X)
numerical_employee['PCA2']=X_pca.T[0]
numerical_employee['PCA3']=X_pca.T[1]
numerical_employee.drop(['YearsAtCompany','YearsWithCurrManager','YearsInCurrentRole','YearsSinceLastPromotion'],axis=1,inplace=True)

employee['PCA2']=X_pca.T[0]
employee['PCA3']=X_pca.T[1]
#employee.drop(['YearsAtCompany','YearsWithCurrManager','YearsInCurrentRole','YearsSinceLastPromotion'],axis=1,inplace=True)

In [None]:
numerical_employee.corr()

#### HourlyRate, DailyRate, MonthlyRate
The p values of the HourlyRate, DailyRate, MonthlyRate features in the Ttest are around 0.05 or above. Therefore, there does not appear to be a significant difference in terms of continuous value. Nevertheless, I visualized these features and observed his behavior categorically about leaving the job.

In [None]:
#DailyRate Feature
numerical_employee['Attrition']=employee['Attrition']

numerical_employee[numerical_employee['Attrition']=='Yes']['DailyRate']
y=np.array(numerical_employee[numerical_employee['Attrition']=='Yes']['DailyRate'])
n=np.array(numerical_employee[numerical_employee['Attrition']=='No']['DailyRate'])

hist_data = [y,n]
group_labels = ['distplot_yes','distplot_no'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels,show_hist=False,bin_size=25)
fig.show()

In [None]:
# There is a tendency to quit before 830, after which the tendency to quit is less. So this value is the threshold value
employee["DailyRate"] = [1 if i < 830 else 2 for i in employee["DailyRate"]]

In [None]:
#HourlyRate Feature 
numerical_employee[numerical_employee['Attrition']=='Yes']['HourlyRate']
y=np.array(numerical_employee[numerical_employee['Attrition']=='Yes']['HourlyRate'])
n=np.array(numerical_employee[numerical_employee['Attrition']=='No']['HourlyRate'])

hist_data = [y,n]
group_labels = ['distplot_yes','distplot_no'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels,show_hist=False,bin_size=25)
fig.show()

In [None]:
#There are 2 threshold value as 45 and 73
employee["HourlyRate"] = [1 if i < 45 else 3 if i > 73 else 2 for i in employee["HourlyRate"]]

In [None]:
#MonthlyRate Feature
numerical_employee[numerical_employee['Attrition']=='Yes']['MonthlyRate']
y=np.array(numerical_employee[numerical_employee['Attrition']=='Yes']['MonthlyRate'])
n=np.array(numerical_employee[numerical_employee['Attrition']=='No']['MonthlyRate'])

hist_data = [y,n]
group_labels = ['distplot_yes','distplot_no'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels,show_hist=False,bin_size=25)
fig.show()

In [None]:
#There is 2 threshold value as 8500 
employee["MonthlyRate"] = [1 if i < 8500 else 2 for i in employee["MonthlyRate"]]


In [None]:

#HourlyRate, DailyRate, MonthlyRate features will be considered as categorical features

In [None]:
numerical_employee.drop(['HourlyRate', 'DailyRate', 'MonthlyRate','Attrition'],axis=1,inplace=True)

<a id= '7'></a><br>
<font color ='blue' >
## Categorical Variable

In [None]:
Categorical_employee=employee.drop(['Attrition','Age','MonthlyIncome','YearsAtCompany','YearsWithCurrManager','YearsInCurrentRole','YearsSinceLastPromotion','TotalWorkingYears','PCA1','PCA2','PCA3'],axis=1)

In [None]:
def percent_attritionbarplot(x,employee):
    liste=employee.sort_values(by=x)[x].unique().tolist()
    listepercentyes=[]
    listepercentno=[]
    genele_etki=[]
    for i in range(len(liste)):
        a=(len(employee[employee[x]==liste[i]][employee['Attrition']=='Yes'])/len(employee[employee[x]==liste[i]]))*100
        b=100-a
        listepercentyes.append(a)
        listepercentno.append(b)
        geneleoran=len(employee[employee[x]==liste[i]])/len(employee)
        genele_etki.append(geneleoran*a)
        
    trace1 = go.Bar(
        x=liste,
        y=listepercentyes,
        name='Yes',
    )
    
    trace2 = go.Bar(
        x=liste,
        y=listepercentno,
        name='No',
    )

    data = [trace1, trace2]
    return data


In [None]:
def attritionbarplot2(x,employee):
    liste=employee.sort_values(by=x)[x].unique().tolist()
    listyes=[]
    for i in range(len(liste)):
        #a=len(employee[employee[x]==liste[i]][employee['Attrition']=='Yes'])
        a=len(employee[employee[x]==liste[i]])
        listyes.append(a)
    
    trace1 = go.Bar(
        x=liste,
        y=listyes,
        name='Yes',
    )
    
    return trace1

In [None]:
data_Categorical=list()
data_Categorical2=list()
rate=Categorical_employee
for i in range(len(rate.columns)):
    data_Categorical2.append(attritionbarplot2(rate.columns[i],employee))
    for j in range(2):
        data_Categorical.append(percent_attritionbarplot(rate.columns[i],employee)[j])


In [None]:
def visibleTF(number):
    liste=list()
    for i in range(len(data_Categorical)+len(data_Categorical2)):
        liste.append(False)
    liste[3*number-3]=True
    liste[3*number-2]=True
    liste[3*number-1]=True
    return liste

In [None]:
def button(attribute,number):
    return dict(label = attribute,method = 'update',args = [{'visible': visibleTF(number)},{'title': 'Categorical-Attrition percent relationship'}])

In [None]:
rate=Categorical_employee
updatemenus = list([dict(active=-1,buttons=[button(rate.columns[i],i+1) for i in range(len(rate.columns))])])

In [None]:
# Create figure
fig = go.Figure()

fig = make_subplots(rows=1, cols=2,
                    specs=[[{}, {}]],
                    subplot_titles=("Related Feature bar graph","Attrition percentage of feature "))

for i in range(len(data_Categorical2)):
    fig.add_trace(data_Categorical2[i],row=1, col=1) 
    fig.add_trace(data_Categorical[2*i],row=1, col=2)
    fig.add_trace(data_Categorical[2*i+1],row=1, col=2)   


fig.update_layout(
    updatemenus=updatemenus)

In [None]:
rate=Categorical_employee
for i in range(len(rate.columns)):
    employ_tablosu=pd.crosstab(employee["Attrition"],employee[rate.columns[i]])
    print(stats.chisquare(employ_tablosu, axis=None))

In [None]:
# First of all, the necessary classification was made in the Features and divided into categories, then our data was finalized by converting to dummy by using above graphs

In [None]:
employee_new=Categorical_employee
employee_new1=pd.DataFrame()      #Only used for check the meaningful difference of new grouping


In [None]:
#Attrition Feature
employee_new["Attritionr"] = [1 if i == 'Yes' else 0 for i in employee["Attrition"]]
employee_new1["Attritionr"] = employee_new["Attritionr"]

In [None]:
#BusinessTravel Feature
employee_new1["BusinessTravel"]=employee_new["BusinessTravel"]
employee_new = pd.get_dummies(employee_new,columns=["BusinessTravel"])
#Let's check the meaningful difference of new grouping
employee_new1[["BusinessTravel","Attritionr"]].groupby(["BusinessTravel"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#Department Feature
employee_new1["Department"]=employee_new["Department"]
employee_new = pd.get_dummies(employee_new,columns=["Department"])
#Let's check the meaningful difference of new grouping
employee_new1[["Department","Attritionr"]].groupby(["Department"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#Education Feature
employee_new["Educationr"] = [13 if i == 1 or i == 3 else 24 if i == 2 or i == 4 else 5 for i in employee_new["Education"]]
employee_new.drop(labels = ["Education"], axis = 1, inplace = True)
employee_new1["Educationr"]=employee_new["Educationr"]
employee_new = pd.get_dummies(employee_new,columns=["Educationr"])
#Let's check the meaningful difference of new grouping
employee_new1[["Educationr","Attritionr"]].groupby(["Educationr"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#EducationField  Feature
employee_new["EducationFieldr"] = ['Other' if i == 'Medical' or i == 'Life Sciences' or i == 'Other' else 'Human Resources' if i == 'Human Resources' else 'Marketing' if i == 'Marketing' else 'Technical Degree' for i in employee_new["EducationField"]]
employee_new.drop(labels = ["EducationField"], axis = 1, inplace = True)
employee_new1["EducationFieldr"]=employee_new["EducationFieldr"]
employee_new = pd.get_dummies(employee_new,columns=["EducationFieldr"])
#Let's check the meaningful difference of new grouping
employee_new1[["EducationFieldr","Attritionr"]].groupby(["EducationFieldr"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#EnvironmentSatisfaction  Feature
employee_new["EnvironmentSatisfactionr"] = [234 if i == 2 or i == 3 or i == 4 else 1 for i in employee_new["EnvironmentSatisfaction"]]
employee_new.drop(labels = ["EnvironmentSatisfaction"], axis = 1, inplace = True)
employee_new1["EnvironmentSatisfactionr"]=employee_new["EnvironmentSatisfactionr"]
employee_new = pd.get_dummies(employee_new,columns=["EnvironmentSatisfactionr"])
#Let's check the meaningful difference of new grouping
employee_new1[["EnvironmentSatisfactionr","Attritionr"]].groupby(["EnvironmentSatisfactionr"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#Gender Feature
employee_new1["Gender"]=employee_new["Gender"]
employee_new = pd.get_dummies(employee_new,columns=["Gender"])
#Let's check the meaningful difference of new grouping
employee_new1[["Gender","Attritionr"]].groupby(["Gender"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#JobInvolvement Feature
employee_new1["JobInvolvement"]=employee_new["JobInvolvement"]
employee_new = pd.get_dummies(employee_new,columns=["JobInvolvement"])
#Let's check the meaningful difference of new grouping
employee_new1[["JobInvolvement","Attritionr"]].groupby(["JobInvolvement"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#JobLevel  Feature
employee_new["JobLevelr"] = [45 if i == 4 or i == 5 else 3 if i == 3 else 2 if i == 2 else 1 for i in employee_new["JobLevel"]]
employee_new.drop(labels = ["JobLevel"], axis = 1, inplace = True)
employee_new1["JobLevelr"]=employee_new["JobLevelr"]
employee_new = pd.get_dummies(employee_new,columns=["JobLevelr"])
#Let's check the meaningful difference of new grouping
employee_new1[["JobLevelr","Attritionr"]].groupby(["JobLevelr"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#JobRole Feature
employee_new["JobRoler"] = ['HMM' if i == 'Manufacturing Director' or i == 'Healthcare Representative' or i == 'Manager' else 'Sales Executive' if i == 'Sales Executive' else 'Research Scientist' if i == 'Research Scientist' else 'Sales Representative' if i == 'Sales Representative' else 'Laboratory Technician' if i == 'Laboratory Technician' else 'Research Director' if i == 'Research Director' else 'Human Resources' for i in employee_new["JobRole"]]
employee_new.drop(labels = ["JobRole"], axis = 1, inplace = True)
employee_new1["JobRoler"]=employee_new["JobRoler"]
employee_new = pd.get_dummies(employee_new,columns=["JobRoler"])
#Let's check the meaningful difference of new grouping
employee_new1[["JobRoler","Attritionr"]].groupby(["JobRoler"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#JobSatisfaction  Feature
employee_new["JobSatisfactionr"] = [23 if i == 2 or i == 3 else 1 if i == 1 else 4  for i in employee_new["JobSatisfaction"]]
employee_new.drop(labels = ["JobSatisfaction"], axis = 1, inplace = True)
employee_new1["JobSatisfactionr"]=employee_new["JobSatisfactionr"]
employee_new = pd.get_dummies(employee_new,columns=["JobSatisfactionr"])
#Let's check the meaningful difference of new grouping
employee_new1[["JobSatisfactionr","Attritionr"]].groupby(["JobSatisfactionr"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
# MaritalStatus Feature
employee_new1["MaritalStatus"]=employee_new["MaritalStatus"]
employee_new = pd.get_dummies(employee_new,columns=["MaritalStatus"])
#Let's check the meaningful difference of new grouping
employee_new1[["MaritalStatus","Attritionr"]].groupby(["MaritalStatus"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
# NumCompaniesWorked  Feature

employee_new["NumCompaniesWorkedr"] = ['2or3or4' if i == 2 or i == 3 or i == 4 else 1 if i == 1 else 0 if i == 0 else '5betw9'  for i in employee_new["NumCompaniesWorked"]]
employee_new.drop(labels = ["NumCompaniesWorked"], axis = 1, inplace = True)
employee_new1["NumCompaniesWorkedr"]=employee_new["NumCompaniesWorkedr"]
employee_new = pd.get_dummies(employee_new,columns=["NumCompaniesWorkedr"])
#Let's check the meaningful difference of new grouping
employee_new1[["NumCompaniesWorkedr","Attritionr"]].groupby(["NumCompaniesWorkedr"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#OverTime
employee_new1["OverTime"]=employee_new["OverTime"]
employee_new = pd.get_dummies(employee_new,columns=["OverTime"])
#Let's check the meaningful difference of new grouping
employee_new1[["OverTime","Attritionr"]].groupby(["OverTime"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)


In [None]:
#RelationshipSatisfaction  Feature
employee_new["RelationshipSatisfactionr"] = [234 if i == 2 or i == 3 or i == 4 else 1 for i in employee_new["RelationshipSatisfaction"]]
employee_new.drop(labels = ["RelationshipSatisfaction"], axis = 1, inplace = True)
employee_new1["RelationshipSatisfactionr"]=employee_new["RelationshipSatisfactionr"]
employee_new = pd.get_dummies(employee_new,columns=["RelationshipSatisfactionr"])
#Let's check the meaningful difference of new grouping
employee_new1[["RelationshipSatisfactionr","Attritionr"]].groupby(["RelationshipSatisfactionr"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#WorkLifeBalance  Feature
employee_new["WorkLifeBalancer"] = [1 if i == 1  else 2 if i == 2   else 34 for i in employee_new["WorkLifeBalance"]]
employee_new.drop(labels = ["WorkLifeBalance"], axis = 1, inplace = True)
employee_new1["WorkLifeBalancer"]=employee_new["WorkLifeBalancer"]
employee_new = pd.get_dummies(employee_new,columns=["WorkLifeBalancer"])
#Let's check the meaningful difference of new grouping
employee_new1[["WorkLifeBalancer","Attritionr"]].groupby(["WorkLifeBalancer"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#TrainingTimesLastYear Feature
employee_new["TrainingTimesLastYearr"] = [0 if i == 0  else '1betw3' if i > 0 and i < 4 else 4 if i == 4 else 5 if i == 5 else 6 for i in employee_new["TrainingTimesLastYear"]]
employee_new.drop(labels = ["TrainingTimesLastYear"], axis = 1, inplace = True)
employee_new1["TrainingTimesLastYearr"]=employee_new["TrainingTimesLastYearr"]
employee_new = pd.get_dummies(employee_new,columns=["TrainingTimesLastYearr"])
#Yeni gruplamanın anlamlı farklılığını kontrol edelim
employee_new1[["TrainingTimesLastYearr","Attritionr"]].groupby(["TrainingTimesLastYearr"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#StockOptionLevel Feature
employee_new["StockOptionLevelr"] = [0 if i == 0 else '1or2or3' for i in employee_new["StockOptionLevel"]]
employee_new.drop(labels = ["StockOptionLevel"], axis = 1, inplace = True)
employee_new1["StockOptionLevelr"]=employee_new["StockOptionLevelr"]
employee_new = pd.get_dummies(employee_new,columns=["StockOptionLevelr"])
#Yeni gruplamanın anlamlı farklılığını kontrol edelim
employee_new1[["StockOptionLevelr","Attritionr"]].groupby(["StockOptionLevelr"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#DistanceFromHome Feature
employee_new["DistanceFromHomer"] = ['1betw8' if i < 9 else '9betw11' if i > 8 and i < 12 else '12up' for i in employee_new["DistanceFromHome"]]
employee_new.drop(labels = ["DistanceFromHome"], axis = 1, inplace = True)
employee_new1["DistanceFromHomer"]=employee_new["DistanceFromHomer"]
employee_new = pd.get_dummies(employee_new,columns=["DistanceFromHomer"])
#Yeni gruplamanın anlamlı farklılığını kontrol edelim
employee_new1[["DistanceFromHomer","Attritionr"]].groupby(["DistanceFromHomer"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#PercentSalaryHike Feature
employee_new["PercentSalaryHiker"] = [11 if i == 11  else '12betw17' if i > 11 and i < 18 else '18betw21' if i > 17 and i < 22 else '22betw25' for i in employee_new["PercentSalaryHike"]]
employee_new.drop(labels = ["PercentSalaryHike"], axis = 1, inplace = True)
employee_new1["PercentSalaryHiker"]=employee_new["PercentSalaryHiker"]
employee_new = pd.get_dummies(employee_new,columns=["PercentSalaryHiker"])
#Yeni gruplamanın anlamlı farklılığını kontrol edelim
employee_new1[["PercentSalaryHiker","Attritionr"]].groupby(["PercentSalaryHiker"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#DailyRate Feature
employee_new1["DailyRate"]=employee_new["DailyRate"]
employee_new = pd.get_dummies(employee_new,columns=["DailyRate"])
#Yeni gruplamanın anlamlı farklılığını kontrol edelim
employee_new1[["DailyRate","Attritionr"]].groupby(["DailyRate"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#HourlyRate Feature
employee_new1["HourlyRate"]=employee_new["HourlyRate"]
employee_new = pd.get_dummies(employee_new,columns=["HourlyRate"])
#Yeni gruplamanın anlamlı farklılığını kontrol edelim
employee_new1[["HourlyRate","Attritionr"]].groupby(["HourlyRate"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#MonthlyRate Feature
employee_new1["MonthlyRate"]=employee_new["MonthlyRate"]
employee_new = pd.get_dummies(employee_new,columns=["MonthlyRate"])
#Yeni gruplamanın anlamlı farklılığını kontrol edelim
employee_new1[["MonthlyRate","Attritionr"]].groupby(["MonthlyRate"], as_index = False).mean().sort_values(by="Attritionr",ascending = False)

In [None]:
#PerformanceRating Feature is dropped
employee_new.drop(labels = ["PerformanceRating"], axis = 1, inplace = True)

In [None]:

employee_new = pd.concat([employee_new, numerical_employee],axis=1)
employee_new

<a id= '8'></a><br>
<font color ='blue' >
# MODELING

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

<a id= '9'></a><br>
<font color ='blue' >
## Train - Test Split-- Hyperparameter Tuning -- Cross Validation Setings

We will compare 5 ml classifier and evaluate mean accuracy of each of them by stratified cross validation.

* Decision Tree
* SVM
* Random Forest
* KNN
* Logistic Regression

We use 2 cv types Grid and Search

<a id= '10'></a><br>
<font color ='blue' >
## Ensemble modelling with inbalanced and balanced dataset


We obtain the best_estimator, cv_result values of 5 machine learning algorithm  method with the following function

In [None]:
def machinelearning_modeling(X_train,y_train,cv_method):
      
    random_state = 42
    classifier = [DecisionTreeClassifier(random_state = random_state),
                 SVC(random_state = random_state, probability=True ),
                 RandomForestClassifier(random_state = random_state),
                 LogisticRegression(random_state = random_state),
                 KNeighborsClassifier()]

    dt_param_grid = {"min_samples_split" : range(10,500,20),
                    "max_depth": range(1,20,2)}

    svc_param_grid = {"kernel" : ["rbf"],
                     "gamma": [0.001, 0.01, 0.1, 1],
                     "C": [1,10,50,100,200,300,1000],
                     "probability" :[True]}

    rf_param_grid = {"max_features": [1,3,10],
                    "min_samples_split":[2,3,10],
                    "min_samples_leaf":[1,3,10],
                    "bootstrap":[False],
                    "n_estimators":[100,300],
                    "criterion":["gini"]}

    logreg_param_grid = {"C":np.logspace(-3,3,7),
                        "penalty": ["l1","l2"]}

    knn_param_grid = {"n_neighbors": np.linspace(1,19,10, dtype = int).tolist(),
                     "weights": ["uniform","distance"],
                     "metric":["euclidean","manhattan"]}
    classifier_param = [dt_param_grid,
                       svc_param_grid,
                       rf_param_grid,
                       logreg_param_grid,
                       knn_param_grid]
    
    ML_Models=["dtc","svm","rfc","lr","knc"]
    
    cv_result = []
    global cv_results 
    best_estimators = []

    if (cv_method=='GridSearchCV'):

        for i in range(len(classifier)):
            
            clf = GridSearchCV(classifier[i], param_grid=classifier_param[i], cv = StratifiedKFold(n_splits = 10), scoring = "accuracy", n_jobs = -1,verbose = 1)
            clf.fit(X_train,y_train)
            cv_result.append(clf.best_score_)
            best_estimators.append(clf.best_estimator_)

    elif (cv_method=='RandomizedSearchCV'):
        
        for i in range(len(classifier)):
            clf = RandomizedSearchCV(classifier[i], param_distributions=classifier_param[i], cv = StratifiedKFold(n_splits = 10), n_iter = 10,random_state = 111,scoring = 'precision')
            clf.fit(X_train,y_train)
            cv_result.append(clf.best_score_)
            best_estimators.append(clf.best_estimator_)

    cv_results = pd.DataFrame({"Cross Validation Means":cv_result, "ML Models":["DecisionTreeClassifier", "SVM","RandomForestClassifier","LogisticRegression","KNeighborsClassifier"]})
    fig = px.bar(cv_results, x='Cross Validation Means', y='ML Models',color='ML Models')
    fig.show()
    
    return best_estimators, cv_results

In [None]:
#For Accuracy Score Table 
#* Data_type is inbalanced or balanced with some techniques
#* Voting Algorithm' is added ML Algorithm column 

columns_name = ['Data_type','CV method','ML Algorithm','Accuracy_Score']
Data_type=list()
CV_method=list()
ML_Algorithm=list()
Accuracy_Score=list()

<a id= '11'></a><br>
<font color ='blue' >
### Inbalanced Dataset

In [None]:

X_train = employee_new.drop(labels = "Attritionr", axis = 1)
y_train = employee_new["Attritionr"]

In [None]:
### GridSearchCV

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size = 0.33, random_state = 42)
accuracy_GSCV=machinelearning_modeling(X_train,y_train,'GridSearchCV')

In [None]:
votingC = VotingClassifier(estimators = [("svm",accuracy_GSCV[0][1]),
                                        ("rfc",accuracy_GSCV[0][2]),
                                        ("lr",accuracy_GSCV[0][3])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_train, y_train)
print(accuracy_score(votingC.predict(X_test),y_test))




In [None]:
for i in range(5):
    Data_type.append('Inbalanced data')
    CV_method.append('GridSearchCV')
    Accuracy_Score.append(accuracy_GSCV[1]['Cross Validation Means'][i])
    ML_Algorithm.append(accuracy_GSCV[1]['ML Models'][i])

Data_type.append('Inbalanced data')
CV_method.append('GridSearchCV')
ML_Algorithm.append('Voting(SVM,RFC,LR)')
Accuracy_Score.append(accuracy_score(votingC.predict(X_test),y_test))


In [None]:
### RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV
accuracy_RSCV=machinelearning_modeling(X_train,y_train,'RandomizedSearchCV')

In [None]:
votingC = VotingClassifier(estimators = [("knc",accuracy_RSCV[0][4]),
                                        ("rfc",accuracy_RSCV[0][2]),
                                        ("lr",accuracy_RSCV[0][3])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_train, y_train)
print(accuracy_score(votingC.predict(X_test),y_test))


In [None]:
for i in range(5):
    Data_type.append('Inbalanced data')
    CV_method.append('RandomizedSearchCV')
    Accuracy_Score.append(accuracy_RSCV[1]['Cross Validation Means'][i])
    ML_Algorithm.append(accuracy_RSCV[1]['ML Models'][i])

Data_type.append('Inbalanced data')
CV_method.append('RandomizedSearchCV')
ML_Algorithm.append('Voting(KNC,RFC,LR)')
Accuracy_Score.append(accuracy_score(votingC.predict(X_test),y_test))

<a id= '12'></a><br>
<font color ='blue' >
### Over sampling Dataset

In [None]:
from sklearn.utils import resample
employee_no = employee_new[employee_new.Attritionr == 0]
employee_yes = employee_new[employee_new.Attritionr == 1]

employee_yes_up = resample(employee_yes,
                                     replace = True,
                                     n_samples = len(employee_no),
                                     random_state = 111)

employee_up = pd.concat([employee_no, employee_yes_up])
employee_up['Attritionr'].value_counts()

X_up = employee_up.drop('Attritionr', axis=1)
y_up = employee_up['Attritionr']

In [None]:
### GridSearchCV

In [None]:
X_up_train, X_up_test, y_up_train, y_up_test = train_test_split(X_up, y_up, test_size = 0.33, random_state = 42)
accuracy_up_GSCV=machinelearning_modeling(X_up_train,y_up_train,'GridSearchCV')

In [None]:
votingC = VotingClassifier(estimators = [("svm",accuracy_up_GSCV[0][1]),
                                        ("rfc",accuracy_up_GSCV[0][2]),
                                        ("knc",accuracy_up_GSCV[0][4])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_up_train, y_up_train)
print(accuracy_score(votingC.predict(X_up_test),y_up_test))

In [None]:
for i in range(5):
    Data_type.append('Balanced_up data')
    CV_method.append('GridSearchCV')
    Accuracy_Score.append(accuracy_up_GSCV[1]['Cross Validation Means'][i])
    ML_Algorithm.append(accuracy_up_GSCV[1]['ML Models'][i])

Data_type.append('Balanced_up data')
CV_method.append('GridSearchCV')
ML_Algorithm.append('Voting(SVM,RFC,KNC)')
Accuracy_Score.append(accuracy_score(votingC.predict(X_up_test),y_up_test))

In [None]:
###RandomizedSearchCV

In [None]:
accuracy_up_RSCV=machinelearning_modeling(X_up_train,y_up_train,'RandomizedSearchCV')

In [None]:
votingC = VotingClassifier(estimators = [("rfc",accuracy_up_RSCV[0][2]),
                                        ("svm",accuracy_up_RSCV[0][1])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_up_train, y_up_train)
print(accuracy_score(votingC.predict(X_up_test),y_up_test))

In [None]:
for i in range(5):
    Data_type.append('Balanced_up data')
    CV_method.append('RandomizedSearchCV')
    Accuracy_Score.append(accuracy_up_RSCV[1]['Cross Validation Means'][i])
    ML_Algorithm.append(accuracy_up_RSCV[1]['ML Models'][i])

Data_type.append('Balanced_up data')
CV_method.append('RandomizedSearchCV')
ML_Algorithm.append('Voting(SVM,RFC)')
Accuracy_Score.append(accuracy_score(votingC.predict(X_test),y_test))

<a id= '13'></a><br>
<font color ='blue' >
###  Under-sampling Dataset

In [None]:
employee_no_down = resample(employee_no,
                                     replace = True,
                                     n_samples = len(employee_yes),
                                     random_state = 111)

employee_down = pd.concat([employee_yes, employee_no_down])
employee_down['Attritionr'].value_counts()

X_down = employee_down.drop('Attritionr', axis=1)
y_down = employee_down['Attritionr']

In [None]:
## GridSearchCV

In [None]:
X_down_train, X_down_test, y_down_train, y_down_test = train_test_split(X_down, y_down, test_size = 0.33, random_state = 42)
accuracy_down_GSCV=machinelearning_modeling(X_down_train,y_down_train,'GridSearchCV')

In [None]:
votingC = VotingClassifier(estimators = [("svm",accuracy_down_GSCV[0][1]),
                                        ("rfc",accuracy_down_GSCV[0][2]),
                                        ("lr",accuracy_down_GSCV[0][3])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_down_train, y_down_train)
print(accuracy_score(votingC.predict(X_down_test),y_down_test))

In [None]:
for i in range(5):
    Data_type.append('Balanced_down data')
    CV_method.append('GridSearchCV')
    Accuracy_Score.append(accuracy_down_GSCV[1]['Cross Validation Means'][i])
    ML_Algorithm.append(accuracy_down_GSCV[1]['ML Models'][i])

Data_type.append('Balanced_down data')
CV_method.append('GridSearchCV')
ML_Algorithm.append('Voting(SVM,RFC,LR)')
Accuracy_Score.append(accuracy_score(votingC.predict(X_down_test),y_down_test))

In [None]:
##RandomizedSearchCV

In [None]:
accuracy_down_RSCV=machinelearning_modeling(X_down_train,y_down_train,'RandomizedSearchCV')

In [None]:
votingC = VotingClassifier(estimators = [("svm",accuracy_down_RSCV[0][1]),
                                        ("dtc",accuracy_down_RSCV[0][0])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_down_train, y_down_train)
print(accuracy_score(votingC.predict(X_down_test),y_down_test))

In [None]:
for i in range(5):
    Data_type.append('Balanced_down data')
    CV_method.append('RandomizedSearchCV')
    Accuracy_Score.append(accuracy_down_RSCV[1]['Cross Validation Means'][i])
    ML_Algorithm.append(accuracy_down_RSCV[1]['ML Models'][i])

Data_type.append('Balanced_down data')
CV_method.append('RandomizedSearchCV')
ML_Algorithm.append('Voting(SVM,DTC)')
Accuracy_Score.append(accuracy_score(votingC.predict(X_down_test),y_down_test))

<a id= '14'></a><br>
<font color ='blue' >
### Smote Dataset

In [None]:
from imblearn.over_sampling import SMOTE
y=employee_new['Attritionr']
X = employee_new.drop('Attritionr', axis=1)

sm = SMOTE(random_state=27)
X_smote, y_smote = sm.fit_resample(X, y)

employee_smote = pd.concat([X_smote, y_smote],axis=1)

X_smote = employee_smote.drop('Attritionr', axis=1)
y_smote = employee_smote['Attritionr']

In [None]:
## GridSearchCV

In [None]:
X_smote_train, X_smote_test, y_smote_train, y_smote_test = train_test_split(X_smote, y_smote, test_size = 0.33, random_state = 42)
accuracy_smote_GSCV=machinelearning_modeling(X_smote_train,y_smote_train,'GridSearchCV')

In [None]:
votingC = VotingClassifier(estimators = [("svm",accuracy_smote_GSCV[0][1]),
                                        ("rfc",accuracy_smote_GSCV[0][2]),
                                        ("lr",accuracy_smote_GSCV[0][3]),
                                        ("knc",accuracy_smote_GSCV[0][4])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_smote_train, y_smote_train)
print(accuracy_score(votingC.predict(X_smote_test),y_smote_test))

In [None]:
for i in range(5):
    Data_type.append('Balanced_smote data')
    CV_method.append('GridSearchCV')
    Accuracy_Score.append(accuracy_smote_GSCV[1]['Cross Validation Means'][i])
    ML_Algorithm.append(accuracy_smote_GSCV[1]['ML Models'][i])

Data_type.append('Balanced_smote data')
CV_method.append('GridSearchCV')
ML_Algorithm.append('Voting(SVM,RFC,LR,KNC)')
Accuracy_Score.append(accuracy_score(votingC.predict(X_smote_test),y_smote_test))

In [None]:
## RandomizedSearchCV

In [None]:
accuracy_smote_RSCV=machinelearning_modeling(X_smote_train,y_smote_train,'RandomizedSearchCV')

In [None]:
votingC = VotingClassifier(estimators = [("lr",accuracy_smote_RSCV[0][3]),
                                         ("rfc",accuracy_smote_RSCV[0][2]),
                                        ("svm",accuracy_smote_RSCV[0][1])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_smote_train, y_smote_train)
print(accuracy_score(votingC.predict(X_smote_test),y_smote_test))

In [None]:
for i in range(5):
    Data_type.append('Balanced_smote data')
    CV_method.append('RandomizedSearchCV')
    Accuracy_Score.append(accuracy_smote_RSCV[1]['Cross Validation Means'][i])
    ML_Algorithm.append(accuracy_smote_RSCV[1]['ML Models'][i])

Data_type.append('Balanced_smote data')
CV_method.append('RandomizedSearchCV')
ML_Algorithm.append('Voting(SVM,RFC,LR)')
Accuracy_Score.append(accuracy_score(votingC.predict(X_smote_test),y_smote_test))

<a id= '15'></a><br>
<font color ='blue' >
### Adasyn Dataset

In [None]:
from imblearn.over_sampling import ADASYN
y=employee_new['Attritionr']
X = employee_new.drop('Attritionr', axis=1)

ad = ADASYN()
X_adasyn, y_adasyn = ad.fit_resample(X, y)

employee_adasyn = pd.concat([X_adasyn, y_adasyn],axis=1)

X_adasyn = employee_adasyn.drop('Attritionr', axis=1)
y_adasyn = employee_adasyn['Attritionr']

In [None]:
##GridSearchCV

In [None]:
X_adasyn_train, X_adasyn_test, y_adasyn_train, y_adasyn_test = train_test_split(X_adasyn, y_adasyn, test_size = 0.33, random_state = 42)
accuracy_adasyn_GSCV=machinelearning_modeling(X_adasyn_train,y_adasyn_train,'GridSearchCV')

In [None]:
votingC = VotingClassifier(estimators = [("svm",accuracy_adasyn_GSCV[0][1]),
                                        ("rfc",accuracy_adasyn_GSCV[0][2]),
                                        ("lr",accuracy_adasyn_GSCV[0][3]),
                                        ("knc",accuracy_adasyn_GSCV[0][4])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_adasyn_train, y_adasyn_train)
print(accuracy_score(votingC.predict(X_adasyn_test),y_adasyn_test))

In [None]:
for i in range(5):
    Data_type.append('Balanced_adasyn data')
    CV_method.append('GridSearchCV')
    Accuracy_Score.append(accuracy_adasyn_GSCV[1]['Cross Validation Means'][i])
    ML_Algorithm.append(accuracy_adasyn_GSCV[1]['ML Models'][i])

Data_type.append('Balanced_adasyn data')
CV_method.append('GridSearchCV')
ML_Algorithm.append('Voting(SVM,RFC,LR,KNC)')
Accuracy_Score.append(accuracy_score(votingC.predict(X_adasyn_test),y_adasyn_test))

In [None]:
###RandomizedSearchCV

In [None]:
accuracy_adasyn_RSCV=machinelearning_modeling(X_adasyn_train,y_adasyn_train,'RandomizedSearchCV')

In [None]:
votingC = VotingClassifier(estimators = [("rfc",accuracy_adasyn_RSCV[0][2]),
                                        ("svm",accuracy_adasyn_RSCV[0][1])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_adasyn_train, y_adasyn_train)
print(accuracy_score(votingC.predict(X_adasyn_test),y_adasyn_test))

In [None]:
for i in range(5):
    Data_type.append('Balanced_adasyn data')
    CV_method.append('RandomizedSearchCV')
    Accuracy_Score.append(accuracy_adasyn_RSCV[1]['Cross Validation Means'][i])
    ML_Algorithm.append(accuracy_adasyn_RSCV[1]['ML Models'][i])

Data_type.append('Balanced_adasyn data')
CV_method.append('RandomizedSearchCV')
ML_Algorithm.append('Voting(SVM,RFC)')
Accuracy_Score.append(accuracy_score(votingC.predict(X_adasyn_test),y_adasyn_test))
Accuracy_Score=np.round(Accuracy_Score,4)

<a id= '16'></a><br>
<font color ='blue' >
## Accuracy Score Table

In [None]:
Results = pd.DataFrame({"Data_type":Data_type, "CV_method":CV_method,"ML_Algorithm":ML_Algorithm,"Accuracy_Score":Accuracy_Score})

fig = go.Figure(data=[go.Table(
    header=dict(values=list(Results.columns),
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[Results.Data_type,Results.CV_method,Results.ML_Algorithm,Results.Accuracy_Score],
               fill_color='lavender',
               align='left'))
])

fig.show()


<a id= '17'></a><br>
<font color ='blue' >
### Best 10 Value Score Table

In [None]:
Ascending_Score_best10=Results.sort_values('Accuracy_Score',ascending=False)
Ascending_Score_best10.head(10)

fig = go.Figure(data=[go.Table(
    header=dict(values=list(Ascending_Score_best10.columns),
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[Ascending_Score_best10.Data_type,Ascending_Score_best10.CV_method,Ascending_Score_best10.ML_Algorithm,Ascending_Score_best10.Accuracy_Score],
               fill_color='lavender',
               align='left'))
])

fig.show()