**BUSINESS PROBLEM**
“Attrition in human resources refers to the gradual loss of employees over time. In general, relatively high attrition is problematic for companies. HR professionals often assume a leadership role in designing company compensation programs, work culture and motivation systems that help the organization retain top employees.”

Our role is to uncover the factors that lead to employee attrition through Exploratory Data Analysis, and explore them by using various classification models to predict if an employee is likely to quit. This could greatly increase the HR’s ability to intervene on time and remedy the situation to prevent attrition.

While this model can be routinely run to identify employees, who are most likely to quit, the key driver of success would be the human element of reaching out the employee, understanding the current situation of the employee and taking action to remedy controllable factors that can prevent attrition of the employee.

**HR ANALYTICS**
Human resource analytics (HR analytics) is an area in the field of analytics that refers to applying analytic processes to the human resource department of an organization in the hope of improving employee performance and therefore getting a better return on investment. HR analytics does not just deal with gathering data on employee efficiency. Instead, it aims to provide insight into each process by gathering data and then using it to make relevant decisions about how to improve these processes.

**DATASET**
This is a hypothetical dataset created by IBM data scientists. The dataset has (23436R X 37C) that contains numeric and categorical data types describing each employee’s background and characteristics; and labelled (supervised learning) with whether they are still in the company or whether they have gone to work somewhere else. Machine Learning models can help to understand and determine how these factors relate to workforce attrition.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns



class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'
    
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [None]:
pd.set_option('display.max_rows',100)
pd.set_option('display.max_columns',100)

In [None]:
#Read The Data

df = pd.read_csv('../input/capstone-projectibm-employee-attrition-prediction/IBM HR Data new.csv')

df.head(5)

In [None]:
df.shape

In [None]:
#check for ay null values

df.isnull().sum()[df.isnull().sum()!=0]

In [None]:
Null_values_percentage=(df.isnull().sum().sum()/len(df))*100
print('Total',round(Null_values_percentage,3),'Of Null  Values are present in dataset.')

**So in Our Data there are 1.5% of Null Values**
* Insted of computing them we can directly DROP them bcz Dropped % of Null values is very less It wont affect our Momdel

In [None]:
df.describe().T

*  **We can See some Outliers are present in the data We can remove them if Model isnot performing well.**

* **Removal of Outliers**


In [None]:
df.loc[(df.EnvironmentSatisfaction>df.EnvironmentSatisfaction.quantile(0.9))]

In [None]:
df.loc[(df.NumCompaniesWorked>10)] 
#Even if some of Lives Entire life Working No one can work at this much companies at One Life 
#We can Replace them Max 10 companies 

In [None]:
df.loc[(df.EnvironmentSatisfaction>df.NumCompaniesWorked.quantile(0.9))]

In [None]:
df.loc[(df.PerformanceRating>4),'PerformanceRating']


**#Here performance Rating Cant be greater than 5 hence we will also cap them with max value.**

> ***If we look at the Indexs of all the rows which are giving many outliers Those are same Alternatively We can drop this Two Rows or Cap there outliers values from Data So here I have capped those values with the Respective row maximum Values***

* **There are some Outliers In Dataset Those are countable Hence We Replaced only Those values from Dataset with folloing function.Those Ourliers can be seen clearly in df.Describe() function In Max column**
* **Those values also checked before only those many values are exists or are there many.**
* **But still After completion of Model we can see there is not much difference by removing outlers from dataset after model Building we got similar results as with Outliers**

* **Treating of Outliers from NumCompaniesWorked column**

In [None]:
global y
y = df.NumCompaniesWorked.quantile(0.9)
def capping(x):
    
    if x > y:
        x=y
    return x
df.NumCompaniesWorked=df.NumCompaniesWorked.apply(capping)
df.NumCompaniesWorked.value_counts()

* **Treating of Outliers from EnvironmentSatisfaction column**

In [None]:
global y
y = df.EnvironmentSatisfaction.quantile(0.9)
def capping(x):
    
    if x > y:
        x=y
    return x
df.EnvironmentSatisfaction=df.EnvironmentSatisfaction.apply(capping)
df.EnvironmentSatisfaction.value_counts()

* **Treating of Outliers from PerformanceRating column**

In [None]:
global y
y = df.PerformanceRating.quantile(0.9)
def capping(x):
    
    if x > y:
        x=y
    return x
df.PerformanceRating=df.PerformanceRating.apply(capping)
df.PerformanceRating.value_counts()

* **Treating of Outliers from JobInvolvement column**

In [None]:
global y
y = df.JobInvolvement.quantile(0.9)
def capping(x):
    
    if x > y:
        x=y
    return x
df.JobInvolvement=df.JobInvolvement.apply(capping)
df.JobInvolvement.value_counts()

In [None]:
df.describe().T

In [None]:
import seaborn as sns
sns.boxplot(x=df['NumCompaniesWorked'])

In [None]:
sns.boxplot(x=df['EnvironmentSatisfaction'])

In [None]:
sns.boxplot(x=df['PerformanceRating'])

In [None]:
sns.boxplot(x=df['JobInvolvement'])

In [None]:
#Copied Original Dataset before EDA

In [None]:
df2=df.copy(deep=True) 
df2

* **Dropping Null Values.**

In [None]:
df=df.dropna() #Total 1.5% Null values are available In dataset.

In [None]:
df.isnull().sum().shape

In [None]:
df.isnull().sum()

* **227 Are Null values Present in our Dataset**
* Our Dataset Is Large so even If We Drop Null value rows from Our Dataset it wont affect our Model Accuracy

**#Mapped Categorial values from Target column to NUMERICAL VALUES for Model UnderStanding.**

In [None]:
df.Attrition=df.Attrition.apply(lambda x: 1 if x=='Voluntary Resignation' else 0)
df.Attrition.value_counts()

In [None]:
df.to_csv('HR_Analyst_new.csv') 
#Saved Final File after Cleaning of Data. 

In [None]:
#by looking at this we can that our models base accuracy is 0.8118%
round((df.Attrition.value_counts()[0]-df.Attrition.value_counts()[1])/df.Attrition.value_counts()[0],2)

* **Till Here We have droped Null Values and Saved the file for Visualisation.**

* **Here We have checked For Numerical and categorical Columns Presnet In Dataset**

In [None]:
numeric_ = df.select_dtypes(exclude=['object']).copy()
categor_ = df.select_dtypes(['object']).copy()
print(color.BOLD+'\033[91M CATEGORICAL COLUMNS- :'+color.END,categor_.columns,color.BOLD+'\nShape of Categorical Data-:'+color.END,categor_.shape)
print(color.BOLD+'NUMERICAL COLUMNS-: '+color.END,numeric_.columns,color.BOLD+'\nShape of Categorical Data-:'+color.END,numeric_.shape)

In [None]:
#DATA VISUALIZATION
sns.countplot(df['Attrition'])
fig = plt.gcf()
fig.set_size_inches(7,7)
plt.title('Attrition')

In [None]:
corr= numeric_.corr()
plt.subplots(figsize=[12,7])
sns.heatmap(corr,annot=True,mask=numeric_.corr()<0.3)

In [None]:
plt.figure(figsize=[10,8])
sns.distplot(df['Age'],hist=True,kde=True,color='k',bins=10)

In [None]:
# Majority of employees lie between the age range of 30 to 40

In [None]:
df['Age'].value_counts()

In [None]:
df.Attrition.value_counts()
#converting object to Numric

In [None]:
sns.catplot(x='Age',hue='Attrition',data=df,kind='count',height=10)

In [None]:
df.EducationField.value_counts()

In [None]:
sns.catplot(x='EducationField',hue='Attrition',kind='count',data=df,height=7)
plt.xticks(rotation=90)

In [None]:
df['Education'].value_counts()
sns.countplot(df['Education'])

In [None]:
plt.xticks(rotation='vertical')
sns.countplot(df['EducationField'])

In [None]:
# Around 30% of employees have education level of 3 and 
# Around 70% of employees are having 'Life Sciences' and 'Medical' education field.
# For both male and female,attrition rate is higher for education level 1,2 and 3.

In [None]:
corr=df.corr()
import  seaborn as sns 
plt.figure(figsize=[20,15])
sns.heatmap(corr,annot=True,cmap='YlGnBu',fmt='.0%')

In [None]:
#Dropped Unnecessary Columns from Dataset.
df=df.drop(['Over18','EmployeeNumber','StandardHours','EmployeeCount'],axis=1)

In [None]:
df['Age_emp']=df.Age

In [None]:
df.drop('Age',axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df.BusinessTravel.value_counts()

In [None]:
plt.xticks(rotation='vertical')
sns.countplot(df['BusinessTravel'])

In [None]:
# BusinessTravel We can see that Employee who Travel rarely there count is Higher.

In [None]:
sns.barplot(df.BusinessTravel,df.Attrition,data=df)

In [None]:
#People Who travel More there attrition Rate is Higher 

In [None]:
df.Department.value_counts()

In [None]:
sns.barplot(df.Department,df.Attrition,data=df)

In [None]:
plt.xticks(rotation='vertical')
sns.countplot(df['Department'])

In [None]:
#R&D DepaDepartments has Maximum NUmber of Employees.

In [None]:
df.DistanceFromHome.value_counts()

In [None]:
df.DistanceFromHome = pd.to_numeric(df.DistanceFromHome,errors='coerce')

In [None]:
plt.figure(figsize=[10,10])
plt.xticks(rotation='vertical')
sns.countplot(df['DistanceFromHome'])

In [None]:
categor_.columns

In [None]:
df.EducationField.value_counts()

In [None]:
def edufield(x):
    if  x=='Test':
        x='Other'
    return x
df.EducationField=df.EducationField.apply(edufield)
df.EducationField.value_counts()

In [None]:
plt.figure(figsize=[10,10])
plt.xticks(rotation='vertical')
sns.countplot(df['EducationField'])

In [None]:
df.Gender.value_counts()

In [None]:
def gender(x):
    if x=='Male':
        x=1
    elif x=='Female':
        x=0
    return x

In [None]:
df.Gender=df.Gender.apply(gender)
df.Gender = pd.to_numeric(df.Gender,errors='coerce')


In [None]:
plt.figure(figsize=[10,10])
plt.xticks(rotation='vertical')
sns.countplot(df['Gender'])

In [None]:
plt.figure(figsize=[7.,7])
sns.barplot(df.Gender,df.Attrition,data=df,hue_order='Attrition')

In [None]:
#Attritio of Male is Higehr Compared With Female.

In [None]:
df.HourlyRate.value_counts()
df.HourlyRate = pd.to_numeric(df.HourlyRate,errors='coerce')

In [None]:
df.HourlyRate.value_counts()
plt.figure(figsize=[15,10])
plt.xticks(rotation='vertical')
sns.countplot(sorted(df['HourlyRate']))

In [None]:
df.JobRole.value_counts()

In [None]:
plt.figure(figsize=[10,10])
plt.xticks(rotation='vertical')
sns.countplot(sorted(df['JobRole']))

In [None]:
plt.figure(figsize=[10,10])
plt.xticks(rotation='vertical')
sns.barplot(df.JobRole,df.Attrition,data= df,ci=80,hue_order='attrition')

In [None]:
#Here above we can see that Attrion of SalesRepresntative Is Much Higher.

In [None]:
df.JobSatisfaction.value_counts()

In [None]:
plt.figure(figsize=[10,10])
plt.xticks(rotation=0)
sns.barplot(df.JobSatisfaction,df.Attrition,data= df)

In [None]:
df.JobSatisfaction = pd.to_numeric(df.JobSatisfaction,errors='coerce')


In [None]:
df.MaritalStatus.value_counts()

In [None]:
plt.figure(figsize=[10,10])
plt.xticks(rotation=0)
sns.barplot(df.MaritalStatus,df.Attrition,data= df)

In [None]:
#From Above Bar chart we can see that Employees those are single Whose Attrition Rate is Higher.

In [None]:
df.MonthlyIncome.value_counts()

In [None]:
df.MonthlyIncome = pd.to_numeric(df.MonthlyIncome,errors='coerce')


In [None]:
df.OverTime.value_counts()

In [None]:
def overtime(x):
    if x=='Yes':
        x=1
    elif x=='No':
        x=0
    return x

In [None]:
df.OverTime=df.OverTime.apply(overtime)
df.OverTime = pd.to_numeric(df.OverTime,errors='coerce')
df.info()

In [None]:
sns.barplot(df.OverTime,df.Attrition,data=df)

In [None]:
df.PercentSalaryHike.value_counts()
df.PercentSalaryHike = pd.to_numeric(df.PercentSalaryHike,errors='coerce')
df.info()

In [None]:
def empsou(x):
    if x=='Test':
        x='Referral'
    return x
df['Employee Source']= df['Employee Source'].apply(empsou)

In [None]:
plt.figure(figsize=[10,10])
plt.xticks(rotation=90)
sns.barplot(df['Employee Source'],df.Attrition)

In [None]:
df['Employee Source'].value_counts()

In [None]:
#From refereal and jora Emp source is employees Attrition Is Higher. 

In [None]:
df=df.drop(['Application ID'],axis=1)
df.head()

In [None]:
df.Attrition.value_counts()


In [None]:
df.Attrition.shape

In [None]:
dfn=df.copy()
y=df.Attrition

In [None]:
df3=pd.get_dummies(df)
df3

In [None]:
df3.to_csv('HR_Analyst_File.csv', index=False)

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
# Print all the data types and their unique values
for col in dfn.columns:
    if dfn[col].dtype=='object':
        print(color.BOLD+str(col)+color.END+ ' : '+str(dfn[col].unique()))
        print(dfn[col].value_counts())
        print('-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x')

In [None]:
#P rint all the data types and their unique values
for col in dfn.columns:
    if dfn[col].dtype=='int64' or dfn[col].dtype=='float64':
        print(color.BOLD+str(col)+color.END+ ' : '+str(dfn[col].unique()))
        print(dfn[col].value_counts())
        print('-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x')

In [None]:
dfn.drop('Attrition',axis=1,inplace=True)

In [None]:
dfnew=pd.get_dummies(dfn)
dfnew.shape

In [None]:
dfn.to_csv('HR_Analyst_2.csv', index=False)

In [None]:
dfnew.info()

In [None]:
X=dfnew
X.head()

In [None]:
y.head()

In [None]:
X.shape,y.shape

> **MODEL BUILDING**

In [None]:
from sklearn.model_selection import train_test_split,cross_val_score,KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from sklearn.neighbors import KNeighborsClassifier as KNN
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn import metrics

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.4, random_state=0)

In [None]:
from sklearn.ensemble  import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100,criterion='entropy',random_state=0,max_depth=5,min_samples_split=10000)
rf.fit(X_train,y_train)

In [None]:
#Accuracy
round(rf.score(X_train,y_train),2)

In [None]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,rf.predict(X_test) )
cm

In [None]:
TN=cm[0][0]
TP=cm[1][1]
FN=cm[1][0]
FP=cm[0][1]
print('Model Testing Accuracy={}'.format(round((TP+TN)/(TP+TN+FP+FN)),3))

In [None]:
#StandardScaler

In [None]:
logmodel = LogisticRegression()
smote=SMOTE(sampling_strategy='minority',random_state=3)
X_train_sm,y_train_sm=smote.fit_sample(X_train,y_train)
pd.Series(y_train_sm).value_counts()

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report,roc_auc_score,roc_curve
def model_eval(algo,xtrain,xtest,ytrain,ytest):
    algo.fit(xtrain,ytrain)
    y_train_pred=algo.predict(xtrain)
    y_train_prob=algo.predict_proba(xtrain)[:,1]

    y_test_pred=algo.predict(xtest)
    y_test_prob=algo.predict_proba(xtest)[:,1]
    print(color.BOLD+"MODEL USED FOR CLASSIFICATION :"+color.END,algo)
    print(color.BOLD+'Confusion Matrix-Train:\n'+color.END,confusion_matrix(ytrain,y_train_pred))
    print(color.BOLD+'Accuracy Score-Train:\n'+color.END,accuracy_score(ytrain,y_train_pred))
    print(color.BOLD+'Classification Report-Train:\n'+color.END,classification_report(ytrain,y_train_pred))
    print(color.BOLD+'AUC Score-Train:\n'+color.END,roc_auc_score(ytrain,y_train_prob))
    print('\n')
    print(color.BOLD+'Confusion Matrix-Test:\n'+color.END,confusion_matrix(ytest,y_test_pred))
    print(color.BOLD+'Accuracy Score-Test:\n'+color.END,accuracy_score(ytest,y_test_pred))
    print(color.BOLD+'Classification Report-Test:\n'+color.END,classification_report(ytest,y_test_pred))
    print(color.BOLD+'AUC Score-Test:\n'+color.END,roc_auc_score(ytest,y_test_prob))
    print('\n')
    print(color.BOLD+'Plot'+color.END)
    fpr,tpr,thresholds= roc_curve(ytest,y_test_prob)
    fig,ax1 = plt.subplots()
    ax1.plot(fpr,tpr)
    ax1.plot(fpr,fpr)
    ax1.set_xlabel('False Positive Rate')
    ax1.set_ylabel('True Positive Rate')
    ax2=ax1.twinx()
    ax2.plot(fpr,thresholds,'-g')
    ax2.set_ylabel('TRESHOLDS')
    plt.show()
    print('-x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--')

In [None]:
lr=LogisticRegression()
knn=KNN()
rf=RandomForestClassifier()
svc=SVC()
rfc=RandomForestClassifier()
dt=DecisionTreeClassifier()
knn=KNN()
xgb=XGBClassifier()
adb=AdaBoostClassifier()
sgd=SGDClassifier()
gnb=GaussianNB()
et=ExtraTreesClassifier()
models=[]
models.append(('MVLC',lr))
models.append(('XGB',xgb))
models.append(('RFC',rf))
models.append(('DT',dt))
models.append(('ExtraTreesClassifier',et))
models.append(('KNNC',knn))
models.append(('AdaBoostClassifier',adb))
results=[]
names=[]
ypred=[]
for name,model in models:
    model.fit(X_train,y_train)
    ypred= model.predict(X_test)
    print(color.BOLD+name+color.END,'\n:')
    print(classification_report(y_test,ypred))
    kfold=KFold(shuffle=True,n_splits=10,random_state=0)
    cv_results=cross_val_score(model,X_train,y_train,cv=kfold,scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print(color.BOLD+"%s: %f (%f)"%(name,np.mean(cv_results)*100,np.var(cv_results,ddof=1))+color.END)
    print('-x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x-')
    

In [None]:
#plotting+feature importance in model
from sklearn.ensemble import RandomForestClassifier
modelRF = RandomForestClassifier()
modelRF.fit(X,y)
modelRF.feature_importances_
print(modelRF.feature_importances_)
column_name = pd.Series(modelRF.feature_importances_,index=X.columns)
plt.figure(figsize =(10,10))
column_name.nlargest(8).sort_values(ascending=True).plot(kind='barh',color=my_cmap(my_norm(data)))
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report,roc_auc_score,roc_curve
def model_eval_2(algo,xtrain,xtest,ytrain,ytest):
    algo.fit(xtrain,ytrain)
    y_train_pred=algo.predict(xtrain)
    y_train_prob=algo.predict_proba(xtrain)[:,1]

    y_test_pred=algo.predict(xtest)
    y_test_prob=algo.predict_proba(xtest)[:,1]
    print(color.BOLD+"MODEL USED FOR CLASSIFICATION :"+color.END,algo)
    #print(color.BOLD+'Confusion Matrix-Train:\n'+color.END,confusion_matrix(ytrain,y_train_pred))
    print(color.BOLD+'Accuracy Score-Train:\n'+color.END,accuracy_score(ytrain,y_train_pred))
    
    print(color.BOLD+'AUC Score-Train:\n'+color.END,roc_auc_score(ytrain,y_train_prob))
    #print('\n')
    #print(color.BOLD+'Confusion Matrix-Test:\n'+color.END,confusion_matrix(ytest,y_test_pred))
    print(color.BOLD+'Accuracy Score-Test:\n'+color.END,accuracy_score(ytest,y_test_pred))
    #print(color.BOLD+'Classification Report-Test:\n'+color.END,classification_report(ytest,y_test_pred))
    print(color.BOLD+'AUC Score-Test:\n'+color.END,roc_auc_score(ytest,y_test_prob))
    #print('\n')
    #print(color.BOLD+'Plot'+color.END)
    print('-x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x-')
   

In [None]:
list1=[xgb,rf,dt,et,knn,adb]

In [None]:
ws=[]
wos=[]
for i in list1:
    print(color.BLUE+'WITH SMOTE'+color.END)
    [ws.append(model_eval_2(i,X_train_sm,X_test,y_train_sm,y_test))]
    print(color.BLUE+'WITHOUT SMOTE'+color.END)
    [wos.append(model_eval_2(i,X_train,X_test,y_train,y_test))]


In [None]:
#Model Comparision With Box plots.

fig = plt.figure(figsize=[10,7])
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
#There Is not Much Difference WHile doing SMOTE and Without SMOTE which is Oversampling Techinique.

In [None]:
models=[dt,adb,et,rf,xgb]
for i in models:
    i.fit(X,y)
    i.feature_importances_
    print(i)
    #Plot the data:
    #my_colors = 'rgbkymc'  #red, green, blue, black, etc.
    feature_ranks = pd.Series(i.feature_importances_,index=X.columns)
    plt.figure(figsize =(10,10))
    feature_ranks.nlargest(8).sort_values(ascending=True).plot(kind='barh')

    plt.show()

In [None]:
a=[]
for i in models:
    i.fit(X,y)
    i.feature_importances_
    imp_features = pd.Series(i.feature_importances_,index=X.columns)
    x = pd.DataFrame(imp_features.nlargest(8).sort_values(ascending=False))
    #print(i,'\n',x.index.values,'\n')
    a.append(x.index.values)
    b=pd.DataFrame(a)
    c=b.T

d=pd.DataFrame()
for i in c.columns:
    d=pd.concat([d,c[i]],ignore_index=True)
print(d)
d = d.rename(columns={0: 'Imp_Features'})
d['Imp_Features'].value_counts()

In [None]:
new_X=dfnew[['DailyRate', 'Age_emp', 'DistanceFromHome', 'MonthlyIncome',
       'TrainingTimesLastYear', 'TotalWorkingYears', 'MonthlyRate',
       'HourlyRate', 'PercentSalaryHike','BusinessTravel_Travel_Frequently', 'OverTime', 'StockOptionLevel']]
new_X

In [None]:
new_y=df['Attrition']
new_y

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( new_X, new_y, test_size=0.4, random_state=0)

In [None]:
from sklearn.ensemble  import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=1000,criterion='gini',random_state=0,max_depth=5,min_samples_split=10000)
rf.fit(X_train,y_train)
rf.score(X_train,y_train)

In [None]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,rf.predict(X_test) )
cm

In [None]:
TN=cm[0][0]
TP=cm[1][1]
FN=cm[1][0]
FP=cm[0][1]
print('Model Testing Accuracy={}'.format((TP+TN)/(TP+TN+FP+FN)))

In [None]:
logmodel = LogisticRegression()
smote=SMOTE(sampling_strategy='minority',random_state=3)
X_train_sm,y_train_sm=smote.fit_sample(X_train,y_train)
pd.Series(y_train_sm).value_counts()

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report,roc_auc_score,roc_curve
def model_eval(algo,xtrain,xtest,ytrain,ytest):
    algo.fit(xtrain,ytrain)
    y_train_pred=algo.predict(xtrain)
    y_train_prob=algo.predict_proba(xtrain)[:,1]

    y_test_pred=algo.predict(xtest)
    y_test_prob=algo.predict_proba(xtest)[:,1]
    print(color.BOLD+"MODEL USED FOR CLASSIFICATION :"+color.END,algo)
    print(color.BOLD+'Confusion Matrix-Train:\n'+color.END,confusion_matrix(ytrain,y_train_pred))
    print(color.BOLD+'Accuracy Score-Train:\n'+color.END,accuracy_score(ytrain,y_train_pred))
    print(color.BOLD+'Classification Report-Train:\n'+color.END,classification_report(ytrain,y_train_pred))
    print(color.BOLD+'AUC Score-Train:\n'+color.END,roc_auc_score(ytrain,y_train_prob))
    print('\n')
    print(color.BOLD+'Confusion Matrix-Test:\n'+color.END,confusion_matrix(ytest,y_test_pred))
    print(color.BOLD+'Accuracy Score-Test:\n'+color.END,accuracy_score(ytest,y_test_pred))
    print(color.BOLD+'Classification Report-Test:\n'+color.END,classification_report(ytest,y_test_pred))
    print(color.BOLD+'AUC Score-Test:\n'+color.END,roc_auc_score(ytest,y_test_prob))
    print('\n')
    print(color.BOLD+'Plot'+color.END)
    fpr,tpr,thresholds= roc_curve(ytest,y_test_prob)
    fig,ax1 = plt.subplots()
    ax1.plot(fpr,tpr)
    ax1.plot(fpr,fpr)
    ax1.set_xlabel('False Positive Rate')
    ax1.set_ylabel('True Positive Rate')
    ax2=ax1.twinx()
    ax2.plot(fpr,thresholds,'-g')
    ax2.set_ylabel('TRESHOLDS')
    plt.show()
    print('-x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x-')

In [None]:
lr=LogisticRegression()
knn=KNN()
rf=RandomForestClassifier()
svc=SVC()
rfc=RandomForestClassifier()
dt=DecisionTreeClassifier()
xgb=XGBClassifier()
et=ExtraTreesClassifier()
models=[]
models.append(('MVLC',lr))
models.append(('XGB',xgb))
models.append(('KNNC',knn))
models.append(('RFC',rf))
models.append(('ExtraTreesClassifier',et))
models.append(('DT',dt))
results=[]
names=[]
ypred=[]
for name,model in models:
    model.fit(X_train,y_train)
    ypred= model.predict(X_test)
    print(color.BOLD+name+color.END,'\n')
    print(classification_report(y_test,ypred))
    kfold=KFold(shuffle=True,n_splits=10,random_state=0)
    cv_results=cross_val_score(model,X_train,y_train,cv=kfold,scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print(color.BOLD+"%s: %f (%f)"%(name,np.mean(cv_results)*100,np.var(cv_results,ddof=1))+color.END)
    print('\n')
    print(color.BOLD+'Plot'+color.END)
    print('-x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x--x-x-x-x-x-')
    

In [None]:

fig = plt.figure(figsize=[12,12])
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.xticks(rotation=90)
plt.show()

In [None]:
#HyperTuning Random Forest

In [None]:
from sklearn.model_selection import GridSearchCV
rfgridcv=GridSearchCV(estimator=RandomForestClassifier(),
param_grid=[{'n_estimators': [5,10,50],
                               'max_depth':[5,10,15,20],
                               'min_samples_leaf':[10,50,100],
                               'min_samples_split': [20,100,200]}])
rfgridcv.fit(X_train,y_train)
y_train_pred=rfgridcv.predict(X_train)
y_train_prob=rfgridcv.predict_proba(X_train)[:,1]

y_test_pred=rfgridcv.predict(X_test)
y_test_prob=rfgridcv.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,thresholds= roc_curve(y_test,y_test_prob)
thresholds[0] = thresholds[0]-1
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr,label='ROC CURVE')
ax1.plot(fpr,fpr,label='AUC CURVE')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
plt.legend(loc='best')
ax2=ax1.twinx()
plt.show()

In [None]:
rfgrid=GridSearchCV(estimator=RandomForestClassifier(),
                   param_grid=[{'n_estimators': [5,10,50],
                               'max_depth':[5,10,15,20],
                               'min_samples_leaf':[10,50,100],
                               'min_samples_split': [20,100,200]}])

In [None]:
rfgrid_fit=rfgrid.fit(X_train,y_train)
rfgrid_fit

In [None]:
print(rfgrid_fit.best_estimator_)

In [None]:
rfgrid_score=rfgrid_fit.score(X_train,y_train)
rfgrid_score

In [None]:

from sklearn.model_selection import RandomizedSearchCV
rfrs_cv=RandomizedSearchCV(estimator=RandomForestClassifier(),
                   param_distributions=[{'n_estimators': [5,10,50],
                               'max_depth':[5,10,15,20],
                               'min_samples_leaf':[10,50,100],
                               'min_samples_split': [20,100,200]}])
rfrs_cv.fit(X_train,y_train)
y_train_pred=rfrs_cv.predict(X_train)
y_train_prob=rfrs_cv.predict_proba(X_train)[:,1]

y_test_pred=rfrs_cv.predict(X_test)
y_test_prob=rfrs_cv.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,thresholds= roc_curve(y_test,y_test_prob)
thresholds[0] = thresholds[0]-1
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr,label='ROC CURVE')
ax1.plot(fpr,fpr,label='AUC CURVE')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
plt.legend(loc='best')
ax2=ax1.twinx()
plt.show()

In [None]:
rfrandomized=RandomizedSearchCV(estimator=RandomForestClassifier(),
                   param_distributions=[{'n_estimators': [5,10,50],
                               'max_depth':[5,10,15,20],
                               'min_samples_leaf':[10,50,100],
                               'min_samples_split': [20,100,200]}])

In [None]:
rfrand_fit=rfrandomized.fit(X_train,y_train)
rfrand_fit

In [None]:
print(rfrand_fit.best_estimator_)

In [None]:
rfrand_score=rfrand_fit.score(X_train,y_train)
rfrand_score

In [None]:
#After Finding Best Estimaters from Both CVs
#Both Hypertunning Models gives the same Estimaters for RandomForestClassifier

In [None]:
#using GridSearchCV Best estimetors
rfrs_cv=RandomForestClassifier(max_depth=20, min_samples_leaf=10, min_samples_split=20,
                       n_estimators=50)
rfrs_cv.fit(X_train,y_train)
y_train_pred=rfrs_cv.predict(X_train)
y_train_prob=rfrs_cv.predict_proba(X_train)[:,1]

y_test_pred=rfrs_cv.predict(X_test)
y_test_prob=rfrs_cv.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,thresholds= roc_curve(y_test,y_test_prob)
thresholds[0] = thresholds[0]-1
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr,label='ROC CURVE')
ax1.plot(fpr,fpr,label='AUC CURVE')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
plt.legend(loc='best')
ax2=ax1.twinx()
plt.show()

* **So After applying Hypertunning with best Estimators for GridSearchCV**
* I have got Train and Test Accuracy of **Train = 0.9589 & Test = 0.932**
* **This Model Accuracy is Neither Overfit nor Underfit.**
* **Its Balanced Accuracy So this is The Best Generalized Model for Model Builduing.**

In [None]:

rf_randcv=RandomForestClassifier(max_depth=10, min_samples_leaf=10, min_samples_split=100,
                       n_estimators=5)

rf_randcv.fit(X_train,y_train)
y_train_pred=rf_randcv.predict(X_train)
y_train_prob=rf_randcv.predict_proba(X_train)[:,1]

y_test_pred=rf_randcv.predict(X_test)
y_test_prob=rf_randcv.predict_proba(X_test)[:,1]

print('Confusion Matrix-Train\n',confusion_matrix(y_train,y_train_pred))
print('Accuracy Score-Train\n',accuracy_score(y_train,y_train_pred))
#print('Classification Report-Train\n',classification_report(y_train,y_train_pred))
print('AUC Score-Train\n',roc_auc_score(y_train,y_train_prob))
print('\n'*2)
print('Confusion Matrix-Test\n',confusion_matrix(y_test,y_test_pred))
print('Accuracy Score-Test\n',accuracy_score(y_test,y_test_pred))
#print('Classification Report-Test\n',classification_report(y_test,y_test_pred))
print('AUC Score-Test\n',roc_auc_score(y_test,y_test_prob))
print('\n'*3)
print('Plot : AUC-ROC Curve')
fpr,tpr,thresholds= roc_curve(y_test,y_test_prob)
thresholds[0] = thresholds[0]-1
fig,ax1 = plt.subplots()
ax1.plot(fpr,tpr,label='ROC CURVE')
ax1.plot(fpr,fpr,label='AUC CURVE')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
plt.legend(loc='best')
ax2=ax1.twinx()
plt.show()

* **So After applying Hypertunning with best Estimators for RandomisedSearchCV**
* I have got Train and Test Accuracy of **Train = 0.877 & Test = 0.868**
* **This Model Accuracy is Neither Overfit nor Underfit.**


* From Above both Hypertunning Methodes We found 
* GridSeachCV Model Train Accuracy > RandomisedSearchCV Train Accuracy
* GridSeachCV Model Test Accuracy > RandomisedSearchCV Test Accuracy.
* So We can see that GridSearchCV Model Accuracy is Higher Compared To RandmisedSearchCV hypertunning Model.


* ***GridSearchCV Hypertunning Model Generalized Model from Our Model Building.***