1. Obtaining the data is the first approach in solving the problem.
1. Scrubbing or cleaning the data is the next step. This includes data imputation of missing or invalid data and fixing column names.
1. Exploring the data will follow right after and allow further insight of what our dataset contains. Looking for any outliers or weird data. Understanding the relationship each explanatory variable has with the response variable resides here and we can do this with a correlation matrix.
1. Modeling the data will give us our predictive power on whether an employee will leave.
1. INterpreting the data is last. With all the results and analysis of the data, what conclusion is made? What factors contributed most to employee turnover? What relationship of variables were found?

**Problem statement**
Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

 The aim is what factors contribute most to employee turnover and create a model that can predict if a certain employee will leave the company or not.

Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'

EnvironmentSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

JobInvolvement
1 'Low' 2 'Medium' 3 'High' 4 'Very High'

JobSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

PerformanceRating
1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'

RelationshipSatisfaction
1 'Low' 2 'Medium' 3 'High' 4 'Very High'

WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import os
print(os.listdir("../input"))
import warnings
warnings.filterwarnings("ignore")

In [None]:
hr_data=pd.read_csv("../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [None]:
hr_data.head(5)

In [None]:
hr_data.info()

There is no missing data

Move the attrition column to front

In [None]:
front =hr_data['Attrition']
hr_data.drop(labels=['Attrition'], axis=1,inplace = True)
hr_data.insert(0, 'Attrition', front)
hr_data.head()

In [None]:
hr_data.shape

In [None]:
attrition_rate = hr_data.Attrition.value_counts() / 1470
attrition_rate

Shows Imbalanced data

Looks like about 84% of employees stayed and 16% of employees left. 
NOTE: When performing cross validation, its important to maintain this turnover ratio

In [None]:
AttritionSummery= hr_data.groupby('Attrition')

Mean value of other variables(Stayed vs Not stayed)

In [None]:
AttritionSummery.mean()

#### Correlation Matrix

In [None]:
correlation = hr_data.corr()
correlation

In [None]:
f,ax =  plt.subplots(figsize=(20,20))
sns.heatmap(hr_data.corr(),annot=True,linewidth=.4,ax=ax,fmt='.1f',cmap="Paired")

In [None]:
pd.set_option('display.max_rows',None)
def corrank(hr_data):
        import itertools
        df = pd.DataFrame([[(i,j),hr_data.corr().loc[i,j]] for i,j in list(itertools.combinations(hr_data.corr(), 2))],columns=['pairs','corr'])    
        print(df.sort_values(by='corr',ascending=False))

print(corrank(hr_data))

# prints a descending list of correlation pair (Max on top)

### Statistical Test for Correlation
##### One-Sample T-Test (Measuring Satisfaction Level)
A one-sample t-test checks whether a sample mean differs from the population mean. Let's test to see whether the average satisfaction level of employees that had Attrition differs from the entire employee population.

**Hypothesis Testing:** Is there significant difference in the means of satisfaction level between employees who had a Attrition and the entire employee population?

**Null Hypothesis**:  The null hypothesis would be that there is no difference in satisfaction level between employees who did Attrition and the entire employee population.

**Alternate Hypothesis**:  The alternative hypothesis would be that there is a difference in satisfaction level between employees who did Attrition and the entire employee population.

In [None]:
hr_data['Attrition'] = hr_data['Attrition'].map({'No':0,'Yes':1})

**satisfaction comparison**

In [None]:
# Let's compare the means of our employee Attrition satisfaction against the employee population satisfaction
population_satisfaction = hr_data['JobSatisfaction'].mean()
left_satisfaction = hr_data[hr_data['Attrition']==1]['JobSatisfaction'].mean()

print( 'The mean for the employee population is: ' + str(population_satisfaction) )
print( 'The mean for the employees that had a Attrition is: ' + str(left_satisfaction) )

#### Conducting the T-Test

Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the employee population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

In [None]:
 import scipy.stats as stats
 stats.ttest_1samp(a=  hr_data[hr_data['Attrition']==1]['JobSatisfaction'], # Sample of Employee satisfaction who had a Turnover
                      popmean = emp_population_satisfaction)  # Employee Population satisfaction mean

#### T-Test Result

The test result shows the test statistic "t" is equal to 3.58. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we reject the null hypothesis. We can check the quantiles with stats.t.ppf():

#### T-Test Quantile
If the t-statistic value we calculated above (3.58) is outside the quantiles, then we can reject the null hypothesis

In [None]:
degree_freedom = len(hr_data[hr_data['Attrition']==1])

LQ = stats.t.ppf(0.025,degree_freedom)  # Left Quartile

RQ = stats.t.ppf(0.975,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

#### One-Sample T-Test Summary
#### T-Test = 3.58 | P-Value = 0.0004125521 | Reject Null Hypothesis

**Reject the null hypothesis because:**

T-Test score is outside the quantiles
P-value is lower than confidence level of 5%
Based on the statistical analysis of a one sample t-test, there seems to be some significant difference between the mean satisfaction of employees who had a Attrition and the entire employee population. The super low P-value of 0.0004125521 at a 5% confidence level is a good indicator to reject the null hypothesis.

But this does not neccessarily mean that there is practical significance. We would have to conduct more experiments or maybe collect more data about the employees in order to come up with a more accurate finding.

In [None]:
categorical_col = []
for column in hr_data.columns:
    if hr_data[column].dtype == object and len(hr_data[column].unique()) <= 50:
        categorical_col.append(column)
        print(f"{column} : {hr_data[column].unique()}")
        print("====================================")

BusinessTravel : The workers who travel alot are more likely to quit then other employees.

Department : The worker in Research & Development are more likely to stay then the workers on other departement.

EducationField : The workers with Human Resources and Technical Degree are more likely to quit then employees from other fields of educations.

Gender : The Male are more likely to quit.

JobRole : The workers in Laboratory Technician, Sales Representative, and Human Resources are more likely to quit the workers in other positions.

MaritalStatus : The workers who have Single marital status are more likely to quit the Married, and Divorced.

OverTime : The workers who work more hours are likely to quit then others.

In [None]:
hr_data.columns


### Visulization help us to uderstand our data in detail

#### Age

In [None]:
plt.figure(figsize=(14,6))
sns.countplot(hr_data.Age,color='hotpink')

In [None]:
sns.factorplot(data=hr_data,y='Age',x='Attrition',size=6,aspect=1,kind='box',palette='winter')

**Age:** Employees in relatively young age bracket 25-35 are more likely to leave. Hence, efforts should be made to clearly articulate the long-term vision of the company and young employees fit in that vision, as well as provide incentives in the form of clear paths to promotion for instance.


In [None]:
f,ax = plt.subplots(figsize = (12,10))
sns.boxplot(x="Gender",y="Age",hue="BusinessTravel",data=hr_data,palette="hls")

In [None]:
sns.factorplot(data=hr_data,x='BusinessTravel',y='Attrition',size=6,aspect=1,kind='bar',palette='winter')

**BusinessTravel** : The workers who travel alot are more likely to quit then other employees.

In [None]:
sns.jointplot(hr_data.MonthlyIncome ,hr_data.Age,size=8, kind = "scatter")   
plt.show()

In [None]:
sns.catplot(x="Attrition", y="MonthlyIncome", data=hr_data,hue='Gender',size=7)

In [None]:
plt.figure(figsize=(13,6))
plt.style.use('seaborn-colorblind')
plt.grid(True, alpha=0.5)
sns.kdeplot(hr_data.loc[hr_data['Attrition'] == 0, 'MonthlyIncome'], label = 'Active Employee',color='olive')
sns.kdeplot(hr_data.loc[hr_data['Attrition'] == 1, 'MonthlyIncome'], label = 'Employee Left',color='maroon')
plt.xlabel('Monthly Income')
plt.xlim(left=0)
plt.ylabel('Density')
plt.title('Monthly Income in Percent by Attrition Status');

**Monthly Income**: people on higher wages are less likely to leave the company. Hence, efforts should be made to gather information on industry benchmarks in the current local market to determine if the company is providing competitive wages.

**Year at Company**

In [None]:
#Distribution of Years at company
plt.figure(figsize=(8,8))
sns.distplot(hr_data["YearsAtCompany"].astype(int),color='lime', kde=False);

In [None]:
plt.figure(figsize=(13,6))
plt.style.use('seaborn-colorblind')
plt.grid(True, alpha=0.5)
sns.kdeplot(hr_data.loc[hr_data['Attrition'] == 0, 'YearsAtCompany'], label = 'Active Employee',color='orangered')
sns.kdeplot(hr_data.loc[hr_data['Attrition'] == 1, 'YearsAtCompany'], label = 'Employees Left',color='mediumblue')
plt.xlabel('YearsAtCompany')
plt.xlim(left=0)
plt.ylabel('Density')
plt.title('Years At Company in Percent by Attrition');

**YearsAtCompany**:  Employees who hit their two-year anniversary should be identified as potentially having a higher-risk of leaving.
A strategic "Retention Plan" should be drawn for each Risk Category group..

In [None]:
plt.figure(figsize=(13,6))
plt.style.use('seaborn-colorblind')
plt.grid(True, alpha=0.5)
sns.kdeplot(hr_data.loc[hr_data['Attrition'] == 0, 'TotalWorkingYears'], label = 'Active Employee',color='cyan')
sns.kdeplot(hr_data.loc[hr_data['Attrition'] == 1, 'TotalWorkingYears'], label = 'Ex-Employees',color='limegreen')
plt.xlabel('TotalWorkingYears')
plt.xlim(left=0)
plt.ylabel('Density')
plt.title('Total Working Years in Percent by Attrition Status');

**TotalWorkingYears:** The more experienced employees are less likely to leave. Employees who have between 5-8 years of experience should be identified as potentially having a higher-risk of leaving.

In [None]:
plt.figure(figsize=(13,6))
plt.style.use('seaborn-colorblind')
plt.grid(True, alpha=0.5)
sns.kdeplot(hr_data.loc[hr_data['Attrition'] == 0, 'YearsWithCurrManager'], label = 'Active Employee',color='fuchsia')
sns.kdeplot(hr_data.loc[hr_data['Attrition'] == 1, 'YearsWithCurrManager'], label = 'Ex-Employees',color='darkblue')
plt.xlabel('YearsWithCurrManager')
plt.xlim(left=0)
plt.ylabel('Density')
plt.title('Years With Curr Manager in Percent by Attrition Status');

**YearsWithCurrManager**: A large number of leavers leave 6 months after their Current Managers. By using Line Manager details for each employee, one can determine which Manager have experienced the largest numbers of employees resigning over the past year. Several metrics can be used here to determine whether action should be taken with a Line Manager:

In [None]:
plt.figure(figsize=(13,6))
plt.style.use('seaborn-colorblind')
plt.grid(True, alpha=0.5)
sns.kdeplot(hr_data.loc[hr_data['Attrition'] == 0, 'PercentSalaryHike'], label = 'Active Employee',color='deeppink')
sns.kdeplot(hr_data.loc[hr_data['Attrition'] == 1, 'PercentSalaryHike'], label = 'Employees Left',color='darkgreen')
plt.xlabel('PercentSalaryHike')
plt.xlim(left=0)
plt.ylabel('Density')
plt.title('Percent Salary Hike in Percent by Attrition Status');

Employees got salary hike of 10% to 17%  more chance to leave company

#### Department

In [None]:
sns.factorplot(data=hr_data,x='Department',y='Attrition',size=7,aspect=1,kind='bar',palette='cubehelix')

**Department :** The worker in Research & Development are more likely to stay then the workers on other departement.

Education

In [None]:
plt.figure(figsize=(13,6))
ax = sns.countplot(data=hr_data,x='Education',hue='Gender',palette='Purples')
ax.set_xticklabels([ '1-Below College' , '2-College' , '3-Bachelor' ,'4-Master',  '5-Doctor'])
plt.show()

In [None]:
ax=sns.factorplot(data=hr_data,x='Education',y='Attrition',size=7,aspect=1,kind='bar',palette="cubehelix")
ax.set_xticklabels([ '1-Below College' , '2-College' , '3-Bachelor' ,'4-Master',  '5-Doctor'])

Employees qualified as Below college and Bachelor Education tend to leave the company than others.

In [None]:
labels=hr_data.EducationField.value_counts().index
sizes=hr_data.EducationField.value_counts().values
plt.figure(figsize=(7,7))
plt.pie(sizes,labels=labels,colors=["deepskyblue","darkorchid","hotpink","cyan","tomato","lime"],autopct="%1.1f%%")
plt.title("Education Field Counts",fontsize=18,color='maroon')

In [None]:
x=sns.factorplot(data=hr_data,x='EducationField',y='Attrition',size=7,aspect=1,kind='bar')


**EducationField **: The workers with Human Resources and Technical Degree are more likely to quit then employees from other fields of educations.

#### EnvironmentSatisfaction

In [None]:
plt.figure(figsize=(10,6))
ax = sns.countplot(data=hr_data,x='EnvironmentSatisfaction',hue='Gender',palette='bright')
ax.set_xticklabels([ '1-Low' , '2-Medium' , '3-High' , '4-Very High'])
plt.show()

In [None]:
ax=sns.factorplot(data=hr_data,x='EnvironmentSatisfaction',y='Attrition',size=7,aspect=1,kind='bar',color='lightskyblue')
ax.set_xticklabels([ '1-Low' , '2-Medium' , '3-High' , '4-Very High'])

#### Job satisfation

In [None]:
plt.figure(figsize=(12,6))
ax = sns.countplot(data=hr_data,x='JobSatisfaction',hue='Gender',palette='Accent')
ax.set_xticklabels([ '1-Low' , '2-Medium' , '3-High' , '4-Very High'])
plt.show()

In [None]:
plt.figure(figsize=(10,6))
ax=sns.violinplot(data=hr_data,x='JobSatisfaction',y='Attrition');
ax.set_xticklabels([ '1-Low' , '2-Medium' , '3-High' , '4-Very High'])

Low Environment satisfaction and Job satisfaction  people more likly to leave the company.

#### PerformanceRating

In [None]:
plt.figure(figsize=(12,6))
ax = sns.countplot(data=hr_data,x='PerformanceRating',hue='Gender',palette='Set2')
#ax.set_xticklabels( ['1-Low' , '2-Good','3-Excellent' , '4-Outstanding'])
plt.show()

In [None]:
g=sns.factorplot(data=hr_data,x='PerformanceRating',y='Attrition',size=6,aspect=1,kind='violin')

#### Worklife balance

In [None]:
plt.figure(figsize=(12,6))
ax = sns.countplot(data=hr_data,x='WorkLifeBalance',hue='Gender',palette='Set1')
ax.set_xticklabels([ '1-Bad', '2-Good', '3-Better', '4-Best'])
plt.show()

In [None]:
ax=sns.factorplot(data=hr_data,x='WorkLifeBalance',y='Attrition',size=7,aspect=1,kind='bar')
ax.set_xticklabels([ '1-Bad', '2-Good', '3-Better', '4-Best'])

### Job Role vs Monthly Income

In [None]:
income=pd.DataFrame(hr_data.groupby("JobRole").MonthlyIncome.mean().sort_values(ascending=False))

In [None]:

    fig =plt.figure(figsize=(13,8))
    ax=sns.barplot(x=income.index,y=income.MonthlyIncome)
    plt.xticks(rotation=90)
    plt.xlabel("Job Roles")
    plt.ylabel("Monthly Income")
    plt.title("Job Roles with Monthly Income")
plt.show()

### MonthlyIncome and MonthlyRate

In [None]:
g = sns.pairplot(hr_data, vars=["MonthlyIncome", "MonthlyRate"],hue="Department",size=5)

#### Years at Company

#### Distance from Home

In [None]:
sns.factorplot(data=hr_data,y='Attrition',x='DistanceFromHome',size=7,aspect=1,kind='bar')

People who live further away from their work show higher proportion of leavers compared to their counterparts.

### TotalWorkingYears vs YearsAtCompany

In [None]:
plt.figure(figsize=(10,10))
sns.jointplot(x=hr_data['TotalWorkingYears'], y=hr_data['YearsAtCompany'],kind='reg',
              height=8,color= 'mediumvioletred')

In [None]:
sns.factorplot(data=hr_data,y='Attrition',x='NumCompaniesWorked',size=7,aspect=1,kind='bar')

People who worked more companies are likly to leave

#### Hourly rate


In [None]:
sns.factorplot(data=hr_data,y='OverTime',x='Attrition',size=6,aspect=1,kind='bar')

**Over Time**: people who work overtime are more likelty to leave the company. Hence efforts must be taken to appropriately scope projects upfront with adequate support and manpower so as to reduce the use of overtime.

### Monthly Income vs Job role,Job Level,Department

In [None]:
sns.swarmplot(x="Department", y="MonthlyIncome", hue="Attrition", data=hr_data);
plt.show()

sns.swarmplot(x="JobRole", y="MonthlyIncome", hue="Attrition", data=hr_data);
plt.xticks( rotation=90 )
plt.show()


sns.swarmplot(x="JobLevel", y="MonthlyIncome", hue="Attrition", data=hr_data);
plt.show()

In [None]:
age=pd.DataFrame(hr_data.groupby("Age")[["MonthlyIncome","DailyRate","MonthlyRate",'HourlyRate']].mean())
age["Count"]=hr_data.Age.value_counts(dropna=False)
age.reset_index(level=0, inplace=True)
age.head()

In [None]:
age.describe().plot(kind = "area",fontsize=15, figsize = (25,8), table = True,colormap="rainbow")
plt.xlabel('Statistics',)
plt.ylabel('Value')
plt.title("General Statistics of Rate")

#### YearsInCurrentRole vs YearsAtCompany

In [None]:
sns.relplot(y="YearsInCurrentRole", x="MonthlyIncome", hue='Department', size="JobSatisfaction",
            sizes=(40, 400), alpha=.5,  palette="cubehelix",
            height=8, data=hr_data)

In [None]:
sns.catplot(y="JobRole",x="Attrition", kind="bar",size=9, data=hr_data);

**JobRole** : The workers in Laboratory Technician, Sales Representative, and Human Resources are more likely to quit the workers in other positions.

### Overtime

In [None]:
sns.catplot(x="OverTime", y="Age", kind="swarm", data=hr_data);

In [None]:
sns.catplot(x="MaritalStatus", y="Attrition", kind="bar",size=7, data=hr_data,color='darkmagenta');

**MaritalStatus **: The workers who have Single marital status are more likely to quit the Married, and Divorced.

### Create Model

### Preparing DataSet

#### Feature Encoding

We use Label Encoder to encode categorical labels with numerical values.

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Create a label encoder object
le = LabelEncoder()

In [None]:
hr_data.shape

 Label Encoding will be used for columns with 2 or less unique values

In [None]:
le_count = 0
for col in hr_data.columns[1:]:
    if hr_data[col].dtype == 'object':
        if len(list(hr_data[col].unique())) <= 2:
            le.fit(hr_data[col])
            hr_data[col] = le.transform(hr_data[col])
            le_count += 1
print('{} columns were label encoded.'.format(le_count))

In [None]:
# convert rest of categorical variable into dummy
hr_data = pd.get_dummies(hr_data, drop_first=True)

In [None]:
hr_data.shape

In [None]:
hr_data.head(3)

### Feature Scaling

Feature Scaling using MinMaxScaler essentially shrinks the range such that the range is now between 0 and n. Machine Learning algorithms perform better when input numerical variables fall within a similar scale. In this case, we are scaling between 0 and 5.

In [None]:
#import the necessary modelling algos.
from sklearn.linear_model import LogisticRegression
#from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB

#model selection
from sklearn.model_selection import train_test_split
#from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix,roc_curve,roc_auc_score
#from sklearn.model_selection import GridSearchCV

#from imblearn.over_sampling import SMOTE

#preprocess.
from sklearn.preprocessing import MinMaxScaler,StandardScaler,Imputer,LabelEncoder,OneHotEncoder

In [None]:
scaler=StandardScaler()
scaled_df=scaler.fit_transform(hr_data.drop('Attrition',axis=1))
X=scaled_df
Y=hr_data['Attrition'].as_matrix()

In [None]:
# assign the target to a new dataframe and convert it to a numerical feature
#df_target = df_HR[['Attrition']].copy()
target = hr_data['Attrition'].copy()

In [None]:
# let's remove the target feature and redundant features from the dataset
hr_data.drop(['Attrition', 'EmployeeCount', 'EmployeeNumber',
            'StandardHours', 'JobRole_Research Scientist','Over18','DailyRate','HourlyRate','MonthlyRate','PercentSalaryHike','PerformanceRating',], axis=1, inplace=True)
print('Size of Full dataset is: {}'.format(hr_data.shape))

In [None]:
# Since we have class imbalance (i.e. more employees with turnover=0 than turnover=1)
# let's use stratify=y to maintain the same ratio as in the training dataset when splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(hr_data,
                                                    target,
                                                    test_size=0.25,
                                                    random_state=7,
                                                    stratify=target)  
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

### Evaluation Metric

Another important point while dealing with the imbalanced classes is the choice of right evaluation metrics.

Note that accuracy is not a good choice. This is because since the data is skewed even an algorithm classifying the target as that belonging to the majority class at all times will achieve a very high accuracy. For eg if we have 20 observations of one type 980 of another ; a classifier predicting the majority class at all times will also attain a accuracy of 98 % but doesnt convey any useful information.

Hence in these type of cases we may use other metrics such as -->

'Precision'-- (true positives)/(true positives+false positives)

'Recall'-- (true positives)/(true positives+false negatives)

'F1 Score'-- The harmonic mean of 'precision' and 'recall'

'AUC ROC'-- ROC curve is a plot between 'senstivity' (Recall) and '1-specificity' (Specificity=Precision)

'Confusion Matrix'-- Plot the entire confusion matrix

In [None]:
# Create base rate model
def base_rate_model(X) :
    y = np.zeros(X.shape[0])
    return y

In [None]:
# Check accuracy of base rate model
y_base_rate = base_rate_model(X_test)
from sklearn.metrics import accuracy_score
print ("Base rate accuracy is %2.2f" % accuracy_score(y_test, y_base_rate))

In [None]:
# Check accuracy of Logistic Model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2', C=1)

model.fit(X_train, y_train)
print ("Logistic accuracy is %2.2f" % accuracy_score(y_test, model.predict(X_test)))

In [None]:
# Using 10 fold Cross-Validation to train our Logistic Regression Model
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression(class_weight = "balanced")
scoring = 'roc_auc'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

In [None]:
# Compare the Logistic Regression Model V.S. Base Rate Model V.S. Random Forest Model
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier


print ("---Base Model---")
base_roc_auc = roc_auc_score(y_test, base_rate_model(X_test))
print ("Base Rate AUC = %2.2f" % base_roc_auc)
print(classification_report(y_test, base_rate_model(X_test)))

# NOTE: By adding in "class_weight = balanced", the Logistic Auc increased by about 10%! This adjusts the threshold value
logis = LogisticRegression(class_weight = "balanced")
logis.fit(X_train, y_train)
print ("\n\n ---Logistic Model---")
logit_roc_auc = roc_auc_score(y_test, logis.predict(X_test))
print ("Logistic AUC = %2.2f" % logit_roc_auc)
print(classification_report(y_test, logis.predict(X_test)))

# Decision Tree Model
dtree = tree.DecisionTreeClassifier(
    #max_depth=3,
    class_weight="balanced",
    min_weight_fraction_leaf=0.01
    )
dtree = dtree.fit(X_train,y_train)
print ("\n\n ---Decision Tree Model---")
dt_roc_auc = roc_auc_score(y_test, dtree.predict(X_test))
print ("Decision Tree AUC = %2.2f" % dt_roc_auc)
print(classification_report(y_test, dtree.predict(X_test)))

# Random Forest Model
rf = RandomForestClassifier(
    n_estimators=1000, 
    max_depth=None, 
    min_samples_split=10, 
    class_weight="balanced"
    #min_weight_fraction_leaf=0.02 
    )
rf.fit(X_train, y_train)
print ("\n\n ---Random Forest Model---")
rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
print ("Random Forest AUC = %2.2f" % rf_roc_auc)
print(classification_report(y_test, rf.predict(X_test)))


# Ada Boost
ada = AdaBoostClassifier(n_estimators=400, learning_rate=0.1)
ada.fit(X_train,y_train)
print ("\n\n ---AdaBoost Model---")
ada_roc_auc = roc_auc_score(y_test, ada.predict(X_test))
print ("AdaBoost AUC = %2.2f" % ada_roc_auc)
print(classification_report(y_test, ada.predict(X_test)))

In [None]:
# Create ROC Graph
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, logis.predict_proba(X_test)[:,1])
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])
dt_fpr, dt_tpr, dt_thresholds = roc_curve(y_test, dtree.predict_proba(X_test)[:,1])
ada_fpr, ada_tpr, ada_thresholds = roc_curve(y_test, ada.predict_proba(X_test)[:,1])

plt.figure()

# Plot Logistic Regression ROC
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)

# Plot Random Forest ROC
plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_roc_auc)

# Plot Decision Tree ROC
plt.plot(dt_fpr, dt_tpr, label='Decision Tree (area = %0.2f)' % dt_roc_auc)

# Plot AdaBoost ROC
plt.plot(ada_fpr, ada_tpr, label='AdaBoost (area = %0.2f)' % ada_roc_auc)

# Plot Base Rate ROC
plt.plot([0,1], [0,1],label='Base Rate' 'k--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()

### Calculating feature importance

In [None]:
feat_importances = pd.Series(rf.feature_importances_, index=hr_data.columns)
feat_importances = feat_importances.nlargest(20)
feat_importances.plot(kind='barh')
plt.show()