# **MINI PROJECT #3**

By Chau Tran

# **EMPLOYEE TURNOVER PREDICTION AND INFLUENCE FACTORS ANALYSIS**

**What are the key influence factors on employee turnover, their costs and how to prevent?**


## TABLE OF CONTENTS

* [SUMMARY](#summary)
* [DATASET](#dataset)
* [EDA](#eda)
* [FEATURE ENGINEERING](#feature_engineering)
* [MODELING](#modeling)

## SUMMARY<a id="summary"></a>

### Data questions
What is the adding value of kitchen feature onto house price of Ames between 2007 and 2010?

### Stake-holders
Marketing manager - House renovation company, property owners

### Objectives
This analysis examines the Ames Housing dataset and identifies what renovation features that likely bring significant adding values to the house price. In this case, we found kitchen quality as the main  feature that brought significant influence.

From building three linear regression models to predict the sale price and pick the best one, we found the adding value of kitchen quality.

By improving the kitchen quality, the house value (with the price of 200k before remodelling) could be increased by of 12%. Details of adding values are as below.

The findings help renovation company as the stake-holder to design marketing strategy on their renovation services  as well as help the property owners to make decision on whether or not to improve their house value before they sell their houses.

### Cost Of Employee Turnover Report

![title](figures/Kitchen_addingvalue_roi_01.png)
![title](figures/Kitchen_addingvalue_roi_02.png)

### Dataset<a id="dataset"></a>

https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

Train dataset (from train.csv)

Rows: 1470

Columns: 32

Duplicated rows: 0

Column names to reformat: 0 

Numerical columns: 25

Ordinal columns: 14

Nominal columns: 7

Missing value columns: 0

In [258]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import KFold 
from sklearn.model_selection import GridSearchCV

import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score, recall_score, precision_score, f1_score, average_precision_score 
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report
from sklearn.metrics import plot_precision_recall_curve, precision_recall_curve

from sklearn.preprocessing import StandardScaler

sns.set(color_codes = True)

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# plt.style.use('fivethirtyeight')

%matplotlib inline

In [None]:
data_file = 'IBM-Employee-Attrition.csv'

# import cafe listings into dataframe
emp = pd.read_csv(data_file)

emp = emp.drop(columns=['EmployeeCount', 'EmployeeNumber','Over18','StandardHours'])


attrition_map={'Yes':1,'No':0}
emp['Attrition'] = emp['Attrition'].map(attrition_map)

In [None]:
emp.head()

## EDA<a id="eda"></a>

In [None]:
emp.shape

In [None]:
# get all numerical columns
numerical_dtypes = ['int16','int32', 'int64','float16','float32','float64']
num_cols = []
for i in emp.columns:
    if emp[i].dtype in numerical_dtypes:
        num_cols.append(i)
        
print(len(num_cols))
print(num_cols)

In [None]:
# get all category columns

# cat_cols = list(set(emp.columns) - set(num_cols))

cat_cols = emp.columns.difference(num_cols)
print(len(cat_cols))
print(cat_cols)

In [None]:
emp.describe().T

In [None]:
attri_yes = emp[emp.Attrition==1]
attri_no =  emp[emp.Attrition==0]

In [None]:
def plot_pair_charts(data1, data2, column_list, legend_text,fig_size):
    fig = plt.figure(figsize=fig_size)
    fig.subplots_adjust(hspace=1, wspace=0.5, top=0.96)
    fig.suptitle('Attrtion Yes vs No')

    for i, col in enumerate(list(data1[column_list]),1):
        # print(col)
        ax = fig.add_subplot(len(column_list), 3, i)
        plt.hist(data1[col],alpha=0.8)
        plt.hist(data2[col],alpha=0.8)
        plt.xlabel('{}'.format(col), size=15,labelpad=12.5)
        plt.ylabel('Frequecy', size=15, labelpad=12.5)
        plt.legend(legend_text)
        plt.xticks(rotation=45)
    
    #figname = 'High vs Low quality houses.png'
    # fig.savefig(figname,transparent=False, bbox_inches='tight', dpi=300)
    
    plt.show()

In [None]:
emp.EducationField.unique()

In [None]:
# plot bar charts of ordinal features 
cols_to_plot = cat_cols
plot_pair_charts(attri_no,attri_yes,cols_to_plot,legend_text=['Attrition No','Attrition Yes'],fig_size=(15, 30))   

In [None]:
# plot bar charts of numerical features 
cols_to_plot = num_cols
plot_pair_charts(attri_no,attri_yes,cols_to_plot,legend_text=['Attrition No','Attrition Yes'],fig_size=(15, 80))   

In [None]:
total_attrition = emp[(emp.Attrition==1)]

age_under_25 = emp[(emp.Age<25) & (emp.Attrition==1)]
age_25_40 = emp[(emp.Age>=25) & (emp.Age<=40) & (emp.Attrition==1)]
age_40_50 = emp[(emp.Age>40) & (emp.Age<=50) & (emp.Attrition==1)]
age_above_50 = emp[(emp.Age>50) & (emp.Attrition==1)]


age_under_25_male = age_under_25[age_under_25.Gender=='Male']
age_under_25_female = age_under_25[age_under_25.Gender=='Female']
age_25_40_male = age_25_40[age_25_40.Gender=='Male']
age_40_50_male = age_40_50[age_40_50.Gender=='Male']
age_above_50_male = age_above_50[age_above_50.Gender=='Male']

age_25_40_male_overtime = emp[(emp.Age>=25) & (emp.Age<=40) & (emp.Gender=='Male') & (emp.OverTime=='Yes') & (emp.Attrition==1)].shape[0] 

attri_by_departments = attri_yes.groupby('Department')



In [None]:
genZ = emp[(emp.Age<=22) & (emp.Attrition==1)]
genY_millenials = emp[(emp.Age>=23) & (emp.Age<=38) & (emp.Attrition==1)]
genX = emp[(emp.Age>=39) & (emp.Age<=54) & (emp.Attrition==1)]
boomers = emp[(emp.Age>54) & (emp.Attrition==1)]

genZ_male = genZ[genZ.Gender=='Male']
genY_millenials_male = genY_millenials[genY_millenials.Gender=='Male']
genX_male = genX[genX.Gender=='Male']
boomers_male = boomers[boomers.Gender=='Male']

In [None]:
age_under_25_male

In [None]:
age_25_40.shape[0] / total_attrition.shape[0] 

In [None]:
age_25_40_male / age_25_40 

In [None]:
age_25_40_male_overtime / age_25_40_male

In [None]:
overtime = emp[(emp.OverTime=='Yes') & (emp.Attrition==1)]
overtime.shape[0] / total_attrition


In [None]:
# ATTRITION YES vs NO

# set data to plot
group_names=['Churn', 'No Churn']
group_size=[attri_yes.shape[0],attri_no.shape[0]]
pcts = [f'{s} {l}\n\n({s*100/sum(group_size):.2f}%)' for s,l in zip(group_size, group_names)]

# Create colors
a, b, =[plt.cm.Reds, plt.cm.Blues]

# First Ring (outside)
fig, ax = plt.subplots()
ax.axis('equal')
mypie, _ = ax.pie(group_size, radius=1.3, labels=pcts, colors=[a(0.6), b(0.6)])
plt.setp(mypie, width=0.3, edgecolor='white')
plt.show()


In [None]:
# set data to plot

emp_by_department = pd.DataFrame(attri_yes.groupby('Department').agg({'Attrition':'sum'}))
# emp_by_department.index
# emp_by_department.loc['Sales']
# emp_by_department.loc['Sales'].Attrition

group_names=[n for n in emp_by_department.index]
group_size=[i for i in emp_by_department.Attrition]
pcts = [f'{s} {l}\n\n({s*100/sum(group_size):.2f}%)' for s,l in zip(group_size, group_names)]

# Create colors
a, b, c, d =[plt.cm.Blues, plt.cm.Reds, plt.cm.Greens, plt.cm.Purples]

# First Ring (outside)
fig, ax = plt.subplots()
ax.axis('equal')
mypie, _ = ax.pie(group_size, radius=1.3, labels=pcts, colors=[a(0.6), b(0.6), c(0.6), d(0.6)])
plt.setp( mypie, width=0.3, edgecolor='white')
plt.show()


In [None]:

group_names=['genZ - under 23', 'genY_millenials 23-38', 'genX 39 - 54', 'boomers above 54']
group_size=[genZ.shape[0],genY_millenials.shape[0],genX.shape[0],boomers.shape[0]]
pcts = [f'{s} {l}\n\n({s*100/sum(group_size):.2f}%)' for s,l in zip(group_size, group_names)]


subgroup_names=['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F']
subgroup_size=[genZ_male.shape[0], genZ.shape[0]-genZ_male.shape[0],
               genY_millenials_male.shape[0], genY_millenials.shape[0]-genY_millenials_male.shape[0],
               genX_male.shape[0], genX.shape[0]-genX_male.shape[0],
               boomers_male.shape[0], boomers.shape[0]-boomers_male.shape[0]
              ]

# Create colors
a, b, c, d =[plt.cm.Blues, plt.cm.Reds, plt.cm.Greens, plt.cm.Purples]

# First Ring (outside)
fig, ax = plt.subplots()
ax.axis('equal')
mypie, _ = ax.pie(group_size, radius=1.3, labels=pcts, colors=[a(0.6), b(0.6), c(0.6), d(0.6)])
plt.setp( mypie, width=0.3, edgecolor='white')


# Second Ring (Inside)
mypie2, _ = ax.pie(subgroup_size, radius=1.3-0.3, labels=subgroup_names, labeldistance=0.7, colors=[a(0.5), a(0.2), b(0.5), b(0.2), c(0.5), c(0.2), d(0.5), d(0.2)])
plt.setp( mypie2, width=0.4, edgecolor='white')
plt.margins(0,0)

# show it
plt.show()


In [None]:
def bar_chart(train, feature):
    leave = train[train['Attrition']==1][feature].value_counts(normalize=True)*100
    stay = train[train['Attrition']==0][feature].value_counts(normalize=True)*100
    df = pd.DataFrame([leave,stay])
    df.index = ['Leave','Stay']
    df.plot(kind='bar',stacked=True, figsize=(10,5))
    plt.xticks(rotation=0)

In [None]:
bar_chart(emp, 'Gender')

In [None]:
bar_chart(emp, 'Department')

In [None]:
bar_chart(emp, 'BusinessTravel')

In [None]:
bar_chart(emp, 'EducationField')

In [None]:
# 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime

bar_chart(emp, 'JobRole')

In [None]:
bar_chart(emp, 'MaritalStatus')

In [None]:
bar_chart(emp, 'OverTime')

In [None]:
# plt.hist(genY_millenials.MonthlyIncome)
plt.hist(genZ.MonthlyIncome)
# plt.hist(boomers.MonthlyIncome)

# plt.hist(emp.Age)

In [None]:
# attri_yes.groupby('JobRole').plot.barh()

plt.hist(attri_yes.JobRole)
plt.xticks(rotation=90)
plt.show()

In [None]:
plt.scatter(emp.JobLevel,emp.Age)
plt.xticks(rotation=90)
plt.show()

In [None]:
plt.scatter(emp.MonthlyIncome,emp.Age)
plt.xticks(rotation=90)
plt.show()

In [None]:
plt.scatter(emp.JobRole,emp.MonthlyIncome)
plt.xticks(rotation=90)
plt.show()

In [None]:
plt.scatter(genY_millenials.Department,genY_millenials.MonthlyIncome)
plt.xticks(rotation=90)
plt.show()

In [None]:
plt.scatter(emp_wage[emp_wage.Age<38].Age, emp_wage[emp_wage.Age<38].JobRole)

### OBSERVE RELATIONSHIPS BETWEEN MONTHLY INCOME AND OTHER VARIABLES

In [None]:
# plot MontlyIncome of all numerical columns

def plot_multi_charts(data, x_column_list, y, title, y_label, plot_type, figsize):
    fig = plt.figure(figsize=figsize)
    fig.subplots_adjust(hspace=0.75, wspace=0.4, top=0.96)
    fig.suptitle(title)

    for i, col in enumerate(list(data[x_column_list]),1):
        
        ax = fig.add_subplot(len(x_column_list), 3, i)
        
        if plot_type == 'scatter':
            plt.scatter(x=data[col], y=data[y])
            plt.xlabel('{}'.format(col), size=15,labelpad=12.5)
            plt.ylabel(y_label, size=15, labelpad=12.5)
        elif plot_type == 'bar':
            data.groupby(col).agg({y:'mean'}).sort_values(by=y).plot.bar(ax=ax)
    
#     figname = title + '.png'
#     fig.savefig(figname,transparent=False, bbox_inches='tight', dpi=300)
   
    plt.xticks(rotation=45)
    plt.show()
    

In [None]:
# Observe MonthlyIncome of all numerical features in Attrition Yes subset

plot_multi_charts(data=attri_yes, x_column_list=num_cols,y='MonthlyIncome', 
                  title='MonthlyIncome vs numerical features',y_label='MonthlyIncome',plot_type='scatter',figsize=(18,70))    

In [None]:
# Observe sale price of all category features in Attrition Yes subset

plot_multi_charts(data=attri_yes, x_column_list=cat_cols,y='MonthlyIncome', 
                  title='MonthlyIncome vs numerical features',y_label='MonthlyIncome',plot_type='scatter',figsize=(18,25))    

In [None]:
# Observe average sale price of all category features

plot_multi_charts(data=attri_yes, x_column_list=cat_cols, y='MonthlyIncome',
                  title='MEAN MonthlyIncome vs nominal features', y_label='Ave Monthly Income',plot_type='bar',figsize=(15,35))    

In [None]:
sns.pairplot(data=emp,hue='Attrition')

In [None]:
emp[(emp.OverTime=='Yes') & (emp.Attrition==1)].shape[0] / emp[(emp.Attrition==1)].shape[0]

In [None]:
emp['Attrition'].value_counts(normalize=True).plot(kind='bar')

xlocs, xlabs = plt.xticks()
xlocs=[i+1 for i in range(0,2)]
for i, v in enumerate(emp['Attrition'].value_counts(normalize=True)):
    plt.text(xlocs[i] - 1.1, v + 0.01, str(round(v,2)))

plt.show()

## FEATURES ENGINEERING<a id="feature_engineering"></a>

In [None]:
cat_cols

In [None]:
gender_map={'Male':1,'Female':0}
emp['GenderMale'] = emp['Gender'].map(gender_map)

overtime_map={'Yes':1,'No':0}
emp['OverTime'] = emp['OverTime'].map(overtime_map)

emp.head()

In [None]:
# get all numerical columns
numerical_dtypes = ['int16','int32', 'int64','float16','float32','float64']
        
num_cols = [i for i in emp.columns if emp[i].dtype in numerical_dtypes]        
        
print(len(num_cols))
print(num_cols)

In [None]:
# get all category columns
emp = emp.drop(columns=['Gender'])
cat_cols = emp.columns.difference(num_cols)

print(len(cat_cols))
print(cat_cols)

In [None]:
# HOT ENCODE nominal cols

emp = pd.get_dummies(data = emp, columns = cat_cols)
emp.head()

In [None]:
figure = plt.figure(figsize=(6,10))
sns.heatmap(emp.corr()[['Attrition']].sort_values('Attrition',ascending=False),annot=True, cmap='coolwarm', center=0);


In [None]:
abs(emp.corr()[['Attrition']]).sort_values('Attrition',ascending=False).head(15)

In [None]:

sns.distplot(emp.MonthlyIncome)

In [None]:
# Skew and kurt
print("Skewness: %f" % emp['MonthlyIncome'].skew())
print("Kurtosis: %f" % emp['MonthlyIncome'].kurt())

In [None]:
emp.MonthlyIncome.describe()

In [None]:
emp_income_fitler = emp[(emp.MonthlyIncome < 15000)]

In [None]:
emp['MonthlyIncome2'] = np.log1p(emp['MonthlyIncome'])

In [None]:
emp_income_fitler['MonthlyIncome3'] = np.log1p(emp_income_fitler['MonthlyIncome'])
emp_income_fitler['MonthlyIncome3']

In [None]:

sns.distplot(emp.MonthlyIncome2)

In [None]:
# Skew and kurt
print("Skewness: %f" % emp['MonthlyIncome2'].skew())
print("Kurtosis: %f" % emp['MonthlyIncome2'].kurt())

In [None]:
emp_wage = pd.read_csv(data_file)
emp_wage.groupby('JobRole').agg({'HourlyRate':'mean'})

# MODELING<a id="modeling"></a>

In [None]:
JobRole_encoded_cols = [c for c in emp if c.startswith('JobRole_')]
Department_encoded_cols = [c for c in emp if c.startswith('Department_')]
EducationField_encoded_cols = [c for c in emp if c.startswith('EducationField_')]
BusinessTravel_encoded_cols = [c for c in emp if c.startswith('BusinessTravel_')]
MaritalStatus_encoded_cols = [c for c in emp if c.startswith('MaritalStatus_')]
JobRole_encoded_cols

In [None]:
cols6 = ['MonthlyIncome','JobSatisfaction','EnvironmentSatisfaction','WorkLifeBalance', 'JobInvolvement',
           'TrainingTimesLastYear','NumCompaniesWorked','JobLevel', 'StockOptionLevel',
           'DistanceFromHome', 'YearsWithCurrManager', 'YearsAtCompany', 'YearsInCurrentRole', 
           'TotalWorkingYears','Age','OverTime','GenderMale']

# feature_cols = cols1 + cols2

filter_cols = ['HourlyRate','DailyRate','MonthlyRate']


target_col = 'Attrition'

# feature_cols = [c for c in emp.columns if c != target_col]

# feature_cols = [c for c in emp.columns if (c != target_col) & (c not in filter_cols)]

# feature_cols = cols3 + cols4

# feature_cols = cols5

feature_cols = cols6 + JobRole_encoded_cols + Department_encoded_cols + EducationField_encoded_cols + BusinessTravel_encoded_cols + MaritalStatus_encoded_cols



X = emp_income_fitler[feature_cols]
 
y = emp_income_fitler['Attrition']

# X = emp[feature_cols]

# y = emp['Attrition']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)



In [None]:
X_train.shape


In [None]:
print(feature_cols)

### SMOTE RESAMPLE

In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter

#Oversampling the data
smote = SMOTE(random_state = 101)
X_sm, y_sm = smote.fit_resample(X, y)
Counter(y_sm)

X_train_sm, X_test_sm, y_train_sm, y_test_sm = train_test_split(X_sm, y_sm,test_size=0.2,random_state=42)


In [None]:
def plot_ROC_curve(model,X_test,y_test):

    # Generate the prediction values for each of the test observations using predict_proba() function rather than just predict
    preds = model.predict_proba(X_test)[:,1]

    # Store the false positive rate(fpr), true positive rate (tpr) in vectors for use in the graph
    fpr, tpr, _ = roc_curve(y_test, preds)

    # Store the Area Under the Curve (AUC) so we can annotate our graph with theis metric
    roc_auc = auc(fpr, tpr)

    # Plot the ROC Curve
    plt.figure()
    lw = 2
    plt.plot(fpr, tpr, color='darkorange', lw = lw, label = 'ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color = 'navy', lw = lw, linestyle = '--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc = "lower right")
    plt.show()


In [None]:
def display_scores(model, X_test, y_test, y_pred):
    
    # predict probabilities
    pred_probs = model.predict_proba(X_test)
    # keep probabilities for the positive outcome only
    pred_probs = pred_probs[:, 1]
    
    print('Accuracy is: ',round(accuracy_score(y_test, y_pred),2))
    print('F1 score is: ',round(f1_score(y_test, y_pred),2))
    print('Ave PR score: ',round(average_precision_score(y_test, pred_probs),2))

    cm = confusion_matrix(y_test,y_pred)
    
    sns.heatmap(cm/np.sum(cm),annot=True,fmt='.2%', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    
#     print(classification_report(y_test,y_pred,target_names=('Stay','Leave')))
    print(classification_report(y_test,y_pred))
    
    plot_precision_recall_curve(model, X_test, y_test)
    
    plot_ROC_curve(model,X_test,y_test)

## CROSS VALIDATON TEST MODELS

In [333]:
kf = KFold(n_splits=5, random_state=42, shuffle=True)

def cross_val_metrics(model, X_train, X_test, y_train, y_test) :
    scores = ['accuracy', 'precision', 'recall', 'f1', 'average_precision','roc_auc']
    print('\n Model:', model)
    for sc in scores:
        scores = cross_val_score(model, X_train, y_train, cv = 10, scoring = sc)
        print('[%s] : %0.5f (+/- %0.5f)'%(sc, scores.mean(), scores.std()))

    model.fit( X_train, y_train)
    y_pred = model.predict(X_test)
    print('\n', model, ' - Test score report')
    print('\n', classification_report(y_test, y_pred))


In [None]:
RF_clf = RandomForestClassifier()
test = cross_val_metrics(RF_clf,X_train,X_test,y_train,y_test) 
test = pd.DataFrame(test)
test.mean()

In [336]:
def cross_val_metrics2(X_train, X_test, y_train, y_test):
    
    models = [
              ('LogReg', LogisticRegression()), 
              ('RF', RandomForestClassifier()),
              ('GB', GradientBoostingClassifier()),
              ('XGB', xgb.XGBClassifier()),
              ('KNN', KNeighborsClassifier()),
              ('NB', GaussianNB())
             ] 

    scoring = ['accuracy', 'precision', 'recall', 'f1', 'average_precision','roc_auc']
    train_score_dfs = []
    test_score_dfs = []
    test_score_dict = {}
    target_names = ['No Churn', 'Churn']

    for name, model in models:
        
        # GET TRAIN SCORES
        kfold = KFold(n_splits=5, shuffle=True, random_state=42)
        cv_results = cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring)
        df1 = pd.DataFrame(cv_results)
        df1['model'] = name
        train_score_dfs.append(df1)
    
        # GET TEST SCORES
        clf = model.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        print(name)
        print(classification_report(y_test, y_pred, target_names=target_names))
        
        # predict probabilities
        pred_probs = model.predict_proba(X_test)
        # keep probabilities for the positive class only
        pred_probs = pred_probs[:, 1]
        
        test_score_dict = {'Accuracy': round(accuracy_score(y_test, y_pred),2),
                           'Precision': round(precision_score(y_test, y_pred),2),
                           'Recall': round(recall_score(y_test, y_pred),2),
                           'F1': round(f1_score(y_test, y_pred),2),
                           'Average PC': round(average_precision_score(y_test, pred_probs),2),
                           'ROC_AUC': round(roc_auc_score(y_test, y_pred),2)
                          }
        
        df2 = pd.DataFrame.from_dict(test_score_dict, orient='index').transpose()
        df2['Model'] = name
        test_score_dfs.append(df2)
    
    # combine all score sets into final df
    final_train_scores = pd.concat(train_score_dfs, ignore_index=True)
    final_test_scores = pd.concat(test_score_dfs, ignore_index=True)
    
    return final_train_scores, final_test_scores

In [331]:
train_scores, test_scores = cross_val_metrics2(X_train,X_test,y_train,y_test) 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

LogReg
              precision    recall  f1-score   support

    No Churn       0.84      1.00      0.91       221
       Churn       0.83      0.11      0.19        47

    accuracy                           0.84       268
   macro avg       0.84      0.55      0.55       268
weighted avg       0.84      0.84      0.78       268

RF
              precision    recall  f1-score   support

    No Churn       0.85      0.98      0.91       221
       Churn       0.67      0.21      0.32        47

    accuracy                           0.84       268
   macro avg       0.76      0.60      0.62       268
weighted avg       0.82      0.84      0.81       268

KNN
              precision    recall  f1-score   support

    No Churn       0.83      0.94      0.88       221
       Churn       0.19      0.06      0.10        47

    accuracy                           0.79       268
   macro avg       0.51      0.50      0.49       268
weighted avg       0.71      0.79      0.74       268

GNB
 

In [324]:
train_scores


Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1,test_average_precision,test_roc_auc,model
0,0.061833,0.017492,0.864486,0.9,0.243243,0.382979,0.629421,0.830203,LogReg
1,0.071255,0.014561,0.85514,0.5,0.064516,0.114286,0.395462,0.776485,LogReg
2,0.063851,0.012201,0.808411,0.444444,0.1,0.163265,0.351339,0.730603,LogReg
3,0.06165,0.011701,0.799065,0.25,0.02439,0.044444,0.423967,0.748484,LogReg
4,0.085737,0.014209,0.826291,0.4,0.055556,0.097561,0.442221,0.720339,LogReg
5,0.308649,0.036426,0.864486,0.9,0.243243,0.382979,0.610517,0.821118,RF
6,0.231881,0.044439,0.883178,0.75,0.290323,0.418605,0.60873,0.882426,RF
7,0.360777,0.042548,0.836449,0.692308,0.225,0.339623,0.570337,0.825647,RF
8,0.297048,0.04543,0.841121,0.818182,0.219512,0.346154,0.63031,0.829973,RF
9,0.265891,0.046234,0.835681,0.545455,0.166667,0.255319,0.469002,0.740035,RF


In [332]:
train_scores.groupby('model').mean().sort_values()

Unnamed: 0_level_0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1,test_average_precision,test_roc_auc
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
GNB,0.005514,0.012357,0.73056,0.355884,0.677755,0.463818,0.422402,0.767413
KNN,0.00572,0.025136,0.813848,0.389231,0.122617,0.185658,0.237149,0.606056
LogReg,0.064701,0.017888,0.830679,0.498889,0.097541,0.160507,0.448482,0.761223
RF,0.330024,0.050215,0.854056,0.764231,0.226443,0.347219,0.586326,0.828305
XGB,0.199756,0.016385,0.862481,0.67498,0.384606,0.488539,0.586603,0.810568


In [326]:
test_scores

Unnamed: 0,Accuracy,Precision,Recall,F1,Average PC,ROC_AUC,Model
0,0.84,0.83,0.11,0.19,0.41,0.55,LogReg
1,0.84,0.58,0.23,0.33,0.43,0.6,RF
2,0.79,0.19,0.06,0.1,0.18,0.5,KNN
3,0.71,0.32,0.57,0.41,0.42,0.66,GNB
4,0.83,0.52,0.34,0.41,0.44,0.64,XGB


In [317]:
final = cross_val_metrics2(Xs_train,Xs_test,ys_train,ys_test) 
final.sort_values(['F1','Average PC'],ascending=False)

Unnamed: 0,Accuracy,Precision,Recall,F1,Average PC,ROC_AUC,Model
0,0.89,0.65,0.51,0.57,0.63,0.73,LogReg
4,0.88,0.62,0.46,0.53,0.63,0.71,XGB
1,0.89,0.8,0.31,0.44,0.62,0.65,RF
3,0.3,0.16,0.92,0.28,0.37,0.56,GNB
2,0.85,0.5,0.15,0.24,0.27,0.56,KNN


In [312]:
final = cross_val_metrics2(X_train_sm,X_test_sm,y_train_sm,y_test_sm)   
final

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,Accuracy,Precision,Recall,F1,Average PC,ROC_AUC,Model
0,0.81,0.85,0.77,0.81,0.87,0.81,LogReg
1,0.92,0.95,0.88,0.91,0.97,0.92,RF
2,0.69,0.68,0.72,0.7,0.71,0.69,KNN
3,0.8,0.76,0.87,0.81,0.93,0.8,GNB
4,0.93,0.95,0.9,0.92,0.98,0.93,XGB


In [338]:
final1, final2 = cross_val_metrics2(Xs_train_sm, Xs_test_sm, ys_train_sm, ys_test_sm) 
final2.sort_values(['ROC_AUC','Recall'],ascending=False)

LogReg
              precision    recall  f1-score   support

    No Churn       0.88      0.97      0.92       219
       Churn       0.97      0.87      0.92       223

    accuracy                           0.92       442
   macro avg       0.92      0.92      0.92       442
weighted avg       0.92      0.92      0.92       442

RF
              precision    recall  f1-score   support

    No Churn       0.89      0.95      0.92       219
       Churn       0.95      0.88      0.91       223

    accuracy                           0.91       442
   macro avg       0.92      0.91      0.91       442
weighted avg       0.92      0.91      0.91       442

GB
              precision    recall  f1-score   support

    No Churn       0.88      0.96      0.92       219
       Churn       0.96      0.87      0.91       223

    accuracy                           0.91       442
   macro avg       0.92      0.91      0.91       442
weighted avg       0.92      0.91      0.91       442

XGB
  

Unnamed: 0,Accuracy,Precision,Recall,F1,Average PC,ROC_AUC,Model
3,0.93,0.95,0.9,0.92,0.98,0.93,XGB
0,0.92,0.97,0.87,0.92,0.97,0.92,LogReg
1,0.91,0.95,0.88,0.91,0.97,0.91,RF
2,0.91,0.96,0.87,0.91,0.97,0.91,GB
4,0.9,0.92,0.88,0.9,0.95,0.9,KNN
5,0.68,0.62,0.94,0.75,0.92,0.68,NB


In [288]:
final.groupby('model').mean()

Unnamed: 0_level_0,1
model,Unnamed: 1_level_1
GNB,0.515


### TEST WITH UNBALANCED DATA BEFORE NORMALISATION

In [334]:
LG_clf = LogisticRegression()
cross_val_metrics(LG_clf,X_train,X_test,y_train,y_test)   

RF_clf = RandomForestClassifier()
cross_val_metrics(RF_clf,X_train,X_test,y_train,y_test) 

GB_clf = GradientBoostingClassifier()
cross_val_metrics(GB_clf,X_train,X_test,y_train,y_test)  

NB_clf = GaussianNB()
cross_val_metrics(NB_clf,X_train,X_test,y_train,y_test) 
 


 Model: LogisticRegression()


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

[accuracy] : 0.82508 (+/- 0.01661)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

[precision] : 0.48095 (+/- 0.40641)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

[recall] : 0.05380 (+/- 0.04776)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

[f1] : 0.09384 (+/- 0.08250)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

[average_precision] : 0.42489 (+/- 0.07776)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

[roc_auc] : 0.74366 (+/- 0.04834)

 LogisticRegression()  - Test score report

               precision    recall  f1-score   support

           0       0.84      1.00      0.91       221
           1       0.83      0.11      0.19        47

    accuracy                           0.84       268
   macro avg       0.84      0.55      0.55       268
weighted avg       0.84      0.84      0.78       268


 Model: RandomForestClassifier()
[accuracy] : 0.85407 (+/- 0.01625)
[precision] : 0.80000 (+/- 0.20710)
[recall] : 0.20614 (+/- 0.10004)
[f1] : 0.31745 (+/- 0.08970)
[average_precision] : 0.56973 (+/- 0.06353)
[roc_auc] : 0.80588 (+/- 0.04263)

 RandomForestClassifier()  - Test score report

               precision    recall  f1-score   support

           0       0.85      0.97      0.91       221
           1       0.59      0.21      0.31        47

    accuracy                           0.84       268
   macro avg       0.72      0.59      0.61       268
weighted avg       0.81   

### TEST WITH UNBALANCED DATA AFTER NORMALISATION

In [225]:
ss = StandardScaler()
Xs = ss.fit_transform(X)

Xs_train, Xs_test, ys_train, ys_test = train_test_split(Xs,y,test_size=0.2,random_state=42)


In [None]:
XGB_clf = xgb.XGBClassifier(scale_pos_weight=5.2)
cross_val_metrics(XGB_clf,Xs_train,Xs_test,ys_train,ys_test) 

LG_clf = LogisticRegression()
cross_val_metrics(LG_clf,Xs_train,Xs_test,ys_train,ys_test)   

RF_clf = RandomForestClassifier(class_weight='balanced')
cross_val_metrics(RF_clf,Xs_train,Xs_test,ys_train,ys_test)   

GB_clf = GradientBoostingClassifier()
cross_val_metrics(GB_clf,Xs_train,Xs_test,ys_train,ys_test)   

NB_clf = GaussianNB()
cross_val_metrics(NB_clf,Xs_train,Xs_test,ys_train,ys_test)   


KN_clf = KNeighborsClassifier()
cross_val_metrics(KN_clf,Xs_train,Xs_test,ys_train,ys_test) 


### TEST WITH SMOTE SAMPLE

In [None]:
# TEST WITH SMOTE SAMPLE

XGB_clf = xgb.XGBClassifier()
cross_val_metrics(XGB_clf,X_train_sm,X_test_sm,y_train_sm,y_test_sm)   

LG_clf = LogisticRegression()
cross_val_metrics(LG_clf,X_train_sm,X_test_sm,y_train_sm,y_test_sm)   

RF_clf = RandomForestClassifier()
cross_val_metrics(RF_clf,X_train_sm,X_test_sm,y_train_sm,y_test_sm) 

GB_clf = GradientBoostingClassifier()
cross_val_metrics(GB_clf,X_train_sm,X_test_sm,y_train_sm,y_test_sm)  

NB_clf = GaussianNB()
cross_val_metrics(NB_clf,X_train_sm,X_test_sm,y_train_sm,y_test_sm) 

KN_clf = KNeighborsClassifier()
cross_val_metrics(KN_clf,X_train_sm,X_test_sm,y_train_sm,y_test_sm) 

### TEST WITH SMOTE SAMPLE - NORMALISED

In [None]:
ss = StandardScaler()
Xs_sm = ss.fit_transform(X_sm)
Xs_train_sm, Xs_test_sm, ys_train_sm, ys_test_sm = train_test_split(Xs_sm,y_sm,test_size=0.2,random_state=42)


In [None]:
# TEST WITH SMOTE SAMPLE - NORMALISED

XGB_clf = xgb.XGBClassifier()
cross_val_metrics(XGB_clf,Xs_train_sm, Xs_test_sm, ys_train_sm, ys_test_sm) 

LG_clf = LogisticRegression()
cross_val_metrics(LG_clf,Xs_train_sm, Xs_test_sm, ys_train_sm, ys_test_sm)   

RF_clf = RandomForestClassifier()
cross_val_metrics(RF_clf,Xs_train_sm, Xs_test_sm, ys_train_sm, ys_test_sm)   

GB_clf = GradientBoostingClassifier()
cross_val_metrics(GB_clf,Xs_train_sm, Xs_test_sm, ys_train_sm, ys_test_sm)   

NB_clf = GaussianNB()
cross_val_metrics(NB_clf,Xs_train_sm, Xs_test_sm, ys_train_sm, ys_test_sm)  

KN_clf = KNeighborsClassifier()
cross_val_metrics(KN_clf,Xs_train_sm, Xs_test_sm, ys_train_sm, ys_test_sm)  

## HYPERPARAMETERS TUNING

In [None]:
xgb_cfl = xgb.XGBClassifier(n_jobs = -1)


# A parameter grid for XGBoost
param_grid = {
        'n_estimators' : [100, 200],
        'learning_rate' : [0.01, 0.02, 0.05, 0.1, 0.25],
        'min_child_weight': [1, 5, 7, 10],
        'gamma': [0.1, 0.5, 1, 1.5, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5, 10, 12]
        }

folds = 5
param_comb = 800

grid_search = GridSearchCV(estimator = xgb_cfl, param_grid = param_grid,cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(Xs_train_sm,ys_train_sm)
grid_search.best_params_

In [None]:
from sklearn.model_selection import GridSearchCV

# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000],
    'class_weight':['balanced']
}

#  'max_features': [2, 3],


# Create a based model
RF = RandomForestClassifier()

# cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=1)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = RF, param_grid = param_grid,cv = 5, n_jobs = -1, verbose = 2, scoring='recall')
grid_search.fit(Xs_train_sm,ys_train_sm)
grid_search.best_params_

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    "n_estimators":[5,50,100,250,500,600],
    "max_depth":[1,3,5,7,9],
    "learning_rate":[0.01,0.1,1,10,100]
}

# Create a based model
gb = GradientBoostingClassifier()

# cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=1)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = gb, param_grid = param_grid,cv = 5, n_jobs = -1, verbose = 2)
grid_search.fit(Xs_train_sm,ys_train_sm)
grid_search.best_params_

In [None]:
# Create the parameter grid based on the results of random search 
# param_grid = {'penalty' : ['l1', 'l2'],
#                 'C' : np.logspace(-4, 4, 20),
#                 'solver' : ['liblinear']}

from sklearn.model_selection import RepeatedStratifiedKFold

param_grid = {'penalty' : ['l1', 'l2'],
                'C' : [100, 10, 1.0, 0.1, 0.01],
                'solver' : ['newton-cg', 'lbfgs', 'liblinear']}


# Create a based model
lr = LogisticRegression()

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = lr, param_grid = param_grid,cv = cv, n_jobs = -1, verbose = 2)
grid_result = grid_search.fit(X_train,y_train)
grid_search.best_params_

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))


### TEST MODELS WITH TUNED HYPERPARAMETERS USING RESAMPLED DATA

In [None]:
def display_feature_importance(model, X):
    importances =model.feature_importances_

    imp_dict = dict(zip(X.columns, importances))
    score_df = pd.DataFrame(imp_dict.items(), columns=['feature', 'score'])
    score_df = score_df.sort_values('score',ascending=False)
    print(score_df.head(10))

    # plot the scores
    fig = plt.figure(figsize=(4,9))
    score_df = score_df.sort_values('score',ascending=True)
    plt.barh(score_df.feature, score_df.score)
    plt.xticks(rotation=90)
    plt.show()
    
     # figname = 'Feature importance.png'
    # fig.savefig(figname,transparent=False, bbox_inches='tight', dpi=300)
    
    return score_df
   

In [None]:
# xgb 

# xgb_clf = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
#                            colsample_bytree=0.8, gamma=1.5, learning_rate=0.05,
#                            max_delta_step=0, max_depth=3, min_child_weight=7, missing=None,
#                            n_estimators=200, n_jobs=-1, nthread=None,
#                            objective='binary:logistic', random_state=0, reg_alpha=0,
#                            reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
#                            subsample=0.6)

# xgb_clf = xgb.XGBClassifier(colsample_bytree=0.8, gamma=1.5, learning_rate=0.05,
#                            max_depth=3, n_estimators=200, n_jobs=-1, 
#                            random_state=42)

# xgb_clf = xgb.XGBClassifier()

XGB_clf = xgb.XGBClassifier(colsample_bytree=0.6,
                             gamma=1,
                             learning_rate= 0.1,
                             max_depth=12,
                             min_child_weight=1,
                             n_estimators=200,
                             subsample=1.0)

XGB_clf.fit(Xs_train_sm,ys_train_sm)
ys_pred_sm = XGB_clf.predict(Xs_test_sm)
display_scores(XGB_clf, Xs_test_sm, ys_test_sm, ys_pred_sm)



In [None]:
XGB_score_df = display_feature_importance(XGB_clf, X)

In [None]:
from sklearn.inspection import permutation_importance

per_imp = permutation_importance(XGB_clf, Xs_train_sm,ys_train_sm, scoring='recall')
importances = per_imp.importances_mean
    
imp_dict = dict(zip(X.columns, importances))
score_df = pd.DataFrame(imp_dict.items(), columns=['feature', 'score'])
score_df = score_df.sort_values('score',ascending=False)
print(score_df.head(10))

# plot the scores
fig = plt.figure(figsize=(4,9))
score_df = score_df.sort_values('score',ascending=True)
plt.barh(score_df.feature, score_df.score)
plt.xticks(rotation=90)
plt.show()


In [None]:
from sklearn.feature_selection import SelectKBest, chi2, RFE

# Create an instance of SelectKBest
kbest = SelectKBest(score_func=chi2, k=4)

# Fit 
fit = kbest.fit(X, y)

# Print Score 
# Find Top 4 Features
importance = pd.DataFrame(fit.scores_, index=feature_cols, columns=['score'])

importance.sort_values('score',ascending=False).head(10)

In [None]:
# RANDOM FOREST CLASSIFIER TUNED

# RF_clf = RandomForestClassifier(bootstrap=True,
#                                  max_depth=20,
#                                  max_features='sqrt',
#                                  min_samples_leaf=2,
#                                  min_samples_split=2,
#                                  n_estimators=1200)


# RF_clf = RandomForestClassifier(bootstrap=True,
#                                  max_depth=90,
#                                  max_features=3,
#                                  min_samples_leaf=3,
#                                  min_samples_split=8,
#                                  n_estimators=1000)


# RF_clf = RandomForestClassifier(bootstrap=True,
#                                  max_depth=90,
#                                  max_features='auto',
#                                  min_samples_leaf=3,
#                                  min_samples_split=8,
#                                  n_estimators=100)
# 'bootstrap': True,
#  'class_weight': 'balanced',
#  'max_depth': 90,
#  'max_features': 'auto',
#  'min_samples_leaf': 3,
#  'min_samples_split': 8,
#  'n_estimators': 100}

RF_clf = RandomForestClassifier(class_weight='balanced')
    
RF_clf.fit(Xs_train_sm,ys_train_sm)
ys_pred_sm = RF_clf.predict(Xs_test_sm)
display_scores(RF_clf, Xs_test_sm, ys_test_sm, ys_pred_sm)

In [None]:
RF_score_df = display_feature_importance(RF_clf, X)


In [None]:
#GRADIENT BOOSTING


# {'learning_rate': 0.1, 'max_depth': 1, 'n_estimators': 250}
#  0.1, 'max_depth': 1, 'n_estimators': 500
# 'learning_rate': 1, 'max_depth': 5, 'n_estimators': 500
            
GB_clf = GradientBoostingClassifier(random_state=42, 
                                    learning_rate=1, 
                                    max_depth=5, 
                                    n_estimators=500)


GB_clf.fit(Xs_train_sm,ys_train_sm)
ys_pred_sm = GB_clf.predict(Xs_test_sm)
display_scores(GB_clf, Xs_test_sm, ys_test_sm, ys_pred_sm)
GB_score_df = display_feature_importance(GB_clf, X)

In [None]:

log_model = LogisticRegression(solver='newton-cg',C=1.0,penalty='l2')

log_model.fit(Xs_train_sm,ys_train_sm)
ys_pred_sm = log_model.predict(Xs_test_sm)
display_scores(log_model, Xs_test_sm, ys_test_sm, ys_pred_sm)


In [None]:
#libraries
import matplotlib
import squarify # pip install squarify (algorithm for treemap)

size_list = list(RF_score_df.score.head(10))
label_list = list(RF_score_df.feature.head(10))

# create a color palette, mapped to these values
cmap = matplotlib.cm.Reds
mini=min(size_list)
maxi=max(size_list)
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
colors = [cmap(norm(value)) for value in size_list]

# colors=['Red','Yellow','Green', 'Purple']

squarify.plot(sizes=size_list, label=label_list, color=colors, alpha=.4 )
plt.axis('off')
plt.show()


In [None]:
# y_pred is an array of predictions
def bestThresshold(y_true,y_pred):
    best_thresh = None
    best_score = 0
    for thresh in np.arange(0.1, 0.501, 0.01):
        
        score = f1_score(y_true, np.array(y_pred)>thresh)
#         print(thresh)
#         print(np.array(y_pred)>thresh)
#         print(score)
        if score > best_score:
            best_thresh = thresh
            best_score = score
    return best_score , best_thresh

best_score , best_thresh = bestThresshold(ys_test_sm, ys_pred_sm)
print(best_score , best_thresh)

In [None]:
XGB_clf = xgb.XGBClassifier(scale_pos_weight=5.2)
cross_val_metrics(XGB_clf,Xs_train,Xs_test,ys_train,ys_test) 

best_score , best_thresh = bestThresshold(ys_test, y_pred)
print(best_score , best_thresh)

In [None]:
precision_recall_curve