# HRWorks - Case Study

**HRWorks supports several information technology (IT) companies in India with their talent acquisition. One of the challenges they face is about 30% of the candidates who accept the jobs offer do not join the company. This leads to huge loss of revenue and time asthe companies initiate  the recruitment process again to fill the workforce demand.**

**HRWorks wants to find if a model can be built to predict the likelihood of a candidate joining the company**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from imblearn.over_sampling import SMOTE


pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [None]:
df = pd.read_csv("../input/hrworks-dataset/hr_data.csv")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

## 1. Identify and define the problem statement clearly also mention why it is necessary for an organisation to solve the problem.

In [None]:
round(df.Status.value_counts()/df.shape[0]*100,2)

In [None]:
plt.figure(figsize = (8,5));
sns.countplot(df.Status, color='Orange')
plt.title('Number of Candidates who Joined / Not Joined')
plt.show()

### The Problem Statement for this Case Study is:
* Based on the Case Study description, 30% of the candidates who accept the job offer do not join the company.
* When we check the Target column of our data, it is found that about 19% of candidates who accepted the job offer did not join the company.
* This leads to a huge loss of time and money as HRWorks would get a commission from partner companies only after the successful onboarding of candidates.
* This may be the reason that many companies require the candidates to fill the legal bond before joining the company so that candidates do not leave the company before the bond period, else, legal actions can be taken.
* A solution needs to be found to minimize these losses as the losses may increase with the increase in candidates.
* This can be done by primarily classifying the candidates based on their likelihood of joining using Machine Learning Classification Algorithms.
* This will help HRWorks to prioritize candidates who are more likely to join the company. 

## 2. Define any hypothesis if possible.

* Based on the data give, A hypothesis test can be done to check if Gender and Location variables are dependent or independent of each other. Thus,
    H0: Gender and Location Features are independent
    H1: There is a dependency between Genger and Location Features
    
* We can apply Chi-Squared Test to check if the Null Hypothesis is true or not.
* If p-value > 0.05, we will accept Null Hypothesis is True and Gender and Location Features are independent 
* Else, we failed to reject Null Hypothesis and accept that there is a dependency between Gender and Location Features.

## 3. Do the EDA of dataset and write the observation you got form the dataset?

### Checking for NaN Values

In [None]:
df.isna().sum()

In [None]:
df.dropna(inplace = True)

A total of 28 rows are having NaN values which is very less in comparision to dataset. **THUS, REMOVING IT**

### Checking for Duplicate Rows

In [None]:
df.duplicated().sum()

**No Duplicated Rows Found**

### Statistical Summary of Numeric Features

In [None]:
df.describe().T

### Statistical Summary of Categorical Features

In [None]:
df.describe(include = 'object').T

### Primary Observations From Statistical Summary

* Duration to accept offer has a maximum value of 224 days and a median of 30 days. This looks absurd as no company gives about 8 months to make the decision
* This dataset has inexperienced to highly experienced candidates having experience of up to 24 years.
* We also have younger to old candidates up to 60 years of age group as well
* A candidate got a Percent hike in CTC of exceptional 471% which is more than 100% of what was expected. 
* Most of the candidates were having job offers in Chennai
* Almost all of the candidates were not given a Joining bonus
* Most of the candidates are Males
* About 81% of the candidates in this dataset have joined the company.

### Univariate Analysis

#### Defining Functions for Univariate Analysis

###### Count Plot Function

In [None]:
def countp(x):
    plt.figure(figsize = (8,6))
    sns.countplot(x = x, data = df, palette='YlOrRd')

##### Count Function

In [None]:
def count(col):
    return df[col].value_counts()

##### Count Plot Function with hue as Target Variable for Bivariate Analysis

In [None]:
def countp_y(x):
    plt.figure(figsize = (8,6))
    sns.countplot(x = x, data = df, hue = 'Status', palette = 'Oranges_r')

##### Count with each class of Target Variable

In [None]:
def count_y(x):
    return df.groupby([x, 'Status'])[x].count()

### Analysis of 'Status'

In [None]:
print(count('Status'))


countp('Status')
plt.xlabel("Status")
plt.ylabel("Frequency")
plt.title("Status of Candidates")
plt.show()

* Majority of the candidates from this data have joined the company which is a good sign.
* This also leads to imbalance data which needs to be treated to avoid model producing biased results

### Analysis of 'Gender'

In [None]:
print(count('Gender'))

countp('Gender')
plt.xlabel("Gender of Candidates")
plt.show()

In [None]:
print(count_y('Gender'))

countp_y('Gender')
plt.xlabel("Gender")
plt.ylabel("Frequency")
plt.title("Gender of Candidates with Status")
plt.show()

* Most of the candidates are Male and that is why more number of candidates who didn't join are Males

### Analysis of 'Location'

In [None]:
print(count('Location'))

plt.figure(figsize = (8,6))
sns.countplot(x = 'Location', data = df, palette='YlOrRd_r')
plt.xticks(rotation = 90)
plt.xlabel("Location")
plt.ylabel("Frequency")
plt.title("Location Offered")
plt.show()

In [None]:
print(count_y('Location'))

plt.figure(figsize = (8,6))
sns.countplot(x = 'Location', data = df, palette='YlOrRd_r', hue = 'Status')
plt.xticks(rotation = 90)
plt.xlabel("Location")
plt.ylabel("Frequency")
plt.title("Location Offered with Status")
plt.show()

* Chennai, Bangalore and Noida have maximum candidates in comparison to other cities.
* Reason behind this is these cities are hubs of several top firms in India
* Bangalore having less number of candidates than Noida has more candidates who didn't join the companies.
* This suggests that these candidates must be getting better options as Bangalore houses many firms offices.

### Analysis of 'Joining Bonus'	

In [None]:
print(count('Joining Bonus'))

countp('Joining Bonus')
plt.xlabel("Joining Bonus")
plt.ylabel("Frequency")
plt.title("Joining Bonus to Candidates")
plt.show()

In [None]:
print(count_y('Joining Bonus'))

countp_y('Joining Bonus')
plt.xlabel("Joining Bonus")
plt.ylabel("Frequency")
plt.title("Joining Bonus given to Candidates with Status")
plt.show()

* Almost all of the candidates were not offered any Joining Bonus.
* But we still have candidates who got joining bonus but still didn't joined the company.

### Analysis of 'DOJ Extended'

In [None]:
print(count('DOJ Extended'))

countp('DOJ Extended')
plt.xlabel("DOJ Extended")
plt.ylabel("Frequency")
plt.title("DOJ Extended for Candidates")
plt.show()

In [None]:
print(count_y('DOJ Extended'))

countp_y('DOJ Extended')
plt.xlabel("DOJ Extended")
plt.ylabel("Frequency")
plt.title("DOJ Extended for Candidates with Status")
plt.show()

* Almost 50% of the candidates were offered to extend the Date of Joining
* While the other half were not offered the same
* The Joined and Not Joined ratio is the same in both the cases is not distinguishable

### Analysis of 'Offered band'

In [None]:
print(count('Offered band'))

countp('Offered band')
plt.xlabel("Offered Band")
plt.ylabel("Frequency")
plt.title("Offered Band to Candidates")
plt.show()

In [None]:
print(count_y('Offered band'))

countp_y('Offered band')
plt.xlabel("Offered Band")
plt.ylabel("Frequency")
plt.title("Offered Band to Candidates with Status")
plt.show()

* Most of the candidates were offered E1 band followed by E2 and E3 and E4.
* Since, most of them were offered E1, thus most of the candidates who did not join are also from E1
* Thus, we can say that most of the candidates are freshers or having low experience level. 

### Analysis of 'LOB'

In [None]:
print(count('LOB'))

sns.countplot(x = 'LOB', data = df, palette='YlOrRd_r')
plt.xticks(rotation = 90)
plt.xlabel("Line Of Business")
plt.ylabel("Frequency")
plt.title("LOB of Candidates were Offered Job")
plt.show()

In [None]:
print(count_y('LOB'))

sns.countplot(x = 'LOB', data = df, palette='YlOrRd_r', hue = 'Status')
plt.xticks(rotation = 90)
plt.xlabel("Line Of Business")
plt.ylabel("Frequency")
plt.title("LOB of Candidates were Offered Job with Status")
plt.show()

* Most of the candidates were offered jobs in the Infrastructure Line of Business
* Then comes ERS followed by BFSI
* Infra has twice more candidates than BFSI who joined but has almost equal numbers who did not join
* This suggests that candidates are more likely to not join a BFSI firm.

#### Analysis of Candidate Source

In [None]:
print(count('Candidate Source'))

countp('Candidate Source')
plt.xlabel("Source")
plt.ylabel("Frequency")
plt.title("Candidate Source")
plt.show()

In [None]:
print(count_y('Candidate Source'))

countp_y('Candidate Source')
plt.xlabel("Source")
plt.ylabel("Frequency")
plt.title("Candidate Source with Status")
plt.show()

* Most of the candidates onboarded Directly by the firms
* Number of candidates who did not join the company from the Direct channel is almost equal to those from Agency Channel even though candidates onboarded through Agency Channel is less
* A reason behind this is, while applying through an agency, one gets multiple options of companies to apply, thus selecting the one that gives the best benefits
* While a candidate onboarded directly has fewer or no options at the moment thus having less number who did not join.

### Analysis of 'Candidate relocate actual'

In [None]:
print(count('Candidate relocate actual'))

countp('Candidate relocate actual')
plt.xlabel("Relocation")
plt.ylabel("Frequency")
plt.title("Rolocation required by Candidate")
plt.show()

In [None]:
print(count_y('Candidate relocate actual'))

countp_y('Candidate relocate actual')
plt.xlabel("Relocation")
plt.ylabel("Frequency")
plt.title("Relocation required by Candidates with Status")
plt.show()

* Most of the candidates were not required to relocate
* The ones who were required to relocate, all of them joined which suggests they are in need of job and money are ready to work from anywhere

### Analysis of Numeric Columns

In [None]:
df.describe().T

#### Functions for Univariate and Bivariate Analysis of Numeric Features

##### Function for Dist Plot

In [None]:
def dist(data, x):
    plt.figure(figsize = (8,6))
    sns.distplot(data[x], bins = 10, color = 'orange')

##### Function for Box Plot of Classes of Target Variable 'Status'

In [None]:
def box(x):
    plt.figure(figsize = (8,6))
    sns.boxplot(x = 'Status', y = x, data = df, palette = 'autumn')

### Analysis of 'Age'

In [None]:
dist(df, 'Age')
plt.xlabel("Age")
plt.title("Distribution of Age")
plt.show()

In [None]:
box('Age')
plt.xlabel("Status")
plt.title("Age with Status")
plt.show()

* Most of the candidates were of age 22 years to 35 years
* We have few outliers of candidates ageing up to 60 years old
* It can be observed that the 60 years old candidate did not join
* It may seem like an error in input as generally people get retired at the age of 60 
* Both Joined and Not Joined classes have equal number of outliers (aged candidates)

### Analysis of 'Pecent hike expected in CTC'

In [None]:
dist(df, 'Pecent hike expected in CTC')
plt.xlabel("Percent Hike Expected in CTC")
plt.title("Distribution of Percent Hike Expected in CTC")
plt.show()

In [None]:
box('Pecent hike expected in CTC')
plt.xlabel("Status")
plt.title("Percent Hike Expected with Status")
plt.show()

* Most of the candidates were expecting per cent hike in CTC of about 50%
* A few candidates were expecting a per cent hike more than 300%
* But we cannot comment on this, as they might have changed the position or department or had been assigned a managerial post

### Analysis of 'Pecent hike offered in CTC'

In [None]:
dist(df, 'Percent hike offered in CTC')
plt.xlabel("Percent Hike Offered in CTC")
plt.title("Distribution of Percent Hike Offered in CTC")
plt.show()

In [None]:
box('Percent hike offered in CTC')
plt.xlabel("Status")
plt.title("Percent Hike Offered in CTC with Status")
plt.show()

* Looks like what candidates were expecting got a little less of per cent hike 
* This makes sense as candidates always aspire more than the calibre
* We have few candidates who were offered a high per cent hike in CTC which can be related to those candidates who were expecting these hikes
* This strongly suggests that these set of candidates had applied for a managerial level or equivalent positions

### Analysis of 'Pecent difference CTC'

In [None]:
dist(df, 'Percent difference CTC')
plt.xlabel("Percent Difference CTC")
plt.title("Distribution of Percent Difference CTC")
plt.show()

In [None]:
box('Percent difference CTC')
plt.xlabel("Status")
plt.title("Percent Hike Difference with Status")
plt.show()

* Percent difference CTC can produce same observations as those from Percent hike expected and Percent hike offered features
* A lot of Percent difference in this feature is incorrect, thus we will remove them in Outlier Analysis

### Analysis of 'Rex in Yrs'

In [None]:
dist(df, 'Rex in Yrs')
plt.xlabel("Years of Experience")
plt.title("Distribution of Rex in Yrs")
plt.show()

In [None]:
box('Rex in Yrs')
plt.xlabel("Status")
plt.title("Rex of Yrs. with Status")
plt.show()

* Most of the Candidates were having 0 to 8 years of experience
* We also have candidates having experience of 25 years and more
* Experience candidates are more likely to join 
* As experienced people understand the work ethics of the corporate world, they don't want to exploit them

### Analysis of 'Duration to accept offer'

In [None]:
dist(df, 'Duration to accept offer')
plt.xlabel("Duration to accept offer")
plt.title("Distribution of Duration to accept offer")
plt.show()

In [None]:
box('Duration to accept offer')
plt.xlabel("Status")
plt.title("Duration to accept offer with Status")
plt.show()

* The ones who did not join were given as high as 200 days to accept the offer
* The candidate must be of greater importance to the company as no company offers a time span of 8 months to accept the offer

### Analysis of 'Notice period'

In [None]:
dist(df, 'Notice period')
plt.xlabel("Notice Period")
plt.title("Distribution of Notice period")
plt.show()

In [None]:
box('Notice period')
plt.xlabel("Status")
plt.title("Notice Period with Status")
plt.show()

* Notice Period of a company varies ranging from 0 days to 120 days
* Candidates who did not join were given more options of Notice Period starting from 0 days to about 90 days.
* Mostly who joined were not given 0 days of the notice period (except one outlier)
* This suggests that candidates who joined are less bothered about the notice period

### EDA Conclusions
* Data is imbalance thus it needs to be treated adequately. 

* Most of the candidates are Male and that is why more number of candidates who didn't join are Males

* Chennai, Bangalore and Noida have maximum candidates in comparison to other cities. This suggests that these candidates must be getting better options as Bangalore houses many firms offices.

* Almost all of the candidates were not offered any Joining Bonus.

* Almost 50% of the candidates were offered to extend the Date of Joining While the other half were not offered the same

* Most of the candidates were offered E1 band followed by E2 and E3 and E4. Since, most of them were offered E1, thus most of the candidates who did not join are also from E1. Thus, we can say that most of the candidates are freshers or having low experience level. 

* Most of the candidates were offered jobs in the Infrastructure Line of Business and Infra has twice more candidates than BFSI who joined but has almost equal numbers who did not join. This suggests that candidates are more likely to not join a BFSI firm.

* Number of candidates who did not join the company from the Direct channel is almost equal to those from Agency Channel even though candidates onboarded through Agency Channel is less. A reason behind this is, while applying through an agency, one gets multiple options of companies to apply, thus selecting the one that gives the best benefits

* The ones who were required to relocate, all of them joined which suggests they are in need of job and money are ready to work from anywhere

* Most of the candidates were of age 22 years to 35 years. It can be observed that the 60 years old candidate did not join. It seems like an error in input as generally people get retired at the age of 60 

* Most of the candidates were expecting per cent hike in CTC of about 50% while a few candidates were expecting a per cent hike more than 300% but we cannot comment on this, as they might have changed the position or department or had been assigned a managerial post

* Candidates who were expecting got a little less of per cent hike. This makes sense as candidates always aspire more than the calibre. We also see few candidates who were offered a high per cent hike in CTC which can be related to those candidates who were expecting these hikes This strongly suggests that these set of candidates had applied for a managerial level or equivalent positions

* A lot of Percent difference in Percent difference CTC is incorrect, thus we will remove them in Outlier Analysis

* Most of the Candidates were having 0 to 8 years of experience. Experience candidates are more likely to join as experienced people understand the work ethics of the corporate world, they don't want to exploit them

* The ones who did not join were given as high as 200 days to accept the offer which suggests that the candidate must be of greater importance to the company as no company offers a time span of 8 months to accept the offer

* Notice Period of a company varies ranging from 0 days to 120 days. Candidates who did not join were given more options of Notice Period starting from 0 days to about 90 days while mostly who joined were not given 0 days of the notice period (except one outlier). This suggests that candidates who joined are less bothered about the notice period

### Outlier Analysis

In [None]:
# Creating a copy of Dataframe to avoid accidental changes made to original dataframe

In [None]:
df_fin = df.copy()

#### Treating Outliers of 'Duration to accept offer'

In [None]:
print('Q1: ', df_fin['Duration to accept offer'].quantile(q = 0.25))
print('Q2: ', df_fin['Duration to accept offer'].quantile(q = 0.50))
print('Q3: ', df_fin['Duration to accept offer'].quantile(q = 0.75))

In [None]:
q3_dur = df_fin['Duration to accept offer'].quantile(q = 0.75)
q1_dur = df_fin['Duration to accept offer'].quantile(q = 0.25)

iqr = q3_dur - q1_dur

out_up = q3_dur + (1.5*iqr)
out_low = q1_dur - (1.5*iqr)

df_no_ot = df_fin[(df_fin['Duration to accept offer'] > out_low) & (df_fin['Duration to accept offer'] < out_up)]

In [None]:
df_no_ot.describe().T

#### Treating Outliers of 'Percent difference CTC'

In [None]:
print('Q1: ', df_no_ot['Percent difference CTC'].quantile(q = 0.25))
print('Q2: ', df_no_ot['Percent difference CTC'].quantile(q = 0.50))
print('Q3: ', df_no_ot['Percent difference CTC'].quantile(q = 0.75))

In [None]:
q3_pdiff = df_no_ot['Percent difference CTC'].quantile(q = 0.75)
q1_pdiff = df_no_ot['Percent difference CTC'].quantile(q = 0.25)

iqr = q3_pdiff - q1_pdiff

out_up = q3_pdiff + (1.5*iqr)
out_low = q1_pdiff - (1.5*iqr)

df_no_ot_2 = df_no_ot[(df_no_ot['Percent difference CTC'] > out_low) & (df_no_ot['Percent difference CTC'] < out_up)]

In [None]:
df_no_ot_2.describe().T

### Cateogrical Feature Encoding

* We will exclude SLNO and Candidate Ref columns from 'X' as they contribute nothing to the prediction
* We will also exclude 'Status' from 'X' as it is the target variable 'y' 

In [None]:
X = df_no_ot_2[['DOJ Extended', 'Duration to accept offer',
       'Notice period', 'Offered band', 'Pecent hike expected in CTC',
       'Percent hike offered in CTC', 'Percent difference CTC',
       'Joining Bonus', 'Candidate relocate actual', 'Gender',
       'Candidate Source', 'Rex in Yrs', 'LOB', 'Location', 'Age']]

##### Label Encoding to 'Offered band' as it is a Ordinal Variable

In [None]:
labenc = LabelEncoder()
X['Offered band'] = labenc.fit_transform(X['Offered band'])

##### For the remaning categorical variables, we use .get_dummies method of Pandas

In [None]:
x = pd.get_dummies(X, drop_first = True)

##### Assigning and Encoding the target variable 

In [None]:
y = df_no_ot_2['Status'].apply(lambda x: 1 if x == 'Joined' else 0)

### Scaling Data

In [None]:
scaler = StandardScaler()

In [None]:
x_scale = scaler.fit_transform(x)

### Train Test Split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_scale, y, test_size = 0.3, random_state = 42)

### Upsampling the Data
* Since the target variable is imbalanced, the model will produce biased results, thus we need to resample the data
* Here we are using Upsampling technique SMOTE to upsample the data

In [None]:
sm = SMOTE(random_state=12)
x_train_sm, y_train_sm = sm.fit_resample(x_train, y_train)

## 4. Develop a machine learning algorithms and compare different models.

##### Creating a Function for Confusion Matrix

In [None]:
def confusion_matrix(actuals, predicted):
    cm = metrics.confusion_matrix(actuals, predicted, [1,0])
    plt.figure(figsize = (8,6))
    sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Joined', 'Not Joined'], yticklabels = ['Joined', 'Not Joined'])

    plt.ylabel("Actual Labels")
    plt.xlabel("Predicted Labels")

    plt.show()

### Logistic Regression

In [None]:
logreg = LogisticRegression()

In [None]:
logreg.fit(x_train_sm, y_train_sm)

In [None]:
y_pred = logreg.predict(x_test)

In [None]:
print(metrics.classification_report(y_test, y_pred))

##### Logistic Regressor Accuracy = 64%

In [None]:
confusion_matrix(y_test, logreg.predict(x_test))

### Decision Tree Classifier
* We will go for Hyperparameter Tuning to select the best hyperparameter values 
* Here we will use GridSearchCV to find the best hyperparameter values

In [None]:
dtc = DecisionTreeClassifier()

In [None]:
param = [{'max_depth': list(range(10, 18)), 'max_features': list(range(0,32)), 'criterion': ['gini', 'entropy']}]

In [None]:
clf_dtc = GridSearchCV(dtc, cv = 3, param_grid = param, scoring = 'roc_auc')

In [None]:
clf_dtc.fit(x_train_sm, y_train_sm)

In [None]:
clf_dtc.best_params_

In [None]:
print(metrics.classification_report(y_test, clf_dtc.predict(x_test)))

##### Tuned Decision Trees Classifier Accuracy = 70%

In [None]:
confusion_matrix(y_test, clf_dtc.predict(x_test))

### Random Forest Classifier
* We will go for Hyperparameter Tuning to select the best hyperparameter values 
* Here we will use GridSearchCV to find the best hyperparameter values

In [None]:
rfc = RandomForestClassifier()

In [None]:
forest_params = forest_params = [{'max_depth': list(range(10, 18)), 'max_features': list(range(0,32))}]

In [None]:
clf_rfc = GridSearchCV(rfc, cv = 3, param_grid = forest_params, scoring = 'roc_auc')

In [None]:
clf_rfc.fit(x_train_sm, y_train_sm)

In [None]:
clf_rfc.best_params_

In [None]:
print(metrics.classification_report(y_test, clf_rfc.predict(x_test)))

##### Tuned Random Forest Classifier Accuracy = 81%

In [None]:
confusion_matrix(y_test, clf_rfc.predict(x_test))

### K-Nearest Neighbor Classifier
* We will go for Hyperparameter Tuning to select the best hyperparameter values 
* Here we will use GridSearchCV to find the best hyperparameter values

In [None]:
knn = KNeighborsClassifier()

In [None]:
knn_params = [{'n_neighbors': range(5,10),'metric': ['canberra', 'euclidean', 'minkowski']}]

In [None]:
clf_knn = GridSearchCV(knn, cv = 3, param_grid = knn_params, scoring = 'roc_auc')

In [None]:
clf_knn.fit(x_train_sm, y_train_sm)

In [None]:
clf_knn.best_params_

In [None]:
print(metrics.classification_report(y_test, clf_knn.predict(x_test)))

##### Tuned KNN Classifier Accuracy = 74%

In [None]:
confusion_matrix(y_test, clf_knn.predict(x_test))

### Performance Comparision

### 1. Accuracy Scores

In [None]:
df_acc = pd.DataFrame({'Estimators': ['Logistic Regressor', 'Decision Tree Classifier', 'Random Forest Classifier', 
                                     'K-Nearest Neighbor Classifier'],
                     'Accuracy':['{0:.2f}'.format(metrics.accuracy_score(y_test, logreg.predict(x_test))), 
                                 '{0:.2f}'.format(metrics.accuracy_score(y_test, clf_dtc.predict(x_test))),
                                '{0:.2f}'.format(metrics.accuracy_score(y_test, clf_rfc.predict(x_test))),
                                '{0:.2f}'.format(metrics.accuracy_score(y_test, clf_knn.predict(x_test)))]})

In [None]:
df_acc

### 2. ROC - AUC Score

In [None]:
fig, ax_arr = plt.subplots(nrows = 2, ncols = 2, figsize = (15,15))

#LOGREG-----------------------------------------
probs = logreg.predict_proba(x_test)
preds = probs[:,1]
fprlog, tprlog, thresholdlog = metrics.roc_curve(y_test, preds)
roc_auclog = metrics.auc(fprlog, tprlog)

ax_arr[0, 0].plot(fprlog, tprlog, 'b', label = 'AUC = %0.2f' % roc_auclog)
ax_arr[0, 0].plot([0, 1], [0, 1],'r--')
ax_arr[0, 0].set_title('Receiver Operating Characteristic Logistic Regressor')
ax_arr[0, 0].set_ylabel('True Positive Rate',fontsize=13)
ax_arr[0, 0].set_xlabel('False Positive Rate',fontsize=13)
ax_arr[0, 0].legend(loc = 'lower right', prop={'size': 16})

#DECISION TREE ---------------------
probs = clf_dtc.predict_proba(x_test)
preds = probs[:,1]
fprdtree, tprdtree, thresholddtree = metrics.roc_curve(y_test, preds)
roc_aucdtree = metrics.auc(fprdtree, tprdtree)

ax_arr[0, 1].plot(fprdtree, tprdtree, 'b', label = 'AUC = %0.2f' % roc_aucdtree)
ax_arr[0, 1].plot([0, 1], [0, 1],'r--')
ax_arr[0, 1].set_title('Receiver Operating Characteristic Decision Tree Classifier')
ax_arr[0, 1].set_ylabel('True Positive Rate',fontsize=13)
ax_arr[0, 1].set_xlabel('False Positive Rate',fontsize=13)
ax_arr[0, 1].legend(loc = 'lower right', prop={'size': 16})

#RANDOM FOREST ---------------------
probs = clf_rfc.predict_proba(x_test)
preds = probs[:,1]
fprrfc, tprrfc, threshodrfc = metrics.roc_curve(y_test, preds)
roc_aucrfc = metrics.auc(fprrfc, tprrfc)

ax_arr[1, 0].plot(fprrfc, tprrfc, 'b', label = 'AUC = %0.2f' % roc_aucrfc)
ax_arr[1, 0].plot([0, 1], [0, 1],'r--')
ax_arr[1, 0].set_title('Receiver Operating Characteristic Random Forest Classifier ')
ax_arr[1, 0].set_ylabel('True Positive Rate',fontsize=13)
ax_arr[1, 0].set_xlabel('False Positive Rate',fontsize=13)
ax_arr[1, 0].legend(loc = 'lower right', prop={'size': 16})

#KNN------------------------------------
probs = clf_knn.predict_proba(x_test)
preds = probs[:,1]
fprknn, tprknn, thresholdknn = metrics.roc_curve(y_test, preds)
roc_aucknn = metrics.auc(fprknn, tprknn)

ax_arr[1, 1].plot(fprknn, tprknn, 'b', label = 'AUC = %0.2f' % roc_aucknn)
ax_arr[1, 1].plot([0, 1], [0, 1],'r--')
ax_arr[1, 1].set_title('Receiver Operating Characteristic KNN ')
ax_arr[1, 1].set_ylabel('True Positive Rate',fontsize=13)
ax_arr[1, 1].set_xlabel('False Positive Rate',fontsize=13)
ax_arr[1, 1].legend(loc = 'lower right', prop={'size': 16})





plt.show()

### Based on the Accuracy Score and ROC - AUC Values, we can conclude that out of four estimators, Random Forest performed well for the given dataset.

### Predictive Analysis Conclusions

* For the given Dataset, the Random Forest Classifier for the given hyperparameters performed well than others while Logistic Regressor performed worst.
* This result is based on Accuracy Score and ROC - AUC Score
* Since the target variable imbalanced, We can try resampling by downsampling the records and can build models
* We can look for more performance metrics to improve the model performance and selecting the best one.
* During Outlier Analysis, we removed outliers of 'Percent difference CTC' and 'Duration to accept offer'
* We can also look for outliers in 'Percent offered in CTC' and 'Percent expected in CTC' with caution as we may lose data. We can keep alpha = 0.01 for this case
* Trying different random states can change the results slightly
* We can also perform more rigorous Hyperparameter Tuning to get a more precise model, but this will incur more cost