<h1>Predicting Job Change of Data Scientists using HR Analytics</h1>
<p>Task : Using the given dataset relating to the details of candidates in the training program, predict whether a specific candidate will work for the company or not.</p>

![](https://payslip.com/wp-content/uploads/2019/11/shutterstock_788621173.jpg)

<h1>Getting Started - Importing Libraries</h1>

In [None]:
from imblearn.over_sampling import SMOTE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

<h1>Preparing Data</h1>
<p>The data is present in test and train files. Let us combine both to get a complete set for analysis. Drop the ID columns since it does not make a meaningful feature for analysis</p>

In [None]:
train_data = pd.read_csv("../input/hr-analytics-job-change-of-data-scientists/aug_train.csv") 
test_data = pd.read_csv("../input/hr-analytics-job-change-of-data-scientists/aug_test.csv") 
data = pd.concat([train_data,test_data])
data.drop(["enrollee_id"],axis=1,inplace=True)
data.head()

<p>Let us go through dataset documentation to understand what each columns stands for</p>

<ul>
    <li>enrollee_id : Unique ID for candidate</li>
    <li>city: City code</li>
    <li>city_ development _index : Developement index of the city (scaled)</li>
    <li>gender: Gender of candidate</li>
    <li>relevent_experience: Relevant experience of candidate</li>
    <li>enrolled_university: Type of University course enrolled if any</li>
    <li>education_level: Education level of candidate</li>
    <li>major_discipline :Education major discipline of candidate</li>
    <li>experience: Candidate total experience in years</li>
    <li>company_size: No of employees in current employer's company</li>
    <li>company_type : Type of current employer</li>
    <li>lastnewjob: Difference in years between previous job and current job</li>
    <li>training_hours: training hours completed</li>
    <li>target: 0 – Not looking for job change, 1 – Looking for a job change</li>
</ul>

<h2>Looking at column data types</h2>
<p>In this work, we are dealing with a large number of categorical values and less of numerical values</p>

In [None]:
ColumnsDescription = {"Name" : [],"Type" : []}
for cols in data.columns:
    ColumnsDescription["Name"].append(cols)
    ColumnsDescription["Type"].append(data[cols].dtypes)
pd.DataFrame(ColumnsDescription,index=None)

<p>There are a lot of missing values in the dataset. For Categorical values, we will replace them with the Mode and for numerical values we will replace them with the mean of the values. Firstly, let us look at the data type of each column and the number of NA values in them.</p>

In [None]:
data.info()

<h1>Data Engineering</h1>
<ul>
    <li>Replace all categorical values with mode of that column and label encode them.</li>
    <li>Replace all numerical values with the mean of that column.</li>
</ul>

In [None]:
data['city'] = data['city'].fillna(data['city'].value_counts().index[0])
CityLabelEncoder = LabelEncoder().fit(data['city'])
CityList = CityLabelEncoder.classes_
data['city'] = CityLabelEncoder.transform(data['city'])

data['company_type'] = data['company_type'].fillna(data['company_type'].value_counts().index[0])
CTypeLabelEncoder = LabelEncoder().fit(data['company_type'])
CTypeList = CTypeLabelEncoder.classes_
data['company_type'] = CTypeLabelEncoder.transform(data['company_type'])

data['company_size'].replace(['<10','10/49', '50-99', '100-500', '500-999', '1000-4999', '5000-9999', '10000+',],
                             ['Startup','Small','Small','Medium','Medium','Large','Large','Large'],inplace=True)
data['company_size'] = data['company_size'].fillna(data['company_size'].value_counts().index[0])
CSizeLabelEncoder = LabelEncoder().fit(data['company_size'])
CSizeList = CSizeLabelEncoder.classes_
data['company_size'] = CSizeLabelEncoder.transform(data['company_size'])

data['education_level'] = data['education_level'].fillna(data['education_level'].value_counts().index[0])
EduLabelEncoder = LabelEncoder().fit(data['education_level'])
EduList = EduLabelEncoder.classes_
data['education_level'] = EduLabelEncoder.transform(data['education_level'])

data['enrolled_university'] = data['enrolled_university'].fillna(data['enrolled_university'].value_counts().index[0])
UniLabelEncoder = LabelEncoder().fit(data['enrolled_university'])
UniList = UniLabelEncoder.classes_
data['enrolled_university'] = UniLabelEncoder.transform(data['enrolled_university'])

data['gender'] = data['gender'].fillna(data['gender'].value_counts().index[0])
GenderLabelEncoder = LabelEncoder().fit(data['gender'])
GenderList = GenderLabelEncoder.classes_
data['gender'] = GenderLabelEncoder.transform(data['gender'])

data['target'] = data['target'].fillna(data['target'].value_counts().index[0])

data['major_discipline'] = data['major_discipline'].fillna(data['major_discipline'].value_counts().index[0])
MajorLabelEncoder = LabelEncoder().fit(data['major_discipline'])
MajorList = MajorLabelEncoder.classes_
data['major_discipline'] = MajorLabelEncoder.transform(data['major_discipline'])

data['relevent_experience'] = data['relevent_experience'].fillna(data['relevent_experience'].value_counts().index[0])
ExpLabelEncoder = LabelEncoder().fit(data['relevent_experience'])
ExpList = ExpLabelEncoder.classes_
data['relevent_experience'] = ExpLabelEncoder.transform(data['relevent_experience'])

<p>Few categorical columns require replacement of values with a fixed string or a number. If a number is encountered, convert that column values to Floating point values.</p>

In [None]:
data['last_new_job'].replace(['>4','never'],['4','0'],inplace=True)
data['last_new_job'].fillna(data['last_new_job'].value_counts().index[0],inplace=True)
data['last_new_job'] = [float(i) for i in data['last_new_job']]

data['experience'].replace(['>20','<1'],['20','1'],inplace=True)
data['experience'].fillna(data['experience'].value_counts().index[0],inplace=True)
data['experience'] = [float(i) for i in data['experience']]

<p>Replace all missing values in numerical columns with their mean</p>

In [None]:
data['training_hours'] = data['training_hours'].fillna(data['training_hours'].mean())
data['training_hours'] = [float(i) for i in data['training_hours']]
data['city_development_index'] = data['city_development_index'].fillna(data['city_development_index'].mean())

In [None]:
data.info()

In [None]:
data.head()

<p>Now that our data is prepared and ready, let us move onto EDA to understand our data better.</p>

<h1>EDA of the transformed dataset</h1>
<p>Generate the dashboard of the dataset to better understand the distribution of the data</p>

In [None]:
plt.figure(figsize=(15,25))
plt.subplot(5,2,1)
genders,count = np.unique(data['gender'],return_counts=True)
plt.title("Gender Count")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.bar([GenderList[i] for i in genders],count)

plt.subplot(5,2,2)
exp,count = np.unique( data['relevent_experience'],return_counts=True)
plt.title("Experience")
plt.xlabel("Experience Level")
plt.ylabel("Count")
plt.bar([ExpList[i] for i in exp],count)

plt.subplot(5,2,3)
uni,count = np.unique( data['enrolled_university'],return_counts=True)
plt.title("University Enrolled")
plt.xlabel("University")
plt.ylabel("Count")
plt.bar([UniList[i] for i in uni],count)

plt.subplot(5,2,4)
edu,count = np.unique( data['education_level'],return_counts=True)
plt.title("Education Level")
plt.xlabel("Level")
plt.ylabel("Count")
plt.bar([EduList[i] for i in edu],count)

plt.subplot(5,2,5)
major,count = np.unique( data['major_discipline'],return_counts=True)
plt.title("Major Discipline")
plt.xlabel("Major")
plt.ylabel("Count")
plt.bar([MajorList[i] for i in major],count)

plt.subplot(5,2,6)
ct,count = np.unique( data['company_type'],return_counts=True)
plt.title("Company Type")
plt.xlabel("Type")
plt.ylabel("Count")
plt.bar([CTypeList[i] for i in ct],count)

plt.subplot(5,2,7)
cs,count = np.unique( data['company_size'],return_counts=True)
plt.title("Company Size")
plt.xlabel("Size")
plt.ylabel("Count")
plt.bar([CSizeList[i] for i in cs],count)

plt.subplot(5,2,8)
tg,count = np.unique( data['target'],return_counts=True)
plt.title("Target")
plt.xlabel("Target label")
plt.ylabel("Count")
plt.bar([["Not Looking","Looking"][int(i)] for i in tg],count)

plt.tight_layout()
plt.show()

print("Total cities : ",len(CityList))
print("Min City development index : {:.2f}".format(min(data['city_development_index'])))
print("Max City development index : ",max(data['city_development_index']))
print("Min difference between last and current job : ",min(data['last_new_job']))
print("Max difference between last and current job : ",max(data['last_new_job']))
print("Min training hours : ",min(data['training_hours']))
print("Max training hours : ",max(data['training_hours']))

<h1>Modeling</h1>
<p>The dataset is highly imbalanced. Use SMOTE to upsample the dataset to generate more data.</p>

In [None]:
X_org = data[data.columns[:len(data.columns)-1]].to_numpy()
Y_org = data[data.columns[len(data.columns)-1]].to_numpy()
over = SMOTE(random_state=42)
X, Y = over.fit_resample(X_org,Y_org)

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,shuffle=True,random_state=42)


<p>RandomForest is a good choice since the dataset was imbalanced and SMOTE was used to generate samples. Use RandomSearchCV() to find the best parameters.</p>

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import plot_confusion_matrix 
from sklearn.model_selection import RandomizedSearchCV

model=RandomForestClassifier()
distributions = dict(n_estimators = list(range(100,1001,100)),
                     max_depth = list(range(10,200,10))),
best_params = RandomizedSearchCV(model,distributions,random_state=0)
params = best_params.fit(X_train,Y_train)
print(params.best_params_)

In [None]:
model=RandomForestClassifier(n_estimators=params.best_params_['n_estimators'],max_depth=params.best_params_['max_depth'])
model.fit(X_train,Y_train)
YPred = model.predict(X_test)
print("Accuracy score : {:.2f}".format(accuracy_score(YPred,Y_test)))
print("Recall score : {:.2f}".format(recall_score(YPred,Y_test,average='macro',zero_division=True)))
print("Precision score : {:.2f}".format(precision_score(YPred,Y_test,zero_division = True)))
print("F1 score : {:.2f}".format(f1_score(YPred,Y_test,zero_division=True)))

disp = plot_confusion_matrix(model,
                                     X_test,
                                     Y_test,
                                     display_labels=["Not Leaving Job","Leaving Job"])
plt.show()