# **Predict the probability of a candidate looking for a new job**

## Introduction

A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because **it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates**. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data we will **predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision**.

## Task Details

This dataset designed to understand the factors that lead a person will work for the company(leaving current job) ,and the goal of this task is building model(s) that uses the current credentials,demographics,experience to predict the probability of a candidate looking for a new job or will work for the company.


## Note:

* The dataset is imbalanced so it might affect your result if you dont handle it
* Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality so encoding methods and techniques will help to boost models performance
* Missing imputation strategy might affect the results so it can be a part of your pipeline as well.

## Features

* enrollee_id : Unique ID for candidate
* city: City code
* city_ development _index : Developement index of the city (scaled)
* gender: Gender of candidate
* relevent_experience: Relevant experience of candidate
* enrolled_university: Type of University course enrolled if any
* education_level: Education level of candidate
* major_discipline :Education major discipline of candidate
* experience: Candidate total experience in years
* company_size: No of employees in current employer's company
* company_type : Type of current employer
* lastnewjob: Difference in years between previous job and current job
* training_hours: training hours completed
* target: 0 – Not looking for job change, 1 – Looking for a job change

## Import required packages

In [None]:
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer
import missingno as msno

from category_encoders.leave_one_out import LeaveOneOutEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

from imblearn.combine import SMOTEENN
from collections import Counter

# Load the data

In [None]:
train = pd.read_csv("../input/hr-analytics-job-change-of-data-scientists/aug_train.csv")
test = pd.read_csv("../input/hr-analytics-job-change-of-data-scientists/aug_test.csv")
train.head()

In [None]:
train.info()

In [None]:
train.isnull().sum() / len(train)

##### Some columns have a significant amount of missing data. Let's check if there is a pattern between these features by visualizing the missing data.

# Visualize the missing data

In [None]:
msno.matrix(train, figsize=(14, 6), fontsize=10)

In [None]:
msno.heatmap(train, figsize=(14, 6), fontsize=10)

* A value close to -1 means that when one variable appears, the other variable is most likely missing.
* A value close to 0 means that there is no dependency between the occurrence of missing values of the two variables.
* A value close to 1 means that when one variable is present, the other variable is very likely to be present.

It is seen that there is a higher correlation between company_size and company_type variables and between major_discipline and education_level variables compared to other variables.

# Visual EDA

In [None]:
cities=train.groupby(["city"]).count().sort_values(by="target", ascending=False)[:10].index.to_list()

plt.figure(figsize=(12,4))
sns.countplot(x="city", hue='target', data=train[train["city"].isin(cities)])
plt.title('Top 10 City vs Target')
plt.show()

In [None]:
categories = ["gender","relevent_experience","enrolled_university","education_level","major_discipline","experience","company_size","company_type","last_new_job"]
titles = ["Gender", "Relevent Experience", "Enrolled University", "Education Level", "Major Discipline", "Experience", "Company Size", "Company Type", "Last New Job"]
for cat,tit in zip(categories, titles):
    plt.figure(figsize=(12,4))
    sns.countplot(x=cat, hue='target', data=train)
    plt.title(tit + ' vs Target')
    plt.show()

# Handling missing values

Here we will first start with the training dataset. Many methods can be used when handling missing data. Some of these are the following methods

* Deleting rows. 
* Deleting columns. 
* Predicting missing values. 
* Assign a unique category

We will use the assign a unique category method here, using the knowledge we have gained from Visual EDA.

In [None]:
train["gender"] = train["gender"].fillna('Other')
train["enrolled_university"] = train["enrolled_university"].fillna('Other')
train["education_level"] = train["education_level"].fillna('Other')
train["major_discipline"] = train["major_discipline"].fillna('Other')
train["experience"] = train["experience"].fillna('-1')
train["company_size"] = train["company_size"].fillna('0')
train["company_type"] = train["company_type"].fillna('Other')
train["last_new_job"] = train["last_new_job"].fillna('-1')

print(train.isnull().sum())

We do the same operations for test data.

In [None]:
test["gender"] = test["gender"].fillna('Other')
test["enrolled_university"] = test["enrolled_university"].fillna('Other')
test["education_level"] = test["education_level"].fillna('Other')
test["major_discipline"] = test["major_discipline"].fillna('Other')
test["experience"] = test["experience"].fillna('-1')
test["company_size"] = test["company_size"].fillna('0')
test["company_type"] = test["company_type"].fillna('Other')
test["last_new_job"] = test["last_new_job"].fillna('-1')

print(test.isnull().sum())

# Preprocessing data

Since the education_level property is an ordinal category, I implemented an appropriate method for it. To make the difference between the categories more obvious, I increased the difference between them numerically. Likewise, in the company_size property, I created the categories by calculating the average values. Apart from these, I made some replacement operation and data type corrections.

In [None]:
train["education_level"] = train["education_level"].replace({'Other': "0", 'Primary School': "2", 'High School': "4", 'Graduate': "8", 'Masters': "13", 'Phd': "20"})
train["experience"] = train["experience"].replace({'>20': "21", '<1': "0"})

train["company_size"] = train["company_size"].replace({'10000+': "10000-10000", '10/49': "10-49", '<10': "10-10", '0': "0-0"})
train['company_size'] = train['company_size'].apply(lambda x: int((int(x.split('-')[0]) + int(x.split('-')[1])) / 2))

train["last_new_job"] = train["last_new_job"].replace({'>4': "5", 'never': "0"})
train["relevent_experience"] = train["relevent_experience"].replace({'No relevent experience': "0", 'Has relevent experience': "1"})

LABELS = ["relevent_experience","education_level","experience","last_new_job"]

int_label = lambda x: x.astype('int64')
train[LABELS] = train[LABELS].apply(int_label, axis=0)

In [None]:
test["education_level"] = test["education_level"].replace({'Other': "0", 'Primary School': "2", 'High School': "4", 'Graduate': "8", 'Masters': "13", 'Phd': "20"})
test["experience"] = test["experience"].replace({'>20': "21", '<1': "0"})

test["company_size"] = test["company_size"].replace({'10000+': "10000-10000", '10/49': "10-49", '<10': "10-10", '0': "0-0"})
test['company_size'] = test['company_size'].apply(lambda x: int((int(x.split('-')[0]) + int(x.split('-')[1])) / 2))

test["last_new_job"] = test["last_new_job"].replace({'>4': "5", 'never': "0"})
test["relevent_experience"] = test["relevent_experience"].replace({'No relevent experience': "0", 'Has relevent experience': "1"})

test[LABELS] = test[LABELS].apply(int_label, axis=0)

# Split the data

In [None]:
X = train.drop(columns=['enrollee_id', 'target'])
y = train['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Encoding categorical variables

In order to prevent data leakage, encoding categorical variable operations are done after data is separated as train and test data. First of all, the LeaveOneOutEncoder method was used here because the city variable is a high cardinality categorical variable. Then, the OneHotEncoder method was used, as variables such as gender, enrolled_university, major_discipline, company_type were nominal variables.

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'

encoder = LeaveOneOutEncoder(return_df=True)
X_train['city'] = encoder.fit_transform(X_train['city'], y_train)

X_test['city'] = encoder.transform(X_test['city'])
test['city'] = encoder.transform(test['city'])

In [None]:
columns = ['gender', 'enrolled_university', 'major_discipline', 'company_type']

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[columns]), columns=OH_encoder.get_feature_names(columns))
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test[columns]), columns=OH_encoder.get_feature_names(columns))
cols_test = pd.DataFrame(OH_encoder.transform(test[columns]), columns=OH_encoder.get_feature_names(columns))

OH_cols_train.index = X_train.index
OH_cols_test.index = X_test.index
cols_test.index = test.index

temp_X_train = X_train.drop(columns, axis=1)
temp_X_test = X_test.drop(columns, axis=1)
temp_test = test.drop(columns, axis=1)

OH_X_train = pd.concat([temp_X_train, OH_cols_train], axis=1)
OH_X_test = pd.concat([temp_X_test, OH_cols_test], axis=1)
OH_test = pd.concat([temp_test, cols_test], axis=1) 

# Handling imbalanced dataset

I used the SMOTEENN method to solve the Imbalanced dataset issue. SMOTE  can generate noisy samples by interpolating new points between marginal outliers and inliers. This issue can be solved by cleaning the space resulting from over-sampling. We can do this by using the SMOTEENN method.

In [None]:
counter = Counter(y_train)
print('Before :', counter)
smt = SMOTEENN(random_state=42)

X_train_sm, y_train_sm = smt.fit_resample(OH_X_train, y_train)
counter = Counter(y_train_sm)
print('After :', counter)

# RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=80,max_features=6,max_samples=0.1,random_state=42)
rf.fit(X_train_sm, y_train_sm)
y_pred = rf.predict(OH_X_test)
print(classification_report(y_test, y_pred))
print('Roc auc score :',roc_auc_score(y_test, rf.predict_proba(OH_X_test)[:, 1]))

In [None]:
# Compute predicted probabilities: y_pred_prob
y_pred_prob = rf.predict_proba(OH_X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
#plt.figure(figsize=(8,8))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

In [None]:
# Create a pd.Series of features importances
importances = pd.Series(data=rf.feature_importances_,
                        index= X_train_sm.columns)

# Sort importances
importances_sorted = importances.sort_values()

# Draw a horizontal barplot of importances_sorted
importances_sorted.plot(kind='barh', color='lightgreen', figsize=(12,8))
plt.title('Features Importances')
plt.show()

# GradientBoostingClassifier

In [None]:
gbc = GradientBoostingClassifier(n_estimators=50,learning_rate=0.1,max_features='sqrt',min_samples_leaf=12, random_state=42)
gbc.fit(X_train_sm, y_train_sm)
y_pred = gbc.predict(OH_X_test)
print(classification_report(y_test, y_pred))
print('Roc auc score :',roc_auc_score(y_test, gbc.predict_proba(OH_X_test)[:, 1]))

In [None]:
# Compute predicted probabilities: y_pred_prob
y_pred_prob = gbc.predict_proba(OH_X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
#plt.figure(figsize=(8,8))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

In [None]:
# Create a pd.Series of features importances
importances = pd.Series(data=gbc.feature_importances_,
                        index= OH_X_train.columns)

# Sort importances
importances_sorted = importances.sort_values()

# Draw a horizontal barplot of importances_sorted
importances_sorted.plot(kind='barh', color='lightgreen', figsize=(12,8))
plt.title('Features Importances')
plt.show()

# TEST

In [None]:
test_answers = np.load('../input/job-change-dataset-answer/jobchange_test_target_values.npy')

In [None]:
rf_pred = rf.predict(OH_test.drop(columns=['enrollee_id']))
print(classification_report(test_answers, rf_pred))
print('Roc auc score :',roc_auc_score(test_answers, rf.predict_proba(OH_test.drop(columns=['enrollee_id']))[:, 1]))

In [None]:
gbc_pred = gbc.predict(OH_test.drop(columns=['enrollee_id']))
print(classification_report(test_answers, gbc_pred))
print('Roc auc score :',roc_auc_score(test_answers, gbc.predict_proba(OH_test.drop(columns=['enrollee_id']))[:, 1]))