# **HR Analytics: Job Change of Data Scientists**

## **1. Context**

A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.


## **2. Problem Statement**
   * To understand the factors that lead a person to leave current job.
   * Find a probability of candidate looking for a job.


In [None]:
# importing libraries.
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.style.use("tableau-colorblind10")
# for imbalance data
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier


import warnings
warnings.filterwarnings('ignore')

## **3. Train Data**

In [None]:
train_data = pd.read_csv("../input/hr-analytics-job-change-of-data-scientists/aug_train.csv")
train_data.head()

   
## **4. Data Description**

**Features**

* **enrollee_id** : Unique ID for candidate
* **city: City code**
* **city_ development _index** : Developement index of the city (scaled)
* **gender**: Gender of candidate
* **relevent_experience**: Relevant experience of candidate
* **enrolled_university**: Type of University course enrolled if any
* **education_level**: Education level of candidate
* **major_discipline**:Education major discipline of candidate
* **experience**: Candidate total experience in years
* **company_size**: No of employees in current employer's company
* **company_type** : Type of current employer
* **lastnewjob**: Difference in years between previous job and current job
* **training_hours**: training hours completed
* **target**: 0 – Not looking for job change, 1 – Looking for a job change


In [None]:
print("Summary for train data \n")

print("Shape:\n", train_data.shape)
print("----"*10)

print("Discription: \n", train_data.describe())
print("----"*10)

print(train_data.info())
print("----"*10)

# Let's take a look at unique values for each column/ features with object datatype.
print("Value counts")
for col in train_data.select_dtypes(object):
    print("----"*10)
    print(train_data[col].value_counts())

Discription for **last_new_job** is Difference in years between previous job and current job. The **last_new_job** has value *never* So, let's change *`never`* into *`0`*.

In [None]:
# For last_new_job column.
train_data.last_new_job.replace("never", 0, inplace=True)

# Let's consider experience less than 1 means 0 and greater than 20 means 21.
train_data.experience.replace({'<1': 0, '>20': 21}, inplace=True)

## 5. Outliers

In [None]:
# Training hours.
# To check outliers in Training hours.
plt.figure(figsize=(14, 8))
sns.histplot(train_data.training_hours, kde=True, fill=False)

plt.title("Distribution of training hours")
plt.show()

In [None]:
print("Skewness: ", train_data.training_hours.skew())
print("Median: ", train_data.training_hours.quantile(0.50))
Q3 = train_data.training_hours.quantile(0.95)
print("Q3: ", Q3)

# removing outliers
train_data['training_hours'] = np.where(train_data.training_hours > Q3, Q3, train_data.training_hours)
print("Skewness: ", train_data.training_hours.skew())

## **6. Missing Data**

In [None]:
# Missing Data.
percent = (train_data.isnull().sum() / train_data.shape[0] ) *100
missingData = pd.DataFrame(round(percent, 2), columns=['PercentMissingData'])
missingData

In [None]:
# Handling missing data.
# Fill all object type of data with most occurance values in the columns.

def filling_data(col, data):
    """
    Filling Missig data of object datatypes columns with its mode value.
    """
    value = data[col].mode()[0]
    data[col].fillna(value, inplace=True)

for col in train_data.select_dtypes(object):
    if train_data[col].isnull().sum() > 0:
        filling_data(col, train_data)

## **7. Casting Datatypes**

In [None]:
print(train_data.dtypes)

train_data['experience'] = train_data.experience.astype(int)
# train_data['last_new_job'] = train_data.last_new_job.astype(int)
train_data['target'] = train_data.target.astype(int)
train_data['training_hours'] = train_data.training_hours.astype(int)

## **8. Data Visualization**

In [None]:
# Average Total experience by education level.
grp_data = pd.DataFrame(train_data.groupby('education_level').mean()['experience'])
grp_data.sort_values(by='experience', ascending=False, inplace=True)

# Plot
grp_data.plot(kind='bar', figsize=(14, 8), title="Average Experience of Employee by Education level", fontsize=14)

plt.show()

**Higher experience is observed for higher education level(PhD).**

In [None]:
# Average experience in Candidates by gender and education_level
grp_data = pd.DataFrame(train_data.groupby(['education_level', 'gender']).mean()['experience'])

# Plot
grp_data.unstack().plot(kind='bar', figsize=(14, 8), title="Average of Total Experience by Gender and Education level", fontsize=13)

plt.xticks(rotation=40)
plt.show()

**More than 20 years of experience is observed in PhD holders who does not specified their gender. Overall average experience of Female irrespective of their education is lower than Male.**

In [None]:
# Observation for most common education major streams in candidates.
grp_data = pd.DataFrame(train_data.groupby(['major_discipline']).count()['enrollee_id'])
grp_data.sort_values(by='enrollee_id', ascending=False, inplace=True)

grp_data.plot(kind='bar', figsize=(14, 8), fontsize=14)

plt.ylabel("Count")
plt.title("Most Common Education major in Candidates ")

plt.xticks(rotation=45)
plt.show()

**As expected, more candidates are from *Science, Technology Engieering and Mathamatics* (STEM) background Since this training is for Data Science role.**

In [None]:
# Observation for most common education level in candidates.
grp_data = pd.DataFrame(train_data.groupby(['education_level']).count()['enrollee_id'])
grp_data.rename(columns={'enrollee_id':'EducationLevelCounts'}, inplace=True)
grp_data.sort_values(by='EducationLevelCounts', ascending=False, inplace=True)

grp_data.plot(kind='bar', figsize=(14, 8), fontsize=14)

plt.ylabel("Count")
plt.title("Most Common Degree in Candidates ")

plt.xticks(rotation=45)
plt.show()

**Most of the candidates are *Graduates*.**

In [None]:
# Examine the last_new_job variable.

fig, axs = plt.subplots(nrows = 2, ncols = 2, figsize=(20, 15))

# Number of observation for last_new_job.

ax1 = sns.countplot(train_data.last_new_job, color="#0779E4", ax = axs[0, 0], label='counts')
ax1.set_title("Number of Obeservation for job change in years")
ax1.legend()


# Relation between year of difference between job change and relevent experience?
b = pd.DataFrame(train_data.groupby(['last_new_job', 'relevent_experience']).count()['enrollee_id'])
b.rename(columns={'enrollee_id':'Counts'}, inplace=True)
ax2 = b.unstack().plot(ax=axs[0,1])
ax2.set_title("Most freqently job changes by education level")


# Relation between year of difference between job change and education level?
c = pd.DataFrame(train_data.groupby(['last_new_job', 'education_level']).count()['enrollee_id'])
c.rename(columns={'enrollee_id':'Counts'}, inplace=True)
ax3 = c.unstack().plot(kind='line', ax=axs[1,0])
ax3.set_title("Most freqently job changes by Education Level")


# Who changes the job Most frequently?
d = pd.DataFrame(train_data.groupby(['last_new_job', 'gender']).count()['enrollee_id'])
d.rename(columns={'enrollee_id':'Counts'}, inplace=True)
ax4 = d.unstack().plot(kind='line', ax=axs[1,1])
ax4.set_title("Most freqently job changes by Gender")


plt.show()

Most of the enrolled candidates have **1 year of difference** between their new job and previous job and these are mostly **Graduates** and **Masters** candidates. Candidates with 1 year of difference in new job and previous job **have relevent experience** in Data science.

In [None]:
# Observation for Relevent experience.
a = train_data.groupby(by=['education_level','relevent_experience']).count()['enrollee_id']

# plot
a.unstack().plot(kind='bar', figsize=(14, 8), fontsize=16)

plt.ylabel("Counts")
plt.title("Relevant Experience in Data Science by Education level")
plt.xticks(rotation=45)

plt.show()

Most of the candidates with  **Graduates** and **Masters** degree have a **relevent experience** in Data Science as compare to other Education levels. 

In [None]:
# Candidates with more number of experience are looking for job change?

Per_exp_df = (train_data.groupby(['experience', 'target']).count()['enrollee_id']) / (train_data.groupby('experience').count()['enrollee_id'])*100
Per_exp_df = round(Per_exp_df, 2)
Per_exp_df.unstack().plot(kind='line', figsize=(14, 8), fontsize=14)

plt.ylabel("Percentage")
plt.title("Percentage of candidates with overall experience.")

plt.show()

Of course, Candidates with **0** experience are Looking for Job and Candidates with **More experience** are less willing to change a job or **not looking for job**.

In [None]:
# Observation for Graduates.

grads_df = train_data.loc[train_data.education_level == 'Graduate']

# Who is ready to change the Job?
a = pd.DataFrame(grads_df.groupby(['last_new_job', 'target']).count()['enrollee_id'])

a.unstack().plot(kind='line', figsize=(14, 8))

plt.title("Number of Graduate looking for job changes.")
plt.show()

Graduate with **1** and **5** **years of difference** between last job and new jobs are **Looking for job change**.

In [None]:
# Graduates who enrolled in Training are willing to change the Job?
enrollGrads_df = pd.DataFrame((grads_df.groupby(by=['enrolled_university', 'target']).count()['enrollee_id']))
total_enrolledDf = pd.DataFrame((grads_df.groupby(by=['enrolled_university']).count()['enrollee_id']))

a = round((enrollGrads_df / total_enrolledDf) * 100)
a.rename(columns={'enrollee_id':'GradsCounts'}, inplace=True)

# Visualization for the same.
a.unstack().plot(kind='bar',figsize=(14, 8), fontsize=16)

plt.ylabel("Percentage")
plt.title("Percentage of Graduates Enrolled in Course")
plt.xticks(rotation=0)
plt.show()

Graduates who take a **full time course** are mostly ready to change the job.

In [None]:
# Does training_hours has influence to change a job?

a = pd.DataFrame(train_data.groupby(['training_hours','target']).count()['enrollee_id'])
a.rename(columns= {'enrollee_id': "Counts"}, inplace=True)
a.reset_index(inplace=True)

plt.figure(figsize=(14, 8))
sns.scatterplot(a.Counts,a.training_hours, hue='target', data=a)
plt.title("Training Hours by target")
plt.show()

Candidate with **less hours** of training are willing to change a job as compare to candidates with **more trainig hours**.

In [None]:
# correlation.
corr = train_data.corr()
mask = np.triu(corr)

# correlation plot

plt.figure(figsize=(14, 8))
sns.heatmap(corr, annot=True, vmin=-1, vmax=1, center=0, fmt="0.2g", square=True, mask=mask, cmap='coolwarm')

plt.xlabel("Features")
plt.title("Correlation between variables.")

plt.show()

There is correlation between **city_development_index** and **Experience**, **last_new_job**, **target** features and **last_new_job** is correlated to the **experience** as well.

In [None]:
# Examine city_development_index.

print("Median: ", train_data.city_development_index.median())
print("Mean: ", train_data.city_development_index.mean())
print("Skewness: ", train_data.city_development_index.skew()) # should be between -1 and 1

# plot
plt.figure(figsize=(14, 8))
sns.histplot(train_data.city_development_index, fill=False, kde=True)
plt.title("Distribution plot")
plt.show()

In [None]:
labels = ['40%', '50%', '60%', '70%', '80%', '90%']
bins = [0.400, 0.500, 0.600, 0.700, 0.800, 0.900, 1.000]

train_data["city_development_percent"] = pd.cut(train_data.city_development_index, labels=labels, bins=bins)

b = train_data.groupby(by=['city_development_percent', 'target']).count()['enrollee_id']

# Visualization for the same.
b.unstack().plot(kind='bar', stacked=True, figsize=(14, 8), fontsize=16)

plt.ylabel("Count")
plt.title("Observation of number of candidates according to City Development index.")

plt.xticks(rotation=0)

plt.show()

Candidates with **City_development_index** **between 0.60 and 0.70** are mostly ready to change the job.

## Data Normalization

In [None]:

def encoder(data):
    
    # Label encoder.
    cols = ['relevent_experience', 'enrolled_university', 'gender']
    
    for col in cols:
        a = pd.get_dummies(data[col])
        data = data.join(a)
    
    return data
        

def fequency_encoder_calculator(data):
    
    # Frequency encoder for more than one categorical value features.
    cols = ['city','major_discipline', 'company_size','company_type', 'last_new_job']
    
    for col in cols:
        fe = data.groupby(col).size() / data.shape[0]
        col_name = col + "_fe"
        data[col_name] = data[col].map(fe)
        
        

def StandardScalar(data):
    
    # Normalizing data using Z-score scaling.
    cols = data.columns
        
    for col in cols:
        mean = data[col].mean()
        std = data[col].std()
        data.loc[:, col] = (data.loc[:, col] - mean) / std



In [None]:
# train_data 
train_data = encoder(train_data)
fequency_encoder_calculator(train_data)
train_data['education_level'].replace({'Primary School':0, 'High School':1, 'Graduate':2, 'Masters':3, 'Phd':4 }, inplace=True)

In [None]:
train_data.head()

In [None]:
train_data.set_index('enrollee_id', inplace=True)
train_data.drop('city_development_percent', axis=1, inplace=True)
data = train_data.select_dtypes(exclude=object)
data.drop('target', axis=1, inplace=True)

In [None]:
data.head()

In [None]:
# standardization
cols = ['experience', 'training_hours']
a = data[cols]
StandardScalar(a)

for col in cols:
    data[col] = a[col]
    
data.head()

In [None]:
y = train_data['target']

In [None]:
# For imbalance dataset.

sm = SMOTE(random_state=1)
sm_data, sm_y = sm.fit_resample(data, y)

## Model Evaluation

In [None]:

def model_eval(model):
    
    print(model)
    print("--"*10)
    
    # fitting model
    model.fit(X_train, y_train)
    
    # prediction
    y_train_preds = model.predict(X_train)
    y_test_preds = model.predict(X_test)
    
    # ROC accuracy score.
    train_accuracy = roc_auc_score(y_train, y_train_preds)
    test_accuracy = roc_auc_score(y_test, y_test_preds)
    
    print("Train Accuracy: ", train_accuracy)
    print("Test Accuracy: ", test_accuracy)
    
    
    # Overall Accuracy of Model.
    score = accuracy_score(y_test, y_test_preds)
    print("Accuracy score: ", score)
    print("--"*10)
    
    return y_test_preds
    
    
# spliting data into to train and test sets.
X_train, X_test, y_train, y_test = train_test_split(sm_data, sm_y, test_size=0.30, random_state=0)

# LogisticRegression
LR_clf = LogisticRegression(random_state=0)

# GradientBoostingClassifier
Grd_clf = GradientBoostingClassifier(learning_rate=0.05, n_estimators=500, random_state=0)

# RandomForestClassifier
Random_clf = RandomForestClassifier(random_state=45)

# prediction from each classifiers
Y_predictions = {}

for model in [LR_clf, Grd_clf, Random_clf]:
    predicts = model_eval(model)
    Y_predictions[str(model).split("(")[0]] = predicts

In [None]:
# confusion matrix for test data with Random Forest Classifier.

TP, FN, TN, FP = confusion_matrix(y_test, Y_predictions['RandomForestClassifier']).ravel()

precision = round((TP / (TP + TN)*100), 2)
recall = round((TP / (TP + FP)*100), 2)

print("Precision: ",precision)
print("Recall: ", recall)

In [None]:
from sklearn.metrics import plot_confusion_matrix

disp = plot_confusion_matrix(Random_clf, X_test, y_test, cmap=plt.cm.PuBu, normalize='true')
disp.ax_.set_title("Confusion matrix for "+ str(Random_clf).split('(')[0])
plt.show()

In [None]:
from sklearn.metrics import plot_roc_curve

disp = plot_roc_curve(Random_clf, X_test, y_test)
disp.ax_.set_title("ROC curve for "+ str(Random_clf).split('(')[0])
sns.lineplot([0, 1],[0, 1], color="r", linestyle="--", alpha=0.7)
plt.show()