# <center> HR Analytics: Job change of Data scientists</center>

**In this notebook, I will try to analyze and predict which employee are more likely to search for a new job. I'll be doing some data cleansing, quick visualization, deal with the imbalance dataset and lastly make model to predict which employee will leave the job.**

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
test = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_test.csv')

Let's look at our data first.

In [None]:
train.head()

In [None]:
print(train.shape)
print(test.shape)

In [None]:
train.info()

In [None]:
train.isnull().sum()

Our dataset has a lot of missing values so we have to deal with them first before we proceed.

# Handling missing values

In [None]:
#gender column
train['gender'].fillna(train['gender'].mode()[0], inplace=True)
#enrolled_university column
train['enrolled_university'].fillna(train['enrolled_university'].mode()[0], inplace=True)
#education_level column
train['education_level'].fillna(train['education_level'].mode()[0], inplace=True)
#major_discipline column
train['major_discipline'].fillna('Unknown_discipline', inplace=True)
#experience column
train.dropna(subset=['experience'], inplace=True)
#company_size column
train['company_size'].fillna(train['company_size'].mode()[0], inplace=True)
#company_type column
train['company_type'].fillna('Unknown_type', inplace=True)
#last_new_job column
train['last_new_job'].fillna(train['last_new_job'].mode()[0], inplace=True)

In [None]:
train.isnull().sum()

# EDA

Let's do some quick EDA to our dataset.

In [None]:
sns.set_style("whitegrid")
sns.countplot(data=train, x='gender')

It seems like our dataset samples are dominated by men.

In [None]:
sns.boxplot(x='target', y='city_development_index', data=train)

People on the lower city development index tends to leave the job.

In [None]:
plt.figure(figsize=(14,6))
order= ['<1','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','>20']
sns.barplot(data=train, x='experience', y='target',order=order)

People with more experience tends to stick with the job.

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x='major_discipline', y='target', data=train)

# Encoding Categorical Data

In [None]:
train.info()

Because our dataset has a lot of categorical data we have to encode them so our model can read them.

In [None]:
# Making dictionaries for the ordinal features

relevent_experience_map = {
    'Has relevent experience':  1,
    'No relevent experience':    0
}

education_level_map = {
    'Primary School' :    0,
    'Graduate'       :    2,
    'Masters'        :    3, 
    'High School'    :    1, 
    'Phd'            :    4
} 
    
experience_map = {
    '<1'      :    0,
    '1'       :    1, 
    '2'       :    2, 
    '3'       :    3, 
    '4'       :    4, 
    '5'       :    5,
    '6'       :    6,
    '7'       :    7,
    '8'       :    8, 
    '9'       :    9, 
    '10'      :    10, 
    '11'      :    11,
    '12'      :    12,
    '13'      :    13, 
    '14'      :    14, 
    '15'      :    15, 
    '16'      :    16,
    '17'      :    17,
    '18'      :    18,
    '19'      :    19, 
    '20'      :    20, 
    '>20'     :    21
} 

company_size_map = {
    '<10'          :    0,
    '10/49'        :    1,
    '50-99'        :    2, 
    '100-500'      :    3, 
    '500-999'      :    4, 
    '1000-4999'    :    5, 
    '10000+'       :    6, 
    '5000-9999'    :    7
}
    
last_new_job_map = {
    'never'        :    0,
    '1'            :    1, 
    '2'            :    2, 
    '3'            :    3, 
    '4'            :    4, 
    '>4'           :    5
}

In [None]:
# Transforming Categorical features into numerical features
train['relevent_experience'] = train['relevent_experience'].map(relevent_experience_map)
train['education_level'] = train['education_level'].map(education_level_map)
train['experience'] = train['experience'].map(experience_map)
train['company_size'] = train['company_size'].map(company_size_map)
train['last_new_job'] = train['last_new_job'].map(last_new_job_map)

#One-hot encoding the other categories because they are independent to each other
new_df = pd.get_dummies(train, columns = ['gender', 'enrolled_university', 'major_discipline', 'company_type'], drop_first=True)

# Dropping 'city' and 'enrollee_id' columns
new_df.drop(['enrollee_id', 'city'],axis=1, inplace=True)

In [None]:
new_df.head()

In [None]:
new_df.info()

Now we have our final dataset, let's see the correlation to our target. 

In [None]:
new_df.corr()['target']

# Modelling

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x='target', data=new_df)
plt.show()

In [None]:
X = new_df.drop('target', axis=1)
y = new_df['target']

Because our dataset is imbalanced we will use smote method to deal with it.

In [None]:
from imblearn.over_sampling import SMOTE

oversample = SMOTE()
smote = SMOTE(random_state = 101)
X_smote, y_smote = smote.fit_resample(X,y)

In [None]:
plt.figure(figsize=(6, 4))
sns.barplot(y_smote.value_counts().index.astype(int),
            y_smote.value_counts().values)
plt.title('After sampling')
plt.show()

Now our dataset is ready let's make the model!

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_smote,
                                                    y_smote,
                                                    test_size=0.2,
                                                    random_state=101)

In [None]:
RFC = RandomForestClassifier()
RFC.fit(X_train,y_train)
prediction = RFC.predict(X_test)
print(classification_report(y_test, prediction))

# Test the model with the test dataset

In [None]:
# Let's the do same cleansing to our test dataset like our training dataset

#gender column
test['gender'].fillna(test['gender'].mode()[0], inplace=True)
#enrolled_university column
test['enrolled_university'].fillna(test['enrolled_university'].mode()[0], inplace=True)
#education_level column
test['education_level'].fillna(test['education_level'].mode()[0], inplace=True)
#major_discipline column
test['major_discipline'].fillna('Unknown_discipline', inplace=True)
#experience column
test.dropna(subset=['experience'], inplace=True)
#company_size column
test['company_size'].fillna(test['company_size'].mode()[0], inplace=True)
#company_type column
test['company_type'].fillna('Unknown_type', inplace=True)
#last_new_job column
test['last_new_job'].fillna(test['last_new_job'].mode()[0], inplace=True)

In [None]:
# Transforming Categorical features into numerical features
test['relevent_experience'] = test['relevent_experience'].map(relevent_experience_map)
test['education_level'] = test['education_level'].map(education_level_map)
test['experience'] = test['experience'].map(experience_map)
test['company_size'] = test['company_size'].map(company_size_map)
test['last_new_job'] = test['last_new_job'].map(last_new_job_map)

#One-hot encoding the other categories because they are independent to each other
new_test_df = pd.get_dummies(test, columns = ['gender', 'enrolled_university', 'major_discipline', 'company_type'], drop_first=True)

# Dropping 'city' and 'enrollee_id' columns
new_test_df.drop(['enrollee_id', 'city'],axis=1, inplace=True)

In [None]:
new_test_df.head()

In [None]:
# Predict the dataset
prediction = RFC.predict(new_test_df)

In [None]:
#Create a  DataFrame
prediction_df = pd.DataFrame({'enrollee_id':test['enrollee_id'],'target':prediction})
                        
prediction_df.head()

**That's it! Thank you for reading my first notebook on Kaggle, as I'm still new on this field, any advice is welcome!**