# HR Analytics: Job Change of Data Scientists

A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

>> This is my first notebook so I offer a very simple exploratory data analysis :).
I will update this notebook whenever I have new improvement ideas.
Feedback and criticism are very welcome

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
df = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')

## Exploratory Data Analysis

Let us have a look at our data sample

In [None]:
df.sample(10)

### Data Types

In [None]:
df.info()

Here we have 19158 entries and 14 features which is consisted of 10 categorical features, 3 numerical features, and 1 target features. Let us split them into groups just in case.

In [None]:
cat_feat = ['city', 'gender', 'relevant_experience', 'enrolled_university', 'education_level',
            'major_discipline', 'experience', 'company_size', 'company_type', 'last_new_job']

num_feat = ['enrolle_id', 'city_development_index', 'training_hours']

target = df['target'].replace({0.0:'STAY',1.0:'CHANGE'}, inplace=True)

### Missing Values

In [None]:
pd.DataFrame({
    '%': round(df.isnull().sum()*100/len(df),2)
})

According to the table above, we have serious problem with missing values in several features here. We will handle missing values as we deep dive into each features soon

## Closer Look into the Features

### Target

In [None]:
fig = px.pie(df,
             names='target')
fig.update_traces(textinfo='percent+label')
fig.update_layout(showlegend=False)
fig.show()

### City

In [None]:
df['city'].describe()

The city feature has many unique value ie from city_1 to city_123. We cant plot all of the city here since there are too many of them

### City Development Index

It is reasonable to guess that one of the motivation someone wanted to change their job is to move to more developed city than their current city. If this is true, we will see more people from city with development index below Middle category expect job change. For our simplicity, let's categorize city by development index into five groups

In [None]:
df['city_development_index'] = pd.cut(df['city_development_index'], 5, labels=['Lowest', 'Lower', 'Mid', 'Higher', 'Higest'])

In [None]:
fig = px.histogram(df,
                   title='City Development Index',
                   x='city_development_index',
                   color='target',
                   barmode='group',
                   category_orders={'city_development_index':['Lowest', 'Lower', 'Mid', 'Higher', 'Higest']})
fig.update_layout(legend={'orientation':'h'})
fig.show()

Our hypothesis is true! Many people from least developed (Lower and Lowest) city expect job change!

### Gender

We will replace missing values with 'MISSING' so we can see the proportion of missing value in the plot

In [None]:
df['gender'].fillna('MISSING', inplace=True)

In [None]:
fig = px.pie(df,
             names='gender')
fig.update_traces(textinfo='percent+label')
fig.update_layout(showlegend=False)
fig.show()

It is very unexpected! We have Male made up to 69% of our data but Female only 6.46% and we have 23.5% missing values

In [None]:
fig = px.histogram(df,
                   title='Job Change by Gender',
                   x='gender',
                   color='target',
                   barmode='group',
                   category_orders={'gender':['Male', 'Female', 'Other', 'MISSING']})
fig.update_layout(legend={'orientation':'h'})
fig.show()

From the viz above we can see that about one third of people in each gender group expect job change

### Relevant Experience

It is simple for us to rename value 'Has relevent experience' and 'No relevent experience' with 'Yes' and 'No' in that order for simplicity

In [None]:
df['relevent_experience'].replace({'Has relevent experience':'Yes', 'No relevent experience':'No'}, inplace=True)

In [None]:
fig = px.pie(df,
             title='Has relevant experience?',
             names='relevent_experience')
fig.update_traces(textinfo='percent+label')
fig.update_layout(showlegend=False)
fig.show()

In [None]:
fig = px.histogram(df,
                   title='Job Change by Relevant Experience',
                   x='relevent_experience',
                   color='target',
                   barmode='group')
fig.update_layout(legend={'orientation':'h'})
fig.show()

We have quarter of people with relevant experience expect job change meanwhile half people with no relevant experience expect job change. So I guess that's why people attend online training because they wanted career shift

### Enrolled University

In [None]:
df['enrolled_university'].describe()

In [None]:
df['enrolled_university'].fillna("MISSING", inplace=True)

In [None]:
fig = px.pie(df,
             title='Enrolled University?',
             names='enrolled_university')
fig.update_traces(textinfo='percent+label')
fig.update_layout(showlegend=False)
fig.show()

In [None]:
fig = px.histogram(df,
                   title='Job Change by Enrollment in University',
                   x='enrolled_university',
                   color='target',
                   barmode='group')
fig.update_layout(legend={'orientation':'h'})
fig.show()

### Education Level

In [None]:
df['education_level'].describe()

In [None]:
df['education_level'].fillna("MISSING", inplace=True)

In [None]:
fig = px.pie(df,
             title='Education Level',
             names='education_level')
fig.update_traces(textinfo='percent+label')
fig.update_layout(showlegend=False)
fig.show()

### Major Discipline

In [None]:
df['major_discipline'].describe()

In [None]:
df['major_discipline'].fillna('MISSING', inplace=True)

In [None]:
fig = px.pie(df,
             title='Major Discipline',
             names='major_discipline')
fig.update_traces(textinfo='percent+label')
fig.update_layout(showlegend=False)
fig.show()

### Work Exprience

In [None]:
df['experience'].describe()

In [None]:
df['experience'].fillna('MISSING', inplace=True)

In [None]:
fig = px.histogram(df,
                   title='Work Experience',
                   x='experience',
                   color='target',
                   barmode='group',
                   category_orders={'experience':['MISSING','<1','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','>20']})
fig.update_layout(legend={'orientation':'h'})
fig.show()

### Company Size

In [None]:
df['company_size'].describe()

In [None]:
df['company_size'].unique()

In [None]:
df['company_size'].replace({'10/49':'10-49','100-500':'100-499','10000+':'>10000'}, inplace=True)

In [None]:
df['company_size'].fillna('MISSING', inplace=True)

In [None]:
fig = px.histogram(df,
                   title='Company Size',
                   x='company_size',
                   color='target',
                   barmode='group',
                   category_orders={'company_size':['MISSING','<10','10-49','50-99','100-499','500-999','1000-4999','5000-9999','>10000']})
fig.update_layout(legend={'orientation':'h'})
fig.show()

### Company Type

In [None]:
df['company_type'].describe()

In [None]:
df['company_type'].unique()

In [None]:
df['company_type'].fillna('MISSING', inplace=True)

In [None]:
fig = px.histogram(df,
                   title='Company Type',
                   x='company_type',
                   color='target',
                   barmode='group')
fig.update_layout(legend={'orientation':'h'})
fig.show()

### Last New Job

In [None]:
df['last_new_job'].describe()

In [None]:
df['last_new_job'].unique()

In [None]:
df['last_new_job'].fillna('MISSING', inplace=True)

In [None]:
fig = px.histogram(df,
                   title='Last New Job',
                   x='last_new_job',
                   color='target',
                   barmode='group',
                   category_orders={'last_new_job':['MISSING','never','1','2','3','4','>4']})
fig.update_layout(legend={'orientation':'h'})
fig.show()

### Training Hours

In [None]:
df['training_hours'].describe()

In [None]:
df['training_hours'] = df['training_hours'].astype(int) 

In [None]:
fig = px.histogram(df,
                   title='Training Hours',
                   color='target',
                   barmode='group',
                   x='training_hours')
fig.update_layout(legend={'orientation':'h'})
fig.show()