# **Data Scientists Job Churn Analysis**<br>
**Introduction**:<br>Job change from one organization to other is a never ending problem in the private sector.Particularly there seems to be a trend in job churns in IT companies. Data Scientist Job is the Golden job of the 21stcentury and they are in demand. The following data is collected by IBM HR department for churn analysis before enrollment of employees to a trainingsession.

Feature Description<br>
<table style="width:100%; text-align:left">
  <tr><th>Feature Name</th><th>Type</th><th>Description</th></tr>
  <tr>
    <th>enrollee_id</th>
    <td>Continuous</td>
    <td>Unique identifier that identifies an individual<br> Values:<i>Too many</i></td>
  </tr>
  <tr>
    <th>city</th>
    <td>Categorical</td>
    <td>
      City the employee is from
      <br> Values:<i>Too many</i>
    </td>
  </tr>
  <tr>
    <th>city_development_index</th>
    <td>Continuous</td>
    <td>Developement index of the city (scaled)</td>
  </tr>
  <tr>
    <th>gender</th>
    <td>Categorical</td>
    <td>
      Gender of candidate<br>
      Values: Male, Female, Other
    </td>
  </tr>
  <tr>
    <th>revelant_experience</th>
    <td>Categorical</td>
    <td>
      Relevant experience of candidate<br>
      Values: Has relevant experience, no relevant experience
    </td>
  </tr>
  <tr>
    <th>enrolled_university</th>
    <td>Categorical</td>
    <td>
      Type of University course enrolled if any<br>
      Full-time, Part-time, no_enrollment
    </td>
  </tr>
  <tr><th>education_level</th>
    <td>Categorical</td>
    <td>Education level of candidate<br>
      Values: PhD,Graduate, Masters,High School,Primary School
    </td>
  </tr>
  <tr><th>major_discipline</th>
    <td>Categorical</td>
    <td>Education major discipline of candidate</br>
      No major, STEM, Humanities, Business Degree, Arts, Others
    </td>
  </tr>
  <tr><th>experience</th>
    <td>Categorical</td>
    <td>Candidate total experience in years<br>
      Values: 2,3,4,5,6,7,8,9,>20
    </td>
  </tr>
  <tr><th>company_size</th>
    <td>Categorical</td>
    <td>No of employees in current employer's company
    </td>
  </tr>
  <tr><th>company_type</th>
    <td>Categorical</td>
    <td>Type of current employer
    </td>
  </tr>
  <tr><th>last_new_job</th>
    <td>Continuous</td>
    <td>Difference in years between previous job and current job</td>
  </tr>
  <tr><th>training_hours</th>
    <td>Continuous</td>
    <td>training hours completed</td>
  </tr>
  <tr><th>target</th>
    <td>Categorical</td>
    <td>0 – Not looking for job change,<br> 1 – Looking for a job change</td>
  </tr>
</table>

## Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import LabelEncoder,MinMaxScaler
from imblearn.combine import SMOTETomek,SMOTEENN
from imblearn.under_sampling import TomekLinks
from imblearn.under_sampling import NearMiss
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import VarianceThreshold, SelectKBest, chi2, RFE
from sklearn.metrics import classification_report, accuracy_score
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth',None)

## Input file, read contents

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
data = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')

## Exploratory Data Analysis

Check the data type of the columns

In [None]:
data.info()

Print first 3 rows

In [None]:
data.head(3)

Check how the target class is distributed

In [None]:
data['target'].value_counts().plot(kind='bar')

*It is clearly evident that the dataset is heavily imbalanced*

### Data Imputation

Check percentage of null values

In [None]:
data.isnull().sum()/(1.0*data.shape[0])*100

Remove column enrollee_id as it is irrelevant

In [None]:
data.drop(['enrollee_id'],axis=1,inplace=True)

Impute the other columns(categorical) with the most frequent value

In [None]:
data['gender'].replace(np.nan,data['gender'].mode()[0],inplace=True)
data['major_discipline'].fillna(data['major_discipline'].mode()[0], inplace=True)
data['enrolled_university'].fillna(data['enrolled_university'].mode()[0],inplace=True)
data['last_new_job'].fillna(data['last_new_job'].mode()[0],inplace=True)
data['education_level'].fillna(data['education_level'].mode()[0],inplace=True)
data['experience'].fillna(data['experience'].mode()[0],inplace=True)
data['company_type'].fillna(data['company_type'].mode()[0],inplace=True)
data['company_size'].fillna(data['company_size'].mode()[0],inplace=True)

### Insights from data

In [None]:
find_counts = lambda d,column_name: {col_val : len(data.loc[(data[column_name]==col_val)&(data['target']==1)])/(1.0* d.loc[col_val][0])*100 for col_val in d.index}
figure, axes = plt.subplots(6,2,figsize=(22,13))
col = 0
row=0
for column in data.columns:
  if column not in ['target']:
    if data.dtypes[column] not in [int,float]:
      sorted_values = sorted(find_counts(pd.DataFrame(data[column].value_counts()),column).items(), key=lambda x: x[0],reverse=True)
      ax = sns.countplot(data=data,x=column,hue='target',ax=axes[row,col],order=data[column].value_counts().iloc[:10].index)
      bars = ax.patches
      half = int(len(bars)/2)
      left_bars = bars[:half]
      right_bars = bars[half:]

      for left, right in zip(left_bars, right_bars):
          height_l = left.get_height()
          height_r = right.get_height()
          total = height_l + height_r

          ax.text(left.get_x() + left.get_width()/2., height_l + 40, '{0:.0%}'.format(height_l/total), ha="center")
          ax.text(right.get_x() + right.get_width()/2., height_r + 40, '{0:.0%}'.format(height_r/total), ha="center")
      
    else:
      sns.histplot(data=data,x=column,hue='target',ax=axes[row,col])
    col = 1 if col ==0 else 0
    row = row+1 if col == 0 else row
      
figure.tight_layout()

The following are the inferences drawn from the graph above:
<table>
<tr><th>Column</th><th>Inference</th></tr>
<tr><td>City</td><td>Data Scientists from city_21 are more likely to leave the company</td></tr>
<tr><td>City Development Index</td><td>Employees from city index close to around 6.3 are more likely to leave</td></tr>
<tr><td>Gender</td><td>All the genders are equally likely to leave</td></tr>
<tr><td>Relevant Experience</td><td>Employees who have no relevant experience are more likely to leave</td></tr>
<tr><td>Enrolled University</td><td>Employees enrolled in a full time course very likely to leave <br>- as the course is full time it is obvious that they leave</td></tr>
<tr><td>Education Level</td><td>Employees who have a Graduate degree are more likely to leave</td></tr>
<tr><td>Major Discipline</td><td>Employees from STEM discipline are more likely to leave</td></tr>
<tr><td>Experience</td><td>Employees having lesser years of experience are likely to leave</td></tr>
<tr><td>Company Size</td><td>Employees in a company of size 50-99 are more likely to leave</td></tr>
<tr><td>Company Type</td><td>Employees from a private company are more likely to leave</td></tr>
<tr><td>Last New Job</td><td>Exployees with 1 last new job are more likely to leave</td></tr>
<tr><td>Training Hours</td><td>Employees with lesser training hours are likely to leave</td></tr>
</table>

Most of the features in the dataset are categorical and needs to be encoded.<br> We use a LabelEncoder here.

In [None]:
col_encoder_map={}
for col in data.select_dtypes(include='O').keys():
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    col_encoder_map[col]=le

### Check for Correlation Between the features

In [None]:
correlation_matrix = data.corr()
#To mask out the upper triangle
plt.figure(figsize=(20,10))
mask = np.zeros_like(correlation_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(data.corr(),mask=mask,annot=True)

### Bi-variate analysis

In [None]:
plt.figure(figsize=(20,10))
sns.pairplot(data, corner=True, hue = 'target', palette='Accent')

### Feature Scaling - Scaling all features to range [0,1]

In [None]:
scaler=MinMaxScaler(feature_range=(0,1))
train, test = train_test_split(data,test_size=0.30,stratify=data['target'])
X_train, X_test = train.drop(['target'],1) , test.drop(['target'],1)
y_train, y_test = train['target'], test['target']
scaler.fit_transform(X_train,y_train)
scaler.fit_transform(X_test,y_test)

### Feature Sampling

Since the dataset is heavily imbalanced we use undersampling and oversampling to tackle it.

In [None]:
smk = SMOTETomek(random_state=42,tomek=TomekLinks(sampling_strategy='majority'))
X_train_over_sample,y_train_over_sample=smk.fit_resample(X_train,y_train)
X_train_over_sample = pd.DataFrame(X_train_over_sample, columns=X_train.columns)
nr = NearMiss()
X_train_under_sample, y_train_under_sample= X_train_over_sample,y_train_over_sample
# X_train_under_sample, y_train_under_sample= nr.fit_sample(X_train_over_sample,y_train_over_sample) 
# X_train_under_sample = pd.DataFrame(X_train_under_sample, columns=X_train.columns)

In [None]:
np.unique(y_train, return_counts=True)

In [None]:
np.unique(y_train_under_sample, return_counts=True)

### Feature Selection

#### Variance Threshold

In [None]:
vt=VarianceThreshold(1)
vt.fit(X_train_under_sample, y_train_under_sample)
X_train_under_sample.columns[vt.get_support()]

In [None]:
for a,b in zip(X_train_under_sample,vt.variances_):
  print(a," -> ",b)

#### Recursive Feature Elimination

In [None]:
rfe=RFE(estimator=RandomForestClassifier(),n_features_to_select=7,step=1)
rfe.fit(X_train_under_sample, y_train_under_sample)
X_train_under_sample.columns[rfe.get_support()]

In [None]:
for a,b in zip(X_train_under_sample,rfe.ranking_):
  print(a," -> ",b)

#### Select K Best - using chi2

In [None]:
skb=SelectKBest(chi2,k=7)
skb.fit(X_train_under_sample, y_train_under_sample)
X_train_under_sample.columns[skb.get_support()]

#### Using Random Forest to describe feature importances

In [None]:
rf=RandomForestClassifier(n_estimators=38)
rf.fit(X_train_under_sample,y_train_under_sample)
feature_importance=pd.DataFrame(columns=['Feature','Importance'])
feature_importance = feature_importance.set_index('Feature')
for index, importance in enumerate(rf.feature_importances_):
  feature_importance.loc[X_train_under_sample.columns[index]]=[importance]
display(feature_importance.sort_values('Importance',ascending=False))

Majority Voting for Selecting best features

In [None]:
feature_selection_algo_features_map={'None':X_train_under_sample.columns,
                                     'Variance Threshold':X_train_under_sample.columns[vt.get_support()],
                                     'Recursive Feature Elimination':X_train_under_sample.columns[rfe.get_support()],
                                     'Select K Best': X_train_under_sample.columns[skb.get_support()],
                                     'Random Forest Classifier': list(feature_importance.sort_values('Importance',ascending=False).iloc[:6].index)
                                     }

In [None]:
col_vote_count_map={}
for algo in feature_selection_algo_features_map:
  if algo not in ['None']:
    for col in feature_selection_algo_features_map[algo]:
      if col in col_vote_count_map.keys():
        col_vote_count_map[col]+= 1
      else:
        col_vote_count_map[col]=1
lists=sorted(col_vote_count_map.items(), key= lambda x: x[1],reverse=True)
x, y = zip(*lists)
fig = plt.figure(figsize=(15,5))
plt.bar(x, y)
fig.suptitle('Features order by importance across feature selection algorithms')
plt.xlabel('Feature Name')
plt.ylabel('Vote Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
selected_columns =[details[0] for details in sorted(col_vote_count_map.items(), key= lambda x: x[1],reverse=True)[:5]]
selected_columns

## Model Building

In [None]:
def evaluate_model(model,X,y,columns,params={},cv=3):
  grid_search = GridSearchCV(estimator=model,param_grid=params,n_jobs=-1,verbose=1,cv=cv)
  grid_search.fit(X[columns],y)
  return grid_search.best_score_, grid_search.best_params_

### Decision Tree Classifier

#### Grid Search(Hyper-parameter tuning) for best paramters for each feature selection algorithm

In [None]:
dt_training_summary=pd.DataFrame(columns=['Feature Selection Algo','Best Score','Best Parameters'])
dt_training_summary = dt_training_summary.set_index('Feature Selection Algo')
dt=DecisionTreeClassifier()
params={"criterion":("gini","entropy"),"max_depth":[1,3,5,7,9,12,15,19,22,25]}
cv=StratifiedKFold(n_splits=3,shuffle=True)
for algo, columns in feature_selection_algo_features_map.items():                                
  best_score, best_params = evaluate_model(model=dt,X=X_train_under_sample,y=y_train_under_sample,params=params,columns=columns,cv=cv)
  dt_training_summary.loc[algo]=[best_score, best_params]
display(dt_training_summary.sort_values('Best Score',ascending=False))

In [None]:
selected_algo = dt_training_summary.sort_values('Best Score',ascending=False).iloc[0].name
selected_algo = dt_training_summary.sort_values('Best Score',ascending=False).iloc[1].name if selected_algo == 'None' else selected_algo
print('For Decision Tree Classifier, Feature Selection Algorithm: ',selected_algo," works the best")

#### Train the model using selected features and parameters

In [None]:
dt=DecisionTreeClassifier(criterion='entropy',max_depth=12)
cv=StratifiedKFold(n_splits=3,shuffle=True)
for train_index, test_index in cv.split(X_train_under_sample[selected_columns], y_train_under_sample):
  x_train_fold, x_test_fold = X_train_under_sample.iloc[train_index], X_train_under_sample.iloc[test_index] 
  y_train_fold, y_test_fold = y_train_under_sample[train_index], y_train_under_sample[test_index]
  dt.fit(x_train_fold[selected_columns],y_train_fold)
  predictions=dt.predict(x_test_fold[selected_columns])
  print('Accuracy:',accuracy_score(predictions,y_test_fold))
predictions=dt.predict(X_test[selected_columns])
print(classification_report(predictions,y_test))

*Decision Tree model seems to be fairly fit the data, with a small variance of about **5%**.*

### Logistic Regression

#### Grid Search for best paramters for each feature selection algorithm

In [None]:
lr_training_summary=pd.DataFrame(columns=['Feature Selection Algo','Best Score','Best Parameters'])
lr_training_summary = lr_training_summary.set_index('Feature Selection Algo')
params={
  "dual":[False,True],
  "fit_intercept":[False,True],
  "max_iter":[1000,1250],
  "penalty":("l1","l2"),
  "tol":[0.0001,0.01,1.0],
  "warm_start":[False,True]
  }
lr=LogisticRegression()
cv=StratifiedKFold(n_splits=3,shuffle=True)
for algo, columns in feature_selection_algo_features_map.items():                                
  best_score, best_params = evaluate_model(model=lr,X=X_train_under_sample,y=y_train_under_sample,params=params,columns=columns,cv=cv)
  lr_training_summary.loc[algo]=[best_score, best_params]
display(lr_training_summary.sort_values('Best Score',ascending=False))

In [None]:
selected_algo = lr_training_summary.sort_values('Best Score',ascending=False).iloc[0].name
selected_algo = lr_training_summary.sort_values('Best Score',ascending=False).iloc[1].name if selected_algo == 'None' else selected_algo
print('For Logistic Regression, Feature Selection Algorithm: ',selected_algo," works the best")

#### Train the model using selected features and parameters

In [None]:
lr=LogisticRegression(dual=False,fit_intercept=True,max_iter=1000,penalty='l2',tol=0.0001,warm_start=False)
cv=StratifiedKFold(n_splits=3,shuffle=True)
for train_index, test_index in cv.split(X_train_under_sample[selected_columns], y_train_under_sample):
  x_train_fold, x_test_fold = X_train_under_sample.iloc[train_index], X_train_under_sample.iloc[test_index] 
  y_train_fold, y_test_fold = y_train_under_sample[train_index], y_train_under_sample[test_index]
  lr.fit(x_train_fold[selected_columns],y_train_fold)
  predictions=dt.predict(x_test_fold[selected_columns])
  print('Accuracy:',accuracy_score(predictions,y_test_fold))
predictions=dt.predict(X_test[selected_columns])
print(classification_report(predictions,y_test))

*Logistic Regression model seems to be fairly fit the data, with a medium variance of about **7-9%**.*