## Feature Description

enrollee_id : Unique ID for candidate

city: City code

city_ development _index : Developement index of the city (scaled)

gender: Gender of candidate

relevent_experience: Relevant experience of candidate

enrolled_university: Type of University course enrolled if any

education_level: Education level of candidate

major_discipline :Education major discipline of candidate

experience: Candidate total experience in years

company_size: No of employees in current employer's company

company_type : Type of current employer

lastnewjob: Difference in years between previous job and current job

training_hours: training hours completed

target: 0 – Not looking for job change, 1 – Looking for a job change

I borrowed some ideas from https://www.kaggle.com/cemhansenol98/xgboost-prediction-eda-and-pivottablejs

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
from scipy import stats
from scipy.stats import norm, skew
from scipy.special import boxcox1p
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings('ignore')

In [None]:

# Reading relevant data from the files
hr_train=pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
hr_test=pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_test.csv')
hr_train.head()

In [None]:
print(f'Train_shape: {hr_train.shape}')
print(f'Test_shape:  {hr_test.shape}')

## Data Cleaning

In [None]:
hr_train.columns

In [None]:
hr_train.isnull().sum()

#### Enrolee ID

In [None]:
hr_train['enrollee_id'].nunique()

Enrollee ID has no useful information and can be dropped. I'll set it as index column

In [None]:
hr_train.set_index('enrollee_id',inplace=True)
hr_test.set_index('enrollee_id',inplace=True)

#### City

In [None]:
# hr_train['city']=hr_train['city'].apply(lambda x:x.split('_')[1])
# hr_test['city']=hr_test['city'].apply(lambda x:x.split('_')[1])

#### Gender

In [None]:
hr_train['gender'].value_counts(normalize=True)

90% values in this column is Male. So all missing values will be assumed to be Male

In [None]:
hr_train['gender'].fillna('Male',inplace=True)
hr_test['gender'].fillna('Male',inplace=True)

#### Enrolled University

In [None]:
hr_train['enrolled_university'].value_counts(normalize=True)

73% values in this column is no_enrollment. So missing values will be imputed with this value

In [None]:
hr_train['enrolled_university'].fillna('no_enrollment',inplace=True)
hr_test['enrolled_university'].fillna('no_enrollment',inplace=True)

#### Education level

In [None]:
hr_train['education_level'].value_counts(normalize=True)

here, the levels are more distributed, so missing values will be filled with nearest valid observation

In [None]:
hr_train['education_level'].fillna(method='ffill',inplace=True)
hr_test['education_level'].fillna(method='ffill',inplace=True)

#### Major Discipline

In [None]:
hr_train['major_discipline'].value_counts(normalize=True)

 88% values here are STEM. Same will be used for imputing missing values

In [None]:
hr_train['major_discipline'].fillna('STEM',inplace=True)
hr_test['major_discipline'].fillna('STEM',inplace=True)

#### experience

In [None]:
hr_train['experience'].value_counts(normalize=True)

Observations are spread among different classes. Nearest value will be used to impute missing values

In [None]:
hr_train['experience'].fillna(method='ffill',inplace=True)
hr_test['experience'].fillna(method='ffill',inplace=True)

In [None]:
#### Company Size
hr_train['company_size'].value_counts(normalize=True)

Nearest value will be used to fill missing observations

In [None]:
hr_train['company_size'].fillna(method='bfill',inplace=True)
hr_test['company_size'].fillna(method='bfill',inplace=True)

#### company_type

In [None]:
hr_train['company_type'].value_counts(normalize=True)

In [None]:
hr_train['company_type'].fillna('Pvt Ltd',inplace=True)
hr_test['company_type'].fillna('Pvt Ltd',inplace=True)

#### last new job

In [None]:
hr_train['last_new_job'].value_counts(normalize=True)

In [None]:
hr_train['last_new_job'].fillna(method='ffill',inplace=True)
hr_test['last_new_job'].fillna(method='ffill',inplace=True)

In [None]:
hr_train.dropna(inplace=True)
hr_train.isnull().sum()

## EDA

In [None]:
hr_train.describe(include='all')

In [None]:
## seggregating columns into numeric, ordinal and categorical types

num_cols=list(hr_train.select_dtypes(include=['int64','float64']).dtypes.index)
ord_cols=['experience','education_level','company_size','last_new_job']
cat_cols=[x for x in hr_train.columns if x not in num_cols and x not in ord_cols]
num_cols.remove('target')

In [None]:
plt.figure(figsize=(15,8))
sns.pairplot(hr_train,hue='target',markers=['s','o'],diag_kind='hist')

In [None]:
# checking cardinality of the categorical columns
for col in cat_cols:
    print(f'{col}: {hr_train[col].nunique()}')

In [None]:
plt.figure(figsize=(10,8))
sns.clustermap(hr_train.corr(),annot = True)
plt.show()

So correlation between the target and numerical features is not very strong

In [None]:
# checking distribution of the target column
sns.countplot(hr_train['target'])

#### plotting categorical variables

In [None]:
i=1
plt.figure(figsize=(20,15))
col1=['burlywood','lime','mintcream','aquamarine','turquoise','paleturquoise']
for col in cat_cols:
    plt.subplot(2,3,i)
    labels=list(hr_train[col].value_counts().index)
    values=list(hr_train[col].value_counts().values)
    plt.pie(x=values,autopct='%.1f%%',labels=labels,pctdistance=0.5,colors=col1)
    plt.title(col,fontsize=20)
    i+=1

#### plotting ordinal variables 

In [None]:
i=1
plt.figure(figsize=(20,15))

for col in ord_cols:
    plt.subplot(2,2,i)
    labels=list(hr_train[col].value_counts().index)
    values=list(hr_train[col].value_counts().values)
    plt.pie(x=values,autopct='%.1f%%',labels=labels,pctdistance=0.5,colors=col1)
    plt.title(col,fontsize=20)
    i+=1

### **Encoding**

#### OneHot encoding

In [None]:
oh1=OneHotEncoder(handle_unknown='ignore')
train_cat=oh1.fit_transform(hr_train[cat_cols])
test_cat=oh1.transform(hr_test[cat_cols])
f_names=[j for i in oh1.categories_ for j in i]


#### Ordinal encoding

In [None]:
# Experience
ordinal_experience = {'<1':0, '1':1, '2':2, '3':3, '4':4, '5':5, '6':6, '7':7, '8':8, '9':9, '10':10,
                      '11':11, '12':12, '13':13, '14':14, '15':15, '16':16, '17':17, '18':18, '19':19, '20':20, '>20':21}
hr_train['experience']=hr_train['experience'].map(ordinal_experience)
hr_test['experience']=hr_test['experience'].map(ordinal_experience)

In [None]:
# company size
ordinal_company_size = {'<10':0, '10/49':1, '50-99':2, '100-500':3, '500-999':4, '1000-4999':5, '5000-9999':6, '10000+':7}
hr_train['company_size']=hr_train['company_size'].map(ordinal_company_size)
hr_test['company_size']=hr_test['company_size'].map(ordinal_company_size)

In [None]:
##education level
ordinal_education_level = {'Primary School':0, 'High School':1, 'Graduate':2, 'Masters':3, 'Phd':4}
hr_train['education_level']=hr_train['education_level'].map(ordinal_education_level)
hr_test['education_level']=hr_test['education_level'].map(ordinal_education_level)

In [None]:
## last_new_job
ordinal_last_new_job = {'never':0, '1':1, '2':2, '3':3, '4':4, '>4':5}
hr_train['last_new_job']=hr_train['last_new_job'].map(ordinal_last_new_job)
hr_test['last_new_job']=hr_test['last_new_job'].map(ordinal_last_new_job)

In [None]:
ms=MinMaxScaler()
train_ord=ms.fit_transform(hr_train[ord_cols])
test_ord=ms.transform(hr_test[ord_cols])

#### Numeric variables

In [None]:
train_num=ms.fit_transform(hr_train[num_cols])
test_num=ms.transform(hr_test[num_cols])

In [None]:
#Joining the arrays 
feature_names=f_names+ord_cols+num_cols

X=np.concatenate([train_cat.toarray(),train_ord,train_num],axis=1)
y=hr_train['target']

features_test=np.concatenate([test_cat.toarray(),test_ord,test_num],axis=1)



#### Upsampling and train-test-split

In [None]:
sm=SMOTE(random_state=21)
X,y=sm.fit_resample(X,y)

In [None]:
#Train-test-split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.9,random_state=23)

### Model Building

In [None]:
# Baseline model
lr1=LogisticRegression()
lr1.fit(X_train,y_train)
pred=lr1.predict(X_test)
print(f1_score(y_test,pred))

In [None]:
# Random Forest classifier
rf=RandomForestClassifier(random_state=23,n_jobs=3)

kf=KFold(n_splits=5,shuffle=True,random_state=23)
p_grid={'n_estimators':[300,600],
            'max_depth':[5,7,9],
            'max_features':[0.2,0.6],
            'min_samples_leaf':[5,7,11]
           }
rf_grid=RandomizedSearchCV(rf,param_distributions=p_grid,cv=kf)

In [None]:
%%time
rf_grid.fit(X_train,y_train)

In [None]:
rf_grid.best_params_

In [None]:
pred=rf_grid.predict(X_test)
f1_score(y_test,pred)

In [None]:
## catboost
params = {
        'subsample': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        }

In [None]:
xgb = CatBoostClassifier(learning_rate=0.02, n_estimators=600, verbose = False)

In [None]:
folds = 3
param_comb = 5

skf = KFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(X_train,y_train), verbose=3, random_state=42)


In [None]:
%%time
random_search.fit(X_train, y_train)

In [None]:
random_search.best_params_

In [None]:
model = CatBoostClassifier(learning_rate=0.02, n_estimators=600, verbose = False, subsample = 1, max_depth = 5)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
print(f1_score(y_test,y_pred))

In [None]:
# Probabilities that a candidate will work for the company=probality of class 0
model.predict_proba(features_test)[:,0]

In [None]:
# Feature Importances
f_importances=pd.DataFrame()
f_importances['feature_names']=feature_names
f_importances['Coefficients']=model.feature_importances_
f_importances=f_importances.sort_values(by='Coefficients',ascending=True).set_index('feature_names')


In [None]:
f_importances[f_importances['Coefficients']>=0.1].plot.barh(figsize=(10,8))
plt.xlabel('Feature Importances')
