# HR Analytics notebook

<img src="https://www.digitalvidya.com/wp-content/uploads/2019/05/HR-Analytics.jpg" width=500 height=200>

* **Task type:** classification
* **Models used:** DNN, LGBM
* **Other methods used:** shap

In [None]:
import pandas as pd
import numpy as np

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 1. Import data

In [None]:
train = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')

test = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_test.csv')

In [None]:
train = train.dropna()
test = test.dropna()

# This is the simplest approach, however you can replace N/A values with mean/median, Nth percentile, to avoid data distortion.

# 2. EDA

In [None]:
# Here I will use a very powerful library which provides almost all the necessary EDA features out-of-the-box.

import pandas_profiling as pp
pp.ProfileReport(train)

**So, the dataset appears to be quite small, which is quite good for the purpose of exercise, though lack of data can result in ending up with a poor performing model.**

In [None]:
train['target'].value_counts()

**Another thing, the target has a long-tail distribution which means that the dataset is quite imbalanced.
80% of target is '0', while 20% is '1'. Therefore, we need to evaluate a model based not only on accuracy score, but also precision & recall (confusion matrix).**

# 3. Feature preparation

**Let's look at how the number of Data Scientists who change the job varies across features.**

In [None]:
print(pd.pivot_table(train, values='target',
                    columns=['relevent_experience'], aggfunc=np.sum).T.sort_values('target', ascending=False))

print(pd.pivot_table(train, values='target',
                    columns=['education_level'], aggfunc=np.sum).T.sort_values('target', ascending=False))

print(pd.pivot_table(train, values='target',
                    columns=['enrolled_university'], aggfunc=np.sum).T.sort_values('target', ascending=False))

print(pd.pivot_table(train, values='target',
                    columns=['gender'], aggfunc=np.sum).T.sort_values('target', ascending=False))

print(pd.pivot_table(train, values='target',
                    columns=['major_discipline'], aggfunc=np.sum).T.sort_values('target', ascending=False))

print(pd.pivot_table(train, values='target',
                    columns=['company_type'], aggfunc=np.sum).T.sort_values('target', ascending=False))

**A lot of background-related differences in these features. We will need to encode them manually to improve the model.**

**The number of people who change the job vary significantly and inconsistenly!**

In [None]:
from sklearn.preprocessing import LabelEncoder

# I do this manually to explicitly tell the model that a better education & experience serves well as a trustworthy input.

# However, later we wil see the feature importanes report in SHAP and notice interesting results.
experience_dict = {'Has relevent experience' : 1,
             'No relevent experience': 0}

education_dict = {'Graduate' : 2,
             'Masters' : 1,
             'Phd' : 0}

enrollment_dict = {'no_enrollment' : 2,
             'Full time course' : 1,
             'Part time course' : 0}

gender_dict = {'Male' : 2,
             'Female' : 1,
             'Other' : 0}

discipline_dict = {'STEM' : 5,
             'Humanities' : 4,
             'Business Degree' : 3,
             'Other' : 2,
             'No Major' : 1,
             'Arts' : 0 }

company_dict = {'Pvt Ltd' : 5,
             'Funded Startup' : 4,
             'Public Sector' : 3,
             'Early Stage Startup' : 2,
             'NGO' : 1,
             'Other' : 0 }


# Train encoding
le = LabelEncoder()
train['gender'] = train['gender'].map(gender_dict)
train['relevent_experience'] = train['relevent_experience'].map(experience_dict)
train['education_level'] = train['education_level'].map(education_dict)
train['enrolled_university'] = train['enrolled_university'].map(enrollment_dict)
train['major_discipline'] = train['major_discipline'].map(discipline_dict)
train['experience'] = le.fit_transform(train['experience'].astype(str))
train['company_size'] = le.fit_transform(train['company_size'].astype(str))
train['company_type'] = train['company_type'].map(company_dict)
train['last_new_job'] = le.fit_transform(train['last_new_job'].astype(str))
#train['city'] = le.fit_transform(train['city'].astype(str))

train = pd.get_dummies(train, columns=['city']) # I do one-hot encoding here, since a higher value of the encoded feature is not related to the 'importance' of a feature.

# Test encoding
test['gender'] = le.fit_transform(test['gender'].astype(str))
test['relevent_experience'] = test['relevent_experience'].map(experience_dict)
test['education_level'] = test['education_level'].map(education_dict)
test['enrolled_university'] = test['enrolled_university'].map(enrollment_dict)
test['major_discipline'] = test['major_discipline'].map(discipline_dict)
test['experience'] = le.fit_transform(test['experience'].astype(str))
test['company_size'] = le.fit_transform(test['company_size'].astype(str))
test['company_type'] = test['company_type'].map(company_dict)
test['last_new_job'] = le.fit_transform(test['last_new_job'].astype(str))
#test['city'] = le.fit_transform(test['city'].astype(str))


test = pd.get_dummies(test, columns=['city'])

In [None]:
#train = train.drop('enrollee_id', axis=1)
#test = test.drop('enrollee_id', axis=1)

In [None]:
train['city_development_index'].value_counts()

In [None]:
X = train.drop('target', axis=1)
y = train['target']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33)

# Further in this notebook we will use 'val' for validation dataset, since we have all the corresponding data and columns unlike in the 'test' dataset.
# Test dataset does not contain the target and thus we will not be able to measure the performance of the model.

In [None]:
#X_test = test
#y_test = test['target']

In [None]:
train['city_development_index'].value_counts()

# 4. Model building
## 4.1. Deep Neural Network

**We will use a very basic neural network here with 5 layers.**

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, ThresholdedReLU

**First, let's create a normalization layer for input data.**

In [None]:
norm = tf.keras.layers.LayerNormalization(
    epsilon=0.001,
    center=True,
    scale=True
)

**Here you can add whatever metrics you are interested in.**

In [None]:
metrics = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.AUC(name='auc'),
]

In [None]:
model = Sequential()

model.add(norm)
model.add(ThresholdedReLU(theta=10)) # Theta is a threshold which determines the output result of a particular neuron.
model.add(Dense(200, activation='relu'))
model.add(Dense(100, activation='softmax'))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#model.summary()

In [None]:
model.fit(X_train, y_train,
          epochs=10) # A higher N of epochs doesn't improve the performance since the dataset is small.

In [None]:
model.evaluate(X_val, y_val)

**Not the stellar performance, but anyway. Let's try Gradient Boosting.**

**To interpret the results of the Neural network, you can use:**

1) feature permutation;

2) SHAP library (see further);

3) LIME library.

## 4.2. Gradient Boosting

In [None]:
#conda install -c conda-forge lightgbm 
from lightgbm import LGBMClassifier

In [None]:
lgbm = LGBMClassifier(objective='binary', num_leaves=10, learning_rate=0.05, 
                      max_depth=1, n_estimators=50, boosting_type='goss') # You can play with hyperparameters, pay special attention to num_leaves, max_depth and n_estimators.
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_val)

#cross_val_score(lgbm, X_test, y_test, cv=3)

In [None]:
pd.DataFrame(lgbm.feature_importances_, X_train.columns)

In [None]:
from sklearn.metrics import f1_score, confusion_matrix, recall_score, precision_score, accuracy_score

#confusion_matrix(y_test, y_pred)
print('Accuracy: %f, \nRecall: %f \nPrecision: %f'
      % (accuracy_score(y_val, y_pred), recall_score(y_val, y_pred), precision_score(y_val, y_pred)))

# 6. Conclusion

## 6.1 Feature importances

In [None]:
import shap

X_importance = X_train

explainer = shap.TreeExplainer(lgbm)
shap_values = explainer.shap_values(X_importance)
shap.summary_plot(shap_values, X_importance)

## 6.2 Further steps

**Apparently, the most important feature in the task is city development index. Which is not quite good, because it predominates over other features.**

**As the performance of both DNN and LGBM model is not perfect, the further steps to complete might be as follows:**

1. Bootstrapping the dataset to make it more balanced. (see **Step 7**)

2. Feature insertion & feature engineering based on the most important features.

3. Play with neural networks & try to use recurrent networks. Or add different layers.

4. Use other conventional ML models and/or Boosting (e.g. CAT boost).



# 7. Bootstrapping

In [None]:
train[train['target'] == 0]['city_development_index'].describe()

In [None]:
train[train['target'] == 1]['city_development_index'].describe()

In [None]:
cdi = pd.DataFrame(train['city_development_index'].value_counts())
cdi.head(10)

**We need to get samples of the DataFrame with target=1 and pick several slices of the dataset with underrepresented 'city_development_index' features (with indices between 0.4 and 0.7).**

**Why? Because cities with lower indies have more Data Scientists who change the job. And they are poorly represented in the original dataset.**

In [None]:
def change(x):
    x = np.random.randint(400, 800)/1000
    return x

def change2(x):
    x = np.random.randint(0, 21)
    return x

In [None]:
# Slice 1

insert1 = train[train['city_development_index'] == 0.897].sample(frac=1)
insert1['experience'] = insert1['experience'].apply(lambda x: change2(x))
#insert1['city_development_index'] = insert1['city_development_index'].apply(lambda x: change(x))

In [None]:
# Slice 2

insert2 = train[train['city_development_index'] == 0.926].sample(frac=1)
insert2['experience'] = insert2['experience'].apply(lambda x: change2(x))
#insert2['city_development_index'] = insert2['city_development_index'].apply(lambda x: change(x))

In [None]:
dfs = [train, insert1, insert2]
train_new = pd.concat(dfs)

In [None]:
y_train_new = train_new['target']
X_train_new = train_new.drop(['target'], axis=1)

In [None]:
lgbm2 = LGBMClassifier()
lgbm2.fit(X_train_new, y_train_new)
y_pred2 = lgbm2.predict(X_val)

print('Accuracy: %f, \nRecall: %f \nPrecision: %f'
      % (accuracy_score(y_val, y_pred2), recall_score(y_val, y_pred2), precision_score(y_val, y_pred2)))

**Now we observe a spike in both accuracy and precision. Thus, bootstrapping has proven its efficiencty in this particular dataset. Even though we have used only 2 samples, we can take it further and improve the model performance.**