# Plan:

We will treat each column in two phases:
### Data preprocessing:
* Imputing missing values.
* Handling categorical data.
    
### Data Visualization

Then we will test several model and choose the one with the best performance:
* Model creation: Logistic Regression-SVM-Gradient Boosting-KNN-RandomForest-XGBoost classifier.
* Hyperparameter tuning.
* Use the model to predict target column in Test set.

Lets get started, first lets import our typical libraries:

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
print("Setup complete")

Let's import our train and test data.

In [None]:
#training data
train = pd.read_csv("../input/hr-analytics-job-change-of-data-scientists/aug_train.csv")
#testing data
test = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_test.csv')

Let's take a closer look to our data.

In [None]:
train.head()

In [None]:
test.head()

From the first glence we can see that most of the data is categorical, and we have several missing values in different columns.
Let's take a look at the number of missing values in each column:

In [None]:
print("Train data missing values: \n",train.isna().sum())

In [None]:
print("Test data missing values: \n",test.isna().sum())

# City:

In [None]:
train["city"].unique()

We can see that every city has a specific number, with a prefix "city_", so first we have to delete the prefix then transform the data from object type to integer.

In [None]:
train['city'] = train['city'].map(lambda row: row.replace('city_',''))
test['city'] = test['city'].map(lambda row: row.replace('city_',''))

In [None]:
train['city'] = train['city'].astype(int)
test['city'] = test['city'].astype(int)

### The result:

In [None]:
print(train["city"].unique())
print("We have {} unique variables:".format(len(train["city"].unique())))

In [None]:
plt.figure(figsize=(16,8))
g = sns.distplot(train.city,kde=False, color="red")
g = (g.set(xlim=(0,185),xticks=range(0,190,10)))
plt.xlabel("City Number")
plt.ylabel("Distribution")
plt.show()

In [None]:
print("Most common cities are:\n",train['city'].value_counts())

In [None]:
test['city'].value_counts()

We can see that candidates from the city number **103** are the majority.

In [None]:
train.sort_values(by='city')

And here we can see that each city has specific city_development_index, so deleting this column won't make any difference to the model, however we can visualize what is the development index for the cities with the majority of candidates:

In [None]:
train.loc[train.city == 103,'city_development_index']

So the city with the majority of candidates is a well developped city, but do we have a relationship between the city development index and the chance of the candidate looking for another job?

In [None]:
train['city'] = train['city'].astype(np.int8)

In [None]:
sns.lineplot(x='target', y='city_development_index',data=train)
plt.show()

The candidates from cities with low development index tend to look for a job change and vice versa. Now let's just drop this column from both train and test data.

In [None]:
train = train.drop(labels='city_development_index', axis=1)
test = test.drop(labels='city_development_index', axis=1)

# Gender:

It is obvious we have to encode the caregorical data and take care of missing data, there are a lot of ways to handle gender missing values such as replacing them with the gender most common, deleting those rows...etc
But I prefer to fill the gender missing values with "Other" since we may have candidates identify as non-binary.
First lets take a lot at our gender column:

In [None]:
plt.figure(figsize=(12,6))
sns.violinplot(x='gender', y='target', palette='Set2', data=train)
plt.show()

It looks like more men don't look for a job change but actually we can't conclude that from this violinplot since most of the candidates are men.

In [None]:
train['gender'].hist()
plt.show()

Now let's fill the missing values with 'Other':

In [None]:
train['gender'] = train['gender'].fillna('Other')
test['gender'] = test['gender'].fillna('Other')

Since we don't have any missing values left in the gender column, let's encode the column, I will use for this one LabelEncoder of sklearn:

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
label_encoder = LabelEncoder()
train["gender"] = label_encoder.fit_transform(train["gender"])
test["gender"] = label_encoder.fit_transform(test["gender"])

#### Female : 0
#### Male : 1
#### Other : 2

Lets have a look at what we have achieved so far:

In [None]:
train.isna().sum()

In [None]:
train.head()

# Relevent experience:

Before making any decision we have to look at the values of this column:

In [None]:
train['relevent_experience'].unique()

In [None]:
test['relevent_experience'].unique()

So it only has two values, and no missing data.

### Encoding:

In [None]:
train["relevent_experience"] = train["relevent_experience"].map({"Has relevent experience":1, "No relevent experience":0})
test["relevent_experience"] = test["relevent_experience"].map({"Has relevent experience":1, "No relevent experience":0})

# Enrolled university:

In [None]:
train['enrolled_university'].unique()

This column has missing values, lets take care of them before encoding.

In [None]:
sns.countplot(x='enrolled_university', data=train)
plt.show()

In [None]:
sns.countplot(x='enrolled_university', data=test)
plt.show()

#### Most of the candidates had no university enrollment

In [None]:
sns.lineplot(x='enrolled_university', y='target', palette='Set2', data=train)
plt.show()

#### We can see that most of candidates had no enrollment more likely aren't looking for a job change.

We can't tell if the missing data is left out or the candidates had no enrollment, but also we don't want to create a new value (like 'OTHER') because it can create a pattern that doesn't exist.
I will fill the missing values with the no_enrollment value.

In [None]:
train["enrolled_university"]=train["enrolled_university"].fillna('no_enrollment')
test["enrolled_university"]=test["enrolled_university"].fillna('no_enrollment')

#### Encode:

In [None]:
train["enrolled_university"] = label_encoder.fit_transform(train["enrolled_university"])
test["enrolled_university"] = label_encoder.fit_transform(test["enrolled_university"])

# Education level:

In [None]:
train['education_level'].unique()

In [None]:
train['education_level'].hist()
plt.show()

Most candidates are graduates.

In [None]:
sns.lineplot(x='education_level', y='target', palette='Set2', data=train)
plt.show()

Graduates and Masters are most likely to look for a job change, but people with Phd or primary school aren't.

In [None]:
train["education_level"]=train["education_level"].fillna('Other')
test["education_level"]=test["education_level"].fillna('Other')

In [None]:
train["education_level"] = label_encoder.fit_transform(train["education_level"])
test["education_level"] = label_encoder.fit_transform(test["education_level"])

# Major Discipline:

In [None]:
train['major_discipline'].unique()

In [None]:
train['major_discipline'].hist()

Most of our candidates are STEM majors.

In [None]:
train['major_discipline'] = train['major_discipline'].fillna('Other')
test['major_discipline'] = test['major_discipline'].fillna('Other')

In [None]:
train["major_discipline"] = label_encoder.fit_transform(train["major_discipline"])
test["major_discipline"] = label_encoder.fit_transform(test["major_discipline"])

In [None]:
train.head()

# Experience:

#### The experience variable is an object indicating the minimum or maximum years of experience a candidate had, so deleting the operators won't make a big difference.

First we have to convert the column to string.

In [None]:
train['experience'] = train['experience'].astype(str)
test['experience'] = test['experience'].astype(str)

In [None]:
train['experience'] = train['experience'].apply(lambda col: col.replace('>',''))
train['experience'] = train['experience'].apply(lambda col: col.replace('<',''))
test['experience'] = test['experience'].apply(lambda col: col.replace('>',''))
test['experience'] = test['experience'].apply(lambda col: col.replace('<',''))

Delete the symbols.

In [None]:
train.head()

Lets fill the missing values with 0.

In [None]:
train['experience'] = train['experience'].apply(lambda col: col.replace('nan','0'))

In [None]:
test['experience'] = test['experience'].apply(lambda col: col.replace('nan','0'))

Convert the values to Integer.

In [None]:
train['experience'] = pd.to_numeric(train['experience'])

In [None]:
test['experience'] = pd.to_numeric(test['experience'])

# Company size:

In [None]:
train['company_size'].unique()

#### We have 5938 missing values in the company_size column, before encoding categorical data we have to handle the missing values.

In [None]:
sns.countplot(y='company_size', data=train)
plt.show()

#### We can see that most candidates work in small companies (between 50-500)

Identify each interval with a number:

In [None]:
train['company_size'] = train['company_size'].map({"50-99":0, "<10":1, "10000+":2, "5000-9999":3, "1000-4999":4, "10/49":5, "100-500":6, "500-999":7})
test['company_size'] = test['company_size'].map({"50-99":0, "<10":1, "10000+":2, "5000-9999":3, "1000-4999":4, "10/49":5, "100-500":6, "500-999":7})

In [None]:
train.shape

30% of candidates didn't mention if they had experience or not, so we will assume that these candidates have no experience.

In [None]:
train['company_size'] = train['company_size'].fillna(8)
test['company_size'] = test['company_size'].fillna(8)

# Company type:

In [None]:
sns.countplot(y='company_type', data=train)
plt.show()

#### Most of candidates work in Private limited company type (pvt ltd)

In [None]:
train['company_type'] = train['company_type'].map({"Pvt Ltd":0, "Funded Startup":1, "Early Stage Startup":2, "Public Sector":3, "NGO":4, "Other":5})
test['company_type'] = test['company_type'].map({"Pvt Ltd":0, "Funded Startup":1, "Early Stage Startup":2, "Public Sector":3, "NGO":4, "Other":5})

In [None]:
plt.figure(figsize=(12,6))
sns.violinplot(x='company_size', y='company_type',data=train)
plt.show()

#### Private limited companies are of different sizes, from less than ten people to +10 000! So we can't really find a relation between company size and type.
#### We will fill missing values in company_type with 0(private limited comapany).

In [None]:
train['company_type'] = train['company_type'].fillna(0)
test['company_type'] = test['company_type'].fillna(0)

# Last new job:

In [None]:
train['last_new_job'].unique()

We assume that missing values are from candidates that had no job.

In [None]:
train['last_new_job'] = train['last_new_job'].fillna('never')
test['last_new_job'] = test['last_new_job'].fillna('never')

In [None]:
plt.figure(figsize=(12,6))
sns.violinplot(x='last_new_job', y='target', data=train)
plt.show()

Less the years difference between the last job and the current, more likely a candidate will look for a job change. The same goes for candidates that had only one job OR are looking for a job for the first time in their career.

In [None]:
train['last_new_job'] = train['last_new_job'].map({"1":1, ">4": 5, "never":0, "4":4, "3":3, "2":2})
test['last_new_job'] = test['last_new_job'].map({"1":1, ">4": 5, "never":0, "4":4, "3":3, "2":2})

In [None]:
g = sns.catplot(y="target",x="last_new_job",data=train, kind="bar")

So we have taken care of all categorical data and missing values, lets take a look at what we have done so far:

In [None]:
train.head()

In [None]:
train.dtypes

#### pre modeling steps:

In [None]:
pred = train['target']

In [None]:
train.drop(['target'],axis=1,inplace=True)

In [None]:
test = test.astype(np.int8)

In [None]:
train = train.astype(np.int8)

# Models testing:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score

Lets try different models and see what works best:

In [None]:
KFold_Score = pd.DataFrame()
classifiers = ['Linear SVM', 'LogisticRegression', 'RandomForestClassifier', 'XGBoostClassifier','GradientBoostingClassifier']
models = [svm.SVC(kernel='linear'),
          LogisticRegression(max_iter = 1000),
          RandomForestClassifier(n_estimators=200, random_state=0),
          xgb.XGBClassifier(n_estimators=100),
          GradientBoostingClassifier(random_state=0)
         ]
j = 0
#for i in models:
    #model = i
    #cv = KFold(n_splits=5, random_state=0, shuffle=True)
    #KFold_Score[classifiers[j]] = (cross_val_score(model, train, np.ravel(pred), scoring = 'accuracy', cv=cv))
    #j = j+1

In [None]:
#mean = pd.DataFrame(KFold_Score.mean(), index= classifiers
#KFold_Score = pd.concat([KFold_Score,mean.T])
#KFold_Score.index=['Fold 1','Fold 2','Fold 3','Fold 4','Fold 5','Mean']
#KFold_Score.T.sort_values(by=['Mean'], ascending = False)

I commented the code because it takes great amount of time to run and commit.

We can see that Gradient Boosting gives the best result.

# Hyperparameter Tuning:

Lets initialize our model with these parameters:

In [None]:
mymodel = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=500,min_samples_leaf=50,max_depth=8,max_features='sqrt',subsample=0.8,random_state=10)

We shouldn't use the enrollee_id since it is unique for each candidate.

In [None]:
predictors = [x for x in train.columns if x not in ["enrollee_id"]]

First we have to look for an optimal number of estimators:

In [None]:
param_test1 = {'n_estimators':range(10,100,10)}

Grid search:

In [None]:
from sklearn.model_selection import GridSearchCV
CV_gbc = GridSearchCV(estimator=mymodel, param_grid=param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv= 5)
CV_gbc.fit(train[predictors],pred)
CV_gbc.best_params_, CV_gbc.best_score_

As you can see that here we got 90 as the optimal estimators for 0.1 learning rate, it is close to 100 so we will increase the learning rate to 0.2. (I tried working with number of estimators as 90 but when I increased the learning rate I got slightly better results, I won't include the whole process because it will take a lot of time to run)

In [None]:
mymodel0 = GradientBoostingClassifier(learning_rate=0.2, min_samples_split=500,min_samples_leaf=50,max_depth=8,max_features='sqrt',subsample=0.8,random_state=10)

In [None]:
from sklearn.model_selection import GridSearchCV
CV_gbc0 = GridSearchCV(estimator=mymodel0, param_grid=param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv= 5)
CV_gbc0.fit(train[predictors],pred)
CV_gbc0.best_params_, CV_gbc0.best_score_

The next step is to find the max_depth and min_samples_split.

In [None]:
param_test2 = {'max_depth':range(5,9,1), 'min_samples_split':range(400,1000,100)}
gbc = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.2, n_estimators=70, max_features='sqrt', subsample=0.8, random_state=10), 
param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gbc.fit(train[predictors],pred)
gbc.best_params_, gbc.best_score_

So we got a maximum depth of 6 and minimum samples split is 700.

min_samples_leaf:

In [None]:
param_test3 = {'min_samples_split':range(400,1000,100), 'min_samples_leaf':range(30,71,10)}
gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.2, n_estimators=70,max_depth=6, max_features='sqrt', subsample=0.8, random_state=10), 
param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],pred)
gsearch3.best_params_, gsearch3.best_score_

Lets write a function that returns the accuracy, auc score and the importance of each variable:

In [None]:
from sklearn.model_selection import cross_validate

AUC represents the probability that a random positive  example is positioned to the right of a random negative example. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0. [https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc]

In [None]:
def modelfit(alg, dtrain, pred, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], pred)
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
    
    #Perform cross-validation:
    if performCV:
        cv_score = cross_validate(alg, dtrain[predictors], pred, cv=cv_folds, scoring='roc_auc')
    
    #Print model report:
    print("\nModel Report")
    print("Accuracy :",metrics.accuracy_score(pred.values, dtrain_predictions))
    print("AUC Score (Train):", metrics.roc_auc_score(pred, dtrain_predprob))
    print("cv Score: ", np.mean(cv_score['test_score']))
        
    #Print Feature Importance:
    if printFeatureImportance:
        feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
        feat_imp.plot(kind='bar', title='Feature Importances')
        plt.ylabel('Feature Importance Score')

Lets run the function on the model we got till now:

In [None]:
modelfit(gsearch3.best_estimator_, train, pred, predictors)

In [None]:
param_test4 = {'max_features':range(7,20,2)}
gsearch4 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.2, n_estimators=70,max_depth=6, min_samples_split=800, min_samples_leaf=40, subsample=0.8, random_state=10),
param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],pred)
gsearch4.best_params_, gsearch4.best_score_


With this we have the final tree-parameters as:

    min_samples_split: 800
    min_samples_leaf: 40
    max_depth: 6
    max_features: 11


The next step would be try different subsample values.

In [None]:
param_test5 = {'subsample':[0.6,0.7,0.75,0.8,0.85,0.9]}
gsearch5 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.2, n_estimators=70,max_depth=6,min_samples_split=800, min_samples_leaf=40, subsample=0.8, random_state=10,max_features=11),
param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],pred)
gsearch5.best_params_, gsearch5.best_score_

We got 0.8 as the optimum subsample value.
Now, we need to lower the learning rate and increase the number of estimators to see if we get better results.

In [None]:
gbm_tuned_1 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=140,max_depth=6, min_samples_split=800,min_samples_leaf=40, subsample=0.8, random_state=10, max_features=11)
modelfit(gbm_tuned_1, train, pred, predictors)

We can see a slight improvement in Accuracy and cv score, lets descrease the learning rate and increase number of estimators one more time.

In [None]:
gbm_tuned_2 = GradientBoostingClassifier(learning_rate=0.01, n_estimators=1200,max_depth=6, min_samples_split=800,min_samples_leaf=40, subsample=0.8, random_state=10, max_features=11)
modelfit(gbm_tuned_2, train, pred, predictors)

In [None]:
gbm_tuned_3 = GradientBoostingClassifier(learning_rate=0.005, n_estimators=1500,max_depth=6, min_samples_split=800,min_samples_leaf=40, subsample=0.8, random_state=10, max_features=11)
modelfit(gbm_tuned_3, train, pred, predictors)

In [None]:
gbm_tuned_4 = GradientBoostingClassifier(learning_rate=0.005, n_estimators=1800,max_depth=6, min_samples_split=800,min_samples_leaf=40, subsample=0.8, random_state=10, max_features=11)
modelfit(gbm_tuned_4, train, pred, predictors)

increasing the number of estimators got us a slightly better model.

Lets fit the model.

In [None]:
gbm_tuned_4.fit(train,pred)

Make predictions on test set:

In [None]:
preds = gbm_tuned_3.predict(test[predictors])

Save results to the task submission file::

In [None]:
output = pd.DataFrame({'enrollee_id ': test.enrollee_id , 'target': preds})
output.to_csv('./sample_submission.csv', index=False)