# Notebook outline:

1. Data Analysis and Visualization
    - Dataset Information
    - Visualizations
2. Data Preprocessing
    - Data Encoding
    - Deal with Imbalanced Data using SMOTE
3. Models Training and Evaluation
    - Splitting data into train and test set
    - Training Base Models
    - Evaluation of Base Models
    - Hyperparameter Tuning
    - Evaluation of Tuned Models

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

SEED = 0

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Some basic libraries ...

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

I will load the dataset by using pandas the standard python approach for dealing with data.

In [None]:
data = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
test = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_test.csv')

# 1. Data Analysis and Visualization

In [None]:
data.head(5)

In [None]:
test.head(3)

## Dataset information:

+ enrollee_id : Unique ID for enrollee
+ city: City code
+ citydevelopmentindex: Developement index of the city (scaled)
+ gender: Gender of enrolee
+ relevent_experience: Relevent experience of enrolee
+ enrolled_university: Type of University course enrolled if any
+ education_level: Education level of enrolee
+ major_discipline :Education major discipline of enrolee
+ experience: Enrolee total experience in years
+ company_size: No of employees in current employer's company
+ company_type : Type of current employer
+ lastnewjob: Difference in years between previous job and current job
+ training_hours: training hours completed
+ target: 0 – Not looking for job change, 1 – Looking for a job change

... and try to check the type of the data types.

In [None]:
data.dtypes

**But, "Categorical variables"**

We need to deal with categorical variables so columns which have values different than numbers. 

A simple way of selecting all categorical columns is by checking their type.

Thus, In the database, only 4 columns are of numerical-data, and up to 10 columns are Categorical variables type.

In [None]:
data.isnull().sum()

... and a significant amount of NaN data ...

## Visualizations

In [None]:
#Churn vs. normal 
sns.countplot(data.target)

Next, let's look at the frequency of each category separated the histogram charts to check if there is any special information to distinguish whether the result of the "target" is 0 - Not looking for job change, OR, 1 - Looking for a job change.

In [None]:
#Frequency of each category separated by label
plt.figure(figsize=[15,18])
features = ['gender','relevent_experience','enrolled_university','education_level', 'major_discipline',
       'experience','company_size','company_type','last_new_job']
n=1
for f in features:
    plt.subplot(5,2,n)
    sns.countplot(x=f, hue='target', alpha=0.7, data=data)
    plt.title("Countplot of {}  by target".format(f))
    n=n+1
plt.tight_layout()
plt.show()

From these histogram charts, it can be seen, there is no special correlation between the variables with the target function to distinguish the value of the target. Furthermore, categorical variables cannot determine the correlation factor between these variables and the target function.

In [None]:
np.array(data.columns[data.dtypes != object])

# 2. Data Preprocessing

In [None]:
import copy
df_train=copy.deepcopy(data)
df_test=copy.deepcopy(test)

cols=np.array(data.columns[data.dtypes != object])
for i in df_train.columns:
    if i not in cols:
        df_train[i]=df_train[i].map(str)
        df_test[i]=df_test[i].map(str)
df_train.drop(columns=cols,inplace=True)
df_test.drop(columns=np.delete(cols,len(cols)-1),inplace=True)

In [None]:
df_train.columns

We will assign each categorical variable value a number, so let’s say [A, B, A, F] named values will map to [1, 2, 1, 3]. To do that we will use LabelEncoder from sklearn.preprocessing package, as following.

## Data Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict

# build dictionary function
cols=np.array(data.columns[data.dtypes != object])
d = defaultdict(LabelEncoder)

# only for categorical columns apply dictionary by calling fit_transform 
df_train = df_train.apply(lambda x: d[x.name].fit_transform(x))
df_test=df_test.apply(lambda x: d[x.name].transform(x))
df_train[cols]=data[cols]
df_test[np.delete(cols,len(cols)-1)]=test[np.delete(cols,len(cols)-1)]

Now examine the results, considering the correlation between "pseudo categorical variables" and the "target" objective function.

In [None]:
df_train.dtypes

In [None]:
df_test.columns

In [None]:
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
style.use('ggplot')
sns.set_style('whitegrid')
plt.subplots(figsize = (12,7))
## Plotting heatmap. # Generate a mask for the upper triangle (taken from seaborn example gallery)
mask = np.zeros_like(df_train.corr().apply(abs), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(df_train.corr().apply(abs), cmap=sns.diverging_palette(20, 220, n=200), annot=True, mask=mask, center = 0, )
plt.title("Heatmap of all the Features of Train data set", fontsize = 25)

From the above heatmap we can clearly observe that the target has a high dependance on the city_development_index which means candidates from city with higher amount of development index tends to move towards the field of data science.

In [None]:
# visualizing the features whigh positive and negative correlation
f, axes = plt.subplots(nrows=3, ncols=3, figsize=(20,15))

f.suptitle('Features With High Correlation', size=35)
sns.boxplot(x="target", y="city", data=df_train, ax=axes[0,0])
sns.boxplot(x="target", y="gender", data=df_train, ax=axes[0,1])
sns.boxplot(x="target", y='relevent_experience', data=df_train, ax=axes[0,2])
sns.boxplot(x="target", y='enrolled_university', data=df_train, ax=axes[1,0])
sns.boxplot(x="target", y='education_level', data=df_train, ax=axes[1,1])
sns.boxplot(x="target", y='company_size', data=df_train, ax=axes[1,2])
sns.boxplot(x="target", y='company_type', data=df_train, ax=axes[2,0])
sns.boxplot(x="target", y='enrollee_id', data=df_train, ax=axes[2,1])
sns.boxplot(x="target", y='training_hours', data=df_train, ax=axes[2,2])

In [None]:
counts = data.target.value_counts()
not_change = counts[0]
change = counts[1]
perc_not_change = not_change*100/ sum(counts)
perc_change = change*100/ sum(counts)
print('There were {} not_change ({:.2f}%) and {} change ({:.2f}%).'.format(not_change, perc_not_change, change, perc_change))

From this we can clearly see that the target 0 is in majority which will effect our model so we will use SMOTE (Synthetic Minority Over-sampling Technique) which will help us to create more synthetic data for the minority class 1 :)
    

## Deal with Imbalanced Data using SMOTE

In [None]:
X=df_train.drop(columns=['target']).values
y=df_train['target'].values

Borderline-SMOTE SVM

Hien Nguyen, et al. suggest using an alternative of Borderline-SMOTE where an SVM algorithm is used instead of a KNN to identify misclassified examples on the decision boundary.

Their approach is summarized in the 2009 paper titled “Borderline Over-sampling For Imbalanced Data Classification.” An SVM is used to locate the decision boundary defined by the support vectors and examples in the minority class that close to the support vectors become the focus for generating synthetic examples.

In [None]:
def oversample(X, y, ss=1):
    from collections import Counter
    from imblearn.over_sampling import SVMSMOTE
    from numpy import where

# summarize class distribution
    print("Original class distribution:")
    counter = Counter(y)
    print(counter)
    
# transform the dataset
    X, y = SVMSMOTE(sampling_strategy=ss,n_jobs=-1).fit_resample(X, y)
    
    print("Over sampling done using SVM SMOTE.\nNew class distribution is:")
# summarize the new class distribution
    counter = Counter(y)
    print(counter)
    
    return X, y

In [None]:
X, y = oversample(X,y)

# 3. Model Training and Evaluation

In [None]:
# imports for training and evaluation
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV

## Splitting data into train and test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)
print("Data splitting complete")

In [None]:
# helper functions
def evaluate(model, X, y):
    preds = model.predict(X)
    
    labels = [0,1]
    target_names = ["not_change","change"]
    
    cm = confusion_matrix(y, preds)
    cr = classification_report(y, preds, labels=labels, target_names=target_names)
    
    fig, ax = plt.subplots()
    print(cr)
    sns.heatmap(cm, annot=True, fmt="d", cmap='Blues', ax=ax)
    plt.show()
    
    return preds
    
def test_model(model, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, scorer=None):
    model.fit(X_train,y_train)
        
    train_preds = evaluate(model, X_train, y_train)
    test_preds = evaluate(model, X_test, y_test)
    
    return train_preds, test_preds

def plot_preds(model_names, y_true, preds_list, show_value=False, scorer='accuracy'):
    xval = model_names
    if scorer=='accuracy':
        yval = [accuracy_score(y_true, y_pred) for y_pred in preds_list]
    elif scorer in ['f1 score','f1']:
        yval = [f1_score(y_true, y_pred) for y_pred in preds_list]
    plt.figure(figsize=(12,6))
    plt.ylim(ymax = min(100,max(yval)*1.1), ymin = min(yval)*0.8)
    plt.ylabel(scorer)
    plt.xticks(rotation=45)
    
    s = sns.barplot(xval,yval)
    if show_value:
        for x,y in zip(range(len(yval)),yval):
            s.text(x,y+0.1,round(y,2),ha="center")

In [None]:
train_preds = dict()
test_preds = dict()

## Training Base Models

### A. Logistic Regression

In [None]:
train_preds["LR"],test_preds["LR"] = test_model(LogisticRegression());

### B. SVM - Support Vector Classifier

In [None]:
train_preds["SVC"],test_preds["SVC"] = test_model(SVC());

### C. kNN (k- Nearest Neighbors)

In [None]:
train_preds["KNN"],test_preds["KNN"] = test_model(KNeighborsClassifier());

### D. Random Forest

In [None]:
train_preds["RF"],test_preds["RF"] = test_model(RandomForestClassifier());

### E. Light GBM

In [None]:
train_preds["LGBM"],test_preds["LGBM"] = test_model(LGBMClassifier());

### Evaluation of Base Models

In [None]:
plot_preds(list(test_preds.keys()),y_test,list(test_preds.values()), 1)

## Hyperparameter Tuning

In [None]:
# helper functions
def best_params(model, grid, X_train=X_train, y_train=y_train):
    gscv=GridSearchCV(model,grid,scoring=make_scorer(f1_score),n_jobs=-1, verbose=1)
    grid_search=gscv.fit(X_train,y_train)
    bp = grid_search.best_params_ 
    print("\nBest Params for {}:".format(model))
    for k in bp:
        print(k,":",bp[k])
    print()
    return bp

In [None]:
tuned_train_preds = dict()
tuned_test_preds = dict()

### A. Logistic Regression

In [None]:
model = LogisticRegression()

solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1]

grid = dict(solver=solvers,penalty=penalty,C=c_values)

tuned_train_preds["LR"],tuned_test_preds["LR"] = test_model(LogisticRegression(**best_params(model,grid)));

### B. SVM

In [None]:
model = SVC()

C = [1.0, 0.1, 0.01, 0.05, 0.001]

grid = dict(C=C)

tuned_train_preds["SVC"],tuned_test_preds["SVC"] = test_model(SVC(**best_params(model,grid)));

### C. K-Nearest Neighbors

In [None]:
model = KNeighborsClassifier()

n_neighbors = [9,11,13,15]
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']

grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)

tuned_train_preds["KNN"],tuned_test_preds["KNN"] = test_model(KNeighborsClassifier(**best_params(model,grid)));

### D. Random Forest

In [None]:
model = RandomForestClassifier()

n_estimators = [50, 100, 500]
max_features = ['auto','sqrt', 'log2']
max_depth = [5,8,10,None]
min_samples_split = [3,5,7,9]
grid = dict(n_estimators=n_estimators,# max_features=max_features,
            max_depth=max_depth,min_samples_split=min_samples_split)

tuned_train_preds["RF"],tuned_test_preds["RF"] = test_model(RandomForestClassifier(**best_params(model,grid)));

### E. LightGBM

In [None]:
model = LGBMClassifier()

n_estimators = [40, 80, 160]
learning_rate = [0.01, 0.05, 0.1, 0.5]
max_depth = [5,7,9]
subsample = [0.5,0.7,0.9]
grid = dict(n_estimators=n_estimators,learning_rate=learning_rate,
            max_depth=max_depth,subsample=subsample)

tuned_train_preds["LGBM"],tuned_test_preds["LGBM"] = test_model(LGBMClassifier(**best_params(model,grid)));

### Evaluation of Tuned Models

In [None]:
plot_preds(list(tuned_test_preds.keys()),y_test,list(tuned_test_preds.values()), 1)