# Importing required libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd
import seaborn as sns
from sklearn.model_selection import cross_val_score,train_test_split
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.preprocessing import LabelEncoder
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from sklearn.impute import SimpleImputer
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score
# data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Loading + Inspecting

In [None]:
df = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv')


In [None]:
df.head()


In [None]:
df.columns

In [None]:
df.info()

We have 13 columns, including the target so 12 features to work with. Enrolee_id holds no particular meaning, so we can discard it from the get-go. Most of our features are categorical.

Before proceeding, as usual let's check if there are any null values.

In [None]:
df.isnull().sum()

So,the numerical columns have no missing values in them, all NaN values are relegated to categorical columns.

Now let's look at the target in detail. A , '1' indicates the employee made a career change, '0' indicating otherwise.

Clearly the dataset is imbalanced, people who don't pull off a career change outnumber those who do by almost 3 times. We will get back to this issue later.

# EDA + Feature Selection

Let's plot some graphs and examing how our features relate to the target and each other. Starting with training_hours as it's a fairly straightforward & simple numerical column. 

In [None]:
sns.histplot(x=df['training_hours'],kde=True)

Normally distributed with a high degree of skew and kurtosis. The mean is about 65 hours and the median, 47 hours.

In [None]:
print("Skewness" , df['training_hours'].skew())
print("Kurtosis" , df['training_hours'].kurt())

We apply log transformation to rectify this.

In [None]:
df['training_hours'] = np.log(df['training_hours'])

High degree of skew, we will apply the log transformation to this column before passing it into a model.

Now, city_development_index is an interesting feature, is there any correlation between how developed a city is and people being more likely to switch careers?

In [None]:
sns.countplot(x=df['city_development_index'])

Most of the cities fall into the underdeveloped or extremely developed category, this is analogous to the real world. Let's see whether people living in super developed cities have a higher count of career switching.

In [None]:
df_city = df.query('city_development_index>0.50')

In [None]:
df_city['target'].value_counts()

So out of 4777 positive values, 4740 values are relegated to cities with more than a 0.50 value in the development_index. So it's safe to assume, people from more developed cities are more likely to switch careers. We will keep this feature as it is.

In [None]:
sns.countplot(x=df['gender'])

Most of the values unfortunately belong to one class :/, so it's probably not going to be a good feature, further exploration is unwarranted.

Now let's go onto Education Level, this intuitively seems like a good feature to have. Academically accomplished and studious employees might tend to chase new avenues.

In [None]:
df['education_level'].value_counts()

So we have 4 varying levels of education, with most employees being at the Graduate Level. It would be interesting to check whether employees with a higher education level than that would have a higher propensity for checking.

In [None]:
sns.countplot(x=df['education_level'],hue=df['target'])

Doesn't seem like it, still an important feature to have as being "Graduate" or above still dramatically increases your chance of doing a career switch. 

Okay, let's move onto 'relevent expereience', seems important for obvious reasons.

In [None]:
df['relevent_experience'].value_counts()

Wow, overwhelming number of the employees have relevant experience. Let's see if having relevent experience could mean switching.

In [None]:
sns.countplot(x=df['relevent_experience'],hue=df['target'])

Disappointingly, having relevent experience in the field is not a good indicator of whether an employee will switch. Neverthless, it will be a weak feature that can aid in prediction.

Now let's look at the total   'experience'  an employee has.

In [None]:
df['experience'].value_counts()

Let's use a mapper to get rid of the special characters.

In [None]:
experience_map = {
    '<1'      :    0,
    '1'       :    1, 
    '2'       :    2, 
    '3'       :    3, 
    '4'       :    4, 
    '5'       :    5,
    '6'       :    6,
    '7'       :    7,
    '8'       :    8, 
    '9'       :    9, 
    '10'      :    10, 
    '11'      :    11,
    '12'      :    12,
    '13'      :    13, 
    '14'      :    14, 
    '15'      :    15, 
    '16'      :    16,
    '17'      :    17,
    '18'      :    18,
    '19'      :    19, 
    '20'      :    20, 
    '>20'     :    21
} 
df['experience'] = df['experience'].map(experience_map)


In [None]:
sns.histplot(x=df['experience'],hue=df['target'])

Let's look at last_new_job now, which is the difference in years b/w previous and current job.

In [None]:
last_new_job_map = {
    'never'        :    0,
    '1'            :    1, 
    '2'            :    2, 
    '3'            :    3, 
    '4'            :    4, 
    '>4'           :    5
}
df['last_new_job'] = df['last_new_job'].map(last_new_job_map)

In [None]:
sns.countplot(x=df['last_new_job'],hue=df['target'])

Again, no discernible trend.

Now let's move onto company_size and type.

In [None]:
df['company_size'].value_counts()

Special characters again, so once again remove with a map

In [None]:
company_size_map = {
    '<10'          :    0,
    '10/49'        :    1, 
    '100-500'      :    2, 
    '1000-4999'    :    3, 
    '10000+'       :    4, 
    '50-99'        :    5, 
    '500-999'      :    6, 
    '5000-9999'    :    7
}
df['company_size'] = df['company_size'].map(company_size_map)

In [None]:
sns.countplot(x=df['company_size'],hue=df['target'])

No clear trend again. 

In [None]:
sns.stripplot(x=df['education_level'],y=df['training_hours'])

Hmm, people with a Graduate and Masters degree tend to have higher number of training hours compared to the rest.

In [None]:
sns.countplot(x=df['major_discipline'],hue=df['target'])

STEM employees dominate the dataset, so consequqntially they have the most employees who switch. Anyway, this tells us that most of time, an employee who switched has a STEM background.

In [None]:
sns.stripplot(y=df['experience'],x=df['education_level'])

Employees with only a primary education level sorely lack in experience, could be a good predictor for classifying into the '0' class, i.e no switch.

That should be enough EDA, let's plot a heatmap just in case we missed any correlations.

In [None]:
df_corr = df.corr()
sns.heatmap(df_corr,annot=True)

Looks, like we didn't!  We can proceed with cleaning and preprocessing.

# Cleaning + Preprocessing

First, let's seperate out our label from the dataframe and drop it.   We can also drop "enrollee_id" as it holds no useable information.

In [None]:
y = df['target']
df.drop(columns=['enrollee_id','target'],inplace=True)

Let's pull out the categorical columns in our dataframe, and apply Label Encoding on them.

In [None]:
cat_cols = list(df.select_dtypes(include='object').columns)
for col in cat_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

Writing a small helper function to check if there are any NULL values and calculate their proportion, column-wise.

In [None]:
nan_cols = [i for i in df.columns if df[i].isna().any()]
def nan_calculator(df):
    total = len(df)
    for col in nan_cols:
        null = df[col].isnull().sum()
        print(f'Missing values: {null} of {total} in {col}')
nan_calculator(df)

Not many are missing, can be dealt with a basic imputer.

Let's use a simple imputer with median as our strategy.

In [None]:
imp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit(df)
df = imp.transform(df)

# Modelling

Using the classic, 'train_test_split' to split our training dataset.

In [None]:
X_train,X_test,y_train,y_test = train_test_split(df,y,test_size=0.2)

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 402)
X_smote, Y_smote = smote.fit_resample(X_train,y_train)


Remember, we never had enough '1.0' target values. Using the Smote library to generate some synthetic samples

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_smote, Y_smote, test_size = 0.2 ,random_state = 42)

Okay, we are ready to pass in our features to a model. Let's start with the simple Logisitc Regression, and try out increasingly complex models.

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
lr.score(X_val,y_val)




68 percent, not bad for a simple model. Let's try a decision tree next.

In [None]:
clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,random_state=0)
clf.fit(X_train,y_train)
clf.score(X_val,y_val)


Clear improvement, to be expected as Decision Trees have a much higher model capacity. Trying Random Forests next, as they are an evolved version of Decision Trees.

In [None]:
rf = RandomForestClassifier(n_estimators=10, max_depth=None,
min_samples_split=2, random_state=0)
rf.fit(X_train,y_train)
rf.score(X_val,y_val)

Almost 85% now, again keeping in line with our expectations.

Finally, let's try out a Gradient Boosted Classifier.

In [None]:
xgb_reg = XGBClassifier(max_depth=5)
xgb_reg.fit(X_train,y_train)
xgb_reg.score(X_test,y_test)

A dip, probably overfitting. Let's try ExtraTreesClassifier next.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
etc = ExtraTreesClassifier()
etc.fit(X_train,y_train)
print(etc.score(X_val,y_val))
etc_pred = etc.predict(X_val)

A 2 percent improvement over Random Forests, next we will try sci-kit learn's VotingClassifier to stack our best performing models and see if there is any improvement.

In [None]:
eclf1 = VotingClassifier(estimators=[
('rf', rf),('etc',etc),('xgb',xgb_reg)], voting='hard')
eclf1 = eclf1.fit(X_train, y_train)
print(eclf1.score(X_val,y_val))
pred = eclf1.predict(X_val)

Stays about the same, so we will choose ExtraTreesClassifier as our model since it performs the best.

In [None]:
target_names = ['No career switch', "Succesfull career transition"]

In [None]:
print(classification_report(y_val,etc_pred,target_names=target_names))

Not bad, 87 percent on precision and recall. Seems reliable. 

# Testing

Now let's see how well our model performs well on the test set. First we have to apply whatever transformations we applied to our training set before splitting the dataset.

In [None]:
df_test = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_test.csv')
df_test['training_hours'] = np.log(df_test['training_hours'])
experience_map = {
    '<1'      :    0,
    '1'       :    1, 
    '2'       :    2, 
    '3'       :    3, 
    '4'       :    4, 
    '5'       :    5,
    '6'       :    6,
    '7'       :    7,
    '8'       :    8, 
    '9'       :    9, 
    '10'      :    10, 
    '11'      :    11,
    '12'      :    12,
    '13'      :    13, 
    '14'      :    14, 
    '15'      :    15, 
    '16'      :    16,
    '17'      :    17,
    '18'      :    18,
    '19'      :    19, 
    '20'      :    20, 
    '>20'     :    21
} 
df_test['experience'] = df_test['experience'].map(experience_map)
last_new_job_map = {
    'never'        :    0,
    '1'            :    1, 
    '2'            :    2, 
    '3'            :    3, 
    '4'            :    4, 
    '>4'           :    5
}
df_test['last_new_job'] = df_test['last_new_job'].map(last_new_job_map)
company_size_map = {
    '<10'          :    0,
    '10/49'        :    1, 
    '100-500'      :    2, 
    '1000-4999'    :    3, 
    '10000+'       :    4, 
    '50-99'        :    5, 
    '500-999'      :    6, 
    '5000-9999'    :    7
}
df_test['company_size'] = df_test['company_size'].map(company_size_map)
df_test.drop(columns=['enrollee_id'],inplace=True)
cat_cols = list(df_test.select_dtypes(include='object').columns)
for col in cat_cols:
    le = LabelEncoder()
    df_test[col] = le.fit_transform(df_test[col])
imp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit(df_test)
df_test = imp.transform(df_test)
X_train_1,X_test_1,y_train_1,y_test_1 = train_test_split(df_test,y[0:2129],test_size=0.2)
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 402)
X_smote, Y_smote = smote.fit_resample(X_train_1,y_train_1)
X_train_1, X_val_1, y_train_1, y_val_1 = train_test_split(X_smote, Y_smote, test_size = 0.2 ,random_state = 42)


In [None]:
etc.fit(X_train_1,y_train_1)

In [None]:
etc.score(X_val_1,y_val_1)


In [None]:
etc_pred_test = etc.predict(X_val_1)

In [None]:
print(classification_report(y_val_1,etc_pred_test,target_names=target_names))

Getting almost the same level of performance as on our training set, this confirms the model has good generalisation capacity.