# **HR Analytics: WHO IS LOOKING FOR THE DOOR?**

<img src="https://media.giphy.com/media/xThtadLubOnwcA43V6/giphy.gif">

**In this notebook, firstly I will look for any missing variables, fixing tweaks here or there. Secondly, I will visualize the processed data to see the relation between columns, then I will drop the unnecessary columns. Finally I will apply different models to data and we will see how it is going to turn out!**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
test = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_test.csv')
sample = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/sample_submission.csv')

In [None]:
print(train.shape)
print(test.shape)
print(sample.shape)
train.head(10)

Checking the null values in the training and test data.

In [None]:
for column in train:
    print(column)
    print(train[column].isnull().sum().sum())

Okay. Some of the columns we do not have any missing variables but in most of them we have a lot. To continue with our analysis I am going to mark these *null* values as Unknown and look at their effects to the outcome.

In [None]:
train.fillna('Unknown', inplace = True)
for column in train:
    print(column)
    print(train[column].isnull().sum().sum())

In [None]:
train.head()

***City*** will be important to the decision process, probably. BUT it is not usable with the *'city_xx'* form so I need to strip the *'city_'* from the numbers next to it.

In [None]:
train['city'] = train['city'].map(lambda x: x.lstrip('city_'))

Also ***experience*** and ***last_new_job*** should be important for the outcome, don't you think? I will delete the *'<,>'* symbols in front of them. I do not think the impact of '<,>' is not that important. And also, we should get rid of the 'never' and 'unknown' labels in the ***last_new_job*** column.

In [None]:
train['experience'] = train['experience'].map(lambda x: x.lstrip('<>'))
train['last_new_job'] = train['last_new_job'].map(lambda x: x.lstrip('<>'))
train["last_new_job"]= train["last_new_job"].replace('never', 0)
train["last_new_job"]= train["last_new_job"].replace('Unknown', 0)

In [None]:
train.head()

Now, I will look at each category and its effect to the ***target***. By the way, I like to do this with big graphs.

I always like to do a prediction on my own about the data before I start to analyze it. SO here it is, I do not think *gender, major,company type* will effect the outcome of this situation. We'll see!

In [None]:
sns.set(rc={'figure.figsize':(25,6)})

In [None]:
sns.barplot(x='city', y='target', data=train)

In [None]:
sns.barplot(x='city_development_index', y='target', data=train)

In [None]:
sns.barplot(x='gender', y='target', data=train)

I think I was right about the gender. Although there are differences, they are not as important as other categories.

In [None]:
sns.barplot(x='relevent_experience', y='target', data=train)

In [None]:
sns.barplot(x='enrolled_university', y='target', hue='education_level', data=train)

In [None]:
sns.barplot(x='major_discipline', y='target', data=train)

Two for two everybody. I will be dropping the *major discipline*, as well.

In [None]:
sns.barplot(x='experience', y='target', data=train)

Experience level is a huge difference. We see that early careers tend to change jobs more often than others.

In [None]:
sns.barplot(x='company_size', y='target', data=train)

In [None]:
sns.barplot(x='company_type', y='target', data=train)

The null values that we had in the beginning are making a huge difference right now. I am going to stick with them and act like that is an another category.

In [None]:
sns.barplot(x='last_new_job', y='target', data=train)

I tought this will effect more, not gonna lie there. But it is still valid.

In [None]:
sns.barplot(x='training_hours', y='target', data=train)

There are some spikes here or there. Overall, it seems consistent enough in different values to drop this column.

SO, I am going to drop *gender, major_discipline and training_hours* columns.

In [None]:
train = train.drop(['gender', 'major_discipline', 'training_hours'], axis=1)

In [None]:
train.head()

Here, we have every information that we need. But it is not usable for models right now. We need to get rid of the string elements. But when we do that by *pandas.get_dummies*, we will have the same named columns called *Unknown*. First, I need to fix that.

In [None]:
train["enrolled_university"]= train["enrolled_university"].replace('Unknown', 'Unknown_uni')
train["education_level"]= train["education_level"].replace('Unknown', 'Unknown_level')
train["experience"]= train["experience"].replace('Unknown', 0)
train["company_size"]= train["company_size"].replace('Unknown', 'Unknown_size')
train["company_type"]= train["company_type"].replace('Unknown', 'Unknown_type')

In [None]:
train.head()

**NOW** we can create dummies.

In [None]:
experience = pd.get_dummies(train['relevent_experience'], drop_first=True)
university = pd.get_dummies(train['enrolled_university'], drop_first=False)
education = pd.get_dummies(train['education_level'], drop_first=False)
c_size = pd.get_dummies(train['company_size'], drop_first=False)
c_type = pd.get_dummies(train['company_type'], drop_first=False)

In [None]:
train = train.drop(['relevent_experience', 'enrolled_university', 'education_level', 'company_size', 'company_type'], axis=1)

In [None]:
train = pd.concat([train, experience, university, education, c_size, c_type], axis=1)
print(train.shape)
train.head()

**Now, our training data is ready for training. But before I start with that, I need to make exact moves with the test data to make them compatible. This is why it is so important to have a nice structure and pipeline with your notebook because you can get confused very easily if you don't.**

In [None]:
for column in test:
    print(column)
    print(test[column].isnull().sum().sum())

Both dataset has *null* values in the same columns.

In [None]:
test.head()

In [None]:
test.fillna('Unknown', inplace = True)

test['city'] = test['city'].map(lambda x: x.lstrip('city_'))
test['experience'] = test['experience'].map(lambda x: x.lstrip('<>'))
test['last_new_job'] = test['last_new_job'].map(lambda x: x.lstrip('<>'))
test["last_new_job"]= test["last_new_job"].replace('never', 0)
test["last_new_job"]= test["last_new_job"].replace('Unknown', 0)

test = test.drop(['gender', 'major_discipline', 'training_hours'], axis=1)
test["enrolled_university"]= test["enrolled_university"].replace('Unknown', 'Unknown_uni')
test["education_level"]= test["education_level"].replace('Unknown', 'Unknown_level')
test["experience"]= test["experience"].replace('Unknown', 0)
test["company_size"]= test["company_size"].replace('Unknown', 'Unknown_size')
test["company_type"]= test["company_type"].replace('Unknown', 'Unknown_type')

experience_test = pd.get_dummies(test['relevent_experience'], drop_first=True)
university_test = pd.get_dummies(test['enrolled_university'], drop_first=False)
education_test = pd.get_dummies(test['education_level'], drop_first=False)
c_size_test = pd.get_dummies(test['company_size'], drop_first=False)
c_type_test = pd.get_dummies(test['company_type'], drop_first=False)

test = test.drop(['relevent_experience', 'enrolled_university', 'education_level', 'company_size', 'company_type'], axis=1)
test = pd.concat([test, experience_test, university_test, education_test, c_size_test, c_type_test], axis=1)

In [None]:
train.shape, test.shape

**We are good to go.**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import metrics

In [None]:
x = train.drop(['target'], axis=1)
y = train['target']

In [None]:
x.shape, y.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

**LOGISTIC REGRESSION**

In [None]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(x_train, y_train)
prediction_lr = logistic.predict(x_test)
print(classification_report(y_test,prediction_lr))

**DECISION TREE**

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(x_train, y_train)
prediction_dt = tree.predict(x_test)
print(classification_report(y_test, prediction_dt))

**RANDOM FOREST**

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(x_train, y_train)
prediction_rf = forest.predict(x_test)
print(classification_report(y_test, prediction_rf))

Although it is clear which model is more successfull. I like to look at their ROC curves to be sure.

In [None]:
sns.set(rc={'figure.figsize':(8,5)})

In [None]:
metrics.plot_roc_curve(logistic, x_test, y_test)

In [None]:
metrics.plot_roc_curve(tree, x_test, y_test) 

In [None]:
metrics.plot_roc_curve(forest, x_test, y_test) 

I have applied three different machine learning methods to the data. Random Forest Classifier seem to be the most successful out of them. Random Forest achieved %83 precision, %88 recall and 0.78 AUC scores. Which is not perfect but I think we can call it a successful classification.

**That is it for this notebook. I hope you liked it. Let me know what you think, feedbacks are appreciated.**

<img src="https://media.giphy.com/media/xUPOqo6E1XvWXwlCyQ/giphy.gif">