# **HR Analytics Project** 

**by Mario Nascitini**

**Context and Content**

A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

The whole data divided to train and test . Target isn't included in test but the test target values data file is in hands for related tasks. A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target

Note:

The dataset is imbalanced.
Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality.
Missing imputation can be a part of your pipeline as well.
Features

enrollee_id : Unique ID for candidate

city: City code

city_ development _index : Developement index of the city (scaled)

gender: Gender of candidate

relevent_experience: Relevant experience of candidate

enrolled_university: Type of University course enrolled if any

education_level: Education level of candidate

major_discipline :Education major discipline of candidate

experience: Candidate total experience in years

company_size: No of employees in current employer's company

company_type : Type of current employer

lastnewjob: Difference in years between previous job and current job

training_hours: training hours completed

target: 0 – Not looking for job change, 1 – Looking for a job change

**Inspiration**

**Predict** the probability of a candidate will work for the company

**Interpret** model(s) such a way that illustrate which features affect candidate decision

Please refer to the following task for more details:
https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015

# **Importing libraries**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

# **Loading data**

In [None]:
import os, seaborn as sns, pandas as pd, numpy as np
#os.chdir('/Utenti/marionascitini/Download/')

filepath_train = '../input/hr-analytics-job-change-of-data-scientists/aug_train.csv'
filepath_test = '../input/hr-analytics-job-change-of-data-scientists/aug_test.csv'
filepath_submission = '../input/hr-analytics-job-change-of-data-scientists/sample_submission.csv'

output_file='../output/kaggle/working/test_data_predictions.csv'
df_train = pd.read_csv(filepath_train, sep=',')
df_test = pd.read_csv(filepath_test, sep=',')
df_submission=pd.read_csv(filepath_submission,sep=',')

In [None]:
orig_test=df_test.copy()

# **Dataset description**

In [None]:
df_test.head()

In [None]:
df_train.info()

In [None]:
df_test.info()

**Output variable (desired target for prediction):**

target - Looking for job change? (binary: 1:"yes",0:"no")

In [None]:
df_train.dtypes.value_counts()

We have:

**3 quantitative (numerical) indipendent features**

**10 qualitative (categorical) indipendent features**

**1 output variable (binary: 1/0)**

# **Exploratory Data Analysis**

In [None]:
sns.countplot(x=df_train['target']);
print(df_train.target.value_counts(normalize=True))


> Note that the **dataset is unbalanced on the target variable**.

> **75%** of enrollees **did not look a job change** 
> 
> **25%** of enrollees **look a job change**

**Define the numerical and categorical features**

In [None]:
object_features = df_train.select_dtypes(include=['object', 'bool']).columns.values

target_feature=['target']


binary_features=list() 
for col in object_features:
    if len(df_train[col].unique())==2:
      binary_features.append(col)
binary_features=list(set(binary_features)-set(target_feature))

categorical_features=list(set(object_features)-set(binary_features)-set(target_feature))    

numerical_features=df_train.select_dtypes(include=['int64','float64']).columns.values
numerical_features=list(set(numerical_features)-set(target_feature))

ordinal_features=['education_level','experience','last_new_job','company_size','enrolled_university']
nominal_features=list(set(categorical_features)-set(ordinal_features))

In [None]:
print('numerical features:',numerical_features)
print('\n')
print('categorical binary features:',binary_features)
print('\n')
print('categorical nominal features:',nominal_features)
print('\n')
print('categorical ordinal features:',ordinal_features)
print('\n')
print('target feature:',target_feature)

In [None]:
df_uniques = pd.DataFrame([[i, len(df_train[i].unique())] for i in df_train.columns], columns=['Variable', 'Unique Values']).set_index('Variable')
df_uniques

**Unique values for numerical features**

In [None]:
for col in numerical_features:
    print(col, "(", len(df_train[col].unique()) , "values):\n", df_train[col].unique())
    

**Binary features unique values**

In [None]:
for col in binary_features:
    print(col, "(", len(df_train[col].unique()) , "values):\n", df_train[col].unique())

**Nominal features unique values**

In [None]:
for col in nominal_features:
    print(col, "(", len(df_train[col].unique()) , "values):\n", df_train[col].unique())



**Ordinal features unique values**

In [None]:
for col in ordinal_features:
    print(col, "(", len(df_train[col].unique()) , "values):\n", df_train[col].unique())


**Count of the missing values for each variable of the train set**

In [None]:
print('# of numerical features with missing values:\n',df_train[numerical_features].isnull().sum().sort_values(ascending=False))
print('\n # of binary features with missing values:\n',df_train[binary_features].isnull().sum().sort_values(ascending=False))
print('\n # of categorical features with missing values:\n',df_train[categorical_features].isnull().sum().sort_values(ascending=False))
print('\n # of target feature with missing values:\n',df_train[target_feature].isnull().sum().sort_values(ascending=False))

In [None]:
print('# of numerical features with missing values:\n',df_test[numerical_features].isnull().sum().sort_values(ascending=False))
print('\n # of binary features with missing values:\n',df_test[binary_features].isnull().sum().sort_values(ascending=False))
print('\n # of categorical features with missing values:\n',df_test[categorical_features].isnull().sum().sort_values(ascending=False))


We have missing values only on categorical features. 
4 categorical features with many missing values.
We can: replace these missing values in three ways: 

1) replace missing values with the "mode" value for each feature (see "fill_with_mode" function)
 
2) replace missing values with a specific "unknown" value for each feature (see "fill_with_unknown" function)

3) replace missing values with random sampling on other values (see feature_engine RandomSampler code)
 
**4) delete all rows with missing values (choose this one)**

In [None]:
df_train.dropna(axis=0,inplace=True)
#df_test.dropna(axis=0,inplace=True)

In [None]:
df_test.isnull().sum()

In [None]:
pip install feature_engine

**For final test data we do not delete missing values but we use a RandomSampleImputer**

In [None]:

#Using feature_engine  RandomSampleImputer
from feature_engine.imputation import RandomSampleImputer
imputer = RandomSampleImputer(
        random_state=101,
        seed='general',
        seeding_method='add'
    )

# fit the imputer
#imputer.fit(df_train)
# transform the data
#df_train = imputer.transform(df_train)

imputer.fit(df_test)
df_test = imputer.transform(df_test)


#Using sklearn  SimpleImputer
#from sklearn.impute import SimpleImputer
#si=SimpleImputer(strategy="most_frequent")
#df_train[categorical_features]=si.fit_transform(df_train[categorical_features])
#df_test[categorical_features]=si.fit_transform(df_test[categorical_features])

#from sklearn.impute import SimpleImputer
#si=SimpleImputer(strategy='constant',fill_value='Unknown')
#df_train[categorical_features]=si.fit_transform(df_train[categorical_features])

# Coding manually
#def fill_with_mode(data,features):
#  for col in features:
#    if data[col].isnull().sum()>0:
#      data[col]=data[col].fillna(data[col].value_counts().index[0])

#def fill_with_unknown(data,features):
#    for col in features:
#        if data[col].isnull().sum()>0:
#            data[col]=data[col].fillna('Unknown')

#fill_with_mode(df_train,categorical_features)
#fill_with_mode(df_test,categorical_features)
#fill_with_unknown(df_train,categorical_features)
#fill_with_unknown(df_test,categorical_features)

In [None]:
df_test.isnull().sum()

In [None]:
print('# of numerical features with missing values:\n',df_train[numerical_features].isnull().sum().sort_values(ascending=False))
print('\n # of binary features with missing values:\n',df_train[binary_features].isnull().sum().sort_values(ascending=False))
print('\n # of categorical features with missing values:\n',df_train[categorical_features].isnull().sum().sort_values(ascending=False))
print('\n # of target feature with missing values:\n',df_train[target_feature].isnull().sum().sort_values(ascending=False))

In [None]:
print('# of numerical features with missing values:\n',df_test[numerical_features].isnull().sum().sort_values(ascending=False))
print('\n # of binary features with missing values:\n',df_test[binary_features].isnull().sum().sort_values(ascending=False))
print('\n # of categorical features with missing values:\n',df_test[categorical_features].isnull().sum().sort_values(ascending=False))


**Analyze "gender" feature**

In [None]:
df_test['gender'].value_counts()

The value "Other" for gender is negligible. Moreover the target values for these records respect the same proportion of the entire dataset.
So we can drop rows with gender="Other"
For final test data we set value of "Other" with the most frequent "Male"

In [None]:
index_of_other_gender=df_train[df_train['gender']=='Other'].index 
df_train.drop(axis=0,index=index_of_other_gender,inplace=True)
#index_of_other_gender=df_test[df_test['gender']=='Other'].index 
#df_test.drop(axis=0,index=index_of_other_gender,inplace=True)
#df_test[df_test['gender']=='Other']='Male'
#df_train['gender'].value_counts()

In [None]:
df_test['gender'].value_counts()

In [None]:
df_test["gender"].replace({"Other":"Male"},inplace=True)


In [None]:
df_test['gender'].value_counts()

In [None]:
binary_features=list(set.union(set(binary_features),set(['gender'])))

In [None]:
binary_features

In [None]:
nominal_features=list(set(nominal_features)-set(['gender']))
nominal_features

# Let's plot binary features distribution of values

In [None]:
for col in binary_features:
    plt.figure(figsize=(20,50))    
    sns.catplot(x=col, kind="count", data=df_train)    
    plt.title(col)    
    plt.tight_layout()

# Let's plot nominal features distribution of values

In [None]:
#sns.histplot(data=df, x="marital", color="lime",hue='subscribe')
for col in nominal_features:
    plt.figure(figsize=(20,50))    
    sns.catplot(x=col, kind="count", data=df_train)    
    plt.title(col)    
    plt.tight_layout()

# Let's plot ordinal features distribution of values

In [None]:
#sns.histplot(data=df, x="marital", color="lime",hue='subscribe')
for col in ordinal_features:
    plt.figure(figsize=(20,50))    
    sns.catplot(x=col, kind="count", data=df_train)    
    plt.title(col)    
    plt.tight_layout()

# Let's encode the ordinal categorical features

We have two choices: 
1) to do a one hot encoding 

2) **to implement a manual mapping on ordered values (choose this one)**

**One Hot Encoding**

In [None]:
#for col in ordinal_features:
#    df_train = pd.concat([df_train,pd.get_dummies(df_train[col], prefix=col)],axis=1)
#    df_train.drop([col],axis=1, inplace=True)

**Manual Mapping**

In [None]:
df_train['education_level']=df_train['education_level'].map({'Primary School': 1, 'High School': 2,'Graduate': 3, 'Masters':4, 'Phd': 5})
df_test['education_level']=df_test['education_level'].map({'Primary School': 1, 'High School': 2,'Graduate': 3, 'Masters':4, 'Phd': 5})
df_train['last_new_job']=df_train['last_new_job'].map({'1': 1, '2': 2,'3': 3, '4':4, '>4': 5, 'never': 0})
df_test['last_new_job']=df_test['last_new_job'].map({'1': 1, '2': 2,'3': 3, '4':4, '>4': 5, 'never': 0})
df_train['experience']=df_train['experience'].map({'<1': 0, '1': 1,'2': 2,'3': 3,'4': 4,'5': 5,'6': 6,'7': 7,'8': 8,'9': 9,'10': 10,'11': 11,'12': 12,'13': 13,'14': 14,'15': 15,'16': 16,'17': 17,'18': 18,'19': 19,'20': 20,'>20': 21})
df_test['experience']=df_test['experience'].map({'<1': 0, '1': 1,'2': 2,'3': 3,'4': 4,'5': 5,'6': 6,'7': 7,'8': 8,'9': 9,'10': 10,'11': 11,'12': 12,'13': 13,'14': 14,'15': 15,'16': 16,'17': 17,'18': 18,'19': 19,'20': 20,'>20': 21})
df_train['company_size']=df_train['company_size'].map({'<10': 1, '10/49': 2,'50-99': 3,'100-500': 4,'500-999': 5,'1000-4999': 6,'5000-9999': 7,'10000+': 8})
df_test['company_size']=df_test['company_size'].map({'<10': 1, '10/49': 2,'50-99': 3,'100-500': 4,'500-999': 5,'1000-4999': 6,'5000-9999': 7,'10000+': 8})
df_train['enrolled_university']=df_train['enrolled_university'].map({'no_enrollment': 1, 'Part time course': 2,'Full time course': 3})
df_test['enrolled_university']=df_test['enrolled_university'].map({'no_enrollment': 1, 'Part time course': 2,'Full time course': 3})


In [None]:
df_train["relevent_experience"].replace({"Has relevent experience":"YES","No relevent experience":"NO"},inplace=True)

In [None]:
df_test["relevent_experience"].replace({"Has relevent experience":"YES","No relevent experience":"NO"},inplace=True)

In [None]:
df_test

# Let's encode the binary features

In [None]:
binary_features

In [None]:
from sklearn.preprocessing import LabelEncoder,LabelBinarizer
le = LabelEncoder()
for col in binary_features:
    le.fit(df_train[col])
    df_train[col]=le.transform(df_train[col])
    #le.fit(df_test[col])
    df_test[col]=le.transform(df_test[col])


In [None]:
nominal_features

In [None]:
df_test

# Save the "enrollee_id" for finale test predictions

In [None]:
id_test=df_test['enrollee_id']

# Let's drop 'enrollee_id' and 'city'features

In [None]:
df_train.drop(columns=['enrollee_id','city'],axis=1,inplace=True)
df_test.drop(columns=['enrollee_id','city'],axis=1,inplace=True)


In [None]:
nominal_features=['major_discipline','company_type']
nominal_features

In [None]:
print(df_train['major_discipline'].value_counts())
print(df_train['major_discipline'].value_counts().isnull().sum())

In [None]:
print(df_train['company_type'].value_counts())
print(df_train['company_type'].value_counts().isnull().sum())

# Let's encode nominal features

**Choose "Dummy encoding"**

In [None]:
df_train=pd.get_dummies(df_train,columns=nominal_features,drop_first=True)

In [None]:
df_test=pd.get_dummies(df_test,columns=nominal_features,drop_first=True)

# Build Model and Predict


In [None]:
#from sklearn.utils import resample

#def df_sample(data_frame,num_samples):
#    dataset_majority = data_frame[data_frame.target == 0]
#    dataset_minority = data_frame[data_frame.target == 1]
    # Downsample majority class
#    df_majority_downsampled = resample(dataset_majority, replace=False,
#                                   n_samples=num_samples, random_state=123)

#    data_frame_downsampled = pd.concat([df_majority_downsampled, dataset_minority])
#    return data_frame_downsampled

#dataset_majority = y_train[y_train == 0]
#dataset_minority = y_train[y_train == 1]

#Downsample majority class
#df_majority_downsampled = resample(dataset_majority, replace=False,
#                                   n_samples=4198, random_state=123)

#y_train_downsampled = pd.concat([df_majority_downsampled, dataset_minority])


In [None]:
#len_min_class=len(df_train[df_train['target']==1])
#len_maj_class=len(df_train[df_train['target']==0])

#print('length of minority class:', len_min_class)
#print('length of majority class:', len_maj_class)


In [None]:
#pip install imblearn

In [None]:
#df_train_resampled=df_sample(df_train,len_min_class)
#df_train_resampled.target.value_counts()
#df_train_resampled

# Prepare out dataset for train/test split

In [None]:
#X=df_train_resampled.drop(['target'],axis=1)
#y=df_train_resampled['target']
X=df_train.drop(['target'],axis=1)
y=df_train['target']
columns=X.columns


In [None]:
X.shape

# Oversampling minority class

**Now we have to solve the problem of unbalanced data on target variable.We can "downsample" the majority class or "oversample" the minority class.
We tried both but the oversample gives better results on model scoring**

In [None]:
from imblearn.over_sampling import SMOTE, ADASYN
X_resampled, y_resampled = SMOTE().fit_resample(X, y)


In [None]:
X_resampled.shape

In [None]:
y_resampled.value_counts()

# Train/Test Splitting

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.1, random_state=42)

In [None]:
y_train.value_counts()

# Scaling our train/test data

In [None]:
from sklearn.preprocessing import MinMaxScaler
mm=MinMaxScaler(copy=False)
cols=X_train.columns
X_train[cols] = mm.fit_transform(X_train[cols])
X_test[cols]=mm.transform(X_test[cols])
#X_train_scaled=pd.DataFrame(mm.fit_transform(X_train),columns=X_train.columns) 
#X_test_scaled=pd.DataFrame(mm.transform(X_test),columns=X_test.columns) 
#X_train_scaled=pd.DataFrame(X_train_scaled,columns=X.columns)
#X_test_scaled=pd.DataFrame(X_test_scaled,columns=X.columns)

# Prepare modeling: import libraries and define score function 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import classification_report

def view_scores(test,pred):
  print("Accuracy score:", accuracy_score(test,pred))
  print("Classification report")
  print(classification_report(test,pred))

  # Confusion Matrix
  print("Confusion Matrix:")
  print(confusion_matrix(test,pred))

  conf_mat = confusion_matrix(test,pred)
  ax = plt.subplot()
  sns.heatmap(conf_mat, annot=True, ax=ax, fmt='d')
  #labels, title and ticks
  ax.set_xlabel('Predicted labels')
  ax.set_ylabel('True labels')
  ax.set_title('Confusion Matrix')
  ax.xaxis.set_ticklabels(['no', 'yes'])
  ax.yaxis.set_ticklabels(['no', 'yes'])
  plt.show()


# Simple Logistic Regression

In [None]:
# Standard logistic regression without downsampling
lr = LogisticRegression(C=1,solver='liblinear',penalty='l2',class_weight='balanced',random_state=31) 
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
y_proba_lr=lr.predict_proba(X_test)
view_scores(y_test,y_pred_lr)

In [None]:
df_test.shape

In [None]:
df_train.columns

# Logistic Regression with cross validation

In [None]:
from sklearn.linear_model import LogisticRegressionCV
lr_l1 = LogisticRegressionCV(Cs=10, cv=5, penalty='l1', solver='liblinear').fit(X_train, y_train)
lr_l2 = LogisticRegressionCV(Cs=10, cv=5, penalty='l2', solver='liblinear').fit(X_train, y_train)
y_pred_lr_l1=lr_l1.predict(X_test)
y_pred_lr_l2=lr_l2.predict(X_test)
view_scores(y_test,y_pred_lr_l1)
view_scores(y_test,y_pred_lr_l2)


# Decision Tree Classifier

In [None]:
# Decision Tree Classifier model
dt = DecisionTreeClassifier(criterion='entropy',splitter='best',max_depth=50)
dt=dt.fit(X_train,y_train)
y_pred_dt=dt.predict(X_test)
view_scores(y_test,y_pred_dt)

In [None]:
dt.tree_.node_count, dt.tree_.max_depth

In [None]:
def measure_error(y_true, y_pred, label):
    return pd.Series({'accuracy':accuracy_score(y_true, y_pred),
                      'precision': precision_score(y_true, y_pred),
                      'recall': recall_score(y_true, y_pred),
                      'f1': f1_score(y_true, y_pred)},
                      name=label)

In [None]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

train_test_full_error = pd.concat([measure_error(y_train, y_train_pred, 'train'),
                              measure_error(y_test, y_test_pred, 'test')],
                              axis=1)

train_test_full_error


In [None]:
dt.feature_importances_

# Decision Tree Classifier with GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth':range(1, dt.tree_.max_depth+1, 2),
              'max_features': range(1, len(dt.feature_importances_)+1)}

GR = GridSearchCV(DecisionTreeClassifier(random_state=42),
                  param_grid=param_grid,
                  scoring='accuracy',
                  n_jobs=-1)

GR = GR.fit(X_train, y_train)

In [None]:
y_pred_gr = GR.predict(X_test)
view_scores(y_test,y_pred_gr)

# Linear SVC

In [None]:
from sklearn.svm import LinearSVC

LSVC = LinearSVC()
LSVC.fit(X_train, y_train)
y_pred_svc=LSVC.predict(X_test)
view_scores(y_test,y_pred_svc)

# KNN

In [None]:
knn = KNeighborsClassifier(n_neighbors=10,weights='distance')
knn = knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
view_scores(y_test,y_pred_knn)

Now let's compare the ROC-AUC Curve for these 3 models.

In [None]:
from sklearn import metrics
metrics.plot_roc_curve(lr, X_test, y_test)  
plt.show()                                  

metrics.plot_precision_recall_curve(lr, X_test, y_test)  
plt.show()                                  



In [None]:
metrics.plot_roc_curve(dt, X_test, y_test)  
plt.show()      

metrics.plot_precision_recall_curve(dt, X_test, y_test)  
plt.show()                                  


In [None]:
from sklearn import metrics
metrics.plot_roc_curve(knn, X_test, y_test)  
plt.show()           
metrics.plot_precision_recall_curve(knn, X_test, y_test)  
plt.show()                                  


# Gradient Boost Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
error_list = list()

tree_list = [200, 400,500,600,700,800,900,1000]
for n_trees in tree_list:
    
    # Initialize the gradient boost classifier
    GBC = GradientBoostingClassifier(max_features=5,n_estimators=n_trees, random_state=42)

    # Fit the model
    print(f'Fitting model with {n_trees} trees')
    GBC.fit(X_train.values, y_train.values)
    y_pred_gbc = GBC.predict(X_test)

    # Get the error
    error = 1.0 - accuracy_score(y_test, y_pred_gbc)
    
    # Store it
    error_list.append(pd.Series({'n_trees': n_trees, 'error': error}))

error_df = pd.concat(error_list, axis=1).T.set_index('n_trees')

error_df

In [None]:
view_scores(y_test,y_pred_gbc)

In [None]:
metrics.plot_roc_curve(GBC, X_test, y_test)  
plt.show()           
metrics.plot_precision_recall_curve(GBC, X_test, y_test)  
plt.show()                                  


# Predict  test data contained in "aug_test.csv" 

**Must scale the final test data too.**

In [None]:
cols=df_test.columns
df_test[cols] = mm.fit_transform(df_test[cols])

In [None]:
test_predictions=GBC.predict(df_test)

In [None]:
df_test_pred=pd.DataFrame(test_predictions,columns=['target'])

In [None]:
df_id=pd.DataFrame(id_test)
df_id_new=df_id.reset_index()
df_id_new.drop('index',axis=1,inplace=True)

In [None]:
df_pred_final=pd.concat([df_id_new,df_test_pred],axis=1)

In [None]:
df_pred_final.target.value_counts()

In [None]:
sns.countplot(x=df_pred_final['target']);
print(df_pred_final.target.value_counts(normalize=True))


In [None]:
df_pred_final

# Write to submission csv file

In [None]:
df_pred_final.to_csv('test_data_predictions.csv')