## Machine Learning 2

This year's Assignment will verse on predicting employee attrition. The dataset is available on Kaggle: [Employee Attrition competition](https://www.kaggle.com/competitions/playground-series-s3e3/data)

* The goal is to predict whether an employee will leave the company or not (`Attrition` column, binary classification).
* The dataset is artificially generated, but it is based on real data: [original data](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)
    * Here you can find what each feature represents and their possible values.
* You're given a training set and a test set. 
    * Perform your analysis, experiments and model selection on the training set, and don't touch the test set until you're ready to submit your predictions.
    * You can save some data from the training set to use as a validation set, but you should not use the test set for this purpose.
    * Once you're comfortable with the performance of your model on the training set, you can use the test set to get a final estimate of the performance of your model.
    * The CSV file that you will submit, should contain the predictions of your model on the test set. This means that the CSV should contain as many rows as the test set, and a single column with the predictions (0 or 1).

##### Imports

In [1]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

##### Loading Data

In [38]:
data = pd.read_csv("train.csv")

##### EDA 
Delete Index Column & Obsolete Columns

In [39]:
data=data.set_index("id")

In [9]:
#data.describe(include='all') #collection of information of the features - part 1 #attrition: 12% True, 88% False

In [10]:
#data.iloc[:,10:24].describe(include='all') #collection of information of the features - part 2

In [40]:
data = data.drop(['EmployeeCount','Over18','StandardHours'], axis=1) #deletion of columns because they all have a domain of one

##### One Hot Encoding for Categorical Columns

In [42]:
data_cat = data.select_dtypes("O") #filter for categorical columns

In [64]:
ohe = OneHotEncoder(sparse=False, drop = "if_binary") #deletion of excessive columns due to one-hot-encoding for columns with binary domain

In [65]:
data_cat_ohe = ohe.fit_transform(data_cat)

In [66]:
data_cat_ohe = pd.DataFrame(data_cat_ohe, columns=ohe.get_feature_names_out())

In [67]:
data_full = pd.concat([data, data_cat_ohe], axis=1)
data_full = data_full.drop(columns=data_cat.columns)

In [63]:
data_full = data_full.astype(int)

##### Correlation analysis

In [123]:
corr = data_full.corr()[['Attrition']] #correlation matrix to identify features that have a high correlation with attrition
corr ['AbsolutCorrValues'] = corr['Attrition'].abs()
corr.sort_values(by = 'AbsolutCorrValues', ascending = False).head()

Unnamed: 0,Attrition,AbsolutCorrValues
Attrition,1.0,1.0
StockOptionLevel,-0.194018,0.194018
MaritalStatus_Single,0.175006,0.175006
OverTime_Yes,0.173965,0.173965
Age,-0.161044,0.161044


##### Feature set X and target y

In [112]:
X = data_full.drop(columns=['Attrition'])
#X = data_full.loc[:,['StockOptionLevel','Age','JobInvolvement','TotalWorkingYears','JobLevel']]
y = data_full['Attrition']

##### Model Selection

In [117]:
def evaluation(y_true, y_pred): 
    cm = confusion_matrix(y_true, y_pred)
    tn = cm[0, 0]
    tp = cm[1, 1]
    fn = cm[1, 0]
    fp = cm[0, 1]

    f1 = f1_score(y_true, y_pred) # f1 = (2 * precision * sensitivity) / (precision + sensitivity)
    
    sensitivity = tp / (tp + fn) # all positives #same as recall
    specificity = tn / (tn + fp) # all negatives
    precision = tp / (tp + fp) # all positive predictions
     
    acc = accuracy_score(y_true, y_pred) #acc = (tp+tn) / (tp+tn+fp+fn)
       
    print('confusion matrix:\n',cm[0,],'\tall negatives in test set =',(tn+fp),' \n',cm[1,],'\tall positives in test set=',(tp+fn),' \n')
    print(f"f1-score: {f1:.2f}\n\nsensitivity: {sensitivity:.2f}\nspecificity: {specificity:.2f}\nprecision: {precision:.2f}\n\naccuracy: {acc:.2f}\n")

###### Model1: Single Decision Tree

In [118]:
#max_depth
#min_samples_leaf
#min_samples_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.25, stratify = y)

tree = DecisionTreeClassifier(random_state = 42, max_depth = 16, min_samples_leaf = 3, min_samples_split = 3) 
tree.fit(X_train, y_train)
y_test_pred = tree.predict(X_test)

evaluation(y_test,y_test_pred)

confusion matrix:
 [316  54] 	all negatives in test set = 370  
 [36 14] 	all positives in test set= 50  

f1-score: 0.24

sensitivity: 0.28
specificity: 0.85
precision: 0.21

accuracy: 0.79



###### Model2: Random Forest

In [119]:
#max_depth: 16
#min_samples_leaf: 1
#min_samples_split: 2
#n_estimators=100

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.25, stratify = y)

forest = RandomForestClassifier(random_state = 0, max_depth = 16, min_samples_leaf = 1, min_samples_split = 2, n_estimators=100) 
forest.fit(X_train, y_train)
y_test_pred = forest.predict(X_test)

evaluation(y_test,y_test_pred)

confusion matrix:
 [366   4] 	all negatives in test set = 370  
 [46  4] 	all positives in test set= 50  

f1-score: 0.14

sensitivity: 0.08
specificity: 0.99
precision: 0.50

accuracy: 0.88



###### Model3: XGBoost

In [120]:
xgb = XGBClassifier(use_label_encoder=False, objective='binary:logistic')
xgb.fit(X_train, y_train)
y_test_pred = xgb.predict(X_test)

evaluation(y_test, y_test_pred)

confusion matrix:
 [360  10] 	all negatives in test set = 370  
 [37 13] 	all positives in test set= 50  

f1-score: 0.36

sensitivity: 0.26
specificity: 0.97
precision: 0.57

accuracy: 0.89



