Assignment 5: Fundamentals of Machine Learning.

The libraries below let the scientist use data cleaning (pandas), visualization if needed (plotly, seaborn) and facilitate the creation of complex models that bring relevant insights about data,  such as correlations, multiple lineal models to train data sets and to be evaluated  among others.

In [1]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

1) Importing the dataset with Pandas
The data selected was the HR Employee Attrition because of its relevance. Companies face major issues when dealing with attrition. Demotivated employees, vast difference in wages, bad work life balance among others can cause attrition of employees, what can generate huge consequences for the correct and fluent operation of companies. Diminishing Attrition means saving valuable resources such as time and money in capacitation, induction, task delegation, among others.

In [2]:
df1 = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
df1.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


The first step to understand the data set is to display the top head. After visualizing it is important to see the total number of rows, columns and to know with what type of data are we dealing. therefore df.info() will help up with this to carry on afterwards.

In [3]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

After seeing this, the easiest approach would be to consider all columns that are integer values. Only Attrition, our dependent variable will be not integer at the beginning, but we will take care of this afterwards.

A relevant insight which will be inmediately useful is the total attrition:
from 1470 employees 237 have quitted the job, equivalent to 15,85 %.

In [4]:
df1['Attrition'].value_counts()

No     1233
Yes     237
Name: Attrition, dtype: int64

The COlumn attrition must be transformed to a numerical value,i.e. dummy variable so sk learn can work with this. Therefore binary dummy variables will be created for yes = 1, no = 0 and added as column. For the model the column Attrition is not longer relevant as it is now represented with  either Yes or No. (only one of these two will be included in the model as Yes/No are perfectly correlated = 0 ) 

In [5]:
dummies = pd.get_dummies(df1['Attrition'])  
df1 = pd.concat([df1, dummies], axis=1) 
df1.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,No,Yes
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,0,8,0,1,6,4,0,5,0,1
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,1,10,3,3,10,7,1,7,1,0
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,0,7,3,3,0,0,0,0,0,1
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,0,8,3,3,8,7,3,0,1,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,1,6,3,3,2,2,2,2,1,0


For the purpose of the analysis, the data will be subsetted to the columns that are relevant for me according to what I understood to be important while working in companies from different sectors to ensure durability of employees.
It must be mentioned that the predictive model will be built to predict "YES", attrition = 1 ( as seen in column belows)

In [6]:
df = df1[['Yes', 'Age', 'Education', 'MonthlyRate', 'WorkLifeBalance', 'YearsAtCompany', 'DistanceFromHome', 'YearsSinceLastPromotion']]
df.head()

Unnamed: 0,Yes,Age,Education,MonthlyRate,WorkLifeBalance,YearsAtCompany,DistanceFromHome,YearsSinceLastPromotion
0,1,41,2,19479,1,6,1,0
1,0,49,1,24907,3,10,8,1
2,1,37,2,2396,3,0,2,0
3,0,33,4,23159,3,8,3,3
4,0,27,1,16632,3,2,2,2


For Analysing variables it is relevant to understand from the beginning the correlation among them. 

In [7]:
df.corr()

Unnamed: 0,Yes,Age,Education,MonthlyRate,WorkLifeBalance,YearsAtCompany,DistanceFromHome,YearsSinceLastPromotion
Yes,1.0,-0.159205,-0.031373,0.01517,-0.063939,-0.134392,0.077924,-0.033019
Age,-0.159205,1.0,0.208034,0.028051,-0.02149,0.311309,-0.001686,0.216513
Education,-0.031373,0.208034,1.0,-0.026084,0.009819,0.069114,0.021042,0.054254
MonthlyRate,0.01517,0.028051,-0.026084,1.0,0.007963,-0.023655,0.027473,0.001567
WorkLifeBalance,-0.063939,-0.02149,0.009819,0.007963,1.0,0.012089,-0.026556,0.008941
YearsAtCompany,-0.134392,0.311309,0.069114,-0.023655,0.012089,1.0,0.009508,0.618409
DistanceFromHome,0.077924,-0.001686,0.021042,0.027473,-0.026556,0.009508,1.0,0.010029
YearsSinceLastPromotion,-0.033019,0.216513,0.054254,0.001567,0.008941,0.618409,0.010029,1.0


General notes: (please correct me if I am wrong in the interpretations)
It can be noted that NONE of these variables are strongly correlated to Attrition = Yes (leaving the company). Still relevant: Years at Company are negative correlated to the dependent variable Y, meaning that people that employees who have worked longer are not likely to leave (reasons could be afraid of change, stability, too old to change).
This is confirmed by the corr age & attrition since a major age means a lower attrition rage.

Now that the basis for the analysis is defined the model can be built.
First the independent variables X and the dependent variable Y will be defined  as axes.


In [8]:
X = df[['Yes', 'Age', 'Education', 'MonthlyRate', 'WorkLifeBalance', 'YearsAtCompany', 'DistanceFromHome', 'YearsSinceLastPromotion']]
y = df['Yes'] #create dependent variable y (attrition = yes)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1) 

X_train.head() 

Unnamed: 0,Yes,Age,Education,MonthlyRate,WorkLifeBalance,YearsAtCompany,DistanceFromHome,YearsSinceLastPromotion
1073,0,28,1,3173,2,8,29,1
1105,0,33,4,10589,1,2,8,2
538,0,41,3,19562,3,22,1,2
1300,0,34,2,22128,3,10,8,4
1382,0,31,2,3995,4,4,3,2


for the training we choose a test_size of 30% , 0.3 so we have enough data to validate the model after knowing the performance. We avoid also overfitting the model or underfitting.
Furthermore, stratifiying will let the model be less biased when testing the trained model and final model.

k-nearest neighbor is a non paramethric method used for classification. Non Paramethric means that it does not have previous assumptions / parameters related to the data.
What this method does basically consist of setting random points among data set and then going to the closest neighbours k_neighbour = 3 ( three neighbours). It will calculate the eucledian distance between the random points and nearest neighbours and then train the model according to this.   

In [9]:
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=3) #the KNN is created and n = 3 neighbours that will make the model fit.
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data

Evaluting accuracy of model with the score method.

In [10]:
knn.score(X_test, y_test) #calculate the fit on the *test* data

0.7913832199546486

The accuraccy is 79,13 % . As threeshold the baseline "did not leave (NO) can be used, 1470 employees, Attrition : No = 1233, Yes =237 , so we evaluate did not leave, Attrition NO = 83,78 % what means that the model is below this number, what can be explained of course because of less information used to train the data set.

Evaluate the predictive performance of your model on the test set

In [11]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[342,  28],
       [ 64,   7]], dtype=int64)

the confusion matrix shows the 30% of the test model, equivalent to a sample of = 441. The values represented are 0 for Attrition NO ( 342, 64), 1 for Attrition YES ( 28, 7) as shown above with Attrition.counts() 
Furthermore, we can confirm this with knn.classes as shown below: 

In [12]:
knn.classes_

array([0, 1], dtype=uint8)

the above matrix can be made prettier and a new dataframe will be created

In [13]:
conf_matrix = pd.DataFrame(cm, index=['Attrition NO (actual)', 'Attrition YES (actual)'], columns = ['Attrition NO (predicted)', 'Attrition YES (predicted)']) #make a dataframe, put labels on rows (index) and columns 
conf_matrix

Unnamed: 0,Attrition NO (predicted),Attrition YES (predicted)
Attrition NO (actual),342,28
Attrition YES (actual),64,7


With this matrix we can double check the already calculated precission of the model.

In [14]:
(342 + 7) /(342 + 28 + 64 + 7)

0.7913832199546486

79,13 % is once again confirmed. 

Furthermore, the precission and recall of the model can be measured. Its only calculated for one outcome, so we will carry on with Attrition = Yes, as this is the outcome defined in the beginning to be predicted.
According to the confussion Matrix, we can work with the following formula to calculate the precission of the prediction for attrition.
PRECISSION:

In [15]:
7 / 35

0.2

The precission of the model is of 20% recall, way below our accuraccy. so for the outcome Attrition = Yes, 20% would be correctness of the prediction. Possible reasons are that the occurance of Attrition is not to representative in the data set ( only a 15,85 % of the total data.

Now recall, the number of correct predicted Employees who are leaving the company:

In [16]:
7 / 71

0.09859154929577464

the recall is 9,85 %, which is very low.
It is clear demonstrated that the most common outcome is predicted better, because its richier in data. Thus, to predict the people who will stay could give us way better results in matter of precission and recall.