# 0. Section objectives
I this notebook, I document the pre-processing of the data before going on to modelling. To do so, I first separate the whole dataset into train (80%) and test (20%) data subsets. Thereafter, I perform a number of methods on the training dataset, which I then use to pre-process the test dataset with the parameters of the train data. The last two steps are concerned with defining the validation methods for recursive feature selection and k-fold cross-validation.

Best practice pre-processing is **typically done within the modelling to avoid overfitting the models during cross-validation**, but I decided to make it a separate step for two reasons: 1) I wanted to review the data after each step and add more (and more robust) methods to the pre-processing, which is very cumbersome to do within the modelling step. 2) The critical performance metrics will later be obtained from the test dataset which in fact will never have seen any data from the train dataset before trying to predict performance outcomes and generating confusion matrices.

# 1. Prepare the work environment

## 1.1 Set general options

In [1]:
#Set seed
set.seed(100)

In [2]:
#General options
options(scipen = 999,
        readr.num_columns = 0,
        warn=-1)

## 1.2 Set working directory

In [3]:
#Set wd
setwd("C:/Users/veren/github/ML_Project_Predict_Employee_Performance")

## 1.3 Load libraries

In [4]:
library(caret)
library(tidyverse)

Loading required package: lattice
Loading required package: ggplot2
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v tibble  3.0.1     v dplyr   0.8.5
v tidyr   1.0.3     v stringr 1.4.0
v readr   1.3.1     v forcats 0.5.0
v purrr   0.3.4     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
x purrr::lift()   masks caret::lift()


## 1.4 Import user-defined functions

In [5]:
#Load functions
load("03_Objects/ud_functions.RData")

## 1.5 Import workspace from EDA

In [6]:
#Load functions
load("03_Objects/eda.RData")

# 2. Separate into train and test
In this step, the dataset is separated into two subsets, train and test. On the train dataset, I will perform all modelling and cross-validation. The test dataset will serve to evaluate the models by predicting values for a dataset the model has never seen before, and perform Out-Of-Sample validation on the test data.

In [7]:
#Separate data into train (80%; to have sufficient data for modelling) and test (20%)
split <- caret::createDataPartition(perf$performance, p = .8, list = F)

In [8]:
#Create train_raw
train_raw <- perf[split, ]

In [9]:
#Create test_raw
test_raw <- perf[-split, ]

In [10]:
#Save test raw
write.csv(test_raw, "01_Data/test_raw.csv")   #this dataset is saved to make possible standalone predictions later on.

# 3. Pre-processing train data

Data pre-processing is handled for train and test datasets separately. To avoid any training effects on the test dataset, we will save the parameters found from the train dataset and then use them to perform the same pre-processing steps on the test dataset later on.

Steps for train:
1. Separate data into target and predictors (this makes it easier to perform One-Hot-Encoding which is actually not of use here, but I wanted to keep the code anyway in case someone wants to try it with categorical predictors)
2. One-Hot-Encoding for categorical predictors (not necessary here)
3. Robust winsorizing of predictors (with robust z-values + MAD) to adjust extreme univariate values (threshold alpha = .01) and avoid issues during later centering and scaling
4. Re-integrate pre-processed predictors with target variable
5. Use Mahalanobis Distance to inspect (and if necessary, delete) extreme multivariate values, at alpha = .001
6. Use preProcess = c("zv", "nzv", "corr", "center", "scale", "knnImpute"). Those methods will delete predictors with zero or near-zero variance, remove highly correlated predictors, center and scale all predictors and impute missing values using KNN algorithm).

Steps for test (see in notebook 05_Model Evaluation):
1. Separate data into target and predictors
2. One-Hot-Encoding for categorical predictors (not necessary here)
3. Robust winsorizing of predictors (with robust z-values + MAD parameters from train) to adjust extreme univariate values (threshold alpha = .01)
4. Re-integrate pre-processed predictors with target variable
5. Use preProcess (parameters of train)

Note that Mahalanobis' distance is not performed on the test dataset. This is because I use MD on the train dataset to find and delete multivariate outliers, thus avoiding that extreme cases will lead to overfitting the models. In the test dataset, entire observations can never be deleted, which is why I skip this step during pre-processing.

## 3.1 Train

In [11]:
#Save target separately
target_train <- train_raw$performance
train <- train_raw[, -1]

In [12]:
#Process one-hot-encoding for categorical variables
dmy <- dummyVars(~ ., data = train, levelsOnly = T, fullRank = T)
train <- as.data.frame(predict(dmy, train))


summary(train)

    quality      service_orientation   innovation     organisation 
 Min.   :2.000   Min.   :2.000       Min.   :1.000   Min.   :2.00  
 1st Qu.:3.000   1st Qu.:3.000       1st Qu.:3.000   1st Qu.:3.00  
 Median :3.000   Median :4.000       Median :3.000   Median :3.00  
 Mean   :3.489   Mean   :3.709       Mean   :3.231   Mean   :3.21  
 3rd Qu.:4.000   3rd Qu.:4.000       3rd Qu.:4.000   3rd Qu.:4.00  
 Max.   :5.000   Max.   :5.000       Max.   :5.000   Max.   :5.00  
                                                     NA's   :6     
 problem_solving   curiosity      determination       analysis     
 Min.   :2.000   Min.   :0.4600   Min.   :0.5400   Min.   :0.2100  
 1st Qu.:3.000   1st Qu.:0.7500   1st Qu.:0.8225   1st Qu.:0.6025  
 Median :4.000   Median :0.8700   Median :0.9400   Median :0.7700  
 Mean   :3.529   Mean   :0.8435   Mean   :0.8973   Mean   :0.7508  
 3rd Qu.:4.000   3rd Qu.:0.9700   3rd Qu.:1.0000   3rd Qu.:0.9200  
 Max.   :5.000   Max.   :1.0000   Max.   :1.0000

In [13]:
#Save parameters of train
centers_train <- apply(train, 2, function(x) median(x, na.rm = T))
scales_train <- apply(train, 2, mad_na)

In [14]:
#Winsorizing (robust method with median and mad)
train <- wins_rob(train, centers = centers_train, scales = scales_train)

summary(train)

    quality      service_orientation   innovation     organisation 
 Min.   :2.000   Min.   :3.743       Min.   :2.743   Min.   :2.00  
 1st Qu.:3.000   1st Qu.:3.743       1st Qu.:3.000   1st Qu.:3.00  
 Median :3.000   Median :4.000       Median :3.000   Median :3.00  
 Mean   :3.489   Mean   :3.936       Mean   :3.058   Mean   :3.21  
 3rd Qu.:4.000   3rd Qu.:4.000       3rd Qu.:3.257   3rd Qu.:4.00  
 Max.   :5.000   Max.   :4.257       Max.   :3.257   Max.   :5.00  
                                                     NA's   :6     
 problem_solving   curiosity      determination       analysis     
 Min.   :2.000   Min.   :0.4600   Min.   :0.7114   Min.   :0.2100  
 1st Qu.:3.000   1st Qu.:0.7500   1st Qu.:0.8225   1st Qu.:0.6025  
 Median :4.000   Median :0.8700   Median :0.9400   Median :0.7700  
 Mean   :3.529   Mean   :0.8435   Mean   :0.9011   Mean   :0.7508  
 3rd Qu.:4.000   3rd Qu.:0.9700   3rd Qu.:1.0000   3rd Qu.:0.9200  
 Max.   :5.000   Max.   :1.0000   Max.   :1.0000

In [15]:
#Re-integrate data with target
train <- data.frame(performance = target_train, train)

In [16]:
#Mahalanobis Distance (delete multivariate extreme values)
md(train, showout = T)  #just show multivariate outliers; there are none

train <- md(train)

performance,quality,service_orientation,innovation,organisation,problem_solving,curiosity,determination,analysis,empowerment
<ord>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>


In [17]:
#Define pre-processing
prep <- caret::preProcess(train, method = c("zv", "nzv", "corr",
                                            "center", "scale",
                                            "knnImpute"))

In [18]:
#Pre-process train
train <- as.data.frame(predict(prep, train))


summary(train)
anyNA(train)

 performance    quality        service_orientation   innovation     
 C:63        Min.   :-2.1178   Min.   :-1.1626     Min.   :-1.8693  
 B:95        1st Qu.:-0.6955   1st Qu.:-1.1626     1st Qu.:-0.3437  
 A:24        Median :-0.6955   Median : 0.3819     Median :-0.3437  
             Mean   : 0.0000   Mean   : 0.0000     Mean   : 0.0000  
             3rd Qu.: 0.7268   3rd Qu.: 0.3819     3rd Qu.: 1.1819  
             Max.   : 2.1490   Max.   : 1.9264     Max.   : 1.1819  
  organisation      problem_solving       curiosity       determination    
 Min.   :-1.65816   Min.   :-2.251706   Min.   :-2.7788   Min.   :-1.8548  
 1st Qu.:-0.28804   1st Qu.:-0.779107   1st Qu.:-0.6776   1st Qu.:-0.7684  
 Median :-0.28804   Median : 0.546231   Median : 0.1919   Median : 0.3804  
 Mean   : 0.01459   Mean   :-0.002352   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 1.08209   3rd Qu.: 0.693491   3rd Qu.: 0.9164   3rd Qu.: 0.9670  
 Max.   : 2.45222   Max.   : 2.166090   Max.   : 1.1338   Max

# 4. Define cross-validation methods in caret framework

## 4.1 Control methods for Recursive Feature Elimination
Usually, for a dataset with so few predictors, it is not necessary to perform RFE as the primary goal is to find the handful of strongest predictors among a large predictor pool.

However, as the original project contained up to 30 predictors and the challenge was to find the best predictors according to a principle of parsimony - i.e. it was deemed fundamental to achieve acceptably high predictive validity with the lowest number of predictors possible - I will show how RFE can be used to streamline the selection process of the client.

In [19]:
#Naive Bayes
rfe_nb <- rfeControl(functions = nbFuncs,
                     method = "repeatedcv",
                     repeats = 10,
                     verbose = FALSE)

In [20]:
#Linear Discriminant Analysis
rfe_lda <- rfeControl(functions = ldaFuncs,
                      method = "repeatedcv",
                      repeats = 10,
                      verbose = FALSE)

In [21]:
#Random Forest
rfe_rf <- rfeControl(functions = rfFuncs,
                     method = "repeatedcv",
                     repeats = 10,
                     verbose = FALSE)

In [22]:
#Others
rfe_gen <- rfeControl(functions = caretFuncs,
                      method = "repeatedcv",
                      repeats = 5,
                      verbose = FALSE)

## 4.2 Control method for k-fold cross-validation

k-fold cross-validation is a good way to estimate out-of-sample validity of the held-out sets if there are few data points to work with. In this case, I decided to do 10 times repeated 10-fold cross-validation and use SMOTE, a statistical procedure of over- and undersampling typically used for small and/or imbalanced datasets that contain few examples of the class to be predicted (which, in that case, would be our A performers.)

In [23]:
#Definition of CV method
ctrl <- caret::trainControl(method = "repeatedcv", repeats = 10,
                            selectionFunction = "best",
                            summaryFunction = multiClassSummary,
                            savePredictions = "final", classProbs = T,
                            allowParallel = T,
                            sampling = "smote")

# 5. Save workspace
To use the workspace in the next notebook, I will save it as an R object to be able to import it into other notebooks.

In [24]:
save.image(file = "03_Objects/pre_process.RData")