#Data Pre-processing

https://rpubs.com/sidTyson92/329310
Taking Care of Missing data

In [4]:
#import the data set 

dataset<-read.csv('Data.csv')
dataset

Country,Age,Salary,Purchased
<fct>,<int>,<int>,<fct>
France,44.0,72000.0,No
Spain,27.0,48000.0,Yes
Germany,30.0,54000.0,No
Spain,38.0,61000.0,No
Germany,40.0,,Yes
France,35.0,58000.0,Yes
Spain,,52000.0,No
France,48.0,79000.0,Yes
Germany,50.0,83000.0,No
France,37.0,67000.0,Yes


In [5]:
# missing values present in both Age and Salary Columns

#taking care of missing values
# By replacing it to the average value for non NA entries

dataset$Age <- ifelse(is.na(dataset$Age),
                  ave(dataset$Age, FUN=function(x)
                      mean(x, na.rm=TRUE)),
                      dataset$Age)
                      
dataset$Salary<- ifelse(is.na(dataset$Salary),
                       ave(dataset$Salary, FUN=function(x)
                           mean(x, na.rm = TRUE)),
                           dataset$Salary)
                           
dataset


Country,Age,Salary,Purchased
<fct>,<dbl>,<dbl>,<fct>
France,44.0,72000.0,No
Spain,27.0,48000.0,Yes
Germany,30.0,54000.0,No
Spain,38.0,61000.0,No
Germany,40.0,63777.78,Yes
France,35.0,58000.0,Yes
Spain,38.77778,52000.0,No
France,48.0,79000.0,Yes
Germany,50.0,83000.0,No
France,37.0,67000.0,Yes


How the ave() function works here ?

read it like this : we are changing Age column of dataset and if the column entry is NA then, take the average of the dataset$Age column where function FUN is function of x which calculates the mean excluding(na.rm = TRUE) the NA values.

In [6]:
#defining  x = 1 2 3
x <- 1:3
#introducing missing value
x[1] <- NA
# mean = NA
mean(x)

In [7]:
# mean = mean excluding the NA value
mean(x, na.rm = T)

2. Categorical data

Categorical data is non numeric data which belongs to specific set of categories. Like the Country column in dataset

By default read.csv() function in R makes all the string variables as categorical variables(factor) but suppose there is a name column in the dataset in that case we dont need them as categorical variables. Below is the code to make specific variables as factor variables.

In [None]:
# Encoding categorical data
dataset$Country = factor(dataset$Country,
                         levels = c('France', 'Spain', 'Germany'),
                         labels = c(1, 2, 3))

dataset$Purchased = factor(dataset$Purchased,
                           levels = c('No', 'Yes'),
                           labels = c(0, 1))

3. Splitting into training and test dataset : When the dataset is presented to us to do machine learning stuff we need some data as part of training and some to test the model after the learning stage is done.

For this we need to install catools,

In [9]:
install.packages('caTools')
library(caTools)

also installing the dependency 'bitops'



package 'bitops' successfully unpacked and MD5 sums checked
package 'caTools' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\sxw17\AppData\Local\Temp\RtmpasX4Xt\downloaded_packages


"package 'caTools' was built under R version 3.6.1"

In [11]:
set.seed(123) # this is to ensure same output as split is done randomly, you can exclude in real time
split = sample.split(dataset$Purchased,SplitRatio = 0.8)
training_set = subset(dataset,split == TRUE)
test_set = subset(dataset, split == FALSE)

training_set

test_set

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
1,France,44.0,72000.0,No
2,Spain,27.0,48000.0,Yes
3,Germany,30.0,54000.0,No
4,Spain,38.0,61000.0,No
5,Germany,40.0,63777.78,Yes
7,Spain,38.77778,52000.0,No
8,France,48.0,79000.0,Yes
10,France,37.0,67000.0,Yes


Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
6,France,35,58000,Yes
9,Germany,50,83000,No


SplitRatio is the ratio in which training and test set, its usually set an 80:20 for training and test respectively.

sample.split() methid takes the column and calculates a numeric array with true and false in random locations and with the given split ratio.

subset() method takes the dataset and subset according to the condition

4. Feature Scaling :

Feature scaling is needed when different features has different ranges, for example Age and Salary Column.

They have very different ranges but when we training a model, which is basically trying to fit some line(in linear regression) then the error is trying to be minimized,

to minimize the error the euclidian distance is minimized using some algorithm(gradient descent )

But if no feature scaling is applied then the training will be highly biased with the feature having large values because the euclidian distance will be large there.

Hence, We need feature scaling which is done in below steps :

In [13]:
training_set
test_set

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
1,France,44.0,72000.0,No
2,Spain,27.0,48000.0,Yes
3,Germany,30.0,54000.0,No
4,Spain,38.0,61000.0,No
5,Germany,40.0,63777.78,Yes
7,Spain,38.77778,52000.0,No
8,France,48.0,79000.0,Yes
10,France,37.0,67000.0,Yes


Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
6,France,35,58000,Yes
9,Germany,50,83000,No


In [14]:
training_set[,2:3] = scale(training_set[,2:3])
test_set[,2:3] = scale(test_set[,2:3])

In [15]:
training_set

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
1,France,0.90101716,0.9392746,No
2,Spain,-1.58847494,-1.337116,Yes
3,Germany,-1.14915281,-0.7680183,No
4,Spain,0.02237289,-0.1040711,No
5,Germany,0.31525431,0.1594,Yes
7,Spain,0.13627122,-0.9577176,No
8,France,1.48678,1.6032218,Yes
10,France,-0.12406783,0.4650265,Yes


In [16]:
test_set

Unnamed: 0_level_0,Country,Age,Salary,Purchased
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<fct>
6,France,-0.7071068,-0.7071068,Yes
9,Germany,0.7071068,0.7071068,No


Note : Most libraries in R internally take care this feature scaling problem(overfitting) so we might not need to include this always.