## *Learn Maching Learning* series on Kaggle in R

This is my R code for the first sections 3 and 4 of the level 2 part of the *Learn Machine Learning* series on Kaggle. I've already done the Python one, which is on Kaggle located [here](https://www.kaggle.com/learn/machine-learning). The data used is from the [*Home Prices: Advanced Regression Techniques*](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition.

I had planned on doing all of level 2 but that was more difficult than I expected. One reason for that is the Python series told me what packages to use and gave me an outline of steps to follow. Since I'm doing this in R with no tutorial to follow, I had to research and decide which packages to use and steps to follow. That slowed me down but I definitely learned a lot. I also had problems with Jupyter and the R kernel, so I worked on this in RStudio while fixing those issues.

### Load and install packages and load the data
I learned since the last time that I need to do an if-else statement, not just if, when checking for packages before loading them.

In [11]:
# Install and load packages
if (!require("randomForest")) {
  install.packages("randomForest", repos="http://cran.rstudio.com/")
  library(randomForest)
} else {
  library(randomForest)
}

if (!require("dplyr")) {
  install.packages("dplyr", repos="http://cran.rstudio.com/")
  library(dplyr)
} else {
  library(dplyr)
}

if (!require("caTools")) {
  install.packages("caTools", repos="http://cran.rstudio.com/")
  library(caTools)
} else {
  library(caTools)
}

if (!require("rpart")) {
  install.packages("rpart", repos="http://cran.rstudio.com/")
  library(rpart)
} else {
  library(rpart)
}

if (!require("caret")) {
  install.packages("caret", repos="http://cran.rstudio.com/")
  library(rpart)
} else {
  library(rpart)
}

if (!require("xgboost")) {
  install.packages("xgboost", repos="http://cran.rstudio.com/")
  library(xgboost)
} else {
  library(xgboost)
}

# Save filepath to variable
training_data_filepath <- "C:/Development/Kaggle/House Prices - Advanced Regression Techniques/train.csv"

# Import data
dataset <- read.csv(training_data_filepath)

Loading required package: xgboost
"package 'xgboost' was built under R version 3.4.4"
Attaching package: 'xgboost'

The following object is masked from 'package:dplyr':

    slice



### XGBoost

I'm going to use XGBoost on the data, which requires some upfront work.

In [2]:
# Create the dataframe with only numeric predictors
nums <- unlist(lapply(dataset, is.numeric))
dataset_nums <- dataset[, nums]

# Create the dataframe with only categorical predictors
dataset_cat <- dataset[, !nums]

# Use dummyVars from the caret package to perform one-hot encoding
# on the categorical features
dummies <- dummyVars( ~ ., data=dataset_cat)
dataset_dummies <- as.data.frame(predict(dummies, newdata=dataset_cat))

# Impute missing numeric data using rfImpute
dataset_nums_impute <- rfImpute(SalePrice ~ ., dataset_nums)

     |      Out-of-bag   |
Tree |      MSE  %Var(y) |
 300 | 8.683e+08    13.77 |
     |      Out-of-bag   |
Tree |      MSE  %Var(y) |
 300 | 8.534e+08    13.53 |
     |      Out-of-bag   |
Tree |      MSE  %Var(y) |
 300 | 8.517e+08    13.50 |
     |      Out-of-bag   |
Tree |      MSE  %Var(y) |
 300 | 8.635e+08    13.69 |
     |      Out-of-bag   |
Tree |      MSE  %Var(y) |
 300 | 8.504e+08    13.48 |


In [3]:
dataset_combined <- cbind(dataset_nums_impute, dataset_dummies)

In [4]:
head(dataset_combined)

SalePrice,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,...,SaleType.CWD,SaleType.New,SaleType.Oth,SaleType.WD,SaleCondition.Abnorml,SaleCondition.AdjLand,SaleCondition.Alloca,SaleCondition.Family,SaleCondition.Normal,SaleCondition.Partial
208500,1,60,65,8450,7,5,2003,2003,196,...,0,0,0,1,0,0,0,0,1,0
181500,2,20,80,9600,6,8,1976,1976,0,...,0,0,0,1,0,0,0,0,1,0
223500,3,60,68,11250,7,5,2001,2002,162,...,0,0,0,1,0,0,0,0,1,0
140000,4,70,60,9550,7,5,1915,1970,0,...,0,0,0,1,1,0,0,0,0,0
250000,5,60,84,14260,8,5,2000,2000,350,...,0,0,0,1,0,0,0,0,1,0
143000,6,50,85,14115,5,5,1993,1995,0,...,0,0,0,1,0,0,0,0,1,0


#### Split the data set into training and test

This is simliar to before, using dataset_combined instead of dataset.

In [5]:
# Split data into training and validation data, for both predictors and target.
set.seed(42)
split <- sample.split(dataset_combined, SplitRatio=0.7)  # for training data
training_set <- subset(dataset_combined, split==TRUE)
test_set <- subset(dataset_combined, split==FALSE)

To show the number of unique values in each field:

    sapply(training_set_cat, function(x) {length(unique(x))})

To show the actual values, you can run the following:

    apply(training_set_cat, 2, function(x) {unique(x)})

In [7]:
# Create X and y from the training and test sets
X_train <- training_set %>%
    select(-c(Id, SalePrice))
X_test <- test_set %>%
    select(-c(Id, SalePrice))
y_train <- training_set$SalePrice
y_test <- test_set$SalePrice

In [32]:
# Create the XGBoost model and run
xgb <- xgboost(data.matrix(X_train), y_train, nrounds=25, verbose=0)

In [27]:
# Predict values
y_pred <- predict(xgb, data.matrix(X_test))

In [28]:
mae <- function(error)
{
  mean(abs(error))
}

error <- y_pred - y_test

mae(error)

In [31]:
# Create a function to get the MAE of XGBoost nround values
getMae_xgb <- function(X_train, y_train, X_test, y_test, n) {
  set.seed(42)
  xgb <- xgboost(data.matrix(X_train), y_train, nrounds=n, verbose=0)
  y_pred <- predict(xgb, data.matrix(X_test))
  error <- (y_pred - y_test)
  print(paste("nround of ", n, " has an MAE of ", mae(error), sep=""))
}

# Loop through multiple nround values
nrounds = c(1, 5, 10, 30, 50, 100, 500, 1000, 5000)

for (i in nrounds) {
  getMae_xgb(X_train, y_train, X_test, y_test, i)
}

[1] "nround of 1 has an MAE of 136694.404586062"
[1] "nround of 5 has an MAE of 39075.4283438924"
[1] "nround of 10 has an MAE of 21147.2548049544"
[1] "nround of 30 has an MAE of 19529.0262136959"
[1] "nround of 50 has an MAE of 19375.0769148633"
[1] "nround of 100 has an MAE of 19388.9438621156"
[1] "nround of 500 has an MAE of 19361.7373202591"
[1] "nround of 1000 has an MAE of 19361.7657940632"
[1] "nround of 5000 has an MAE of 19362.0505321042"


nround of 500 has the lowest MAE. There are other parameters I can tune and may try at a later date.

### Next steps

I'm going to work on the next few sections in R but I'm not sure how some of it will translate to R. I know I can do XGBoost but I'm not entirely sure about Partial Dependence Plots. I can also do Cross Validation and probably Data Leakage, but I'm unsure about Pipelines. I know I can do pipelines with %>% and I think I can do similar to pipelines in Python with that. I'll have to play around with it.

Before doing all that I think I'm going to do a post on how I set up my R environment. I like what I have set up now and would like to share it.