## *Learn Maching Learning* series on Kaggle in R

This is my R code for the first sections 3 and 4 of the level 2 part of the *Learn Machine Learning* series on Kaggle. I've already done the Python one, which is on Kaggle located [here](https://www.kaggle.com/learn/machine-learning). The data used is from the [*Home Prices: Advanced Regression Techniques*](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition.

I had planned on doing all of level 2 but that was more difficult than I expected. One reason for that is the Python series told me what packages to use and gave me an outline of steps to follow. Since I'm doing this in R with no tutorial to follow, I had to research and decide which packages to use and steps to follow. That slowed me down but I definitely learned a lot. I also had problems with Jupyter and the R kernel, so I worked on this in RStudio while fixing those issues.

### Load and install packages and load the data
I learned since the last time that I need to do an if-else statement, not just if, when checking for packages before loading them.

In [1]:
# Install and load packages
if (!require("randomForest")) {
  install.packages("randomForest", repos="http://cran.rstudio.com/")
  library(randomForest)
} else {
  library(randomForest)
}

if (!require("dplyr")) {
  install.packages("dplyr", repos="http://cran.rstudio.com/")
  library(dplyr)
} else {
  library(dplyr)
}

if (!require("caTools")) {
  install.packages("caTools", repos="http://cran.rstudio.com/")
  library(caTools)
} else {
  library(caTools)
}

if (!require("rpart")) {
  install.packages("rpart", repos="http://cran.rstudio.com/")
  library(rpart)
} else {
  library(rpart)
}

if (!require("caret")) {
  install.packages("caret", repos="http://cran.rstudio.com/")
  library(rpart)
} else {
  library(rpart)
}

# Save filepath to variable
training_data_filepath <- "C:/Development/Kaggle/House Prices - Advanced Regression Techniques/train.csv"

# Import data
dataset <- read.csv(training_data_filepath)

Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Loading required package: dplyr

Attaching package: 'dplyr'

The following object is masked from 'package:randomForest':

    combine

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Loading required package: caTools
"package 'caTools' was built under R version 3.4.4"Loading required package: rpart
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2

Attaching package: 'ggplot2'

The following object is masked from 'package:randomForest':

    margin



### XGBoost

I'm going to use XGBoost on the data, which requires some upfront work.

In [33]:
# Create the dataframe with only numeric predictors
nums <- unlist(lapply(dataset, is.numeric))
dataset_nums <- dataset[, nums]

# Create the dataframe with only categorical predictors
dataset_cat <- dataset[, !nums]

# Use dummyVars from the caret package to perform one-hot encoding
# on the categorical features
dummies <- dummyVars( ~ ., data=dataset_cat)
dataset_dummies <- as.data.frame(predict(dummies, newdata=dataset_cat))

# Impute missing numeric data


In [34]:
dataset_combined <- cbind(dataset_nums, dataset_dummies)

In [35]:
head(dataset_combined)

Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType.CWD,SaleType.New,SaleType.Oth,SaleType.WD,SaleCondition.Abnorml,SaleCondition.AdjLand,SaleCondition.Alloca,SaleCondition.Family,SaleCondition.Normal,SaleCondition.Partial
1,60,65,8450,7,5,2003,2003,196,706,...,0,0,0,1,0,0,0,0,1,0
2,20,80,9600,6,8,1976,1976,0,978,...,0,0,0,1,0,0,0,0,1,0
3,60,68,11250,7,5,2001,2002,162,486,...,0,0,0,1,0,0,0,0,1,0
4,70,60,9550,7,5,1915,1970,0,216,...,0,0,0,1,1,0,0,0,0,0
5,60,84,14260,8,5,2000,2000,350,655,...,0,0,0,1,0,0,0,0,1,0
6,50,85,14115,5,5,1993,1995,0,732,...,0,0,0,1,0,0,0,0,1,0


#### Split the data set into training and test

This is the same as before.

In [2]:
# Split data into training and validation data, for both predictors and target.
set.seed(42)
split <- sample.split(dataset, SplitRatio=0.7)  # for training data
training_set <- subset(dataset, split==TRUE)
test_set <- subset(dataset, split==FALSE)

#### Select only the numeric predictors and then impute missing data

In [3]:
# Create the training and tests dataframe with only numeric predictors
nums <- unlist(lapply(training_set, is.numeric))
training_set_nums <- training_set[, nums]
test_set_nums <- test_set[, nums]

In [4]:
# Create the training and tests dataframe with only categorical predictors
training_set_cat <- training_set[, !nums]
test_set_cat <- test_set[, !nums]

In [5]:
# Convert all the categorical values to factors
# training_set_factors <- apply(training_set[, !nums], 2, as.factor)

# Show the number of unique values in each field for the training set
sapply(training_set_cat, function(x) {length(unique(x))})

To show the actual values, you can run the following:

    apply(training_set_cat, 2, function(x) {unique(x)})

In [6]:
# Show the number of unique values in each field for the test set
sapply(test_set_cat, function(x) {length(unique(x))})

In [7]:
# Use dummyVars from the caret package to perform one-hot encoding
# on the categorical features
dummies_training <- dummyVars( ~ ., data=training_set_cat)
dummies_training_fullRank <- dummyVars( ~ ., data=training_set_cat, fullRank=T)

In [8]:
training_set_dummies <- as.data.frame(predict(dummies_training, newdata=training_set_cat))
training_set_dummies_fullRank <- as.data.frame(predict(dummies_training_fullRank, newdata=training_set_cat))

In [11]:
dummies_test <- dummyVars( ~ ., data=test_set_cat)
dummies_test_fullRank <- dummyVars( ~ ., data=test_set_cat, fullRank=T)
test_set_dummies <- as.data.frame(predict(dummies_test, newdata=test_set_cat))
test_set_dummies_fullRank <- as.data.frame(predict(dummies_test_fullRank, newdata=test_set_cat))

### Next steps

I'm going to work on the next few sections in R but I'm not sure how some of it will translate to R. I know I can do XGBoost but I'm not entirely sure about Partial Dependence Plots. I can also do Cross Validation and probably Data Leakage, but I'm unsure about Pipelines. I know I can do pipelines with %>% and I think I can do similar to pipelines in Python with that. I'll have to play around with it.

Before doing all that I think I'm going to do a post on how I set up my R environment. I like what I have set up now and would like to share it.