## *Learn Maching Learning* series on Kaggle in R

This is my R code for the level 1 part of the *Learn Machine Learning* series on Kaggle. I've already done the Python one, which is on Kaggle located [here](https://www.kaggle.com/learn/machine-learning). The data used is from the [*Home Prices: Advanced Regression Techniques*](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition.

Originally I had planned on doing both level 1 and level 2 at the same time, but I encountered some issues with my R install and I got busier than expected. I'm publishing level 1 now since it's done and while I've already started the level 2 part, I'll just publish it a little later.

### Load and install packages and load the data

In [1]:
# Install and load packages
if (!require("randomForest")) {
  install.packages("randomForest", repos="http://cran.rstudio.com/")
  library(randomForest)
}

if (!require("dplyr")) {
  install.packages("dplyr", repos="http://cran.rstudio.com/")
  library(dplyr)
}

if (!require("caTools")) {
  install.packages("caTools", repos="http://cran.rstudio.com/")
  library(caTools)
}

if (!require("rpart")) {
  install.packages("rpart", repos="http://cran.rstudio.com/")
  library(rpart)
}

# Save filepath to variable
training_data_filepath <- "C:/Development/Kaggle/House Prices - Advanced Regression Techniques/train.csv"

# Import data
dataset <- read.csv(training_data_filepath)

Loading required package: randomForest
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Loading required package: dplyr

Attaching package: 'dplyr'

The following object is masked from 'package:randomForest':

    combine

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Loading required package: caTools
Loading required package: rpart


### View some stats about the data

In [2]:
# View some stats and information about the data
summary(dataset)

       Id           MSSubClass       MSZoning     LotFrontage    
 Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
 1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
 Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
 Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
 3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
 Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
                                                 NA's   :259     
    LotArea        Street      Alley      LotShape  LandContour  Utilities   
 Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63    AllPub:1459  
 1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50    NoSeWa:   1  
 Median :  9478               NA's:1369   IR3: 10   Low:  36                 
 Mean   : 10517                           Reg:925   Lvl:1311                 
 3rd Qu.: 11602                                                              
 Max

### Split the data set into training and test, then create the predictor and target variables

In [3]:
# Split data into training and validation data, for both predictors and target.
set.seed(42)
split <- sample.split(dataset, SplitRatio=0.7)  # for training data
training_set <- subset(dataset, split==TRUE)
test_set <- subset(dataset, split==FALSE)

# Create the training and tests dataframe with the initial predictors
predictors <- c("LotArea", "YearBuilt", "X1stFlrSF", "X2ndFlrSF",
                "FullBath", "BedroomAbvGr", "TotRmsAbvGrd", "SalePrice")
training_set <- training_set %>%
  select(predictors)
test_set <- test_set %>%
  select(predictors)

# Create the predictor variable
X <- training_set %>%
  select(-SalePrice)

# Select the target variable and call it y
y <- training_set$SalePrice

### Predict values with a Decision Tree using rpart

In [4]:
# Fitting Decision Tree to the training data
formula=SalePrice ~ .

regressor <- rpart(formula=formula, data=training_set,
                   control=rpart.control(cp=.01))

# Get predicted prices
y_pred <- predict(regressor, test_set)

# View a summary of the predicted values
summary(y_pred)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 115718  115718  149822  175554  200484  480209 

### Create a function to get the Mean Absolute Error (or MAE)

In [5]:
# Calculating the Mean Absolute Error
mae <- function(error)
{
  mean(abs(error))
}

# Get the MAE
y_test <- test_set$SalePrice
error <- (y_test - y_pred)
mae(error)

### Create a function to compare the MAE for different cp values

In [6]:
# Create the function
getMae_rpart <- function(formula, training_data, test_data, n) {
  set.seed(42)
  regressor_rpart <- rpart(formula=formula, data=training_data,
                    control=rpart.control(cp=n))
  y_prediction <- predict(regressor_rpart, newdata=test_data)
  y_test <- test_data$SalePrice
  error <- (y_test - y_prediction)
  print(paste("cp of ", n, " has an MAE of ", mae(error), sep=""))
}

Set up the formula variable and cp values, then loop through the values and call the function.

In [7]:
# Set the formula variable
formula <- SalePrice ~ .

# Loop through multiple ntree values
cps <- c(.5, .1, .05, .02, .01, .005, .003, .001, .0005, .0001)

for (i in cps) {
  getMae_rpart(formula, training_set, test_set, i)
}

[1] "cp of 0.5 has an MAE of 57536.8983354404"
[1] "cp of 0.1 has an MAE of 40654.9088557541"
[1] "cp of 0.05 has an MAE of 36460.7134426164"
[1] "cp of 0.02 has an MAE of 33492.3580079057"
[1] "cp of 0.01 has an MAE of 29589.8455005301"
[1] "cp of 0.005 has an MAE of 29136.0138171344"
[1] "cp of 0.003 has an MAE of 29583.0145339228"
[1] "cp of 0.001 has an MAE of 27909.4547519322"
[1] "cp of 5e-04 has an MAE of 27597.8067312116"
[1] "cp of 1e-04 has an MAE of 27419.4284590988"


MAE continues to decrease as the cp decreases.

### Predict values with a Random Forest

In [8]:
# Fitting Random Forest Regression to the dataset
regressor <- randomForest(x=X, y=y, ntree=100)

# Predicting a new result
y_pred <- predict(regressor, newdata=test_set)

# Get the MAE
y_test <- test_set$SalePrice
error <- (y_pred - y_test)
mae(error)

### Create a function to compare the MAE for different ntree values

In [9]:
# Create the function
getMae_forest <- function(X, y, test_data, n) {
  set.seed(42)
  regressor <- randomForest(x=X, y=y, ntree=n)
  y_prediction <- predict(regressor, newdata=test_data)
  y_test <- test_data$SalePrice
  error <- (y_prediction - y_test)
  print(paste("ntree of ", n, " has an MAE of ", mae(error), sep=""))
}

# Loop through multiple ntree values
ntrees = c(1, 5, 10, 30, 50, 100, 500, 1000, 5000)

for (i in ntrees) {
  getMae_forest(X, y, test_set, i)
}

[1] "ntree of 1 has an MAE of 35761.9752775473"
[1] "ntree of 5 has an MAE of 25399.3227531454"
[1] "ntree of 10 has an MAE of 24226.9883834123"
[1] "ntree of 30 has an MAE of 23401.1638509278"
[1] "ntree of 50 has an MAE of 23610.084126271"
[1] "ntree of 100 has an MAE of 23260.3606851458"
[1] "ntree of 500 has an MAE of 23166.618382558"
[1] "ntree of 1000 has an MAE of 23113.7696443243"
[1] "ntree of 5000 has an MAE of 23172.7757985064"


ntree of 1000 has the lowest MAE.

That's all for this post. The more I use R, the more I like it. Python and R both have their advantages though.

Hopefully the second part doesn't take me nearly as long. Until then!