Source: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-r/

**To run the R codes on Jupyter notebook, type “conda install -c r r-essentials” on your terminal – it will install the R kernel and some important R packages (e.g. dplyr, ggplot2, etc.)**


In [None]:
#Load the required packages 
library(tidyverse)
library(caret) 
library(leaps)

In [None]:
#Load the dataset

df_housing <- read.csv("train.csv", stringsAsFactors = FALSE)

In [None]:
#Get an overview of the train data set
#Click on white space below to expand out the code output

glimpse(df_housing)

# Method 1: Using stepAIC()
Selects the best model by AIC (Akaike Information Criterion). 

In [None]:
library(MASS)

#Fit your initial model to begin the regression 

full_model <- lm(SalePrice ~ ., data = df_housing)

There are three ways to do stepwise regression: (1) Backward, (2) Forward, (3) Stepwise. To select a particular model, you have to change the value in **direction**: (i) "both" (stepwise regression), (ii) "backward" (for backward regression) and "forward" (for forward selection). 

**trace** can be either TRUE or FALSE. TRUE means you want to see the results at each iteration, while FALSE does not show you each step, you just get the best final model at the end. If you have more variables then TRUE will give you a pretty long list of steps taken. 

In [None]:
# stepwise_model <- stepAIC(SalePrice ~ ., direction = "both", trace = FALSE)

# backwards_model <- stepAIC(SalePrice ~ ., direction = "backward", trace = FALSE)

# forwards_model <- stepAIC(SalePrice ~ 1, direction = "forward", scope=formula(full_model), trace = FALSE)

In [None]:
# Get the results of the final model 

summary(stepwise_model)

# Method 2: Using regsubsets()
Has tuning parameter **nvmax** specifying the maximal number of predictors to incorporate in the model. It returns multiple models with different size up to nvmax. You need to compare the performance of the different models for choosing the best one. 

regsubsets() has the option method, which can take the values “backward”, “forward” and “seqrep” (seqrep = sequential replacement, combination of forward and backward selections).

In [None]:
models <- regsubsets(SalePrice ~ ., data = df_housing, nvmax = ?, method = 'seqrep')

summary(models)

# Method 3: CV and GridSearch

The train() function [caret package] provides an easy workflow to perform stepwise selections using the leaps and the MASS packages. It has an option named method, which can take the following values:

 - "leapBackward", to fit linear regression with backward selection
 - "leapForward", to fit linear regression with forward selection
 - "leapSeq", to fit linear regression with stepwise selection
 
You also need to specify the tuning parameter nvmax, which corresponds to the maximum number of predictors to be incorporated in the model.

For example, you can vary nvmax from 1 to 5. In this case, the function starts by searching different best models of different size, up to the best 5-variables model. That is, it searches the best 1-variable model, the best 2-variables model, …, the best 5-variables models.


We can use 5/10-fold cross-validation to estimate the average prediction error (RMSE) of each of the models. The RMSE statistical metric is used to compare the models and to automatically choose the best one, where best is defined as the model that minimize the RMSE.

In [None]:
#Set seed for reproducibility
set.seed(42)

#Set up repeated k-fold cross-validation, indicate the number of folds you want
train_control <- train_control(method = "cv", number = 10)

#Train the model, indicate the method of regression and range of nvmax numbers to try
step_model <- train(SalePrice ~ ., data = df_housing, 
                   method = "leapSeq",
                   tuneGrid = data.frame(nvmax = 5:50),
                   trControl = train_control)
step_model$results

**nvmax**: the number of variable in the model. For example nvmax = 2, specify the best 2-variables model

**RMSE** and **MAE** are two different metrics measuring the prediction error of each model. The lower the RMSE and MAE, the better the model.

**Rsquared** indicates the correlation between the observed outcome values and the values predicted by the model. The higher the R squared, the better the model.

In [None]:
#Display the best tuning values (nvmax) selected by the train() function

step_model$bestTune

In [None]:
#The function summary() reports the best set of variables for each model size 
summary(step_model$finalModel)