
# Advanced Regression Techniques - best subset selection, forward and backward stepwise selection, ridge and lasso regression, PCR, PLS, bagging, random forest and boosting
## Ollie Thwaites
## 04/09/20

# 1. Introduction
Predicting the price of a house is useful for both the seller and buyer. 
The seller wants to ensure they are not under or overvaluing the house, as that could lead to missed profit or no interest from buyers. 
Conversely, buyers will be interested in predicting a house price to see if they are getting a good deal on an undervalued house or if they are getting ripped off. 
This dataset offers an opportunity to utilise multiple linear regression and tree-based methods to predict the prices of houses in King County, which includes Seattle, in the state of Washington. 
These house prices are from May 2014 to May 2015. 

This is the python equivalent to my previous analysis that I did in R. R is my preferred language but I wanted to learn some Python too and I thought a good way to learn it would be to take what I know in R and try to recreate it in Python. You can find the R project [here](https://www.kaggle.com/thwaiteso/advanced-regression-techniques-r).

The full code used in this analysis can be found on my [github](https://github.com/thwaiteso/Kaggle-Projects/blob/master/Python/Housing/Housing.py).

## 1.1 Import data
The dataset for this analysis can be found [here](https://www.kaggle.com/harlfoxem/housesalesprediction).


In [None]:
# Import packages
import pandas as pd
import numpy as np
from plotnine import * # using * saves me from writing plotnine before every ggplot use
!pip install dfply # seems like dfply is not installed on kaggle
from dfply import *
import folium
import statsmodels.api as sm
import itertools
from sklearn.cross_decomposition import PLSRegression
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score, KFold, ParameterGrid
from sklearn.preprocessing import scale
from sklearn.tree import DecisionTreeRegressor, plot_tree
import matplotlib.pyplot as plt

# from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

# import data
house_data = pd.read_csv("../input/housesalesprediction/kc_house_data.csv")

## 1.2 Tidy data
Let's look at an overview of the data:

In [None]:
# overview of data
house_data.info()

21 variables explaining various features of the house itself and its surrounding environment, but I have an issue with some of them. The number of bathrooms is a decimal (float), ending in either .25, .5 or .75. The number of floors also has some values ending in .5. After some research, I believe the decimals refer to how ‘full’ the bathroom is - i.e. does it have a bath, shower, sink, toilet or a combination of all four. The problem with this system is a rating of ‘1.25’ for the number of bathrooms implies one ‘full’ bathrooms and one ‘.25’ bathroom, but it could also be any combination of the decimal notations. I have decided to leave the variable as is, but it highlights that recording the data in a non-intuitive way can lead to problems with interpretation later on.

Finally, I’ll check if there is any missing data:

In [None]:
# are any data missing?
print('Number of missing data:', house_data.isnull().sum().sum()) 

Excellent, there are no missing data.

# 2. Exploratory Data Analysis

I am going to some initial data visualisation to understand what the trends of each variable are, before moving on to the predictive models.

## 2.1 Price distribution

Firstly, I am going to distribution of the house prices (*note that price has been log transformed*):

In [None]:
# note that ggplot can be used in python, using the plotnine library

price_plot_theme = theme(axis_title = element_text(size = 12.5), 
                         axis_text = element_text(size = 9))

price_plot = (ggplot(data = house_data, mapping = aes(x = 'price')) + 
  # note that the whole object is in brackets - (ggplot...)
  # also note that aes is preceded by mapping = and the column name is in quotes
  geom_histogram(color = 'black', bins = 30) +
  # makes a boxplot, bins = 30 was the value defaulted by R
  geom_vline(xintercept = house_data['price'].mean(), linetype = 'dashed', 
             color = 'red', size = 2) +
  # add line denoting mean house price - note how the mean value is called is
  # different than R, you use the data first and select the column using [''],
  # followed by .mean()
  # size = 2 replaces lwd = 2 from R
  theme_classic() + # white background, no gridlines
  xlab('Price (US$)') + # change x axis label
  ylab('Frequency') + # change y axis label
  price_plot_theme + # change size of axis titles and text
  scale_x_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', 
                                        '5,000,000'])) +
  scale_y_continuous(breaks = np.array(range(0, 3500, 500)),
                     labels = np.array(range(0, 3500, 500)),
                     limits = np.array([0, 3000]))
 # change x and y axis values
 # note that instead of using c() in R, numpy arrays are used
 # also note that in range(), the second number denotes when the range stops,
 # but that final number is not included - so if I want the numbers 0-3000
 # every 500, the stop number has to be 3500
 )
price_plot

The dashed red line denotes the mean house price: **$540,088**. You can see that not many houses are priced over 1 million dollars, the majority are below that and are roughly centered around the mean price.

## 2.2 House age

Next I want to see how many houses were built in a specific year and whether the age of the house has any effect on its current price.

In [None]:
# dfply is a python equivalent to dplyr and has pipes that can be used to
# chain multiple bits of code together

# houses built over time
# collate number of houses built per year
yearBuilt = (house_data >> # note that >> replaces %>% as the pipe
  group_by(X.yr_built) >>  # X refers to the data frame from the first line
  # and has to be called explicitly unlike in R
  summarize(rows = n(X.yr_built))
  )

yearBuilt_plot_theme = theme(axis_title = element_text(size = 12.5), 
                             axis_text = element_text(size = 10))
yearBuilt_plot = (ggplot(data = yearBuilt, 
                         mapping = aes(x = 'yr_built', y = 'rows')) +
  geom_bar(stat = 'identity', color = 'black') + # create bar chart
  theme_classic() + # white background, no gridlines
  xlab('Year') + # change x axis label
  ylab('Houses Built') + # change y axis label
  yearBuilt_plot_theme + # change the size of axis titles and axis text
  scale_x_continuous(breaks = np.array(range(1900, 2020, 10)),
                     labels = np.array(range(1900, 2020, 10)),
                     limits = np.array([1899, 2016])) +
  scale_y_continuous(breaks = np.array(range(0, 650, 50)),
                     labels = np.array(range(0, 650, 50)),
                     limits = np.array([0, 600]))
  # change x and y axis values
  )
yearBuilt_plot

You can see that generally there has been an **increase** in the number of houses built over time, but there are notable **declines** in the 1930s and 1970s. It is likely that the fall in the 30s is linked with the Great Depression which began with the Wall Street Crash in 1929. As for the 70s, the Cold War and Vietnam war were happening during this time and there was an oil crisis in 1973 and all may have contributed to the decline in houses built in that time.

In [None]:
# mean price of house per year built
# collate current mean price of house per year built
yearBuilt_price = (house_data >> # using house_data
  group_by(X.yr_built) >> # group all the data from the same year
  summarize(mean_price = mean(X.price)) >>
  # calculate current mean price for houses built in that year
  arrange(X.yr_built)
  )

yearBuilt_price_plot_theme = theme(axis_title = element_text(size = 12.5), 
                                   axis_text = element_text(size = 10))
yearBuilt_price_plot = (ggplot(data = yearBuilt_price, 
                               mapping = aes(x = 'yr_built', 
                                             y = 'mean_price')) +
  geom_bar(stat = 'identity', color = 'black') + # create bar chart
  theme_classic() + # white background, no gridlines
  xlab('Year') + # change x axis label
  ylab('Mean Price (US$)') + # change y axis label
  yearBuilt_price_plot_theme + 
  # change the size of axis titles and axis text
  geom_hline(yintercept = house_data['price'].mean(),
             linetype = 'dashed', color = 'red', size = 2) +
  # add line denoting mean house price
  scale_x_continuous(breaks = np.array(range(1900, 2020, 10)),
                     labels = np.array(range(1900, 2020, 10)),
                     limits = np.array([1899, 2016])) +
  scale_y_continuous(breaks = np.array(range(0, 900000, 100000)),
                     labels = np.array(['0', '100,000', '200,000', '300,000',
                                        '400,000', '500,000', '600,000',
                                        '700,000', '800,000']))
 # change x and y axis values
 )
yearBuilt_price_plot

As for the mean house price for houses built in a particular year, houses built pre-1940 and post-1980 are generally **worth more** than average now whereas houses built from 1940-1980 are **worth less** on average. Perhaps there is a ‘reverse sweet spot’, where the houses built in 1940-1980 are not old enough to have the charm or appeal of the older houses and they are not new enough to be suitable for modern lifestyles.

## 2.3 Surrounding houses

There are two variable in this dataset, ‘sqft_living15’ and ‘sqft_lot15’, which describe the average square footage of the living area and average square footage of the lot of the closest 15 houses, respectively. The assumption here is that if a house is in a neighbourhood that has more expensive houses it is likely that that house is also expensive. Let’s see how these variables affect house prices (*note that the two variables and price have been log transformed*):

In [None]:
# Square footage of living area of surrounding 15 houses
sqft_liv15_plot_theme = theme(axis_title = element_text(size = 12.5), 
                              axis_text = element_text(size = 10))
sqft_liv15_plot = (ggplot(data = house_data, 
                          mapping = aes(x = 'sqft_living15', y = 'price')) +
  geom_point(size = 1) +
  # add data as points
  theme_classic() + # white background, no gridlines
  xlab('Square Footage of Living Area of Nearest 15 Houses') + 
  # change x axis label
  ylab('Price (US$)') + # change y axis label
  sqft_liv15_plot_theme + # change the size of axis titles and axis text
  scale_x_continuous(trans = 'log',
                     breaks = np.array([500, 1000, 2500, 5000]),
                     labels = np.array(['500', '1,000', '2,500', '5,000'])) +
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000'])) 
 # change x and y axis values
 )
sqft_liv15_plot


Here we see that the mean square footage of the *living area* of the closest 15 houses is correlated with the price of a house. The assumption here is that larger houses are more expensive (*this is looked at later in **section 2.5***) and if there are many large houses close to a house, it is likely that that house is large and therefore expensive.

In [None]:
# Square footage of lot of surrounding 15 houses
sqft_lot15_plot_theme = theme(axis_title = element_text(size = 12.5), 
                              axis_text = element_text(size = 10))
sqft_lot15_plot = (ggplot(data = house_data, 
                          mapping = aes(x = 'sqft_lot15', y = 'price')) +
  geom_point(size = 1) +
  # add data as points
  theme_classic() + # white background, no gridlines
  xlab('Square Footage of Lot of Nearest 15 Houses') + 
  # change x axis label
  ylab('Price (US$)') + # change y axis label
  sqft_lot15_plot_theme + # change the size of axis titles and axis text
  scale_x_continuous(trans = 'log',
                     breaks = np.array([1000, 10000, 100000, 1000000]),
                     labels = np.array(['1,000', '10,000', '100,000', 
                                        '1,000,000'])) +
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000']))
 # change x and y axis values
 )
sqft_lot15_plot

Conversely, the mean square footage of the *lot* of the closest 15 houses has no obvious relationship with the price of a house. Most mean lot sizes are between 1,000 and 100,000 square feet, with little variation in house prices as this increases. I would expect sqft_living15 to be a better predictor of house prices that sqft_lot15 in the predictive models.

## 2.4 House quality and its surrounding area 
The next four variables I'll look at describe the quality of the house and its surrounding area. 
'waterfront' is a binary variable denoting if the house is on the waterfront ('1') or not ('0'). 

'view' is a variable describing the quality of the view from the house from 0-4, I couldn't find exactly what each value represents but a higher number is better.  

'grade' denotes the quality of construction and design of the house, from 1-13 with [explanations](https://www.kingcounty.gov/depts/assessor/Reports/area-reports/2017/residential-westcentral/~/media/depts/assessor/documents/AreaReports/2017/Residential/013.ashx) as follows:  
Grades 1-3: Falls short of minimum building standards. Normally cabin or inferior structure.  
Grade 4: Generally older low quality construction. Does not meet code.  
Grade 5: Lower construction costs and workmanship. Small, simple design.  
Grade 6: Lowest grade currently meeting building codes. Low quality materials, simple designs.  
Grade 7: Average grade of construction and design. Commonly seen in plats and older subdivisions.  
Grade 8: Just above average in construction and design. Usually better materials in both the exterior and interior finishes.  
Grade 9: Better architectural design, with extra exterior and interior design and quality.  
Grade 10: Homes of this quality generally have high quality features. Finish work is better, and more design quality is seen in the floor plans and larger square footage.   
Grade 11: Custom design and higher quality finish work, with added amenities of solid woods, bathroom fixtures and more luxurious options.  
Grade 12: Custom design and excellent builders. All materials are of the highest quality and all conveniences are present.  
Grade 13: Generally custom designed and built. Approaching the Mansion level. Large amount of highest quality cabinet work, wood trim and marble; large entries.  

'condition' describe the condition of the house in terms of the amount and urgency of repairs and maintenance needed, with [explanations](https://www.kingcounty.gov/depts/assessor/Reports/area-reports/2017/residential-westcentral/~/media/depts/assessor/documents/AreaReports/2017/Residential/013.ashx) as follows:  
1: Poor - Many repairs needed. Showing serious deterioration.  
2: Fair - Some repairs needed immediately. Much deferred maintenance.  
3: Average - Depending upon age of improvement; normal amount of upkeep for the age of the home.  
4: Good - Condition above the norm for the age of the home. Indicates extra attention and care has been taken to maintain.  
5: Very Good - Excellent maintenance and updating on home. Not a total renovation.  

Clearly we should be expecting house prices to increase if the house is on a waterfront and to increase as each of the other three variables increase (*note that price has been log transformed*):


In [None]:
# waterfront
wft_plot_theme = theme(axis_title = element_text(size = 12.5),
                       axis_text = element_text(size = 10))

wft_plot = (ggplot(data = house_data, 
                   mapping = aes(x = house_data['waterfront'].astype('category'), 
                                 y = 'price')) + # note that instead of using
            # factor() in R, in python you specify the column and use 
            # astype('category')
  geom_boxplot() + # makes a boxplot
  geom_hline(yintercept = house_data['price'].mean(),
             linetype = 'dashed', color = 'red', size = 2) +
  # add line denoting mean house price
  theme_classic() + # white background, no gridlines
  xlab('') +
  ylab('Price (US$)') + # change y axis label
  wft_plot_theme + # change axis title and text size
  scale_x_discrete(labels = np.array(['Not on Waterfront', 'On Waterfront'])) +
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000']))
  # change y-axis values
  )
wft_plot

The median price for a house on the waterfront is **higher** than the average, but it isn’t significantly different from the median price for a house not on the waterfront. I wouldn’t expect ‘waterfront’ to be a particulary useful predictor

In [None]:
# view
view_plot_theme = theme(axis_title = element_text(size = 12.5),
                        axis_text = element_text(size = 10))
view_plot = (ggplot(data = house_data, 
                    mapping = aes(x = house_data['view'].astype('category'), 
                                  y = 'price')) + 
  geom_boxplot() + # makes a boxplot
  geom_hline(yintercept = house_data['price'].mean(),
             linetype = 'dashed', color = 'red', size = 2) +
  # add line denoting mean house price
  theme_classic() + # white background, no gridlines
  xlab('View Rating') +
  ylab('Price (US$)') + # change y axis label
  view_plot_theme + # change axis title and text size
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000']))
  # change y-axis values
  )
view_plot

The rating of the view from a house has a marginal effect on median house price with a **slight increase** as view rating increases, but again it isn’t really significant and I don’t think it will be a useful predictor.

In [None]:
# grade
grd_plot_theme = theme(axis_title = element_text(size = 12.5),
                       axis_text = element_text(size = 10))
grd_plot = (ggplot(data = house_data, 
                   mapping = aes(x = house_data['grade'].astype('category'), 
                                 y = 'price')) + 
  geom_boxplot() + # makes a boxplot
  geom_hline(yintercept = house_data['price'].mean(),
             linetype = 'dashed', color = 'red', size = 2) +
  # add line denoting mean house price
  theme_classic() + # white background, no gridlines
  xlab('Grade') +
  ylab('Price (US$)') + # change y axis label
  grd_plot_theme + # change axis title and text size
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000']))
  # change y-axis values
  )
grd_plot

However, the grade of the house clearly has a large effect on median house price. There is a **significant increase** in median house price as the grade of the house increases, with the lowest priced houses in the top grades (12 and 13) being higher than the highest priced houses in the bottom grades (1-4). I would expect ‘grade’ to be a significant predictor of house prices.

In [None]:
# condition
cond_plot_theme = theme(axis_title = element_text(size = 12.5),
                        axis_text = element_text(size = 10))
cond_plot = (ggplot(data = house_data, 
                    mapping = aes(x = house_data['condition'].astype('category'), 
                                  y = 'price')) + 
  geom_boxplot() + # makes a boxplot
  geom_hline(yintercept = house_data['price'].mean(),
             linetype = 'dashed', color = 'red', size = 2) +
  # add line denoting mean house price
  theme_classic() + # white background, no gridlines
  xlab('Condition') +
  ylab('Price (US$)') + # change y axis label
  cond_plot_theme + # change axis title and text size
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000']))
  # change y-axis values
  )
cond_plot

Finally, the results from ‘condition’ are curious. The median price for houses in top condition (‘5’) are not above the mean house price, though there is an increase in median house price as condition improves. As with ‘waterfront’ and ‘view’, I don’t believe ‘condition’ will be a significant predictor.

## 2.5 House dimensions

After looking at the effects of the houses around a particular house and its external qualities, it makes sense to look at the dimensions of a house and its lot. Specifically, the square footage of the interior living space, lot, above ground area and below ground area, all of which are log transformed in addition to price:

In [None]:
# Square footage of interior living space
sqft_liv_plot_theme = theme(axis_title = element_text(size = 12.5), 
                            axis_text = element_text(size = 10))
sqft_liv_plot = (ggplot(data = house_data, 
                        mapping = aes(x = 'sqft_living', y = 'price')) +
  geom_point(size = 1) +
  # add data as points
  theme_classic() + # white background, no gridlines
  xlab('Square Footage of Living Area') + # change x axis label
  ylab('Price (US$)') + # change y axis label
  sqft_liv_plot_theme + # change the size of axis titles and axis text
  scale_x_continuous(trans = 'log',
                     breaks = np.array([500, 1000, 2500, 5000, 10000]),
                     labels = np.array(['500', '1,000', '2,500', '5,000',
                                        '10,000'])) +
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000'])) 
  # change x and y axis values
  )
sqft_liv_plot

In [None]:
# Square footage of above ground interior
sqft_abv_plot_theme = theme(axis_title = element_text(size = 12.5), 
                            axis_text = element_text(size = 10))
sqft_abv_plot = (ggplot(data = house_data, 
                        mapping = aes(x = 'sqft_above', y = 'price')) +
  geom_point(size = 1) +
  # add data as points
  theme_classic() + # white background, no gridlines
  xlab('Square Footage of Interior (Above Ground)') + # change x axis label
  ylab('Price (US$)') + # change y axis label
  sqft_abv_plot_theme + # change the size of axis titles and axis text
  scale_x_continuous(trans = 'log',
                     breaks = np.array([500, 1000, 2500, 5000, 7500]),
                     labels = np.array(['500', '1,000', '2,500', '5,000', 
                                        '7,500'])) +
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000'])) 
  # change x and y axis values
  )
sqft_abv_plot

Despite looking like I've repeated the same graph twice, these plots are different. Here you can see that the square footage of the total living area and the above ground appear to be correlated with price and you would expect these two variables to display similar information.

In [None]:
# Square footage of land space
sqft_lot_plot_theme = theme(axis_title = element_text(size = 12.5), 
                            axis_text = element_text(size = 10))
sqft_lot_plot = (ggplot(data = house_data, 
                        mapping = aes(x = 'sqft_lot', y = 'price')) +
  geom_point(size = 1) +
  # add data as points
  theme_classic() + # white background, no gridlines
  xlab('Square Footage of Lot') + # change x axis label
  ylab('Price (US$)') + # change y axis label
  sqft_lot_plot_theme + # change the size of axis titles and axis text
  scale_x_continuous(trans = 'log',
                     breaks = np.array([1000, 10000, 100000, 1000000]),
                     labels = np.array(['1,000', '10,000', '100,000', 
                                        '1,000,000'])) +
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000'])) 
  # change x and y axis values
  )
sqft_lot_plot

As for the square footage of the lot, there isn’t a noticeable increase in house price as square footage goes up, with most mean lot sizes being between 1,000 and 100,000 square feet as seen in the similar plot in **section 2.3**.

In [None]:
# Square footage of below ground interior
# create a subset with houses that have a below ground area
blw_data = house_data[house_data.sqft_basement > 0]

sqft_blw_plot_theme = theme(axis_title = element_text(size = 12.5), 
                            axis_text = element_text(size = 10))
sqft_blw_plot = (ggplot(data = blw_data, 
                        mapping = aes(x = 'sqft_basement', y = 'price')) +
  geom_point(size = 1) +
  # add data as points
  theme_classic() + # white background, no gridlines
  xlab('Square Footage of Interior (Below Ground)') + # change x axis label
  ylab('Price (US$)') + # change y axis label
  sqft_blw_plot_theme + # change the size of axis titles and axis text
  scale_x_continuous(trans = 'log',
                     breaks = np.array([10, 100, 1000, 5000]),
                     labels = np.array(['10', '100', '1,000', '5,000'])) +
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000'])) 
  # change x and y axis values
  )
sqft_blw_plot

For the below ground square footage, there isn’t a lot of variation with most basements being around 1,000 square feet, with perhaps a slight trend upwards in price as size increases.

## 2.6 Floors, bedrooms and bathrooms

Having now analysed how the size of the house, the areas within it and the lot affects house prices, I’ll now consider how the number of floors, bedrooms and bathrooms within a house affects its price, which is log transformed:

In [None]:
# floors
floor_plot_theme = theme(axis_title = element_text(size = 12.5),
                         axis_text = element_text(size = 10))
floor_plot = (ggplot(data = house_data, 
                     mapping = aes(x = house_data['floors'].astype('category'), 
                                   y = 'price')) + 
  geom_boxplot() +
  # makes a boxplot
  geom_hline(yintercept = house_data['price'].mean(),
             linetype = 'dashed', color = 'red', size = 2) +
  # add line denoting mean house price
  theme_classic() + # white background, no gridlines
  xlab('Number of Floors') + # change x axis label
  ylab('Price (US$)') + # change y axis label
  floor_plot_theme + # change axis title and text size
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000']))
 # change y-axis values
 )
floor_plot

The number of floors appears to have little effect on house prices. If anything, the median house price for 3 and 3.5 floors is slightly less than or roughly equal to 2 and 2.5 floors. I would therefore not expect ‘floors’ to be a strong predictor of house price in the predictive models.

In [None]:
# bedrooms
bed_plot_theme = theme(axis_title = element_text(size = 12.5),
                       axis_text = element_text(size = 10))
bed_plot = (ggplot(data = house_data, 
                   mapping = aes(x = house_data['bedrooms'].astype('category'), 
                                 y = 'price')) + 
  geom_boxplot() +
  # makes a boxplot
  geom_hline(yintercept = house_data['price'].mean(),
             linetype = 'dashed', color = 'red', size = 2) +
  # add line denoting mean house price
  theme_classic() + # white background, no gridlines
  xlab('Number of Bedrooms') + # change x axis label
  ylab('Price (US$)') + # change y axis label
  bed_plot_theme + # change axis title and text size
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                        2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000']))
  # change y-axis values
  )
bed_plot

This graph is interesting in that it highlights there are some houses with **zero** bedrooms and one house with **33** bedrooms. I believe that it doesn’t make sense for a house not to have at least one bedroom and the house with 33 is clearly an outlier, so these houses will be removed. As for the trend, there is generally an increase in median house price as the number of bedrooms increases.

In [None]:
# bathrooms
bath_plot_theme = theme(axis_title = element_text(size = 12.5),
                        axis_text = element_text(size = 10))
bath_plot = (ggplot(data = house_data, 
                    mapping = aes(x = house_data['bathrooms'].astype('category'), 
                                  y = 'price')) + 
  geom_boxplot() + # makes a boxplot
  geom_hline(yintercept = house_data['price'].mean(),
             linetype = 'dashed', color = 'red', size = 2) +
  # add line denoting mean house price
  theme_classic() + # white background, no gridlines
  xlab('Number of Bathrooms') + # change x axis label
  ylab('Price (US$)') + # change y axis label
  bed_plot_theme + # change axis title and text size
  scale_x_discrete(breaks = np.array([0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 
                                      5, 5.5, 6, 6.5, 7.5, 8]),
                   labels = np.array(['0', 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 
                                      5, 5.5, 6, 6.5, 7.5, 8])) +
  scale_y_continuous(trans = 'log',
                     breaks = np.array([100000, 250000, 500000, 1000000, 
                                2500000, 5000000]),
                     labels = np.array(['100,000', '250,000', '500,000', 
                                        '1,000,000', '2,500,000', '5,000,000']))
  # change x and y-axis values
  )
bath_plot

In [None]:
# given that this is housing, doesn't make sense to have 0 bedrooms,
# so they are removed, as well as the 33 bedroom house
# also remove any house with 0 bathrooms
house_data = house_data[(house_data.bedrooms < 33) & (house_data.bedrooms > 0) 
                        & (house_data.bathrooms > 0)]

There is a similar story on this graph, highlighting that some houses have zero bathrooms. Again this doesn’t seem realistic, so these houses will be removed too. As with the number of bedrooms, there is a general increase in median house price as the number of bathrooms increases.

## 2.7 Interactive map

This data set also provides latitude and longitude values for each house, meaning they can be plotted on a map. Each house is assigned to one of five price groups, with the colours denoted on the bottom left. Price groups can be selected or deselected using the layers tool on the bottom left. You can also click on individual houses and it will tell you the price, size of the house and the lot.

In [None]:
# create map
house_map = folium.Map([47.4, -122.25], tiles = 'cartodbpositron',
                       zoom_start = 9)

# create subsets of the original dataset, for each of the (arbitrarily)
# chosen price groups
price_group1_subset = house_data[house_data.price < 250000]
price_group2_subset = house_data[(house_data.price > 250000) 
                                 & (house_data.price < 500000)]
price_group3_subset = house_data[(house_data.price > 500000) 
                                 & (house_data.price < 750000)]
price_group4_subset = house_data[(house_data.price > 750000) 
                                 & (house_data.price < 1000000)]
price_group5_subset = house_data[house_data.price > 1000000]

# template for text shown in LayerControl
legend_text = '<span style = "color: {col};"> {txt} </span>'

# create price groups as FeatureGroups, so that each price groups is its 
# own layer on the map
price_group1 = folium.FeatureGroup(name = legend_text.format(
    txt = 'Price: < $250,000', col = 'lawngreen'))
price_group2 = folium.FeatureGroup(name = legend_text.format(
    txt = 'Price: $250,000 - $500,000', col = 'mediumseagreen'))
price_group3 = folium.FeatureGroup(name = legend_text.format(
    txt = 'Price: $500,000 - $750,000', col = 'deepskyblue'))
price_group4 = folium.FeatureGroup(name = legend_text.format(
    txt = 'Price: $750,000 - $1,000,000', col = 'blue'))
price_group5 = folium.FeatureGroup(name = legend_text.format(
    txt = 'Price: > $1,000,000', col = 'navy'))

# add price_group1 to map
for i in range(0, len(price_group1_subset)):
    # for each house in this price group subset
    
    # template for text shown when the marker is clicked on 
    popup_text = '<b> Price: ${:,} </b> <br> House area (sqft): {:,} <br> Lot area (sqft): {:,}'
    popup_text = popup_text.format(
        price_group1_subset['price'].iloc[i], # find the price for this house
        price_group1_subset['sqft_living'].iloc[i], # find the sqft_living 
        price_group1_subset['sqft_lot'].iloc[i] # find the sqft_lot
        )
    
    # create popup with the text created previously and set the width & height
    iframe = folium.IFrame(popup_text, width = 175, height = 100)
    
    # add marker of this particular house
    folium.CircleMarker(
        location = [price_group1_subset.iloc[i]['lat'], 
                    price_group1_subset.iloc[i]['long']],
        popup = folium.Popup(iframe), # created previously
        radius = 1,
        opacity = 0.75,
        color = 'lawngreen').add_to(price_group1) # add to FeatureGroup
price_group1.add_to(house_map) # add FeatureGroup to Map

# repeat this for the other price groups

for i in range(0, len(price_group2_subset)):
    
    popup_text = '<b> Price: ${:,} </b> <br> House area (sqft): {:,} <br> Lot area (sqft): {:,}'
    popup_text = popup_text.format(
        price_group2_subset['price'].iloc[i],
        price_group2_subset['sqft_living'].iloc[i],
        price_group2_subset['sqft_lot'].iloc[i]
        )
    
    iframe = folium.IFrame(popup_text, width = 175, height = 100)
    
    folium.CircleMarker(
        location = [price_group2_subset.iloc[i]['lat'], 
                    price_group2_subset.iloc[i]['long']],
        popup = folium.Popup(iframe),
        radius = 1,
        opacity = 0.75,
        color = 'mediumseagreen').add_to(price_group2)
price_group2.add_to(house_map)

for i in range(0, len(price_group3_subset)):
    
    popup_text = '<b> Price: ${:,} </b> <br> House area (sqft): {:,} <br> Lot area (sqft): {:,}'
    popup_text = popup_text.format(
        price_group3_subset['price'].iloc[i],
        price_group3_subset['sqft_living'].iloc[i],
        price_group3_subset['sqft_lot'].iloc[i]
        )
    
    iframe = folium.IFrame(popup_text, width = 175, height = 100)
    
    folium.CircleMarker(
        location = [price_group3_subset.iloc[i]['lat'], 
                    price_group3_subset.iloc[i]['long']],
        popup = folium.Popup(iframe),
        radius = 1,
        opacity = 0.75,
        color = 'deepskyblue').add_to(price_group3)
price_group3.add_to(house_map)

for i in range(0, len(price_group4_subset)):
    
    popup_text = '<b> Price: ${:,} </b> <br> House area (sqft): {:,} <br> Lot area (sqft): {:,}'
    popup_text = popup_text.format(
        price_group4_subset['price'].iloc[i],
        price_group4_subset['sqft_living'].iloc[i],
        price_group4_subset['sqft_lot'].iloc[i]
        )
    
    iframe = folium.IFrame(popup_text, width = 175, height = 100)
    
    folium.CircleMarker(
        location = [price_group4_subset.iloc[i]['lat'], 
                    price_group4_subset.iloc[i]['long']],
        popup = folium.Popup(iframe),
        radius = 1,
        opacity = 0.75,
        color = 'blue').add_to(price_group4)
price_group4.add_to(house_map)

for i in range(0, len(price_group5_subset)):
    
    popup_text = '<b> Price: ${:,} </b> <br> House area (sqft): {:,} <br> Lot area (sqft): {:,}'
    popup_text = popup_text.format(
        price_group5_subset['price'].iloc[i],
        price_group5_subset['sqft_living'].iloc[i],
        price_group5_subset['sqft_lot'].iloc[i]
        )
    
    iframe = folium.IFrame(popup_text, width = 175, height = 100)
    
    folium.CircleMarker(
        location = [price_group5_subset.iloc[i]['lat'], 
                    price_group5_subset.iloc[i]['long']],
        popup = folium.Popup(iframe),
        radius = 1,
        opacity = 0.75,
        color = 'navy').add_to(price_group5)
price_group5.add_to(house_map)

# add option to toggle the layers (different price groups) on the map,
# the colour and price range of each price group is shown too
folium.LayerControl(position = 'bottomleft').add_to(house_map)

house_map

It is immediately apparent that the more expensive houses are generally located near water and cheaper houses are generally more inland, confirming the observations in **section 2.4**. For example, there are no houses worth less than $500,000 on Mercer Island. The more expensive houses are also generally more northern than the cheapest houses, if you select only the cheapeast and most expensive price groups this is clear. Therefore, we might expect that latitude will be a significant predictor of the price of a house.

# 3. Linear model approaches

In a linear model, a quantitative response can be predicted by one or more predictors, and you assume there is approximately a linear relationship between the response and the predictors. You typically fit a standard linear model using **least squares**.

Least squares is an approach to maximise the ‘closeness’ of the fit of the model to the data by minimising the **residual sum of squares** (RSS). The residual for an observation is the distance between the observed response and the predicted response from the model.
Consider the graph below:

In [None]:
# create mock data
example_var_x = [7, 6, 9, 4, 12, 5, 10, 3, 13]
example_var_y = [9, 10, 7, 12, 4, 11, 6, 3, 13]
example_data = pd.DataFrame({'Variable 1': example_var_x, 
                             'Variable 2': example_var_y})

# plot data, and highlight the residuals
residual_plot = (ggplot(mapping = aes(x = 'example_var_x', y = 'example_var_y'),
                        data = example_data) +
  geom_smooth(method = 'lm', se = False, colour = 'black') +
  geom_segment(aes(xend = 'Variable 1', yend = 8.35),
               colour = 'red') +
  geom_point() +
  theme_classic() + 
  theme(axis_title = element_text(size = 12.5),
        axis_text = element_text(size = 10))
  )
residual_plot

Here you can see the least squares estimate for the relationship between the data. The residuals are highlighted in red. The RSS is minimised such that any different intercept or slope would result in a higher RSS - i.e. the line you see cannot be changed or it will not fit the data as well. The residuals are added together, and that value is squared. This removes any negative sign, as we are only concerned with the total distance, not its direction.

However, there are other approaches to improve the performance of the model other than least squares. These approaches seek to improve prediction accuracy and model interpretability.

For my predictive modelling, the variables ‘id’, ‘date’ and ‘zipcode’ will be removed. Furthermore, the variables ‘price’ ,‘sqft_living’, ‘sqft_lot’, ‘sqft_above’, ‘sqft_basement’, ‘sqft_living15’ and ‘sqft_lot15’ will be log-transformed.

In [None]:
# first remove unnecessary variables ('id', 'date' and 'zipcode')
model_data = house_data.drop(columns = ['id', 'date', 'zipcode'])

# log transform some variables using NumPy
model_data['price'] = np.log(model_data['price'])
model_data['sqft_living'] = np.log(model_data['sqft_living'])
model_data['sqft_lot'] = np.log(model_data['sqft_lot'])
model_data['sqft_above'] = np.log(model_data['sqft_above'])
model_data['sqft_basement'] = np.log(model_data['sqft_basement'] + 1)
model_data['sqft_living15'] = np.log(model_data['sqft_living15'])
model_data['sqft_lot15'] = np.log(model_data['sqft_lot15'])

# create a way to track model stats, all values are initially nan but these 
# will be replaced when then test MSE's are calculated
# panda's DataFrame replacing matrix() in R
model_stats = pd.DataFrame(columns = ['Test MSE'],
                           index = [
                               'Best subset selection',
                               'Forward stepwise selection', 
                               'Backward stepwise selection',
                               'Ridge regression', 
                               'Lasso regression', 'PCR', 'PLS', 
                               'Single regression tree', 'Bagging', 
                               'Random forest', 'Boosting'
                               ]
                           )

# denote the x (predictors) and y (response) variables, for train and test set
# training set will be 75% of original dataset, test set is 25%
x = model_data.drop('price', axis = 1)
y = pd.DataFrame(model_data.price)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25,
                                                    random_state = 35)


## 3.1 Subset selection methods  
### 3.1.1 Best subset selection

Best subset selection takes the least squares approach, but as its name suggests it seeks to find the best **subset** of predictors. It does this by fitting a least squares regression for each possible combination of predictors - e.g. all of the models with only one of the predictors, all of the models with all of the combinations of two predictors, and so on. So, how do you determine what the ‘best’ model is? The most robust method is to validate each model under consideration and I will be using *k*-fold cross-validation.

In *k*-fold cross-validation the data you have is split into *k* folds (of approximately equal size) - essentially subsets of your data. For example, let’s assume *k* = 10. Your model would be trained on nine of the 10 folds and tested on the remaining fold. The **mean squared error** (MSE) is calculated from the test on the held-out fold - this is a value for how close your models predictions are to the true values, so a *low* MSE indicates a more accurate model. Then, a different fold is chosen to be tested on, and your model is trained using the remaining nine folds. This continues until your model has been tested on each of the *k* folds, and the MSEs are averaged. This approach is repeated for each model under consideration.

The ‘best’ model is the one that has lowest MSE - that model has consistently been the most accurace across the validation process. I will be looking at many approaches to predicting the price of a house, so I have created a matrix to track the test MSE for each of these approaches.

So, let’s perform best subset selection using *k*-fold cross-validation, where *k* = 10:

In [None]:
# this takes about three hours to run on kaggle and it doesn't fully complete and nothing else works after it
# error message:
# Features: 130402/131071IOPub message rate exceeded.
# The notebook server will temporarily stop sending output
# to the client in order to avoid crashing it.
# To change this limit, set the config variable
# `--NotebookApp.iopub_msg_rate_limit`.

# Current values:
# NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
# NotebookApp.rate_limit_window=3.0 (secs)


# So, I will just put the test MSE that I got from my own computer, but if you want to repeat this the code is below

# EFS will consider every possible combination of features
# the 'best' number of features is determined by 10-fold cross-validation,
# the mean squared error on the held-out fold
#lm = LinearRegression()
#efs = EFS(lm, min_features = 1, max_features = 17, 
          #scoring = 'neg_mean_squared_error', cv = 10)

# fit to training data
#efs.fit(x_train, y_train)
#efs.best_feature_names_
# 16 features selected, only 'long' was not selected

# use the selected features to predict values in test set
#x_train_efs = efs.transform(x_train)
#x_test_efs = efs.transform(x_test)
#lm_house = lm.fit(x_train_efs, y_train)
#y_pred = lm_house.predict(x_test_efs)

lm_test_MSE = 0.06351712398582668
best_feature_names_ = ('bedrooms', 'bathrooms', 'sqft_living', 
                           'sqft_lot', 'floors', 'waterfront', 'view', 
                           'condition', 'grade', 'sqft_above', 'sqft_basement', 
                           'yr_built', 'yr_renovated', 'lat', 'sqft_living15', 'sqft_lot15')

In [None]:
print('Test set MSE: %.5f' % lm_test_MSE)
print('Number of features chosen:', len(best_feature_names_))
print('Features chosen:', best_feature_names_)

So, cross-validation has chosen a **16 variable** model. It is a subset, because the total number of predictors is 17, but really it’s not much different than just including all predictors. Fortunately, we can select a model with less predictors, if the cross-validation error of a simpler model is within one standard error of the best model. See below:

In [None]:
# because EFS doesn't work, I've recreate the table produced below

# extract stats of efs
#efs_stats = pd.DataFrame.from_dict(efs.get_metric_dict()).T

# dataframe to add the test mse and standard errors for the best model at
# each size
#efs_best_models = pd.DataFrame(columns = ['CV_score', 'std_error'])

# for each sized model
#for i in range(1, 17, 1):

    #indexes = [] # list to track the indexes in efs_stats of the models with
    # size i
    
    # iterate over all the models in efs_stats
    #for j in range(0, len(efs_stats)):
        
        # if the length of the model is equal to i
        #if len(efs_stats['feature_names'][j]) == i:
            
            # add the index (j) to indexes
            #indexes.append(j)
    
    # extract the index of the model with the lowest cross val score
    #min_avg_score_index = efs_stats[min(indexes):max(indexes)]['avg_score'].astype(float).idxmin(axis = 1)
    
    # use that index to extract the cv_score and std_error
    #efs_best_models.at[i, 'CV_score'] = np.square(efs_stats['avg_score'][min_avg_score_index])
    #efs_best_models.at[i, 'std_error'] = efs_stats['std_err'][min_avg_score_index]

# change columns to floats
#efs_best_models = efs_best_models.astype(float)


# recreated table
efs_best_models = efs_best_models1 = pd.DataFrame(data = {'CV_score': [0.07605759865727282,
                                                                       0.07567839005496807,
                                                                       0.07471145477652824,
                                                                       0.0735012308766475,
                                                                       0.07199211537679669,
                                                                       0.06895525155189056,
                                                                       0.06585606580426408,
                                                                       0.05823823994049927,
                                                                       0.050553675178276816,
                                                                       0.042796856332271004,
                                                                       0.0310474069290086,
                                                                       0.023926368298656284,
                                                                       0.016491804302500277,
                                                                       0.014544475614421114,
                                                                       0.012536455653603465,
                                                                       0.00866700519371302],
                                                          'std_error': [0.0035234919449906965,
                                                                        0.003613041317735649,
                                                                        0.0033761903770162245,
                                                                        0.0034232814493334184,
                                                                        0.003101720687690171,
                                                                        0.002839548464126151,
                                                                        0.003184049048283974,
                                                                        0.0033458532098801174,
                                                                        0.0033063287428112265,
                                                                        0.0029888377848824275,
                                                                        0.002766817317231102,
                                                                        0.002141453537275933,
                                                                        0.0013362830904681947,
                                                                        0.0013186797000082193,
                                                                        0.0014868116451182763,
                                                                        0.0012730036196126162,]},
                                                  index = list(range(1, 17)))


# plot scores with std errors
plot_bss_cv_theme = theme(axis_title = element_text(size = 12.5),
                          axis_text = element_text(size = 10))
plot_bss_cv = (ggplot(data = efs_best_models, 
                      mapping = aes(x = np.array(range(1, 17, 1)), 
                                    y = 'CV_score')) +
  geom_point() + # plot points
  geom_line() + # join points with line
  geom_errorbar(mapping = aes(ymin = efs_best_models['CV_score'] - 
                              efs_best_models['std_error'],
                              ymax = efs_best_models['CV_score'] + 
                              efs_best_models['std_error'])) +
  
  # add standard error bars
  geom_hline(yintercept = 0.00994000519371302, linetype = 'dashed',
             colour = 'red') + # add line denoting the standard error
  # above the best model
  theme_classic() + # white background, no gridlines
  xlab('Model Size') + # change x axis label
  ylab('Test MSE') + # change y axis label
  plot_bss_cv_theme + # change the size of axis titles and axis text
  scale_y_continuous(breaks = np.array(np.arange(0, 0.09, 0.01)),
                     labels = np.array(np.arange(0, 0.09, 0.01)))
  # change y-axis limits
  )
plot_bss_cv

The best model at size 16 is clearly the best model as it has the lowest test MSE. Furthermore, no other model is within one standard error of this model. Therefore this 16 variable model and its test set MSE will be added to the matrix.

In [None]:
# add test set MSE to matrix
model_stats.loc['Best subset selection', 'Test MSE'] = lm_test_MSE

### 3.1.2 Forward stepwise selection

The next of the subset selection approaches is forward stepwise selection. In this approach, you start with a model with no predictors. Then, you add one variable to the model - the variable that results in the best reduction in RSS or increase in R<sup>2</sup>. This continues until you have all of the predictors in the model. Cross-validation is used here too - the process of adding variables takes place on the training folds and the model is tested on the held-out fold.

However, forward stepwise selection may not necessarily select the most optimal combination of predictors at each level. Consider a dataset with predictors a, b, and c, where the best one variable model is with predictor a and the best two variable model is with predictors b and c. Best subset selection would find this combination, but forward stepwise selection wouldn’t, because it has to choose predictor a for the first model and it builds on its previous selections.

The obvious question is then: why choose forward stepwise selection over best subset selection? The answer is that forward stepwise selection is faster than best subset selection. Best subset selection has to consider 2<sup>p</sup> models, where *p* equals the number of predictors, but forward stepwise selection considers 1+*p*(*p*+1)/2 models.

Let’s see how forward stepwise selection performs:

In [None]:
## Forward stepwise selection
# taken and adapted from:
# http://www.science.smith.edu/~jcrouser/SDS293/labs/lab9-py.html

# using 10-fold cross-validation, so set that up first
k = 10
np.random.seed(seed = 35) # set seed to ensure reproducibility
folds = np.random.choice(k, size = len(y), replace = True)

# dataframe to track cross-validation errors
crossVal_errors_fwd = pd.DataFrame(
    columns = range(1, k + 1),
    index = range(1, 18))

# function to fit a linear model
def processSubset(features, X_train, Y_train, X_test, Y_test):
    # fit to training set
    model = sm.OLS(Y_train, X_train[list(features)])
    model_fit = model.fit()
    
    # calculate test set MSE by using the fitted model to make predictions on 
    # the test set
    predictions = model_fit.predict(X_test[list(features)])
    test_MSE = np.mean((Y_test.subtract(predictions, axis = 0))**2)
    test_MSE = test_MSE.tolist()

    return {'Model': model_fit, 'Test MSE': test_MSE}

# function to perform forward stepwise selection
def forward_stepwise_selection(predictors, X_train, Y_train, X_test, Y_test):
    results = [] # empty for now
    
    # extract predictors that still need to be checked
    # checks if each predictor in x_train is also not in predictors
    remaining_predictors = [p for p in X_train.columns if p not in predictors]
    
    # for each of the remaining predictors
    for p in remaining_predictors:
        
        # call processSubset to fit a linear model using predictors + each
        # of the remaining predicotrs
        results.append(processSubset(predictors + [p], 
                                     X_train, Y_train, X_test, Y_test))
        
    # turn results into a dataframe
    models = pd.DataFrame(results)
    
    # choose the best model (with the lowest test set MSE)    
    models.sort_values('Test MSE', inplace = True, axis = 0)
    best_model = models[:1]

    return best_model
    
# now write a for loop performing cross-validation
# in the ith fold, the elements of folds that equal i are in the test set and 
# the remainder are in the training
# dataframe to track the cross_validation errors for each model size
models_crossVal_errors_fwd = pd.DataFrame(columns = ['Model', 'Test MSE'])

# iterate over each fold
for i in range(1, k + 1):
    
    # reset predictors
    predictors = []
    
    # iterate over each model size
    for j in range(1, len(x.columns) + 1):
        
        # create train sets using all but fold i and test sets using the
        # remaining fold i
        x_train = x[folds != (i - 1)]
        y_train = y[folds != (i - 1)]
        x_test = x[folds == (i - 1)]
        y_test = y[folds == (i - 1)]
        
        # call forward_stepwise_selection, training on every fold except i
        # test on the ith fold
        fss = forward_stepwise_selection(
            predictors, x_train, y_train, x_test, y_test)
        
        # add model and test MSE for this fold i to models_crossVal_errors
        models_crossVal_errors_fwd = models_crossVal_errors_fwd.append(fss)

        # in order to add the test MSE for this model, the index of fss needs
        # to be called but this changes over time, so the following code
        # extracts the relevant index
        # convert column to series
        fss_test_MSE_column = pd.Series(fss['Test MSE'])
        # extract index value as a list
        fss_test_MSE_index = fss.index.values.tolist()
        # convert element in fss_index to str
        index_str = [str(i) for i in fss_test_MSE_index]
        fss_index = str(''.join(index_str))                                    
         
        # add test MSE for this model size (j) and fold (i) to 
        # crossVal_errors_fwd - fss_index is finally converted to int
        crossVal_errors_fwd.at[j, i] = pd.Series(fss['Test MSE'][int(fss_index)])

        # extract predictors
        predictors = list(fss['Model'])[0].model.exog_names
        # this ensures that when the j loop runs again, the predictors start
        # with the best selection at the previous model size
       
# this results in a matrix of test MSE, where the (i,j)th element is 
# equal to the test MSE for the ith cross-validation fold for the best
# j-variable model
# obtain a vector for which the jth element is the cross-validation
# error for the j-variable model by averaging all the errors for that size
crossVal_errors_fwd_mean = crossVal_errors_fwd.apply(np.mean, axis = 1)

# standard error of each model size test MSE (standard deviation of test set
# MSE divided by the square root of the number of folds) before mean is found 
crossVal_errors_fwd_se = pd.DataFrame(crossVal_errors_fwd_mean,
                                      columns = ['Test MSE'])
crossVal_errors_fwd_se['SE'] = crossVal_errors_fwd.sem(axis = 1)

# plot
plot_fwd_cv_theme = theme(axis_title = element_text(size = 12.5),
                          axis_text = element_text(size = 10))
plot_fwd_cv = (ggplot(data = crossVal_errors_fwd_se, 
                      mapping = aes(x = np.array(range(1, 18, 1)), 
                                    y = 'Test MSE')) +
  geom_point() + # plot points
  geom_line() + # join points with line
  geom_errorbar(mapping = aes(ymin = crossVal_errors_fwd_se['Test MSE'] - 
                              crossVal_errors_fwd_se['SE'],
                              ymax = crossVal_errors_fwd_se['Test MSE'] + 
                              crossVal_errors_fwd_se['SE'])) +
  
  # add standard error bars
  geom_hline(yintercept = 0.0659094255830648, linetype = 'dashed',
             colour = 'red') + # add line denoting the standard error
  # above the best model
  theme_classic() + # white background, no gridlines
  xlab('Model Size') + # change x axis label
  ylab('Test MSE') + # change y axis label
  plot_fwd_cv_theme + # change the size of axis titles and axis text
  scale_y_continuous(limits = np.array([0.05, 0.25])) 
  # change y-axis limits
  )
plot_fwd_cv
# the 12 variable model is the simplest model that has a test MSE within one
# standard error of the best model (16 variables)

The best model is with 16 variables, but a 12 variable model is the simplest model within one standard error of the best model. Therefore, the 12 variable model and its test set MSE will be added to the matrix.

In [None]:
# therefore add the test set MSE of the 12 variable model to the matrix
model_stats.loc['Forward stepwise selection', 
                'Test MSE'] = crossVal_errors_fwd_se['Test MSE'][12]

### 3.1.3 Backwards stepwise selection

This approach is the opposite to forward stepwise - it starts with a model with all predictors. It then removes the least useful predictor at each model size. As with forward stepwise selection, this approach is faster than best subset selection but it doesn’t necessarily find the true ‘best’ model. I will skip the analysis here, because it is very similar to the forward stepwise selection results.

In [None]:
## Backwards stepwise selection
# taken and adapted from:
# http://www.science.smith.edu/~jcrouser/SDS293/labs/lab8-py.html

# using 10-fold cross-validation, so set that up first
k = 10
np.random.seed(seed = 35)
folds = np.random.choice(k, size = len(y), replace = True)

# dataframe to track cross-validation errors
crossVal_errors_bwd = pd.DataFrame(
    columns = range(1, k + 1),
    index = range(1, 18)
    )

# function to perform backward stepwise selection
def backward_stepwise_selection(predictors, X_train, Y_train, X_test, Y_test):
    results = [] # empty for now
    
    # for each combination of predictors (up to 16)
    for combo in itertools.combinations(predictors, len(predictors) - 1):
            # call processSubset on each combination and append to results
            results.append(processSubset(combo, X_train, Y_train, X_test, Y_test))
        
    # turn results into a dataframe
    models = pd.DataFrame(results)
    
    # choose the best model (with the lowest test set MSE)    
    models.sort_values('Test MSE', inplace = True, axis = 0)
    best_model = models[:1]

    return best_model

models_crossVal_errors_bwd = pd.DataFrame(columns = ['Model', 'Test MSE'])

# iterate over each fold
for i in range(1, k + 1):
    
    predictors = x_train.columns
    
    # iterate over each model size
    while(len(predictors) > 1):

        # create train sets using all but fold i and test sets using the
        # remaining fold i
        x_train = x[folds != (i - 1)]
        y_train = y[folds != (i - 1)]
        x_test = x[folds == (i - 1)]
        y_test = y[folds == (i - 1)]
        
        # call backward_stepwise_selection, training on every fold except i
        # test on the ith fold
        bss = backward_stepwise_selection(
            predictors, x_train, y_train, x_test, y_test)
        
        # add model and test MSE for this fold i to models_crossVal_errors
        models_crossVal_errors_bwd = models_crossVal_errors_bwd.append(bss)

        # in order to add the test MSE for this model, the index of bss needs
        # to be called but this changes over time, so the following code
        # extracts the relevant index
        # convert column to series
        bss_test_MSE_column = pd.Series(bss['Test MSE'])
        # extract index value as a list
        bss_test_MSE_index = bss.index.values.tolist()
        # convert element in fss_index to str
        index_str = [str(i) for i in bss_test_MSE_index]
        bss_index = str(''.join(index_str))                                    
         
        # add test MSE for this model size (j) and fold (i) to 
        # crossVal_errors_bwd - bss_index is finally converted to int
        crossVal_errors_bwd.at[(len(predictors) - 1), i] = pd.Series(
            bss['Test MSE'][int(bss_index)])

        # extract predictors        
        predictors = list(bss['Model'])[0].model.exog_names
        
# note that this means a 17 variable model is never looked at, but the point
# of this is that it should be a subset of predictors anyway, so I'm content
# with leaving the 17 variable model empty (best subset selection identified
# that the 17 variable was not the best anyway)

# this results in a matrix of test MSE, where the (i,j)th element is 
# equal to the test MSE for the ith cross-validation fold for the best
# j-variable model
# obtain a vector for which the jth element is the cross-validation
# error for the j-variable model by averaging all the errors for that size
crossVal_errors_bwd_mean = crossVal_errors_bwd.apply(np.mean, axis = 1)

# standard error of each model size test MSE (standard deviation of test set
# MSE divided by the square root of the number of folds) before mean is found
crossVal_errors_bwd_se = pd.DataFrame(crossVal_errors_bwd_mean,
                                      columns = ['Test MSE'])
crossVal_errors_bwd_se['SE'] = crossVal_errors_bwd.sem(axis = 1)

plot_bwd_cv_theme = theme(axis_title = element_text(size = 22.5),
                          axis_text = element_text(size = 20))
plot_bwd_cv = (ggplot(data = crossVal_errors_bwd_se, 
                      mapping = aes(x = np.array(range(1, 18, 1)), 
                                    y = 'Test MSE')) +
  geom_point() + # plot points
  geom_line() + # join points with line
  geom_errorbar(mapping = aes(ymin = crossVal_errors_bwd_se['Test MSE'] - 
                              crossVal_errors_bwd_se['SE'],
                              ymax = crossVal_errors_bwd_se['Test MSE'] + 
                              crossVal_errors_bwd_se['SE'])) +
  
  # add standard error bars
  geom_hline(yintercept = 0.06589432278232081, linetype = 'dashed',
             colour = 'red') + # add line denoting the standard error
  # above the best model
  theme_classic() + # white background, no gridlines
  xlab('Model Size') + # change x axis label
  ylab('Test MSE') + # change y axis label
  plot_bwd_cv_theme + # change the size of axis titles and axis text
  scale_y_continuous(limits = np.array([0.05, 0.3])) 
  # change y-axis limits
  )
plot_bwd_cv
# the 12 variable model is the simplest model that has a test MSE within one
# standard error of the best model (16 variables)

# therefore add the test set MSE of the 12 variable model to the matrix
model_stats.loc['Backward stepwise selection', 
                'Test MSE'] = crossVal_errors_fwd_se['Test MSE'][12]


## 3.2 Shrinkage methods

### 3.2.1 Ridge regression

The previous three approaches were methods for selecting subsets of predictors. Ridge regression is an approach that uses all predictors, but it shrinks the coefficent estimates (intercept and slope) towards zero which reduces their variance. Ridge regression does seek to minimise RSS (like least squares) but it introduces another term called a **shrinkage penalty**. The tuning paramter (*lambda*, λ) controls the impact of both terms.

When λ = 0, the shrinkage penalty is not in effect, so ridge regression is just performing least squares. As λ increases, the shrinkage penalty grows and the coefficients approach zero. When λ is extremely large, the coefficients will be essentially zero (but never exactly) - leading to a null model with no predictors. How do you select a good value of λ?
Cross-validation!

You can create a grid of λ values, then use 10-fold cross-validation to train a ridge regression model for every value of λ in the grid. The ‘best’ model with the ‘best’ λ value is the model that has the lowest test MSE. Let’s see how ridge regression performs:

In [None]:
# set grid of alpha values
grid = 10**np.linspace(10, -2, 100)*0.5

# perform ridge regression with 10-fold cross-validation to find the best alpha,
# scoring is mean squared error (MSE)
ridge_model = RidgeCV(alphas = grid, scoring = 'neg_mean_squared_error', cv = 10)
# fit to training data
ridge_model.fit(x_train, y_train)
# extract best alpha
ridge_best_alpha = ridge_model.alpha_

# new ridge model with best alpha
ridge2 = Ridge(alpha = ridge_best_alpha)
# fit to training data
ridge2.fit(x_train, y_train)
# predict on test data
mean_squared_error(y_test, ridge2.predict(x_test))

A test set MSE of 0.062 is the best so far! Let's take a look at the models so far:

In [None]:
# add test MSE to matrix
model_stats.loc['Ridge regression', 
                'Test MSE'] = mean_squared_error(y_test, ridge2.predict(x_test))

# show models so far
model_stats

So we can see that the ridge regression with the best λ is the best model so far (with the lowest test MSE). Best subset selection is the best of the subset selection methods with forward and backward stepwise selection arriving at the same test MSE, slightly worse than best subset selection. 

### 3.2.2 Lasso regression

This approach is also a shrinkage method, like ridge regression. In ridge regression, the fact that the coefficients are shrunk towards zero improves accuracy, but it can be difficult to interpret. Lasso regression overcomes this by shrinking coefficients towards zero and forces some to be exactly zero - it performs variable selection like the subset methods by removing variables.

Just like ridge regression, lasso regression uses λ. The same grid of λ values will be used, then 10-fold cross-validation will be used to train a lasso regression model for every value of λ in the grid. The ‘best’ model with the ‘best’ λ value is the model that has the lowest test MSE. Let’s see how lasso regression performs:

In [None]:
# from: https://nbviewer.jupyter.org/github/JWarmenhoven/ISL-python/blob/master/Notebooks/Chapter%206.ipynb

# perform lasso regression with 10-fold cross-validation to find the best alpha,
# 'random_state = 35' ensures reproducible results
lasso_model = LassoCV(alphas = grid, cv = 10, random_state = 35)
# fit to training data
lasso_model.fit(x_train, y_train.values.ravel())
# extract best alpha
lasso_best_alpha = lasso_model.alpha_

# new ridge model with best alpha
lasso2 = Lasso(alpha = lasso_best_alpha)
# fit to training data
lasso2.fit(x_train, y_train)
# predict on test data
mean_squared_error(y_test, lasso2.predict(x_test))

A test set MSE of 0.065 is worse than ridge regression and best subset selection but is better than forward and backward stepwise selection. 

In [None]:
# add test MSE to matrix
model_stats.loc['Lasso regression', 
                'Test MSE'] = mean_squared_error(y_test, lasso2.predict(x_test))

## 3.3 Dimension reduction methods
### 3.3.1 Principal components regression
Principal components regression (PCR) is a dimension reduction technique.
There is one dimension for each predictor.
The first principal component is the direction through the data that captures the most variation in the data - a linear combination of all features.
The hope is that only a few components are responsible for most of the variation in the data and if all principal components are included in a model, this is just least squares.
It is similar to ridge regression, in that all the predictors are included - i.e. it does not perform variable selection.

I'll refer you to some other sources for a better and more in-depth explanation: [Hands on Machine Learning with R, Chapter 17](https://bradleyboehmke.github.io/HOML/pca.html)
and [An Introduction to Statistical Learning (page 231)](https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf).

You decide on the number of components by performing cross-validation.
PCR performs as follows:

In [None]:
# from: https://nbviewer.jupyter.org/github/JWarmenhoven/ISL-python/blob/master/Notebooks/Chapter%206.ipynb

pca = PCA()
lm = LinearRegression()
# scale the predictors
x_transformed = pca.fit_transform(scale(x_train))
n = len(x_transformed)

# using 10-fold cross-validation - different way of choosing folds so not
# necessarily comparable with the other method, but it shouldn't make too
# much difference
folds2 = KFold(n_splits = 10, shuffle = True, random_state = 35)

# track MSE
pcr_mse = []

# calculate MSE for the intercept
pcr_intercept = -1*cross_val_score(lm, np.ones((n, 1)), y_train, cv = folds2,
                                   scoring = 'neg_mean_squared_error').mean()
pcr_mse.append(pcr_intercept)

# now calculate MSE cross-validation for all 17 principal components
for i in range(1, 18, 1):
    pcr_value = -1*cross_val_score(lm, x_transformed[:, :i], y_train, 
                                   cv = folds2,
                                   scoring = 'neg_mean_squared_error').mean()
    pcr_mse.append(pcr_value)


# plot
pcr_mse = pd.DataFrame(pcr_mse)
plot_pcr_cv_theme = theme(axis_title = element_text(size = 12.5),
                          axis_text = element_text(size = 10))
plot_pcr_cv = (ggplot(data = pcr_mse, 
                      mapping = aes(x = np.array(range(0, 18, 1)), 
                                    y = pcr_mse)) +
  geom_point() + # plot points
  geom_line() + # join points with line
  geom_hline(yintercept = min(pcr_mse[0]), linetype = 'dashed',
             colour = 'red') + # add line denoting the standard error
  # above the best model
  theme_classic() + # white background, no gridlines
  xlab('Number of Principal Components') + # change x axis label
  ylab('Test MSE') + # change y axis label
  plot_pcr_cv_theme # change the size of axis titles and axis text
  )
plot_pcr_cv

The lowest test MSE is with all 17 principal components, therefore it is essentially least squares. I'll now use all 17 components when training on all the training data:

In [None]:
# transform test data
x_transformed_test = pca.transform(scale(x_test))[:, :18]
# train on training data
lm = LinearRegression()
lm.fit(x_transformed[:, :18], y_train)
# test on test data
pcr_predictions = lm.predict(x_transformed_test)
print(mean_squared_error(y_test, pcr_predictions))

# add test MSE to matrix
model_stats.loc['PCR', 'Test MSE'] = mean_squared_error(y_test, pcr_predictions)

A test set MSE of 0.0623 is the second best so far.

### 3.3.2 Partial least squares

Partial least squares (PLS) is also a dimension reduction techinque. PCR was an unsupervised approach - the response was not used to help determine the principal component directions. So, the directions might have explained the predictors but those directions are not necessarily the best directions for predicting the response.

PLS overcomes this as a supervised approach by utilising the response to identify components that explain the predictors and the response. Once again, the number of components can be determined using cross-validation.

In [None]:
# from: https://nbviewer.jupyter.org/github/JWarmenhoven/ISL-python/blob/master/Notebooks/Chapter%206.ipynb

n = len(x_train)

# using 10-fold cross-validation 
folds2 = KFold(n_splits = 10, shuffle = True, random_state = 35)

# track MSE
pls_mse = []

# now calculate MSE cross-validation for all 17 principal components
for i in range(1, 18, 1):
    pls = PLSRegression(n_components = i)
    pls_value = cross_val_score(pls, scale(x_train), y_train, cv = folds2,
                                scoring = 'neg_mean_squared_error').mean()
    pls_mse.append(pls_value)

# plot
pls_mse = pd.DataFrame(pls_mse)
pls_mse[0] = pls_mse[0]**2
plot_pls_cv_theme = theme(axis_title = element_text(size = 12.5),
                          axis_text = element_text(size = 10))
plot_pls_cv = (ggplot(data = pls_mse, 
                      mapping = aes(x = np.array(range(0, 17, 1)), 
                                    y = pls_mse)) +
  geom_point() + # plot points
  geom_line() + # join points with line
  geom_hline(yintercept = min(pls_mse[0]), linetype = 'dashed',
             colour = 'red') + # add line denoting the standard error
  # above the best model
  theme_classic() + # white background, no gridlines
  xlab('Number of Principal Components') + # change x axis label
  ylab('Test MSE') + # change y axis label
  plot_pls_cv_theme # change the size of axis titles and axis text
  )
plot_pls_cv

The lowest test set MSE is again with all 17 components, but you can see that there is very little change after four components. Therefore, I am going to use four components when fitting to all the training data:

In [None]:
pls = PLSRegression(n_components = 4)
pls.fit(scale(x_train), y_train)
print(mean_squared_error(y_test, pls.predict(scale(x_test))))

# add test MSE to matrix
model_stats.loc['PLS', 'Test MSE'] = mean_squared_error(
    y_test, pls.predict(scale(x_test))
    )

This test set MSE is almost identical to PCR, but is slightly better and therefore becomes the second best so far:

In [None]:
model_stats

# 4. Tree-based approaches

Tree-based methods involve segmenting the predictor space into separate regions. By following the ‘rules’ at each internal node, you will end up at a terminal node corresponding to a particular region. The prediction is the mean (or sometimes median) value for that region. See my [R version](https://www.kaggle.com/thwaiteso/advanced-regression-techniques-r) for a good illustration.

## 4.1 Regression tree

Let's see how a single regression tree performs. 'ccp_alpha' is the complexity parameter used to determine whether the tree needs pruning. I will use cross-validation to choose the best alpha from a range. I have capped the tree at a maximum depth of three because when left to be as complex as it likes, the resulting plot of the tree is illegible:

In [None]:
# ccp_alpha is the complexity parameter used to determine pruning the tree
# vector of alpha values will be created and used
ccp_alphas = np.array(list(np.arange(0, 0.01, 0.0005)), 
                      dtype = 'float64')

# dataframe to track test MSE for the cross-validation using different alphas
single_tree_stats = pd.DataFrame(columns = ccp_alphas, 
                                 index = list(range(1, 11, 1)))

# iterate over each fold
for i in range(1, k + 1):
    
    # create train sets using all but fold i and test sets using the
    # remaining fold i
    x_train = x[folds != (i - 1)]
    y_train = y[folds != (i - 1)]
    x_test = x[folds == (i - 1)]
    y_test = y[folds == (i - 1)]
    
    # using each alpha    
    for ccp_alpha in ccp_alphas:
        # create tree - i have (arbitrarily) chosen a max depth of 5
        tree = DecisionTreeRegressor(random_state = 35, max_depth = 3,
                                     ccp_alpha = ccp_alpha)
        # fit to training values
        tree.fit(x_train, y_train)
        # create predictions using test set
        tree_predict = tree.predict(x_test)
        # calculate test set MSE
        tree_test_mse = mean_squared_error(y_test, tree_predict)
        # add to dataframe
        single_tree_stats.loc[i, ccp_alpha] = tree_test_mse

# find the mean test set MSE for each alpha
tree_mean = single_tree_stats.apply(np.mean, axis = 0)

# plot
plot_tree_cv_theme = theme(axis_title = element_text(size = 12.5),
                           axis_text = element_text(size = 10))
plot_tree_cv = (ggplot(data = pd.DataFrame(tree_mean), 
                      mapping = aes(x = ccp_alphas, 
                                    y = tree_mean)) +
  geom_point() + # plot points
  geom_line() + # join points with line
  geom_hline(yintercept = min(tree_mean[0:]), linetype = 'dashed',
             colour = 'red') + # add line denoting the lowest test set MSE
  theme_classic() + # white background, no gridlines
  xlab('Alpha') + # change x axis label
  ylab('Test MSE') + # change y axis label
  plot_tree_cv_theme + # change the size of axis titles and axis text
  scale_x_continuous(breaks = np.array(np.arange(0, 0.0125, 0.0025)),
                     labels = np.array(np.arange(0, 0.0125, 0.0025)),
                     limits = np.array([0, 0.01])) +
  scale_y_continuous(breaks = np.array(np.linspace(0.05, 0.11, 5)),
                     labels = np.array(np.linspace(0.05, 0.11, 5)),
                     # using linspace because arange was outputting floats
                     # with a large number of decimal places, linspace stops
                     # this from happening
                     limits = np.array([0.05, 0.11]))
  # change x and y axis labels
  )
plot_tree_cv

The best test set MSE was found with an alpha of 0, 0.0005 or 0.001. I will select 0, this means the tree is not pruned at all. When fitting a tree with alpha = 0 to all the training data I will once again cap its depth at three.

In [None]:
# fit a tree to all training data - again going to use a max depth of 5
# i have tried using everything default, but the tree is massive and illegible
single_tree = DecisionTreeRegressor(random_state = 35, max_depth = 3,
                                    ccp_alpha = 0)

# fit to training values
single_tree.fit(x_train, y_train)
# create predictions using test set
single_tree_predict = single_tree.predict(x_test)
# calculate test set MSE
single_tree_test_mse = mean_squared_error(y_test, single_tree_predict)
fig, ax = plt.subplots(figsize = (17.5, 17.5))
plot_tree(single_tree, fontsize = 10, ax = ax, feature_names = x_train.columns)
plt.show()

You can see a number of internal nodes used to split the data, starting with determining if the grade of the house is below or above 8.5. The number of (training) samples in each node is shown. The predictions ('value = ') are the house prices which were log-transformed - for reference 12.5 is 268,337.30, 13 is 442,413.40, 13.5 is 729,416.40 and 14 is 1,202,604.

So, how does the performance of this single tree stack up against the previous models?

In [None]:
# add test MSE to matrix
model_stats.loc['Single regression tree', 'Test MSE'] = single_tree_test_mse
model_stats

This single tree is the worst method so far. Fortunately, there are other methods that aggregate many trees to increase the predictive performance.

## 4.2 Bagging

‘Bagging’, aka bootstrap aggregation, is used to reduce the variance of a model. Bootstrapping is an approach where you take repeated samples from the same (training) data set. You train your model on each bootstrapped training set, and average their predictions.

In the context of trees, you can construct a tree for each bootstrapped training set and average their predictions. The trees are deep and unpruned - each tree therefore has high variance and low bias, but by averaging the trees you reduce the variance.

In [None]:
# bagging is a special case of a random forest, where the subset of predictors
# used is equal to the number of predictors - max_features = 17 denotes this
bag_model = RandomForestRegressor(max_features = 17, random_state = 35)
# fit to training data
bag_model.fit(x_train, y_train.values.ravel())
# make predictions using test data
bag_predict = bag_model.predict(x_test)
# calculate test set MSE
bag_test_set_mse = mean_squared_error(y_test, bag_predict)
print(bag_test_set_mse)

This test set MSE is a substantial improvement from a single tree, about twice as good and is actually the best performance so far. We're not done yet though.

In [None]:
# add test MSE to matrix
model_stats.loc['Bagging', 'Test MSE'] = bag_test_set_mse

## 4.3 Random forest

A random forest is similar to bagging, in that a tree is fit to each bootstrapped training set. However, where bagging considers every variable when making a split, a random forest only considers a *random subset* of the predictors. This may sound counter-intuitive - why would you not want to consider every variable at each split?

If you have one very strong predictor, it is highly likely that in bagging this variable will be responsible for the first split in almost every tree - therefore the trees will look quite similar and the predictions from them will be highly correlated. Even by averaging these correlated predictions, the variance reduction is not as large as averaging uncorrelated predctions. This is where random forest steps in - by considering a (random) subset of predictors at each split, the trees will be *decorrelated* where all predictors are given more of a chance.

The number of predictors in each subset is therefore a factor to consider, but I will use the default value is this analysis, which is five in this case.

By using a subset of predictors at each split, you can determine which variables are more important in predicting the response:

In [None]:
# growing a random forest is similar, but a lower value of mtry is used
# a value of 5 is used here, the same number I used in R
rf_model = RandomForestRegressor(max_features = 5, random_state = 35)
# fit to training data
rf_model.fit(x_train, y_train.values.ravel())

# extract importance of each variable
rf_var_importance = pd.DataFrame({'Importance': rf_model.feature_importances_}, 
                                 index = x_train.columns)

# plot
plot_rf_var_importance_theme = theme(axis_title = element_text(size = 12.5),
                                     axis_text = element_text(size = 10))
plot_rf_var_importance = (ggplot(data = rf_var_importance,
                                 mapping = aes(x = rf_var_importance.index,
                                               y = rf_var_importance)) +
  geom_col() + # plot bars
  theme_classic() + # white background, no gridlines
  xlab('Variable') + # change x axis label
  ylab('Importance of Variable') + # change y axis label
  plot_rf_var_importance_theme + # change the size of axis titles and axis text
  coord_flip() # flips to put variables on y axis, improving readability
  )
plot_rf_var_importance

The plot shows that latitude is by far the most important variable with the grade of the house and the square footage of the living area rounding out the top tree.

That’s all well and good, but is random forest an improvement over bagging?

In [None]:
# make predictions using test data
rf_predict = rf_model.predict(x_test)
# calculate test set MSE
rf_test_set_mse = mean_squared_error(y_test, rf_predict)
print(rf_test_set_mse)

# add test MSE to matrix
model_stats.loc['Random forest', 'Test MSE'] = rf_test_set_mse

Random forest is slightly worse than bagging but is still far superior to the linear model approaches.

## 4.4 Boosting

In boosting, trees are grown slowly using information from previously grown trees, unlike bagging where trees are grown independently. Boosting does not utilse bootstrap sampling.

There are a number of parameters that can be changed to tune the performance:
1. The shrinkage paramater (λ) - this is the rate at which boosting learns
2. The depth of the tree
3. The minimum number of observations needed in a node
4. The ‘bag.fraction’ paramater allowing stochastic gradient descent. In this approach, there is a global minimum of the loss function. If the loss function is U-shaped, it is not that difficult to find that minimum. However, if it shaped differently, the global minimum might be behind multiple local minimums which the algorithm would normally stop at. Stochastic gradient descent means a subset of the training data is used to grow a tree, a different subset for the bext tree, and so on. This increases the speed of running this algorithm, and while it doesn’t guarantee the global minimum can be found it does make it more likely that local minimums and plateaus can be overcome.

Therefore, I will use a number of different values for each parameter:

In [None]:
# create a dictionary for the different values for the different parameters
parameters = {'Learning rate': [0.001, 0.01, 0.1],
              'Depth': [1, 2, 3, 4, 5],
              'Min. Node Obs.': [5, 10, 15],
              'Bag fraction': [0.65, 0.8, 1]}

# ParameterGrid creates dictionaries for each of the different combinations
# of parameters
parameter_grid = ParameterGrid(parameters)

# dataframe to track test MSE
boost_stats = pd.DataFrame(columns = ['Test MSE']) 

I'll then perform boosting, fitting a model to each combination of predictors:

In [None]:
# iterate over each combination of parameters
for i in range(1, len(list(parameter_grid))):
    
    # create boosted model using parameters at i
    boost_model = GradientBoostingRegressor(
        learning_rate = list(parameter_grid)[i]['Learning rate'],
        subsample = list(parameter_grid)[i]['Bag fraction'], # enables
        # stochastic gradient boosting
        min_samples_leaf = list(parameter_grid)[i]['Min. Node Obs.'],
        max_depth = list(parameter_grid)[i]['Depth'],
        random_state = 35
        )
    
    # fit to training data
    boost_model.fit(x_train, y_train.values.ravel())
    # predict using training data
    boost_predict = boost_model.predict(x_test)
    # calculate test set MSE
    boost_test_set_mse = mean_squared_error(y_test, boost_predict)
    # add to dataframe
    boost_stats.loc[i, 'Test MSE'] = boost_test_set_mse

print('Lowest test set MSE is', min(boost_stats['Test MSE']))
# this was the last model used, which had the following parameters
print('The best model had these parameters:', list(parameter_grid)[134])
# won't be repeating this by using different parameters, but shown it can be
# in R and is possible if a bit trickier in python

You could keep on fine-tuning the parameters almost indefinitely, but I will stop for now. I will take the best model and add it to the matix.

In [None]:
# add test MSE to matrix
model_stats.loc['Boosting', 'Test MSE'] = min(boost_stats['Test MSE'])

# 5. Conclusion

I’ll recap what I’ve done in this analysis:  
1) I used 11 different methods to predict house prices, seven based on the linear model and four tree-based methods  
2) For each method, models were fit to training data, with some type of validation used to determine what the best model was  
3) The best model for a given method was then tested on a test set, with the resulting MSE placed in a matrix  
4) The matrix allows all of these models to be compared, with the results outlined below:

In [None]:
model_stats

The worst models with the highest test MSEs were the single regression tree, followed by forward and backward stepwise selection. Lasso regression comes next, followed by best subset selection. PLS is the best of the linear model methods beating ridge regression which comes second. The best models were the remaining tree-based methods, boosting is the best model with a notable improvement over second-placed random forest and a substantial decrease in test MSE compared to the worst models.

The tree-based approaches also highlighted that latitude was the most significant predictor, followed by the square footage of the house and the grade of the house, and to a lesser extent longitude. This confirmed my early thoughts based on the exploratory data analysis in section 2.

If you’ve made it this far, thank you for taking the time to read my analysis. I welcome any feedback in the comments. Consider checking out my previous [analysis of sex differences in suicide rates](https://www.kaggle.com/thwaiteso/analysis-of-sex-differences-in-suicide-rates-r) or a [binary classification of heart disease](https://www.kaggle.com/thwaiteso/binary-classification-of-heart-disease-r).

