# TRAINING A NEURAL NETWORK WITH BOSTON HOUSING DATA

Adapter from Lewis (2016), Chapter 2


Info on dataset:

The Boston housing dataset has to do with a study carried out in 1978 concerning the median prices of housing in 506 residential areas of Boston, MA, USA.

Originally one of the motivations of the study was to check if the pollution levels were having an impact on these prices. The dataset contains a series of descriptive socio-economic variables on each residential area and also the measurements of a pollutant (nitrogen oxides concentration), as well as characteristics of the houses in each area.

There is also a “target”variable, the median price of the houses in each region (variable medv), whose values are supposed to somehow depend on the values of the other descriptor variables.

The dataset contains both numeric and nominal variables.

More details on their meaning can be obtained on the help page associated with the dataset available in package MASS (Venables and Ripley, 2002).

(Torgo 2017)

## Load libraries and data

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(plotly) # for interactive visualizations
library(neuralnet) # for neural network model
library(deepnet) # for neural network model
library(NeuralNetTools) # for visualizing neural nets
library(MASS) # for data
library(mice) # for imputation
library(VIM) # for imputation
library(listviewer) # for viewing list objects
library(BBmisc) # for standardization/normalization
library(Metrics) # for model fit criteria

In [None]:
data("Boston", package = "MASS")

In [None]:
Boston_dt <- as.data.table(Boston)

In [None]:
Boston_dt

## Explore data

In [None]:
str(Boston_dt)

In [None]:
glimpse(Boston_dt)

Boston contains 506 rows and 14 columns.

```
crim
per capita crime rate by town.

zn
proportion of residential land zoned for lots over 25,000 sq.ft.

indus
proportion of non-retail business acres per town.

chas
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox
nitrogen oxides concentration (parts per 10 million).

rm
average number of rooms per dwelling.

age
proportion of owner-occupied units built prior to 1940.

dis
weighted mean of distances to five Boston employment centres.

rad
index of accessibility to radial highways.

tax
full-value property-tax rate per $10,000.

ptratio
pupil-teacher ratio by town.

black
1000(𝐵𝑘−0.63)2 where 𝐵𝑘 is the proportion of blacks by town.

lstat
lower status of the population (percent).

medv
median value of owner-occupied homes in $1000s.
```

We select the ones we want to use:

- crim: per capita crime rate by town.
- indus: proportion of non-retail business acres per town.
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
- age: proportion of owner-occupied units built prior to 1940.
- dis: average distances to five Boston employment centres.
- tax: full-value property-tax rate.
- ptratio: pupil-teacher ratio by town.
- lstat: lower status of the population (percent).
- medv: median value of owner-occupied homes.

In [None]:
Boston_dt[, c("zn", "chas", "rad", "black") := NULL]

In [None]:
Boston_dt

Check for missing values:

In [None]:
mdpat <- mice::md.pattern(Boston_dt)
mdpat

No missing values (506 cases where all columns are complete) 

We can also confirm that with VIM package (blue corresponds to complete values and it is 100% of all cases)

In [None]:
VIM::aggr(Boston_dt, numbers = T)

## Normalization

In [None]:
Boston_z <- Boston_dt[,BBmisc::normalize(.SD)]

## Split data

In [None]:
set.seed(2016)
train <- Boston_z[,sample(.I, 400)]

In [None]:
data_train <- Boston_z[train]
data_test <- Boston_z[-train]

## Modelling with neuralnet package

### Train a model

In [None]:
formula1 <- reformulate(names(Boston_dt) %>% setdiff("medv"), "medv")
formula1

The weight update algorithm is selected as "resilient backpropagation" (rprop)

Error function is sum of squared errors "sse"

Activation function is logistic

For the output neuron a linear activation function is selected

In [None]:
fit <- neuralnet::neuralnet(formula1,
                           data = data_train,
                           hidden = c(10, 12, 20),
                           algorithm = "rprop+",
                           err.fct = "sse",
                           act.fct = "logistic",
                           threshold = 0.1,
                           linear.output = T,
                            lifesign = "full",
                            lifesign.step = 2000,
                           stepmax = 200000)

In [None]:
fit

In [None]:
NeuralNetTools::plotnet(fit, cex_val = 0.4, line_stag = 0)

We can sort the variables by attribute importance using garson or olden methods. However garson can only be used in models with a single hidden layer. For this model, we can use olden:

In [None]:
NeuralNetTools::olden(fit)

Most important features are dis and age

The weights are:

In [None]:
NeuralNetTools::neuralweights(fit)

### Predictive power

In [None]:
pred <- neuralnet::compute(fit, data_test[,!"medv"])

In [None]:
pred %>% listviewer::jsonedit(mode = "form")

Squared correlation is as such:

In [None]:
cor(pred$net.result, data_test$medv)^2

Mean squared error:

In [None]:
Metrics::mse(pred$net.result, data_test$medv)

Mean squared error is the metric that is minimized in training. In the presence of outliers, the resulting model can struggle to capture the mechanism that generates the data

And root mean squared error:

In [None]:
Metrics::rmse(pred$net.result, data_test$medv)

## Modelling with deepnet package

### Train a model

In [None]:
library(deepnet)

In [None]:
X <- as.matrix(data_train[,!"medv"])
Y <- data_train[,medv]

In [None]:
set.seed(2016)
fitB <- deepnet::nn.train(x = X,
                         y = Y,
                         initW = NULL,
                         initB = NULL,
                         hidden = c(10, 12, 20),
                            learningrate = 0.58,
                            momentum = 0.74,
                            learningrate_scale = 1,
                            activationfun = "sigm",
                            output = "linear",
                            numepochs = 970,
                            batchsize = 60,
                            hidden_dropout = 0,
                            visible_dropout = 0)

The deepnet package gives you the ability to specify the starting values of the neuron weights (initW) and biases (initB).

We set both values to NULL so that the algorithm will select their values at random. The DNN has the same topology as that estimated before i.e. three hidden layers, with 10,12 and 20 neurons in the first, second and third hidden layers respectively.

To use the backpropagation algorithm, you have to specify a learning rate, momentum and learning rate scale. 

The learning rate controls how quickly or slowly the neural network converges.

Briefly, momentum involves adding a weighted average of past gradients in the gradient descent updates.

It tends to dampen noise, especially in areas of high curvature of the error function.

Momentum can therefore help the network avoid becoming trapped in local minima.

All three parameters are generally set by trial and error, we choose values of 0.58, 0.74 and 1 for the learning rate, momentum and learning rate scale respectively.

The next two lines specify the activation function for the hidden and output neurons. For the hidden neurons we use a logistic function ("sigm"); other options include "linear" or "tanh".

For the output neuron we use a linear activation function, other options include "sigm" and "softmax".

The model is run over 970 epochs each with a batch size of 60. No neurons are dropped out in the input layer or hidden layer.

In [None]:
fitB

### Predictive power

In [None]:
Xtest <- data_test[,!"medv"]

In [None]:
predB <- deepnet::nn.predict(fitB, Xtest)

Squared correlation coefficient:

In [None]:
cor(predB, data_test$medv)^2

Mean squared error:

In [None]:
Metrics::mse(predB, data_test$medv)

Root mean squared error:

In [None]:
Metrics::rmse(predB, data_test$medv)

Squared correlation higher and mse/rmse measures are much lower with the second model's tuned parameters

## lm versus nn

Adapted from Ciaburro Venkateswaran (2017), Chapter 2

How would a linear regression model on the same data perform?

In [None]:
Regression_Model <- lm(medv ~ ., data = data_train)

In [None]:
predict_lm <- predict(Regression_Model, data_test)

In [None]:
Metrics::mse(predict_lm, data_test$medv)

The MSE is much higher for lm than it is for the both neural network models we trained above

## Neural Network packages in cran

We can get information on all packages:

In [None]:
pdb <- tools:::CRAN_package_db() 

In [None]:
pdb_dt <- as.data.table(pdb)

Filter for neural networks and sort by last update date:

In [None]:
pdb_dt[grepl("(N|n)eural.*(N|n)et", Description), .(Package, Published, Description)][order(-Published)]

Most popular, recent and actively developed packages for neural networks in CRAN are:

- keras
- tensorflow
- h2o
- RSNNS


The last release dates for the packages we (or popular and recent books on ML with R) used for modelling are:

In [None]:
pdb_dt[Package %in% c("neuralnet", "nnet", "deepnet"), .(Package, Published, Description)][order(-Published)]

So they are not actively developed anymore