# 1 Data Import and Manipulation

We first import a dataset from a Github repo of our lab. This is a dataset on housing prices and air pollution in [Harrison & Rubinfeld (1978)](https://www.sciencedirect.com/science/article/pii/0095069678900062). The dataset is also used throughout an undergraduate econometrics text book by Wooldridge: *Introductory Econometrics: A Modern Approach*. (There is a R package ([wooldridge](https://justinmshea.github.io/wooldridge/index.html)) that collects all the datasets used in that book.)

After briefly inspecting the data, we prepare the dataset for modeling. We then conduct some linear regression analysis.

## 1.1 Data Import

In [None]:
# load data
data_url <- "https://github.com/tdmdal/datasets-teaching/raw/main/hprice/hprice.csv"
hprice <- read.csv(data_url)

## 1.2 Quick Inspection

Let's quickly inspect the data. By no means the data exploration done here is complete and thorough.

In [None]:
# take a look at the structure of the data
str(hprice)

Data Dictionary ([Source](http://fmwww.bc.edu/ec-p/data/wooldridge/hprice2.des))

| Variable    | Description                         |
|-------------|-------------------------------------|
| 1. price    | median housing price, \$            |
| 2. crime    | crimes committed per capita         |
| 3. nox      | nitrous oxide, parts per 100 mill.  |
| 4. rooms    | avg number of rooms per house       |
| 5. dist     | weighted dist. to 5 employ centers  |
| 6. radial   | accessibiliy index to radial hghwys |
| 7. proptax  | property tax per $1000              |
| 8. stratio  | average student-teacher ratio       |
| 9. lowstat  | % of people 'lower status'          |



In [None]:
# print the first few rows of the dataset
head(hprice)

In [None]:
# summary statist
summary(hprice)

Let's focus on `price`, `nox`, `rooms` and `stratio` for this analysis.

In [None]:
# pairwise scatter plot
pairs(hprice[c("price", "nox", "rooms", "stratio")])

In [None]:
# correlation matrix
cor(hprice[c("price", "nox", "rooms", "stratio")])

In [None]:
# histogram and boxplot for price and log price
par(mfrow=c(2,2))
hist(hprice$price, main = "Histogram of price")
hist(log(hprice$price), main = "Histogram of log price")
boxplot(hprice$price)
boxplot(log(hprice$price))

In [None]:
# histogram and boxplot for nox and log nox
par(mfrow=c(2,2))
hist(hprice$nox)
hist(log(hprice$nox))
boxplot(hprice$nox)
boxplot(log(hprice$nox))

## 1.3 Data Manipulation (Preparation for Modeling)

In [None]:
# get rid of price outliners (outside 5th to 95th percentile)
hprice_reg <- hprice[which(hprice$price < quantile(hprice$price, 0.95) & hprice$price > quantile(hprice$price, 0.05)), , drop = FALSE]
str(hprice_reg)

In [None]:
# create log price and log nox
hprice_reg["lprice"] <- log(hprice_reg["price"])
hprice_reg["lnox"] <- log(hprice_reg["nox"])

# 2 Modelling

We will start by runing a simple regression to investigate the effect of air pollution on housing price.

$log(price) = \beta_0 + \beta_1log(nox) + u$.

In [None]:
# setup a simple regression model
lr <- lm(formula = lprice ~ lnox, data = hprice_reg)

Let's run a mulitple regression to investigate the effect of air pollution on housing price, but this time we control for rooms (and rooms squared) and student-teacher ratio.

$log(price) = \beta_0 + \beta_1log(nox) + \beta_2rooms + \beta_2rooms^2 + \beta_4stratio + u$.

In [None]:
lr_multiple <- lm(lprice ~ lnox + rooms + I(rooms^2) + stratio, data = hprice_reg)

# 3 Report & Graph

We report the regression result, and plot a few graphs.

## 3.1 Report

In [None]:
# report the simple regression result
summary(lr)

In [None]:
# report the multiple regression result
summary(lr_multiple)

## 3.2 Graphs

In [None]:
# plot data and regression line for the simple regression
par(mfrow = c(1, 1))
plot(hprice_reg[c("lnox", "lprice")])
abline(coef(lr))

Plot a few diagnositic plots. See [here](https://data.library.virginia.edu/diagnostic-plots/) for what they are for.

In [None]:
# plot a few post regression Diagnostic Plots for the simple regression
par(mfrow = c(2, 2))
plot(lr)

In [None]:
# plot a few post regression Diagnostic Plots for the mulitple regression
par(mfrow = c(2, 2))
plot(lr_multiple)

# 4. A note on predictive analysis

We have so far seen a typical workflow for a causal regression analysis. A causal analysis investigates causal relationships between variables (e.g., whether x causes y while controlling z.)

On the other hand, a predictive analysis mainly concerns whether a model gives good predictions. Therefore, in predictive analysis, it’s important to evaluate how an estimated model may perform in the real world. To obtain an unbiased evaluation of model performance, before estimating the model, the dataset is often split into two subsets, a training set and a test set (sometimes called validation set). The training set is used to estimate model parameters (i.e., train the model), and the test set is used to evaluate the estimated model.

Below I will show you how to

1. randomly split the data into training and test set.
2. train/estimate a linear regression model on training set.
3. evaluate the estimated model on test set, i.e., predict on the test set, and obtain evaluation measures of interest.

In [None]:
# set a random seed so you can reproduce the result
set.seed(123)

# proportion of data for training
prop_train <- 0.8

# total size of the raw data
size_total <- nrow(hprice)

# size of training data
size_train <- as.integer(prop_train * size_total)

# training and test data split
# https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html
train_idx <- sample(1:size_total, size = size_train)
hprice_train <- hprice[train_idx,]
hprice_test <- hprice[-train_idx,]

# prepare the training data
hprice_train["lprice"] = log(hprice_train["price"])
hprice_train["lnox"] = log(hprice_train["nox"])

# train/estimate a regression model
lr_train <- lm(lprice ~ lnox + rooms + I(rooms^2) + stratio, data = hprice_train)
summary(lr_train)

Let's take a look on training MSE.

In [None]:
# predict lprice using the estimated model on training data
lprice_pred_train <- predict(lr_train, hprice_train)

# calcualte MSE on training
MSE_train <- mean((lprice_pred_train - hprice_train$lprice)^2)
MSE_train

In [None]:
# verify the training MSE calcuated above using the residuals already produced by lm()
mean(lr_train$residuals^2)

Now, let's see how our estimated model may perform in the real world. That is, evaluate the model on test data.

In [None]:
# prepare the test data
hprice_test["lprice"] = log(hprice_test["price"])
hprice_test["lnox"] = log(hprice_test["nox"])

# predict lprice using the estimated model on test data
lprice_pred_test <- predict(lr_train, hprice_test)

# calcualte MSE on test
MSE_test <- mean((lprice_pred_test - hprice_test$lprice)^2)
MSE_test

Above, we obtained the MSE of predicted `lprice` on test data (i.e. $log$ of price). Let's see what's the MSE for price prediction. Let's also calculate price prediction's RMSE (Root MSE) and MAE (Mean Absolution Error), and see how they compares with the mean price (i.e. mean median housing price) in the test data.

In [None]:
MSE_test_price <- mean((exp(lprice_pred_test) - hprice_test$price)^2)
MAE_test_price <- mean(abs(exp(lprice_pred_test) - hprice_test$price))
mean_price <- mean(hprice_test$price)
cat("MSE:", MSE_test_price, "\n")
cat("RMSE:", sqrt(MSE_test_price), "\n")
cat("MAE:", MAE_test_price, "($)\n")
cat("Mean Price:", mean_price, "($)\n")

cat("RMSE / Mean Price:", sqrt(MSE_test_price) / mean_price, "\n")
cat("MAE / Mean Price:", MAE_test_price / mean_price, "\n")

The model gives about 25% RMSE error and 17% MAE error, both with respect to mean housing median price. The model certainly has room to improve.