# 1 Data Import and Manipulation

We first import a dataset from a Github repo of our lab. This is a dataset on housing prices and air pollution in [Harrison & Rubinfeld (1978)](https://www.sciencedirect.com/science/article/pii/0095069678900062). The dataset is also used throughout an undergraduate econometrics text book by Wooldridge: *Introductory Econometrics: A Modern Approach*. (There is a R package ([wooldridge](https://justinmshea.github.io/wooldridge/index.html)) that collects all the datasets used in that book.)

After briefly inspecting the data, we prepare the dataset for modeling. We then conduct some linear regression analysis.

We will use the [Tidyverse](https://www.tidyverse.org/) way of handling the data, so let's first load the library. Tidyverse consists a set of libraries for Data Science (mostly for data manipulation).

In [None]:
library(tidyverse)

We will also need a few other libraries for plotting (some of them ggplot2 extensions) and creating tables.

In [None]:
# install a few ggplot related packages if they are not already installed
if (!require(GGally)) install.packages("GGally")
if (!require(gridExtra)) install.packages("gridExtra")
if (!require(ggthemes)) install.packages("ggthemes")
if (!require(ggfortify)) install.packages("ggfortify")

# install a table creation package if it's not already installed
if (!require(huxtable)) install.packages("huxtable")

# load them
library(GGally)
library(gridExtra)
library(ggthemes)
library(ggfortify)

library(huxtable)

## 1.1 Data Import

In [None]:
# load data
data_url <- "https://github.com/tdmdal/datasets-teaching/raw/main/hprice/hprice.csv"

# your code here


`hprice` is a tibble, which is basically R’s traditional `data.frame` with a few extra things. See [here](https://r4ds.had.co.nz/tibbles.html) for details.

## 1.2 Quick Inspection

Let's quickly inspect the data. By no means the data exploration done here is complete and thorough.

In [None]:
# take a look at the structure of the data
glimpse(hprice)

Data Dictionary ([Source](http://fmwww.bc.edu/ec-p/data/wooldridge/hprice2.des))

| Variable    | Description                         |
|-------------|-------------------------------------|
| 1. price    | median housing price, \$            |
| 2. crime    | crimes committed per capita         |
| 3. nox      | nitrous oxide, parts per 100 mill.  |
| 4. rooms    | avg number of rooms per house       |
| 5. dist     | weighted dist. to 5 employ centers  |
| 6. radial   | accessibiliy index to radial hghwys |
| 7. proptax  | property tax per $1000              |
| 8. stratio  | average student-teacher ratio       |
| 9. lowstat  | % of people 'lower status'          |



In [None]:
# str can still be useful
str(hprice)

In [None]:
# print the first few rows of the dataset
head(hprice)

In [None]:
# summary statistics
summary(hprice)

Let's focus on `price`, `nox`, `rooms` and `stratio` for this analysis.

In [None]:
# pairwise scatter plot and correlation; density plot of each variable
# http://ggobi.github.io/ggally/articles/ggpairs.html
ggpairs(hprice[c("price", "nox", "rooms", "stratio")])

# you could try a different theme from ggthemes
# ggpairs(hprice[c("price", "nox", "rooms", "stratio")]) + theme_wsj()
# ggpairs(hprice[c("price", "nox", "rooms", "stratio")]) + theme_economist()

In [None]:
# histogram and boxplot for price and log price
p1 <- ggplot(hprice, aes(x=price)) + geom_histogram()
p2 <- ggplot(hprice, aes(x=log(price))) + geom_histogram()
p3 <- ggplot(hprice, aes(x=price)) + geom_boxplot()
p4 <- ggplot(hprice, aes(x=log(price))) + geom_boxplot()
grid.arrange(p1, p2, p3, p4, ncol=2)

In [None]:
# histogram and boxplot for nox and log nox
# try a different themem
p1 <- ggplot(hprice, aes(x=nox)) + geom_histogram() + theme_fivethirtyeight()
p2 <- ggplot(hprice, aes(x=log(nox))) + geom_histogram() + theme_fivethirtyeight()
p3 <- ggplot(hprice, aes(x=nox)) + geom_boxplot() + theme_fivethirtyeight()
p4 <- ggplot(hprice, aes(x=log(nox))) + geom_boxplot() + theme_fivethirtyeight()
grid.arrange(p1, p2, p3, p4, ncol=2)

## 1.3 Data Manipulation (Preparation for Modeling)

Manipulating the dataframe (tibble) using the Tidyverse way. See [here](https://dplyr.tidyverse.org/articles/base.html) for a comparison between Tidyverse's dplyr approach and base R approach.

In [None]:
# get rid of price outliners (outside 5th to 95th percentile)
# create log price and log nox
# your code here


# 2 Modelling

We will start by runing a simple regression to investigate the effect of air pollution on housing price.

$log(price) = \beta_0 + \beta_1log(nox) + u$.

In [None]:
# setup and run a simple regression model
# your code here


Let's run a mulitple regression to investigate the effect of air pollution on housing price, but this time we control for rooms (and rooms squared) and student-teacher ratio.

$log(price) = \beta_0 + \beta_1log(nox) + \beta_2rooms + \beta_2rooms^2 + \beta_4stratio + u$.

In [None]:
# setup and run a multiple regression model
# your code here


# 3 Report & Graph

We report the regression result, and plot a few graphs.

## 3.1 Report

In [None]:
# report results from two regression models in a single summary table
# your code here


## 3.2 Graphs

In [None]:
# plot data and regression line for the simple regression
ggplot(hprice_reg, aes(lnox, lprice)) +
  geom_point() +
  geom_smooth(method = "lm")

Plot a few diagnositic plots. See [here](https://data.library.virginia.edu/diagnostic-plots/) for what they are for.

In [None]:
# plot a few post regression Diagnostic Plots for the simple regression
autoplot(lr)

In [None]:
# plot a few post regression Diagnostic Plots for the mulitple regression
# try a different theme
autoplot(lr_multiple) + theme_economist()