**Group Members:** Tony Liang (004), Wanxin luo (003), Xuan Chen (004)

**Student Numbers:** 39356993, 33432808, 15734643


ECON 323 Quantitative Economic Modelling with Data Science Applications UBC 2023

# Boston Housing Price Prediction Proposal

In [23]:
# Imports of libraries
import pandas as pd
import numpy as np
# Preprocessing and Feature engineering
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate
# Models makeup

# linear models
from sklearn.linear_model import LinearRegression, Ridge, Lasso
# tree-like models
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

## Introduction
In this project, we aim to **explore the impact of environmental factors on housing prices** using the [Boston Housing dataset](https://www.kaggle.com/datasets/fedesoriano/the-boston-houseprice-data.). This dataset contains information on various attributes such as crime rate, average number of rooms, accessibility to highways, and more, which are hypothesized to influence housing prices. The project will involve several parts, including data cleaning, visualization, and model building. Our objective is to conduct exploratory data analysis (EDA) and then build a hedonic regression model with multiple inputs. We will utilize various Python techniques learned in this course to explore the real world data and solve economic questions.

By analyzing the data, we aim to answer economic questions related to the housing market and explore the real-world application of Python techniques. It is important to note that the dataset has its limitations as it was collected almost 50 years ago, but it still provides an excellent opportunity for us to apply our Python skills and gain insights of housing market.

### Dataset Description

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston, MA. The following describes the dataset columns:

- `CRIM` - per capita crime rate by town
- `ZN` - proportion of residential land zoned for lots over 25,000 sq.ft.
- `INDUS` - proportion of non-retail business acres per town.
- `CHAS` - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- `NOX` - nitric oxides concentration (parts per 10 million)
- `RM` - average number of rooms per dwelling
- `AGE` - proportion of owner-occupied units built prior to 1940
- `DIS` - weighted distances to five Boston employment centres
- `RAD` - index of accessibility to radial highways
- `TAX` - full-value property-tax rate per $10,000$
- `PTRATIO` - pupil-teacher ratio by town
- $B - 1000(Bk - 0.63)^2$ where $B_k$ is the proportion of blacks by town
- `LSTAT` - $%$ lower status of the population
- `MEDV` - Median value of owner-occupied homes in $1000$'s

The dataset is derived from https://www.kaggle.com/datasets/fedesoriano/the-boston-houseprice-data.

## Methods

This report strives to be trustworthy using the following steps: 

1. [Data cleaning](#data-cleaning)
2. [Thorough EDA](#eda)
3. [Building multiple linear regression model](#model-fitting)

**Note**: this could be subjected to changes later after feedback from the ECON323 Instrutor's team


### Data Cleaning

For the data cleaning step, we will check and handle the missing values in the dataset. We will also identify categorical and continuous variables. For instance, the `CHAS` variable is a dummy variable indicating whether the tract bounds the Charles River or not, and is encoded as 0 or 1. Moreover, perform any special treatments toward outliers depending on the method that will be carried in the [model fitting phase](#model-fitting).

### EDA

During the EDA phrase, we will conduct a thorough examination of the Boston Housing dataset. One of the key steps is to generate a correlation matrix, which can help us identify any potential issues related to multicollinearity between the independent variables. In addition, we will use side-by-side box plots to visualize the distributions of the continuous variables and detect any potential outliers or anomalies. Moreover, we will leverage other data visualization techniques, such as scatter plots and histograms, to better understand the relationships between the variables and explore potential trends or patterns in the data. Overall, the goal of EDA is to gain insights into the data and inform our subsequent modelling steps. 

### Model Fitting

In the model fitting phase, we will split the Boston Housing dataset into training and testing sets. We will then use the training set to select the relevant variables and build our final multiple linear regression model. The selection process can involve various techniques, such as stepwise regression or regularization, depending on the specific requirements of the project. Once we have the final model, we will use the testing set to evaluate its performance in terms of mean squared error (`MSE`). The goal is to ensure that the model can generalize well to new, unseen data and make accurate predictions. 

To further explore effects of using different methods of regression, we are going to fit multiple models to using similar metrics accross these to compare best fit of model, i.e. Bayesian Information Criterion (BIC) and Adjusted $R^2$ for inference (how well our model explains the effects of the explanatory variables is of the variable of interes); Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for prediction purposes models. Our ideal approach to test this is to fit the following models:
1. Ordinary Least Squares (OLS) as baseline 
2. OLS with L2-norm regularization (Ridge Regression)
3. OLS with L1-norm regularization (Least Absolute Shrinkage and Selection Operator (LASSO Regression) )
4. Decision Tree Regression
5. Random Forest Regression

#### Load and Split data

To begin, we will have a common block to define function to split the data into train and test data for further modelling purposes.

In [2]:
# function to load and split data

# loads data from path, and specifies proportion of train data, with 1 - proportion of test data
# and target is the variable of interest (your y), then returns train and test data
# proportion is default to 0.5, and target default to None 
def split_data(data_path, proportion=0.5, target=None, random_state=123):
    """
    Loads data from path, and specifies proportion of train data, with 1 - proportion of test data
    and target is the variable of interest (your y), then returns train and test data
    proportion is default to 0.5, and target default to None.
    
    Optional argument:
    random_state = 123 (default), change to other number of your choice to assert reproducibility
    """
    # load the data
    data = pd.read_csv(data_path)
    # drop nas
    data = data.dropna()
    # inner function to split data into train and test portion
    def train_test_split(data, proportion):
        train = data.sample(frac = proportion, random_state=random_state)
        test = data.drop(train.index)
        # rest and remove index of both
        train = train.reset_index().drop(columns=["index"])
        test = test.reset_index().drop(columns=["index"])
        # asserting dimension matches (i.e. number of rows)
        assert train.shape[0] + test.shape[0] == data.shape[0]
        return train, test
    # split the data into train and test
    train, test = train_test_split(data, proportion)
    # further split train data to X and y
    def split_X_y(data):
        X = data.drop(columns=[target])
        y = data[target]
        return X, y
    X_train, y_train = split_X_y(train)
    # split test data to X and y
    X_test, y_test = split_X_y(test)
    # check dimension again
    assert X_train.shape[0] + X_test.shape[0] == data.shape[0]
    assert X_train.shape[1] and X_test.shape[1] == 13
    assert y_train.shape[0] == X_train.shape[0] and y_test.shape[0] == X_test.shape[0]
    # return the objects needed
    return X_train, X_test, y_train, y_test

In [3]:
# path to find data
path = "data/boston_housing_data.csv"
# Splits the data into X and y train and test portions
X_train, X_test, y_train, y_test = split_data(path, proportion = 0.75, target = "MEDV", random_state=20230325)

In [11]:
def data_preprocess(X_train):
    numeric_transf = make_pipeline
    return 0

#### Ordinary Least Squares (OLS)

First, we are going to fit a plain OLS regression model to act as our baseline model for comparing with other regression methods and see their improvements or weakness when applying regularization or boosting and bootstrapping. By definition, a generic linear regression model is explained by the following:

$$y_i = \beta_0 + \beta_i x_i + \epsilon_i \quad \text{for} \quad i = 1, \dots, n$$

whereas $y$ is the dependent variable, or variable we are trying to inference or estimate, and $\beta_j \quad \forall j \in [0, \inf)$ are estimates or weights of the explanatory/independent variables $x_k \quad \forall k \in [1, \inf]$, and $\beta_0$ is a special case, such it is the value of estimated $y$, where all the independent variables equal to 0.

Hence, above equation can be generalized into matrix form below:

$$Y = X\beta + \epsilon$$

where $X$ is the design matrix with leading column of 1s (to represent the intercept term) and columns of independent variables $x_1, \dots, x_n$, $\beta$ is the matrix of all estimates from $\beta_0, \dots, \beta_n$, and $\epsilon$ is the  random error of measurements

Then solving for $\beta$ yields to the following:

$$\beta = (X^{T}X)^{-1}X^{T}Y$$

Hence, we could use this above to solve for our regression.

In [67]:
# implementation of OLS in data

# known B = (XTX)^-1XT Y
def OLS(X_mat, y_mat):
    """
    Converts the parameters to numpy arrays and perform matrix multiplication to get betas of OLS from
    (X^TX)^-1 X^T y
    """
    # add intercept column to matrix X
    X = X_mat
    y = y_mat
    try:
        X.insert(0,'intercept',1)
    except:
        pass
    X = X.to_numpy()
    y = y.to_numpy()
    beta = np.linalg.inv(X.T @ X) @ X.T @ y
    return beta[0], beta[1:]

In [98]:
# get the intercept and beta estimates
# scikit-learn imported function
model = LinearRegression().fit(X_train, y_train)
# self-defined function
intercept, estimates = OLS(X_train, y_train)

In [102]:
# Use this option to NOT show scientific notation (default shows scientific notation)
np.set_printoptions(suppress=True)
# print the estimates from self-defined func
print(f"From self defined OLS, the intercept is {round(intercept,3)}, \nAnd beta estimates are: \n{np.ndarray.round(estimates, decimals=3)}")

# prints the estimates from scikit-learn to inspect and compare
print(f"\nFrom scikit-learn function, the intercept is {round(model.intercept_, 3)}, \nAnd beta estimates are: ")
print(np.ndarray.round(model.coef_[1:], decimals=3))

From self defined OLS, the intercept is 38.811, 
And beta estimates are: 
[ -0.11    0.053   0.06    2.984 -20.39    3.191   0.004  -1.364   0.349
  -0.016  -0.817   0.008  -0.475]

From scikit-learn function, the intercept is 38.811, 
And beta estimates are: 
[ -0.11    0.053   0.06    2.984 -20.39    3.191   0.004  -1.364   0.349
  -0.016  -0.817   0.008  -0.475]


#### Ridge Regression

In [10]:
# Just use scikit-learn's builtin fun first
\

array([-1.03351936e-01,  5.47244860e-02,  8.02194753e-03,  2.68503265e+00,
       -8.99641629e+00,  3.24834784e+00, -6.04015337e-03, -1.22066869e+00,
        3.03911725e-01, -1.57026734e-02, -7.21610616e-01,  8.18229124e-03,
       -4.81799969e-01])

#### LASSO Regression

#### Decision Tree

#### Random Forest

## Division of Labor
Based on the previous discussions, the team has divided the responsibilities as follows:

- Tony: Coding
- Wanxin: Coding and some textual descriptions
- Xuan: Written section of the report

However, the team may make adjustments to the division of labor as needed during the project to ensure that all tasks are completed efficiently and effectively. Effective communication and collaboration within the team will be critical to ensure that everyone is working together towards the same goal.

## References

Vishal, V. (2017, October 27). Boston Housing Dataset. Kaggle. Retrieved March 14, 2023, from https://www.kaggle.com/datasets/altavish/boston-housing-dataset 