# Boston Housing Price Prediction Proposal

**Group Members:** Tony Liang, Wanxin luo, Xuan Chen

**Student Numbers:** 39356993, 33432808, 15734643


ECON 323 Quantitative Economic Modelling with Data Science Applications UBC 2023

## Background
This is a quick demo with some EDA to peek the [boston housing data](https://www.kaggle.com/datasets/altavish/boston-housing-dataset?resource=download), and inspect its distribution, then apply economical theories and statistical methods to analyze this dataset. We will be using a Boston Housing dataset to conduct EDA(exploratory data analysis) and predictive analysis using a regression model. This dataset contains the following columns:  

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:

- `CRIM` - per capita crime rate by town
- `ZN` - proportion of residential land zoned for lots over 25,000 sq.ft.
- `INDUS` - proportion of non-retail business acres per town.
- `CHAS` - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- `NOX` - nitric oxides concentration (parts per 10 million)
- `RM` - average number of rooms per dwelling
- `AGE` - proportion of owner-occupied units built prior to 1940
- `DIS` - weighted distances to five Boston employment centres
- `RAD` - index of accessibility to radial highways
- `TAX` - full-value property-tax rate per $10,000$
- `PTRATIO` - pupil-teacher ratio by town
- $B - 1000(Bk - 0.63)^2$ where $B_k$ is the proportion of blacks by town
- `LSTAT` - $%$ lower status of the population
- `MEDV` - Median value of owner-occupied homes in $1000$'s

The dataset is derived from https://www.kaggle.com/datasets/fedesoriano/the-boston-houseprice-data.

**NOTE**: These variables are all numeric, along with some indexing variables

## Introduction
In this project, we aim to explore the impact of environmental factors on housing prices using a Boston Housing dataset. This dataset contains information on various attributes such as crime rate, average number of rooms, accessibility to highways, and more, which are hypothesized to influence housing prices. The project will involve several parts, including data cleaning, visualization, and model building. Our objective is to conduct exploratory data analysis (EDA) and then build a hedonic regression model with multiple inputs. We will utilize various Python techniques learned in this course to explore the real world data and solve economic questions.

By analyzing the data, we aim to answer economic questions related to the housing market and explore the real-world application of Python techniques. It is important to note that the dataset has its limitations as it was collected almost 50 years ago, but it still provides an excellent opportunity for us to apply our Python skills and gain insights of housing market.

## Methods

This report strives to be trustworthy using the following steps: 
1. [Data cleaning](#data-cleaning)
2. [Thorough EDA](#eda-phase)
3. [Building multiple linear regression model](#model-fitting-phase)

### Data Cleaning

For the data cleaning step, we will check and handle the missing values in the dataset. We will also identify categorical and continuous variables. For instance, the `CHAS` variable is a dummy variable indicating whether the tract bounds the Charles River or not, and is encoded as 0 or 1.

### EDA Phase

During the EDA phrase, we will conduct a thorough examination of the Boston Housing dataset. One of the key steps is to generate a correlation matrix, which can help us identify any potential issues related to multicollinearity between the independent variables. In addition, we will use side-by-side box plots to visualize the distributions of the continuous variables and detect any potential outliers or anomalies. Moreover, we will leverage other data visualization techniques, such as scatter plots and histograms, to better understand the relationships between the variables and explore potential trends or patterns in the data. Overall, the goal of EDA is to gain insights into the data and inform our subsequent modelling steps. 

### Model Fitting Phase

In the model fitting phase, we will split the Boston Housing dataset into training and testing sets. We will then use the training set to select the relevant variables and build our final multiple linear regression model. The selection process can involve various techniques, such as stepwise regression or regularization, depending on the specific requirements of the project. Once we have the final model, we will use the testing set to evaluate its performance in terms of mean squared error `(MSE)`. The goal is to ensure that the model can generalize well to new, unseen data and make accurate predictions. 

## Division of Labor
Based on the previous discussions, the team has divided the responsibilities as follows:

- Tony: Coding
- Wanxin: Coding and some textual descriptions
- Xuan: Written section of the report

However, the team may make adjustments to the division of labor as needed during the project to ensure that all tasks are completed efficiently and effectively. Effective communication and collaboration within the team will be critical to ensure that everyone is working together towards the same goal.

### Loading the data

In [1]:
# import the libraries
import pandas as pd

In [2]:
# read the data
DATA_PATH="data/boston_housing_data.csv"
data = pd.read_csv(DATA_PATH)
# returns size of the data
print(f"\n The data is of size {data.shape}")
# shows the first 6 rows
data.head()


 The data is of size (506, 14)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,,36.2


### Exploratory Data Analysis

This section should find if there is any outlier in data, missing values, data types

In [3]:
# Summary of all the variables

# Note: every attribute here is a numerical variable
data.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,486.0,486.0,486.0,486.0,506.0,506.0,486.0,506.0,506.0,506.0,506.0,506.0,486.0,506.0
mean,3.611874,11.211934,11.083992,0.069959,0.554695,6.284634,68.518519,3.795043,9.549407,408.237154,18.455534,356.674032,12.715432,22.532806
std,8.720192,23.388876,6.835896,0.25534,0.115878,0.702617,27.999513,2.10571,8.707259,168.537116,2.164946,91.294864,7.155871,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.0819,0.0,5.19,0.0,0.449,5.8855,45.175,2.100175,4.0,279.0,17.4,375.3775,7.125,17.025
50%,0.253715,0.0,9.69,0.0,0.538,6.2085,76.8,3.20745,5.0,330.0,19.05,391.44,11.43,21.2
75%,3.560263,12.5,18.1,0.0,0.624,6.6235,93.975,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [None]:
data.