# Group 13 Project Proposal

## Real Estate Valuation

### Introduction
The UCI machine learning repository offers a real estate valuation dataset from professor Prof. I-Cheng Yeh at TamKang University. The dataset can be accessed via https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set.

It contains historical data from the real estate market in Sindian District, New Taipei City, spanning 2012-2013. Each row represents a property transaction record with corresponding feature columns. The dataset includes 414 property sales records. It is collected to understand how 6 different factors impact the house price of unit area. This dataset was downloaded from Data Science Dojo from an open source repository. 

The dataset has no null ratios and 414 rows and 7 columns which are:
* **X1 transaction Date:** The transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.). It is a qualitative data type.
* **X2 house age:** The age of the house in years. It is a quantitative data type. 
* **X3 distance to the nearest MRT station:** The distance to the nearest mass rapid transportation in metres. It is a quantitative data type.
* **X4 number of convenience stores:** The number of convenience stores in the living circle on foot. It is a quantitative data type.
* **X5 latitude:** The geographic coordinate, latitude, in degrees. It is a quantitative data type.
* **X6 longitude:** The geographic coordinate, longitude, in degrees.  It is a quantitative data type.
* **Y house price of unit area:** The house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared) for example, 29.3 = 293,000 New Taiwan Dollar/Ping. It is a quantitative data type.


We want to be able to predict the price of a house using the house age, distance from the nearest metro station and proximity to convenience stores. We plan to plot the variables against the price of the house individually to determine the relationship between them and the price of the house and then perform a regression analysis. This would help us identify which variables have the greatest impact on the price of the house. This data analysis would help identify the factors which matter most when determining the price of the house and the specific extent to which these factors impact price. 

### Preliminary exploratory data analysis

#### **Reading the dataset into Python**
First, we will import the dataset into Python to have get an idea of what we are dealing with.

In [1]:
### importing relevant python packages.
import random

import altair as alt
import pandas as pd
import numpy as np
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [2]:
#importing dataset

houses = pd.read_excel("Real estate valuation data set.xlsx")
houses

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2012.916667,19.5,306.59470,9,24.98034,121.53951,42.2
2,2013.583333,13.3,561.98450,5,24.98746,121.54391,47.3
3,2013.500000,13.3,561.98450,5,24.98746,121.54391,54.8
4,2012.833333,5.0,390.56840,5,24.97937,121.54245,43.1
...,...,...,...,...,...,...,...
409,2013.000000,13.7,4082.01500,0,24.94155,121.50381,15.4
410,2012.666667,5.6,90.45606,9,24.97433,121.54310,50.0
411,2013.250000,18.8,390.96960,7,24.97923,121.53986,40.6
412,2013.000000,8.1,104.81010,5,24.96674,121.54067,52.5


#### **Cleaning and wrangling**
(change column names to without spaces)

#### **Examining the dataset**
(how many rows)
(taking only the columns we need)
(groupby count mean using profession)

#### **Visualizing the data**
(age vs annual income with professions as colors)
(describe the strength, direction and linearity)

In [3]:
house_age_plot = (
    alt.Chart(houses)
    .mark_point()
    .encode(
        x=alt.X("X2 house age", scale=alt.Scale(zero=False)),
        y=alt.Y("Y house price of unit area", scale=alt.Scale(zero=False)),
    )
)
house_age_plot

  for col_name, dtype in df.dtypes.iteritems():


In [4]:
house_mrt_plot = (
    alt.Chart(houses)
    .mark_point()
    .encode(
        x=alt.X("X3 distance to the nearest MRT station", scale=alt.Scale(zero=False)),
        y=alt.Y("Y house price of unit area", scale=alt.Scale(zero=False)),
    )
)
house_mrt_plot

In [5]:
house_store_plot = (
    alt.Chart(houses)
    .mark_point()
    .encode(
        x=alt.X("X4 number of convenience stores", scale=alt.Scale(zero=False)),
        y=alt.Y("Y house price of unit area", scale=alt.Scale(zero=False)),
    )
)
house_store_plot
boxplot

NameError: name 'boxplot' is not defined

### Methods
In order to obtain perform a holistic and accurate data analysis, we will use three variables in the data set to perform a regression method data analysis. 

- **House Age (in years)**: The age of a house is an important variable to consider when assessing house price. It has been shown through previously conducted studies houses depreciate over time, and the maintenance costs for upholding proper living conditions increase proportionally as the house ages. As such, people are willing to pay more for a younger house that does not have these high maintenance costs associated with it. 

Wilhelmsson, M. (2008). House price depreciation rates and level of maintenance. Journal of Housing Economics, 17(1), 88–101. https://doi.org/10.1016/j.jhe.2007.09.001 

- **Distance to the nearest MRT station**: MRT stations provide access to public transportation via train. These stations can be elevated or underground; in both cases, it has been shown through study that house prices in the relative area increase when located near a station. This is because of the benefits of having the convenience of public transport near residence, especially in metropolitan and urban areas.

Hsu K-C. (2020). House Prices in the Peripheries of Mass Rapid Transit Stations Using the Contingent Valuation Method. Sustainability, 12(20, :8701. https://doi.org/10.3390/su12208701

- **Number of Convenience Stores**: Access to convenience stores within a walking distance are important for the ease of getting groceries and for other daily household needs. The price of a house could fluctuate in accordance to this.

We would predict the house price using these variables using a regression model. We choose regression over classification because the price of a house is not a categorical variable; it can be a range of values that must be specified. Thus, regression models are better suited for this data set. We will visualize the results through a scatterplot of House Price vs House Age with a regression curve going across the data points.
 

### Expected Outcomes and Significance
- what do you expect to find?
- what impact could these findings have?
- what future questions could this lead to?


Using the methods specified above, we hope to develop a system in which we can predict the price of any house in relation to the values of the three variables that are specified above. We aknowledge that there are several other factors which impact the price of real estate and that only considering 3 may lead to inaccuracies. Furthermore, the extent to which each factor impacts house price may also be inaccurate. 

Nevertheless, there are cetain predictions we could make using published research. 
The findings from this model have real life implications, as a model that can predict house price based on a number of different factors can help sellers on the market evaluate the price of their properties. In general, this could make the real estate market more reliable over time, as sellers have a consistent model to support their evaluations. However, our model could also lead to further investigation/questions; for example, what other variables could we use to evaluate house price? Should some variables be weighted, or prioritized, over others? Would these variables change from region to region?