# Group 13 Project Report

## Real Estate Valuation

### Introduction
The UCI machine learning repository offers a real estate valuation dataset from professor Prof. I-Cheng Yeh at TamKang University. The dataset can be accessed via https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set.

It contains historical data from the real estate market in Sindian District, New Taipei City, spanning 2012-2013. Each row represents a property transaction record with corresponding feature columns. The dataset includes 414 property sales records. It is collected to understand how 6 different factors impact the house price of unit area. This dataset was downloaded from Data Science Dojo from an open source repository. 

The dataset has no null ratios and 414 rows and 7 columns which are:
* **X1 transaction Date:** The transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.). It is a qualitative data type.
* **X2 house age:** The age of the house in years. It is a quantitative data type. 
* **X3 distance to the nearest MRT station:** The distance to the nearest mass rapid transportation in metres. It is a quantitative data type.
* **X4 number of convenience stores:** The number of convenience stores in the living circle on foot. It is a quantitative data type.
* **X5 latitude:** The geographic coordinate, latitude, in degrees. It is a quantitative data type.
* **X6 longitude:** The geographic coordinate, longitude, in degrees.  It is a quantitative data type.
* **Y house price of unit area:** The house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared) for example, 29.3 = 293,000 New Taiwan Dollar/Ping. It is a quantitative data type.


We want to be able to predict the price of a house using the house age, distance from the nearest metro station and proximity to convenience stores. We plan to plot the variables against the price of the house individually to determine the relationship between them and the price of the house and then perform a regression analysis. This would help us identify which variables have the greatest impact on the price of the house. This data analysis would help identify the factors which matter most when determining the price of the house and the specific extent to which these factors impact price. 

### Preliminary exploratory data analysis

#### **Reading the dataset into Python**
First, we will import the dataset into Python to have get an idea of what we are dealing with. The dataset is imported from the csv file saved in the same folder as this notebook, which can also be accessed from this link: https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set.

However, before reading in the data, we will import some packages that might be helpful in our analysis.

In [1]:
### importing relevant python packages.
import random

import altair as alt
import pandas as pd
import numpy as np
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

#importing dataset
houses = pd.read_excel("Real estate valuation data set.xlsx")
houses

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2012.916667,19.5,306.59470,9,24.98034,121.53951,42.2
2,2013.583333,13.3,561.98450,5,24.98746,121.54391,47.3
3,2013.500000,13.3,561.98450,5,24.98746,121.54391,54.8
4,2012.833333,5.0,390.56840,5,24.97937,121.54245,43.1
...,...,...,...,...,...,...,...
409,2013.000000,13.7,4082.01500,0,24.94155,121.50381,15.4
410,2012.666667,5.6,90.45606,9,24.97433,121.54310,50.0
411,2013.250000,18.8,390.96960,7,24.97923,121.53986,40.6
412,2013.000000,8.1,104.81010,5,24.96674,121.54067,52.5


#### **Cleaning and wrangling**
As we can see from the table above, the data is already tidy. Every column holds a single variable, every row is a single obervation and every value is in a single cell.
The titles, however, are a bit confusing because of axis specified in their names and spaces between the words. So, we will transform the titles by shortening them into a more Python readable format in the next step.

In [2]:
#wrangling the dataset
houses = houses.rename(
    columns={
        "X1 transaction date" : "transaction_date",
        "X2 house age" : "house_age",
        "X3 distance to the nearest MRT station" : "distance_nearest_MRT",
        "X4 number of convenience stores" : "number_convenience_stores",
        "X5 latitude" : "latitude",
        "X6 longitude" : "longitude",
        "Y house price of unit area" : "house_price"
    }
)
houses

Unnamed: 0,transaction_date,house_age,distance_nearest_MRT,number_convenience_stores,latitude,longitude,house_price
0,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2012.916667,19.5,306.59470,9,24.98034,121.53951,42.2
2,2013.583333,13.3,561.98450,5,24.98746,121.54391,47.3
3,2013.500000,13.3,561.98450,5,24.98746,121.54391,54.8
4,2012.833333,5.0,390.56840,5,24.97937,121.54245,43.1
...,...,...,...,...,...,...,...
409,2013.000000,13.7,4082.01500,0,24.94155,121.50381,15.4
410,2012.666667,5.6,90.45606,9,24.97433,121.54310,50.0
411,2013.250000,18.8,390.96960,7,24.97923,121.53986,40.6
412,2013.000000,8.1,104.81010,5,24.96674,121.54067,52.5


#### **Examining the dataset**
Looking at the dataset, we can observe some columns that we don't need, that are transaction_date, latitude and longitude. Although these measures are very useful in predicting the price of a house in the real estate market (Mooya, 2016), we don't have the useful background information relating to monthly price trends according to the geographical location for this particular dataset that could have better explained the correlation between these variables and the house price. Therefore, we won't be be using these variables in our data analysis.

First, we will perform some exploratory data analysis and preview the dataset using the info method.

In [3]:
houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 7 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   transaction_date           414 non-null    float64
 1   house_age                  414 non-null    float64
 2   distance_nearest_MRT       414 non-null    float64
 3   number_convenience_stores  414 non-null    int64  
 4   latitude                   414 non-null    float64
 5   longitude                  414 non-null    float64
 6   house_price                414 non-null    float64
dtypes: float64(6), int64(1)
memory usage: 22.8 KB


There are a total of 414 data entries. All the columns are a `float64` type except the one that states the number of convenience stores which is an `int64` type. This makes sense because only a whole number can denote the number of convenience stores approachable by foot.

Next, we would like to create a training dataset that contains 75% of the observations and perform exploratory data analysis on that.

In [4]:
#training data
np.random.seed(10)
houses_train, houses_test = train_test_split(
    houses, train_size = 0.75)
houses_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 310 entries, 175 to 265
Data columns (total 7 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   transaction_date           310 non-null    float64
 1   house_age                  310 non-null    float64
 2   distance_nearest_MRT       310 non-null    float64
 3   number_convenience_stores  310 non-null    int64  
 4   latitude                   310 non-null    float64
 5   longitude                  310 non-null    float64
 6   house_price                310 non-null    float64
dtypes: float64(6), int64(1)
memory usage: 19.4 KB


In [5]:
houses_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104 entries, 231 to 147
Data columns (total 7 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   transaction_date           104 non-null    float64
 1   house_age                  104 non-null    float64
 2   distance_nearest_MRT       104 non-null    float64
 3   number_convenience_stores  104 non-null    int64  
 4   latitude                   104 non-null    float64
 5   longitude                  104 non-null    float64
 6   house_price                104 non-null    float64
dtypes: float64(6), int64(1)
memory usage: 6.5 KB


The training set and the testing set have 310 and 104 data entries respectively which corresponds to the 75%-25% split of the original data.

Now, a useful data exploration analysis could be to single out the predictor variables we will be using for the regression model and find their mean.

In [6]:
#weeding out other variables
houses_train_predictor = houses_train.loc[
    :,
    ["house_age", "distance_nearest_MRT", "number_convenience_stores", "house_price"]]
    
houses_train_predictor

Unnamed: 0,house_age,distance_nearest_MRT,number_convenience_stores,house_price
175,30.2,472.17450,3,36.5
10,34.8,405.21340,1,41.4
76,35.9,616.40040,3,36.8
172,6.6,90.45606,9,58.1
34,15.4,205.36700,7,55.1
...,...,...,...,...
369,20.2,2185.12800,3,22.8
320,13.5,4197.34900,0,18.6
15,35.7,579.20830,2,50.5
125,1.1,193.58450,6,48.6


In [7]:
#finding mean of the predictor variables
mean_predictor = pd.DataFrame(
    houses_train_predictor.mean()
).transpose()
mean_predictor

Unnamed: 0,house_age,distance_nearest_MRT,number_convenience_stores,house_price
0,18.03871,1137.675412,4.154839,37.490323


This gives us an overview of the whole data. The average price of houses per 3.3 meter squared (Ping which is a local unit, 1 Ping = 3.3 meter squared) in the Sindian district is 374,903.23 New Taiwan Dollars. The associated value for the age of a typical house is around 18 years. The nearest mass rapid transportation is on average more than a thousand meters away, while there are about four convenience stores in the vicinity of a typical house.

#### **Visualizing the data**

We will use scatterplots to show correlation between the predictor variables we plan to use in the regression model and the target variables, i.e. the house price. 

First, we plot house prices against their age.

In [8]:
price_age_plot = (
    alt.Chart(houses_train, title="House Price vs House Age")
    .mark_point()
    .encode(
        x=alt.X(
            "house_age", 
            title ="House Age (in years)",
            scale=alt.Scale(zero=False)),
        y=alt.Y(
            "house_price",
            title ="House Price (in 10000 New Taiwan Dollars/Ping)",
            scale=alt.Scale(zero=False)),
    )
)
price_age_plot

  for col_name, dtype in df.dtypes.iteritems():


The scatter plot between the age of the house and its price shows **two clusters of data**.
* The first cluster has houses ranging from being newly constructed to be being 25 years old. These properties show a **moderately strong negative correlation** which suggests that older the house, the less price it has. This is consistent with the common perception that newly built properties have a higher market price due to a lower need for maintenance. The design of the new houses being in accordance to the current trends can also be factored in (Wilhelmsson, 2008).
* The second cluster is comprised of houses that are beyond 25 years of age, ranging until 45 years. These houses have a **moderately weak positive correlation**. This might be due to the fact that an established neighborhood with old houses has more facilities available in the vicinity that are sometimes more valued than how old the house is (Wing, 2005).

Given that, there is a particularly influential outlier near the point (10, 120). This means that there is a ten-year old house that is priced exhorbitantly high at 120,000 New Taiwan Dollars/Ping. This could affect our regression model significantly.

In [9]:
price_mrt_plot = (
    alt.Chart(houses_train, title="House Price vs Distance to nearest MRT station")
    .mark_point()
    .encode(
        x=alt.X(
            "distance_nearest_MRT",
            title ="Distance from nearest MRT station (in meters)",
            scale=alt.Scale(zero=False)),
        y=alt.Y(
            "house_price",
            title ="House Price (in 10000 New Taiwan Dollars/Ping)",
            scale=alt.Scale(zero=False)),
    )
)
price_mrt_plot

The correlation between the price of houses and its distance from the nearest mass rapid transportation system is strongly negative and has a linear trend. This suggest that the **distance from the nearest mass rapid tranportation system of a house is a strong predictor for its market price**. However, it is to note that buying a house far from an MRT station will not cause its price to fall because correlation does not imply causation. The influential outlier is still visible in this scatterplot.

In [10]:
price_store_plot = (
    alt.Chart(houses_train, title="House Price vs Number of convenience stores")
    .mark_point()
    .encode(
        x=alt.X(
            "number_convenience_stores", 
            title ="Number of Convenience Stores",
            scale=alt.Scale(zero=False)),
        y=alt.Y("house_price", 
                title ="House Price (in 10000 New Taiwan Dollars/Ping)",
                scale=alt.Scale(zero=False)),
    )
)
price_store_plot

Since the number of convenience stores is a discrete quantitative variable, the points in the scatterplot align themselves at the integers marked in the x-axis. As we can notice, there is a **non-linear relationship** between the two. A case study conducted on Tapei, Taiwan about the nonlinear effect of convenience stores on residential property prices finds that "the residents in the neighbourhoods with lower-priced property may prefer accessibility to convenience stores where they can complete multiple tasks in one go, while those in the neighbourhoods with higher-priced property may be more mobile to access convenience stores in other suburbs en route from one place to another,"(Chiang, Peng, & Chang, n.d.). This subdivide would have a major influence on the regression model.

Another thing apparent from this scatterplot is heavily rightwards skewness of house prices which have only one convenience store. Having a high number of convenience store does not neccessarily mean a higher house price because having only one can fulfil the immediate need of household items. Once that condition is satisfied, a buyer potentially looks at other factors that affect the price. 

Therefore, this variable has lower effect on predicting the price of a house.

### Multicolinearity
In order for the regression model to be more accurate, we want to make sure that the predictor variables don't have a strong correlation among themselves. This is because a small change in the data might result in a large change in the correlation coefficients and make the regression model unreliable. 

In [11]:
age_mrt_plot = (
    alt.Chart(houses_train, title="House Age vs Distance to nearest MRT station")
    .mark_point()
    .encode(
        x=alt.X(
            "distance_nearest_MRT",
            title ="Distance from nearest MRT station (in meters)",
            scale=alt.Scale(zero=False)),
        y=alt.Y(
            "house_age",
            title ="House Age (in Years)",
            scale=alt.Scale(zero=False)),
    )
)
age_mrt_plot

In [12]:
age_store_plot = (
    alt.Chart(houses_train, title="House Age vs Number of Convenience Stores")
    .mark_point()
    .encode(
        x=alt.X("house_age",
            title ="House Age (in Years)",
            scale=alt.Scale(zero=False)),
        y=alt.Y(
            "number_convenience_stores",
            title ="Number of Convenience Stores",
            scale=alt.Scale(zero=False)),
    )
)
age_store_plot

In [13]:
mrt_store_plot = (
    alt.Chart(houses_train, title="House Age vs Number of Convenience Stores")
    .mark_point()
    .encode(
        x=alt.X("distance_nearest_MRT",
            title ="Distance from Nearest MRT Station (in meters)",
            scale=alt.Scale(zero=False)),
        y=alt.Y(
            "number_convenience_stores",
            title ="Number of Convenience Stores",
            scale=alt.Scale(zero=False)),
    )
)
mrt_store_plot

Since we find that the predictor variables don't have a strong correlation among themselves, we can use them in our regression model.

### Regression Analysis

Now, we will conduct linear regression wherein the regression model we build will predict the price of a house based on the three predictor variables– house age, distance from the nearest MRT and the number of convenience stores accessible by foot. 

To gain insight into the relationship between the response variable (dependent variable) and the predictor variable (independent variable), we perform separate regressions for each variable. This method enables us to identify the strength and nature of each predictor variable's impact on the response variable, as well as any outliers, influential observations, and non-linear relationships.

Multi-regression, on the other hand, is employed to determine the combined influence of multiple predictor variables on the response variable, while also accounting for the effect of other predictor variables. It allows us to recognize the key predictor variables and how they collaborate with one another. Multi-regression provides a more comprehensive perspective on the connection between predictor variables and the response variable, and it is a more effective approach for making predictions than individual regressions.

By conducting both the simple and multi linear regressions, we can gain a better understanding of the relationship between predictor variables and the response variable, resulting in more accurate predictions. 

#### Simple Linear Regression Models
##### **House Price vs House Age**
To begin our analysis, we will perform a linear regression between `house_price` and `house_age`.

In [14]:
#fit the linear regression model for price vs age
lm1 = LinearRegression()

X_train_age = pd.DataFrame(houses_train["house_age"])
y_train_age = houses_train["house_price"]

X_test_age = pd.DataFrame(houses_test["house_age"])
y_test_age = houses_test["house_price"]

lm1.fit(X_train_age, y_train_age)

In [15]:
#visualizing the regression line in house price vs house age
price_age_reg = price_age_plot + price_age_plot.transform_regression(
    "house_age", "house_price"
).mark_line(size=1)
price_age_reg

  for col_name, dtype in df.dtypes.iteritems():


In [16]:
#finding the slope and intercept of the regression line
reg1 = ({"slope": lm1.coef_[0], "intercept": lm1.intercept_})
reg1

{'slope': -0.2968146704956151, 'intercept': 42.844476249714454}

This plot illustrates the correlation between house price and house age within the training dataset using a scatter plot. Each data point on the plot represents a house, with the x-axis displaying the age of the house in years, and the y-axis displaying the house price in 10,000 New Taiwan Dollars/Ping.

The plot includes a regression line, which is the best-fit line that represents the trend between the two variables. This line represents the linear equation that fits the data points the best, with the slope indicating the rate of change in house price for each additional year in house age. The negative slope of 0.2968 implies that the house's price decreases by approximately 3000 New Taiwan Dollars/Ping as the house gets one year older. This can be termed as significant change and hence age of a house is a **strong predictor** of its price.

In [17]:
#testing accuracy of the model
price_age_pred = houses_test.assign(
    predicted = lm1.predict(X_test_age))

price_age_RMSPE = np.sqrt(mean_squared_error(
    y_true = price_age_pred["house_price"],
    y_pred = price_age_pred["predicted"]))
price_age_RMSPE

13.300653236566939

As shown above, **the root mean squared prediction error (RMSPE) is 13.3.** This value is measured in the same units as our target variable, house price, and denotes the standard residual error in the predictions. In general, a lower RMSPE signifies that the regression line better fits the true values; however, the significance of the actual RMSPE value varies case-by-case because it's units are dependent on the target variable. 

As such, we can say that our model has an error of roughly 13,300 New Taiwan Dollars/Ping. This means that **the actual values have a tendency to be on average 13.3 units away from true values** at any given point. This is a significant amount and indicates a **low accuracy**. This is in line with the preliminary analysis where we stated the presence of two clusters of data might affect prediction accuracy.

It is important to recognize the few outliers present in the data; these data points, while important to include in the data, would notably bring up the RMSPE value. However, when observing the general trend, the RMSPE in this case is reasonable with the outliers taken into account.

##### **House Price vs Distance of nearest MRT stations**
Next, we will perform a linear regression between `house_price` and `distance_nearest_MRT`.

In [18]:
#fit the linear regression model for price vs distance from MRT
lm2 = LinearRegression()

X_train_mrt = pd.DataFrame(houses_train["distance_nearest_MRT"])
y_train_mrt = houses_train["house_price"]

X_test_mrt = pd.DataFrame(houses_test["distance_nearest_MRT"])
y_test_mrt = houses_test["house_price"]

lm2.fit(X_train_mrt, y_train_mrt)

In [19]:
#visualizing the regression line in house price vs distance from MRT
price_mrt_reg = price_mrt_plot + price_mrt_plot.transform_regression(
    "distance_nearest_MRT", "house_price"
).mark_line(size=1)
price_mrt_reg

  for col_name, dtype in df.dtypes.iteritems():


In [20]:
#finding the slope and intercept of the regression line
reg2 = ({"slope": lm2.coef_[0], "intercept": lm2.intercept_})
reg2

{'slope': -0.007033742110671133, 'intercept': 45.492438034985376}

This plot displays a scatter plot of house prices versus the distance of the house from the nearest MRT station. Each data point on the plot represents a house, with the x-axis displaying the distance of the house to the nearest MRT station in metres, and the y-axis displaying the house price in 10,000 New Taiwan Dollars/Ping.

The regression line represents the linear relationship between the two variables, with the slope indicating the rate of change in house price for each additional unit increase in the distance from the nearest MRT station. Here, a negative slope of 0.007 indicates that the house price decreases by just 70 New Taiwan Dollars/Ping as the MRT station gets farther by 1 metre.

**This might give the impression that this predictor variable is not very influential. However, the case is quite the opposite.** An additional metre of distance does not affect the price by much because it is a small unit but an additional kilometre (as shown by the scale of the scatterplot) can cause a huge difference. If calculated in kilometres, the slope of the regression line would become -7.0337 which means that every additional kilometre of distance between the house and MRT station can cause the house price to fall by more than 70,000 New Taiwan Dollars/Ping.

In [21]:
#testing accuracy of the model
price_mrt_pred = houses_test.assign(
    predicted = lm2.predict(X_test_mrt))

price_mrt_RMSPE = np.sqrt(mean_squared_error(
    y_true = price_mrt_pred["house_price"],
    y_pred = price_mrt_pred["predicted"]))
price_mrt_RMSPE

10.014938944882417

Here, **the root mean squared prediction error (RMSPE) is 10.01.** Hence, we can say that our model has an error of roughly 10,000 New Taiwan Dollars/Ping. This means that **the actual values have a tendency to be on average 10.01 units away from true values** at any given point. This is also a significant amount but it has a lower error than the house price vs house age model which makes this model **more accurate.** This is in line with the preliminary analysis where we stated that the variables have a strong negatively linear trend.

##### **House Price vs Number of Convenience Stores** 
Next, we will perform a linear regression between `house_price` and `number_convenience_stores`.

In [22]:
#fit the linear regression model for price vs number of convenience store
lm3 = LinearRegression()

X_train_store = pd.DataFrame(houses_train["number_convenience_stores"])
y_train_store = houses_train["house_price"]

X_test_store = pd.DataFrame(houses_test["number_convenience_stores"])
y_test_store = houses_test["house_price"]

lm3.fit(X_train_store, y_train_store)

In [23]:
#visualizing the regression line in house price vs number of convenience stores
price_store_reg = price_store_plot + price_store_plot.transform_regression(
    "number_convenience_stores", "house_price"
).mark_line(size=1)
price_store_reg

  for col_name, dtype in df.dtypes.iteritems():


In [24]:
#finding the slope and intercept of the regression line
reg3 = ({"slope": lm3.coef_[0], "intercept": lm3.intercept_})
reg3

{'slope': 2.691571981983652, 'intercept': 26.307275120016314}

This plot provides a visual representation of the relationship between the two variables, house price and the number of convenience stores accessible by foot. Along the x-axis, the number of convenience stores is measured in whole numbers while the along the y-axis, the house price is displayed in 10,000 New Taiwan Dollars/Ping.

The regression line represents the linear relationship between the two variables, with the slope indicating the rate of change in house price for each additional convenience store in the vicinity. Here, a positive slope of 2.6915 indicates that the house price increases by around 27,000 New Taiwan Dollars/Ping with just one extra addition of a neighboring convenience store. 

This is quite an important predictor which hints to the high property prices in the downtown of a city which has access to a diversified marketplace.

In [25]:
#testing accuracy of the model
price_store_pred = houses_test.assign(
    predicted = lm3.predict(X_test_store))

price_store_RMSPE = np.sqrt(mean_squared_error(
    y_true = price_store_pred["house_price"],
    y_pred = price_store_pred["predicted"]))
price_store_RMSPE

11.501936780459747

Here, **the root mean squared prediction error (RMSPE) is 11.5.** Hence, we can say that our model has an error of roughly 11,500 New Taiwan Dollars/Ping. This means that **the actual values have a tendency to be on average 11.5 units away from true values** at any given point. This is again a significant error which is higher than the house price vs distance from MRT stations model but lower than the house price vs house age model. This indicates that this model's relative accuracy is in between the other two models, which is **still low.** This is in line with the preliminary analysis where we stated that the variables have a roughly non-linear trend with high right skewness at `x=1` (where there is one convenience store in the neighborhood).

In addition to these insights, these plots can help identify any outliers or influential points that may affect the relationship between the variables. Overall, the plots provides valuable information about the relationship between house prices and the predictor variables, expressing the extent to which the predictor influences the response variable, enabling informed decisions in the real estate market.

#### Multivariable Linear Regression Model 

explain what it is?? or just say that now we are doing this

In [26]:
#fit the linear regression model
mlm = LinearRegression()

X_train = pd.DataFrame(houses_train[[
    "house_age", "distance_nearest_MRT", "number_convenience_stores"]])
y_train = houses_train["house_price"]

X_test = pd.DataFrame(houses_test[[
    "house_age", "distance_nearest_MRT", "number_convenience_stores"]])
y_test = houses_test["house_price"]
    
mlm.fit(X_train, y_train)

In [27]:
#testing the accuracy of the regression model
houses_pred = houses_test.assign(
    predicted = mlm.predict(X_test))

houses_pred_RMSPE = np.sqrt(mean_squared_error(
    y_true = houses_pred["house_price"],
    y_pred = houses_pred["predicted"]
))
houses_pred_RMSPE

9.378637860750057

**The root mean squared prediction error (RMSPE) of the multi linear regression is 9.38.** This means that the multi linear regression model has an error of 93,800 New Taiwan Dollars/Ping. The actual values tend to be on average 9.38 units away from the predicted values.

The error value for multi-regression is lower than all of the simple linear regression models we have built. This signifies the importance of taking a number of variables to build a predition model as it leads to **higher accuracy**.

### Observations
Our analysis aimed to understand the relationship between house prices in New Taipei City and several predictor variables, including the age of the property, distance from the nearest MRT station, and the number of nearby convenience stores. We conducted individual regressions for each variable and found that all three had a significant impact on house prices.

The regression analysis between house age and house price showed a weak correlation, with a negative slope indicating that the house's price decreases as the age of the house increases. Regression model 1 used only "house_age" as a predictor variable and resulted in an RMSPE score of `13.300653236566939`.

The regression analysis between house price and distance from the nearest MRT station showed the strongest negative correlation, with houses farther away from the MRT stations having lower prices. Regression model 2 used only "distance_nearest_MRT" as a predictor variable and resulted in an RMSPE score of `10.014938944882417`.

The regression analysis between house price and the number of nearby convenience stores showed a positive correlation, with houses near more convenience stores having higher prices. Regression model 3 used only "number_convenience_stores" as a predictor variable and resulted in an RMSPE score of `11.501936780459747`.

Performing multi-regression using all three predictor variables together resulted in the most accurate model for predicting house prices in this dataset. Model 4, which uses house_age, distance_nearest_MRT, and number_convenience_stores as predictor variables, resulted in the lowest RMSPE score of `9.378637860750057`.

### Discussion
The results from the regression analysis are in line with our project proposal's expectations as there are some significant inaccuracies in determining a house's price using only three variables. Nevertheless, the multi-regression using these primary factors still provided valuable insights into the Taipei housing market and the most critical factors contributing to a house's price. This outcome indicates that taking a more comprehensive approach to analyzing the relationship between predictor variables and the response variable can lead to a better understanding of the factors that contribute to house prices. The multi-regression model using all three predictor variables resulted in the lowest RMSPE score because it considers the combined influence of all three predictors on the response variable, enabling more accurate predictions of house prices.

Overall, data analysis showed that older houses and those farther away from the nearest MRT station tend to have lower prices, while houses with a higher number of nearby convenience stores tend to have higher prices. These insights are highly relevant for real estate agents and developers, who can utilize this information to make more informed decisions regarding their investments.

Moreover, the study may prompt further research into the impact of other factors, such as property size or number of bedrooms, on house prices in New Taipei City and other regions worldwide. Overall, the findings could have a significant impact on the real estate market in Taiwan, offering valuable insights for individuals and organizations involved in buying, selling, and developing properties.

#### **Bibliography:**

Chiang, Y.-H., Peng, T.-C., & Chang, C.-O. (n.d.). The nonlinear effect of convenience stores on   residential property prices: A case study of Taipei, Taiwan. Computers in Human Behavior. Retrieved from https://www.sciencedirect.com/science/article/pii/S0197397514001556

Hsu K-C. (2020). House Prices in the Peripheries of Mass Rapid Transit Stations Using the Contingent Valuation Method. Sustainability, 12(20, :8701. https://doi.org/10.3390/su12208701

Mooya, M.M. (2016). Standard Theory of Real Estate Market Value: Concepts and Problems. In: Real Estate Valuation Theory. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49164-5_1

Wilhelmsson, M. (2008). House price depreciation rates and level of maintenance. Journal of Housing Economics, 17(1), 88–101. https://doi.org/10.1016/j.jhe.2007.09.001

Wing, Chau Kwong & Wong, Siu Kei & Yiu, Edward. (2005). Adjusting for Non-Linear Age Effects in the Repeat Sales Index. The Journal of Real Estate Finance and Economics. 31. 137-153. 10.1007/s11146-005-1369-6. 
