## ECON 570 Final Project
### Instructor: Ida Johnsson
### Group Members: Mingyu Zhao, Shang Gao, Yantong Li

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

## I. Introduction

-a. Background and Questions

Singapore is a south-eastern Asian country with a land area of 280 square miles (Los Angeles is 503 square miles). It is one of the smallest countries in the world, and it is also famous in its public housing system. 86% of Singaporeans are living in the public houses built by Housing & Development Board (HDB) (Sau, 2011). According to the HDB, there are some limitations on the public houses. One is that the house ownership will expire in 99 years after its construction, and another one is that flats under new scheme can only be sold in the resale market after 10 years. Given these unique characteristics, we are interested in Covid’s impact on public houses in Singapore. Specifically, we want to investigate if covid had impact on people’s preference on different prime districts, and if covid had impact on people’s preference on different house types or house models.

-b Why should we care about these questions?

Answering these questions can help HDB to understand which types/models of houses they should build more to meet the changing preference. Also, it can help HDB to understand where the hottest house market district is in post-Covid era. 

-c.Any previous paper?

We did not find paper about Covid impacts on HDB house price, but there are many papers talking about covid impacts on house price in other countries. Qian, Qiu, and Zhang (2021) showed that the house price in China reduced by 2.47% due to Covid. Annenberg and Ringo (2021) stated that there was an increasing demand in housing market, and such increase resulted in a higher house price in the United States.In addtion,we follow Wen (2005)’s idea that neighborhoods have a large set of characteristics, such as socioeconomic factors (neighbors’ status, residents’ income), public services (school, hospital, church), and externalities (crime rate, traffic, noise) to evaluate Covid’s impact on different districts in Singapore. 

## II. Data

Our dataset was found on https://www.kaggle.com/datasets/denzilg/hdb-flat-prices-19902021-march, and it was contributed by a Singaporean. It includes HDB public houses resale data from 1990 to 2021, but we only consider data from 2019 to 2021 in our project. The contributor divided Singapore into 6 districts from 1 to 6, where 1 is the most prime part in Singapore and 6 is the least prime part in Singapore. Besides district division, this dataset also includes the house types and house models of sold public houses. 
	The dependent variable is called prices cpi-adjusted and we decided not to remove any outliers from our data because it's common to see extremum in sold house prices. Summary statistics include the mean, the median, the standard deviation, the 25th percentile, and the 75th percentile of the dataset. For the dependent variable (prices cpi-adjusted), the mean is 449109.149388, the median is 420500.308209, the standard deviation is 156432.740458, the 25th percentile is 335785.7386, and the 75th percentile is 530121.928.

-Meanings of Some Independent Variables

flat_type: HDB-specified flat type ['4 ROOM' '3 ROOM' '1 ROOM' '5 ROOM' 'EXECUTIVE' '2 ROOM''MULTI GENERATION']

storey: storey number

lease_rem: The number of years of remaining lease

flat_model: HDB-specified flat model (not the same as flat type) ['New Generation', 'Improved', 'Standard', 'Model A', 'Apartment', 'Maisonette', 'Model A-Maisonette', 'Simplified', 'Terrace', 'Improved-Maisonette', 'MULTI GENERATION', 'Premium Apartment', 'Multi Generation', 'Adjoined flat', 'Premium Maisonette', '2-room', 'Model A2', 'DBSS', 'Type S1', 'Type S2', 'Premium Apartment Loft']

In [3]:
data_source = "https://raw.githubusercontent.com/yantonglll/ECON570_Final_Project/main/ALL%20Prices%202019-2021%20mar.csv"
data = pd.read_csv(data_source)
data.head()

Unnamed: 0,month,town,town_dummy,flat_type,block,street_name,address,latitude,longitude,storey_range,...,price_psm_yearly,Core CPI,price cpi_adj,price_psm cpi_adj,bala lease pct,price lease_adj implied,price_psm lease_adj implied,price cpi_lease_adj implied,price_psm cpi_lease_adj implied,year_gni
0,2019-01,ANG MO KIO,2,5 ROOM,700B,ANG MO KIO AVE 6,700B ANG MO KIO AVE 6 SINGAPORE,1.369457,103.846276,19 TO 21,...,86.16086,99.961,794109.7028,7154.141466,92.2,826516.26898,7446.092513,826838.736104,7448.997622,78847
1,2019-01,ANG MO KIO,2,5 ROOM,316A,ANG MO KIO ST 31,316A ANG MO KIO ST 31 SINGAPORE,1.364621,103.84708,19 TO 21,...,81.395349,99.961,770300.4172,7002.731065,93.3,792282.958199,7202.572347,792592.069145,7205.382446,78847
2,2019-01,ANG MO KIO,2,4 ROOM,310B,ANG MO KIO AVE 1,310B ANG MO KIO AVE 1 SINGAPORE,1.364778,103.844221,25 TO 27,...,84.918478,99.961,750292.6141,7815.548064,95.0,757894.736842,7894.736842,758190.431091,7897.816991,78847
3,2019-01,ANG MO KIO,2,5 ROOM,315B,ANG MO KIO ST 31,315B ANG MO KIO ST 31 SINGAPORE,1.364079,103.847476,13 TO 15,...,76.955603,99.961,728284.0308,6620.763916,93.3,749067.524116,6809.704765,749359.774457,6812.361586,78847
4,2019-01,ANG MO KIO,2,5 ROOM,353,ANG MO KIO ST 32,353 ANG MO KIO ST 32 SINGAPORE,1.364015,103.851622,16 TO 18,...,81.705948,99.961,728284.0308,6620.763916,91.4,764638.949672,6951.263179,764937.275239,6953.975229,78847


-Summary Statistics

In [11]:
data.mean()

  data.mean()


town_dummy                              3.986019
latitude                                1.368154
longitude                             103.841947
storey                                  8.629908
area_sqm                               97.475734
lease_start                          1995.629794
lease_rem                              74.916149
resale_price                       448966.534107
price_psm                            4637.302933
price_psm_yearly                       63.071376
Core CPI                               99.965318
price cpi_adj                      449109.149388
price_psm cpi_adj                    4638.786463
bala lease pct                         87.465088
price lease_adj implied            490849.372381
price_psm lease_adj implied          5085.421549
price cpi_lease_adj implied        491004.393064
price_psm cpi_lease_adj implied      5087.039705
year_gni                            75476.395129
covid_dum                               0.578541
dtype: float64

In [12]:
data.median()

  data.median()


town_dummy                              4.000000
latitude                                1.368377
longitude                             103.847206
storey                                  8.000000
area_sqm                               93.000000
lease_start                          1996.000000
lease_rem                              75.000000
resale_price                       420000.000000
price_psm                            4328.358209
price_psm_yearly                       58.717254
Core CPI                               99.952000
price cpi_adj                      420500.395500
price_psm cpi_adj                    4330.460722
bala lease pct                         88.500000
price lease_adj implied            453033.707865
price_psm lease_adj implied          4800.000000
price cpi_lease_adj implied        453288.009560
price_psm cpi_lease_adj implied      4801.614451
year_gni                            75000.000000
covid_dum                               1.000000
dtype: float64

In [13]:
data.std()

  data.std()


town_dummy                              1.335508
latitude                                0.042548
longitude                               0.071105
storey                                  5.821679
area_sqm                               24.272039
lease_start                            13.819732
lease_rem                              13.787720
resale_price                       156429.568400
price_psm                            1261.668212
price_psm_yearly                       16.934227
Core CPI                                0.213564
price cpi_adj                      156432.740458
price_psm cpi_adj                    1261.544423
bala lease pct                          6.436430
price lease_adj implied            162943.221987
price_psm lease_adj implied          1277.759841
price cpi_lease_adj implied        162937.583909
price_psm cpi_lease_adj implied      1277.494957
year_gni                             2994.466875
covid_dum                               0.493797
dtype: float64

In [14]:
data.quantile(q=0.25, axis=0, numeric_only=True, interpolation='linear')

town_dummy                              3.000000
latitude                                1.337799
longitude                             103.779077
storey                                  5.000000
area_sqm                               82.000000
lease_start                          1985.000000
lease_rem                              64.000000
resale_price                       335000.000000
price_psm                            3815.476190
price_psm_yearly                       50.502152
Core CPI                               99.865000
price cpi_adj                      335785.738600
price_psm cpi_adj                    3816.873921
bala lease pct                         82.400000
price lease_adj implied            374515.662651
price_psm lease_adj implied          4243.665768
price cpi_lease_adj implied        374726.885728
price_psm cpi_lease_adj implied      4245.005900
year_gni                            72418.000000
covid_dum                               0.000000
Name: 0.25, dtype: f

In [15]:
data.quantile(q=0.75, axis=0, numeric_only=True, interpolation='linear')

town_dummy                              5.000000
latitude                                1.395882
longitude                             103.899374
storey                                 11.000000
area_sqm                              113.000000
lease_start                          2010.000000
lease_rem                              89.000000
resale_price                       530000.000000
price_psm                            5074.626866
price_psm_yearly                       72.035900
Core CPI                              100.121000
price cpi_adj                      530121.928000
price_psm cpi_adj                    5075.811907
bala lease pct                         94.300000
price lease_adj implied            570514.285714
price_psm lease_adj implied          5558.823529
price cpi_lease_adj implied        570553.460586
price_psm cpi_lease_adj implied      5559.402836
year_gni                            78847.000000
covid_dum                               1.000000
Name: 0.75, dtype: f

## III. Model

### III.1 Model 1:
We first investigate a model where Y, the dependent variable, is price cpi_adj. Independent variables, or the covariates, are town_dummy, covid dummy, area_sqm, lease_rem, storey, and flat_type. The model should look like this:

$ $price cpi_adj$_i = \beta_0 + \beta_1*$town_dummy$_i+\beta_2*$covid_dummy$_i+\beta_3*$area_sqm$_i+\beta_4*$lease_rem$_i+\beta_5*storey_i+\beta_6*$flat_type$_i+e_i$,  

where $e_i \sim N(0,\sigma^2)$

In [5]:
data_sum=data

In [6]:
# create a dummy for covid
data_sum["covid_dum"] = (data_sum.month >= "2020-01").astype(int)
data_sum

# rename "price cpi_adj"
data_sum = data_sum.rename(columns = {"price cpi_adj":"price_cpi_adj"})

Now, we create dummies for town_dummy, it is important to keep in mind that town_dummy is from 1-6, with 1 being the most prime area and 6 being the least prime area

In [7]:
# create dummies for town_dummy
town_dum = pd.get_dummies(data_sum['town_dummy'])
town_dum

# Attach these dummies to dataframe
data_c = pd.concat([data_sum,town_dum], axis=1)
data_c

# Rename columns
data_rn1 = data_c.rename(columns = {1: "town_1",2: "town_2",3: "town_3",4: "town_4",5: "town_5",6: "town_6"})

#data_rn.columns

In [8]:
# set variable "flat_type" to a categorical variable
data_rn1["flat_type"].describe()
data_rn1["flat_type"] = data_rn1["flat_type"].astype("category")

In [9]:
data_rn1["flat_type"] = data_rn1["flat_type"].cat.codes

#### Now let's run the regression with model 1:

In [10]:
# regression 1
est1 = smf.ols(formula="price_cpi_adj ~ covid_dum + town_2 + town_3 + town_4 + town_5 + town_6 + lease_rem + area_sqm + flat_type + storey", data=data_rn1).fit()

est1.summary()

0,1,2,3
Dep. Variable:,price_cpi_adj,R-squared:,0.775
Model:,OLS,Adj. R-squared:,0.775
Method:,Least Squares,F-statistic:,18140.0
Date:,"Mon, 02 May 2022",Prob (F-statistic):,0.0
Time:,15:34:14,Log-Likelihood:,-665020.0
No. Observations:,52641,AIC:,1330000.0
Df Residuals:,52630,BIC:,1330000.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.467e+05,3070.423,-47.771,0.000,-1.53e+05,-1.41e+05
covid_dum,1.836e+04,657.696,27.916,0.000,1.71e+04,1.96e+04
town_2,-7.309e+04,2019.519,-36.193,0.000,-7.71e+04,-6.91e+04
town_3,-6.765e+04,2324.524,-29.104,0.000,-7.22e+04,-6.31e+04
town_4,-2.018e+05,1965.077,-102.707,0.000,-2.06e+05,-1.98e+05
town_5,-2.571e+05,1990.389,-129.190,0.000,-2.61e+05,-2.53e+05
town_6,-2.558e+05,2100.351,-121.784,0.000,-2.6e+05,-2.52e+05
lease_rem,3949.6076,27.677,142.703,0.000,3895.360,4003.855
area_sqm,3676.5236,45.953,80.006,0.000,3586.455,3766.592

0,1,2,3
Omnibus:,4850.771,Durbin-Watson:,0.722
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7319.515
Skew:,0.711,Prob(JB):,0.0
Kurtosis:,4.147,Cond. No.,1850.0


### III.2. Model 2

$ $price cpi_adj$_i = \beta_0 + \beta_1*$flat_type$_i+\beta_2*$storey$_i+\beta_3*$area_sqm$_i+\beta_4*$lease_rem$_i+\beta_5*district_i+\beta_6*$covid$_i+\beta_7*$covid_districts$_i +e_i$,  

where $e_i \sim N(0,\sigma^2)$

In [14]:
dummies = pd.get_dummies(data['town_dummy'])

In [15]:
dummies.rename(columns={1:'d1',2:'d2',3:'d3',4:'d4',5:'d5',6:'d6'},inplace = True)

In [16]:
dataTown=pd.concat([data,dummies.reindex(data.index)],axis=1)

In [17]:
dataTown['Covid']=dataTown.month.between('2020-02','2021-03').astype(int)

In [18]:
dataTown['Covid2']=dataTown['d2']*dataTown['Covid']
dataTown['Covid3']=dataTown['d3']*dataTown['Covid']
dataTown['Covid4']=dataTown['d4']*dataTown['Covid']
dataTown['Covid5']=dataTown['d5']*dataTown['Covid']
dataTown['Covid6']=dataTown['d6']*dataTown['Covid']

In [19]:
dataTown['flat_type'].replace('ROOM','',regex=True,inplace=True)

In [22]:
dataTown['Y']=dataTown['price cpi_adj']

In [24]:
est1 = smf.ols(formula="Y ~ flat_type + storey + area_sqm + lease_rem + d2 + d3 + d4 + d5 + d6 + Covid + Covid2 + Covid3 + Covid4 + Covid5 + Covid6", data=dataTown).fit()

In [25]:
print(est1.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.778
Model:                            OLS   Adj. R-squared:                  0.778
Method:                 Least Squares   F-statistic:                     9205.
Date:                Mon, 02 May 2022   Prob (F-statistic):               0.00
Time:                        15:39:57   Log-Likelihood:            -6.6472e+05
No. Observations:               52641   AIC:                         1.329e+06
Df Residuals:                   52620   BIC:                         1.330e+06
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept     

### III.3 Model 3

$ $price cpi_adj$_i = \beta_0 + \beta_1*$flat_type$_i+\beta_2*$flat_model$_i+\beta_3*$area_sqm$_i+\beta_4*$lease_rem$_i+\beta_5*covid_i+\beta_6*$covid_flat_model$_i++e_i$,  

where $e_i \sim N(0,\sigma^2)$

In [18]:
data_source = "https://raw.githubusercontent.com/yantonglll/ECON570_Final_Project/main/ALL%20Prices%202019-2021%20mar.csv"
df= pd.read_csv(data_source)

In [19]:
pd.set_option('display.max_columns', None)

##### convert flat type to cateogry variables

In [20]:
df['flat_type'].replace('ROOM','',regex=True,inplace=True)

In [21]:
df['covid_dummy'] = (df.month >'2020-01').astype(int)

In [22]:
df['flat_model'].unique()

array(['Improved', 'Model A', 'DBSS', 'Standard', 'New Generation',
       'Apartment', 'Maisonette', 'Premium Apartment', 'Simplified',
       'Type S2', 'Type S1', 'Adjoined flat', 'Model A2', 'Terrace',
       'Premium Apartment Loft', 'Model A-Maisonette', 'Multi Generation',
       'Improved-Maisonette', 'Premium Maisonette', '2-room'],
      dtype=object)

In [23]:
len(df['flat_model'].unique())

20

In [24]:
flat_model_dummy = pd.get_dummies(df['flat_model'],drop_first = True)
#flat_model_dummy
df['flat_model'].replace(' ','-',regex=True,inplace=True)

In [25]:
df['flat_model'].replace('','-',regex=True,inplace=True)
df = df.join(flat_model_dummy)

In [26]:
df['covid_flatmodel'] = df['covid_dummy']

In [27]:
y = df['price cpi_adj']

In [28]:
result = smf.ols(formula = 'y~flat_type+area_sqm+covid_dummy+storey+lease_rem+flat_model+covid_flatmodel',data = df).fit()
result.summary()
#dataTown['Covid2']=dataTown['d2']*dataTown['Covid']

0,1,2,3
Dep. Variable:,y,R-squared:,0.641
Model:,OLS,Adj. R-squared:,0.641
Method:,Least Squares,F-statistic:,3362.0
Date:,"Thu, 05 May 2022",Prob (F-statistic):,0.0
Time:,19:35:55,Log-Likelihood:,-677300.0
No. Observations:,52641,AIC:,1355000.0
Df Residuals:,52612,BIC:,1355000.0
Df Model:,28,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-9.826e+04,4.62e+04,-2.127,0.033,-1.89e+05,-7731.702
flat_type[T.2 ],-4.751e+04,1.87e+04,-2.537,0.011,-8.42e+04,-1.08e+04
flat_type[T.3 ],9197.9452,1.86e+04,0.494,0.622,-2.73e+04,4.57e+04
flat_type[T.4 ],4.887e+04,1.9e+04,2.566,0.010,1.15e+04,8.62e+04
flat_type[T.5 ],6.86e+04,1.97e+04,3.484,0.000,3e+04,1.07e+05
flat_type[T.EXECUTIVE],5.712e+04,2.06e+04,2.779,0.005,1.68e+04,9.74e+04
flat_type[T.MULTI GENERATION],1.419e+05,2.51e+04,5.655,0.000,9.27e+04,1.91e+05
flat_model[T.Adjoined-flat],1.365e+05,4.3e+04,3.170,0.002,5.21e+04,2.21e+05
flat_model[T.Apartment],6.759e+04,4.23e+04,1.597,0.110,-1.54e+04,1.51e+05

0,1,2,3
Omnibus:,10778.572,Durbin-Watson:,0.584
Prob(Omnibus):,0.0,Jarque-Bera (JB):,22931.775
Skew:,1.2,Prob(JB):,0.0
Kurtosis:,5.166,Cond. No.,1.02e+16


## IV Findings

All three models show that the Covid had a statistically significant positive impact on HDB house price. 

Focusing on model 1,the regression output shows that a more prime district tends to have a less price difference from the most prime district (district 1); the coefficient of covid dummy shows that house price in post-covid era increased by 1836 cpi-adjusted price. 

Focusing on model 2, with considering Covid's impact on each district, the coefficient of covid decreased from 1836 to 1008, and it is still significant. In other words, we eliminate the potential omitted variable baises from model 1. The coefficient of "covid_district" variables shows that not every district in Singapore had house price increase. District 3 had a decrease of 3546.89-1008=2538.89 cpi-adjusted Singapore dollar. The magnitude of these coefficients showed that District 4 had the most house price increase. 

Turning to model 3, where we measure people's preference on house types and house models, the magnitudes of coefficients show that the multi-generation type of house increase the house price the most. We might infer that the multi generation type of HDB house was more popular. The coefficient of 2 rooms shows that 2 rooms type would reduce house price by 4751 cpi-adjusted Singapore dollar. 

## V Conclusion

In conclusion, we built three models in this project, where model 1 measures the Covid impact on overall Singapore house market, and model 2 measures the Covid impact on the house market of different districts in Singapore, and model 3 meausures the Covid impact with consideration of different house types and house models. The regression results of all three models show that Covid had positive impact on house price in Singapore. However, in a district level, we found that Covid did not always have positive impact on house price; district 3 had a decrease of 2538.89 cpi-adjusted Singapore dollar after Covid. In model 3, we found that the multi-generation is the most preferred house type and 2 rooms is the least preferred house type. 
In addition, we also learnt some lessons from doing this project. We often perceive that the pandemic deteriorates the economy and slows down social development, but we should also remember that besides the “covid” incident alone, there are many other factors that need to be considered as well. Just as what we did in our project, we always have to include as many useful variables as possible into our regression to avoid omitted variable bias, making our results more reliable.
In the future, when more data points are available, we might be able to answer this question more thoroughly because technically the pandemic has not ended yet and it is unclear whether the effect (our coefficient of covid here) will change if we see the problem in a longer time horizon.


## VI Reference

- Anenberg, E., &amp; Ringo, D. (2021). Housing market tightness during COVID-19: Increased demand or reduced supply? The Fed - Housing Market Tightness During COVID-19: Increased Demand or Reduced Supply? Retrieved May 5, 2022, from https://www.federalreserve.gov/econres/notes/feds-notes/housing-market-tightness-during-covid-19-increased-demand-or-reduced-supply-20210708.htm 
- Qian, X., Qiu, S., & Zhang, G. (2021). The impact of COVID-19 on housing price: Evidence from China. Finance research letters, 43, 101944. https://doi.org/10.1016/j.frl.2021.101944
- Wen, Haizhen. 2005. “Characteristic price of urban house: theory and empirical research.”EconomicScience Press (4)19~38
- Housing & Development Board Website. https://www.hdb.gov.sg/cs/infoweb/homepage
- HDB flat prices 1990-2021 March. Kaggle. https://www.kaggle.com/datasets/denzilg/hdb-flat-prices-19902021-march