### Modelling

In [None]:
nominal_vars = ['township', 'building_type', 'tenure']

for var in nominal_vars:
    kruskal_result = kruskal(*df_transactions[[var, 'price_psf']].groupby(var, observed=True)['price_psf'].apply(list), nan_policy='omit')
    print(f"Kruskal-Wallis result for {var} and price_psf: {kruskal_result}")

Kruskal-Wallis result for project_name and price_psf: KruskalResult(statistic=162116.11862761097, pvalue=0.0)
Kruskal-Wallis result for building_type and price_psf: KruskalResult(statistic=88069.13795613007, pvalue=0.0)
Kruskal-Wallis result for tenure and price_psf: KruskalResult(statistic=20043.441115929923, pvalue=0.0)


For Kruskal-Wallis test (DV: continuous, IV: nominal), the null hypothesis H0 is that the population median of all of the groups are equal (thus no association as different groups has the similar population median), while the alternative hypothesis H1 is that the population median of all of the groups are not equal (thus has association as different groups will have different population median).

Since the computed p-values < 0.05 therefore we reject the null hypothesis H0 and conclude that there is association between the nominal variables and `price_psf`.

In [None]:
px.scatter(df_transactions.query('year >= 1957'), x='spa_date', y='price_psf', trendline='ols', trendline_color_override='red').write_html(CHARTS_DIR / 'price_psf_scatter.html')


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In [None]:
px.box(df_transactions.query('year >= 1957'), x='year', y='price_psf', color='year', notched=True).write_html(CHARTS_DIR / 'price_psf_box_year.html')

In [None]:
px.box(df_transactions.query('year >= 1957'), x='month', y='price_psf', color='month', notched=True).write_html(CHARTS_DIR / 'price_psf_box_month.html')

In [None]:
px.box(df_transactions.query('year >= 1957'), x='day', y='price_psf', color='day', notched=True).write_html(CHARTS_DIR / 'price_psf_box_day.html')

In [None]:
px.bar(df_transactions.query('year > 1957')[['year', 'price_psf']].groupby('year').median().sort_values(by='year', ascending=True), text_auto=True).update_xaxes(tickmode='linear').write_html(CHARTS_DIR / 'median_price_psf_by_year.html')

In [None]:
px.bar(df_transactions.query('year > 1957')[['month', 'price_psf']].groupby('month').median().sort_values(by='month', ascending=True), text_auto=True).update_xaxes(tickmode='linear').write_html(CHARTS_DIR / 'median_price_psf_by_month.html')

In [None]:
px.bar(df_transactions.query('year > 1957')[['day', 'price_psf']].groupby('day').median().sort_values(by='day', ascending=True), text_auto=True).update_xaxes(tickmode='linear').write_html(CHARTS_DIR / 'median_price_psf_by_day.html')

Based on the plots above, the year does have an obvious impact on the price, but the month and day does not. Hence we need to use a more objective measure to determine if month and days has an impact on the price.

In [None]:
ordinal_vars = ['floors', 'rooms', 'year', 'month', 'day']

for var in ordinal_vars:
    spearman_result = spearmanr(df_transactions[var], df_transactions['price_psf'], nan_policy='omit')
    print(f"Spearman result for {var} and price_psf: {spearman_result}")

Spearman result for floors and price_psf: SignificanceResult(statistic=0.06611489545829043, pvalue=1.3671703056186273e-282)
Spearman result for rooms and price_psf: SignificanceResult(statistic=0.23106137073587568, pvalue=0.0)
Spearman result for year and price_psf: SignificanceResult(statistic=0.6745377265610482, pvalue=0.0)
Spearman result for month and price_psf: SignificanceResult(statistic=0.043261692172534914, pvalue=5.069670789621408e-122)
Spearman result for day and price_psf: SignificanceResult(statistic=0.009710387604962123, pvalue=1.3618592797375146e-07)


However, the Spearman rank test (DV: continuous, IV: ordinal) shows otherwise. Similarly, the null hypothesis H0 is that there is no correlation between variables, while the alternative hypothesis H1 is that there is a correlation between variables.

Since the computed p-values are close to 0.0 (less than 0.05) therefore we reject the null hypothesis H0 and conclude that there is a correlation between variables. Thus, year, month and day has an impact on the price.

In [None]:
df_transactions_notna = df_transactions.dropna()
df_transactions_notna

Unnamed: 0,project_name,spa_date,building_type,tenure,floors,rooms,land_area,built_up,price_psf,price,year,month,day
253,BANDAR BARU SRI PETALING,2020-07-30,TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,2.0,4.0,2196.0,2342.0,560.0,1230000.0,2020,7,30
257,BANDAR BARU SRI PETALING,2020-07-16,TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,1.5,3.0,1927.0,1187.0,379.0,730000.0,2020,7,16
258,BANDAR BARU SRI PETALING,2020-07-10,TERRACE HOUSE - CORNER LOT,LEASEHOLD,2.0,2.0,1711.0,672.0,251.0,430000.0,2020,7,10
261,BANDAR BARU SRI PETALING,2020-07-03,TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,1.0,3.0,1539.0,843.0,341.0,525000.0,2020,7,3
262,BANDAR BARU SRI PETALING,2020-07-03,TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,2.0,3.0,1647.0,1634.0,577.0,950000.0,2020,7,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294562,HERITAGE STATION HOTEL,1990-11-13,FLAT,LEASEHOLD,1.0,2.0,493.0,493.0,71.0,35000.0,1990,11,13
294563,IDAMAN PUTERI,2005-01-10,CONDOMINIUM,FREEHOLD,1.0,3.0,1454.0,1454.0,150.0,218025.0,2005,1,10
294564,KELAB LE CHATEAU II,2008-02-25,CONDOMINIUM,FREEHOLD,1.0,3.0,593.0,593.0,194.0,115000.0,2008,2,25
294565,MUTIARA SENTUL CONDOMINIUM,2009-08-10,APARTMENT,LEASEHOLD,1.0,2.0,1193.0,1193.0,197.0,235000.0,2009,8,10


In [None]:
fig1 = px.bar(df_transactions_notna['year'].value_counts().sort_index())
fig2 = px.bar(df_transactions['year'].value_counts().sort_index())

fig1.update_traces(name='NaNs removed')
fig2.update_traces(marker_color = 'rgba(0,0,0,0)', marker_line_color = 'black', name='NaNs included')

fig = go.Figure(fig1.data)
fig.add_traces(fig2.data)
fig.write_html(CHARTS_DIR / 'count_of_transactions_with_missing_values_by_year.html')

The chart above shows that there are many missing values for year 2020 to 2023 if rows with NaNs are dropped directly. Therefore, imputation is required.

But before imputation, we need to convert the categorical variables into numerical variables. Based on surveyed literature:
1. One-hot encoding (Parygin et al., 2018)
2. 

Parygin et al (2018): https://iopscience.iop.org/article/10.1088/1742-6596/1015/3/032102/pdf

In [None]:
print(
    f"Number of rows with missing values: {df_transactions['rooms'].isna().sum()}",
    f"\nPercentage of missing values: {df_transactions['rooms'].isna().sum() / len(df_transactions) * 100}",
)

Number of rows with missing values: 30670 
Percentage of missing values: 10.411892710317177


In [None]:
df_transactions[df_transactions['rooms'].isna()]

Unnamed: 0,project_name,spa_date,building_type,tenure,floors,rooms,land_area,built_up,price_psf,price
0,BANDAR BARU SRI PETALING,2023-06-09,TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,1,,2196,,342,750000
1,BANDAR BARU SRI PETALING,2023-06-01,TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,2,,753,,398,300000
2,BANDAR BARU SRI PETALING,2023-05-29,TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,2.5,,3197,,188,600000
3,BANDAR BARU SRI PETALING,2023-05-25,TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,2,,753,,531,400000
4,BANDAR BARU SRI PETALING,2023-05-22,SEMI-D,LEASEHOLD,2.5,,4801,,250,1200000
...,...,...,...,...,...,...,...,...,...,...
294507,PANTAI PANORAMA KONDO,2022-12-06,FLAT,LEASEHOLD,1,,657,657,297,195000
294521,WINSOR TOWER,2010-04-16,SERVICE RESIDENCE,FREEHOLD,1,,640,640,609,390000
294530,KAWASAN PERINDUSTRIAN TRISEGI,1999-09-27,FLAT,FREEHOLD,1,,511,511,88,45000
294532,TAMAN SUNGAI BESI (MEDIUM COST FLAT),1999-01-11,FLAT,FREEHOLD,1,,500,500,156,78000
