In [5]:
import numpy as np, geopandas as gpd

In [6]:
PATH_HDB = "./processed_n/hdb_features.parquet"

In [7]:
hdb = gpd.read_parquet(PATH_HDB)
hdb.head()

Unnamed: 0,month,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,resale_year,resale_age,LAT,LNG,X,Y,geometry,log_price,dist_mrt,dist_hcen,dist_scen,bus_count_400m
0,2023-01,ANG MO KIO,2 ROOM,01 TO 03,44.0,Improved,1979,267000.0,2023,44,1.362005,103.85388,30288.234663,38229.067463,POINT (30288.235 38229.067),12.495004,934.249034,148.674999,483.224591,10
1,2023-01,ANG MO KIO,2 ROOM,04 TO 06,49.0,Improved,1977,300000.0,2023,46,1.367908,103.847714,29602.047153,38881.891694,POINT (29602.047 38881.892),12.611538,248.134757,747.311654,570.211463,10
2,2023-01,ANG MO KIO,2 ROOM,04 TO 06,44.0,Improved,1978,280000.0,2023,45,1.366227,103.850086,29865.998046,38695.970271,POINT (29865.998 38695.97),12.542545,358.374108,687.704391,343.038217,7
3,2023-01,ANG MO KIO,2 ROOM,07 TO 09,44.0,Improved,1978,282000.0,2023,45,1.366227,103.850086,29865.998046,38695.970271,POINT (29865.998 38695.97),12.549662,358.374108,687.704391,343.038217,7
4,2023-01,ANG MO KIO,2 ROOM,01 TO 03,45.0,Improved,1986,289800.0,2023,37,1.374001,103.836432,28346.433332,39555.534275,POINT (28346.433 39555.534),12.576946,73.660649,674.629682,1353.591387,11


**OLS (Ordinary Least Squares)**
A fundamental method in regression analysis used to estimate the relationship between a dependent variable and one or more independent variables.

**Core Idea** OLS finds the best-fitting line (or hyperplane) through a set of data points by minimizing the sum of squared residuals -- the differences between observed and predicted values.

**Methematically** $$\min_\beta\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

where $$\hat{y}_i=\beta_0+\beta_1x_{1i}+\beta_2x_{2i}+...+\beta_kx_{ki}$$

OLS chooses the coefficients $\beta_0, \beta_1, ..., \beta_k$ to make the predicted value $\hat{y}_i$ as close as possible to the observed $y_i$.

**Python, `statsmodels` can give:**
- coefficients ($\beta$ values)
- standard errors
- R-squared (fit quality)
- t-tests & p-values (significance)

**Key Assumptions of OLS**
- `Linearity` relationship between predictors and outcome is linear.
- `Independence` residuals are independent.
- `Homoscedasticity` residuals have constant variance.
- `Normality` residuals are normally distributed.
- `No multicollinearity` predictors are not too correlated.

**Estimate**

$Log_{price}=\beta_0+\beta_1(D_{mrt})+\beta_2(D_{hcen})+\beta_3(D_{Scen})+\beta_4({BusCount400m})+\beta_5({ResaleAge})+\epsilon$

In [4]:
import statsmodels.api as sm
X = hdb[['dist_mrt', 'dist_hcen', 'dist_scen', 'bus_count_400m', 'resale_age', ""]]
X = sm.add_constant(X)
y = hdb['log_price']
ols = sm.OLS(y, X).fit()
print(ols.summary())

                            OLS Regression Results                            
Dep. Variable:              log_price   R-squared:                       0.181
Model:                            OLS   Adj. R-squared:                  0.181
Method:                 Least Squares   F-statistic:                     1139.
Date:                Mon, 10 Nov 2025   Prob (F-statistic):               0.00
Time:                        15:57:46   Log-Likelihood:                -2655.4
No. Observations:               25760   AIC:                             5323.
Df Residuals:                   25754   BIC:                             5372.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             13.4475      0.008   1598.

**Overall Model Fit**

| Metric                            | Interpretation                                                                                                                                                                                            |
| --------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **R-squared = 0.181**             | The model explains **18.1% of the variation** in HDB log prices. That’s moderate for spatial data, suggesting other unobserved factors (e.g. floor area, lease left, town effects) also play major roles. |
| **Adj. R² = 0.181**               | Adjusted for number of predictors — still the same, meaning all predictors are contributing meaningfully.                                                                                                 |
| **F-statistic = 1139, p = 0.000** | The model as a whole is **statistically significant**. At least one predictor has a non-zero effect.                                                                                                      |
| **Observations = 25,760**         | Very large sample, so even small effects can become statistically significant.                                                                                                                            |


**Coefficients Interpretation**
| Variable           | Coefficient | Sign                 | p-value | Interpretation                                                                                                                                                             |
| ------------------ | ----------- | -------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **const**          | 13.4475     | +                    | 0.000   | Baseline log-price when all other variables = 0.                                                                                                                           |
| **dist_mrt**       | -7.16×10⁻⁵  | **–**                | 0.000   | For each additional meter **farther from the nearest MRT**, log(price) decreases by 0.0000716 — meaning **houses closer to MRT are more expensive**.                       |
| **dist_hcen**      | +4.99×10⁻⁵  | **+**                | 0.000   | Surprisingly, prices **increase** slightly as distance from healthcare centers grows. Possibly due to confounding (e.g. hospitals located in lower-cost or older estates). |
| **dist_scen**      | -1.53×10⁻⁵  | **–**                | 0.000   | Flats closer to recreation/sports facilities are **slightly more expensive**.                                                                                              |
| **bus_count_400m** | -0.0005     | ns (not significant) | 0.324   | Number of bus stops within 400m has **no statistically significant effect** on HDB prices once MRT proximity and age are controlled for.                                   |
| **resale_age**     | -0.0081     | **–**                | 0.000   | Each additional year of building age reduces log(price) by about **0.0081**, meaning roughly **0.8% lower price per year**, holding other factors constant.                |


**Diagnostics and Warnings**
| Diagnostic                            | Meaning                                                                                                                                                      |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Durbin–Watson = 0.429**             | Indicates **strong positive autocorrelation** in residuals — likely because spatial data points are not independent (neighboring flats have similar prices). |
| **Condition No. = 6.75e+03**          | Suggests potential **multicollinearity** (e.g., distances may be correlated with each other). You might check `X.corr()` or VIFs.                            |
| **Omnibus/Jarque-Bera tests (p ≈ 0)** | Residuals are **not perfectly normal**, which is common in large samples — but indicates some skewness.                                                      |
| **Kurtosis ≈ 3.3**                    | Slightly heavier tails than normal distribution.                                                                                                             |


**Substantive Interpretation**
- Accessibility matters: Closer to MRT and recreation areas → higher prices.
- Age matters most: The resale_age coefficient is much larger in absolute value and highly significant, showing older flats lose value quickly.
- Bus stops are redundant: Their influence disappears when MRT distance is included — consistent with Singapore’s high integration of bus and MRT networks.
- Healthcare proximity effect is reversed: Possibly a socio-spatial artifact — hospitals may cluster in older, cheaper estates.

In [9]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
[vif for vif in [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]]

[25.325834987682676,
 1.0030270383090514,
 1.007136428593453,
 1.034052903207726,
 1.0092839366912048,
 1.0310478174403794]

Here need some preprocessing first before being added to the regression, because most of them are categorical (strings), while OLS requires numerical inputs.

In [33]:
hdb['storey_range'].unique()

array(['01 TO 03', '04 TO 06', '07 TO 09', '25 TO 27', '10 TO 12',
       '13 TO 15', '16 TO 18', '22 TO 24', '19 TO 21', '34 TO 36',
       '28 TO 30', '37 TO 39', '31 TO 33', '40 TO 42', '43 TO 45',
       '46 TO 48', '49 TO 51'], dtype=object)

In [31]:
hdb.columns

Index(['month', 'town', 'flat_type', 'storey_range', 'floor_area_sqm',
       'flat_model', 'lease_commence_date', 'resale_price', 'resale_year',
       'resale_age', 'LAT', 'LNG', 'X', 'Y', 'geometry', 'log_price',
       'dist_mrt', 'dist_hcen', 'dist_scen', 'bus_count_400m'],
      dtype='object')

In [34]:
# Covert `story_range` to numeric midpoint
"""
The storey_range values are all well-formatted like "01 TO 03", "04 TO 06", etc.
We can safely convert each range into its midpoint floor number.
"""

import re

def storey_mid(x):
    nums = re.findall(r'\d+', x)
    if len(nums) == 2:
        low, high = map(int, nums)
        return (low + high)/2
    elif len(nums) == 1:
        return int(nums[0])
    else:
        return None

hdb['storey_mid'] = hdb['storey_range'].apply(storey_mid)
print(hdb[['storey_range', 'storey_mid']].drop_duplicates().sort_values('storey_mid'))

      storey_range  storey_mid
0         01 TO 03         2.0
1         04 TO 06         5.0
3         07 TO 09         8.0
20        10 TO 12        11.0
22        13 TO 15        14.0
81        16 TO 18        17.0
96        19 TO 21        20.0
92        22 TO 24        23.0
6         25 TO 27        26.0
333       28 TO 30        29.0
369       31 TO 33        32.0
115       34 TO 36        35.0
368       37 TO 39        38.0
414       40 TO 42        41.0
580       43 TO 45        44.0
1576      46 TO 48        47.0
15615     49 TO 51        50.0


In [38]:
hdb['flat_type'].unique()

array(['2 ROOM', '3 ROOM', '4 ROOM', '5 ROOM', 'EXECUTIVE',
       'MULTI-GENERATION', '1 ROOM'], dtype=object)

In [53]:
# Create dummy variables for `flat_type` and `flat_model`
"""
Since both are categorical strings, OLS requires you to convert them into one-hot (dummy) variables.
"""

import pandas as pd

flat_type_dummies = pd.get_dummies(hdb['flat_type'], prefix='type', drop_first=True, dtype=float) # `drop_first=True` avoids the dummy variable trap
flat_model_dummies = pd.get_dummies(hdb['flat_model'], prefix='model', drop_first=True, dtype=float)

print(flat_type_dummies.columns)
print(flat_model_dummies.columns)

Index(['type_2 ROOM', 'type_3 ROOM', 'type_4 ROOM', 'type_5 ROOM',
       'type_EXECUTIVE', 'type_MULTI-GENERATION'],
      dtype='object')
Index(['model_3Gen', 'model_Adjoined flat', 'model_Apartment', 'model_DBSS',
       'model_Improved', 'model_Improved-Maisonette', 'model_Maisonette',
       'model_Model A', 'model_Model A-Maisonette', 'model_Model A2',
       'model_Multi Generation', 'model_New Generation',
       'model_Premium Apartment', 'model_Premium Apartment Loft',
       'model_Simplified', 'model_Standard', 'model_Terrace', 'model_Type S1',
       'model_Type S2'],
      dtype='object')


In [67]:
"""
Test data error type
"""
# cols_base = ['dist_mrt', 'dist_hcen', 'dist_scen', 'bus_count_400m',
#              'resale_age', 'floor_area_sqm', 'storey_mid']

# X = pd.concat([hdb[cols_base], flat_type_dummies, flat_model_dummies], axis=1)

# # Coerce everything to numeric, kill infs
# X = X.apply(pd.to_numeric, errors='coerce')
# y = pd.to_numeric(hdb['log_price'], errors='coerce')

# X = X.replace([np.inf, -np.inf], np.nan)
# y = y.replace([np.inf, -np.inf], np.nan)

# # Drop rows with any NA in X or y (keep aligned)
# valid = X.notna().all(axis=1) & y.notna()
# X = X.loc[valid]
# y = y.loc[valid]

# # Add constant and fit
# X = sm.add_constant(X)
# model = sm.OLS(y.astype(float), X.astype(float)).fit()
# print(model.summary())

'\nTest data error type\n'

In [68]:
# Combine everything: Now combine main numeric predictors, storey midpoints, and the dummy variables:
hdb_ols = pd.concat(
    [hdb,
    flat_type_dummies,
    flat_model_dummies],
    axis=1
)

# hdb_ols.to_parquet("processed_n/hdb_ols.parquet")

hdb_ols.head()

Unnamed: 0,month,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,resale_year,resale_age,...,model_Model A2,model_Multi Generation,model_New Generation,model_Premium Apartment,model_Premium Apartment Loft,model_Simplified,model_Standard,model_Terrace,model_Type S1,model_Type S2
0,2023-01,ANG MO KIO,2 ROOM,01 TO 03,44.0,Improved,1979,267000.0,2023,44,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2023-01,ANG MO KIO,2 ROOM,04 TO 06,49.0,Improved,1977,300000.0,2023,46,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2023-01,ANG MO KIO,2 ROOM,04 TO 06,44.0,Improved,1978,280000.0,2023,45,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2023-01,ANG MO KIO,2 ROOM,07 TO 09,44.0,Improved,1978,282000.0,2023,45,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2023-01,ANG MO KIO,2 ROOM,01 TO 03,45.0,Improved,1986,289800.0,2023,37,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
# Select variables for OLS
"""
Three groups of predictors:
1. accessiblity: dist_mrt, dis_hcen, dist_scen, bus_count_400m
2. structure: floor_area_sqm, resale_age, storey_mid
3. dummies: all columns from dummy creation (flat_type, flat_model)
"""
cols = ['dist_mrt', 'dist_hcen', 'dist_scen', 'bus_count_400m', 
        'resale_age', 'floor_area_sqm', 'storey_mid'] \
        + list(flat_type_dummies.columns) \
        + list(flat_model_dummies.columns)

X = hdb_ols[cols]
y = hdb_ols['log_price']
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:              log_price   R-squared:                       0.793
Model:                            OLS   Adj. R-squared:                  0.793
Method:                 Least Squares   F-statistic:                     3187.
Date:                Wed, 22 Oct 2025   Prob (F-statistic):               0.00
Time:                        11:40:58   Log-Likelihood:                 15082.
No. Observations:               25760   AIC:                        -3.010e+04
Df Residuals:                   25728   BIC:                        -2.984e+04
Df Model:                          31                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const           

**Model Overview**
| Metric                       | Interpretation                                                                                                                                      |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| **R-squared = 0.793**        | The model explains about **79.3 %** of the variation in `log_price`, meaning the added structure variables dramatically improved explanatory power. |
| **Adj. R² = 0.793**          | Almost identical — indicates minimal overfitting despite many variables (31 predictors).                                                            |
| **F-stat = 3187, p = 0.000** | The model as a whole is **highly significant**.                                                                                                     |
| **n = 25,760**               | Large sample → reliable statistical inference.                                                                                                      |

**This confirms that flat structure and physical characteristics are the dominant factors in explaining HDB resale prices.**


**Key Variable Effects**

Accessibility Variables

| Variable           | Coef     | Sign                    | Meaning                                                                                                                   |
| ------------------ | -------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| **dist_mrt**       | –7.0e-05 | **Negative**, p < 0.001 | Each meter farther from MRT → slightly lower price (closeness to MRT strongly increases value).                           |
| **dist_hcen**      | +2.0e-05 | Positive, p < 0.001     | Slightly higher prices farther from healthcare centers — possible socioeconomic pattern (hospitals often in older areas). |
| **dist_scen**      | –3.6e-05 | **Negative**, p < 0.001 | Nearness to recreational areas raises price.                                                                              |
| **bus_count_400m** | –0.0060  | **Negative**, p < 0.001 | Many nearby bus stops may correlate with dense, lower-value neighborhoods (redundant with MRT access).                    |


Structual Variables

| Variable                     | Coef                                                               | Interpretation |
| ---------------------------- | ------------------------------------------------------------------ | -------------- |
| **resale_age = –0.0067**     | Each extra year older → price ≈ 0.67 % lower (very significant).   |                |
| **floor_area_sqm = +0.0078** | Each extra m² → price ≈ 0.78 % higher (strong size premium).       |                |
| **storey_mid = +0.0121**     | Each higher floor → ≈ 1.2 % higher price (clear vertical premium). |                |

**Together, these structural features explain most of the improvement in R².**

Flat Type Effects

Baseline (omitted dummy) is the first category — likely “1 ROOM.”
Compared with 1-ROOM:

| Type             | Coef                   | Interpretation                                        |
| ---------------- | ---------------------- | ----------------------------------------------------- |
| 2 ROOM           | –0.099 (weak)          | ~9 % lower than baseline, but not significant at 5 %. |
| 3 ROOM           | +0.067                 | Slightly higher, not sig.                             |
| 4 ROOM           | +0.131 (**p = 0.02**)  | ~13 % higher than 1-ROOM, significant.                |
| 5 ROOM           | +0.107 (marginal)      | ~11 % higher.                                         |
| EXECUTIVE        | +0.087 (ns)            | Higher, but within noise.                             |
| MULTI-GENERATION | +0.131 (**p = 0.009**) | ~13 % higher.                                         |

**Interpretation: 4- and multi-generation flats carry premiums; smaller flats less so.**

Flat Model Effects

Relative to the dropped reference model (probably Improved or Standard).

| Model                      | Coef         | Meaning                                                     |
| -------------------------- | ------------ | ----------------------------------------------------------- |
| **DBSS**                   | +0.287       | ≈ 33 % higher — consistent with DBSS’s semi-private design. |
| **Premium Apartment Loft** | +0.363       | ≈ 44 % higher — luxury premium.                             |
| **Terrace**                | +0.741       | ≈ +107 % (exp(0.74) ≈ 2.1×) — huge premium.                 |
| **Type S1 / S2**           | +0.43 – 0.52 | Large positive effects (≈ 50 % higher).                     |
| **Maisonette variants**    | +0.25 – 0.28 | 25–30 % higher than baseline.                               |
| **Model A2**               | –0.061       | ≈ 6 % lower — slightly cheaper generation.                  |

**Interpretation: high-end or rare models (Terrace, Loft, Maisonette, DBSS) strongly outperform standard flats, even controlling for size and age.**


**Diagnostics & Warnings**

| Diagnostic                  | Interpretation                                                                                                                                                                             |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Durbin–Watson = 0.699**   | Positive spatial autocorrelation — residuals not independent (common in housing data). Suggests potential need for **spatial or GWR model**.                                               |
| **Cond. No. = 1.06 × 10¹⁶** | Extremely high → indicates **strong multicollinearity**, likely between dummy variables (flat types/models overlap conceptually). Doesn’t invalidate results but inflates standard errors. |
| **Omnibus/JB p ≈ 0**        | Residuals not perfectly normal (expected with large N).                                                                                                                                    |

**Practical Takeaways**
1. Structural factors dominate price variation; accessibility plays a secondary but still significant role.
2. Age and size have clear, interpretable economic meanings (–0.67 % per year, +0.78 % per m²).
3. High-end flat models yield large premiums beyond physical size — capturing design or exclusivity effects.
4. Multicollinearity warning: you might remove redundant dummies (e.g., rare flat types) or use regularized regression (Ridge/Lasso) for stability.
5. Next step: test a spatial error/lag model or Geographically Weighted Regression (GWR) to address the Durbin–Watson < 2 issue.

OLS model now robustly captures both accessibility and structural determinants of HDB resale prices:

**log(Price) = f(Age ↓, Area ↑, Floor ↑, MRT nearer ↑, Scenic nearer ↑, high-end model ↑)**

explaining nearly 80 % of price variation.


In [69]:
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Assume X already includes all predictors (with constant)
# and that model was built as: model = sm.OLS(y, X).fit()

# Drop the constant column for VIF calculation (to avoid redundancy)
X_vif = X.drop(columns='const', errors='ignore')

# Compute VIF for each column
vif_data = pd.DataFrame()
vif_data['Variable'] = X_vif.columns
vif_data['VIF'] = [variance_inflation_factor(X_vif.values, i)
                   for i in range(X_vif.shape[1])]

# Sort by descending VIF for clarity
vif_data.sort_values('VIF', ascending=False, inplace=True)
print(vif_data)

  vif = 1. / (1. - r_squared_i)


                        Variable         VIF
23        model_Multi Generation         inf
12         type_MULTI-GENERATION         inf
9                    type_4 ROOM  426.785073
5                 floor_area_sqm  371.169182
10                   type_5 ROOM  271.679665
20                 model_Model A  267.689129
8                    type_3 ROOM  202.032153
17                model_Improved  155.564452
11                type_EXECUTIVE   85.324102
24          model_New Generation   78.869236
25       model_Premium Apartment   69.225327
27              model_Simplified   25.593470
15               model_Apartment   24.494680
19              model_Maisonette   19.393021
7                    type_2 ROOM   18.846154
28                model_Standard   17.945829
3                 bus_count_400m   10.389755
4                     resale_age   10.382857
16                    model_DBSS    8.602434
22                model_Model A2    8.339069
2                      dist_scen    5.065646
1         

High R^2 = 0.79 is excellent, but the VIFs confirm redundancy between flat type, flat model, and floor area.

To fix it:
- drop one of the overlapping sets
- or apply Ridge regression to stabilize coefficients.

| Source                                                 | Explanation                                                                                                                                                    |
| ------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **`type_MULTI-GENERATION` ↔ `model_Multi Generation`** | These two are *perfectly collinear* — they describe the same flats in different categorical sets. Hence both VIF = ∞.                                          |
| **`flat_type` ↔ `flat_model`**                         | Some flat models exist **only for specific flat types**, so their dummy variables are strongly correlated.                                                     |
| **`floor_area_sqm` ↔ flat_type**                       | Flat types roughly determine floor area (e.g., 2-Room ≈ 40 m², 5-Room ≈ 110 m²). So `floor_area_sqm` and `type_*` dummies encode nearly identical information. |
| **`type_4 ROOM`, `type_5 ROOM`, `model_Model A`**      | Mid-range flats and models overlap strongly — VIF 200–400 range indicates almost complete redundancy.                                                          |




**Interpretation of inf and very high VIFs**


| Range       | Meaning                                                                                             |
| ----------- | --------------------------------------------------------------------------------------------------- |
| `VIF = inf` | Perfect linear dependency — one variable is an exact linear combination of others. Must remove one. |
| `VIF > 100` | Catastrophic multicollinearity — coefficients become unstable; standard errors blow up.             |
| `VIF 10–50` | Manageable but still problematic if many predictors.                                                |


In [72]:
print("Shape:", hdb_ols.shape)
print("CRS:", hdb_ols.crs)
print("Columns:", hdb_ols.columns.tolist())
hdb.head(3)

Shape: (25760, 46)
CRS: {"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "ProjectedCRS", "name": "SVY21 / Singapore TM", "base_crs": {"name": "SVY21", "datum": {"type": "GeodeticReferenceFrame", "name": "SVY21", "ellipsoid": {"name": "WGS 84", "semi_major_axis": 6378137, "inverse_flattening": 298.257223563}}, "coordinate_system": {"subtype": "ellipsoidal", "axis": [{"name": "Geodetic latitude", "abbreviation": "Lat", "direction": "north", "unit": "degree"}, {"name": "Geodetic longitude", "abbreviation": "Lon", "direction": "east", "unit": "degree"}]}, "id": {"authority": "EPSG", "code": 4757}}, "conversion": {"name": "Singapore Transverse Mercator", "method": {"name": "Transverse Mercator", "id": {"authority": "EPSG", "code": 9807}}, "parameters": [{"name": "Latitude of natural origin", "value": 1.36666666666667, "unit": "degree", "id": {"authority": "EPSG", "code": 8801}}, {"name": "Longitude of natural origin", "value": 103.833333333333, "unit": "degree", "id

Unnamed: 0,month,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,resale_year,resale_age,...,LNG,X,Y,geometry,log_price,dist_mrt,dist_hcen,dist_scen,bus_count_400m,storey_mid
0,2023-01,ANG MO KIO,2 ROOM,01 TO 03,44.0,Improved,1979,267000.0,2023,44,...,103.85388,30288.234663,38229.067463,POINT (30288.235 38229.067),12.495004,934.249034,148.674999,483.224591,10,2.0
1,2023-01,ANG MO KIO,2 ROOM,04 TO 06,49.0,Improved,1977,300000.0,2023,46,...,103.847714,29602.047153,38881.891694,POINT (29602.047 38881.892),12.611538,248.134757,747.311654,570.211463,10,5.0
2,2023-01,ANG MO KIO,2 ROOM,04 TO 06,44.0,Improved,1978,280000.0,2023,45,...,103.850086,29865.998046,38695.970271,POINT (29865.998 38695.97),12.542545,358.374108,687.704391,343.038217,7,5.0
