### Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import warnings

from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.linear_model import Ridge, BayesianRidge, LinearRegression
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

sb.set()
warnings.filterwarnings('ignore')



### Importing dataset

In [20]:
data = pd.read_csv("cleaned-Housing.csv")
train_data = pd.read_csv("train-Housing.csv")
test_data = pd.read_csv("test-Housing.csv")

We encode `storey_range` since they present ordering-like-structure. They will be utilized in our regression.

In [37]:
print(data['storey_range'].unique())
# Encode storey_range
label_encoders = LabelEncoder()
categorical_cols = ['storey_range']
for col in categorical_cols:
    data[col] = label_encoders.fit_transform(data[col])
    train_data[col] = label_encoders.fit_transform(train_data[col])
    test_data[col] = label_encoders.fit_transform(test_data[col])

[ 5  2  4  0  7 10  8 13 12 15 16 20 19 17  3  1  6  9 11 14 18]


These are the four candidates variable that we think may have significant impact on `resale_price`

In [42]:
candidates = ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']

In [26]:
data.head()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,resale_year
0,1990-01,ANG MO KIO,0,309,ANG MO KIO AVE 1,5,31.0,6,1977,9000.0,1990
1,1990-01,ANG MO KIO,0,309,ANG MO KIO AVE 1,2,31.0,6,1977,6000.0,1990
2,1990-01,ANG MO KIO,0,309,ANG MO KIO AVE 1,5,31.0,6,1977,8000.0,1990
3,1990-01,ANG MO KIO,0,309,ANG MO KIO AVE 1,4,31.0,6,1977,6000.0,1990
4,1990-01,ANG MO KIO,2,216,ANG MO KIO AVE 1,2,73.0,19,1976,47200.0,1990


### Linear Regression Model Function

In [38]:
def linear_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predict and evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Print the performance
    print(f'Linear Regression using {predictor}:')
    print(f'Mean Squared Error (MSE): {mse}')
    print(f'R-squared Score: {r2}\n')
    print();

In [56]:
for var in candidates:
    linear_regression([var])
linear_regression(candidates)

Linear Regression using ['floor_area_sqm']:
Mean Squared Error (MSE): 10759844588.638773
R-squared Score: 0.4078269189637921

Linear Regression using ['lease_commence_date']:
Mean Squared Error (MSE): 13728028202.8894
R-squared Score: 0.24447154505923774

Linear Regression using ['storey_range']:
Mean Squared Error (MSE): 17759011196.92869
R-squared Score: 0.0226245100467376

Linear Regression using ['resale_year']:
Mean Squared Error (MSE): 11805276560.963236
R-squared Score: 0.35029108125115316

Linear Regression using ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']:
Mean Squared Error (MSE): 4838937896.31295
R-squared Score: 0.7336867889311198



The MSE for 'floor_area_sqm' and 'lease_commence_date' are both high, indicating that the models may not be fitting the data well. The R^2 value for 'floor_area_sqm' is slightly higher than that for 'lease_commence_date', suggesting that 'floor_area_sqm' is a slightly better predictor of 'resale_price' than 'lease_commence_date', but overall, both models explain a very small portion of the variance in 'resale_price'.

### Ridge Regression Model Function

In [53]:
def ridge_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the Ridge Regression model
    ridge_reg_model = Ridge(alpha=1.0)
    ridge_reg_model.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred_ridge = ridge_reg_model.predict(X_test)

    # Evaluate the model performance
    mse_ridge = mean_squared_error(y_test, y_pred_ridge)
    r2_ridge = r2_score(y_test, y_pred_ridge)

    # Print out the model performance metrics for Ridge Regression
    print(f'Ridge Regression using {predictor}:')
    print(f'Mean Squared Error (MSE): {mse_ridge}')
    print(f'R-squared Score: {r2_ridge}\n')

In [55]:
for var in candidates:
    ridge_regression([var])
ridge_regression(candidates)

Ridge Regression using ['floor_area_sqm']:
Mean Squared Error (MSE): 10759844588.339684
R-squared Score: 0.4078269189802526

Ridge Regression using ['lease_commence_date']:
Mean Squared Error (MSE): 13728028201.250164
R-squared Score: 0.24447154514945402

Ridge Regression using ['storey_range']:
Mean Squared Error (MSE): 17759011201.01285
R-squared Score: 0.022624509821963956

Ridge Regression using ['resale_year']:
Mean Squared Error (MSE): 11805276559.323626
R-squared Score: 0.35029108134138975

Ridge Regression using ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']:
Mean Squared Error (MSE): 4838937897.134691
R-squared Score: 0.7336867888858949



### Bayesian Ridge Regression Model Function

In [61]:
def bayesian_ridge_regression(predictor):  
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the model
    model = BayesianRidge()
    model.fit(X_train, y_train)
    
    # Predict and evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Print the performance
    print(f'Bayesian Ridge Regression using {predictor}:')
    print(f'Mean Squared Error (MSE): {mse}')
    print(f'R-squared Score: {r2}\n')
    print()

In [62]:
for var in candidates:
    bayesian_ridge_regression([var])
bayesian_ridge_regression(candidates)

Bayesian Ridge Regression using ['floor_area_sqm']:
Mean Squared Error (MSE): 10759844302.771444
R-squared Score: 0.4078269346966342


Bayesian Ridge Regression using ['lease_commence_date']:
Mean Squared Error (MSE): 13728027770.633022
R-squared Score: 0.24447156884866972


Bayesian Ridge Regression using ['storey_range']:
Mean Squared Error (MSE): 17759012326.357716
R-squared Score: 0.022624447888082933


Bayesian Ridge Regression using ['resale_year']:
Mean Squared Error (MSE): 11805276374.580502
R-squared Score: 0.350291091508814


Bayesian Ridge Regression using ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']:
Mean Squared Error (MSE): 4838938019.043185
R-squared Score: 0.7336867821766039




The Bayesian Ridge model is incorporating prior distributions over the weights and automatically tuning its regularization parameters. Despite this, the predictive power does not significantly improve compared to the standard Ridge Regression, as indicated by the similar MSE and R^2 values.

In [63]:
average_resale_price = data['resale_price'].mean()

print(f"The average resale price is: {average_resale_price}")

The average resale price is: 285461.57954690786


It is important to note that eventhough the MSE seems high it is acceptable because the reatail prices usually vary widely. Housing prices, such as those for HDB flats, can range significantly. An MSE of tens of millions might seem large, but if the resale prices range from tens of thousands to multiple millions, the MSE might be relatively small when considered as a percentage of the price range. The square root of the MSE gives you the Root Mean Squared Error (RMSE), which is on the same scale as the prices themselves. For example, if the RMSE is 70,000, this means the average prediction error is 70,000. If the average housing price is around 500,000 or more, a 70,000 error might be a reasonable margin for certain applications, like a quick market estimate.

Thus, we moved on to try other complex models like Gradient Boosting and Random Forest

### Gradient Boosting Regressor Model

In [74]:
def gradient_boosting_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the Gradient Boosting Regressor model
    gb_regressor = GradientBoostingRegressor()
    gb_regressor.fit(X_train, y_train)

    # Make predictions
    y_pred = gb_regressor.predict(X_test)

    # Evaluate the model
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    print(f'Gradient Boosting Regression using {predictor}:')
    print(f"Mean Squared Error: {mse}")
    print(f'R-squared Score: {r2}\n')
    print();

In [None]:
for var in candidates:
    gradient_boosting_regression([var])
gradient_boosting_regression(candidates)

Compared to linear models, this model is more robust and captures the interaction between different varibles like 'floor_area_sqm' and 'flat_type' effectively. It provided more flexibility in considering more varibles both categorical and numerical.

 We can see thet the MSE also decreased in using the model and eventhough it is still high, the absolute error would be a less percentage of high 'resale_price' value.

We can see that the MSE has increased after removing the outliers which could be beacuse of the following reasons:

The removed outliers may have been influential points that were actually well-predicted by the model. If these points were not errors but valid extreme values, the model might have been leveraging these data points to better fit the overall trend. Without them, the model might not capture the full range of the data as well.

Tree-based models like Gradient Boosting, can be sensitive to the removal of data points. The decision boundaries or split points might change significantly when outliers are removed, leading to a poorer fit to the remaining data.

### Random Forest Regressor Model

In [72]:
def random_forest_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Train decision tree
    rf = RandomForestRegressor()
    rf.fit(X_train, y_train)
    
    # Predict Response corresponding to Predictors
    y_pred = rf.predict(X_test)
    
    # Check the Goodness of Fit (on Train Data)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    print(f'Random Forest Regression using {predictor}:')
    print(f"Mean Squared Error: {mse}")
    print(f'R-squared Score: {r2}\n')
    print();
    
    # Plot trees is not useful since the number of nodes is incomprehensible thus not providing significant insights
#     for i in range(min(2,len(rf.estimators_))):
#         plt.figure(figsize=(30,15), dpi=300)
#         plot_tree(rf.estimators_[i])
#         plt.show()

In [73]:
for var in candidates:
    random_forest_regression([var])
random_forest_regression(candidates)

Random Forest Regression using ['floor_area_sqm']:
Mean Squared Error: 10043238356.085766
R-squared Score: 0.447265678243711


Random Forest Regression using ['lease_commence_date']:
Mean Squared Error: 13205259030.104958
R-squared Score: 0.2732423910661945


Random Forest Regression using ['storey_range']:
Mean Squared Error: 17331908076.044346
R-squared Score: 0.04613033012905732


Random Forest Regression using ['resale_year']:
Mean Squared Error: 9681720670.235817
R-squared Score: 0.46716197322411346


Random Forest Regression using ['floor_area_sqm', 'lease_commence_date', 'storey_range', 'resale_year']:
Mean Squared Error: 1712660908.191451
R-squared Score: 0.9057429056322173




The Random Forest algorithm constructs multiple decision trees during training and outputs the mean prediction of the individual trees for regression tasks, or the class that is the mode of the classes for classification tasks.

Building the regression tree using `year` and `lease_commence_date` instead of only `floor_area_sqm` shows a jump of the explained variance from around 60% to above 70%. This shows that `year` and `lease_commence_date` are also quite relevant factors toward `resale_price`

Explained variance for the training, it's around 79.5%, indicating that the model explains a substantial portion of the variance in the resale price data. The explained variance is for the test set is at around 74.29% suggesting that the model generalizes well to unseen data.