### Libraries

In [2]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import warnings

from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.linear_model import Ridge, BayesianRidge, LinearRegression
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

sb.set()
warnings.filterwarnings('ignore')

### Importing dataset

In [3]:
data = pd.read_csv("cleaned-Housing.csv")
train_data = pd.read_csv("train-Housing.csv")
test_data = pd.read_csv("test-Housing.csv")

In [4]:
data.head()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,year
0,2012-01-03,ANG MO KIO,2 ROOM,172,ANG MO KIO AVE 4,06 TO 10,45,Improved,1986,250000.0,2012
1,2012-01-03,ANG MO KIO,2 ROOM,510,ANG MO KIO AVE 8,01 TO 05,44,Improved,1980,265000.0,2012
2,2012-01-03,ANG MO KIO,3 ROOM,610,ANG MO KIO AVE 4,06 TO 10,68,New Generation,1980,315000.0,2012
3,2012-01-03,ANG MO KIO,3 ROOM,474,ANG MO KIO AVE 10,01 TO 05,67,New Generation,1984,320000.0,2012
4,2012-01-03,ANG MO KIO,3 ROOM,604,ANG MO KIO AVE 5,06 TO 10,67,New Generation,1980,321000.0,2012


### Linear Regression Model Function

In [38]:
def linear_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predict and evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Print the performance
    print(f'Linear Regression using {predictor}:')
    print(f'Mean Squared Error (MSE): {mse}')
    print(f'R-squared Score: {r2}\n')

In [52]:
linear_regression(['floor_area_sqm'])

Linear Regression using ['floor_area_sqm']:
Mean Squared Error (MSE): 6673143504.148816
R-squared Score: 0.5520847532235249



In [53]:
linear_regression(['lease_commence_date'])

Linear Regression using ['lease_commence_date']:
Mean Squared Error (MSE): 12041934922.93348
R-squared Score: 0.19172032651199544



In [54]:
linear_regression(['year'])

Linear Regression using ['year']:
Mean Squared Error (MSE): 14817369450.824999
R-squared Score: 0.005427398643804926



In [55]:
linear_regression(['floor_area_sqm', 'lease_commence_date', 'year'])

Linear Regression using ['floor_area_sqm', 'lease_commence_date', 'year']:
Mean Squared Error (MSE): 6554116589.753842
R-squared Score: 0.56007408684135



The MSE for 'floor_area_sqm' and 'lease_commence_date' are both high, indicating that the models may not be fitting the data well. The R^2 value for 'floor_area_sqm' is slightly higher than that for 'lease_commence_date', suggesting that 'floor_area_sqm' is a slightly better predictor of 'resale_price' than 'lease_commence_date', but overall, both models explain a very small portion of the variance in 'resale_price'.

### Ridge Regression Model Function

In [11]:
def ridge_regression(data, predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the Ridge Regression model
    ridge_reg_model = Ridge(alpha=1.0)
    ridge_reg_model.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred_ridge = ridge_reg_model.predict(X_test)

    # Evaluate the model performance
    mse_ridge = mean_squared_error(y_test, y_pred_ridge)
    r2_ridge = r2_score(y_test, y_pred_ridge)

    # Print out the model performance metrics for Ridge Regression
    print(f'Ridge Regression using {predictor}:')
    print(f'Mean Squared Error (MSE): {mse_ridge}')
    print(f'R-squared Score: {r2_ridge}\n')

In [12]:
ridge_regression(data, ['floor_area_sqm'])

Ridge Regression using ['floor_area_sqm']:
Mean Squared Error (MSE): 6673143506.010233
R-squared Score: 0.5520847530985826



In [13]:
ridge_regression(data, ['lease_commence_date'])

Ridge Regression using ['lease_commence_date']:
Mean Squared Error (MSE): 12041934921.445866
R-squared Score: 0.19172032661184724



In [47]:
ridge_regression(data, ['year'])

Ridge Regression using ['year']:
Mean Squared Error (MSE): 14817370219.780151
R-squared Score: 0.005427347029938523



In [48]:
ridge_regression(data, ['floor_area_sqm','lease_commence_date', 'year'])

Ridge Regression using ['floor_area_sqm', 'lease_commence_date', 'year']:
Mean Squared Error (MSE): 6554116577.730454
R-squared Score: 0.5600740876483846



The R^2 value for both variables are very close to 0. This indicates that only about 0.9% of the variance in 'resale_price' is being explained by 'floor_area_sqm' and  only about 0.5% of the variance in 'resale_price' is being explained by 'lease_commence_date' which is even less than what 'floor_area_sqm' could explain in Ridge Regression model. This suggests that 'floor_area_sqm' could be a better predictor.

Hence we move on to try Bayesian Ridge model

### Bayesian Ridge Regression Model Function

In [15]:
def bayesian_ridge_regression(data, predictor):  
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Initialize and train the model
    model = BayesianRidge()
    model.fit(X_train, y_train)
    
    # Predict and evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Print the performance
    print(f'Bayesian Ridge Regression using {predictor}:')
    print(f'Mean Squared Error (MSE): {mse}')
    print(f'R-squared Score: {r2}\n')

In [16]:
bayesian_ridge_regression(data, ['floor_area_sqm'])

Bayesian Ridge Regression using ['floor_area_sqm']:
Mean Squared Error (MSE): 6673144464.012085
R-squared Score: 0.5520846887955093



In [17]:
bayesian_ridge_regression(data, ['lease_commence_date'])

Bayesian Ridge Regression using ['lease_commence_date']:
Mean Squared Error (MSE): 12041934298.783642
R-squared Score: 0.19172036840622841



In [51]:
bayesian_ridge_regression(data, ['year'])

Bayesian Ridge Regression using ['year']:
Mean Squared Error (MSE): 14817504814.509762
R-squared Score: 0.005418312752241583



In [18]:
bayesian_ridge_regression(data, ['floor_area_sqm', 'lease_commence_date', 'year'])

Bayesian Ridge Regression using ['floor_area_sqm', 'lease_commence_date', 'year']:
Mean Squared Error (MSE): 6554115919.068987
R-squared Score: 0.5600741318591098



The Bayesian Ridge model is incorporating prior distributions over the weights and automatically tuning its regularization parameters. Despite this, the predictive power does not significantly improve compared to the standard Ridge Regression, as indicated by the similar MSE and R^2 values.

In [19]:
average_resale_price = data['resale_price'].mean()

print(f"The average resale price is: {average_resale_price}")

The average resale price is: 461391.20211721683


It is important to note that eventhough the MSE seems high it is acceptable because the reatail prices usually vary widely. Housing prices, such as those for HDB flats, can range significantly. An MSE of tens of millions might seem large, but if the resale prices range from tens of thousands to multiple millions, the MSE might be relatively small when considered as a percentage of the price range. The square root of the MSE gives you the Root Mean Squared Error (RMSE), which is on the same scale as the prices themselves. For example, if the RMSE is 70,000, this means the average prediction error is 70,000. If the average housing price is around 500,000 or more, a 70,000 error might be a reasonable margin for certain applications, like a quick market estimate.

Thus, we moved on to try other complex models like Gradient Boosting and Random Forest

### Gradient Boosting Regressor Model

In [30]:
def gradient_boosting_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Encode categorical variables
    label_encoders = LabelEncoder()
    categorical_cols = ['flat_type', 'storey_range', 'flat_model']
    for col in categorical_cols:
        X_train[col] = label_encoders.fit_transform(X_train[col])
        X_test[col] = label_encoders.fit_transform(X_test[col])

    # Initialize and train the Gradient Boosting Regressor model
    gb_regressor = GradientBoostingRegressor()
    gb_regressor.fit(X_train, y_train)

    # Make predictions
    y_pred = gb_regressor.predict(X_test)

    # Evaluate the model
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    print(f"Mean Squared Error: {mse}")
    print(f'R-squared Score: {r2}\n')

In [57]:
gradient_boosting_regression(['flat_type', 'storey_range', 'flat_model', 'floor_area_sqm', 'lease_commence_date', 'year'])

Mean Squared Error: 3665597450.59099
R-squared Score: 0.7539574886042992



Compared to linear models, this model is more robust and captures the interaction between different varibles like 'floor_area_sqm' and 'flat_type' effectively. It provided more flexibility in considering more varibles both categorical and numerical.

 We can see thet the MSE also decreased in using the model and eventhough it is still high, the absolute error would be a less percentage of high 'resale_price' value.

We can see that the MSE has increased after removing the outliers which could be beacuse of the following reasons:

The removed outliers may have been influential points that were actually well-predicted by the model. If these points were not errors but valid extreme values, the model might have been leveraging these data points to better fit the overall trend. Without them, the model might not capture the full range of the data as well.

Tree-based models like Gradient Boosting, can be sensitive to the removal of data points. The decision boundaries or split points might change significantly when outliers are removed, leading to a poorer fit to the remaining data.

### Random Forest Regressor Model

In [43]:
def random_forest_regression(predictor):
    # Prepare the data
    X_train = train_data[predictor]
    X_test = test_data[predictor]
    y_train = train_data['resale_price']
    y_test = test_data['resale_price']

    # Train decision tree
    rf = RandomForestRegressor()
    rf.fit(X_train, y_train)
    
    # Predict Response corresponding to Predictors
    y_train_pred = rf.predict(X_train)
    y_test_pred = rf.predict(X_test)
    
    # Check the Goodness of Fit (on Train Data)
    print("Goodness of Fit of Model \tTrain Dataset")
    print("Explained variance\t:", rf.score(X_train, y_train))
    print("Mean squared error \t:", mean_squared_error(y_train, y_train_pred))
    print()

    # Check the Goodness of Fit (on Test Data)
    print("Goodness of Fit of Model \tTest Dataset")
    print("Explained variance \t:", rf.score(X_test, y_test))
    print("Mean squared error \t:", mean_squared_error(y_test, y_test_pred))
    print()
    
    # Plot trees
#     for i in range(min(2,len(rf.estimators_))):
#         plt.figure(figsize=(30,15), dpi=300)
#         plot_tree(rf.estimators_[i])
#         plt.show()

In [45]:
random_forest_regression(['floor_area_sqm'])

Goodness of Fit of Model 	Train Dataset
Explained variance	: 0.6113498311971979
Mean squared error 	: 5854036186.56932

Goodness of Fit of Model 	Test Dataset
Explained variance 	: 0.6061041828675331
Mean squared error 	: 5868349720.9029665



In [56]:
random_forest_regression(['lease_commence_date'])

Goodness of Fit of Model 	Train Dataset
Explained variance	: 0.24127685760804374
Mean squared error 	: 11428253703.920877

Goodness of Fit of Model 	Test Dataset
Explained variance 	: 0.22995736699142033
Mean squared error 	: 11472270773.008965



In [49]:
random_forest_regression(['year'])

Goodness of Fit of Model 	Train Dataset
Explained variance	: 0.011606497976667796
Mean squared error 	: 14887659370.477146

Goodness of Fit of Model 	Test Dataset
Explained variance 	: 0.012279705302368993
Mean squared error 	: 14715282223.395002



In [50]:
random_forest_regression(['floor_area_sqm','lease_commence_date','year'])

Goodness of Fit of Model 	Train Dataset
Explained variance	: 0.7953065480009689
Mean squared error 	: 3083191443.984981

Goodness of Fit of Model 	Test Dataset
Explained variance 	: 0.7293796834423929
Mean squared error 	: 4031763197.4432244



The Random Forest algorithm constructs multiple decision trees during training and outputs the mean prediction of the individual trees for regression tasks, or the class that is the mode of the classes for classification tasks.

Building the regression tree using `year` and `lease_commence_date` instead of only `floor_area_sqm` shows a jump of the explained variance from around 60% to above 70%. This shows that `year` and `lease_commence_date` are also quite relevant factors toward `resale_price`

Explained variance for the training, it's around 79.5%, indicating that the model explains a substantial portion of the variance in the resale price data. The explained variance is for the test set is at around 74.29% suggesting that the model generalizes well to unseen data.