# **Introduction California Housing Prices Dataset 📚**

Median house prices for California districts derived from the 1990 census.

**About Dataset**


The California Housing Prices dataset, commonly referred to as the "California Housing" dataset, is a popular dataset used in machine learning and data analysis. It is often featured on platforms like Kaggle for educational and practice purposes. Here is a brief introduction to the dataset:

**Source**

The dataset is based on data collected during the 1990 California census. It is often used as a benchmark dataset for regression problems, particularly in predicting housing prices.

**Features**

The dataset includes various features that can be used to predict the median house value for California districts. Some of the key features include:

* **longitude**: A measure of how far west a house is; a higher value is farther west
* **latitude**: A measure of how far north a house is; a higher value is farther north
* **housingMedianAge**: Median age of a house within a block; a lower number is a newer building
* **totalRooms**: Total number of rooms within a block
* **totalBedrooms**: Total number of bedrooms within a block
* **population**: Total number of people residing within a block
* **households**: Total number of households, a group of people residing within a home unit, for a block
* **medianIncome**: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
* **medianHouseValue**: Median house value for households within a block (measured in US Dollars)
* **oceanProximity**: Location of the house w.r.t ocean/sea

**Target Variable**

The target variable is Median House Value, which represents the median house value for California districts (in units of 100,000).

**Link dataset**: https://www.kaggle.com/datasets/camnugent/california-housing-prices

# **Data preprocessing**

**Read dataset from github**

In [None]:
!git clone https://github.com/honganhdinh2002/California-Housing-Prices-Kaggle.git

In [None]:
import pandas as pd
import numpy as np

In [None]:
dataset = pd.read_csv('/content/California-Housing-Prices-Kaggle/california_housing_price.csv')

In [None]:
df = dataset.copy()
df.head()

Each row represents one district. There are 10 attributes: longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, household, median_income, median_house_value, and ocean_proximity.

The info() method is useful to get a quick description of the data, in particular the total number of rows, each attribute's type, and the number of nonull values.

In [None]:
df.info()

**Shape of data**

In [None]:
nRow, nCol = df.shape
print("Shape of dataset {}".format(df.shape))
print(f"Rows: {nRow} \nColumns: {nCol}")

**Encode categorical values**

There are 10 features: 9 numberic and 1 category. The category should be converted to the numberic

In [None]:
import plotly.express as ex

ex.pie(df ,names='ocean_proximity', title='Proportion of Locations of the house w.r.t ocean/sea')

In [None]:
df["ocean_proximity"].value_counts()

In [None]:
from sklearn.preprocessing import LabelEncoder

ocean_proximity_le = LabelEncoder()

In [None]:
df['ocean_proximity'] = ocean_proximity_le.fit_transform(df['ocean_proximity'])

In [None]:
print("ID\tRepresentation value")
for i in range(len(ocean_proximity_le.classes_)):
    print(f"{i}\t\t{ocean_proximity_le.classes_[i]}")

In [None]:
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt


fig=plt.figure(figsize=(17, 4))
plt.subplot(132)
sns.boxplot( x=df["ocean_proximity"], y=df["median_house_value"], palette="Blues").set_title('Median House Value Boxplot by Ocean Proximity')

plt.tight_layout()
plt.show()

**Describe the dataset**

In [None]:
df.describe()

* The average house price is above 200,000 USD
* The highest house price is 500,000 USD
* The lowest house price is 15,000 USD

**Handle missing data**

In [None]:
print(df.isna().sum())
df.isna().sum().plot(kind='bar')

In [None]:
df = df.dropna()
print(df.isna().sum())

**Check and remove duplicated data**

In [None]:
df.duplicated().value_counts()

In [None]:
df_no_duplicates =  df.drop_duplicates(subset=None, keep="first", inplace=True)
df.duplicated().value_counts()

**Handle outlier values**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(18,8))
sns.boxplot(data=df)

**Handling median house value outliners**

In [None]:
df0 = df[(df['median_house_value'] <= 470000)]
df0.shape[0]

**Handling total rooms outliers**

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,8))
sns.boxplot(data=df, y='total_rooms', ax=ax[0], color='#7209b7')
sns.scatterplot(data=df,x = 'median_house_value', s = 100, y='total_rooms', ax=ax[1])

In [None]:
df1 = df0[(df0['total_rooms'] <= 23000)]
df1.shape[0]

*225 rows were removed about 1% to handle total rooms outliers*

**Handling total bedrooms outliers**

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,8))
sns.boxplot(data=df1, y='total_bedrooms', ax=ax[0], color='#7209b7')
sns.scatterplot(data=df1,x = 'median_house_value', s = 100, y='total_bedrooms', ax=ax[1])

In [None]:
df2 = df1[(df1['total_bedrooms'] < 3000)]
df2.shape

*49 rows were removed about 0.24% to handle total bedrooms outliers*

**Handling population outliers**

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,8))
sns.boxplot(data=df2, y='population', ax=ax[0], color='#7209b7')
sns.scatterplot(data=df2,x = 'median_house_value', s = 100, y='population', ax=ax[1])

In [None]:
df3 = df2[(df2['population'] < 7500)]
df3.shape[0]

*37 rows were removed about 0.18% to handle population outliers*

**Handling households outliers**

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,8))
sns.boxplot(data=df3, y='households', ax=ax[0], color='#7209b7')
sns.scatterplot(data=df3,x = 'median_house_value', s = 100, y='households', ax=ax[1])

In [None]:
df4 = df3[(df3['households'] < 2300)]
df4.shape[0]

*47 rows were removed about 0.23% to handle households outliers*

**Handling median income outliers**

In [None]:
fig, ax =plt.subplots(1,2, figsize=(12,8))
sns.boxplot(data=df4, y='median_income', ax=ax[0], color='#7209b7')
sns.scatterplot(data=df4,x = 'median_house_value', s = 100, y='median_income', ax=ax[1])

In [None]:
df5 = df4[(df4['median_income'] < 11)]
df5.shape[0]

*156 rows were removed about 0.76% to handle median income outliers*

514 rows were removed to handle outliers about 2.5%

# **Visualize data**

In [None]:
df5.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
        s=df5["population"]/100, label="population", figsize=(15,8),
        c="median_house_value", cmap=plt.get_cmap("jet"),colorbar=True,
    )
plt.legend()
plt.show()

**Correlation**

In [None]:
df5 = df5[['longitude', 'latitude' ,'housing_median_age', 'total_rooms',
       'total_bedrooms','population', 'households', 'median_income', 'ocean_proximity', 'median_house_value']]

In [None]:
plt.figure(figsize = (12,8))
sns.heatmap(df5.corr() , annot = True , cmap = "YlGnBu")

* median income has 68% relation with median house value.
* total rooms has 16% relation with median house value.
* latitude has 14% relation with median house value.
* housing median age has 11% relation with median house value.
* longitude has 92% relation with longitude.
* longitude has 29% relation with ocean proximity.
* latitude has 20% relation with ocean proximity.
* median income has 24% relation with total rooms.
* median house age has averag 32% relation with total rooms, total bedrooms,  population and households.
* total rooms, total bedrooms, population and households have average 92% relation with eachother.



Target variable median_house_value is very mildly correlated to all but one feature here: median_income, so one might outline this as an important feature.

In [None]:
corr_matrix = df5.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

**Median house value distribution**


In [None]:
# Histogram

fig = plt.figure(figsize=(15, 5))

plt.subplot(131)

(sns.distplot(df5["housing_median_age"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Housing Median Age", ylabel = "Density", title = "Median House Age Histogram"));

plt.subplot(132)

(sns.distplot(df5["total_rooms"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Total Rooms", ylabel = "Density", title = "Total Rooms Histogram"));

plt.subplot(133)

(sns.distplot(df5["total_bedrooms"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Total Bedrooms", ylabel = "Density", title = "Total Bedrooms Histogram"));

plt.tight_layout()
plt.show()

# Boxplot

fig = plt.figure(figsize=(15, 5))

plt.subplot(131)
sns.boxplot(y=df5["housing_median_age"], color="skyblue").set_title('Median House Age Boxplot')

plt.subplot(132)
sns.boxplot(y=df5["total_rooms"], color="skyblue").set_title('Total Rooms Boxplot')

plt.subplot(133)
sns.boxplot(y=df5["total_bedrooms"], color="skyblue").set_title('Total Bedrooms Boxplot')

plt.tight_layout()
plt.show()


# Histogram

fig = plt.figure(figsize=(15, 5))

plt.subplot(131)

(sns.distplot(df5["population"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Population", ylabel = "Density", title = "Population Histogram"));

plt.subplot(132)

(sns.distplot(df5["households"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Households", ylabel = "Density", title = "Households Histogram"));

plt.subplot(133)

(sns.distplot(df5["median_income"], bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "Median Income", ylabel = "Density", title = "Median Income Histogram"));

plt.tight_layout()
plt.show()


# Boxplot

fig = plt.figure(figsize=(15, 5))

plt.subplot(131)
sns.boxplot(y=df5["population"], color="skyblue").set_title('Population Boxplot')

plt.subplot(132)
sns.boxplot(y=df5["households"], color="skyblue").set_title('Households Boxplot')

plt.subplot(133)
sns.boxplot(y=df5["median_income"], color="skyblue").set_title('Median Income Boxplot')

plt.tight_layout()
plt.show()

# **Split data**

In [None]:
df5.columns

In [None]:
final = df5[['longitude', 'latitude', 'housing_median_age',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity', 'median_house_value']]

In [None]:
final.tail(2)

In [None]:
from sklearn.model_selection import train_test_split

X = final.drop(['median_house_value'] , axis = 1).values
y = final['median_house_value'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

**Scaling**

In [None]:
# from sklearn.preprocessing import StandardScaler
# scale = StandardScaler()

In [None]:
# X_train=scale.fit_transform(X_train)
# X_test=scale.fit_transform(X_test)

In [None]:
# X_train.shape, y_train.shape

# **Linear Regression**

Linear regression models are useful for prediction and to explain variation in the response variable. In this case, we will focus on prediction. So we will fit a predictive model to an observed data set of values of the response and explanatory variables.

To run the linear model we will use a class from Scikit-learn called **Linear Regression()** this class has a function called **fit()** , which will train our data.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error , mean_absolute_percentage_error

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)

We can see above, that the variable regressor_linear is a Linear Regression model trained from the variables X_train and y_train. To train the model means that we are looking for the line that better fits the training data, to do so we will use the **predict()** function.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn import metrics

mae_linear = np.round(metrics.mean_absolute_error(y_test, y_pred_lr))
mse_linear = np.round(metrics.mean_squared_error(y_test, y_pred_lr))
rmse_linear = np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred_lr)))


print('Mean Absolute Error:', mae_linear, 2)
print('Mean Squared Error:', mse_linear, 2)
print('Root Mean Squared Error:', rmse_linear, 2)

In [None]:
lr_frame = pd.DataFrame({"Y_test": y_test , "Y_pred" : y_pred_lr})
lr_frame.head(10)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(lr_frame[:50])
plt.legend(["Actual" , "Predicted"])

In [None]:
# residual plot
plt.figure(figsize=(12,7))
(sns.distplot((y_test-y_pred_lr), bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "(y_test-y_pred)", ylabel = "Density"
    , title = "Regression Tree Residual Plot"));

Linear Regression typically has less hyperparameters that need to be tuned compared to more complex models.

# **Random Forest**

Random Forest is an evolution of bagging. The Random Forest model provide an **improvement over bagged trees** by way of a small tweak that **decorrelates the trees**.

As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a **random sample of m predictors is chosen as split candidates from the full set of p predictors**. The idea behind this process is to decorrelate the trees, for example: Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated.

In Random Forest p is the full set of predictors and m is the predictors taken at each split. So, **the main difference between bagging and random forests is the choice of predictor subset size m**. For instance, if a random forest is built using m = p, then this amounts simply to bagging.

In [None]:
from sklearn.ensemble import RandomForestRegressor

regressor_rf = RandomForestRegressor(random_state=42)
regressor_rf.fit(X_train, y_train)
y_pred_forest = regressor_rf.predict(X_test)

In [None]:
mae_forest = np.round(metrics.mean_absolute_error(y_test, y_pred_forest))
mse_forest = np.round(metrics.mean_squared_error(y_test, y_pred_forest))
rmse_forest = np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred_forest)))


print('Mean Absolute Error:', mae_forest, 2)
print('Mean Squared Error:', mse_forest, 2)
print('Root Mean Squared Erro:', rmse_forest, 2)

**Tunning hyperparameters Random Forest Regression**

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['auto', 'sqrt']
}

rf = RandomForestRegressor()
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_

In [None]:
# Dự đoán trên tập kiểm thử
y_pred_forest_cv = grid_search.predict(X_test)

# Đánh giá hiệu suất
mse = mean_squared_error(y_test, y_pred_forest_cv)
rmse = np.sqrt(mse)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared: {rmse}")

Mean Squared Error on Test Set: 1897876002.331699
Root Mean Squared Error on Test Set: 43564.618698339356

Because after tunning, the rmse, mse and mae values ​​do not change, so we can take either model.

In [None]:
regressor_rf_frame_cv = pd.DataFrame({"Y_test": y_test , "Y_pred" : y_pred_forest_cv})
regressor_rf_frame_cv.head(10)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(regressor_rf_frame_cv[:50])
plt.legend(["Actual" , "Predicted"])

# **Regression Tree**

Decision Trees methods involve stratifying or segmenting the predictor space into a number of simple regions. Decision Trees are **simple and useful for interpretation**, however they are typically **not competitive in terms of prediction accuracy**.

Decision trees can be applied to both **regression and classification problems**. At this task, we are going to use **Regression Trees**

**Advantages of Decision Trees**:

1. Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!
2. Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters.
3. Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
4. Trees can easily handle qualitative predictors without the need to create dummy variables,

**Disadvantages of Decision Trees**:

1. Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book.
2. Additionally, trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree (high variance)

However, by aggregating many decision trees the predictive performance of trees can be substantially improved. We will evaluate these models in the next cells.

In [None]:
from sklearn.tree import DecisionTreeRegressor

regressor_tree = DecisionTreeRegressor(random_state = 42)
regressor_tree.fit(X_train, y_train)

y_pred_tree = regressor_tree.predict(X_test)

In [None]:
mae_tree = np.round(metrics.mean_absolute_error(y_test, y_pred_tree))
mse_tree = np.round(metrics.mean_squared_error(y_test, y_pred_tree))
rmse_tree = np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred_tree)))


print('Mean Absolute Error:', mae_tree, 2)
print('Mean Squared Error:', mse_tree, 2)
print('Root Mean Squared Erro:', rmse_tree, 2)

**Tunning hyperparameters Regression Tree**

In [None]:
from sklearn.model_selection import GridSearchCV


param_grid = {
    'max_depth': [None, 5, 10, 15, 20, 30],
    'min_samples_split': [2, 5, 10, 15, 20],
    'min_samples_leaf': [1, 2, 4, 8, 12]
}

grid_search = GridSearchCV(regressor_tree, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Hyperparameters:", grid_search.best_params_)

In [None]:
best_regressor_tree = grid_search.best_estimator_
y_pred_tree_cv = best_regressor_tree.predict(X_test)

In [None]:
mae_tree_cv = np.round(metrics.mean_absolute_error(y_test, y_pred_tree_cv))
mse_tree_cv = np.round(metrics.mean_squared_error(y_test, y_pred_tree_cv))
rmse_tree_cv = np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred_tree_cv)))


print('Mean Absolute Error:', mae_tree_cv, 2)
print('Mean Squared Error:', mse_tree_cv, 2)
print('Root Mean Squared Erro:', rmse_tree_cv, 2)

Because after tunning, the rmse, mse and mae values ​​change for the better, so we will take the tunning model

In [None]:
regressor_tree_frame_cv = pd.DataFrame({"Y_test": y_test, "Y_pred": y_pred_tree_cv})
regressor_tree_frame_cv.head(10)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(regressor_tree_frame_cv[:50])
plt.legend(["Actual" , "Predicted"])

In [None]:
# residual plot
plt.figure(figsize=(12,7))
(sns.distplot((y_test-y_pred_tree_cv), bins = "fd", norm_hist = True, kde = False, color = "skyblue", hist_kws = dict(alpha = 1))
    .set(xlabel = "(y_test-y_pred)", ylabel = "Density", title = "RF Residual Plot"));

**Comparison Regression Tree**

In [None]:
models = ['Regression Tree' , 'Regression Tree CV']
data = [
    [mae_tree, mse_tree, rmse_tree],
    [mae_tree_cv, mse_tree_cv, rmse_tree_cv]
]

cols = ['mae' , 'mse', 'rmse']

df = pd.DataFrame(data=data, index=models, columns=cols)
pd.options.display.float_format = '{:.2f}'.format

df

# **Gradient Boosting Regression**

**Gradient Boosting Regression** is a powerful machine learning algorithm used for regression problems. It belongs to the category of Ensemble Learning, where a series of weak models (weak learners) are combined to create a strong model (strong learner).

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Tạo mô hình Gradient Boosting
gb_model = GradientBoostingRegressor()

# Huấn luyện mô hình trên tập huấn luyện
gb_model.fit(X_train, y_train)

# Dự đoán trên tập kiểm tra
y_pred_gb = gb_model.predict(X_test)

# Đánh giá hiệu suất sử dụng Mean Squared Error
mse_gb = mean_squared_error(y_test, y_pred_gb)

# Tính Root Mean Squared Error
rmse_gb = np.sqrt(mse_gb)

print("Mean Squared Error on Test Set:", mse_gb)
print("Root Mean Squared Error on Test Set:", rmse_gb)

**Tunning hyperparameters  gradient boostring regression**

In [None]:
from sklearn.model_selection import GridSearchCV

# Định nghĩa lưới hyperparameters bạn muốn thử nghiệm
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
}

# Sử dụng GridSearchCV để tìm kiếm hyperparameters tốt nhất
grid_search = GridSearchCV(estimator=GradientBoostingRegressor(), param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

# In ra hyperparameters tốt nhất
best_params = grid_search.best_params_

# Dùng hyperparameters tốt nhất để tạo mô hình
best_gb_model = grid_search.best_estimator_

# Dự đoán trên tập kiểm tra và đánh giá lại hiệu suất
y_pred_gb_best = best_gb_model.predict(X_test)
mse_gb_best = mean_squared_error(y_test, y_pred_gb_best)
print("Mean Squared Error on Test Set (Best Model):", mse_gb_best)

rmse_gb_best = np.sqrt(mse_gb_best)
print("Root Mean Squared Error on Test Set (Best Model):", rmse_gb_best)


Mean Squared Error on Test Set (Best Model): 1710899232.3084118

Root Mean Squared Error on Test Set (Best Model): 41363.01768861179

In [None]:
gb_frame_cv = pd.DataFrame({"Y_test": y_test, "Y_pred": y_pred_gb})
gb_frame_cv.head(10)

In [None]:
mae_gb_cv = np.round(metrics.mean_absolute_error(y_test, y_pred_gb_best))
mse_gb_cv = np.round(metrics.mean_squared_error(y_test, y_pred_gb_best))
rmse_gb_cv = np.round(np.sqrt(metrics.mean_squared_error(y_test, y_pred_gb_best)))


print('Mean Absolute Error:', mae_gb_cv, 2)
print('Mean Squared Error:', mse_gb_cv, 2)
print('Root Mean Squared Erro:', rmse_gb_cv, 2)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(gb_frame_cv[:50])
plt.legend(["Actual" , "Predicted"])

# **Evaluation models**

In [None]:
!pip install tabulate

In [None]:
from tabulate import tabulate

# Assuming y_test, y_pred_lr, y_pred_forest, y_pred_tree are defined for the entire test dataset

# Create a DataFrame to store the evaluation metrics
results_df = pd.DataFrame(columns=['Model', 'MAE', 'MSE', 'RMSE'])

# Calculate evaluation metrics for LR
mae_lr = mean_absolute_error(y_test, y_pred_lr)
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)
results_df = results_df.append({'Model': 'Linear Regression', 'MAE': f'{mae_lr:.2f}', 'MSE': '{:.2f}'.format(mse_lr), 'RMSE': f'{rmse_lr:.2f}'}, ignore_index=True)

# Calculate evaluation metrics for Random Forest
mae_rf = mean_absolute_error(y_test, y_pred_forest)
mse_rf = mean_squared_error(y_test, y_pred_forest)
rmse_rf = np.sqrt(mse_rf)
results_df = results_df.append({'Model': 'Random Forest', 'MAE': f'{mae_rf:.2f}', 'MSE': '{:.2f}'.format(mse_rf), 'RMSE': f'{rmse_rf:.2f}'}, ignore_index=True)

# Calculate evaluation metrics for Decision Tree
mae_tree = mean_absolute_error(y_test, y_pred_tree_cv)
mse_tree = mean_squared_error(y_test, y_pred_tree_cv)
rmse_tree = np.sqrt(mse_tree)
results_df = results_df.append({'Model': 'Regression Tree', 'MAE': f'{mae_tree:.2f}', 'MSE': '{:.2f}'.format(mse_tree), 'RMSE': f'{rmse_tree:.2f}'}, ignore_index=True)

# Calculate evaluation metrics for Gradient Boosting
mae_gb = mean_absolute_error(y_test, y_pred_gb_best)
mse_gb = mean_squared_error(y_test, y_pred_gb_best)
rmse_gb = np.sqrt(mse_gb)
results_df = results_df.append({'Model': 'Gradient Boosting', 'MAE': f'{mae_gb:.2f}', 'MSE': '{:.2f}'.format(mse_gb), 'RMSE': f'{rmse_gb:.2f}'}, ignore_index=True)

# Display the table
print(tabulate(results_df, headers='keys', tablefmt='fancy_grid'))


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Define labels and data for plotting
models = ['Linear Regression', 'Random Forest', 'Decision Tree', 'Gradient Boosting']
mae_values = [mae_lr, mae_rf, mae_tree, mae_gb]
mse_values = [mse_lr, mse_rf, mse_tree, mse_gb]
rmse_values = [rmse_lr, rmse_rf, rmse_tree, rmse_gb]

# Set up colors
colors = ['steelblue', 'forestgreen', 'darkorange', 'slategray']

# Set up bar width
bar_width = 0.2

# Set up positions for bars on x-axis
r1 = np.arange(len(models))
r2 = [x + bar_width for x in r1]
r3 = [x + bar_width for x in r2]

# Plotting MAE
plt.figure(figsize=(10, 5))
plt.bar(r1, mae_values, color=colors, width=bar_width, edgecolor='grey', label='MAE')
plt.xlabel('Models', fontweight='bold')
plt.xticks([r + bar_width for r in range(len(models))], models)
plt.title('Mean Absolute Error (MAE) Comparison')
plt.legend()
plt.show()

# Plotting MSE
plt.figure(figsize=(10, 5))
plt.bar(r2, mse_values, color=colors, width=bar_width, edgecolor='grey', label='MSE')
plt.xlabel('Models', fontweight='bold')
plt.xticks([r + bar_width for r in range(len(models))], models)
plt.title('Mean Squared Error (MSE) Comparison')
plt.legend()
plt.show()

# Plotting RMSE
plt.figure(figsize=(10, 5))
plt.bar(r3, rmse_values, color=colors, width=bar_width, edgecolor='grey', label='RMSE')
plt.xlabel('Models', fontweight='bold')
plt.xticks([r + bar_width for r in range(len(models))], models)
plt.title('Root Mean Squared Error (RMSE) Comparison')
plt.legend()
plt.show()


**Results: which is the best model?**

As can be seen from the table below, **gradient boostring regression** resulted to be the best model for this dataset because of:
* lowest root mean squared error


With the average house price fluctuates around more than 200K USD. **Gradient Boostring** gives the average value of the deviation between the predicted value and the actual value of 40K (about ~ 18- 19%). It sounds good.

After gradient boostring, **RF** is ~ 43K (about ~ 20%), **LR** and **DTR** is more than 25%.
