This notebook is about predicting the price of the house per unit area. We first use a linear regression model and then a polynomial regression model and we compare the results to see which one is better.

First, let's get to know the data better and then make a linear regression model based on the dataset.

## Step 1: Importing the dataset and neccessary libraries:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df=pd.read_csv('/kaggle/input/real-estate-price-prediction/Real estate.csv')

## Step 2: Getting to know the data better

In [None]:
df.head()

In [None]:
df.drop('No', axis=1, inplace=True)

In [None]:
df.columns

We can see the features are 'transaction date', 'house age', 'distance to the nearest MRT station','number of convenience stores', 'latitude' and 'longitude'.

In [None]:
df.shape

We have 414 rows and 7 columns!

In [None]:
df.info()

There are no missing data and no categorical data.

In [None]:
df.describe()

## Step 3: Exploratory data analysis

In [None]:
plt.figure(figsize=(8,3))
sns.displot(x=df['Y house price of unit area'], kde=True, aspect=2, color='purple')
plt.xlabel('house price of unit area')

We can see most of the house prices of unit area are around 40. The distribution seems normal.

In [None]:
sns.jointplot(data=df, y=df['Y house price of unit area'], x=df['X1 transaction date'])

We can see the house price increased a little bit during the time.

In [None]:
sns.jointplot(data=df, y=df['Y house price of unit area'], x=df['X2 house age'])

We can see the house price of the old houses (20-40) are less than newer ones (0-10 years)

In [None]:
sns.jointplot(data=df, y=df['Y house price of unit area'], x=df['X3 distance to the nearest MRT station'])

It is clear that the more distance to the nearest MRT station, the cheaper the house as it is less convenient.

In [None]:
sns.jointplot(data=df, y=df['Y house price of unit area'], x=df['X4 number of convenience stores'])

The more convenient stores are around a house, the more expensive the house.

In [None]:
sns.jointplot(data=df, y=df['Y house price of unit area'], x=df['X5 latitude'])

In [None]:
sns.jointplot(data=df, y=df['Y house price of unit area'], x=df['X6 longitude'])

The more latitude and longtitude, the more expansive the house.

In [None]:
df.corr()

In [None]:
sns.heatmap(df.corr(), annot=True,cmap='Greens')

We can see that the 'number of convenience stores', 'latitude', 'longitude' have a possitive and 'distance to the nearest MRT station' has a negative corrolation with the housing price.

## A. Linear regression

## Step 4-A: Building a linear regression model

Defining the features and target variables:

In [None]:
X = df.drop('Y house price of unit area',axis=1)
y = df['Y house price of unit area']

Splitting the data into train and test:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Training the model:

In [None]:
from sklearn.linear_model import LinearRegression
LR= LinearRegression()
LR.fit(X_train, y_train)

Coefficients are like this:

In [None]:
pd.DataFrame(LR.coef_, X.columns, columns=['Coeficient'])

## Step 5-A: Predicting the test data

In [None]:
y_pred=LR.predict(X_test)

## Step 6-A: Evalutaing the model

In [None]:
from sklearn import metrics
MAE_simple= metrics.mean_absolute_error(y_test, y_pred)
MSE_simple= metrics.mean_squared_error(y_test, y_pred)
RMSE_simple=np.sqrt(MSE_simple)

pd.DataFrame([MAE_simple, MSE_simple, RMSE_simple], index=['MAE', 'MSE', 'RMSE'], columns=['Metrics'])

The mean absolute error is 5.39 and the mean squared error is 6.79.

In [None]:
print('predicted mean:' ,np.mean(y_pred))
print('real mean:' ,df['Y house price of unit area'].mean())

We now check the residuals:

In [None]:
test_residuals=y_test-y_pred

In [None]:
sns.displot(x=test_residuals)

The test residuals seem to have a normal distribution with the mean of nearly 0 which is fine.

In [None]:
sns.scatterplot(x=y_test, y=test_residuals)
plt.axhline(y=0, color='r', ls='--')

The y test and test residuals do not show a regression which is what we wanted.

## B. polynomial regression

## Step 4-B: Bulding a polynomial regression

Defining the features and target variables:

In [None]:
X = df.drop('Y house price of unit area',axis=1)
y = df['Y house price of unit area']

Preprocessing:

In [None]:
from sklearn.preprocessing import PolynomialFeatures
PF=PolynomialFeatures(degree=2, include_bias=False)
poly_features=PF.fit_transform(X)

You can see that we decided to go by degree 2. How did we decide this? look at step 7!

In [None]:
poly_features.shape

Splitting the data into train and test:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

Training the model:

In [None]:
from sklearn.linear_model import LinearRegression
polymodel=LinearRegression()
polymodel.fit(X_train, y_train)

## Step 5-B: Predicting the test data

In [None]:
y_pred=polymodel.predict(X_test)

## Step 6-B: Evalutaing the model

In [None]:
MAE_Poly = metrics.mean_absolute_error(y_test,y_pred)
MSE_Poly = metrics.mean_squared_error(y_test,y_pred)
RMSE_Poly = np.sqrt(MSE_Poly)

pd.DataFrame([MAE_Poly, MSE_Poly, RMSE_Poly], index=['MAE', 'MSE', 'RMSE'], columns=['metrics'])

The mean absolute error is 4.30 and the mean squared error is 5.30 which is less than the previous model:

In [None]:
pd.DataFrame({'Poly Metrics': [MAE_Poly, MSE_Poly, RMSE_Poly], 'Simple Metrics':[MAE_simple, MSE_simple, RMSE_simple]}, index=['MAE', 'MSE', 'RMSE'])

## * Step 7-B: Adjusting model parameters

Now let's discuss how did we know the degree of the polynomial model:

In [None]:
train_RMSE_list=[]
test_RMSE_list=[]

for d in range(1,10):
    
    polynomial_converter= PolynomialFeatures(degree=d, include_bias=False)
    poly_features= polynomial_converter.fit_transform(X)
    
    X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)
    
    polymodel=LinearRegression()
    polymodel.fit(X_train, y_train)
    
    y_train_pred=polymodel.predict(X_train)
    y_test_pred=polymodel.predict(X_test)
    
    train_RMSE=np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))
    
    test_RMSE=np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))
        
    train_RMSE_list.append(train_RMSE)
    test_RMSE_list.append(test_RMSE)

In [None]:
plt.plot(range(1,6), train_RMSE_list[:5], label='Train RMSE')
plt.plot(range(1,6), test_RMSE_list[:5], label='Test RMSE')

plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE')
plt.legend()

In [None]:
display(pd.DataFrame({'degree': list(range(1, 10)),'train_RMSE': train_RMSE_list,'test_RMSE':test_RMSE_list}).set_index('degree'))

We can see from the graph that degree=2 is the best to use for this model.

I hope you enjoyed this notebook! If you have any questions, please mention in the comments. If you liked the notebook, please upvote! Thanks!