Real Estate Valuation
            (Random Forest, XGBoost, and kNN)

Problem Statement: 
Predict real estate values in Sindian District, New Taipei City, using historical data on factors like location, size, and building age. 
A model is trained on two-thirds of the data and tested on the remaining one-third to estimate property prices. This is a "regression problem" focused on predicting continuous values.

1. Random Forest

In [74]:
import pandas as pd #Imports the pandas library for data manipulation and analysis.
from sklearn.preprocessing import LabelEncoder # Imports the LabelEncoder class from scikit-learn to convert categorical labels into numerical values.
from sklearn.ensemble import RandomForestRegressor # Imports the RandomForestClassifier class from scikit-learn to create random forest models, an ensemble of decision trees.
from sklearn.model_selection import train_test_split # Imports the train_test_split function from scikit-learn to split the dataset into training and testing sets.
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.neighbors import KNeighborsRegressor

In [76]:
df=pd.read_csv('C:/Users/suvra/OneDrive/Desktop/Real estate valuation data set.csv')

In [78]:
df.head()
#Displays the first five rows of the dataframe `df`, providing a quick look at the structure and contents of the data.

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


In [80]:
df.isna().sum()
# Returns the number of missing values (NaNs) in each column of the dataframe `df`.

No                                        0
X1 transaction date                       0
X2 house age                              0
X3 distance to the nearest MRT station    0
X4 number of convenience stores           0
X5 latitude                               0
X6 longitude                              0
Y house price of unit area                0
dtype: int64

In [82]:
df.info()
# Provides a concise summary of the dataframe df, including the number of non-null entries, data types of each column, and memory usage.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   No                                      414 non-null    int64  
 1   X1 transaction date                     414 non-null    float64
 2   X2 house age                            414 non-null    float64
 3   X3 distance to the nearest MRT station  414 non-null    float64
 4   X4 number of convenience stores         414 non-null    int64  
 5   X5 latitude                             414 non-null    float64
 6   X6 longitude                            414 non-null    float64
 7   Y house price of unit area              414 non-null    float64
dtypes: float64(6), int64(2)
memory usage: 26.0 KB


In [84]:
df.drop(['No','X1 transaction date'], axis=1, inplace=True)

In [86]:
df.head()
#Displays the first five rows of the dataframe `df`, providing a quick look at the structure and contents of the data.

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,32.0,84.87882,10,24.98298,121.54024,37.9
1,19.5,306.5947,9,24.98034,121.53951,42.2
2,13.3,561.9845,5,24.98746,121.54391,47.3
3,13.3,561.9845,5,24.98746,121.54391,54.8
4,5.0,390.5684,5,24.97937,121.54245,43.1


In [88]:
df.shape
#Tells us how many rows and columns are in the data.

(414, 6)

In [90]:
from sklearn.preprocessing import StandardScaler

In [92]:
ss=StandardScaler()
# StandardScaler() is a preprocessing tool from the sklearn library that standardizes features by removing the mean and scaling to unit variance, making the data have a mean of 0 and a standard deviation of 1.

In [94]:
df['X2 house age']=ss.fit_transform(df[['X2 house age']])
df['X3 distance to the nearest MRT station']=ss.fit_transform(df[['X3 distance to the nearest MRT station']])
df['X5 latitude']=ss.fit_transform(df[['X5 latitude']])
df['X6 longitude']=ss.fit_transform(df[['X6 longitude']])

In [96]:
df.head()
#Displays the first five rows of the dataframe `df`, providing a quick look at the structure and contents of the data.

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
0,1.255628,-0.792495,10,1.12543,0.448762,37.9
1,0.157086,-0.616612,9,0.912444,0.401139,42.2
2,-0.387791,-0.414015,5,1.48686,0.688183,47.3
3,-0.387791,-0.414015,5,1.48686,0.688183,54.8
4,-1.117223,-0.549997,5,0.834188,0.592937,43.1


In [98]:
df.columns
# Returns the labels of columns in the DataFrame df as an Index object.

Index(['X2 house age', 'X3 distance to the nearest MRT station',
       'X4 number of convenience stores', 'X5 latitude', 'X6 longitude',
       'Y house price of unit area'],
      dtype='object')

In [100]:
X=df[['X2 house age', 'X3 distance to the nearest MRT station',
       'X4 number of convenience stores', 'X5 latitude', 'X6 longitude']]
Y=df['Y house price of unit area']
# Selects specific columns from the dataframe df to create the feature matrix X for the machine learning model.
# Selects specific columns from the dataframe df to create the feature matrix Y (target variable) for the machine learning model.

In [102]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.3)
# `train_test_split` splits the dataset into training and testing subsets, with 30% of the data allocated for testing and 70% for training.

In [104]:
rf=RandomForestRegressor()
# Initializes an instance of the RandomForestRegressor class, which is a machine learning model that uses an ensemble of decision trees to perform regression tasks.

In [106]:
rf.fit(X_train,Y_train)
# Trains the RandomForestRegressor model rf on the training data X_train (features) and Y_train (target values).

In [108]:
Y_pred=rf.predict(X_test)
#  Uses the trained RandomForestRegressor model rf to predict target values for the test data X_test.

In [110]:
print("Random Forest Regressor Prediction")
print("mean_absolute_error:",mean_absolute_error(Y_test,Y_pred),",","mean_squared_error:",mean_squared_error(Y_test,Y_pred))                          
# prints the mean absolute error (MAE) and mean squared error (MSE) between the actual values Y_test and the predicted values Y_pred.

Random Forest Regressor Prediction
mean_absolute_error: 5.1730201333333286 , mean_squared_error: 85.1404742983855


2. XGBoost

In [120]:
#Import XGBoost
!pip install xgboost      
import xgboost as xgb 
xg= xgb.XGBRegressor() 



In [122]:
xg.fit(X_train, Y_train)
# Trains the XGBRegressor model xg on the training data X_train (features) and Y_train (target values).

In [124]:
Y_pred1=xg.predict(X_test)
# Uses the trained XGBRegressor model xg to predict target values for the test data X_test.

In [126]:
print("XGBoost Regressor Prediction")
print("mean_absolute_error:",mean_absolute_error(Y_test,Y_pred1),",","mean_squared_error:",mean_squared_error(Y_test,Y_pred1))
# prints the mean absolute error (MAE) and mean squared error (MSE) between the actual values Y_test and the predicted values Y_pred1 from the XGBRegressor model.

XGBoost Regressor Prediction
mean_absolute_error: 5.33907176361084 , mean_squared_error: 91.14249604587283


3. kNN

In [128]:
knn=KNeighborsRegressor()
# Initializes an instance of the KNeighborsRegressor class, which uses the k-nearest neighbors algorithm for regression tasks.

In [130]:
knn.fit(X_train,Y_train)
# Trains the KNeighborsRegressor model knn using the training data X_train (features) and Y_train (target values).

In [132]:
Y_pred2=knn.predict(X_test)
# Uses the trained KNeighborsRegressor model knn to predict target values for the test data X_test.

In [134]:
print("kNN Regressor Prediction")
print("mean_absolute_error:",mean_absolute_error(Y_test,Y_pred2),",","mean_squared_error:",mean_squared_error(Y_test,Y_pred2))
# prints the mean absolute error (MAE) and mean squared error (MSE) between the actual values Y_test and the predicted values Y_pred2 from the KNeighborsRegressor model.

kNN Regressor Prediction
mean_absolute_error: 5.648960000000001 , mean_squared_error: 99.98673920000003


Comparison of the performance of the 3 Models: Random Forest, KNN & XGBoost Regressor

Random Forest (RF) Regressor outperforms both XGBoost and kNN, as it has the lowest MAE(mean_absolute_error) and MSE(mean_sqaure_error), making it the most accurate and reliable model among the three.

1st preference - Random Forest Regressor

2nd preference - XGBoost Regressor

3rd preference - KNN Regressor

Taking new Record to predict:

In [136]:
new_data1=pd.read_csv('C:/Users/suvra/Downloads/New_data1.csv')

In [138]:
new_data1

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude
0,32.0,84.87882,10,24.98298,121.54024
1,19.5,306.5947,9,24.98034,121.53951


In [146]:
new_data1['X2 house age']=ss.fit_transform(new_data1[['X2 house age']])
new_data1['X3 distance to the nearest MRT station']=ss.fit_transform(new_data1[['X3 distance to the nearest MRT station']])
new_data1['X5 latitude']=ss.fit_transform(new_data1[['X5 latitude']])
new_data1['X6 longitude']=ss.fit_transform(new_data1[['X6 longitude']])

In [148]:
rf.predict(new_data1)
# Uses the trained RandomForestRegressor model rf to predict target values for the new input data new_data1.

array([49.81636667, 25.8535    ])

In [150]:
Y_pred # Represents the predicted target values generated by a regression model for a given set of  test data.

array([48.51495833, 39.52676667, 51.61133333, 30.93197619, 18.784     ,
       46.90125   , 57.1515    , 26.8214    , 27.39525   , 26.301     ,
       44.4289    , 59.551     , 31.749     , 48.58161667, 44.915     ,
       45.595     , 49.551     , 25.1675    , 54.18383333, 57.454     ,
       26.8214    , 56.61794524, 51.78566667, 38.813     , 49.1846    ,
       40.533     , 46.715     , 40.88266667, 25.832     , 48.811125  ,
       50.98466667, 51.125     , 29.648     , 26.9295    , 53.66265   ,
       50.09675   , 42.51      , 35.4548    , 37.106     , 47.02725   ,
       30.77397619, 59.1405    , 50.09675   , 34.832     , 68.36446071,
       25.12066667, 25.08      , 14.636     , 21.436     , 39.523     ,
       40.302     , 46.66675   , 45.068     , 18.962     , 48.74575   ,
       25.1315    , 38.567     , 58.108     , 37.10633333, 27.291     ,
       49.70675   , 30.93197619, 26.517     , 61.889     , 51.00034524,
       33.7032    , 58.078     , 41.795     , 38.563     , 20.73

Interpretations:

Based on new data, we have got predicted values of house price of unit area(Y house price of unit area) as 49.81636667, 25.8535 respectively. Then, we can see Y_pred values listed above, that is after fitting the random forest model with Y_test data. 

The 1st predicted value of 49.81636667 for one of the new_data1 entries is very close to 48.51495833 from Y_pred. This suggests that the model is consistent for similar input features between the new data and the test data. This indicates that the model is likely generalizing well.

The 2nd  prediction for the other new_data1 entry i.e. 25.8535 is not as closely aligned with the Y_pred value. This suggests that there is  some variance, which could be due to differences in the features of the new_data1 entry compared to the test data.