<p style="background-color:lightgreen;font-family:newtimeroman;font-size:22px;line-height:1.7em;text-align:center;border-radius:5px 5px">Housing Price estimation in Metropolitan City [BENGALURU] of India_Part_3_Scikit-Learn</p>

In [1]:
# Importing Pre-Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import sklearn
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Loading Dataset
df = pd.read_csv(r"C:\PYTHON\AI_ML\Machine_Learning_Projects\Banglorehouseprice\cleaned_data_of_bengaluru_house.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,area_type,availability,location,size,bath,balcony,Area_sqft,price
0,0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,2.0,1.0,1056,39.07
1,1,Plot Area,Ready To Move,Chikka Tirupathi,4 BHK,5.0,3.0,2600,120.0
2,2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,2.0,3.0,1440,62.0
3,3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,3.0,1.0,1521,95.0
4,4,Super built-up Area,Ready To Move,Kothanur,2 BHK,2.0,1.0,1200,51.0


In [3]:
# To drop the Unnamed Column from dataframe
data = df.loc[:,~df.columns.str.match("Unnamed")]
data.head()

Unnamed: 0,area_type,availability,location,size,bath,balcony,Area_sqft,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,2.0,1.0,1056,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 BHK,5.0,3.0,2600,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,2.0,3.0,1440,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,3.0,1.0,1521,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,2.0,1.0,1200,51.0


In [4]:
data.dtypes

area_type        object
availability     object
location         object
size             object
bath            float64
balcony         float64
Area_sqft         int64
price           float64
dtype: object

In [5]:
# Alternatively x and y can be obtained directly from the frame attribute:
x = data.drop('price', axis=1)
y = data['price']

In [6]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [8]:
# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ["bath", "balcony", "Area_sqft"]
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])

categorical_features = ["area_type", "availability", "location", "size"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(transformers=[("num", numeric_transformer, numeric_features),("cat", categorical_transformer, categorical_features)])

___LINEAR REGRESSOR___

In [9]:
# Modelling With Linear Regressor
from sklearn.linear_model import LinearRegression
# Append regressor to preprocessing pipeline. Now we have a full prediction pipeline.
lr = Pipeline(steps=[("preprocessor", preprocessor), ("LR", LinearRegression())])

In [10]:
model1 = lr.fit(x_test,y_test)

In [11]:
y1_pred = model1.predict(x_test)

In [12]:
# Importing required libraries to check regression evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

# Checking the R2 value
r2_score1 = r2_score(y_test, y1_pred)
r2_score1

0.7019624953234174

In [13]:
# Checking for RMSE score
from math import sqrt
rmse1 = sqrt(mean_squared_error(y_test, y1_pred))
rmse1

77.1135626412901

___LASSO REGRESSOR___

In [14]:
# Modelling With Lasso Regressor
from sklearn.linear_model import Lasso

# Append regressor to preprocessing pipeline. Now we have a full prediction pipeline.
lar = Pipeline(steps=[("preprocessor", preprocessor), ("LAR", Lasso(alpha = 0.001))])

In [15]:
model2 = lar.fit(x_test,y_test)

In [16]:
y2_pred = model2.predict(x_test)

In [17]:
# Checking the R2 value
r2_score2 = r2_score(y_test, y2_pred)
r2_score2

0.7018952973132219

In [18]:
# Checking for RMSE score
rmse2 = sqrt(mean_squared_error(y_test, y2_pred))
rmse2

77.12225548335343

* For Lasso Regression Hyper parameter tuning has done.
* Without alpha, then R2 = 0.29, RMSE = 118.59  # Default value of alpha
* With alpha = 0.1, then R2 = 0.59, RMSE = 89.35
* With alpha = 0.01, then R2 = 0.69, RMSE = 77.35
* With alpha = 0.001, then R2 = 0.70, RMSE = 77.12

___RIDGE REGRESSOR___

In [19]:
# Modelling With Ridge Regressor
from sklearn.linear_model import Ridge

# Append regressor to preprocessing pipeline. Now we have a full prediction pipeline.
rr = Pipeline(steps=[("preprocessor", preprocessor), ("RR", Ridge(alpha = 0.001))])

In [20]:
model3 = rr.fit(x_test,y_test)

In [21]:
y3_pred = model3.predict(x_test)

In [22]:
# Checking the R2 value
r2_score3 = r2_score(y_test, y3_pred)
r2_score3

0.7017394125193155

In [23]:
# Checking for RMSE score
rmse3 = sqrt(mean_squared_error(y_test, y3_pred))
rmse3

77.1424172177118

* For Ridge Regression Hyper parameter tuning has done.
* Without alpha, then R2 = 0.63, RMSE = 85.23 # Default value of alpha
* With alpha = 0.1, then R2 = 0.69, RMSE = 77.43
* With alpha = 0.01, then R2 = 0.70, RMSE = 77.14
* With alpha = 0.001, then R2 = 0.70, RMSE = 77.14

___ELASTIC NET REGRESSOR___

In [24]:
# Modelling With Elastic Net Regressor
from sklearn.linear_model import ElasticNet

# Append regressor to preprocessing pipeline. Now we have a full prediction pipeline.
enr = Pipeline(steps=[("preprocessor", preprocessor), ("ENR", ElasticNet(alpha = 0.001,l1_ratio=1))])

In [25]:
model4 = enr.fit(x_test,y_test)

In [26]:
y4_pred = model4.predict(x_test)

In [27]:
# Checking the R2 value
r2_score4 = r2_score(y_test, y4_pred)
r2_score4

0.7018952973132219

In [28]:
# Checking for RMSE score
rmse4 = sqrt(mean_squared_error(y_test, y4_pred))
rmse4

77.12225548335343

* For Elastic Net Regression Hyper parameter tuning has done.
* Without alpha, then R2 = 0.21, RMSE = 125.10 # Default value of alpha
* With alpha = 0.1, l1_ratio=1, then R2 = 0.59, RMSE = 89.35
* With alpha = 0.01, l1_ratio=1, then R2 = 0.69, RMSE = 77.72
* With alpha = 0.001, l1_ratio=1, then R2 = 0.70, RMSE = 77.12

The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

___LGBM REGRESSOR___

In [29]:
# Modelling With LGBM Regressor
from lightgbm import LGBMRegressor

# Append regressor to preprocessing pipeline. Now we have a full prediction pipeline.
lgbm = Pipeline(steps=[("preprocessor", preprocessor), ("lgbm", LGBMRegressor(boosting_type='gbdt', num_leaves=100, learning_rate=1, n_estimators=200))])

In [30]:
model5 = lgbm.fit(x_test,y_test)

In [31]:
y5_pred = model5.predict(x_test)

In [32]:
# Checking the R2 value
r2_score5 = r2_score(y_test, y5_pred)
r2_score5

0.9314272507026052

In [33]:
# Checking for RMSE score
rmse5 = sqrt(mean_squared_error(y_test, y5_pred))
rmse5

36.988872192839175

* For LGBM Regression Hyper parameter tuning has done.
* boosting_type='gbdt', num_leaves=100, learning_rate=0.1, n_estimators=100, then R2 = 0.71, RMSE = 75.57
* boosting_type='gbdt', num_leaves=100, learning_rate=1, n_estimators=150, then R2 = 0.92, RMSE = 38.10
* boosting_type='gbdt', num_leaves=100, learning_rate=1, n_estimators=200, then R2 = 0.93, RMSE = 36.98
* boosting_type='dart', num_leaves=100, learning_rate=1, n_estimators=200, then R2 = 0.89, RMSE = 45.98

___KNN REGRESSOR___

In [34]:
# Modelling With KNN Regressor
from sklearn.neighbors import KNeighborsRegressor

# Append regressor to preprocessing pipeline. Now we have a full prediction pipeline.
knn = Pipeline(steps=[("preprocessor", preprocessor), ("KNN", KNeighborsRegressor(n_neighbors=3, weights='distance', algorithm='auto', leaf_size=30))])

In [35]:
model6 = knn.fit(x_test,y_test)

In [36]:
y6_pred = model6.predict(x_test)

In [37]:
# Checking the R2 value
r2_score6 = r2_score(y_test, y6_pred)
r2_score6

0.9974877164937322

In [38]:
# Checking for RMSE score
rmse6 = sqrt(mean_squared_error(y_test, y6_pred))
rmse6

7.079940896699334

* For KNN Regression Hyper parameter tuning has done.
* With default hyperparameter tuning , R2 = 0.57, RMSE = 91.77
* With n_neighbors=3, weights='uniform', algorithm='auto', leaf_size=30, then R2 = 0.69, RMSE = 77.52
* with n_neighbors=3, weights='distance', algorithm='auto', leaf_size=30, then R2 = 0.99, RMSE = 7.07

___REAL TIME PREDICTION___

In [39]:
import joblib
dump = joblib.dump(knn, 'knn_house_price_estimator.pkl')

In [40]:
load = joblib.load('knn_house_price_estimator.pkl')

In [41]:
test = x.head(1)
test

Unnamed: 0,area_type,availability,location,size,bath,balcony,Area_sqft
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,2.0,1.0,1056


In [42]:
result = load.predict(test)
result

array([39.07])

### CONCLUSION

* In this Project i've done feature engineering Techniques like StandardScaler, ColumnTransformer, SimpleImputer, OneHotEncoder in single pipe line using scikit-learn pipeline.

* Regression Algorithms has been applied, among all KNN Regressor was selected based on R2 score 0.99 and RMSE Score 7.07. These  are the best scores among all. 

* Ideally, lower RMSE and higher R-squared values are indicative of a good model. 

* The RMSE value 7.07 tells us that the average deviation between the predicted house price made by the model and the actual house price.

* The R2 value tells us that the predictor variables in the model (square footage, bathrooms, and bedrooms) are able to explain 99.7 % of the variation in the house prices.

* Based on above scores i used to download KNN model for real time prediction.