<a href="https://colab.research.google.com/github/radhakrishnan-omotec/avm-repository/blob/master/Real_Estate_Pricing_Using_Regression_Model_%F0%9F%8F%A0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b>1 <span style='color:#a6acd7'>|</span> Introduction</b>
![](https://images.pexels.com/photos/106399/pexels-photo-106399.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1)

### What Is Real Estate?
Real estate is the land along with any permanent improvements attached to the land, whether natural or man-made—including water, trees, minerals, buildings, homes, fences, and bridges. Real estate is a form of real property. - [Investopedia](https://www.investopedia.com/terms/r/realestate.asp)

### What to Expect
In this notebook, I will be analyzing factors that affects house pricing by visualizing it using Plotly graphs. I'm also going to make a model that predict house prices base on its properties

### Dataset
This dataset contains Real Estate listings in the US broken by State and zip code.
#### Columns Attributes
The realtor-data.csv has 200k+ entries:
* **status** - Housing Status (on sale or other option)
* **price** - Price in USD
* **bed** - Bedroom count
* **bath** - Bathroom count
* **acre_lot** - Acre lot
* **full_address** - Full address
* **street** - Street name
* **city** - City name
* **state** - State name
* **zip_code** - Zip Code
* **house_size** - House size in sqft (square feet)
* **sold_date** - The date when the house is sold

# <b>2 <span style='color:#a6acd7'>|</span> Data Preprocessing</b>
### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingRegressor
from catboost import CatBoostRegressor
from sklearn.kernel_ridge import KernelRidge
from xgboost.sklearn import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge, BayesianRidge, SGDRegressor, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RandomizedSearchCV

### Exploring the dataset

In [None]:
df = pd.read_csv('../input/usa-real-estate-dataset/realtor-data.csv')
df.head()

In [None]:
df.info()

In [None]:
df.shape

Lets check if theres duplicate values in our data

In [None]:
df.duplicated().sum()

Oops thats alot of duplicate values! We have no choice but to drop those as it might cause overfitting on our data.

In [None]:
df.drop_duplicates(inplace=True)
df.shape

While analyzing the data, I think it would be better just to drop sold_date, street and full address because those columns might not help on regression

In [None]:
df = df.drop(columns=['sold_date', 'street', 'full_address'])
df.head()

Now lets check for missing values

In [None]:
df.isnull().sum()

In [None]:
df.dropna().shape

Its not gonna be a great idea to just drop missing values considering that we had drop around 90% of our data because of duplicate values and if we drop missing values, our rows will just be 13062. The alternative solution that we can do is to impute the missing values on their median (if numerical value) and mode (if non-numerical value).

But I think we should use the dataset that was dropped missing values on exploratory data analysis so that we can analyze real data

In [None]:
df_nonull = df.dropna()

In [None]:
df['bed'] = df['bed'].fillna(df['bed'].median())
df['bath'] = df['bath'].fillna(df['bath'].median())
df['acre_lot'] = df['acre_lot'].fillna(df['acre_lot'].median())
df['city'] = df['city'].fillna(df['city'].mode()[0])
df['zip_code'] = df['zip_code'].fillna(df['zip_code'].median())
df['house_size'] = df['house_size'].fillna(df['house_size'].median())

And lets also we need to change our categorical data into numerical data by using LabelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder
print('Categorical columns: ')
for col in df.columns:
    if df[col].dtype == 'object':
        values = df[col].value_counts()
        values = dict(values)

        print(str(col))
        label = LabelEncoder()
        label = label.fit(df[col])
        df[col] = label.transform(df[col].astype(str))

        new_values = df[col].value_counts()
        new_values = dict(new_values)

        value_dict = {}
        i=0
        for key in values:
            value_dict[key] = list(new_values)[i]
            i+= 1
        print(value_dict)

# <b>3 <span style='color:#a6acd7'>|</span> Exploratory Data Analysis</b>

Here we will use plotly for interactive data analysis. Also, we are using the dataset that we dropped missing values so that we would analyze real data.

In [None]:
from plotly.offline import iplot, init_notebook_mode
import plotly.express as px

init_notebook_mode(connected=True)

### Checking if number of bed, bath, and house size affects its price

#### House size & Price

In [None]:
fig = px.scatter(df_nonull, x="house_size", y="price", trendline="ols", color='price', title='Total Square Feet and its Price')
fig.show()

Looking at the graph above, it seems that theres a mistake here, I think it would be crazy if someone would sell a 1 million square feet house and just sell it for 8 million, so we need to remove that and also the 60 million house as its so high and its definetly an outlier

In [None]:
df_nonull = df_nonull.sort_values(by='house_size', ascending=False)
df_nonull = df_nonull.drop(10328)
df = df.drop(10328)

In [None]:
df_nonull = df_nonull.sort_values(by='price', ascending=False)
df_nonull = df_nonull.drop(40599)
df = df.drop(40599)

In [None]:
fig = px.scatter(df_nonull, x="house_size", y="price", trendline="ols", color='price', title='Total Square Feet and its Price')
fig.show()

#### Bed & Price

In [None]:
fig = px.scatter(df_nonull, x="bed", y="price", trendline="ols", color='price', title='Number of Beds and its Price')
fig.show()

#### Bath & Price

In [None]:
fig = px.scatter(df_nonull, x="bath", y="price", trendline="ols", color='price', title='Number of Baths and its Price')
fig.show()

By analyzing the graphs above, we can say yes. Yes, the number of beds, baths, and house size does affects its price. All of them has a positive correlation towards price but its a weak correlations.



### Now lets rank which state has the highest median house prices

In [None]:
order = df_nonull.groupby(by=['state'])['price'].median().sort_values(ascending=False).index

fig = px.box(df_nonull, x="state", y="price", points='all', color='state', title='Highest City House Prizes Ranking by Median')
fig.update_xaxes(categoryorder='array', categoryarray= list(order))
fig.show()

### Heatmap

In [None]:
fig = px.imshow(df_nonull.corr(), title='Heatmap of numerical values of the data')
fig.show()

Oh it seems that our price columns has weak correlation towards to bed but has moderate correlations to number of baths and house size which is great, it might help us to get a decent r2 score

# <b>4 <span style='color:#a6acd7'>|</span> Preparing the Data for Modelling</b>

### Lets standarize the data

In [None]:
df = (df-df.mean())/df.std()

### Separating the data

In [None]:
X = df.drop(columns='price')
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# <b>5 <span style='color:#a6acd7'>|</span> Training the Model</b>
### Trying out different regression models
To see what model gets the highest score and we will use it on training

In [None]:
models = {}
def train_validate_predict(regressor, x_train, y_train, x_test, y_test, index):
    model = regressor
    model.fit(x_train, y_train)

    y_pred = model.predict(x_test)

    r2 = r2_score(y_test, y_pred)
    models[index] = r2

In [None]:
model_list = [LinearRegression, Lasso, Ridge, BayesianRidge, DecisionTreeRegressor, LinearSVR, KNeighborsRegressor,
              RandomForestRegressor, GradientBoostingRegressor, ElasticNet, SGDRegressor, CatBoostRegressor, XGBRegressor,
             LGBMRegressor]
model_names = ['Linear Regression', 'Lasso', 'Ridge', 'Bayesian Ridge', 'Decision Tree Regressor', 'Linear SVR',
               'KNeighbors Regressor', 'Random Forest Regressor', 'Gradient Boosting Regressor', 'Elastic Net', 'SGD Regressor',
              'Cat Boost Regressor', 'XGB Regressor', 'LGBM Regressor']

index = 0
for regressor in model_list:
    train_validate_predict(regressor(), X_train, y_train, X_test, y_test, model_names[index])
    index+=1

In [None]:
models

From the scores above the highest r2 score was XGB Regressor followed by Cat Boost Regressor. We got around 0.65 r2 score or kinda like 65% which is not that good.

# <b>6 <span style='color:#a6acd7'>|</span> Evaluating the Model</b>

Lets use XGB Regressor to train our model

In [None]:
model = XGBRegressor()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

### Evaluating

In [None]:
print('MAE: ', mean_absolute_error(y_test, y_pred))
print('MSE: ', mean_squared_error(y_test, y_pred))
print('r2: ', r2_score(y_test, y_pred))

In [None]:
preds = pd.DataFrame({'y_pred': y_pred, 'y_test':y_test})
preds = preds.sort_values(by='y_test')
preds = preds.reset_index()

In [None]:
fig = px.line(preds, x=preds.index, y=preds.columns[1::], title='Predictions vs Actual Value')
fig.show()

Oh it predicts not so great especially when it comes to the high price houses. Its not that bad and maybe by doing more feature engineering and hyper parameter tuning, the model would be more reliable and accurate.