## **How to build a real estate price prediction website**  

This data science project series walks through step by step process of how to build a real estate price prediction website.  

#### **Step-1:**  
We will first build a model using sklearn and linear regression using banglore home prices dataset from kaggle.  
[Bengaluru House Price Dataset](https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data)
#### **Step-2:**  
Second step would be to write a python flask server that uses the saved model to serve http requests.  
#### **Step-3:**
Third component is the website built in html, css and javascript that allows user to enter home square ft area, bedrooms etc and it will call python flask server to retrieve the predicted price. 

#### **Summary**
During model building we will cover almost all data science concepts such as data load and cleaning, outlier detection and removal, feature engineering, dimensionality reduction, gridsearchcv for hyperparameter tunning, k fold cross validation etc.  
Technology and tools wise this project covers;  
*  Python
*  Numpy and Pandas for data cleaning
*  Matplotlib for data visualization
*  Sklearn for model building
*  Jupyter notebook, visual studio code and pycharm as IDE
*  Python flask for http server
*  HTML/CSS/Javascript for UI  

## **Building House Price Prediction Model**

### **TO find path of dataset**

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



### **Importing required libraries and loading data**

In [None]:
# important python libraries for machine learning
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt # visualizing data
import seaborn as sns # visualizing data with stunning default theme
import sklearn # contain algorithms
plt.rcParams["figure.figsize"] = (20,10)
import warnings
warnings.filterwarnings('ignore')

# load dataset from input directory
df = pd.read_csv("../input/bengaluru-house-price-data/Bengaluru_House_Data.csv") 
df.head()


In [None]:
df.shape

#### **To check categories in a are_type column**  
It is helpful to analyze the dataset for categorical balance of data.

In [None]:
df.groupby('area_type')['area_type'].agg('count')

#### **To remove columns**  
To make the project simple for beginners, we assume *areatype, society, balcony, availability* columns unuseful and remove these columns.

In [None]:
#dropping some columns
df2 = df.drop(['area_type','society','balcony','availability'],axis='columns')
df2.head()

## **Data Cleaning**  
In this process, we handle the Null/missing values and duplicate values.

In [None]:
#before dropping null value, lets check it column-wise
df2.isnull().sum()

In [None]:
# We can fill the missing-values using median but
# here the missing values are less compare to dataset size, so we are dropping
df3 = df2.dropna()
df3.isnull().sum()

In [None]:
#to drop duplicate values
df4 = df3.drop_duplicates()
print("Dataset size before dropping duplicate values: {} and after {}".format(df3.shape, df4.shape))

#### **To transform the column into appropriate datatype or category**

In [None]:
#lets check size column
df4['size'].unique()

In [None]:
#from above analysis, we found the datatype inappropriate for ml-model
#4-Bedroom and 4 BHK are same and so on. We create new column with integer type and 
# convert the given size-column. We don't drop size column for later use.
df4['bhk'] = df4['size'].apply(lambda x: int(x.split(' ')[0]))
df4.head()


In [None]:
df4['bhk'].unique()

#### **To check total_sqft column**

In [None]:
df4.total_sqft.unique()

In [None]:
# the above analysis shows an inappropriate data in total_sqft column
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [None]:
# we use ~ negative opperator to show the inappropriate data
df4[~df4['total_sqft'].apply(is_float)].head(10)

In [None]:
def convert_sqft_to_num(x):
    tokens = x.split('-')
    if len(tokens)==2:
        return (float(tokens[0]) + float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None

In [None]:
df5 = df4.copy()
df5['total_sqft'] = df5['total_sqft'].apply(convert_sqft_to_num)
df5.head()

In [None]:
df5.isnull().sum()

We got other missing values on applying our function because we put None for values with units like meters etc and just convert simple value or range value(x1-x2) into float

In [None]:
# dropping the created missing values with our convert_sqft_to_num
df5=df5.dropna()
df5.isnull().sum()

## **Feature Engineering**  
In this step, we add new feature which will be helpful for outlier detection and removal later-on.
We will also refine categorical data in location column for one hot-encoding later-on.

In [None]:
df6 = df5.copy()
# the given price is in lac unit and we are converting it into rupees
df6["price_per_sqft"] = df6['price']*100000/df6['total_sqft']
df6.head()

#### **To check the number of unique categories in location column**

In [None]:
len(df6.location.unique())

In [None]:
# 1298 is high dimensionality problem. if we apply one hot-encoding on this,
# we will get high number of feature.
# we will check the number of rows for each category and will make some threshold
# for keeping the category. Obviously categories with less rows(samples)
# will be placed in 'other' category.
df6.location = df6.location.apply(lambda x: x.strip()) # remove leading or end spaces
location_stats = df6.groupby('location')['location'].agg('count').sort_values(ascending=False)
location_stats

In [None]:
len(location_stats[location_stats <= 10])

In [None]:
# we add all these 1057 unique categories with less or equal to 10 rows into 'other' category
location_stats_less_than_ten = location_stats[location_stats<=10]
df6.location = df6.location.apply(lambda x: 'other' if x in location_stats_less_than_ten else x)
df6.location.head(10)

### **Outlier Detection and Removal**  
Outliers are the data points which are data errors but some time they represent extrem variation.
We can use techniques like:
* Standard deviation
* Domain knowledge

#### **Let, we have been told for a room per sqft threshold as 300**  
As a data-scientist we will check our dataset for outliers which includes rooms with less then the 
given threshold. We will remove such samples from data considering as inappropiate.

In [None]:
df6[df6.total_sqft/df6.bhk <300].head()

In [None]:
# to remove the outiers
df7 = df6[~(df6.total_sqft/df6.bhk <300)]
print("Data Size before outlier removal: {} and after: {}".format(df6.shape, df7.shape))

#### **Now we check price per sqft if that is feasible or not**

In [None]:
df7.price_per_sqft.describe()

The above detail shows that according to the selected areas there isn't any area with such min or max area per sqft price. It clearly shows the variation in the given data.

In [None]:
def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft <=(m+st))]
        df_out = pd.concat([df_out,reduced_df],ignore_index=True)
    return df_out

In [None]:
df8 = remove_pps_outliers(df7)
df8.shape

### **To check the price for 2 and 3 bedroom in same location and equal area**  
In this step, we can learn how to think about cleaning the dataset with outliers and wrong data.
So we will check that if the area is same but 2-bedroom cost is high than 3-bedroom, it means there are still false data in our dataset.

In [None]:
def plot_scatter_chart(df,location):
    bhk2 = df[(df.location == location) & (df.bhk==2)]
    bhk3 = df[(df.location == location) & (df.bhk==3)]
    plt.rcParams['figure.figsize'] = (15,10)
    plt.scatter(bhk2.total_sqft,bhk2.price, color='blue', label='2 BHK', s=50)
    plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='green',label='3 BHK', s=50)
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price")
    plt.title(location)
    plt.legend()

# we can check for different locations
plot_scatter_chart(df8,'Rajaji Nagar')

Now we remove those bedroom appertments whose price_per_sqft is less then the mean of 1-less bedroom appartments.

In [None]:
def remove_bhk_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df.price_per_sqft),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft < (stats['mean'])].index.values)
    return df.drop(exclude_indices, axis='index')

df9=remove_bhk_outliers(df8)
df9.shape

In [None]:
# let re-check the price using our defined scatter plot function
plot_scatter_chart(df9,'Rajaji Nagar')

We can compare the plot_scatter_chart function for our df8 and new dataframe df9 as shown above. We have removed all the samples with false information.

### **Use Histogram to find the distribution of data w.r.t price_per_sqft**

In [None]:
plt.rcParams["figure.figsize"] = (20,10)
plt.hist(df9.price_per_sqft,rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

#### **We check number of bathroom in the dataset for outliers**
Lets we finalize in team meeting to remove the samples(rows) with 
number_of_bathrooms > number_of_bedrooms + 2

In [None]:
# lets first check if there is any such case
df9[df9.bath>df9.bhk+2]

In [None]:
#lets remove the rows that doesn't satisfy the threshold of bathrooms
data = df9[df9.bath<df9.bhk+2]
data.shape

#### **Now our data is much clean**  
We will remove the features(columns) that are unnecessary for machine learning model.
Such as 'size' and price_per_sqft. We need these for outlier detection but aren't useful for ml-model. 

In [None]:
data = data.drop(['size','price_per_sqft'], axis=1)
data.head()

## **Building Machine-Learning Model**  
Yet, we have sting for location column which cannot be interpreted through machine-learning model. To convert the text into numeric values, we use one hot-encoding technique using pandas dummy.

In [None]:
dummies = pd.get_dummies(data.location)
dummies.head(3)

In [None]:
# to avoid dummy trap, we will drop one column from dummy data 
# and consider 0 value instead of that column value
data2 = pd.concat([data,dummies.drop('other',axis='columns')], axis='columns')
data2.head(3)

In [None]:
# now we can drop location column because that is converted into numeric
data3 = data2.drop('location',axis='columns')
data3.head(3)

#### **Separating the dependent and independent features**

In [None]:
X = data3.drop('price',axis="columns")
Y = data3.price
print("Depented Features: {}    Independent Feature: {}".format(X.shape, Y.shape))

#### **Spliting the data for training and testing**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)

#### **Training and testing Linear Regression model**

In [None]:
from sklearn.linear_model import LinearRegression
#build model
lr_clf = LinearRegression()
# training our model
lr_clf.fit(X_train,Y_train)
#testing our model
lr_clf.score(X_test, Y_test)

#### **evaluate machine learning model using k-fold cross-validation**

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
cross_val_score(LinearRegression(), X, Y, cv=cv)

#### **Using GridSearchCV method to find best algorithm for our model**

In [None]:
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor

def find_best_model_using_gridsearchcv(X,Y):
    algos = {
        'linear_regression' : {
            'model': LinearRegression(),
            'params': {
                'normalize': [True, False]
                }
             },
             'lasso':{
                 'model': Lasso(),
                 'params': {
                     'alpha': [1,2],
                     'selection': ['random', 'cyclic']
                     }
                  },
              'decision_tree': {
                  'model': DecisionTreeRegressor(),
                  'params': {
                      'criterion': ['mse', 'friedman_mse'],
                      'splitter': ['best', 'random']
                      }
                  }
              }
    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    for algo_name, config in algos.items():
        gs = GridSearchCV(config['model'], config['params'], cv=cv, return_train_score = False)
        gs.fit(X,Y)
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
            })
            
    return pd.DataFrame(scores, columns=['model','best_score', 'best_params'])
    
find_best_model_using_gridsearchcv(X,Y)

It is found that Linear Regression model performs well and should be selected for price prediction

In [None]:
# lr_cfr is already trained
# we write a function to predict prices for some data
def predict_price(location,area,bathroom, bedroom):
    loc_index = np.where(X.columns==location)[0][0]
    
    x=np.zeros(len(X.columns))
    x[0] = area
    x[1] = bathroom
    x[2] = bedroom
    if loc_index >=0:
        x[loc_index] = 1
    return lr_clf.predict([x])[0]

predict_price('1st Phase JP Nagar',1000,3,3)

### **To export(save) the model into pickle file to use in website for prediction**

In [None]:
import pickle
model_file = "banglore_home_prices_model.pickle"
with open(model_file,'wb') as f:
    pickle.dump(lr_clf,f)

#### **We also store the columns name for later use in website**


In [None]:
import json
columns = {
    'data_columns' : [col.lower() for col in X.columns]
    }
with open('columns.json', 'w') as f:
    f.write(json.dumps(columns))