# Predicting Bangaluru Housing Price - Regression Model

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Importing other necessary libraries.
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams["figure.figsize"] = (22,12)
import seaborn as sns
sns.style="darkgrid"

In [None]:
# Loading the dataset and storing it into the dataframe df.
df = pd.read_csv('/kaggle/input/bengaluru-house-price-data/Bengaluru_House_Data.csv')
pd.options.display.max_columns = 30

## About the Dataset

What are the things that a potential home buyer considers before purchasing a house? The location, the size of the property, vicinity to offices, schools, parks, restaurants, hospitals or the stereotypical white picket fence? What about the most important factor — the price?

Now with the lingering impact of demonetization, the enforcement of the Real Estate (Regulation and Development) Act (RERA), and the lack of trust in property developers in the city, housing units sold across India in 2017 dropped by 7 percent. In fact, the property prices in Bengaluru fell by almost 5 percent in the second half of 2017, said a study published by property consultancy Knight Frank.
For example, for a potential homeowner, over 9,000 apartment projects and flats for sale are available in the range of ₹42-52 lakh, followed by over 7,100 apartments that are in the ₹52-62 lakh budget segment, says a report by property website Makaan. According to the study, there are over 5,000 projects in the ₹15-25 lakh budget segment followed by those in the ₹34-43 lakh budget category.

Buying a home, especially in a city like Bengaluru, is a tricky choice. While the major factors are usually the same for all metros, there are others to be considered for the Silicon Valley of India. With its help millennial crowd, vibrant culture, great climate and a slew of job opportunities, it is difficult to ascertain the price of a house in Bengaluru.

### Reference Kaggle Description of - https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data

# Exploratory Data Analysis and Cleaning the Data.

EDA is the most crucial part. It's the first look at our data with the help of data visualization and descriptive statistics. We need to do it carefully in order to identify and eliminate outliers and missing values. Data Cleaning goes hands in hand with EDA. We'll proceed with our EDA in the upcoming cell blocks.

The very first steps are getting an overall view of our DataFrame. We do it by calling the .info() and .describe() methods on our dataframe.

In [None]:
# Checking the Info of our dataframe.
# Describing our Dataset
df.info()
# We get to see the column names along with the number of non-null values and their coressponding datatypes.

In [None]:
# Quick data description.
df.describe()

In [None]:
# Checcking for null values. The society column has a lot of null values. Other one is Balcony.
df.isnull().sum()

In [None]:
# Checking the head of the dataframe.
df.head()

In [None]:
#Initial Shape of the DataFrame.
df.shape

In [None]:
# Displaying the column names for the dataframe.
df.columns

In [None]:
df['area_type'].unique()

In [None]:
df['area_type'].value_counts()

In [None]:
df['availability'].unique()

In [None]:
df['availability'].nunique()

In [None]:
df['location'].nunique()

In [None]:
df['size'].nunique()

In [None]:
ax = sns.countplot(y="size", data=df)
# Have a quick look at the current dataset and the counts of the various categories listed in the size column.

In [None]:
df['society'].nunique()

In [None]:
df['total_sqft'].nunique()

In [None]:
df['bath'].nunique()

In [None]:
df['bath'].value_counts()

In [None]:
ax = sns.countplot(x="bath", data=df)
# Bath is set as float we need to convert it to string. Nevertheless we find the mode is in 2bathrooms.
# There are some big outliers and need to be taken care of.

In [None]:
df['balcony'].nunique()

In [None]:
df['balcony'].value_counts()

In [None]:
ax = sns.countplot(x="balcony", data=df)

We are dropping the 
society column - too many unique values compared to dataset. No problem if we drop it.
availability column - too many unique values compared to dataset. No problem if we drop it.
area_type - No use for our analysis.
balcony Not much use for our analysis. Might consider in case of different results.

In [None]:
# Dropping non-essential columns from the dataframe and storing it in df1. Keeping the original dataframe intact.
df1=df.drop(['society','availability','area_type','balcony'],axis = 'columns')
df1.head(5)

In [None]:
df1.columns

In [None]:
# Checking Null values in the existing data in the dataframe.
df1.isnull().sum()

In [None]:
# Since very few NA values we are simply dropping those values from our data.
df2=df1.dropna()
df2.isnull().sum()

In [None]:
# Cheking shape after dropping nulls.
df2.shape

In [None]:
# Checking Unique values in the size Column.
df2['size'].unique()

In [None]:
# Creating a new column 'BHK' by extracting essential information about BHK in the 'size' column.
# THis is fr easier analysis of critical information from our dataset.
df2['BHK']=df2['size'].apply(lambda x : int(x.split()[0]))

In [None]:
# Head of the new dataframe.
df2.head(5)

In [None]:
# Checking the unique values in the 'BHK' column.
# With just a look we are able to see big outliers such as 43 bedrooms and 27 bedrooms. Might be anomalous if not also
# These are outliers.
df2['BHK'].unique()

In [None]:
# Checking for those entries which have more than 10 bedrooms. Very few of them.
df2[df2.BHK>10].value_counts()

In [None]:
# Just 12 entries have 10 or more bedrooms. Very small proportion of our dataset. Can be easily dropped.
len(df2[df2.BHK>10])

In [None]:
df2.total_sqft.unique()
# The array of unique values in the total_sqft column.

In [None]:
df2.total_sqft.nunique()
# Number of unique values in the total_sqft column.
# Huge number of square foot values. We can replace them with some measure of central tendency like mean of the ranged extremes.

In [None]:
# Functon for identifying proper float values and tackling with improper float values.
def is_float(num):
    try:
        float(num)
    except:
        return False
    return True

In [None]:
#Tackling those values that aren't a proper float type number i.e uncleaned data with values as range or different unit
#data, misentered data etc.
#Incoming data that is unstructured.

df2[~df2['total_sqft'].apply(is_float)].head(10)

# We see many range values in the output, so we have to replace these range values with their average in order to make them proper
# float type and also usable.

In [None]:
# We see there are 173 such ranged entries. We need to tackle these.
df2[~df2['total_sqft'].apply(is_float)]['total_sqft'].nunique()

In [None]:
#User Defined Function that takes range value and returns the average of the range. This will be used in the total square_feet column in order
#Convert the range of square feets in the the average of the two.

def convertion_of_square_feet_to_numerical_values(sqft):
    tokens = sqft.split('-')   #Splits the characters on '-'.
    if len(tokens)==2: 
        return (float((float(tokens[0])+float(tokens[1]))/2))
    try:
        return float(sqft)
    except:
        return None

In [None]:
# Testing the Function - 1
convertion_of_square_feet_to_numerical_values('1156 - 2278') # We give range as an Input and get float as an output.

In [None]:
# Testing the Function - 2
convertion_of_square_feet_to_numerical_values('1015 - 1510')

In [None]:
# Testing the Function - 3
convertion_of_square_feet_to_numerical_values('3090 - 5002')

In [None]:
#applying the function to the copy of the above dataframe denoted by df3. And then displaying it's head.
df3=df2.copy()
df3['total_sqft'] = df3['total_sqft'].apply(convertion_of_square_feet_to_numerical_values)
df3.head()

In [None]:
# Observing a few entries inside the dataframe and displaying it as a sub-dataframe
# The 30th, 78th and 122nd examples are displayed.
# .loc() function helps us in accessing row wise examples from our dataframe.

df3.loc[[30, 378, 1322]]

# Feature Engineering Begins
Feature Engineering is a process of using domain knowledge, skills and experiences to extract useful data from raw data in the data mining process. It's very essential for improvement in the process of our machine learning algorithms. It is called applied Machine Learning in some places.

Process of Feature Engineering - 

1. Brainstroming on the testing features.
2. Deciding what features we should create and what we should omit.
3. Creating new features.
4. Checking how the features might work with our model.
5. Improving the existing features.
6. Iteratively follow the above process till satisfactory results are reached.

In [None]:
df3.head(5)

In [None]:
# Price per squarefoot is a very important feature in Real Estate business.
# We have to remove outliers based on invalid Price per square foot data.
# The Prices are in lakhs so multiplying by 100000(1Lakh).

df4=df3.copy()
df4['Price_Per_sqft'] = df4['price']*100000/df4['total_sqft']
df4.head(5)

### Dealing with Dimensionality Curse

In [None]:
#Dimensionality Curse
len(df4.location.unique()) # Unique values of Location

In [None]:
df4.location = df4.location.apply(lambda x : x.strip()) #Strips external spaces in the location data
location_stats = df4.groupby('location')['location'].agg('count').sort_values(ascending = False)
location_stats
# We display in ascending order and find there are many locations in the very end having only single entries.
# We need to combine them under one location name and deal with sparse data and reduce complexity.

In [None]:
#Any Location which has less than 10 houses call it other location. Putting them under the same Umbrella.
location_stats_less_than_10 = location_stats[location_stats<=10]
location_stats_less_than_10
#1052 locations have less than 10 houses among 1293 entries in location. Check the length of the series below.
# This is a huge amount and needs to be death with to increase efficiency of our model.

In [None]:
#Current Number of Unique entries without replacement.
a = len(df4.location.unique())
print(a)

In [None]:
#Unique entries after replacing the locations having less than 10 entries as Other Location
df4.location = df4.location.apply(lambda location : 'Other Location' if location in location_stats_less_than_10 else location)
b = len(df4.location.unique())
print(b)

In [None]:
# Number of locations combined under 'Other Location' Column
print(a - b)
# 1051 is a big number considering the total number 1293. These 1051 locations had only single entry values and had to be combined.

In [None]:
 df4.head(10)

#### Outlier Reduction and Removal : Application of Domain Knowledge.
#### Domain Knowledge = Typical square_foot per bedroom = 300(minimum) : will help in removal of outliers and unusual datapoints. Well omit those values which dosen't satisfy above criterion

In [None]:
# Checking what all entries have less rhan 300 sqft per bedroom. 300 is our base limit.
df4[df4.total_sqft/df4.BHK < 300].head()

## Examples of erroneous data - 

1. Eg 1. 1000sqft home with 6 bedrooms, 600sqft with 8 bedrooms, very unusual data. So needs to be removed.
2. Eg 2. 1407sqft home with 6 bedrooms, 500sqft with 3 bedrooms, very unusual data. So needs to be removed. 

In [None]:
#Current Shape of our data frame.
df4.shape

In [None]:
#Creating a new dataframe df5 and removing all the unual data points which give erroneous data for sqft per bedroom.
df5 = df4[~(df4.total_sqft/df4.BHK < 300)]
df5.shape
#Shape after removal of certain erroneous datapoints.

In [None]:
#Total number of unusual datapoints removed
df4.shape[0] - df5.shape[0]

In [None]:
#Detecting anomalies based on price per squarefoot
df5.Price_Per_sqft.describe()

In [None]:
describe_df = df5.groupby('location').describe()
describe_df

In [None]:
# Distribution of mean value of Total Sqft grouped by location.
plt.hist(describe_df['total_sqft']['mean'],bins = 40, color = 'skyblue', density = True )
plt.xlabel('Square foot', size = 15)
plt.ylabel('Count', size = 15)
# Observation - Maximum flats are in the range of 1200 - 1400 sqft

In [None]:
# Distribution of mean value of BHK grouped by location.
plt.hist(describe_df['BHK']['mean'],bins = 40, color = 'olivedrab', density = True )
plt.xlabel('BHK', size = 15)
plt.ylabel('Count', size = 15)
# Observation - 2 to 3 BHK flats dominate the distribution.

In [None]:
# Distribution of mean value of Price Per sqft grouped by location.
plt.hist(describe_df['Price_Per_sqft']['mean'],bins = 50, color = 'steelblue' )
plt.xlabel('Price per sqft', size = 15)
plt.ylabel('Count', size = 15)
# Observation - Average Price per square foot is in the range of Rs. 5000 to Rs. 6000.

In [None]:
# Distribution of mean value of baths grouped by location.
plt.hist(describe_df['bath']['mean'],bins = 40, color = 'yellowgreen' )
plt.xlabel('Bathrooms', size = 15)
plt.ylabel('Count', size = 15)
# Observation - Maximum households have 2 to 3 bathrooms.

In [None]:
# Filtering out price per squarefoot data.
# Keeping only those data that lies within the Mean and one Standard Deviation on either side.
# These outliers has to be removed.

def remove_price_per_square_foot_outlier(df):
    df_out = pd.DataFrame()
    for key,sub_df in df.groupby('location'):
        
        m = np.mean(sub_df.Price_Per_sqft) #Mean of Price Per Square foot column.
        sd  =np.std(sub_df.Price_Per_sqft) #Standard Deviation of Price Per Square foot column.
        reduced_df = sub_df[(sub_df.Price_Per_sqft>(m-sd)) & (sub_df.Price_Per_sqft<=(m+sd))] 
        # Keeping those values that are in the range of 1 SD from the mean.
        df_out = pd.concat([df_out,reduced_df], ignore_index = True)
        
    return df_out
df6 = remove_price_per_square_foot_outlier(df5)
df6.shape #Displaying the Shape

In [None]:
#Outlier data removed, removing data beyond one standard deviation.
df5.shape[0] - df6.shape[0]
# Below we see the number of datapoints removed. We can change and include more datapoints.
# Say till 1.5 to 2 SD but for my analysis I have considered only 1 SD.

In [None]:
# Certain Data Points have same sqft 3BHK flats priced less than 2BHK flats. These are outliers and have to be removed.
# BHK is an important criterion for determining flat price. 
# For same sqft if number of BHK is more the on with more BHK will have more price.
# We'll do it location wise and later apply on the entire dataframe locationwise and observe the desired changes.

def plot_scatter_chart(df,location):
    bhk2 = df[(df.location==location) & (df.BHK==2)]  #Finding 2BHK flats
    bhk3 = df[(df.location==location) & (df.BHK==3)]  #Finding 3BHK flats
    matplotlib.rcParams['figure.figsize'] = (15,10)
    
    plt.scatter(bhk2.total_sqft,bhk2.price,color='midnightblue',label='2 BHK', s=50)
    plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='seagreen',label='3 BHK', s=50)
    
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price (Lakh Indian Rupees)")
    plt.title(location)
    plt.legend()
    
plot_scatter_chart(df6,"Rajaji Nagar") #Plotting the scatterplot for Rajaji Nagar

In [None]:
#Plotting the same for Hebbal
plot_scatter_chart(df6,"Hebbal")

In [None]:
#Plotting the same for Yelahanka
plot_scatter_chart(df6,"Yelahanka")

In [None]:
df6['location'].value_counts().sort_values(ascending = False)

In [None]:
#Plotting the same for Whitefield
plot_scatter_chart(df6,"Whitefield")

In [None]:
#Plotting the same for Electronic City
plot_scatter_chart(df6,"Electronic City")

In [None]:
#2BHK flat should have price lesser than 3BHK of same sqft.
#Filtering out all those values of say 2BHK whose price per squarefoot is less than 1BHK.

In [None]:
def remove_bhk_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('BHK'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df.Price_Per_sqft),
                'std': np.std(bhk_df.Price_Per_sqft),
                'count': bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby('BHK'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.Price_Per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')
df7 = remove_bhk_outliers(df6)
#df8 = df7.copy()
df7.shape

In [None]:
#Total number of anomalies removes
df6.shape[0] - df7.shape[0]

In [None]:
plot_scatter_chart(df7,"Rajaji Nagar") #Plotting the scatterplot for Rajaji Nagar

In [None]:
#Plotting the same for Hebbal
plot_scatter_chart(df7,"Hebbal")

In [None]:
#Plotting the same for Yelahanka
plot_scatter_chart(df7,"Yelahanka")

In [None]:
#Plotting the same for Whitefield
plot_scatter_chart(df6,"Whitefield")

In [None]:
#Plotting the same for Electronic City
plot_scatter_chart(df6,"Electronic City")

In [None]:
# Observing the cleaned Price Per Square Foot count distribution.
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)
plt.hist(df7.Price_Per_sqft,rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

### Removing bathroom specific outliers.

In [None]:
df7.bath.unique()

In [None]:
df7[df7['bath']>10]

In [None]:
plt.hist(df7.bath,rwidth=0.8)
plt.xlabel("Number of bathrooms")
plt.ylabel("Count")

In [None]:
#Domain Knowledge : Any time we have bathrooms greater than bedroom+2 remove those outliers.
df7[df7.bath>df7.BHK+2]
# Eg1. 4Bedrom 7 bathroms -  outlier. unusual
# Eg1. 6 bedrooms and 9 bathrooms - outlier. unusual.

In [None]:
df8 = df7[df7.bath<df7.BHK+2]
df8.shape

In [None]:
#Dropping other unnecessary features for model. Remain only with location, total_sqft, bath, price and BHK.
df10 = df8.drop(['size','Price_Per_sqft'],axis='columns')
df10.head(3)

In [None]:
#Location is a categorical column, need to convert it to a numerical using one hot encoding ---> dummies.
#Using pd.get_dummies() method.
dummies = pd.get_dummies(df10.location)
dummies.head(3)

In [None]:
# Getting rid of the dummy variabe trap. Dropping the categorical column for 'Other Location' type.
df11 = pd.concat([df10,dummies.drop('Other Location',axis ='columns')],axis ='columns')
df11.head(3)

In [None]:
#Replaced location by one hot encoding. Now we can drop location column now as we have it's encoded columns.
df12 = df11.drop('location',axis= 'columns')
df12.head(3)

In [None]:
#Checking shape of X.
df12.shape
# We have 7251 rows and 245 columns respectively after encoding.

In [None]:
# X is our independent variables. Price is what we want to predict based on total_sqft, bath and BHK.
X = df12.drop(['price'],axis='columns')
X.head(3)

In [None]:
# Shape of Independent feature dataframe
X.shape

In [None]:
# y is our dependent variables. Dependent on location, total_sqft, BHK and bath.
y = df12.price
y.head(5)

In [None]:
# Checking length of y.
len(y)

# Training and Testing the Model

In [None]:
# Splitting our dataset in training and testing dataset using sklearn's train_test_split method.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10)

In [None]:
# Importing sklearn's Linear Regression model and fitting and training the model. Testing the same model and displaying the score.
from sklearn.linear_model import LinearRegression
lr_clf = LinearRegression()
lr_clf.fit(X_train,y_train)
lr_clf.score(X_test,y_test)

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

cross_val_score(LinearRegression(), X, y, cv=cv)

# Identifying the best Model. Comapring Linear Regression, Lasso and DT Regressor.

In [None]:
import random
random.seed(42)
# Best Model Selection using GridSearch CV

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

def find_best_model_using_gridsearchcv(X,y):
    algos = {
        'Linear Regression' : {
            'model': LinearRegression(),
            'params': {
                'normalize': [True, False]
            }
        },
        'Lasso': {
            'model': Lasso(),
            'params': {
                'alpha': [1,2],
                'selection': ['random', 'cyclic']
            }
        },
        'Decision Tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion' : ['mse','friedman_mse'],
                'splitter': ['best','random']
                
            }
        },
        'KNN Regression' :{
            'model' : KNeighborsRegressor(),
            'params' : {
                'n_neighbors' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                'weights' : ['uniform', 'distance']
            }
        }   
        }
    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    for algo_name, config in algos.items():
        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X,y)
        scores.append({
            'Model': algo_name,
            'Best_score': gs.best_score_,
            'Best_params': gs.best_params_
        })

    return pd.DataFrame(scores,columns=['Model','Best_score','Best_params'])

find_best_model_using_gridsearchcv(X,y)

In [None]:
# Predict price function that predicts flat price based on location, sqft, bath and bhk using our model's prediction.
def predict_price(location,sqft,bath,bhk):    
    loc_index = np.where(X.columns==location)[0][0]

    x = np.zeros(len(X.columns))
    x[0] = sqft
    x[1] = bath
    x[2] = bhk
    if loc_index >= 0:
        x[loc_index] = 1

    return lr_clf.predict([x])[0]

In [None]:
# Final Dataframe
df12.head(5)

# Predicting some flat values - based on Location, Sqft, BHK and Bathrooms. 

In [None]:
print(round(predict_price('1st Phase JP Nagar',1000, 2, 2),2),'Lakhs')

In [None]:
print(round(predict_price('1st Phase JP Nagar',2000, 3, 3),2),'Lakhs')

In [None]:
print(round(predict_price('Indira Nagar',2000, 3, 3),2),'Lakhs')

In [None]:
print(round(predict_price('Indira Nagar',2500, 3, 3),2),'Lakhs')

In [None]:
print(round(predict_price('5th Block Hbr Layout',8000, 6, 4),2),'Lakhs')

In [None]:
print(round(predict_price('5th Block Hbr Layout',3500, 4, 3),2),'Lakhs')

In [None]:
print(round(predict_price('Varthur',1600, 3, 2),2),'Lakhs')

In [None]:
print(round(predict_price('Vishveshwarya Layout',2000, 3, 3),2),'Lakhs')

In [None]:
print(round(predict_price('Whitefield',1500, 2, 2),2),'Lakhs')

In [None]:
print(round(predict_price('6th Phase JP Nagar',1250, 3, 2),2),'Lakhs')

# Conclusion
As always said in the world of data science. Data Preparation is the most vital, crucial and tedious step for any project. Identifying the type of problem and later applying suitable models might be bit easier with a plethora of functionalities available with standard libraries. Yet domain knowledge plays a very vital role in scaling up these tasks. I had a wonderful data preparation experience with this real world dataset and will use the knowledge gained in my upcoming projects as well. Data preparation and EDA is the major part. Then comes understanding the models and getting the most optimum model out of them all.
Thank You,
Regards
Abhilash Jash