### Bangalore House Price Prediction

### Import the Libraries

In [None]:
import pandas as pd
import tensorflow as tf
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib 
matplotlib.rcParams["figure.figsize"] = (20,10)

In [None]:
df1 = pd.read_csv("../input/bangalorehouseprices/bengaluru_house_prices.csv")
df1.head()

In [None]:
df1.shape

### Count of Each `type of area_type`

In [None]:
df1['area_type'].value_counts()


### Data Cleaning

In [None]:
df1['area_type'].value_counts()


In [None]:
df2 = df1.drop(['area_type' , 'society' , 'balcony' , 'availability'] , axis = 'columns')
df2.head()

In [None]:
df2.isnull().sum()

Since the dataset has 13,000 rows and the na values are small in number, we can drop it. else we can use median, std deviation

In [None]:
df3 = df2.dropna()

In [None]:
df3.isnull().sum()

In [None]:
df3['size'].unique()

## Feature Engineering

In [None]:
df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
df3.bhk.unique()
# code basically removes the duplicates

In [None]:
df3

In [None]:
df3['bhk'].unique()

In [None]:
df3[df3['bhk']>20]

In [None]:
df3.total_sqft.unique()

Convert the SQFT with hyphens to numbers

In [None]:
# Check whether the value is float or not
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [None]:
df3.total_sqft.unique()

**Looking at Values where it is valid float**

In [None]:
df3[df3['total_sqft'].apply(is_float)]

**Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take average of min and max value in the range. There are other cases such as 34.46Sq. Meter which one can convert to square ft using unit conversion. I am going to just drop such corner cases to keep things simple**



**Convert to averages**

In [None]:
def convert_sqft_to_num(x):
    tokens = x.split('-')
    if(len(tokens)) == 2:
        return ((float(tokens[0]) + float(tokens[1]))/2)
    try:
        return float(x)
    except:
        return None

In [None]:
convert_sqft_to_num('2000 - 4000')

In [None]:
df4 = df3.copy() #deep copy

In [None]:
df4['total_sqft'] = df4["total_sqft"].apply(convert_sqft_to_num)

In [None]:
df4 = df4[df4.total_sqft.notnull()]


In [None]:
df4

In [None]:
df4.total_sqft.unique()

### Feature Engineering

In [None]:
df5 = df4.copy()
df5['price_per_sqft'] = df5['price']*100000/df5['total_sqft']
df5.head()
#Price is in Lakhs Thus we multiply by 100000

In [None]:
df5.location.unique()

In [None]:
len(df5.location.unique())
#Lot of Locations
#Since we have lot of locations, this is called dimensionality curse

**We will use Dimensionality Reduction to reduce the numbeer of Locations**

**Here dimensionality is a categorical variable**

In [None]:
df5.location = df5.location.apply(lambda x: x.strip())
# Remove the leading spaces

In [None]:
location_stats = df5.groupby('location')['location'].agg('count').sort_values(ascending = False)

In [None]:
location_stats

**To reduce the number of locations, we can say that any location that has less than 10 data points is called other location**

In [None]:
len(location_stats[location_stats < 10])

In [None]:
location_stas_less_than_10 = location_stats[location_stats < 10]
location_stas_less_than_10

In [None]:
len(df5.location.unique())

In [None]:
df5.location = df5.location.apply(lambda x : 'other' if x in location_stas_less_than_10 else x)
# all the locations less than 10 data points will be converted to 'other'

In [None]:
len(df5.location.unique())

In [None]:
df5.head(10)

## Outlier Detection

Outliers are not errors but really large or small values which make no sense in the data. For example a 2 bedroom apartment cannot be 5000 sq feet

**As a data scientist when you have a conversation with your business manager (who has expertise in real estate), he will tell you that normally square ft per bedroom is 300 (i.e. 2 bhk apartment is minimum 600 sqft. If you have for example 400 sqft apartment with 2 bhk than that seems suspicious and can be removed as an outlier. We will remove such outliers by keeping our minimum thresold per bhk to be 300 sqft**



In [None]:
df5[df5.total_sqft/df5.bhk<300].head()


**Check above data points. We have 6 bhk apartment with 1020 sqft. Another one is 8 bhk and total sqft is 600. These are clear data errors that can be removed safely**



In [None]:
df6 = df5[~(df5.total_sqft/df5.bhk<300)]
df6.shape

In [None]:
df6.price_per_sqft.describe()


**clearly the minimum value of square feet cannot be 267 rupees and maximum cannot be 176470**

**Now we can remove these extreme cases based on Standard Deviation**

**Basically what the below function does is take the data points per location and filter out the data points that have standard deviation that is greater than 1**

In [None]:
def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]
        df_out = pd.concat([df_out,reduced_df],ignore_index=True)
    return df_out
df7 = remove_pps_outliers(df6)
df7.shape

**the remove_pps_outliers function is looping thorough the subgroups of locations. For. eg. a subdf could be  all data points with "jayanagar" as a location. It calculates mean and std of the rows in jayanagar location and then selects all points in that are within m-st and m-st of jayanagar  and adds that to the df_out.**


Now our data points are reduced by almost 2000 points

**One more thing that we have to check is that if the price of a two bhk apt is greater than 3bhk apt for the same square foot area**

We are going to plot a scatter plot which will tell us how many of these types of points we have

In [None]:
def plot_scatter_chart(df,location):
    bhk2 = df[(df.location==location) & (df.bhk==2)]
    bhk3 = df[(df.location==location) & (df.bhk==3)]
    matplotlib.rcParams['figure.figsize'] = (15,10)
    plt.scatter(bhk2.total_sqft,bhk2.price,color='blue',label='2 BHK', s=50) # s is the marker size
    plt.scatter(bhk3.total_sqft,bhk3.price,marker='+', color='green',label='3 BHK', s=50)
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price (Lakh Indian Rupees)")
    plt.title(location)
    plt.legend()
    
plot_scatter_chart(df7,"Rajaji Nagar")


for around 1700 sq foot area the two bedroom apt price is higher than 3 bedroom

In [None]:
plot_scatter_chart(df7,"Hebbal")


We should also remove properties where for same location, the price of (for example) 3 bedroom apartment is less than 2 bedroom apartment (with same square ft area). What we will do is for a given location, we will build a dictionary of stats per bhk, i.e.

{ <br>
    '1' : { <br>
        'mean': 4000,<br>
        'std: 2000,<br>
        'count': 34<br>
    },  <br>
    '2' : {<br>
        'mean': 4300,<br>
        'std: 2300,<br>
        'count': 22<br>
    },    
}<br>
Now we can remove those 2 BHK apartments whose price_per_sqft is less than mean price_per_sqft of 1 BHK apartment



In [None]:
def remove_bhk_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df.price_per_sqft),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')
df8 = remove_bhk_outliers(df7)
df8.shape


Inner for loop will iterate for every possible group of no. of bedrooms of that respective  location group. (of outer for loop)


First inner for loop will store information about mean , std and no of data points( no of values present in a group of bedroom) in the already created dictionary in the outer for loop with key as the respective bedroom no. group. i.e (bhk_stats[2] stores info about 2 bedroom group values)


Second inner for loop performs the main functionality,
stats = bhk_stats.get(bhk-1)


here it will fetch the value for the previous no. of bedroom group.
For example, for 1 bedroom group it will be None , as there is no possibe value stored for 0 bedroom group, simply because there is not any value like that in dataframe.


also for 3 bedroom group, it will fetch information about 2 bedroom group ( so that we can check the mean value )


if stats and stats['count']>5:
it checks if there is dictionary present ( we didn't have for 1 bedroom group ) because None value will throw error. It also checks if it has more than 5 values or not. Because we cannot decide to discard something without comparing it with substantial data values.


exclude_indices = np.append(exclude_indices,
bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
this will finally store the index of the current bedroom group's element if it is lower than the previous bedroom's mean value. Then they are dropped 


In [None]:
plot_scatter_chart(df8,"Rajaji Nagar")


Plot same scatter chart again to visualize price_per_sqft for 2 BHK and 3 BHK properties



In [None]:
plot_scatter_chart(df8,"Hebbal")


In [None]:
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)
plt.hist(df8.price_per_sqft,rwidth=0.8)
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")
#Normal Distribution

In [None]:
df8.bath.unique()


In [None]:
plt.hist(df8.bath,rwidth = 0.8)
plt.xlabel("Number of bathrooms")
plt.ylabel("Count")

In [None]:
df8[df8.bath>10]


In [None]:
df8[df8.bath>df8.bhk+2]


In [None]:
df9 = df8[df8.bath>df8.bhk+2]


In [None]:
df9

In [None]:
df9 = df8[df8.bath<df8.bhk+2]
df9.shape


In [None]:
df9.head(2)

In [None]:
df10 = df9.drop(['size','price_per_sqft'],axis='columns')
df10.head(3)


**size and price_per_sqft can be dropped because they were used only for outlier detection. Now the dataset is neat and clean and we can go for machine learning training**

In [None]:
df10.head()

### One Hot Encoding and Machine Learning Model

In [None]:

dummies = pd.get_dummies(df10.location)
dummies.head(3)

In [None]:
df11 = pd.concat([df10,dummies.drop('other',axis='columns')],axis='columns')
df11.head()

### Model Building

In [None]:
df12 = df11.drop('location',axis = 'columns')
df12.head(10)

In [None]:
df12.shape


In [None]:
X = df12.drop(['price'],axis='columns')
X.head(3)

In [None]:
X.shape

In [None]:
y = df12.price
y.head(3)


In [None]:
len(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10)

In [None]:
from sklearn.linear_model import LinearRegression
lr_clf = LinearRegression()
lr_clf.fit(X_train,y_train)
lr_clf.score(X_test,y_test)
