<h1 style='color:red' align='center'>Data Science Regression Project: Predicting Home Prices in Banglore</h1>

# Import Library 


In [23]:
import pandas as pd 
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams["figure.figsize"] =(20,20)

# Read Data

In [24]:
# Read the data
df1 = pd.read_csv("data/bengaluru_house_prices.csv")

# Print the first row of the table
df1.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [25]:
df1.shape

(13320, 9)

In [26]:
# drop the unused variables 
df2 = df1.drop(['availability' , 'society' ,'balcony' , 'area_type' ] , axis='columns' )
df2.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0


In [27]:
# Drop NAN values from variable
df3 = df2.dropna()
# How many NAN in each variable
df3.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

In [28]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13246 entries, 0 to 13319
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   location    13246 non-null  object 
 1   size        13246 non-null  object 
 2   total_sqft  13246 non-null  object 
 3   bath        13246 non-null  float64
 4   price       13246 non-null  float64
dtypes: float64(2), object(3)
memory usage: 620.9+ KB


In [29]:
df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
df3.bhk.unique()
# BHK <- BEDROOM , living room , kitchen 
# RK -> 1 room 1 kitchen and a bathroom

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))


array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18], dtype=int64)

In [30]:
df3['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

In [31]:
import pandas as pd
import re

def convert_total_sqft(value):
    # Check if the value is in the range format (e.g., '1000 - 1500')
    if '-' in value:
        # Split the range values
        range_vals = re.findall(r'\d+', value)
        # Convert the range values to floats
        range_vals = list(map(float, range_vals))
        # Compute the average of the range values
        return sum(range_vals) / len(range_vals)
    else:
        # Extract the numeric value from the string
        numeric_val = re.findall(r'\d+', value)[0]
        return float(numeric_val)



In [32]:
df4 = df3.copy()
df4['total_sqft'] = df4['total_sqft'].apply(convert_total_sqft)

In [33]:
df5 = df4.copy()
# Assuming your DataFrame variable is named 'data'
df5['price_per_sqft'] = df5['price']*100000 / df5['total_sqft']

df5.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250.0


In [34]:
df5.location =df5['location'].apply(lambda x: x.strip())

location_stats = df5['location'].value_counts()
location_stats




Whitefield                        535
Sarjapur  Road                    392
Electronic City                   304
Kanakpura Road                    266
Thanisandra                       236
                                 ... 
Vasantapura main road               1
Bapuji Layout                       1
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
Abshot Layout                       1
Name: location, Length: 1293, dtype: int64

In [35]:
location_stats_less_than_10 =location_stats[location_stats <= 10] 

df5.location = df5['location'].apply(lambda x : 'others' if x in location_stats_less_than_10 else x )

df5.location.value_counts()

others                2881
Whitefield             535
Sarjapur  Road         392
Electronic City        304
Kanakpura Road         266
                      ... 
Nehru Nagar             11
Banjara Layout          11
LB Shastri Nagar        11
Pattandur Agrahara      11
Narayanapura            11
Name: location, Length: 242, dtype: int64

## Remove Outliers 

### Outlier Removal Using Business Logic

As a data scientist when you have a conversation with your business manager (who has expertise in real estate), he will tell you that normally square ft per bedroom is 300 (i.e. 2 bhk apartment is minimum 600 sqft. If you have for example 400 sqft apartment with 2 bhk than that seems suspicious and can be removed as an outlier. We will remove such outliers by keeping our minimum thresold per bhk to be 300 sqft

In [37]:
df5[df5.total_sqft/df5.bhk<300]

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
9,others,6 Bedroom,1020.0,6.0,370.0,6,36274.509804
45,HSR Layout,8 Bedroom,600.0,9.0,200.0,8,33333.333333
58,Murugeshpalya,6 Bedroom,1407.0,4.0,150.0,6,10660.980810
68,Devarachikkanahalli,8 Bedroom,1350.0,7.0,85.0,8,6296.296296
70,others,3 Bedroom,500.0,3.0,100.0,3,20000.000000
...,...,...,...,...,...,...,...
13277,others,7 Bedroom,1400.0,7.0,218.0,7,15571.428571
13279,others,6 Bedroom,1200.0,5.0,130.0,6,10833.333333
13281,Margondanahalli,5 Bedroom,1375.0,5.0,125.0,5,9090.909091
13303,Vidyaranyapura,5 Bedroom,774.0,5.0,70.0,5,9043.927649


In [45]:

df6 = df5[~(df5.total_sqft/df5.bhk<300)]
df6.shape


(12462, 7)

In [48]:
df6.sample(5)

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
1715,others,2 BHK,1000.0,2.0,50.0,2,5000.0
9891,Margondanahalli,2 Bedroom,1152.0,1.0,66.0,2,5729.166667
5565,7th Phase JP Nagar,2 BHK,850.0,2.0,42.0,2,4941.176471
347,Bommasandra,3 BHK,1260.0,3.0,49.36,3,3917.460317
772,Banashankari Stage VI,2 BHK,1177.5,2.0,59.935,2,5090.021231
