# Data Science Regression Project: Predicting Home Prices in Banglore


Dataset is downloaded from here: https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data

In [26]:
from matplotlib import pyplot as plt
import pandas as pd 
import numpy as np
%matplotlib inline
import matplotlib 
matplotlib.rcParams["figure.figsize"] = (20,10)

In [27]:
housedf=pd.read_csv('Bengaluru_House_Data.csv')
housedf.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price(lakhs)
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [28]:
housedf.shape

(13320, 9)

In [29]:
housedf['area_type'].unique()

array(['Super built-up  Area', 'Plot  Area', 'Built-up  Area',
       'Carpet  Area'], dtype=object)

<b>We would keep the features that are important for the analysis. we donot require parameters like area_type , availability ,society in our analysis, so we drop them.</b>

In [30]:
housedf1=housedf.drop(columns=['area_type', 'availability','society'])
housedf1.head()

Unnamed: 0,location,size,total_sqft,bath,balcony,price(lakhs)
0,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Kothanur,2 BHK,1200,2.0,1.0,51.0


## Data Cleaning 

<b> The first step in data cleaning is always to check for null/NA values in all the columns and deal with them</b>

In [31]:
housedf1.isnull().sum()

location          1
size             16
total_sqft        0
bath             73
balcony         609
price(lakhs)      0
dtype: int64

<b>Given we have a sufficiently large data set(size=13320 rows) we would drop the NA values for all whose count is less (eg< 100 say) . But the column 'balcony' has as big has '609' values as null so we could replace them with the mean value .</b>
   

In [32]:
housedf1['balcony'].unique()

array([ 1.,  3., nan,  2.,  0.])

In [33]:
housedf1['balcony'].value_counts()

2.0    5113
1.0    4897
3.0    1672
0.0    1029
Name: balcony, dtype: int64

In [34]:
mean_balcony=np.floor(housedf1['balcony'].mean())

In [35]:
housedf2=housedf1.copy()
housedf2.head()

Unnamed: 0,location,size,total_sqft,bath,balcony,price(lakhs)
0,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Kothanur,2 BHK,1200,2.0,1.0,51.0


In [36]:
housedf2['balcony'].fillna(mean_balcony, inplace=True)
housedf2['balcony'].unique()

array([1., 3., 2., 0.])

In [37]:
housedf2.dropna(inplace=True)
housedf2

Unnamed: 0,location,size,total_sqft,bath,balcony,price(lakhs)
0,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.00
2,Uttarahalli,3 BHK,1440,2.0,3.0,62.00
3,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.00
4,Kothanur,2 BHK,1200,2.0,1.0,51.00
...,...,...,...,...,...,...
13315,Whitefield,5 Bedroom,3453,4.0,0.0,231.00
13316,Richards Town,4 BHK,3600,5.0,1.0,400.00
13317,Raja Rajeshwari Nagar,2 BHK,1141,2.0,1.0,60.00
13318,Padmanabhanagar,4 BHK,4689,4.0,1.0,488.00


## Feature Engineering

<b>So now we analyze the values in our columns .(range, type  of vaues in each columns).Modify or add new features.</b>

<b> Feature 1: Size </b>

In [38]:
housedf2['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

<b> Here we note that two different type are used to represent the same thing , eg:  '2 BHK' an '2 Bedroom' have one and the same meaning . So we will use a function to make that homogenous </b>

In [39]:
housedf2['bhk']=housedf2['size'].apply(lambda x:int(x.split(' ')[0]))
housedf2['bhk'].unique()
                                       

array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18], dtype=int64)

<b>Feature 2:total_sqft</b>

In [40]:
type(housedf2['total_sqft'][0])

str

In [41]:
housedf2['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

<b>Here we see that data type of the column in string  , and  string can tak eany value , digit or special character . so we would try to convert all to float value . and for string that has non numeric value or ranges etc will get separated by throwing an exception</b>

In [42]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [43]:
housedf2[~housedf2['total_sqft'].apply(is_float)].head(15)

Unnamed: 0,location,size,total_sqft,bath,balcony,price(lakhs),bhk
30,Yelahanka,4 BHK,2100 - 2850,4.0,0.0,186.0,4
122,Hebbal,4 BHK,3067 - 8156,4.0,0.0,477.0,4
137,8th Phase JP Nagar,2 BHK,1042 - 1105,2.0,0.0,54.005,2
165,Sarjapur,2 BHK,1145 - 1340,2.0,0.0,43.49,2
188,KR Puram,2 BHK,1015 - 1540,2.0,0.0,56.8,2
410,Kengeri,1 BHK,34.46Sq. Meter,1.0,0.0,18.5,1
549,Hennur Road,2 BHK,1195 - 1440,2.0,0.0,63.77,2
648,Arekere,9 Bedroom,4125Perch,9.0,1.0,265.0,9
661,Yelahanka,2 BHK,1120 - 1145,2.0,0.0,48.13,2
672,Bettahalsoor,4 Bedroom,3090 - 5002,4.0,0.0,445.0,4


<b>Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take average of min and max value in the range. There are other cases such as 34.46Sq. Meter. </b>

In [44]:
def convert_sqft_to_num(x):
    tokens = x.split('-')
    if len(tokens) == 2:
        return (float(tokens[0])+float(tokens[1]))/2
    try:
         return float(x)
    except:
        return None

In [45]:
housedf2['sqft']=housedf2['total_sqft'].apply(convert_sqft_to_num)
housedf2[housedf2['sqft'].isnull()==True]

Unnamed: 0,location,size,total_sqft,bath,balcony,price(lakhs),bhk,sqft
410,Kengeri,1 BHK,34.46Sq. Meter,1.0,0.0,18.5,1,
648,Arekere,9 Bedroom,4125Perch,9.0,1.0,265.0,9,
775,Basavanagara,1 BHK,1000Sq. Meter,2.0,1.0,93.0,1,
872,Singapura Village,2 BHK,1100Sq. Yards,2.0,1.0,45.0,2,
1019,Marathi Layout,1 Bedroom,5.31Acres,1.0,0.0,110.0,1,
1086,Narasapura,2 Bedroom,30Acres,2.0,2.0,29.5,2,
1400,Chamrajpet,9 BHK,716Sq. Meter,9.0,1.0,296.0,9,
1712,Singena Agrahara,3 Bedroom,1500Sq. Meter,3.0,1.0,95.0,3,
1743,Hosa Road,3 BHK,142.61Sq. Meter,3.0,1.0,115.0,3,
1821,Sarjapur,3 Bedroom,1574Sq. Yards,3.0,1.0,76.0,3,


In [46]:
housedf2['sqft'].isnull().sum()

46

<b> So we can see that there are 46 incorrect entries or entries in term of diff units . now we can drop them or convert the different units into square feets. Since data set is sufficienty large  and such cases are less, we can safely drop them </b>

In [47]:
housedf2=housedf2.dropna()
housedf2.head()

Unnamed: 0,location,size,total_sqft,bath,balcony,price(lakhs),bhk,sqft
0,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07,2,1056.0
1,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0,4,2600.0
2,Uttarahalli,3 BHK,1440,2.0,3.0,62.0,3,1440.0
3,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0,3,1521.0
4,Kothanur,2 BHK,1200,2.0,1.0,51.0,2,1200.0


<b>After dropping the NAN values ,so dropping total_sqft column.The whole purpose of making sqft was to see the type of values in total_sqft that we were going to drop .Also dropping size columns as it is no longer required. we have extracted the numbers from it to make bhk column</b>

In [48]:
housedf2=housedf2.drop(columns=['total_sqft','size'],axis=1)
housedf2.head()

Unnamed: 0,location,bath,balcony,price(lakhs),bhk,sqft
0,Electronic City Phase II,2.0,1.0,39.07,2,1056.0
1,Chikka Tirupathi,5.0,3.0,120.0,4,2600.0
2,Uttarahalli,2.0,3.0,62.0,3,1440.0
3,Lingadheeranahalli,3.0,1.0,95.0,3,1521.0
4,Kothanur,2.0,1.0,51.0,2,1200.0


<b> Adding a new Feature -price per square feet</b>

In [49]:
housedf2['price_per_sqft'] = np.float64(housedf2['price(lakhs)'])*100000/housedf2['sqft']
housedf2.head()

Unnamed: 0,location,bath,balcony,price(lakhs),bhk,sqft,price_per_sqft
0,Electronic City Phase II,2.0,1.0,39.07,2,1056.0,3699.810606
1,Chikka Tirupathi,5.0,3.0,120.0,4,2600.0,4615.384615
2,Uttarahalli,2.0,3.0,62.0,3,1440.0,4305.555556
3,Lingadheeranahalli,3.0,1.0,95.0,3,1521.0,6245.890861
4,Kothanur,2.0,1.0,51.0,2,1200.0,4250.0


In [50]:
housedf2_stats = housedf2['price_per_sqft'].describe()
housedf2_stats

count    1.320000e+04
mean     7.920759e+03
std      1.067272e+05
min      2.678298e+02
25%      4.267701e+03
50%      5.438331e+03
75%      7.317073e+03
max      1.200000e+07
Name: price_per_sqft, dtype: float64

<b>Examine locations which is a categorical variable. We need to apply dimensionality reduction technique here to reduce number of locations</b>

In [56]:
location_stats = housedf2['location'].value_counts(ascending=False)
location_stats

Whitefield         532
Sarjapur  Road     392
Electronic City    302
Kanakpura Road     264
Thanisandra        232
                  ... 
adigondanhalli       1
Pillahalli           1
elachenahalli        1
Gulakamale           1
1Hanuman Nagar       1
Name: location, Length: 1298, dtype: int64

In [57]:
location_stats.values.sum()

13200

In [58]:
len(location_stats[location_stats>10])

240

In [59]:
len(location_stats)

1298

In [60]:
len(location_stats[location_stats<=10])

1058

## Dimensionality Reduction

<b>
The location that have less than 10 data points should be tagged as "other" location. Doing this will make  number of categories to be reduced by huge chunk. Later on when we do one hot encoding, it will help us with having fewer dummy columns</b>

In [61]:
location_stats_less_than_10 = location_stats[location_stats<=10]
location_stats_less_than_10

BTM 1st Stage       10
Thyagaraja Nagar    10
Ganga Nagar         10
Sadashiva Nagar     10
Basapura            10
                    ..
adigondanhalli       1
Pillahalli           1
elachenahalli        1
Gulakamale           1
1Hanuman Nagar       1
Name: location, Length: 1058, dtype: int64

In [62]:
housedf2.location = housedf2.location.apply(lambda x: 'other' if x in location_stats_less_than_10 else x)
len(housedf2.location.unique())

241

In [66]:
housedf2['location'].value_counts()

other                 2887
Whitefield             532
Sarjapur  Road         392
Electronic City        302
Kanakpura Road         264
                      ... 
Kodigehalli             11
Doddaballapur           11
Nehru Nagar             11
Pattandur Agrahara      11
LB Shastri Nagar        11
Name: location, Length: 241, dtype: int64

In [67]:
housedf2.head(12)

Unnamed: 0,location,bath,balcony,price(lakhs),bhk,sqft,price_per_sqft
0,Electronic City Phase II,2.0,1.0,39.07,2,1056.0,3699.810606
1,Chikka Tirupathi,5.0,3.0,120.0,4,2600.0,4615.384615
2,Uttarahalli,2.0,3.0,62.0,3,1440.0,4305.555556
3,Lingadheeranahalli,3.0,1.0,95.0,3,1521.0,6245.890861
4,Kothanur,2.0,1.0,51.0,2,1200.0,4250.0
5,Whitefield,2.0,1.0,38.0,2,1170.0,3247.863248
6,Old Airport Road,4.0,1.0,204.0,4,2732.0,7467.057101
7,Rajaji Nagar,4.0,1.0,600.0,4,3300.0,18181.818182
8,Marathahalli,3.0,1.0,63.25,3,1310.0,4828.244275
9,other,6.0,1.0,370.0,6,1020.0,36274.509804


## Outlier Removal Using Business Logic

<b> Considering this :business manager (who has expertise in real estate), he says normally square ft per bedroom is 300 (i.e. 2 bhk apartment is minimum 600 sqft. If we have for example 400 sqft apartment with 2 bhk than that seems suspicious and can be removed as an outlier. We will remove such outliers by keeping our minimum thresold per bhk to be 300 sqft</b>