#### It might be difficult to accurately estimate housing values. The size of the house (square feet) is not the only aspect that influences the price of a home.

#### This project will demostrate how to build a housing price prediction website step by step. Below are the areas that would be covered in this project.
- Data Cleaning
- Feature Engineering
- Outlier Removal
- Model Building


#### Dataset: https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data

### Import required libraries

In [1]:
import pandas as pd

### Read the dataset

In [2]:
dt = pd.read_csv("Bengaluru_House_Data.csv") # Reading the data

In [3]:
# copying data to another varaible to avoid any changes to original data
dt1=dt.copy()

### View the first and last 5 rows of the dataset.

In [4]:
dt1.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [5]:
dt1.tail()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.0,231.0
13316,Super built-up Area,Ready To Move,Richards Town,4 BHK,,3600,5.0,,400.0
13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.0,60.0
13318,Super built-up Area,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.0,488.0
13319,Super built-up Area,Ready To Move,Doddathoguru,1 BHK,,550,1.0,1.0,17.0


### Understanding the shape of the dataset

In [6]:
dt1.shape

(13320, 9)

The dataset has 13320 rows and 9 columns of data

### Check the data types of the columns for the dataset.

In [7]:
dt1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


#### Observations -

    - There are no null values in the dataset.

### Dropping the features that are not adding any information.

In [8]:
dt2 = dt1.drop(['area_type','society','balcony','availability'],axis='columns')
dt2.shape

(13320, 5)

In [9]:
dt2.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0


## Data Cleaning

### Handling NAN values

In [10]:
dt2.isnull().sum() # number of rows that have NAN values

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

In [11]:
# Dropping the rows that have NAN values as the number of the NAN rows are small compared to the whole dataset
dt3 = dt2.dropna()
dt3.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

In [12]:
dt3.shape

(13246, 5)

In [13]:
dt3['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

#### Observation:
 - We can see that 3 BHK and 3 Bedroom are same to make it unique creating new column called size. Also applies to other number of bedrooms as well.

In [14]:
dt3['bhk'] = dt3['size'].apply(lambda x : int(x.split(" ")[0]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dt3['bhk'] = dt3['size'].apply(lambda x : int(x.split(" ")[0]))


In [15]:
dt3.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0,4
2,Uttarahalli,3 BHK,1440,2.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0,3
4,Kothanur,2 BHK,1200,2.0,51.0,2


In [16]:
dt3['bhk'].unique()

array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18])

In [17]:
dt3['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

#### Observation:
- We can see that total_sqft has range value and hence we will update this value to the average of the range.

In [18]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True
dt3[~dt3['total_sqft'].apply(is_float)]

Unnamed: 0,location,size,total_sqft,bath,price,bhk
30,Yelahanka,4 BHK,2100 - 2850,4.0,186.000,4
122,Hebbal,4 BHK,3067 - 8156,4.0,477.000,4
137,8th Phase JP Nagar,2 BHK,1042 - 1105,2.0,54.005,2
165,Sarjapur,2 BHK,1145 - 1340,2.0,43.490,2
188,KR Puram,2 BHK,1015 - 1540,2.0,56.800,2
...,...,...,...,...,...,...
12975,Whitefield,2 BHK,850 - 1060,2.0,38.190,2
12990,Talaghattapura,3 BHK,1804 - 2273,3.0,122.000,3
13059,Harlur,2 BHK,1200 - 1470,2.0,72.760,2
13265,Hoodi,2 BHK,1133 - 1384,2.0,59.135,2


In [19]:
def convert_sqft_to_num(x):
    tokens = x.split('-')
    if len(tokens) == 2:
        return (float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None

dt4 = dt3.copy()
dt4.total_sqft = dt4.total_sqft.apply(convert_sqft_to_num)
dt4 = dt4[dt4.total_sqft.notnull()]
dt4.head(10)

Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3
4,Kothanur,2 BHK,1200.0,2.0,51.0,2
5,Whitefield,2 BHK,1170.0,2.0,38.0,2
6,Old Airport Road,4 BHK,2732.0,4.0,204.0,4
7,Rajaji Nagar,4 BHK,3300.0,4.0,600.0,4
8,Marathahalli,3 BHK,1310.0,3.0,63.25,3
9,Gandhi Bazar,6 Bedroom,1020.0,6.0,370.0,6


In [20]:
dt4.loc[122]

location      Hebbal
size           4 BHK
total_sqft    5611.5
bath             4.0
price          477.0
bhk                4
Name: 122, dtype: object

## Feature Engineering

In [21]:
# Determining Price/sqft
dt5 = dt4.copy()
dt5['price_per_sqft'] = dt5['price']*100000/dt5['total_sqft'] # Price is in lakhs and hence multiplying it by 100000
dt5.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250.0


#### Analyze the categorical variable of locations and to reduce the number of locations, we must use dimensionality reduction techniques. 

In [22]:
len(dt5.location.unique())

1298

In [23]:
dt5.location = dt5.location.apply(lambda x: x.strip())
location_data = dt5['location'].value_counts(ascending=False)
location_data

Whitefield                                            533
Sarjapur  Road                                        392
Electronic City                                       304
Kanakpura Road                                        264
Thanisandra                                           235
                                                     ... 
Chaitanya Ananya                                        1
Gaundanapalya                                           1
BDS Layout                                              1
5th block Koramangala                                   1
bsk 6th stage 2ad block near sri conversation hall      1
Name: location, Length: 1287, dtype: int64

In [24]:
len(location_data[location_data<10])# Checking the occurances of the location data that is less than 10

1033

### Dimensionality Reduction

#### To reduce the number of categories by huge amount, we will tag the location as other where the data points are less than 10.

In [25]:
locations_less_than_10 = location_data[location_data<=10]
locations_less_than_10

Nagappa Reddy Layout                                  10
Basapura                                              10
Dodsworth Layout                                      10
Dairy Circle                                          10
Sadashiva Nagar                                       10
                                                      ..
Chaitanya Ananya                                       1
Gaundanapalya                                          1
BDS Layout                                             1
5th block Koramangala                                  1
bsk 6th stage 2ad block near sri conversation hall     1
Name: location, Length: 1047, dtype: int64

In [26]:
len(dt5.location.unique())

1287

In [27]:
dt5.location = dt5.location.apply(lambda x: 'other' if x in locations_less_than_10 else x)
len(dt5.location.unique())

241

In [28]:
dt5.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250.0
