**Our Problem Statement is to analyze the dataset of properties in Lahore and then building a model to predict its price using location, Area(Sq.ft), bedrooms, baths...**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('Property_with_Feature_Engineering.csv')

In [3]:
df.head(2)

Unnamed: 0,property_id,location_id,page_url,property_type,price,price_bin,location,city,province_name,locality,...,area_marla,area_sqft,purpose,bedrooms,date_added,year,month,day,agency,agent
0,347795,8,https://www.zameen.com/Property/lahore_model_t...,House,220000000,Very High,Model Town,Lahore,Punjab,"Model Town, Lahore, Punjab",...,120.0,32670.12,For Sale,0,07-17-2019,2019,7,17,Real Biz International,Usama Khan
1,482892,48,https://www.zameen.com/Property/lahore_multan_...,House,40000000,Very High,Multan Road,Lahore,Punjab,"Multan Road, Lahore, Punjab",...,20.0,5445.02,For Sale,5,10-06-2018,2018,10,6,Khan Estate,mohsinkhan and B


In [7]:
## Checking which cities of Pakistan are present in the dataset
print(df['city'].unique())


['Lahore' 'Karachi' 'Islamabad' 'Faisalabad' 'Rawalpindi']


In [8]:
df.shape

(191393, 24)

In [9]:
data = df[df['city']=='Lahore']
data.shape

(58736, 24)

In [10]:
# dropping unnecessary columns

col_names = ["location_id","page_url","province_name","locality","area_marla","year","month","day","agency","agent","latitude","longitude","property_id","property_type","price_bin","purpose","date_added","city","area"]
data = data.drop(col_names, axis=1)

In [12]:
data = data.reset_index()
data = data.drop("index",axis=1)
data.head()

Unnamed: 0,price,location,baths,area_sqft,bedrooms
0,220000000,Model Town,0,32670.12,0
1,40000000,Multan Road,5,5445.02,5
2,9500000,Eden,0,2450.26,3
3,125000000,Gulberg,7,5445.02,8
4,21000000,Allama Iqbal Town,5,2994.76,6


In [13]:
data.isnull().sum()

price        0
location     0
baths        0
area_sqft    0
bedrooms     0
dtype: int64

In [16]:
data['bedrooms'].unique()

array([ 0,  5,  3,  8,  6,  4,  2,  7,  1, 10, 11,  9, 14, 12, 13, 18, 15,
       16, 25, 20])

In the above cell, there are some properties which have bedrooms even greater than 10...This could be possible that some of them could be typo error while others can be having other errors in them as well like very less baths or no baths...Lets inspect the properties that are having bedrooms more than 13

In [17]:
data[data['bedrooms']>13]

Unnamed: 0,price,location,baths,area_sqft,bedrooms
1099,100000000,Garden Town,0,10890.04,14
19710,650000000,Gulberg,0,25047.09,18
21817,175000000,Shah Jamal,0,26136.1,15
28454,350000000,Gulberg,0,17424.06,16
38558,175000000,Shah Jamal,0,26136.1,15
39985,960000000,Gulberg,0,43560.16,25
51998,960000000,Gulberg,0,43560.16,25
53342,1000000,Gulberg,0,21780.08,20
57580,300000,Habibullah Road,0,8712.03,20


In [18]:
data['baths'].unique()

array([ 0,  5,  7,  6,  4,  3,  2,  8,  1, 10,  9, 11, 12, 15])

In [19]:
data = data.drop(data[(data['baths']==0) & (data['bedrooms'] > 3)].index)
data

Unnamed: 0,price,location,baths,area_sqft,bedrooms
0,220000000,Model Town,0,32670.12,0
1,40000000,Multan Road,5,5445.02,5
2,9500000,Eden,0,2450.26,3
3,125000000,Gulberg,7,5445.02,8
4,21000000,Allama Iqbal Town,5,2994.76,6
...,...,...,...,...,...
58730,32000,Allama Iqbal Town,0,2722.51,2
58732,185000,DHA Defence,6,5445.02,5
58733,150000,DHA Defence,5,2722.51,4
58734,70000,DHA Defence,3,5445.02,3


In the above cell we have dropped those indexes that are having no baths and having bedrooms greater than 3 cause they can probably be a typo error.

In [25]:
data.drop(data[data['bedrooms']==0 | (data['baths']==0)].index, inplace=True)


We have also dropped those properties that are either having no bedrooms or no baths in the above cell...