<h1 style='color:#007fd4' align='center'>Data Science Regression Project: Predicting Home Prices in Hyderabad</h1>

Dataset is downloaded from here: https://www.kaggle.com/datasets/chilledwanker/makaan-house-prices-dataset

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib 
matplotlib.rcParams["figure.figsize"] = (20,10)

<h2 style='color:#007fd4' align='left'>Loading Hyd House prices data into a dataframe</h2>

In [2]:
df1 = pd.read_csv("Hyd_House_Prices.csv")
df1.head(10)

Unnamed: 0.1,Unnamed: 0,Builder Name,City,Locality,Property Type,Price in Crores,Area in Sqft,Construction Status,Age of property
0,0,Theme Ambience InfrastructuresBUILDER0,Hyderabad,Attapur,3 BHK Apartment,1.5,2090,Under Construction,Possession by Dec 2024
1,1,Elemental RealtyBUILDER0,Hyderabad,Patancheru,4 BHK Villa,1.48,1930,Under Construction,Possession by Nov 2024
2,2,R V Nirmaan Private LimitedBUILDER0,Hyderabad,Miyapur,3 BHK Apartment,1.27,1967,Under Construction,Possession by Nov 2024
3,3,Risinia BuildersBUILDER0,Hyderabad,Pragathi Nagar Kukatpally,2 BHK Apartment,0.7345,1312,Under Construction,Possession by May 2025
4,4,Cyber City OrianaBUILDER0,Hyderabad,Kukatpally,3 BHK Apartment,1.28,1480,Under Construction,Possession by Nov 2025
5,5,Shree Vasavi DevelopersBUILDER0,Hyderabad,Attapur,3 BHK Apartment,1.46,2005,Under Construction,Possession by Nov 2024
6,6,Jain ConstructionsBUILDER0,Hyderabad,Malkajgiri,3 BHK Apartment,1.63,2245,Under Construction,Possession by Nov 2023
7,7,Ira RealityBUILDER0,Hyderabad,Adibatla,4 BHK Villa,3.14,3705,Under Construction,Possession by Dec 2025
8,8,Hallmark BuildersBUILDER0,Hyderabad,Patighanpur,3 BHK Apartment,1.13,1885,Under Construction,Possession by Nov 2025
9,9,APR GroupBUILDER0,Hyderabad,Bachupally,4 BHK Villa,4.08,3400,Under Construction,Possession by Nov 2024


In [3]:
# from df1 drop unnamed column
df1 = df1.drop(['Unnamed: 0'], axis=1)
df1.head(10)

Unnamed: 0,Builder Name,City,Locality,Property Type,Price in Crores,Area in Sqft,Construction Status,Age of property
0,Theme Ambience InfrastructuresBUILDER0,Hyderabad,Attapur,3 BHK Apartment,1.5,2090,Under Construction,Possession by Dec 2024
1,Elemental RealtyBUILDER0,Hyderabad,Patancheru,4 BHK Villa,1.48,1930,Under Construction,Possession by Nov 2024
2,R V Nirmaan Private LimitedBUILDER0,Hyderabad,Miyapur,3 BHK Apartment,1.27,1967,Under Construction,Possession by Nov 2024
3,Risinia BuildersBUILDER0,Hyderabad,Pragathi Nagar Kukatpally,2 BHK Apartment,0.7345,1312,Under Construction,Possession by May 2025
4,Cyber City OrianaBUILDER0,Hyderabad,Kukatpally,3 BHK Apartment,1.28,1480,Under Construction,Possession by Nov 2025
5,Shree Vasavi DevelopersBUILDER0,Hyderabad,Attapur,3 BHK Apartment,1.46,2005,Under Construction,Possession by Nov 2024
6,Jain ConstructionsBUILDER0,Hyderabad,Malkajgiri,3 BHK Apartment,1.63,2245,Under Construction,Possession by Nov 2023
7,Ira RealityBUILDER0,Hyderabad,Adibatla,4 BHK Villa,3.14,3705,Under Construction,Possession by Dec 2025
8,Hallmark BuildersBUILDER0,Hyderabad,Patighanpur,3 BHK Apartment,1.13,1885,Under Construction,Possession by Nov 2025
9,APR GroupBUILDER0,Hyderabad,Bachupally,4 BHK Villa,4.08,3400,Under Construction,Possession by Nov 2024


In [4]:
df1.shape

(5960, 8)

<h2 style='color:#007fd4' align='left'>Renaming column names for better readability</h2>

In [5]:
df1.rename(columns={
    'Builder Name': 'builder_name',
    'City': 'city',
    'Locality': 'locality',
    'Property Type': 'property_type',
    'Price in Crores': 'price_in_crores',
    'Area in Sqft': 'area_in_sqft',
    'Construction Status': 'construction_status',
    'Age of property': 'age_of_property',
}, inplace=True)

<h2 style='color:#007fd4' align='left'>Data Cleaning</h2>

In [6]:
df1.isnull().sum()

builder_name             0
city                     0
locality                 0
property_type            0
price_in_crores          0
area_in_sqft             0
construction_status      0
age_of_property        174
dtype: int64

In [7]:
df1.locality.value_counts()

locality
Miyapur             205
Kollur              202
Yacharam            164
Narsingi            149
Kokapet             137
                   ... 
Erragadda             1
Narepally             1
Rameswaram Banda      1
Masab Tank            1
Devatabowli           1
Name: count, Length: 299, dtype: int64

In [8]:
nulls = df1[df1.age_of_property.isnull()]
nulls.head(29)

Unnamed: 0,builder_name,city,locality,property_type,price_in_crores,area_in_sqft,construction_status,age_of_property
53,Alekya Infraa DevelopersAGENT0,Hyderabad,Aroor,Residential Plot,0.296,1665,New,
67,sellerVERIFIED OWNER,Hyderabad,Shadnagar,"Residential PlotShadnagar, Hyderabad",0.4,2250,Resale,
76,sellerVERIFIED OWNER,Hyderabad,Kadthal,"Residential PlotKadthal, Hyderabad",0.25,12150,Resale,
128,Alekya Infraa DevelopersAGENT0,Hyderabad,Nallagandla Gachibowli,Residential Plot,0.272,1530,New,
129,Alekya Infraa DevelopersAGENT0,Hyderabad,Aroor,Residential Plot,0.4,2250,New,
353,sellerVERIFIED OWNER,Hyderabad,Shamshabad,"Residential PlotShamshabad, Hyderabad",0.2562,1650,New,
360,Alekya Infraa DevelopersAGENT0,Hyderabad,Sadashivpet,Residential Plot,0.234,1620,New,
361,Alekya Infraa DevelopersAGENT0,Hyderabad,Sadashivpet,Residential Plot,0.221,1530,New,
362,Alekya Infraa DevelopersAGENT0,Hyderabad,Sadashivpet,Residential Plot,0.2119,1467,New,
374,Alekya Infraa DevelopersAGENT0,Hyderabad,Sadashivpet,Residential Plot,0.4875,3375,New,


In [9]:
# filter rows where Property Type contains 'Residential Plot'
df_residential_plots = df1[df1.property_type.str.contains('Residential Plot', case=False, na=False)]
df_residential_plots.shape, df_residential_plots.head()

((1857, 8),
                       builder_name       city     locality      property_type  \
 12           Beyond NatureBUILDER0  Hyderabad  Sadashivpet  Residential Plot    
 27        G square HousingBUILDER0  Hyderabad   Choutuppal  Residential Plot    
 40           Beyond NatureBUILDER0  Hyderabad  Sadashivpet  Residential Plot    
 50        G square HousingBUILDER0  Hyderabad   Choutuppal  Residential Plot    
 53  Alekya Infraa DevelopersAGENT0  Hyderabad        Aroor  Residential Plot    
 
     price_in_crores  area_in_sqft construction_status         age_of_property  
 12           0.6300          3150                 New  Possession by Feb 2024  
 27           0.9781          4797                 New  Possession by Apr 2024  
 40           0.3600          1800                 New  Possession by Feb 2024  
 50           0.4899          2403                 New  Possession by Apr 2024  
 53           0.2960          1665                 New                     NaN  )

In [10]:
df2 = df1[~df1.property_type.str.contains('Residential Plot', case=False, na=False)]
df2.shape

(4103, 8)

<h2 style='color:#007fd4' align='left'>Feature Engineering: Creating new columns for property type & bedrooms</h2>

In [11]:
df2['bedrooms'] = df2['property_type'].str.extract(r'(\d+)\s*BHK', expand=False).astype(float)
df2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['bedrooms'] = df2['property_type'].str.extract(r'(\d+)\s*BHK', expand=False).astype(float)


Unnamed: 0,builder_name,city,locality,property_type,price_in_crores,area_in_sqft,construction_status,age_of_property,bedrooms
0,Theme Ambience InfrastructuresBUILDER0,Hyderabad,Attapur,3 BHK Apartment,1.5000,2090,Under Construction,Possession by Dec 2024,3.0
1,Elemental RealtyBUILDER0,Hyderabad,Patancheru,4 BHK Villa,1.4800,1930,Under Construction,Possession by Nov 2024,4.0
2,R V Nirmaan Private LimitedBUILDER0,Hyderabad,Miyapur,3 BHK Apartment,1.2700,1967,Under Construction,Possession by Nov 2024,3.0
3,Risinia BuildersBUILDER0,Hyderabad,Pragathi Nagar Kukatpally,2 BHK Apartment,0.7345,1312,Under Construction,Possession by May 2025,2.0
4,Cyber City OrianaBUILDER0,Hyderabad,Kukatpally,3 BHK Apartment,1.2800,1480,Under Construction,Possession by Nov 2025,3.0
...,...,...,...,...,...,...,...,...,...
5955,TSR PROPERTIESAGENT0,Hyderabad,Nanakramguda,2 BHK Apartment,0.9073,1315,Under Construction,Possession by Nov 2026,2.0
5956,TSR PROPERTIESAGENT0,Hyderabad,Nanakramguda,3 BHK Apartment,1.5300,2230,Under Construction,Possession by Nov 2026,3.0
5957,TSR PROPERTIESAGENT0,Hyderabad,Nanakramguda,2 BHK Apartment,0.9073,1315,Under Construction,Possession by Nov 2026,2.0
5958,TSR PROPERTIESAGENT0,Hyderabad,Narsingi,2 BHK ApartmentNars,0.6893,1130,Under Construction,2 Bathrooms,2.0


In [12]:
df2 = df2.dropna(subset=['bedrooms'])
df2.shape

(4099, 9)

In [13]:
df2.property_type.unique()

array(['3 BHK Apartment ', '4 BHK Villa ', '2 BHK Apartment ',
       '4 BHK Apartment ', '4 BHK Independent House ', '1 BHK Apartment ',
       '3 BHK Independent House ',
       '2 BHK ApartmentPragathi Nagar Kukatpally, Hyderabad',
       '3 BHK Independent FloorMalakpet, Hyderabad',
       '3 BHK ApartmentKollur, Hyderabad',
       '3 BHK ApartmentKokapet, Hyderabad',
       '3 BHK ApartmentAlwal, Hyderabad',
       '4 BHK VillaPatancheru, Hyderabad',
       '4 BHK VillaManneguda, Hyderabad',
       '2 BHK VillaNandigama, Hyderabad',
       '3 BHK VillaIDA Pashamylaram, Hyderabad', '4 BHK ApartmentNars',
       '3 BHK ApartmentSa', '2 BHK Independent HouseLaxmiguda, Hyderabad',
       '2 BHK ApartmentToli Chowki, Hyderabad', '5 BHK Apartment ',
       '2 BHK ApartmentBowrampet, Hyderabad',
       '3 BHK VillaKompally, Hyderabad',
       '2 BHK ApartmentBachupally, Hyderabad',
       '3 BHK VillaBowrampet, Hyderabad',
       '2 BHK ApartmentMiyapur, Hyderabad',
       '3 BHK Apartme

In [14]:
df2['property_type'] = df2['property_type'].str.extract(
    r'(Apartment|Villa|Independent House|Independent Floor)', 
    expand=False
)
df2['price_in_crores'] = df2['price_in_crores'].round(2)
df2.head(10)

df2['bedrooms'] = df2['bedrooms'].astype(int)
df2.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['property_type'] = df2['property_type'].str.extract(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['price_in_crores'] = df2['price_in_crores'].round(2)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['bedrooms'] = df2['bedrooms'].astype(int)


(4099, 9)

In [15]:
df3 = df2.copy()
exclude_values = ['Resale ', 'New ', '1 Bathrooms', '2 Bathrooms', '3 Bathrooms', '4 Bathrooms',
                  '5 Bathrooms', '6 Bathrooms', '7 Bathrooms', '8 Bathrooms',
                  '9 Bathrooms', '10 Bathrooms', '11 Bathrooms', '12 Bathrooms']
df3 = df3[~df3['age_of_property'].isin(exclude_values)]
df3

Unnamed: 0,builder_name,city,locality,property_type,price_in_crores,area_in_sqft,construction_status,age_of_property,bedrooms
0,Theme Ambience InfrastructuresBUILDER0,Hyderabad,Attapur,Apartment,1.50,2090,Under Construction,Possession by Dec 2024,3
1,Elemental RealtyBUILDER0,Hyderabad,Patancheru,Villa,1.48,1930,Under Construction,Possession by Nov 2024,4
2,R V Nirmaan Private LimitedBUILDER0,Hyderabad,Miyapur,Apartment,1.27,1967,Under Construction,Possession by Nov 2024,3
3,Risinia BuildersBUILDER0,Hyderabad,Pragathi Nagar Kukatpally,Apartment,0.73,1312,Under Construction,Possession by May 2025,2
4,Cyber City OrianaBUILDER0,Hyderabad,Kukatpally,Apartment,1.28,1480,Under Construction,Possession by Nov 2025,3
...,...,...,...,...,...,...,...,...,...
5953,TSR PROPERTIESAGENT0,Hyderabad,Manikonda,Apartment,1.10,1750,Ready to move,4 years old,3
5954,TSR PROPERTIESAGENT0,Hyderabad,Nanakramguda,Apartment,1.53,2230,Under Construction,Possession by Nov 2026,3
5955,TSR PROPERTIESAGENT0,Hyderabad,Nanakramguda,Apartment,0.91,1315,Under Construction,Possession by Nov 2026,2
5956,TSR PROPERTIESAGENT0,Hyderabad,Nanakramguda,Apartment,1.53,2230,Under Construction,Possession by Nov 2026,3


<h2 style='color:#007fd4' align='left'>Handling age of property column</h2>

In [16]:
df3.age_of_property.unique()

array(['Possession by Dec 2024', 'Possession by Nov 2024',
       'Possession by May 2025', 'Possession by Nov 2025',
       'Possession by Nov 2023', 'Possession by Dec 2025',
       'Possession by Feb 2026', 'Possession by Nov 2026',
       'Possession by May 2027', '0 - 1 year old', '2 - 3 years old',
       'Possession by Jun 2025', 'Possession by Aug 2025',
       'Possession by Oct 2023', 'Possession by Sep 2026',
       '6 - 7 years old', 'Possession by Jan 2027',
       'Possession by Sep 2023', '3 - 4 years old',
       'Possession by Mar 2024', '4 - 5 years old', '1 - 2 years old',
       'Possession by May 2024', 'Possession by Oct 2024',
       'Possession by May 2023', 'Possession by Apr 2025',
       'Possession by Feb 2024', 'Possession by Nov 2027',
       'Possession by Aug 2024', 'Possession by Feb 2025',
       '5 - 6 years old', 'Possession by Nov 2022',
       'Possession by Dec 2023', 'Possession by Jan 2025',
       '10 - 11 years old', '8 - 9 years old', 'Posses

In [17]:
df4 = df3.copy()

In [18]:
def clean_age_property(value):
    import re
    
    if 'Possession by' in str(value):
        year_match = re.search(r'20(\d{2})', str(value))
        if year_match:
            possession_year = '20' + year_match.group(1)  
            return possession_year
    
    number_range = re.search(r'(\d+)\s*-\s*(\d+)', str(value))
    single_number = re.search(r'(\d+)', str(value))
    
    if number_range:
        return f"{number_range.group(1)} - {number_range.group(2)}"
    elif single_number:
        return single_number.group(1)
    else:
        return value

df4['age_of_property'] = df4['age_of_property'].apply(clean_age_property)

df4.age_of_property.unique()

array(['2024', '2025', '2023', '2026', '2027', '0 - 1', '2 - 3', '6 - 7',
       '3 - 4', '4 - 5', '1 - 2', '5 - 6', '2022', '10 - 11', '8 - 9',
       '2028', '2', '7 - 8', '14 - 15', '7', '9 - 10', '5', '2029', '4',
       '2020', '12 - 13', '13 - 14', '1', '11 - 12', '20 - 21', '11', '8',
       '18 - 19', '1899'], dtype=object)

In [19]:
def avg_property_age(x):
    tokens = x.split('-')
    if len(tokens) == 2:
        return (float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None

In [20]:
avg_property_age('8 - 9')

8.5

In [21]:
df5 = df4.copy()
df5.age_of_property = df5.age_of_property.apply(lambda x: avg_property_age(str(x)))
df5.head()

Unnamed: 0,builder_name,city,locality,property_type,price_in_crores,area_in_sqft,construction_status,age_of_property,bedrooms
0,Theme Ambience InfrastructuresBUILDER0,Hyderabad,Attapur,Apartment,1.5,2090,Under Construction,2024.0,3
1,Elemental RealtyBUILDER0,Hyderabad,Patancheru,Villa,1.48,1930,Under Construction,2024.0,4
2,R V Nirmaan Private LimitedBUILDER0,Hyderabad,Miyapur,Apartment,1.27,1967,Under Construction,2024.0,3
3,Risinia BuildersBUILDER0,Hyderabad,Pragathi Nagar Kukatpally,Apartment,0.73,1312,Under Construction,2025.0,2
4,Cyber City OrianaBUILDER0,Hyderabad,Kukatpally,Apartment,1.28,1480,Under Construction,2025.0,3


In [22]:
df6 = df5.copy()
df6['age_of_property'] = df6['age_of_property'].apply(lambda x: 2025 - x if x > 1800 else x)
df6.shape

(2872, 9)

In [23]:
df6 = df6[df6['age_of_property']<120]
df6.shape


(2871, 9)

In [24]:
df6['age_of_property'] = df6.age_of_property.astype(int)
df6.age_of_property.describe()

count    2871.000000
mean        1.145942
std         2.662414
min        -4.000000
25%         0.000000
50%         1.000000
75%         2.000000
max        20.000000
Name: age_of_property, dtype: float64

In [25]:
df6['locality'] = df6['locality'].str.strip()
df6.locality.value_counts(ascending=False).head()

locality
Kollur         158
Miyapur        133
Narsingi       129
Hitech City    114
Kokapet        114
Name: count, dtype: int64

<h2 style="color:#007fd4">Outlier Removal Using Business Logic</h2>

In [26]:
df6[df6['area_in_sqft']/df6['bedrooms']<300]

Unnamed: 0,builder_name,city,locality,property_type,price_in_crores,area_in_sqft,construction_status,age_of_property,bedrooms
70,sellerVERIFIED OWNER,Hyderabad,Habsiguda,Apartment,0.55,1051,Ready to move,3,4
2266,Om Sri Sai Ram RentalsAGENT0,Hyderabad,Nacharam,Independent Floor,1.5,1000,Ready to move,7,4
3362,Om Sri Sai Ram RentalsAGENT0,Hyderabad,Nacharam,Independent Floor,1.5,1000,Ready to move,7,4
4135,MG PropertiesAGENT0,Hyderabad,Madhapur,Apartment,2.5,240,Ready to move,10,1
4466,MMAB Real EstateAGENT0,Hyderabad,Kurmaguda,Independent House,0.2,750,Ready to move,4,3


In [27]:
df7 = df6[~(df6['area_in_sqft']/df6['bedrooms']<300)]
df7.shape

(2866, 9)

<h2 style="color:#007fd4">Feature Engineering: Adding price per sqft column</h2>

In [28]:
df7['price_per_sqft'] = df7['price_in_crores']*10000000/df7['area_in_sqft']
df7

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df7['price_per_sqft'] = df7['price_in_crores']*10000000/df7['area_in_sqft']


Unnamed: 0,builder_name,city,locality,property_type,price_in_crores,area_in_sqft,construction_status,age_of_property,bedrooms,price_per_sqft
0,Theme Ambience InfrastructuresBUILDER0,Hyderabad,Attapur,Apartment,1.50,2090,Under Construction,1,3,7177.033493
1,Elemental RealtyBUILDER0,Hyderabad,Patancheru,Villa,1.48,1930,Under Construction,1,4,7668.393782
2,R V Nirmaan Private LimitedBUILDER0,Hyderabad,Miyapur,Apartment,1.27,1967,Under Construction,1,3,6456.532791
3,Risinia BuildersBUILDER0,Hyderabad,Pragathi Nagar Kukatpally,Apartment,0.73,1312,Under Construction,0,2,5564.024390
4,Cyber City OrianaBUILDER0,Hyderabad,Kukatpally,Apartment,1.28,1480,Under Construction,0,3,8648.648649
...,...,...,...,...,...,...,...,...,...,...
5953,TSR PROPERTIESAGENT0,Hyderabad,Manikonda,Apartment,1.10,1750,Ready to move,4,3,6285.714286
5954,TSR PROPERTIESAGENT0,Hyderabad,Nanakramguda,Apartment,1.53,2230,Under Construction,-1,3,6860.986547
5955,TSR PROPERTIESAGENT0,Hyderabad,Nanakramguda,Apartment,0.91,1315,Under Construction,-1,2,6920.152091
5956,TSR PROPERTIESAGENT0,Hyderabad,Nanakramguda,Apartment,1.53,2230,Under Construction,-1,3,6860.986547


In [29]:
df7.price_per_sqft.describe()

count     2866.000000
mean      7305.679027
std       2998.442533
min       1750.000000
25%       5608.769567
50%       6751.273941
75%       8289.474611
max      40909.090909
Name: price_per_sqft, dtype: float64

<h2 style='color:#007fd4'>Outlier Removal Using Standard Deviation and Mean</h2>

In [30]:
def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('locality'):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]
        df_out = pd.concat([df_out,reduced_df],ignore_index=True)
    return df_out
df8 = remove_pps_outliers(df7)
df8.shape

(2238, 10)

**Checking for a given location how does the 2 bedrooms and 3 bedrooms property prices look like**

In [31]:
# def plot_scatter_chart(df,locality):
#     bedrooms2 = df[(df.locality==locality) & (df.bedrooms==2)]
#     bedrooms3 = df[(df.locality==locality) & (df.bedrooms==3)]
#     matplotlib.rcParams['figure.figsize'] = (15,10)
#     plt.scatter(bedrooms2.area_in_sqft,bedrooms2.price_in_crores,color='blue',label='2 Bedrooms', s=50)
#     plt.scatter(bedrooms3.area_in_sqft,bedrooms3.price_in_crores,marker='+', color='green',label='3 Bedrooms', s=50)
#     plt.xlabel("Total Square Feet Area")
#     plt.ylabel("Price (Crore Indian Rupees)")
#     plt.title(locality)
#     plt.legend()
    
# plot_scatter_chart(df8,"Kompally")


import plotly.express as px

def plot_scatter_chart(df, locality):
    bedrooms2 = df[(df.locality == locality) & (df.bedrooms == 2)]
    bedrooms3 = df[(df.locality == locality) & (df.bedrooms == 3)]
    
    plot_df = pd.concat([bedrooms2, bedrooms3])
    
    fig = px.scatter(
        plot_df,
        x="area_in_sqft",
        y="price_in_crores",
        color=plot_df["bedrooms"].astype(str) + " Bedrooms",
        title=locality,
        labels={
            "area_in_sqft": "Total Square Feet Area",
            "price_in_crores": "Price (Crore Indian Rupees)"
        },
        symbol="bedrooms", 
        size_max=10
    )
    
    fig.update_traces(marker=dict(size=10, opacity=0.8))
    fig.update_layout(
        legend_title_text="Bedroom Type",
        width=1000,
        height=550
    )
    
    fig.show()

plot_scatter_chart(df8, "Kompally")


In [32]:
def remove_bedroom_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('locality'):
        bedrooms_stats = {}
        for bedroom, bedroom_df in location_df.groupby('bedrooms'):
            bedrooms_stats[bedroom] = {
                'mean': np.mean(bedroom_df.price_per_sqft),
                'std': np.std(bedroom_df.price_per_sqft),
                'count': bedroom_df.shape[0]
            }
        for bedroom, bedroom_df in location_df.groupby('bedrooms'):
            stats = bedrooms_stats.get(bedroom-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bedroom_df[bedroom_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')
df9 = remove_bedroom_outliers(df8)
df9.shape

(1850, 10)

In [33]:
plot_scatter_chart(df9, "Kompally")

In [34]:
df9.price_in_crores.describe()

count    1850.000000
mean        1.564573
std         1.314433
min         0.270000
25%         0.860000
50%         1.255000
75%         1.770000
max        17.600000
Name: price_in_crores, dtype: float64

In [35]:
df9.columns

Index(['builder_name', 'city', 'locality', 'property_type', 'price_in_crores',
       'area_in_sqft', 'construction_status', 'age_of_property', 'bedrooms',
       'price_per_sqft'],
      dtype='object')

In [36]:
x = df9.drop(['builder_name', 'city', 'construction_status', 'price_in_crores', 'price_per_sqft'], axis=1)
y = df9['price_in_crores']

In [37]:
loc_dummies = pd.get_dummies(x['locality'])
locality_dummies = loc_dummies.drop('Yapral',axis='columns')
locality_dummies.head()

Unnamed: 0,AS Rao Nagar,Abids,Adibatla,Ameenpur,Ameerpet,Appa Junction Peerancheru,Attapur,Bachupally,Balanagar,Balapur,...,Shankarpalli,Shankarpally Road,Sultanpur,Tellapur,Trimalgherry,Tukkuguda,Turkayamjal,Uppal Kalan,Vanasthalipuram,Velmala
0,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [38]:
pro_type_dummies = pd.get_dummies(x['property_type'])
property_type_dummies = pro_type_dummies.drop('Independent Floor',axis='columns')
property_type_dummies.head()

Unnamed: 0,Apartment,Independent House,Villa
0,False,True,False
1,True,False,False
2,False,True,False
3,True,False,False
4,True,False,False


In [39]:
X = pd.concat([x.drop(['locality', 'property_type'], axis=1), locality_dummies, property_type_dummies], axis=1)
X

Unnamed: 0,area_in_sqft,age_of_property,bedrooms,AS Rao Nagar,Abids,Adibatla,Ameenpur,Ameerpet,Appa Junction Peerancheru,Attapur,...,Tellapur,Trimalgherry,Tukkuguda,Turkayamjal,Uppal Kalan,Vanasthalipuram,Velmala,Apartment,Independent House,Villa
0,5500,7,7,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,1100,7,2,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,1500,7,3,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
3,1400,7,3,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
4,950,7,2,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2233,3200,1,4,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
2234,3900,1,4,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
2235,3600,1,4,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
2236,2780,0,3,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


In [40]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

In [41]:
from sklearn.linear_model import LinearRegression
lr_clf = LinearRegression()
lr_clf.fit(X_train, y_train)
lr_clf.score(X_test, y_test)

0.9414584010772888

In [42]:
from sklearn.tree import DecisionTreeRegressor
dt_clf = DecisionTreeRegressor(criterion='friedman_mse', splitter='best')
dt_clf.fit(X_train, y_train)
dt_clf.score(X_test, y_test)

0.9716283571949381

In [43]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

cross_val_score(LinearRegression(), X, y, cv=cv)

array([0.94607021, 0.9517526 , 0.94643653, 0.94875666, 0.9550611 ])

In [44]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

def find_and_eval_models(X_train, y_train, X_test, y_test, cv=cv):
    algos = {
        'linear_regression': {
            'model': LinearRegression(),
            'params': {'fit_intercept': [True, False]}
        },
        'lasso': {
            'model': Lasso(max_iter=10000),
            'params': {'alpha': [0.01, 0.1, 1.0]}
        },
        'ridge': {
            'model': Ridge(),
            'params': {'alpha': [0.1, 1.0, 10.0]}
        },
        'elasticnet': {
            'model': ElasticNet(max_iter=10000),
            'params': {'alpha': [0.01, 0.1, 1.0], 'l1_ratio': [0.2, 0.5, 0.8]}
        },
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {'criterion': ['squared_error','friedman_mse'], 'splitter': ['best','random']}
        },
        'random_forest': {
            'model': RandomForestRegressor(random_state=0),
            'params': {'n_estimators': [50,100], 'max_depth': [None, 10]}
        },
        'gradient_boosting': {
            'model': GradientBoostingRegressor(random_state=0),
            'params': {'n_estimators': [100], 'learning_rate': [0.1, 0.05], 'max_depth':[3,5]}
        },
        'knn': {
            'model': KNeighborsRegressor(),
            'params': {'n_neighbors': [3,5]}
        }
    }

    results = []
    for name, cfg in algos.items():
        gs = GridSearchCV(cfg['model'], cfg['params'], cv=cv, n_jobs=-1, return_train_score=False)
        gs.fit(X_train, y_train)
        best = gs.best_estimator_
        y_pred = best.predict(X_test)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        results.append({
            'model': name,
            'best_cv_score': gs.best_score_,
            'test_r2': best.score(X_test, y_test),
            'test_rmse': rmse,
            'best_params': gs.best_params_
        })

    return pd.DataFrame(results).sort_values(by='test_r2', ascending=False)

res_df = find_and_eval_models(X_train, y_train, X_test, y_test)
res_df

Unnamed: 0,model,best_cv_score,test_r2,test_rmse,best_params
6,gradient_boosting,0.967858,0.973231,0.217771,"{'learning_rate': 0.1, 'max_depth': 5, 'n_esti..."
5,random_forest,0.950801,0.972949,0.218915,"{'max_depth': None, 'n_estimators': 100}"
4,decision_tree,0.946958,0.972211,0.22188,"{'criterion': 'squared_error', 'splitter': 'be..."
2,ridge,0.941248,0.941568,0.32174,{'alpha': 0.1}
0,linear_regression,0.941526,0.939953,0.326158,{'fit_intercept': False}
3,elasticnet,0.83347,0.870291,0.479365,"{'alpha': 0.01, 'l1_ratio': 0.2}"
1,lasso,0.845853,0.866696,0.485963,{'alpha': 0.01}
7,knn,0.78115,0.794886,0.602809,{'n_neighbors': 3}


In [45]:
rf_model = RandomForestRegressor(max_depth=None, n_estimators=100)
rf_model.fit(X_train, y_train)
rf_model.score(X_test, y_test)

0.9727243825278595

In [46]:
X.columns

Index(['area_in_sqft', 'age_of_property', 'bedrooms', 'AS Rao Nagar', 'Abids',
       'Adibatla', 'Ameenpur', 'Ameerpet', 'Appa Junction Peerancheru',
       'Attapur',
       ...
       'Tellapur', 'Trimalgherry', 'Tukkuguda', 'Turkayamjal', 'Uppal Kalan',
       'Vanasthalipuram', 'Velmala', 'Apartment', 'Independent House',
       'Villa'],
      dtype='object', length=112)

In [47]:
def predict_cost(locality, property_type, area_in_sqft, age_of_property, bedrooms):
    loc_index = np.where(X.columns == locality)[0][0]
    pro_index = np.where(X.columns == property_type)[0][0]

    z = np.zeros(len(X.columns))
    z[0] = area_in_sqft
    z[1] = age_of_property
    z[2] = bedrooms

    if loc_index >= 0:
        z[loc_index] = 1
    if pro_index >= 0:
        z[pro_index] = 1

    return rf_model.predict([z])[0]

In [48]:
X.value_counts()

area_in_sqft  age_of_property  bedrooms  AS Rao Nagar  Abids  Adibatla  Ameenpur  Ameerpet  Appa Junction Peerancheru  Attapur  Bachupally  Balanagar  Balapur  Bandlaguda Jagir  Banjara Hills  Beeramguda  Begumpet  Bolarum  Bongloor  Bowenpally  Bowrampet  Chandanagar  Gachibowli  Gagillapur  Gajularamaram  Gajulramaram Kukatpally  Gandi Maisamma  Gandipet  Gopanpally  Gowdavalli  Gundlapochampally  Harshaguda  Hayathnagar  Hitech City  Hyder Nagar  Indresham  Isnapur  Jeedimetla  Kadthal  Kardhanur  Karwan  Khajaguda Nanakramguda Road  Kismatpur  Kokapet  Kollur  Kompally  Kondapur  Kowkur  Krishna Reddy Pet  Kukatpally  Kushaiguda  LB Nagar  Laxmiguda  Madhapur  Mahadevpur Colony  Maheshwaram  Malkajgiri  Mallapur  Mamidipally  Manchirevula  Manikonda  Manneguda  Mansoorabad  Medchal  Miyapur  Mokila  Moti Nagar  Nacharam  Nagole  Nagulapally  Nallagandla Gachibowli  Nallakunta  Nanakramguda  Narsingi  Nawab Saheb Kunta  Neredmet  Nizampet  Osman Nagar  Padmarao Nagar  Patancheru  Pa

In [60]:
predict_cost('AS Rao Nagar','Apartment', 680, 0, 2)*10000000


X does not have valid feature names, but RandomForestRegressor was fitted with feature names



3782928.5714285728

In [50]:
predict_cost('Kompally','Apartment', 1700, 0, 3)*10000000


X does not have valid feature names, but RandomForestRegressor was fitted with feature names



10465000.00000001

In [51]:
import pickle
with open('hyderabad_home_prices_model.pickle','wb') as f:
    pickle.dump(rf_model,f)

In [52]:
import json
columns = {
    'data_columns' : [col.lower() for col in X.columns]
}
with open("columns.json","w") as f:
    f.write(json.dumps(columns))