**Bengaluru House Prediction**

Kaggle Data URL
https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data/data

In [116]:
import numpy as np
import pandas as pd

In [117]:
data = pd.read_csv('Bengaluru_House_Data.csv')

In [118]:
data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [119]:
data.shape

(13320, 9)

In [120]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


This code iterates through each column of the data DataFrame.

For each column, it prints the frequency distribution of unique values in that column using **value_counts()**.

In [121]:
for column in data.columns:
    print(data[column].value_counts())
    print("*"*20)

area_type
Super built-up  Area    8790
Built-up  Area          2418
Plot  Area              2025
Carpet  Area              87
Name: count, dtype: int64
********************
availability
Ready To Move    10581
18-Dec             307
18-May             295
18-Apr             271
18-Aug             200
                 ...  
16-Oct               1
17-Jan               1
16-Nov               1
16-Jan               1
14-Jul               1
Name: count, Length: 81, dtype: int64
********************
location
Whitefield                         540
Sarjapur  Road                     399
Electronic City                    302
Kanakpura Road                     273
Thanisandra                        234
                                  ... 
3rd Stage Raja Rajeshwari Nagar      1
Chuchangatta Colony                  1
Electronic City Phase 1,             1
Chikbasavanapura                     1
Abshot Layout                        1
Name: count, Length: 1305, dtype: int64
********************
siz

In [122]:
# The code calculates and displays the number of missing values
data.isna().sum()

Unnamed: 0,0
area_type,0
availability,0
location,1
size,16
society,5502
total_sqft,0
bath,73
balcony,609
price,0


This line of code removes four columns — 'area_type', 'availability', 'society', and 'balcony' — from the data DataFrame.

The **`inplace=True`**, the changes are made directly to the original data and rather than returning a new DataFrame with the columns dropped.

In [123]:
data.drop(columns=['area_type','availability','society','balcony'],inplace=True)

In [None]:
print(data.columns)

Index(['location', 'size', 'total_sqft', 'bath', 'price'], dtype='object')


In [None]:
# Generates descriptive statistics for the columns in the DataFrame.
data.describe()

Unnamed: 0,bath,price
count,13247.0,13320.0
mean,2.69261,112.565627
std,1.341458,148.971674
min,1.0,8.0
25%,2.0,50.0
50%,2.0,72.0
75%,3.0,120.0
max,40.0,3600.0


In [None]:
data.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0


In [None]:
# Counts the occurrences of each unique value in the 'location' column of the data DataFrame.
data['location'].value_counts()

Unnamed: 0_level_0,count
location,Unnamed: 1_level_1
Whitefield,540
Sarjapur Road,399
Electronic City,302
Kanakpura Road,273
Thanisandra,234
...,...
3rd Stage Raja Rajeshwari Nagar,1
Chuchangatta Colony,1
"Electronic City Phase 1,",1
Chikbasavanapura,1


**data['location'].fillna('Sarjapur Road'):** This part specifically targets the 'location' column and uses the **fillna() method** to replace any missing values **(NaN)** with the string **'Sarjapur Road'.**

In [None]:
data['location'] = data['location'].fillna('Sarjapur Road')

In [None]:
# How many '2 BHK' or '3 Bedroom' entries there are.
data['size'].value_counts()

Unnamed: 0_level_0,count
size,Unnamed: 1_level_1
2 BHK,5199
3 BHK,4310
4 Bedroom,826
4 BHK,591
3 Bedroom,547
1 BHK,538
2 Bedroom,329
5 Bedroom,297
6 Bedroom,191
1 Bedroom,105


In [None]:
data['size'] = data['size'].fillna('2 BHK')

In [None]:
# This code is used to fill in any missing values (NaN) in the 'bath' column with the median value of that column.
data['bath'] = data['bath'].fillna(data['bath'].median())

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   location    13320 non-null  object 
 1   size        13320 non-null  object 
 2   total_sqft  13320 non-null  object 
 3   bath        13320 non-null  float64
 4   price       13320 non-null  float64
dtypes: float64(2), object(3)
memory usage: 520.4+ KB


This code creates a new column called **'bhk'** in the data DataFrame by extracting the numerical part from the 'size' colum.

**data['size'].str.split():** This splits the strings in the 'size' column into lists of words based on spaces. For example, '2 BHK' becomes **['2', 'BHK'] **and '4 Bedroom' becomes **['4', 'Bedroom'].**

In [None]:
data['bhk'] = data['size'].str.split().str.get(0).astype(int)

In [None]:
# Display rows where the value in the 'bhk' column is greater than 20.
data[data.bhk > 20]

Unnamed: 0,location,size,total_sqft,bath,price,bhk
1718,2Electronic City Phase II,27 BHK,8000,27.0,230.0,27
4684,Munnekollal,43 Bedroom,2400,40.0,660.0,43


In [None]:
# Displays all the unique values present in the 'total_sqft' column of the data DataFrame.
data['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

Python function called **convertRange** that aims to convert values from the 'total_sqft' column into a consistent numerical format.



1.   **temp = x.split('-'):** This line attempts to split the input x
2.    If x is a range like '1200-1400', temp will be a list ['1200', '1400']. If x is a single value like '1000', temp will be ['1000'].
3.   if(len(temp) == 2): return (float(temp[0]) + float(temp[1]))/2: This checks if the temp list has two elements.
4.   **try:** return float(x) **except:** return None: If the input x was not a range (the if condition was false), this block is executed. It attempts to convert the input x directly to a floating-point number. If this conversion is successful, the float value is returned.

In [None]:
def convertRange(x):
    temp = x.split('-')
    if(len(temp)== 2):
      return (float(temp[0]) + float(temp[1]))/2
    try:
      return float(x)
    except:
      return None

In [None]:
data['total_sqft'] = data['total_sqft'].apply(convertRange)

In [None]:
data.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3
4,Kothanur,2 BHK,1200.0,2.0,51.0,2


**Price Per Square Feet**

In [None]:
data['price_per_sqft'] = data['price'] * 100000 / data['total_sqft']

In [None]:
data['price_per_sqft']

Unnamed: 0,price_per_sqft
0,3699.810606
1,4615.384615
2,4305.555556
3,6245.890861
4,4250.000000
...,...
13315,6689.834926
13316,11111.111111
13317,5258.545136
13318,10407.336319


In [None]:
data.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250.0


This helps in understanding which locations are most common in the dataset and the overall distribution of properties across different locations.

In [None]:
data['location'].value_counts()

Unnamed: 0_level_0,count
location,Unnamed: 1_level_1
Whitefield,540
Sarjapur Road,399
Electronic City,302
Kanakpura Road,273
Thanisandra,234
...,...
Mango Garden Layout,1
Milk Colony,1
"Basnashankari,6th stage,",1
Near ullas theater,1


This code is effective for standardizing location names by removing extra spaces and then providing a clear ***summary of how many properties are in each location.***

The **strip()** method removes whitespace characters like spaces, tabs, and newlines.

In [None]:
data['location'] = data['location'].apply(lambda x: x.strip())
location_count = data['location'].value_counts()

In [None]:
location_count_less_10 = location_count[location_count <= 10]

In [None]:
location_count_less_10

Unnamed: 0_level_0,count
location,Unnamed: 1_level_1
1st Block Koramangala,10
Dairy Circle,10
Nagadevanahalli,10
Sadashiva Nagar,10
Naganathapura,10
...,...
Xavier Layout,1
Ramanagara Channapatna,1
Maheswari Nagar,1
Hsr layout sector3,1


This code is used to replace the location names in the data DataFrame that have a low count with the string 'other'.

**lambda x: 'other'** if x in location_count_less_10 else x: This is a small anonymous function that takes one argument x (which represents each location name).

Imagine you have a list of all the property locations, and you've counted how many properties are in each location. Some locations might only have one or two properties, while others have hundreds.

This code looks at each location name and checks if it's one of those locations with 10 or fewer properties. If it is, the code changes that location name to just say "other". If the location has more than 10 properties, the name stays the same.

So, it's like grouping all the less common locations together into one big group called "other". This makes it easier to work with the data because you don't have to deal with hundreds of different location names that only appear a few times.

In [None]:
data['location'] = data['location'].apply(lambda x: 'other' if x in location_count_less_10 else x)

In [None]:
data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,13274.0,13320.0,13320.0,13320.0,13274.0
mean,1559.626694,2.688814,112.565627,2.802778,7907.501
std,1238.405258,1.338754,148.971674,1.294496,106429.6
min,1.0,1.0,8.0,1.0,267.8298
25%,1100.0,2.0,50.0,2.0,4266.865
50%,1276.0,2.0,72.0,3.0,5434.306
75%,1680.0,3.0,120.0,3.0,7311.746
max,52272.0,40.0,3600.0,43.0,12000000.0


Calculates the **price per square foot** for each entry by dividing the **total_sqft by the bhk** (bedrooms, hall, kitchen) count.

This code is looking at how much space you get for each bedroom in a house. It does this by dividing the total size of the house (in square feet) by the number of bedrooms it has.

Then, it gives you a summary of these numbers, like the average amount of space per bedroom, the smallest amount, the largest amount, and so on. This helps us see if there are any unusual cases where a house has a lot of bedrooms but very little space, or vice versa.

In [None]:
(data['total_sqft'] / data['bhk']).describe()

Unnamed: 0,0
count,13274.0
mean,575.074878
std,388.205175
min,0.25
25%,473.333333
50%,552.5
75%,625.0
max,26136.0


 It keeps only the rows where the calculated price per square foot **(total_sqft divided by bhk)** is **greater than or equal to 300.**


In [None]:
data = data[(data['total_sqft'] / data['bhk']) >= 300]

In [None]:
data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,12530.0,12530.0,12530.0,12530.0,12530.0
mean,1594.564544,2.559537,111.382401,2.650838,6303.979357
std,1261.271296,1.077938,152.077329,0.976678,4162.237981
min,300.0,1.0,8.44,1.0,267.829813
25%,1116.0,2.0,49.0,2.0,4210.526316
50%,1300.0,2.0,70.0,3.0,5294.117647
75%,1700.0,3.0,115.0,3.0,6916.666667
max,52272.0,16.0,3600.0,16.0,176470.588235


In [None]:
data.shape

(12530, 7)

In [None]:
data.price_per_sqft.describe()

Unnamed: 0,price_per_sqft
count,12530.0
mean,6303.979357
std,4162.237981
min,267.829813
25%,4210.526316
50%,5294.117647
75%,6916.666667
max,176470.588235


This code is trying to clean up your **housing data** by removing properties that have a really high or really low price per square foot compared to other properties in the same location.

It does this by looking at the **average price per square foot** in each area and keeping only the properties that are close to that average.

***This helps to get rid of unusual listings that might be errors or don't fit the typical prices for that area.***

In [None]:
def remove_outliers_sqft(df):
    df_output = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        m = np.mean(subdf.price_per_sqft)

        st = np.std(subdf.price_per_sqft)

        gen_df = subdf[(subdf.price_per_sqft > (m-st)) & (subdf.price_per_sqft <= (m+st))]
        df_output = pd.concat([df_output,gen_df], ignore_index=True)
    return df_output
data = remove_outliers_sqft(data)
data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,10301.0,10301.0,10301.0,10301.0,10301.0
mean,1508.440608,2.471702,91.286372,2.574896,5659.062876
std,880.694214,0.979449,86.342786,0.897649,2265.774749
min,300.0,1.0,10.0,1.0,1250.0
25%,1110.0,2.0,49.0,2.0,4244.897959
50%,1286.0,2.0,67.0,2.0,5175.600739
75%,1650.0,3.0,100.0,3.0,6428.571429
max,30400.0,16.0,2200.0,16.0,24509.803922


Imagine you're looking at house prices in different neighborhoods. This function helps us find and remove houses that seem unusually cheap for their size compared to other houses with fewer bedrooms in the same neighborhood.

Here's the basic idea:
1.   It goes neighborhood by neighborhood.
2.   In each neighborhood, it looks at houses based on how many bedrooms they have (like 2 bedrooms, 3 bedrooms, etc.).
3.   For each number of bedrooms, it figures out the average price per square foot.
4.   Then, it checks if a house with, say, 3 bedrooms is cheaper per square foot than the average price per square foot for houses with 2 bedrooms in that same neighborhood.
5.   If a house is much cheaper per square foot than houses with fewer bedrooms in the same area, the function marks it as a potential "outlier" (like a weirdly low price).
6.   Finally, it removes those marked houses from the list.

So, it's basically cleaning up the data by removing listings that seem like they might be errors or don't fit the typical pricing pattern for that area and number of bedrooms.

In [None]:
def bhk_outlier_remover(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df.price_per_sqft),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices, axis='index')

**This step is crucial for cleaning your dataset by removing potentially erroneous or unusual data points that could skew your analysis or model training later on.**

In [None]:
data = bhk_outlier_remover(data)

In [None]:
data.shape

(7361, 7)

In [None]:
data.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,1st Block Jayanagar,4 BHK,2850.0,4.0,428.0,4,15017.54386
1,1st Block Jayanagar,3 BHK,1630.0,3.0,194.0,3,11901.840491
2,1st Block Jayanagar,3 BHK,1875.0,2.0,235.0,3,12533.333333
3,1st Block Jayanagar,3 BHK,1200.0,2.0,130.0,3,10833.333333
4,1st Block Jayanagar,2 BHK,1235.0,2.0,148.0,2,11983.805668


In [None]:
data.drop(columns= ['size', 'price_per_sqft'], inplace=True)

In [None]:
data.head()

Unnamed: 0,location,total_sqft,bath,price,bhk
0,1st Block Jayanagar,2850.0,4.0,428.0,4
1,1st Block Jayanagar,1630.0,3.0,194.0,3
2,1st Block Jayanagar,1875.0,2.0,235.0,3
3,1st Block Jayanagar,1200.0,2.0,130.0,3
4,1st Block Jayanagar,1235.0,2.0,148.0,2


In [None]:
data.to_csv("Cleaned_data.csv")

**Supervised Learning Task**

A standard step in preparing data for **a machine learning model**, specifically for a **supervised learning** task like predicting house prices.

In [None]:
X = data.drop(columns=['price'])
y = data['price']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

***It allows you to train your model and then check how well it generalizes to new, unseen data.***

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
print(X_train.shape)
print(X_test.shape)

(5888, 4)
(1473, 4)


### Applying Linear Regression

**Linear Regression** is a way to find the relationship between two things: one that you want to predict (like house price) and one or more things that might affect it (like size or number of rooms). It draws a straight line through the data points to best show how changes in one thing affect the other.

**In simple terms, it helps you predict a value based on other known values.**

This code creates a transformer that will apply one-hot encoding to the 'location' column and leave all other columns untouched. This is a crucial step before training many machine learning models, as they typically require all input features to be in a numerical format.

In [None]:
column_trans = make_column_transformer((OneHotEncoder(sparse_output=False), ['location']), remainder='passthrough')

A **StandardScaler** object from the scikit-learn library.

Think of **StandardScaler** as a tool that standardizes your numerical data. It does this by transforming each feature (each numerical column) so that it has:

1.   A mean of 0
2.   A standard deviation of 1

**scalar = StandardScaler()** creates an instance of this tool, ready to be used to transform your numerical data later in the machine learning pipeline.

In [None]:
lr = LinearRegression()

A **StandardScaler** object from the scikit-learn library.

Think of **StandardScaler** as a tool that standardizes your numerical data. It does this by transforming each feature (each numerical column) so that it has:

1.   A mean of 0
2.   A standard deviation of 1

**scalar = StandardScaler()** creates an instance of this tool, ready to be used to transform your numerical data later in the machine learning pipeline.

Think of a pipeline as a sequence of steps that your data will go through, from raw input to model prediction. This is a very powerful concept in machine learning.

When you train this pipe object (using pipe.fit(...)), it will first apply the column_trans to your data, then apply the scalar to the output of the transformer, and finally train the lr model on the scaled data. When you make predictions (using pipe.predict(...)), it will apply the same preprocessing steps in the same order before feeding the data to the trained model.

In [None]:
scaler = StandardScaler()
pipe = make_pipeline(column_trans, scaler, lr)

In [None]:
# This line of code is where the actual training of your machine learning model happens.
pipe.fit(X_train, y_train)

In [None]:
# This line of code is where your trained machine learning model makes predictions on the unseen testing data.
y_pred_lr = pipe.predict(X_test)

This line of code is used to evaluate the performance of your trained Linear Regression model by calculating the **R-squared score.**

The R-squared score tells you how well your model's predictions match the actual prices.

*   An R-squared score of 1 means your model perfectly predicts the house prices.
*   An R-squared score of 0 means your model is no better than simply predicting the average price of all houses.
*   An R-squared score between 0 and 1 indicates how much of the variation in prices your model is able to explain. A higher score is generally better.

By calculating **r2_score(y_test, y_pred_lr)**, you are comparing the actual prices in your test set (y_test) with the prices predicted by your model (y_pred_lr) to see how close they are and get a single number that summarizes the model's performance on unseen data.

In [None]:
r2_score(y_test, y_pred_lr)

0.8233783132019383

### Applying Lasso

In [None]:
lasso = Lasso()

In [None]:
pipe = make_pipeline(column_trans, scaler, lasso)

In [None]:
pipe.fit(X_train, y_train)

In [None]:
y_pred_lasso = pipe.predict(X_test)
r2_score(y_test, y_pred_lasso)

0.8128285650772719

### Applying Ridge

In [None]:
ridge = Ridge()

In [None]:
pipe = make_pipeline(column_trans, scaler, ridge)
pipe.fit(X_train, y_train)

In [None]:
y_pred_ridge = pipe.predict(X_test)
r2_score(y_test, y_pred_ridge)

0.8234146633312639

In [None]:
print('No Regularization: ', r2_score(y_test, y_pred_lr))
print('Lasso: ', r2_score(y_test, y_pred_lasso))
print('Ridge: ', r2_score(y_test, y_pred_ridge))

No Regularization:  0.8233783132019383
Lasso:  0.8128285650772719
Ridge:  0.8234146633312639


In [None]:
import pickle

In [None]:
pickle.dump(pipe, open('RidgeModel.pkl', 'wb'))