<a href="https://colab.research.google.com/github/rajendran-official/AI_ML_COURSE_ICT/blob/Assignment_on_Preprocessing/Assignment_on_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [44]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


Loading the data set

In [45]:
filepath = '/content/Bengaluru_House_Data.csv'
House_Data = pd.read_csv(filepath)
House_Data.head()


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [46]:
print("Original shape:", House_Data.shape)

Original shape: (13320, 9)


**Identify and treat missing values in the society column:**
1. Treat missing values in 'society' column


**Chosen imputation strategy :** Fill all missing values with a new category "Unknown"


In [47]:
House_Data['society'] = House_Data['society'].replace(r'^\s*$', np.nan, regex=True)
House_Data['society'].fillna('Unknown', inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  House_Data['society'].fillna('Unknown', inplace=True)


In [48]:
print("\nAfter society imputation - missing count:", House_Data['society'].eq('Unknown').sum())


After society imputation - missing count: 5502


**Justification**

Society is a categorical feature with hundreds of  names in the dataset.
Missing values likely indicate properties that are not part of a named society and it is common in real estate data.

Creating "Unknown" as a new category why because it is a standard practice for categorical features with structural missingness.This keeps all rows intact while handling the missing data meaningfully.

**2. Check for outliers in 'price' using IQR**




**Method used: Interquartile Range (IQR) method**

In [49]:
Q1 = House_Data['price'].quantile(0.25)
Q3 = House_Data['price'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR
lower_bound = Q1 - 1.5 * IQR  # (will be negative, so effectively 0 for price)

print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")


Q1: 50.0, Q3: 120.0, IQR: 70.0


In [50]:
print(f"Outlier threshold (upper): {upper_bound}")



Outlier threshold (upper): 225.0


In [51]:
outliers = House_Data[House_Data['price'] > upper_bound]
print(f"Number of outliers: {len(outliers)}")


Number of outliers: 1276


**Justification:**

IQR is a robust, non-parametric method that doesn't assume normally


It's widely used in real estate datasets for outlier detection and resistant to extreme values.

 **3. Based on your preprocessing
Which row(s) would you consider dropping and why?**

I recommend dropping NO rows.
Justification and reasoning:

Society column: Missing values (approximately 35-40% in similar datasets; in this provided data, many blanks) have been imputed with "Unknown". This is a valid categorical imputation strategyâ€”no need to drop rows, as it preserves all data without introducing bias.

In [52]:
print(outliers[['location', 'size', 'total_sqft', 'price']])

                    location       size total_sqft  price
7               Rajaji Nagar      4 BHK       3300  600.0
9               Gandhi Bazar  6 Bedroom       1020  370.0
11                Whitefield  4 Bedroom       2785  295.0
18     Ramakrishnappa Layout      3 BHK       2770  290.0
22               Thanisandra  4 Bedroom       2800  380.0
...                      ...        ...        ...    ...
13306  Rajarajeshwari Nagara  4 Bedroom       1200  325.0
13311       Ramamurthy Nagar  7 Bedroom       1500  250.0
13315             Whitefield  5 Bedroom       3453  231.0
13316          Richards Town      4 BHK       3600  400.0
13318        Padmanabhanagar      4 BHK       4689  488.0

[1276 rows x 4 columns]


In [53]:
print(f"Price outliers detected: {len(outliers)} rows (prices > {upper_bound:.2f} lakhs)")
print("Final cleaned shape:", House_Data.shape)

Price outliers detected: 1276 rows (prices > 225.00 lakhs)
Final cleaned shape: (13320, 9)
