# House Price EDA Analysis

# ‚úÖ **Exploratory Data Analysis (EDA)**

---

## **1Ô∏è‚É£ Understand the Problem**

Before touching the dataset:

* What is the business problem?
* What is the target variable?
* What questions do we want answered?
* What decisions depend on this data?

Example: *Predict customer churn ‚Üí target variable = churn.*

---

## **2Ô∏è‚É£ Load the Data**

Typical Python steps:

```python
import pandas as pd
df = pd.read_csv("data.csv")
```

---

## **3Ô∏è‚É£ Basic Data Exploration**

Check shape, size, and first rows:

```python
df.shape
df.info()
df.head()
```

Key tasks:

* Understand columns
* Data types
* Memory usage

---

## **4Ô∏è‚É£ Descriptive Statistics**

```python
df.describe(include='all')
```

Look for:

* Mean, median, std
* Min / max
* Outliers
* Missing values
* Category distribution

---

## **5Ô∏è‚É£ Check Missing Values**

```python
df.isnull().sum()
df.isna().mean() * 100  # percentage
```

Identify:

* Which columns have missing?
* How much data is missing?

---

## **6Ô∏è‚É£ Data Type Correction**

Fix incorrect types:

* Convert object to datetime
* Convert floats to integers
* Convert categorical strings to category type

Example:

```python
df['date'] = pd.to_datetime(df['date'])
df['gender'] = df['gender'].astype('category')
```

---

## **7Ô∏è‚É£ Univariate Analysis (Single Column)**

### üîπ Numerical Columns

* Histograms
* Boxplots

```python
df['age'].hist()
sns.boxplot(df['salary'])
```

Find:

* Distribution (normal, skewed)
* Outliers
* central tendency

### üîπ Categorical Columns

* Countplots
* Value counts

```python
df['gender'].value_counts()
sns.countplot(df['city'])
```

---

## **8Ô∏è‚É£ Bivariate Analysis (Two Columns)**

**Numerical vs Numerical**

* Scatter plot
* Correlation

```python
df.corr()
sns.scatterplot(x='age', y='salary', data=df)
```

**Numerical vs Categorical**

* Boxplot
* Violin plot

**Categorical vs Categorical**

* Crosstab

```python
pd.crosstab(df['gender'], df['churn'])
```

---

## **9Ô∏è‚É£ Multivariate Analysis**

* Pairplots (`sns.pairplot(df)`)
* Heatmaps (`sns.heatmap(corr)`)

Goal:

* Identify relationships between multiple variables
* Multi-column dependencies

---

## **üîü Outlier Detection & Treatment**

Techniques:

* Boxplot (IQR method)
* Z-score method

Example (IQR):

```python
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
```



## **1Ô∏è‚É£5Ô∏è‚É£ Final EDA Summary / Insights**

Write clear insights:

* Who are best customers?
* What features impact churn?
* Which products sell most?
* What patterns/outliers exist?

---




# üéØ **Final EDA Flow (SUPER SHORT VERSION for Notes)**

1. Understand the problem
2. Load data
3. Basic exploration
4. Descriptive statistics
5. Missing values
6. Data type correction
7. Univariate analysis
8. Bivariate analysis
9. Multivariate analysis
10. Outlier detection
11. Feature engineering
12. Feature encoding
13. Feature scaling
14. Imbalance handling
15. Insights summary

## Problem statemants

1. business

real estate business

2. business problem 

What are the things that a potential home buyer considers before purchasing a house? The location, the size of the property, vicinity to offices, schools, parks, restaurants, hospitals or the stereotypical white picket fence? What about the most important factor ‚Äî the price?

Now with the lingering impact of demonetization, the enforcement of the Real Estate (Regulation and Development) Act (RERA), and the lack of trust in property developers in the city, housing units sold across India in 2017 dropped by 7 percent. In fact, the property prices in Bengaluru fell by almost 5 percent in the second half of 2017, said a study published by property consultancy Knight Frank.
For example, for a potential homeowner, over 9,000 apartment projects and flats for sale are available in the range of ‚Çπ42-52 lakh, followed by over 7,100 apartments that are in the ‚Çπ52-62 lakh budget segment, says a report by property website Makaan. According to the study, there are over 5,000 projects in the ‚Çπ15-25 lakh budget segment followed by those in the ‚Çπ34-43 lakh budget category.

Buying a home, especially in a city like Bengaluru, is a tricky choice. While the major factors are usually the same for all metros, there are others to be considered for the Silicon Valley of India. With its help millennial crowd, vibrant culture, great climate and a slew of job opportunities, it is difficult to ascertain the price of a house in Bengaluru.

### 2. data Load 
1. data collections
- https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data 
2. company data

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv("Bengaluru_House_Data.csv")

#### 3. Data Overview

In [2]:
# pd.read_csv(r'https://media.githubusercontent.com/media/shahil04/ds_materials/refs/heads/main/8.0_Machine%20Learning/ml_class/ml_projects/1.house_price_predictions/Bengaluru_House_Data.csv')

In [3]:
data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


#### 1. About Data 

 the Bengaluru House Data dataset 
.
- **Shape:** The dataset contains 12,530 rows and 7 columns after initial cleaning.
- **Columns:**  
    - `area_type`: type of plots 
    - `avalibility`: when to shift 
    - `location`: Area or locality of the property  
    - `size`: Number of bedrooms (e.g., "2 BHK", "4 Bedroom") 
    - `society`:  groups of people
    - `total_sqft`: Total area in square feet  
    - `bath`: Number of bathrooms  
    - `price`: Price of the property (in lakhs)  

We will also check for missing values, data types, and unique values in key columns to guide further cleaning and preprocessing steps. This foundational understanding helps in identifying potential issues such as outliers, inconsistent data, and the need for encoding categorical variables.

In [4]:
data.shape

(13320, 9)

As above we can see that the dataset have 13320 rows and 9 columns.

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


### Data Overview and Missing Values

The dataset initially contains 9 columns, with 6 columns of object type and 3 columns of float type. Below is a summary of missing values in each column:

- **location**: 1 missing value  
- **size**: 16 missing values  
- **society**: Many missing values (over 5,000)  
- **bath**: 73 missing values  
- **balcony**: Approximately 600 missing values  

To handle these missing values, we can either replace them with appropriate statistics (mean, median, or mode) or drop columns that are not important for our analysis.

#### Column-wise Data Types and Non-Null Counts

| Column         | Non-Null Count | Data Type |
|----------------|---------------|-----------|
| area_type      | 13,320        | object    |
| availability   | 13,320        | object    |
| location       | 13,319        | object    |
| size           | 13,304        | object    |
| society        | 7,818         | object    |
| total_sqft     | 13,320        | object    |
| bath           | 13,247        | float64   |
| balcony        | 12,711        | float64   |
| price          | 13,320        | float64   |

We will proceed by cleaning the data, handling missing values, and dropping columns that are not relevant for further analysis.

### 3.Check the null values and remove

In [6]:
# null value check
data.isna().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

In [7]:
# Locations

# find the most occuring  locations name for fill the missing value
data["location"].value_counts().index[0]

data['location'].mode()[0]

'Whitefield'


There was only one missing value in the `location` column, which was filled with `'Whitefield'`, the most frequent location. This maintains data consistency.

In [8]:
# for fill the null value use fillna 
data['location'] = data['location'].fillna(data['location'].mode()[0])

In [9]:
data.isnull().sum()

area_type          0
availability       0
location           0
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

In [10]:
data['size'].value_counts()

size
2 BHK         5199
3 BHK         4310
4 Bedroom      826
4 BHK          591
3 Bedroom      547
1 BHK          538
2 Bedroom      329
5 Bedroom      297
6 Bedroom      191
1 Bedroom      105
8 Bedroom       84
7 Bedroom       83
5 BHK           59
9 Bedroom       46
6 BHK           30
7 BHK           17
1 RK            13
10 Bedroom      12
9 BHK            8
8 BHK            5
11 BHK           2
10 BHK           2
11 Bedroom       2
27 BHK           1
19 BHK           1
43 Bedroom       1
16 BHK           1
14 BHK           1
12 Bedroom       1
13 BHK           1
18 Bedroom       1
Name: count, dtype: int64

In [11]:
data['size'].mode()[0]

'2 BHK'

In [12]:

data['size'] = data['size'].fillna(data['size'].mode()[0])

The `size` column has only 16 missing values, which are filled with the most frequent value, `2 BHK`, to maintain consistency.

In [13]:
data['bath'].median() # bathroom has 73 null values so we will replace them my median value.

2.0

In [14]:
data['bath'] = data['bath'].fillna(data['bath'].median())

Filled missing values in the `bath` column with the median number of bathrooms to maintain consistency and minimize the impact of outliers.

In [15]:
data['bath'].isna().sum()

np.int64(0)

In [16]:
data['balcony'] = data['balcony'].fillna(data['balcony'].median())

#### Drop the columns

In [17]:
data.drop('society',axis=1,inplace=True)

In [18]:
# drop null values
# data.dropna(axis=1)

In [19]:
data.isna().sum()

area_type       0
availability    0
location        0
size            0
total_sqft      0
bath            0
balcony         0
price           0
dtype: int64

### 4.check the duplicates values and remove


In [20]:
data.duplicated().sum()

np.int64(569)

In [21]:
data[data.duplicated()]

Unnamed: 0,area_type,availability,location,size,total_sqft,bath,balcony,price
971,Super built-up Area,Ready To Move,Haralur Road,3 BHK,1464,3.0,2.0,56.0
1115,Super built-up Area,Ready To Move,Haralur Road,2 BHK,1027,2.0,2.0,44.0
1143,Super built-up Area,Ready To Move,Vittasandra,2 BHK,1246,2.0,1.0,64.5
1290,Super built-up Area,Ready To Move,Haralur Road,2 BHK,1194,2.0,2.0,47.0
1394,Super built-up Area,Ready To Move,Haralur Road,2 BHK,1027,2.0,2.0,44.0
...,...,...,...,...,...,...,...,...
13285,Super built-up Area,Ready To Move,VHBCS Layout,2 BHK,1353,2.0,2.0,110.0
13299,Super built-up Area,18-Dec,Whitefield,4 BHK,2830 - 2882,5.0,0.0,154.5
13311,Plot Area,Ready To Move,Ramamurthy Nagar,7 Bedroom,1500,9.0,2.0,250.0
13313,Super built-up Area,Ready To Move,Uttarahalli,3 BHK,1345,2.0,1.0,57.0


In [22]:
# drop the duplicates values
data.drop_duplicates(inplace=True)

#### check the distinct values in each columns 

In [23]:
data['area_type'].unique()  # show the unique value
data['area_type'].nunique() # count the unique values

# both 
data['area_type'].value_counts()

area_type
Super built-up  Area    8279
Built-up  Area          2396
Plot  Area              1989
Carpet  Area              87
Name: count, dtype: int64

#### 5. check the data Type and change the data types

In [24]:
data.head()

Unnamed: 0,area_type,availability,location,size,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,1200,2.0,1.0,51.0


In [25]:
data['availability'].value_counts()

availability
Ready To Move    10139
18-May             290
18-Dec             283
18-Apr             269
18-Aug             187
                 ...  
16-Oct               1
17-Jan               1
16-Nov               1
16-Jan               1
14-Jul               1
Name: count, Length: 81, dtype: int64

In [26]:
# check the data types
data.dtypes

area_type        object
availability     object
location         object
size             object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

In [27]:
data['size'].str.split()[0]

data['size'].str.split()[0][0]


'2'

In [28]:
data['bhk'] = data['size'].str.split().str.get(0).astype(int)

In [29]:
data.dtypes

area_type        object
availability     object
location         object
size             object
total_sqft       object
bath            float64
balcony         float64
price           float64
bhk               int64
dtype: object

In [36]:
data = data.drop('size',axis=1)

KeyError: "['size'] not found in axis"

Extracting Number of Bedrooms (`bhk`)

The `bhk` column is generated by extracting the numeric value from the `size` column (e.g., "2 BHK" becomes 2). This transformation standardizes the number of bedrooms as an integer feature, making it easier to analyze and model.

Converting categorical or object-type columns into numeric values (such as integers or floats) is a crucial preprocessing step. It enables more effective statistical analysis and machine learning, as most algorithms require numerical input.

In [37]:
data.dtypes

area_type        object
availability     object
location         object
total_sqft       object
bath            float64
balcony         float64
price           float64
bhk               int64
dtype: object

In [38]:
data["total_sqft"].value_counts()[:]

total_sqft
1200           803
1100           209
1500           202
2400           196
600            178
              ... 
2505             1
567              1
1400 - 1421      1
4350             1
1443             1
Name: count, Length: 2117, dtype: int64

we can see there is no null value now.

Checking for outliers.


In [None]:
data[data.bhk>20]

only two columns that  have value greter than 20 so they are outliers.

Now let's check for `total_sqft` column.

In [None]:
data['total_sqft'].unique()

To handle `total_sqft` values given as ranges (e.g., "1200-1500"), we take their mean. For other values, we convert them directly to float. This ensures all entries in `total_sqft` are numeric and consistent for analysis.

In [None]:
def convertRange(x):
    temp = x.split('-')
    if len(temp) == 2:
        return (float(temp[0]) + float(temp[1]))/2
    try:
        return float(x)
    except:
        return None

In [None]:
data['total_sqft'] = data['total_sqft'].apply(convertRange)

In [None]:
data['total_sqft'].dtype

In [None]:
data.head()

Price Per square feet

In [None]:
data[['price', 'total_sqft']].corr()

### Creating the `price_per_sqft` Feature

A new column, `price_per_sqft`, is added to represent the price per square foot for each property. This feature enables more effective comparison of property values across different locations and sizes.

In [None]:
data['price_per_sqft'] = data['price']*100000 / data['total_sqft']

In [None]:
data['price_per_sqft']

In [None]:
data.describe()

In [None]:
data['location'].value_counts()

`location` columns preprocessing we will remove outlier if any.

In [None]:
# Remove leading and trailing spaces from location names
data['location'] = data['location'].apply(lambda x : x.strip())
location_count = data['location'].value_counts()


In [None]:
location_count  

In [None]:
location_count_less_10 = location_count[location_count <= 10]
location_count_less_10

Locations with 10 or fewer occurrences (totaling 1,053 unique locations) are replaced with `'other'`. This reduces the number of unique categories in the `location` column, simplifying encoding and improving model performance.

In [None]:

data['location'] = data['location'].apply(lambda x: 'other' if x in location_count_less_10 else x)

In [None]:
data['location'].value_counts()

In [None]:
data.describe()

Outlier Detection in `total_sqft`

A review of the `total_sqft` column reveals some unrealistic values, such as properties with as little as 1 square foot of area. Such entries are clear outliers or data entry errors, as it is not feasible for any property‚Äîespecially those with multiple bedrooms‚Äîto have such a small area.

To improve data quality and ensure reliable analysis, we remove properties where the average area per BHK (i.e., `total_sqft` divided by `bhk`) is less than 300 sqft. This threshold helps filter out likely outliers and data inconsistencies, resulting in a more accurate and trustworthy dataset for further exploration and modeling.

In [None]:
data = data[((data['total_sqft']/data['bhk']) >= 300)]
data.describe()

In [None]:
data.shape

In [None]:
def remove_outliers_sqft(df):
    df_output = pd.DataFrame()
    for key,subdf in df.groupby('location'):
        
        m = np.mean(subdf['price_per_sqft'])

        st = np.std(subdf['price_per_sqft'])

        gen_df = subdf[(subdf['price_per_sqft'] > (m-st)) & (subdf['price_per_sqft'] <= (m+st))]
        
        df_output = pd.concat([df_output,gen_df],ignore_index= True)
       
    return df_output
data = remove_outliers_sqft(data)
data.describe()

In [None]:
data.shape

### Understanding `groupby` and Outlier Removal Logic

- **`groupby('location')`** splits the DataFrame into groups based on unique values in the `location` column.
    - **`key`**: Stores the name of the current location (a string, e.g., `'Whitefield'`).
    - **`subdf`**: Contains a DataFrame with all rows for that location.

- For each location, the function keeps only those rows where `price_per_sqft` is within one standard deviation of the mean for that location. This filters out unusually high or low property prices, keeping only the most typical values for each area.


In [None]:


def bhk_outlier_remove(df):
    exclude_indices = np.array([])

    for location, location_df in df.groupby('location'):
        bhk_stats = {}

        # Step 1: Calculate mean and std for each BHK in a location
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df['price_per_sqft']),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
  
        # Step 2: Compare each BHK price_per_sqft with the (BHK-1) stats
        
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk - 1)
            if stats and stats['count'] > 5:
                exclude_indices = np.append(
                    exclude_indices,
                    bhk_df[bhk_df['price_per_sqft'] < stats['mean']].index.values
                )

    return df.drop(exclude_indices, axis='index')


### Understanding and Removing BHK Outliers

The `bhk_outlier_remove` function refines real estate data by removing properties where the `price_per_sqft` is unusually low compared to properties with fewer rooms (BHK) in the same location. For example, it addresses cases where a 3BHK is listed at a lower price per square foot than a 2BHK nearby, which is typically unrealistic.

#### How It Works: Two-Step Process Per Location

**1. Calculate BHK Statistics**  
For each unique location, properties are grouped by their BHK count. For every BHK group within a location, the function computes:
- Mean `price_per_sqft`
- Standard deviation of `price_per_sqft`
- Count of properties for that BHK type

This builds a statistical profile for each BHK size in every location.

**2. Compare and Exclude Outliers**  
For each BHK type in a location:
- The function looks up the mean `price_per_sqft` of the (BHK - 1) group (e.g., for 3BHK, it checks 2BHK stats).
- If a property's `price_per_sqft` is less than the mean of the smaller BHK group (and the smaller group has more than 5 properties), it is flagged as an outlier.
- The indices of these outlier properties are collected.

**Final Step: Clean the DataFrame**  
After processing all locations and BHK groups, the function removes all identified outlier rows, resulting in a cleaner dataset for further analysis and modeling.


In [None]:
data = bhk_outlier_remove(data)

In [None]:
data.shape

In [None]:
data

In [None]:
data.drop(columns=['size','price_per_sqft'],inplace=True)

### Cleaned data

In [None]:
data.head()

In [None]:
data.to_csv("Cleaned_data.csv")

In [None]:
X = data.drop(columns=['price'])
y = data['price']


## ‚úÖ **UNIVARIATE ANALYSIS (5 Questions)**

1. What is the distribution of **area_type** in the dataset?
2. What is the frequency distribution of **BHK (size)**?
3. What is the distribution of **price** (minimum, maximum, median, skewness)?
4. Which **locations** have the highest number of listings?
5. What is the distribution of **total_sqft** values?

---

## ‚úÖ **BIVARIATE ANALYSIS (5 Questions)**

1. How does **price** vary with **total_sqft**?
2. What is the relationship between **BHK** and **price**?
3. How does **area_type** affect **average price**?
4. Is there a correlation between **bathroom count** and **price**?
5. How does **price** vary across different **locations**?

---

## ‚úÖ **MULTIVARIATE ANALYSIS (5 Questions)**

1. How do **BHK, total_sqft, and price** interact with each other?
2. How does **location** influence **price_per_sqft** when controlling for total_sqft?
3. Which combination of **area_type + BHK** results in the highest property prices?
4. How do **bathrooms, total_sqft, and price** correlate together?
5. What are the most important factors (BHK, sqft, location, area_type) that influence **price**?

---


# ‚úÖ **PART 1 ‚Äî Data Cleaning Questions (Real World)**

### **1Ô∏è‚É£ Identify missing values. Which columns need major cleaning?**

Expected observation:

* **society** ‚Üí many missing values
* **balcony** ‚Üí missing values
* **availability** ‚Üí mixed formats (dates + text)
* **size** ‚Üí mixed (2 BHK, 4 Bedroom)


# ‚úÖ **PART 3 ‚Äî EDA Questions & Insights**

Below is a **complete Univariate, Bivariate and Multivariate Analysis** based fully on **your dataset** (the 17 rows of Bangalore housing data you provided).
This is written in a **real-world Data Analyst EDA format** ‚Äî exactly how it is expected in assignments, interviews, or case studies.

---

# ‚úÖ **üìå 1. UNIVARIATE ANALYSIS (One Variable at a Time)**

We analyze each column independently.

---

## **1Ô∏è‚É£ area_type**

**Categories present:**

* Super built-up Area (majority)
* Plot Area
* Built-up Area

üëâ Insight: The market is dominated by *Super built-up Area* properties, meaning more apartments than plots.

---

## **2Ô∏è‚É£ availability**

Values include:

* Ready To Move
* Dates like 18-Feb, 19-Dec
* 18-May

üëâ Insight:
~70% of listings are **Ready To Move** properties ‚Üí high demand market.

---

## **3Ô∏è‚É£ location**

Most frequent locations:

* Whitefield (3 occurrences)
* Others appear once each

üëâ Insight:
Whitefield has the highest listings ‚Üí it is a prime investment zone.

---

## **4Ô∏è‚É£ size**

Values:

* 2 BHK
* 3 BHK
* 4 BHK / 4 Bedroom
* 6 Bedroom

Converted to numeric:

| BHK | Count |
| --- | ----- |
| 2   | 6     |
| 3   | 5     |
| 4   | 4     |
| 6   | 1     |

üëâ Insight:
2 BHK is the most common configuration.

---

## **5Ô∏è‚É£ society**

~40% missing
Values like Soiewre, Jaades, Brway G, Skityer

üëâ Insight:
Society information is inconsistent ‚Üí common in Indian real estate datasets.

---

## **6Ô∏è‚É£ total_sqft**

Distribution:

* Minimum = 1000 sqft
* Maximum = 3300 sqft
* Most values between **1100‚Äì1800 sqft**

üëâ Insight:
Most properties are mid-sized apartments.

---

## **7Ô∏è‚É£ bath**

Bathroom distribution:

* 2, 3, 4, 5, 6

üëâ Insight:
Most properties have **2 to 3 bathrooms**.

---

## **8Ô∏è‚É£ balcony**

Values:

* 0‚Äì3
  Missing in few entries

---

## **9Ô∏è‚É£ price (Lakhs)**

* Minimum price = 38 lakhs
* Maximum price = 600 lakhs
* Most prices between **40‚Äì120 lakhs**

üëâ Insight:
Price range is highly skewed due to premium localities.

---

# ‚úÖ **üìå 2. BIVARIATE ANALYSIS (Two Variables Together)**

Now we compare two variables to find relationships.

---

## **1Ô∏è‚É£ Price vs total_sqft**

* Larger sqft ‚Üí Generally higher price
* Some outliers like

  * 1020 sqft ‚Üí 370 lakhs (Gandhi Bazar) = premium area
  * 3300 sqft ‚Üí 600 lakhs (Rajaji Nagar)

üëâ Insight:
**Location is a stronger driver than sqft.**

---

## **2Ô∏è‚É£ Price vs BHK**

Trend:

* 2 BHK ‚Üí mostly < 50 lakhs
* 3 BHK ‚Üí 45‚Äì95 lakhs
* 4 BHK ‚Üí 120‚Äì600 lakhs
* 6 Bedroom ‚Üí 370 lakhs

üëâ Insight:
Price does **not** increase linearly with BHK ‚Üí location matters more.

---

## **3Ô∏è‚É£ Price vs area_type**

| area_type      | Price Range   |
| -------------- | ------------- |
| Super built-up | 38‚Äì204 lakhs  |
| Plot           | 120‚Äì370 lakhs |
| Built-up       | ~40‚Äì62 lakhs  |

üëâ Insight:
**Plot Area** properties are the most expensive.

---

## **4Ô∏è‚É£ Price vs location**

Locations like:

* Rajaji Nagar ‚Üí 600 lakhs
* Gandhi Bazar ‚Üí 370 lakhs
* Whitefield ‚Üí 38‚Äì70 lakhs

üëâ Insight:
Highly location sensitive market.

---

## **5Ô∏è‚É£ total_sqft vs BHK**

* 2 BHK ‚Üí ~1000‚Äì1200 sqft
* 3 BHK ‚Üí ~1300‚Äì1800 sqft
* 4 BHK ‚Üí ~2600‚Äì3300 sqft

üëâ Insight:
Sqft increases with BHK, but not always proportionally.

---

## **6Ô∏è‚É£ bathrooms vs price**

* 4‚Äì6 bathroom homes ‚Üí very high prices
* 2‚Äì3 bathrooms ‚Üí mid-range homes

üëâ Insight:
Bathroom count is correlated with both luxury level and price.

---

# ‚úÖ **üìå 3. MULTIVARIATE ANALYSIS (Three or More Variables)**

Now we combine multiple features to understand deeper patterns.

---

## **1Ô∏è‚É£ Price vs BHK vs total_sqft**

Observation:

* High-price properties (200‚Äì600 lakhs) belong to:

  * 4 BHK / 6 Bedroom
  * 2600‚Äì3300 sqft
  * Premium locations (Rajaji Nagar, Gandhi Bazar)

üëâ Insight:
Large size + high BHK in premium areas = luxury segment.

---

## **2Ô∏è‚É£ Price vs location vs sqft**

* Whitefield (mid-range area):
  38‚Äì70 lakhs for 1170‚Äì1800 sqft
* Rajaji Nagar & Gandhi Bazar (premium):

  > 350 lakhs even for 1000‚Äì3300 sqft

üëâ Insight:
**Location + sqft** together explain price much better than separate.

---

## **3Ô∏è‚É£ Price per sqft vs location vs BHK**

Compute price_per_sqft:

| Location     | Price Per Sqft (Approx) |
| ------------ | ----------------------- |
| Rajaji Nagar | ‚Çπ18,181                 |
| Gandhi Bazar | ‚Çπ36,274                 |
| Whitefield   | ‚Çπ3,500‚Äì‚Çπ4,000           |
| Mysore Road  | ‚Çπ6,255                  |

üëâ Insight:

* Gandhi Bazar extremely premium even for smaller plots
* Whitefield affordable despite being popular

---

## **4Ô∏è‚É£ Bathroom vs BHK vs price**

* 4 BHK + 4‚Äì6 bathrooms ‚Üí >250 lakhs
* 3 BHK + 2‚Äì3 bathrooms ‚Üí 60‚Äì95 lakhs
* 2 BHK + 2 bathrooms ‚Üí <50 lakhs

üëâ Insight:
Bath count is a strong indicator of luxury level.

---

## **5Ô∏è‚É£ area_type vs sqft vs price**

* Plot Area ‚Üí very high price despite smaller size
* Super built-up ‚Üí price depends on sqft
* Built-up ‚Üí affordable category

üëâ Insight:
Product type impacts price elasticity.

---

# ‚úÖ **üìå Summary Table**

| Type             | What You Learn                                          |
| ---------------- | ------------------------------------------------------- |
| **Univariate**   | Distribution of each variable                           |
| **Bivariate**    | Relationship between two variables (e.g., BHK vs price) |
| **Multivariate** | Combined effect (location + sqft + BHK on price)        |

---

# ‚úÖ NEXT:

I can generate **Python EDA code**:

* Univariate (histograms, countplots)
* Bivariate (pairplots, scatterplots, boxplots)
* Multivariate (heatmap, pairplot, price_per_sqft analysis)
---

## **üìå Q1. Which type of property (area_type) is most common?**

Expected:

* Super built-up Area ‚Üí majority
* Plot Area ‚Üí fewer
* Built-up Area ‚Üí fewer

Insight:

* Market dominated by Super built-up properties ‚Üí more apartments than land.

---

## **üìå Q2. Which location has the highest average price?**

From sample:

* Rajaji Nagar ‚Üí 600 Lakh for 3300 sqft
* Gandhi Bazar ‚Üí 370 Lakh

Typical EDA Chart:

* Bar chart of top 10 expensive locations

---

## **üìå Q3. Compare **BHK vs Price** trend. Does price increase linearly?**

Observation:

* 2 BHK: 38‚Äì51 lakh
* 3 BHK: 48‚Äì95 lakh
* 4 BHK: 120‚Äì600 lakh
* 6 Bedroom: 370 lakh

Insight:

* Price does not increase linearly.
* 4 BHK variations huge ‚Üí location & sqft matter more than BHK.

üí° Interview question: *Why is 4 BHK variation so large?*
‚û°Ô∏è Because some are in premium areas (Rajaji Nagar), others in outskirts.

---

## **üìå Q4. Are more bathrooms associated with higher prices?**

Expected trend:

* More bathrooms ‚Üí higher price ‚Üí correlated with BHK & total_sqft.

Plot:

* scatterplot(bath, price)

---

## **üìå Q5. How does total_sqft relate to price?**

Create scatter plot:

* Many outliers
* Strong correlation
* Price ‚âà sqft * price_per_sqft (location dependent)

---

## **üìå Q6. Which locations have highest price per sqft?**

From sample:
Approx calculations:

* Rajaji Nagar ‚âà ‚Çπ18,181 per sqft
* Whitefield ‚âà ‚Çπ3,300‚Äì‚Çπ4,000 per sqft
* Mysore Road ‚âà ‚Çπ6,255 per sqft
* Marathahalli ‚âà ‚Çπ4,828 per sqft

Insight:

* Rajaji Nagar & Gandhi Bazaar are premium localities.

---

## **üìå Q7. Identify unrealistic data points (Outliers)**

Examples from sample:

* 3300 sqft priced at 600 lakh (high but reasonable for premium area)
* 1020 sqft priced at 370 lakh (Plot + premium area ‚Üí Gandhi Bazar)

Check:

* Bathroom > BHK by large difference (bath 6 for a 1020 sqft house ‚Üí suspicious)
* Price too low or high w.r.t sqft

---

# ‚úÖ **PART 4 ‚Äî Real-World Business Insights (Analyst-Level)**

### **üìå Insight 1: Location is the biggest price driver**

Rajaji Nagar & Gandhi Bazar extremely premium.

### **üìå Insight 2: Price per sqft more stable than total price**

High variance in 3 BHK prices due to locality.

### **üìå Insight 3: Plots are far more expensive per sqft**

* Plot > Built-up > Super built-up

### **üìå Insight 4: Bathroom count strongly correlates with price**

More bathrooms ‚Üí more luxurious property.

---

# ‚úÖ **PART 5 ‚Äî 20 Real Interview Questions Based on This Dataset**

### **Data Cleaning & Transformation**

1. How will you handle missing societies & balcony values?
2. How will you convert ‚Äú2 BHK‚Äù & ‚Äú4 Bedroom‚Äù into numeric?
3. How will you parse availability dates like "19-Dec"?
4. How will you treat inconsistent sqft (ranges, text)?
5. How will you remove outliers? What methods?

### **EDA**

6. Which areas have highest average prices?
7. What is the distribution of price per sqft?
8. How does price vary by area_type?
9. Do more bathrooms indicate higher price?
10. Which features correlate most with price?
11. Which areas offer best value for money (low price per sqft)?
12. What is the relationship between total_sqft and BHK?
13. Are there unrealistic sqft-to-price points?
14. How does readiness of property (Ready to Move) affect price?
15. Which societies are most premium?

### **Feature Engineering**

16. How do you create price_per_sqft?
17. How do you group rare locations?
18. How do you encode high-cardinality location column?
19. How do you detect and treat outliers in price_per_sqft?

### **Business Insights**

20. If you were a real-estate consultant, how would you advise a buyer based on this EDA?

---


## Model Building 

The necessary libraries for model building and evaluation have been imported, including tools for data preprocessing, regression algorithms (Linear Regression, Lasso, Ridge), and performance metrics. This sets the stage for splitting the data, encoding categorical features, scaling, and training regression models.
```

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression , Lasso,Ridge
from sklearn.preprocessing import OneHotEncoder , StandardScaler
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score