# Exploratory Data Analysis - India Housing Price Prediction

## Objective
This notebook performs exploratory data analysis on the India housing dataset to understand:
- Data structure and quality
- Distribution of target variable (Price)
- Relationships between features and price
- Data issues that need to be addressed during preprocessing

## Dataset
- Source: Kaggle - India House Price Prediction
- Contains residential property records from various Indian cities
- Features include location, size, amenities, and transaction prices

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

df = pd.read_csv('../data/raw/india_housing_prices.csv')
print(f"Dataset Shape: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

## Initial Data Overview

The dataset contains 250,000 property records with 23 features. Each row represents a single property transaction.

**Key Observations:**
- Large dataset with sufficient samples for machine learning
- Mix of numerical and categorical features
- Target variable: `Price_in_Lakhs`
- Features include location (State, City, Locality), property characteristics (BHK, Size, Type), and amenities
- Locality column appears to be anonymized (Locality_84, Locality_490, etc.)

In [None]:
df.info()

## Data Quality Assessment

**Data Types:**
- 9 numerical features (int64)
- 2 float features (Price_in_Lakhs, Price_per_SqFt)
- 12 categorical features (object)

**Missing Values:**
- No missing values detected (all columns have 250,000 non-null entries)
- Data appears to be clean and complete

**Memory Usage:**
- Dataset uses approximately 43.9 MB of memory
- Manageable size for in-memory processing

In [None]:
df.describe()

## Statistical Summary of Numerical Features

**Target Variable (Price_in_Lakhs):**
- Mean: 254.59 Lakhs (approximately 2.55 Crores)
- Median: 253.87 Lakhs (close to mean, suggests relatively symmetric distribution)
- Range: 10 to 500 Lakhs (wide price range)
- Standard Deviation: 141.35 Lakhs (high variability in prices)

**Property Characteristics:**
- BHK: Average 3 BHK, ranging from 1 to 5 BHK
- Size: Average 2,750 sq ft, ranging from 500 to 5,000 sq ft
- Year Built: Properties built between 1990-2023, average year 2007
- Age: Average property age is 18 years

**Location & Amenities:**
- Floor Number: Ranges from ground floor (0) to 30th floor
- Total Floors: Buildings range from 1 to 30 floors
- Nearby Schools: 1 to 10 schools, average 5-6
- Nearby Hospitals: 1 to 10 hospitals, average 5-6

**Important Note:**
Price_per_SqFt is directly derived from Price_in_Lakhs and Size_in_SqFt, which could cause data leakage in modeling.

In [None]:
categorical_cols = df.select_dtypes(include='object').columns
print("Categorical Features:")
print("-" * 50)
for col in categorical_cols:
    print(f"\n{col}:")
    print(f"  Unique values: {df[col].nunique()}")
    print(f"  Top 3 values: {df[col].value_counts().head(3).to_dict()}")

## Categorical Features Analysis

**Geographic Features:**
- **State**: 20 unique states, relatively balanced distribution (top state has ~12,681 properties)
- **City**: 42 cities across India, well distributed (top city has ~6,461 properties)
- **Locality**: 500 unique localities, highly granular but fairly uniform distribution

**Property Characteristics:**
- **Property_Type**: 3 types (Villa, Independent House, Apartment) - evenly distributed
- **Furnished_Status**: 3 categories (Unfurnished, Semi-furnished, Furnished) - balanced distribution
- **Facing**: 4 directions (West, North, South, East) - roughly equal distribution

**Binary Features:**
- **Parking_Space**: Nearly 50-50 split (No: 50.2%, Yes: 49.8%)
- **Security**: Nearly 50-50 split (Yes: 50.1%, No: 49.9%)
- **Availability_Status**: Balanced (Under_Construction: 50.0%, Ready_to_Move: 50.0%)

**Other Features:**
- **Public_Transport_Accessibility**: 3 levels (High, Low, Medium) - evenly distributed
- **Owner_Type**: 3 types (Broker, Owner, Builder) - balanced
- **Amenities**: 325 unique combinations (e.g., "Pool", "Clubhouse", "Garden")

**Key Insight:**
The dataset appears to be synthetically generated or heavily balanced, as most categorical features show near-perfect uniform distribution, which is unusual for real-world data.

In [None]:
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.hist(df['Price_in_Lakhs'], bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Price (Lakhs)')
plt.ylabel('Frequency')
plt.title('Distribution of Property Prices')
plt.grid(axis='y', alpha=0.3)

plt.subplot(1, 2, 2)
plt.boxplot(df['Price_in_Lakhs'])
plt.ylabel('Price (Lakhs)')
plt.title('Price Distribution - Boxplot')
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Price Statistics:")
print(f"Mean: {df['Price_in_Lakhs'].mean():.2f} Lakhs")
print(f"Median: {df['Price_in_Lakhs'].median():.2f} Lakhs")
print(f"Skewness: {df['Price_in_Lakhs'].skew():.2f}")
print(f"Kurtosis: {df['Price_in_Lakhs'].kurtosis():.2f}")

## Target Variable Distribution Analysis

**Distribution Characteristics:**
- The histogram shows an extremely uniform distribution across all price ranges (10 to 500 Lakhs)
- Each price bin has approximately the same frequency (~5,000 properties)
- Mean (254.59) and Median (253.87) are nearly identical

**Statistical Measures:**
- **Skewness: 0.01** - Nearly zero, indicating a perfectly symmetric distribution
- **Kurtosis: -1.20** - Negative kurtosis indicates a flatter distribution than normal (platykurtic)

**Boxplot Insights:**
- Interquartile range (IQR): 132.55 to 376.88 Lakhs
- No visible outliers
- Box spans approximately 50% of the total range

**Critical Observation:**
This uniform distribution is highly unusual for real estate data, which typically shows right-skewed distributions. This further confirms the dataset is synthetically generated with uniform random sampling across price ranges. In real-world scenarios, most properties cluster at lower price points with fewer expensive properties creating a long tail.

In [None]:
numeric_cols = df.select_dtypes(include=[np.number]).columns.drop('ID')
correlation_matrix = df[numeric_cols].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix - Numerical Features', fontsize=14, pad=20)
plt.tight_layout()
plt.show()

print("\nCorrelations with Target Variable (Price_in_Lakhs):")
print("-" * 50)
price_corr = correlation_matrix['Price_in_Lakhs'].sort_values(ascending=False)
for feature, corr in price_corr.items():
    if feature != 'Price_in_Lakhs':
        print(f"{feature:35s}: {corr:6.4f}")

## Correlation Analysis

**Key Findings:**

**Strong Correlation:**
- **Price_per_SqFt: 0.5556** - Moderate positive correlation with price
  - This is expected but represents data leakage (Price_per_SqFt = Price / Size)
  - Must be removed during model training

**Weak/No Correlations:**
All other features show near-zero correlations with price (range: -0.0028 to 0.0027):
- Year_Built: 0.0027
- Total_Floors: 0.0013
- BHK: -0.0010
- Size_in_SqFt: -0.0025
- Nearby_Schools: 0.0002
- Nearby_Hospitals: -0.0028

**Critical Issue:**
The absence of meaningful correlations between property characteristics (Size, BHK, Location features) and Price is highly unusual. In real estate:
- Size typically has strong positive correlation with price
- BHK count correlates with price
- Location features significantly impact pricing

**Implication for Modeling:**
Since numerical features show no predictive power, the model will likely rely heavily on:
1. Categorical features (State, City, Property_Type)
2. The leaked Price_per_SqFt feature

For a realistic model, Price_per_SqFt must be excluded despite being the only feature with correlation.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

df.boxplot(column='Price_in_Lakhs', by='Property_Type', ax=axes[0, 0])
axes[0, 0].set_title('Price by Property Type')
axes[0, 0].set_xlabel('Property Type')
axes[0, 0].set_ylabel('Price (Lakhs)')
plt.sca(axes[0, 0])
plt.xticks(rotation=45)

df.boxplot(column='Price_in_Lakhs', by='Furnished_Status', ax=axes[0, 1])
axes[0, 1].set_title('Price by Furnished Status')
axes[0, 1].set_xlabel('Furnished Status')
axes[0, 1].set_ylabel('Price (Lakhs)')

df.boxplot(column='Price_in_Lakhs', by='Parking_Space', ax=axes[1, 0])
axes[1, 0].set_title('Price by Parking Space')
axes[1, 0].set_xlabel('Parking Available')
axes[1, 0].set_ylabel('Price (Lakhs)')

df.boxplot(column='Price_in_Lakhs', by='Security', ax=axes[1, 1])
axes[1, 1].set_title('Price by Security')
axes[1, 1].set_xlabel('Security Available')
axes[1, 1].set_ylabel('Price (Lakhs)')

plt.suptitle('')
plt.tight_layout()
plt.show()

print("Average Price by Category:")
print("-" * 50)
print(f"\nProperty Type:")
print(df.groupby('Property_Type')['Price_in_Lakhs'].mean().sort_values(ascending=False))
print(f"\nFurnished Status:")
print(df.groupby('Furnished_Status')['Price_in_Lakhs'].mean().sort_values(ascending=False))
print(f"\nParking Space:")
print(df.groupby('Parking_Space')['Price_in_Lakhs'].mean().sort_values(ascending=False))
print(f"\nSecurity:")
print(df.groupby('Security')['Price_in_Lakhs'].mean().sort_values(ascending=False))

## Categorical Features vs Price Analysis

**Observations:**

**Property Type:**
- Independent House: 255.37 Lakhs
- Apartment: 254.62 Lakhs
- Villa: 253.77 Lakhs
- Price difference: Only 1.6 Lakhs between highest and lowest (~0.6% variation)

**Furnished Status:**
- Unfurnished: 254.98 Lakhs
- Furnished: 254.45 Lakhs
- Semi-furnished: 254.33 Lakhs
- Price difference: Only 0.65 Lakhs (~0.25% variation)

**Parking Space:**
- With Parking: 254.75 Lakhs
- Without Parking: 254.43 Lakhs
- Price difference: Only 0.32 Lakhs (~0.12% variation)

**Security:**
- With Security: 255.12 Lakhs
- Without Security: 254.05 Lakhs
- Price difference: Only 1.07 Lakhs (~0.4% variation)

**Key Finding:**
All categorical features show nearly identical average prices across their categories. The price differences are minimal (less than 1% variation), which indicates these features have negligible impact on pricing in this dataset.

**Expected Real-World Behavior:**
- Furnished properties typically command 10-20% premium
- Villas usually cost significantly more than apartments
- Security and parking add substantial value

The uniform pricing across categories confirms this is synthetic data with randomized relationships.

In [None]:
top_states = df.groupby('State')['Price_in_Lakhs'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
top_states.plot(kind='barh')
plt.xlabel('Average Price (Lakhs)')
plt.ylabel('State')
plt.title('Top 10 States by Average Property Price')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("Top 10 States by Average Price:")
print("-" * 50)
for state, price in top_states.items():
    print(f"{state:20s}: {price:.2f} Lakhs")

print(f"\nPrice Range Across States: {top_states.max() - top_states.min():.2f} Lakhs")

## Geographic Price Analysis - States

**Top 10 States by Average Price:**
1. Karnataka: 257.41 Lakhs
2. Tamil Nadu: 256.66 Lakhs
3. Uttar Pradesh: 256.25 Lakhs
4. Madhya Pradesh: 255.96 Lakhs
5. Gujarat: 255.79 Lakhs

**Key Observations:**
- Price range across top 10 states: Only 2.73 Lakhs (approximately 1% variation)
- All states cluster around the overall mean of 254.59 Lakhs
- Minimal geographic price differentiation

**Real-World Expectation:**
In actual real estate markets, states show significant price variations:
- Metro areas (Maharashtra, Karnataka, Delhi) typically 2-3x more expensive
- Tier-2 cities show 40-60% lower prices
- Rural/less developed states have substantially lower property values

**Dataset Limitation:**
The uniform pricing across geographies indicates location has minimal predictive power in this synthetic dataset. Real estate pricing models typically derive 40-60% of their predictive power from location features alone.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].scatter(df['Size_in_SqFt'], df['Price_in_Lakhs'], alpha=0.3, s=1)
axes[0, 0].set_xlabel('Size (Sq Ft)')
axes[0, 0].set_ylabel('Price (Lakhs)')
axes[0, 0].set_title('Size vs Price')
axes[0, 0].grid(alpha=0.3)

axes[0, 1].scatter(df['BHK'], df['Price_in_Lakhs'], alpha=0.3, s=1)
axes[0, 1].set_xlabel('BHK')
axes[0, 1].set_ylabel('Price (Lakhs)')
axes[0, 1].set_title('BHK vs Price')
axes[0, 1].grid(alpha=0.3)

axes[1, 0].scatter(df['Year_Built'], df['Price_in_Lakhs'], alpha=0.3, s=1)
axes[1, 0].set_xlabel('Year Built')
axes[1, 0].set_ylabel('Price (Lakhs)')
axes[1, 0].set_title('Year Built vs Price')
axes[1, 0].grid(alpha=0.3)

axes[1, 1].scatter(df['Age_of_Property'], df['Price_in_Lakhs'], alpha=0.3, s=1)
axes[1, 1].set_xlabel('Age of Property (Years)')
axes[1, 1].set_ylabel('Price (Lakhs)')
axes[1, 1].set_title('Property Age vs Price')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Scatter Plot Analysis - Key Features vs Price

**Size vs Price:**
- Shows complete random scatter with no discernible pattern
- Properties of all sizes (500-5000 sq ft) distributed uniformly across all price ranges
- Expected: Strong positive linear relationship (larger properties = higher prices)
- Actual: Zero relationship, confirming correlation of -0.0025

**BHK vs Price:**
- Vertical bands at each BHK value (1, 2, 3, 4, 5)
- Each BHK category spans the entire price range (10-500 Lakhs)
- Expected: Clear upward trend (more bedrooms = higher price)
- Actual: No relationship, all BHK types have identical price distributions

**Year Built vs Price:**
- Vertical stripes for each year from 1990 to 2023
- Every construction year shows full price range
- Expected: Newer properties command premium prices
- Actual: No age-based pricing differentiation

**Property Age vs Price:**
- Mirror of Year Built (Age = 2025 - Year Built)
- Uniform price distribution across all property ages
- Expected: Depreciation effect on older properties
- Actual: Age has no impact on pricing

**Conclusion:**
All scatter plots confirm complete independence between property characteristics and price. The price appears to be randomly assigned rather than determined by property features. This makes the dataset unsuitable for learning real-world pricing patterns, but acceptable for demonstrating ML pipeline implementation.

In [None]:
print("BHK Distribution:")
print(df['BHK'].value_counts().sort_index())
print(f"\nAverage price by BHK:")
print(df.groupby('BHK')['Price_in_Lakhs'].agg(['mean', 'median', 'std']))

print("\n" + "="*60)
print("Size Distribution by BHK:")
print(df.groupby('BHK')['Size_in_SqFt'].agg(['mean', 'min', 'max']))