# Preliminary Data Inspection

In [2]:
# ==========================================================
# 01 - PRELIMINARY DATA INSPECTION
# Real Estate Investment Advisor (Phase 2)
# ==========================================================

# This notebook performs:
# ✔ Environment checks
# ✔ Logging & reproducibility setup
# ✔ Dataset loading with exception handling
# ✔ Basic structural inspection
# ✔ Missing values analysis
# ✔ Column-wise data type examination
# ✔ Duplicate detection
# ✔ Summary statistics (numerical + categorical)

print("Notebook Loaded: 01_preliminary_data_inspection.ipynb")

Notebook Loaded: 01_preliminary_data_inspection.ipynb


## Environment Checks

In [3]:
import sys, os, importlib, logging

In [4]:
REQUIRED_PACKAGES = [
    "numpy", "pandas", "scipy", "sklearn", "xgboost",
    "lightgbm", "shap", "mlflow",
    "matplotlib", "seaborn", "plotly", "folium"
]

print("Python Version:", sys.version)
print("Working Directory:", os.getcwd())

Python Version: 3.10.19 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 16:41:31) [MSC v.1929 64 bit (AMD64)]
Working Directory: C:\Users\uttam\LabMentix\real_estate_project\notebooks


In [5]:
missing = []
for pkg in REQUIRED_PACKAGES:
    try:
        importlib.import_module(pkg)
    except:
        missing.append(pkg)

if missing:
    print("\n⚠ Missing Packages:")
    for m in missing:
        print(" -", m)
    print("\nInstall them inside your env if required.")
else:
    print("\n All required packages available.")


 All required packages available.


## Dataset Loading

In [6]:
import pandas as pd

# file path
DATA_PATH = "../dataset/india_housing_prices.csv"

In [9]:
import logging

# Create and configure logger
logger = logging.getLogger("RealEstateLogger")
logger.setLevel(logging.INFO)

# Log format
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")

# Console handler
ch = logging.StreamHandler()
ch.setFormatter(formatter)
logger.addHandler(ch)

In [10]:
def load_dataset(path):
    """Loads CSV or Parquet files safely with detailed logging."""
    try:
        if path.endswith(".csv"):
            df = pd.read_csv(path)
        elif path.endswith(".parquet"):
            df = pd.read_parquet(path)
        else:
            raise ValueError("Unsupported file format: " + path)
        
        logger.info(f"Dataset Loaded Successfully: {df.shape[0]} rows, {df.shape[1]} columns")
        return df
    
    except FileNotFoundError:
        logger.error(f"File Not Found at: {path}")
        raise
    
    except Exception as e:
        logger.error("Error in loading dataset:", exc_info=True)
        raise

In [11]:
# Attempt loading
df = load_dataset(DATA_PATH)
df.head()

2025-12-08 15:21:29,972 - INFO - Dataset Loaded Successfully: 250000 rows, 23 columns


Unnamed: 0,ID,State,City,Locality,Property_Type,BHK,Size_in_SqFt,Price_in_Lakhs,Price_per_SqFt,Year_Built,...,Age_of_Property,Nearby_Schools,Nearby_Hospitals,Public_Transport_Accessibility,Parking_Space,Security,Amenities,Facing,Owner_Type,Availability_Status
0,1,Tamil Nadu,Chennai,Locality_84,Apartment,1,4740,489.76,0.1,1990,...,35,10,3,High,No,No,"Playground, Gym, Garden, Pool, Clubhouse",West,Owner,Ready_to_Move
1,2,Maharashtra,Pune,Locality_490,Independent House,3,2364,195.52,0.08,2008,...,17,8,1,Low,No,Yes,"Playground, Clubhouse, Pool, Gym, Garden",North,Builder,Under_Construction
2,3,Punjab,Ludhiana,Locality_167,Apartment,2,3642,183.79,0.05,1997,...,28,9,8,Low,Yes,No,"Clubhouse, Pool, Playground, Gym",South,Broker,Ready_to_Move
3,4,Rajasthan,Jodhpur,Locality_393,Independent House,2,2741,300.29,0.11,1991,...,34,5,7,High,Yes,Yes,"Playground, Clubhouse, Gym, Pool, Garden",North,Builder,Ready_to_Move
4,5,Rajasthan,Jaipur,Locality_466,Villa,4,4823,182.9,0.04,2002,...,23,4,9,Low,No,Yes,"Playground, Garden, Gym, Pool, Clubhouse",East,Builder,Ready_to_Move


## Basic Metadata Summary

In [12]:
print("==========================================")
print("BASIC DATASET OVERVIEW")
print("==========================================")

print("\n Shape of dataset:", df.shape)

print("\n Column Names:")
print(df.columns.tolist())

print("\n Data Types:")
print(df.dtypes)

print("\n Memory Usage:")
print(df.memory_usage(deep=True))

BASIC DATASET OVERVIEW

 Shape of dataset: (250000, 23)

 Column Names:
['ID', 'State', 'City', 'Locality', 'Property_Type', 'BHK', 'Size_in_SqFt', 'Price_in_Lakhs', 'Price_per_SqFt', 'Year_Built', 'Furnished_Status', 'Floor_No', 'Total_Floors', 'Age_of_Property', 'Nearby_Schools', 'Nearby_Hospitals', 'Public_Transport_Accessibility', 'Parking_Space', 'Security', 'Amenities', 'Facing', 'Owner_Type', 'Availability_Status']

 Data Types:
ID                                  int64
State                              object
City                               object
Locality                           object
Property_Type                      object
BHK                                 int64
Size_in_SqFt                        int64
Price_in_Lakhs                    float64
Price_per_SqFt                    float64
Year_Built                          int64
Furnished_Status                   object
Floor_No                            int64
Total_Floors                        int64
Age_of_Propert

## Missing Values Analysis

In [15]:
print("==========================================")
print(" MISSING VALUES ANALYSIS")
print("==========================================")
print(df.isnull().sum())
missing = df.isnull().sum().sort_values(ascending=False)
missing_pct = (df.isnull().mean() * 100).sort_values(ascending=False)

missing_df = pd.DataFrame({
    "missing_count": missing,
    "missing_percent": missing_pct
})

missing_df[missing_df["missing_count"] > 0]

 MISSING VALUES ANALYSIS
ID                                0
State                             0
City                              0
Locality                          0
Property_Type                     0
BHK                               0
Size_in_SqFt                      0
Price_in_Lakhs                    0
Price_per_SqFt                    0
Year_Built                        0
Furnished_Status                  0
Floor_No                          0
Total_Floors                      0
Age_of_Property                   0
Nearby_Schools                    0
Nearby_Hospitals                  0
Public_Transport_Accessibility    0
Parking_Space                     0
Security                          0
Amenities                         0
Facing                            0
Owner_Type                        0
Availability_Status               0
dtype: int64


Unnamed: 0,missing_count,missing_percent


## Duplicate Rows

In [16]:
print("==========================================")
print(" DUPLICATE ROWS")
print("==========================================")

dupes = df.duplicated().sum()
print(f" Duplicate Rows Found: {dupes}")

 DUPLICATE ROWS
 Duplicate Rows Found: 0


## Statistical Summary

In [17]:
print("==========================================")
print("NUMERICAL SUMMARY")
print("==========================================")
display(df.describe().T)

print("==========================================")
print("CATEGORICAL SUMMARY")
print("==========================================")
display(df.describe(include="object").T)


NUMERICAL SUMMARY


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,250000.0,125000.5,72168.927986,1.0,62500.75,125000.5,187500.25,250000.0
BHK,250000.0,2.999396,1.415521,1.0,2.0,3.0,4.0,5.0
Size_in_SqFt,250000.0,2749.813216,1300.606954,500.0,1623.0,2747.0,3874.0,5000.0
Price_in_Lakhs,250000.0,254.586854,141.349921,10.0,132.55,253.87,376.88,500.0
Price_per_SqFt,250000.0,0.130597,0.130747,0.0,0.05,0.09,0.16,0.99
Year_Built,250000.0,2006.520012,9.808575,1990.0,1998.0,2007.0,2015.0,2023.0
Floor_No,250000.0,14.9668,8.948047,0.0,7.0,15.0,23.0,30.0
Total_Floors,250000.0,15.503004,8.671618,1.0,8.0,15.0,23.0,30.0
Age_of_Property,250000.0,18.479988,9.808575,2.0,10.0,18.0,27.0,35.0
Nearby_Schools,250000.0,5.49986,2.878639,1.0,3.0,5.0,8.0,10.0


CATEGORICAL SUMMARY


Unnamed: 0,count,unique,top,freq
State,250000,20,Odisha,12681
City,250000,42,Coimbatore,6461
Locality,250000,500,Locality_296,567
Property_Type,250000,3,Villa,83744
Furnished_Status,250000,3,Unfurnished,83408
Public_Transport_Accessibility,250000,3,High,83705
Parking_Space,250000,2,No,125456
Security,250000,2,Yes,125233
Amenities,250000,325,Pool,10218
Facing,250000,4,West,62757


## Dataset Quality Warnings (Auto Alerts)

In [18]:
print("==========================================")
print(" QUALITY CHECK — AUTO ALERTS")
print("==========================================")

warnings = []

# Alert 1 — Too many missing values
if (df.isnull().mean() > 0.4).any():
    warnings.append("⚠ Some columns have more than 40% missing values.")

# Alert 2 — Duplicates
if dupes > 0:
    warnings.append(f"⚠ Dataset contains {dupes} duplicate rows.")

# Alert 3 — Suspicious pricing
if "Price" in df.columns:
    if df["Price"].max() > 100000000:  # 10 crore threshold
        warnings.append("⚠ Extremely high property prices detected.")

# Print warnings
if warnings:
    print("\n".join(warnings))
else:
    print(" No critical data quality issues detected.")


 QUALITY CHECK — AUTO ALERTS
 No critical data quality issues detected.


## Conclusion

## Main Points:- 
- Clean
- No missing values
- No duplicates
- Balanced data types
- Good volume (2.5 lakh rows)
- Ready for EDA + Feature Engineering
------
## Detailed Insights:-
### 1. **PRICE PATTERN CHECK**
   - *Mean price* = **₹254.5 Lakhs (~2.54 Cr)**
   - Min = ₹10 Lakhs
   - Max = ₹5 Cr

This matches real Indian metropolitan market patterns.
    ✔ Low price end = Tier-2 cheaper flats
    ✔ Mid price = metro apartments
    ✔ High price = villas, independent houses in high-demand zones

The dataset spans all segments → very powerful for modelling.

### 2. **PPSF IS VERY LOW (Suspicious but Understandable)**

- **Price_per_SqFt mean = 0.13**; This means values are stored in Lakhs, so: **0.13 lakhs per sq ft = ₹13,000/sqft**
Which is exactly realistic for:

  - Mumbai/Pune premium zones = 12k–18k
  - Chennai/Bangalore mid segments = 8k–16k
So the PPSF column is perfectly usable.

### 3. **PROPERTY SIZE RANGE** - *Size_in_SqFt 500 → 5000 sqft* 
This means your dataset includes:
- 1BHKs
- mid-sized apartments
- large penthouses
- independent houses & villas\
Your model will generalize extremely well.

### 4. **HIGH CARDINALITY LOCALITIES**
- Locality unique = 500 \
This is EXCELLENT — because in India, **locality-level valuation is the #1 factor** in predicting:
- Price
- Appreciation
- Investment score 

We will later engineer: **Locality Features (very important)**

- Locality demand index

- Average PPSF per locality

- Price deviation from locality average

- Infrastructure score

- Construction density

This will turn the model into an **AI property evaluator like NoBroker / MagicBricks / 99acres.**

### 5. **Public Transport Accessibility**

Values = High / Medium / Low \
This is a goldmine feature because:
**Metro = price surges**
- Delhi / Bangalore / Hyderabad metro impact increases prices 10–25%

Model can capture this uplift.

### 6. **Amenities Column (most powerful raw feature)**

- This contains mixed items: **Playground, Gym, Garden, Pool, Clubhouse**.We will transform this into binary engineered features.
- Amenities raise prices by 10–30% in India.
This will become a major feature in modeling appreciation.

### 7. **Availability Status**
- Ready_to_Move
- Under_Construction

In India:

- Under construction is **cheaper**, but **appreciates more**

- RTM is **stable**, low-risk

Your ML model will be able to **predict appreciation probability,** which is exactly what real users want.