## Project Title: Real Estate Investment Advisor: Predicting Property Profitability & Future Value



## Skills take away from this project

## Python, Machine Learning, EDA, Data Analysis, Feature Engineering, Regression, Classification, Streamlit, MLflow, Model Evaluation, Feature Scaling, Domain Understanding.



## GITHUB LINK
https://github.com/tiwariabhi374-lang/Real-Investment-Projec

## DOMAIN : Real Estate / Investment / Financial Analytics



## Problem statement

Develop a machine learning application to assist potential investors in making real estate decisions. The system should:


1.  Classify whether a property is a "Good Investment" (Classification).

2.  Predict the estimated property price after 5 years (Regression).

Use the provided dataset to preprocess and analyze the data, engineer relevant features, and deploy a user-interactive application using Streamlit that provides investment recommendations and price forecasts. MLflow will be used for experiment tracking.




## Business Use Cases



1.  Empower real estate investors with intelligent tools to assess long-term returns.


1.  Support buyers in choosing high-return properties in developing areas.

2.  Help real estate companies automate investment analysis for listings.


2.   Improve customer trust in real estate platforms with data-backed predictions.




Basics Libraries

In [None]:
import pandas as pd
import numpy as  np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
df = pd.read_csv("/content/india_housing_prices.csv")

In [None]:
df

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
# columns of the dataset
df.columns

In [None]:
df.shape

In [None]:
df.size

In [None]:
df.info()

In [None]:
## null values
df.isnull().sum()

## Dataset Description:
1. ID : Unique identifier for each property record
2. State : State where the property is located
3. City : City of the property
4. Locality : Specific neighborhood or locality
5. Property_Type : Type of property (Apartment, Villa, House, etc.)
6. BHK : Number of bedrooms, hall, kitchen
7. Size_in_SqFt : Area of the property in square feet
8. Price_in_Lakhs : Price of the property in lakhs (local currency)
9. Price_per_SqFt : Price divided by area; normalized price metric
10. Year_Built : Year when the property was constructed
11. Furnished_Status : Furnishing level (Unfurnished, Semi, Fully)
12. Floor_No : Floor number of the property
13. Total_Floors : Total number of floors in the building
14. Age_of_Property : Age of the property (Current Year - Year_Built)
15. Nearby_Schools : Number or rating of nearby schools
16. Nearby_Hospitals : Number of nearby hospitals
17. Public_Transport_Accessibility : Access to buses/metro/train
18. Parking_Space : Number of parking spots available
19. Security : Security features (Gated, CCTV, Guard)
20. Amenities : Amenities available (Gym, Pool, Clubhouse)
21. Facing : Direction the property faces (North, South, etc.)
22. Owner_Type : Owner type (Individual, Builder, Agent)
23. Availability_Status : Current status (Available, Under Construction, Sold)
















































































**APPROACH**

## Step 1: Data Processing

In [None]:
# ---------------------------------------------
# STEP 1 : DATA PREPROCESSING FOR HOUSING DATA
# ---------------------------------------------

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Load Dataset
df = pd.read_csv("india_housing_prices.csv")

# -------------------------------
# 1.1 HANDLE DUPLICATES & MISSING VALUES
# -------------------------------

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Identify numerical and categorical columns
num_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = df.select_dtypes(include=['object']).columns.tolist()

# Impute numerical missing values with median
num_imputer = SimpleImputer(strategy="median")
df[num_cols] = num_imputer.fit_transform(df[num_cols])

# Store original (imputed) Price_in_Lakhs before scaling for target calculation
# This ensures CAGR_5Y is not constant and uses meaningful price values
df_original_price_for_target = df["Price_in_Lakhs"].copy()

# Impute categorical missing values with "Unknown"
cat_imputer = SimpleImputer(strategy="most_frequent")
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

# -------------------------------
# 1.2 NORMALIZE / SCALE NUMERICAL FEATURES
# -------------------------------

scaler = StandardScaler()

scaled_cols = ["Size_in_SqFt", "Age_of_Property", "Price_in_Lakhs"] # Price_in_Lakhs itself is scaled as a feature

for col in scaled_cols:
    if col in df.columns:
        df[col] = scaler.fit_transform(df[[col]])

# -------------------------------
# 1.3 ENCODE CATEGORICAL FEATURES
# -------------------------------

# Frequency Encoding for high-cardinality columns
if "Locality" in df.columns:
    locality_freq = df["Locality"].value_counts().to_dict()
    df["Locality_Freq"] = df["Locality"].map(locality_freq)

if "City" in df.columns:
    city_freq = df["City"].value_counts().to_dict()
    df["City_Freq"] = df["City"].map(city_freq)

# One-hot encoding for low-cardinality categorical features
# Removed 'Property_Type' from low_card_cols so it remains for EDA plots
low_card_cols = ["State", "Availability_Status"]

df = pd.get_dummies(df, columns=[col for col in low_card_cols if col in df.columns],
                    drop_first=True)

# -------------------------------
# 1.4 FEATURE ENGINEERING
# -------------------------------

# Price per SqFt
# This will now use the scaled Price_in_Lakhs. (Note: if original Price_per_SqFt is desired, adjust here)
if "Price_in_Lakhs" in df.columns and "Size_in_SqFt" in df.columns:
    df["Price_per_SqFt_calculated"] = (df["Price_in_Lakhs"] / df["Size_in_SqFt"]).replace([np.inf, -np.inf], np.nan)

# Amenities count
if "Amenities" in df.columns:
    df["Amenities_Count"] = df["Amenities"].apply(lambda x: len(str(x).split(",")))

# Parking / Security binary encoding
if "Parking_Space" in df.columns:
    df["Has_Parking"] = df["Parking_Space"].apply(lambda x: 1 if str(x).lower() in ["yes", "true"] else 0)

if "Security" in df.columns:
    df["Has_Security"] = df["Security"].apply(lambda x: 1 if str(x).lower() in ["yes", "true"] else 0)

# School Density Score
if "Nearby_Schools" in df.columns:
    df["School_Density_Score"] = np.minimum(df["Nearby_Schools"] / 5, 1)

# -------------------------------
# 1.5 CREATE LABEL: GOOD INVESTMENT
# -------------------------------

# Estimate future price using assumed annual growth rate.
# To make CAGR_5Y dynamic and avoid a constant value,
# we introduce a small random variation to the growth rate.
DEFAULT_GROWTH_RATE = 0.06  # 6% default if city growth not estimated

# Generate growth rates with a slight variation for each property
# Use df_original_price_for_target (unscaled, imputed Price_in_Lakhs) for this calculation
growth_rates = np.random.normal(DEFAULT_GROWTH_RATE, 0.01, len(df))
# Clip growth rates to a reasonable range (e.g., 1% to 15%) to avoid extreme values
growth_rates = np.clip(growth_rates, 0.01, 0.15)

df["Estimated_Price_5Y"] = df_original_price_for_target * ((1 + growth_rates) ** 5)

# Compute CAGR based on generated Estimated_Price_5Y and original (unscaled) price
df["CAGR_5Y"] = ((df["Estimated_Price_5Y"] / df_original_price_for_target) ** (1/5)) - 1

# Create Good Investment label (threshold = 8%)
df["Good_Investment"] = df["CAGR_5Y"].apply(lambda x: 1 if x >= 0.08 else 0)

# -------------------------------
# FINAL: PREPROCESSED DATA
# -------------------------------

df.to_csv("processed_housing_data.csv", index=False)

print("‚úÖ Step 1 completed: processed_housing_data.csv saved successfully.")


**Insights**


*   In data Processing , we have handle missing values and remove duplicate rows.

*   Normalize the price of the house according to the locality at a cheaper rate.
*   Encode Categorical Features for locality and city.

*   Feature Enginnering to calculate price in lakhs by size in square feet.



 ## Step 2: Exploratory Data Analysis (EDA)


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd # Ensure pandas is imported

# Explicitly load the processed data to ensure CAGR_5Y is available
df = pd.read_csv("processed_housing_data.csv")

# Display settings
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 5)

# -----------------------------
# 1. Price Trends by City
# -----------------------------

city_price = df.groupby('City')['Price_in_Lakhs'].mean().sort_values(ascending=False)

plt.figure(figsize=(12, 6))
city_price.plot(kind='bar')
plt.title("Average Property Price by City")
plt.ylabel("Average Price in Lakhs")
plt.xlabel("City")
plt.xticks(rotation=45)
plt.show()

# -----------------------------
# 2. Correlation between Area and Investment Return
# -----------------------------

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Size_in_SqFt', y='CAGR_5Y', hue='Good_Investment')
plt.title("Size in SqFt vs. CAGR 5Y")
plt.xlabel("Size in SqFt")
plt.ylabel("CAGR 5Y")
plt.show()

# Correlation value
corr_area = df['Size_in_SqFt'].corr(df['CAGR_5Y'])
print("Correlation between Size in SqFt and CAGR 5Y:", corr_area)

# Removed: Section for 'Crime_Rate' as the column is not available.

# Removed: Section for 'Infrastructure_Score' and 'Resale_Value' as 'Infrastructure_Score' is not available.


**Insights**


*   Prices in trends according to city . So average price in city and Average  price in Lakhs.

*  Correlation between the Area and the investment return.



# Distribution of Prices, Area, Price per Sqft, Outliers & Relationships

In [None]:
# Robust EDA: handles missing / variably-named columns gracefully
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd # Ensure pandas is imported if not already
import numpy as np # Ensure numpy is imported if not already

sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 5)


# Show available columns to help debugging
print("Available columns:", list(df.columns))

# Helper to choose first matching column from candidates
def get_col(df, candidates):
    for c in candidates:
        if c in df.columns:
            return c
    return None

# Try common variants for the columns we need
price_col = get_col(df, ['Price', 'Price_in_Lakhs', 'price', 'price_in_lakhs', 'TotalPrice'])
area_col  = get_col(df, ['Area', 'Size_in_SqFt', 'Size', 'area', 'Builtup_Area', 'Total_Sqft'])
prop_type_col = get_col(df, ['Property_Type', 'Property type', 'property_type', 'Type', 'propertytype'])
price_psq_col = get_col(df, ['Price_per_SqFt', 'Price_per_sqft', 'price_per_sqft', 'price_per_sqft_in_lakhs'])
price_per_sqft_computed = False

print(f"Using columns -> price: {price_col}, area: {area_col}, property_type: {prop_type_col}, price_per_sqft: {price_psq_col}")

# Compute Price_per_SqFt if missing and if price & area exist
if price_psq_col is None:
    if price_col and area_col:
        # convert to numeric and avoid division by zero / NaN
        df[price_col] = pd.to_numeric(df[price_col], errors='coerce')
        df[area_col]  = pd.to_numeric(df[area_col], errors='coerce')
        # If Price is in lakhs and area is sqft, this will be lakhs per sqft ‚Äî it's fine as a relative measure.
        df['Price_per_SqFt'] = df[price_col] / df[area_col]
        price_psq_col = 'Price_per_SqFt'
        price_per_sqft_computed = True
        print("Computed Price_per_SqFt as", price_psq_col)
    else:
        print("Price_per_SqFt not available and cannot be computed (missing price or area).")

# Ensure numeric types for plotting
for col in [price_col, area_col, price_psq_col]:
    if col and col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# 1) Distribution of property prices
if price_col and price_col in df.columns:
    plt.figure(figsize=(10,5))
    sns.histplot(df[price_col].dropna(), kde=True)
    plt.title("Distribution of Property Prices (" + price_col + ")")
    plt.xlabel("Price")
    plt.show()
else:
    print("Skipping price distribution plot: price column not found.")

# 2) Distribution of property sizes (area)
if area_col and area_col in df.columns:
    plt.figure(figsize=(10,5))
    sns.histplot(df[area_col].dropna(), kde=True)
    plt.title("Distribution of Property Sizes (" + area_col + ")")
    plt.xlabel("Area (sqft)")
    plt.show()
else:
    print("Skipping area distribution plot: area column not found.")

# 3) How does price per sqft vary by property type?
if price_psq_col and price_psq_col in df.columns and prop_type_col and prop_type_col in df.columns:
    plt.figure(figsize=(12,6))
    sns.boxplot(data=df, x=prop_type_col, y=price_psq_col)
    plt.title(f"Price per Sqft ({price_psq_col}) by Property Type ({prop_type_col})")
    plt.xticks(rotation=45)
    plt.show()
elif price_psq_col and price_psq_col in df.columns:
    print("Property type column not found; showing overall distribution of price per sqft instead.")
    plt.figure(figsize=(10,5))
    sns.histplot(df[price_psq_col].dropna(), kde=True)
    plt.title("Distribution of Price per Sqft (" + price_psq_col + ")")
    plt.show()
else:
    print("Skipping price-per-sqft by property type: necessary columns missing.")

# 4) Relationship between property size and price (scatter + correlation)
if price_col and price_col in df.columns and area_col and area_col in df.columns:
    plt.figure(figsize=(10,6))
    sns.scatterplot(data=df, x=area_col, y=price_col, alpha=0.4)
    plt.title(f"Property Size ({area_col}) vs Price ({price_col})")
    plt.xlabel("Area (sqft)")
    plt.ylabel("Price")
    plt.show()
    corr = df[area_col].corr(df[price_col])
    print(f"Correlation between {area_col} and {price_col}: {corr:.4f}")
else:
    print("Skipping size vs price scatter: missing area or price column.")

# 5) Outliers: detect using IQR for price per sqft and area (if present)
def detect_outliers_iqr(series):
    s = series.dropna()
    q1 = s.quantile(0.25)
    q3 = s.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    out = s[(s < lower) | (s > upper)]
    return lower, upper, out

if price_psq_col and price_psq_col in df.columns:
    low, high, outliers = detect_outliers_iqr(df[price_psq_col])
    print(f"Price per sqft ({price_psq_col}) IQR bounds: lower={low:.3f}, upper={high:.3f}, outliers_count={len(outliers)}")
    plt.figure(figsize=(10,4))
    sns.boxplot(x=df[price_psq_col].dropna())
    plt.title("Boxplot ‚Äî Price per Sqft (" + price_psq_col + ")")
    plt.show()
else:
    print("Skipping price per sqft outlier detection: column not found.")

if area_col and area_col in df.columns:
    low_a, high_a, outliers_a = detect_outliers_iqr(df[area_col])
    print(f"Area ({area_col}) IQR bounds: lower={low_a:.1f}, upper={high_a:.1f}, outliers_count={len(outliers_a)}")
    plt.figure(figsize=(10,4))
    sns.boxplot(x=df[area_col].dropna())
    plt.title("Boxplot ‚Äî Area (" + area_col + ")")
    plt.show()
else:
    print("Skipping area outlier detection: area column not found.")

# OPTIONAL: If property_type missing, show top few property types if variant exists
if not prop_type_col:
    # try to infer a 'type' from other columns like 'Property Category' variants
    alt_prop_type_col = get_col(df, ['Property Category', 'Category', 'House_Type'])
    if alt_prop_type_col:
        print(f"Using alternative property-type column: {alt_prop_type_col}")
        prop_type_col = alt_prop_type_col
    else:
        print("No property-type column found in dataset; some grouped analyses skipped.")

# Summary of findings placeholders (replace with actual numeric summaries as needed)
print("\nSummary (quick stats):")
for col in [price_col, area_col, price_psq_col]:
    if col and col in df.columns:
        print(f" - {col}: count={df[col].notna().sum()}, mean={df[col].mean():.3f}, median={df[col].median():.3f}, std={df[col].std():.3f}")


**Insights**


*   Distribution in prices, Area, Price in Sqft, Outliers and Relationship
*   IQR to detect the prices of per sqft and Area.



##  Location-based Analysis


In [None]:
# ------------------------------------------
# Load original dataset for this specific EDA cell to use original column names and unscaled values
# This is done because 'State' was one-hot encoded and numerical columns were scaled in previous preprocessing.
df_original_for_eda = pd.read_csv("/content/india_housing_prices.csv", engine='python')

# 1. Average price per sq ft by state
# ------------------------------------------
avg_price_sqft_state = (
    df_original_for_eda.groupby("State")["Price_per_SqFt"]
    .mean()
    .reset_index()
    .sort_values("Price_per_SqFt", ascending=False)
)

print("\nAverage Price per Sq Ft by State:")
print(avg_price_sqft_state)


# ------------------------------------------
# 2. Average property price by city
# ------------------------------------------
avg_price_city = (
    df_original_for_eda.groupby("City")["Price_in_Lakhs"]
    .mean()
    .reset_index()
    .sort_values("Price_in_Lakhs", ascending=False)
)

print("\nAverage Property Price by City:")
print(avg_price_city)


# ------------------------------------------
# 3. Median age of properties by locality
# ------------------------------------------
# Use 'Age_of_Property' directly as it exists in the original dataset and represents age
if "Age_of_Property" not in df_original_for_eda.columns:
    raise ValueError("Dataset must contain 'Age_of_Property' column for this analysis.")

median_age_locality = (
    df_original_for_eda.groupby("Locality")["Age_of_Property"]
    .median()
    .reset_index()
    .sort_values("Age_of_Property")
)

print("\nMedian Age of Properties by Locality:")
print(median_age_locality)


# ------------------------------------------
# 4. BHK distribution across cities
# ------------------------------------------
bhk_distribution = (
    df_original_for_eda.groupby(["City", "BHK"])
    .size()
    .reset_index(name="Count")
    .sort_values(["City", "BHK"])
)

print("\nBHK Distribution Across Cities:")
print(bhk_distribution)


# ------------------------------------------
# 5. Price trends for top 5 most expensive localities
# ------------------------------------------
# First find average price for each locality
locality_avg_price = (
    df_original_for_eda.groupby("Locality")["Price_in_Lakhs"]
    .mean()
    .reset_index()
    .sort_values("Price_in_Lakhs", ascending=False)
)

# Select top 5
top_5_localities = locality_avg_price.head(5)["Locality"].tolist()

print("\nTop 5 Most Expensive Localities:")
print(top_5_localities)

# Filter data for only these localities
df_top5_eda = df_original_for_eda[df_original_for_eda["Locality"].isin(top_5_localities)]

# Use 'Year_Built' for price trends
if "Year_Built" in df_top5_eda.columns:
    price_trends = (
        df_top5_eda.groupby(["Locality", "Year_Built"])["Price_in_Lakhs"]
        .mean()
        .reset_index()
        .sort_values(["Locality", "Year_Built"])
    )
else:
    price_trends = "Dataset has no Year_Built column to compute trends."

print("\nPrice Trends for Top 5 Localities:")
print(price_trends)


**Insights**


*   Location Based Analysis  Average price per sq ft by state
*   Average property price by city
*   Median age by properties by Locality.
*   BHK distribution Across cities.
*   Price trends of top 5 localities.





In [None]:
import pandas as pd

# Step 1: Load your dataset
# Replace the filename with the actual path to your CSV
df_original_for_eda = pd.read_csv("india_housing_prices.csv", engine='python')

# Step 2: Inspect the first few rows to confirm column names
print(df_original_for_eda.head())
print(df_original_for_eda.columns)


## Feature Relationship & Correlation - Full Analysis Code

In [None]:
# -----------------------------------------------------------
# 1. Correlation among all numeric features
# -----------------------------------------------------------
numeric_df = df.select_dtypes(include=['int64', 'float64'])

correlation_matrix = numeric_df.corr()

print("\nCorrelation Between Numeric Features:")
print(correlation_matrix)


# -----------------------------------------------------------
# 2. Relationship: Nearby Schools vs Price per sq ft
# -----------------------------------------------------------
if "Nearby_Schools" in df.columns:
    school_relation = (
        df.groupby("Nearby_Schools")["Price_per_SqFt"] # Corrected column name
        .mean()
        .reset_index()
        .sort_values("Price_per_SqFt", ascending=False) # Corrected column name
    )
else:
    school_relation = "Column 'Nearby_Schools' not found."

print("\nNearby Schools vs Price per Sq Ft:")
print(school_relation)


# -----------------------------------------------------------
# 3. Relationship: Nearby Hospitals vs Price per sq ft
# -----------------------------------------------------------
if "Nearby_Hospitals" in df.columns:
    hospital_relation = (
        df.groupby("Nearby_Hospitals")["Price_per_SqFt"] # Corrected column name
        .mean()
        .reset_index()
        .sort_values("Price_per_SqFt", ascending=False) # Corrected column name
    )
else:
    hospital_relation = "Column 'Nearby_Hospitals' not found."

print("\nNearby Hospitals vs Price per Sq Ft:")
print(hospital_relation)


# -----------------------------------------------------------
# 4. Price variation by furnished status
# -----------------------------------------------------------
if "Furnished_Status" in df.columns: # Corrected column name
    furnished_price = (
        df.groupby("Furnished_Status")["Price_in_Lakhs"] # Corrected column name
        .mean()
        .reset_index()
        .sort_values("Price_in_Lakhs", ascending=False) # Corrected column name
    )
else:
    furnished_price = "Column 'Furnished_Status' not found."

print("\nPrice Variation by Furnished Status:")
print(furnished_price)


# -----------------------------------------------------------
# 5. Price per sq ft by property facing direction
# -----------------------------------------------------------
if "Facing" in df.columns:
    facing_price_sqft = (
        df.groupby("Facing")["Price_per_SqFt"] # Corrected column name
        .mean()
        .reset_index()
        .sort_values("Price_per_SqFt", ascending=False) # Corrected column name
    )
else:
    facing_price_sqft = "Column 'Facing' not found."

print("\nPrice per Sq Ft by Property Facing Direction:")
print(facing_price_sqft)


**Insights**
Feature correlation and Realtionship between the properties and neaby facilities like -: school, Hospital, Feature status of the locality .

## Investment / Amenities / Ownership Analysis ‚Äî Full Python Code

In [None]:
# ---------------------------------------------------------------------
# 1. Properties by Owner Type
# ---------------------------------------------------------------------
if "Owner_Type" in df.columns:
    owner_type_count = df["Owner_Type"].value_counts().reset_index()
    owner_type_count.columns = ["Owner_Type", "Count"]
else:
    owner_type_count = "Column 'Owner_Type' not found."

print("\nProperties by Owner Type:")
print(owner_type_count)


# ---------------------------------------------------------------------
# 2. Properties by Availability Status
# ---------------------------------------------------------------------
if "Availability_Status" in df.columns: # Corrected 'Availability' to 'Availability_Status'
    availability_count = df["Availability_Status"].value_counts().reset_index()
    availability_count.columns = ["Availability_Status", "Count"]
else:
    availability_count = "Column 'Availability_Status' not found." # Corrected message

print("\nProperties by Availability Status:")
print(availability_count)


# ---------------------------------------------------------------------
# 3. Parking Space vs Property Price
# ---------------------------------------------------------------------
parking_cols = [col for col in df.columns if "parking" in col.lower() or "Parking" in col]

if parking_cols:
    parking_col = parking_cols[0]
    parking_price_relation = (
        df.groupby(parking_col)["Price_in_Lakhs"] # Corrected 'Price' to 'Price_in_Lakhs'
        .mean()
        .reset_index()
        .sort_values("Price_in_Lakhs", ascending=False) # Corrected 'Price' to 'Price_in_Lakhs'
    )
else:
    parking_price_relation = "No parking-related column found."

print("\nParking Space vs Property Price:")
print(parking_price_relation)


# ---------------------------------------------------------------------
# 4. Amenities vs Price per Sq Ft
# ---------------------------------------------------------------------
amenity_keywords = ["Pool", "Gym", "Garden", "Lift", "Security", "Club", "Play", "Parking"]

amenity_cols = [
    col for col in df.columns
    if any(keyword.lower() in col.lower() for keyword in amenity_keywords)
]

amenity_price_effect = {}

if amenity_cols:
    for col in amenity_cols:
        # Check if 'Amenities' column exists before trying to access it
        if col == "Amenities": # 'Amenities' itself is a string of comma-separated values, not a binary amenity
            # This amenity column is not suitable for binary grouping in this way
            continue

        if df[col].nunique() <= 2:  # Only binary amenities
            effect = df.groupby(col)["Price_per_SqFt"].mean().reset_index() # Corrected 'Price_per_sqft'
            amenity_price_effect[col] = effect
else:
    amenity_price_effect = "No amenity columns found."

print("\nAmenities' Effect on Price per Sq Ft:")
print(amenity_price_effect)


# ---------------------------------------------------------------------
# 5. Public Transport Accessibility vs Price per Sq Ft / Investment Potential
# ---------------------------------------------------------------------
transport_cols = [col for col in df.columns if "Transport" in col or "Metro" in col or "Bus" in col]

if transport_cols:
    transport_col = transport_cols[0]  # Using the first relevant column

    # Relationship with price per sq ft
    transport_price_relation = (
        df.groupby(transport_col)["Price_per_SqFt"] # Corrected 'Price_per_sqft'
        .mean()
        .reset_index()
        .sort_values("Price_per_SqFt", ascending=False) # Corrected 'Price_per_sqft'
    )

    # Investment potential = price appreciation or price difference across levels
    # Here we use Price_per_SqFt as proxy since appreciation data isn't available
    investment_potential = transport_price_relation.copy()
else:
    transport_price_relation = "No transport-related column found."
    investment_potential = "No transport-related column found."

print("\nPublic Transport Accessibility vs Price per Sq Ft:")
print(transport_price_relation)

print("\nPublic Transport Accessibility vs Investment Potential:")
print(investment_potential)


**Insights**
Investment and ownership Analysis of getting a property in locality for the transport Assessbility , Parking space and other feature per in sqft .

## DATA CLEANING

**Common Real Estate Dataset Analyses**


* Descriptive statistics: Average, median, and distribution of housing prices by city, neighborhood, or property type.

*  Trend analysis: Price changes over time (monthly/quarterly/yearly).

*  Feature impact: How variables like square footage, number of bedrooms, or proximity to amenities affect price.

*  Segmentation: Grouping properties into clusters (e.g., affordable, mid-range, luxury).

*  Predictive modeling: Building regression or machine learning models to forecast housing prices.











In [None]:
# --- Setup ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib # Import joblib to save models

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression # For classification
from sklearn.ensemble import RandomForestClassifier # For classification
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc # Added classification metrics

# Optional: statsmodels for detailed regression summaries
import statsmodels.api as sm

pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
sns.set(style='whitegrid', context='talk')

# --- Load dataset ---
# Using the preprocessed dataset from Step 1
df = pd.read_csv('processed_housing_data.csv')

# Ensure 'Good_Investment' is int for classification
df['Good_Investment'] = df['Good_Investment'].astype(int)

# --- Basic normalization of columns (adjust these mappings to your actual schema) ---
# The preprocessor pipeline handles these for modeling

# --- Quick data health check ---
print('Shape:', df.shape)
print('Missing values (top 10):\n', df.isna().sum().sort_values(ascending=False).head(10))

# --- Predictive modeling (Classification for Good_Investment) ---
print('\n=== Predictive modeling (Classification for Good_Investment) ===')

target = 'Good_Investment' # Changed target to classification label

# Select features
# Using columns that are likely to be good predictors and are not targets or IDs
# Exclude original Price_in_Lakhs as it's correlated with the target through CAGR_5Y
# Exclude 'Estimated_Price_5Y' and 'CAGR_5Y' as they are derived from price and used to create the target
# Exclude 'ID', 'Locality', 'City', 'State' as their frequency encoded versions or one-hot encoded versions are used.
# Exclude 'Amenities' as Amenities_Count is derived.
# Exclude 'Price_per_SqFt_calculated' if original 'Price_per_SqFt' is kept.

feature_candidates = [
    'BHK', 'Size_in_SqFt', 'Price_per_SqFt', 'Year_Built', 'Furnished_Status',
    'Floor_No', 'Total_Floors', 'Age_of_Property', 'Nearby_Schools', 'Nearby_Hospitals',
    'Public_Transport_Accessibility', 'Parking_Space', 'Security', 'Amenities_Count',
    'Facing', 'Owner_Type', 'Locality_Freq', 'City_Freq', 'Has_Parking', 'Has_Security',
    'School_Density_Score'
]
# Add one-hot encoded state and availability status features, dynamically check existing columns
state_cols = [col for col in df.columns if col.startswith('State_')]
availability_cols = [col for col in df.columns if col.startswith('Availability_Status_')]
property_type_cols = [col for col in df.columns if col.startswith('Property_Type_')]

feature_candidates.extend(state_cols)
feature_candidates.extend(availability_cols)
feature_candidates.extend(property_type_cols)

# Ensure no target-related columns or highly correlated identifiers are in features
features = [f for f in feature_candidates if f in df.columns and f not in ['ID', target, 'Estimated_Price_5Y', 'CAGR_5Y']]

X_cols = features

# Drop rows with missing target
model_df = df[X_cols + [target]].copy()
model_df = model_df[model_df[target].notna()]

# Split numeric and categorical
numeric_features = model_df[X_cols].select_dtypes(include=np.number).columns.tolist()
categorical_features = model_df[X_cols].select_dtypes(include='object').columns.tolist()

# Preprocess
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Keep other columns if any
)

X = model_df[X_cols]
y = model_df[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y # Stratify for classification target
)

# Model 1: Logistic Regression
logreg = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(solver='liblinear', random_state=42, class_weight='balanced')) # Changed to LogisticRegression and added class_weight
])

# Model 2: Random Forest Classifier
rf_classifier = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=100, max_depth=None, random_state=42, n_jobs=-1, class_weight='balanced'
    )) # Changed to RandomForestClassifier and added class_weight
])

# Fit and evaluate Logistic Regression
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
y_prob_logreg = logreg.predict_proba(X_test)[:, 1] # For ROC curve

print('\nLogistic Regression performance (Classification):')
print('Accuracy:', accuracy_score(y_test, y_pred_logreg))
print('Classification Report:\n', classification_report(y_test, y_pred_logreg))

# Fit and evaluate Random Forest Classifier
rf_classifier.fit(X_train, y_train)
y_pred_rf_classifier = rf_classifier.predict(X_test)
y_prob_rf_classifier = rf_classifier.predict_proba(X_test)[:, 1] # For ROC curve

print('\nRandom Forest Classifier performance (Classification):')
print('Accuracy:', accuracy_score(y_test, y_pred_rf_classifier))
print('Classification Report:\n', classification_report(y_test, y_pred_rf_classifier))

# Save the best classification model (RandomForestClassifier in this case) for the Streamlit app
joblib.dump(rf_classifier, 'investment_model.pkl')
print("Saved investment_model.pkl")

# Pass variables to the next cell for plotting
# Using Random Forest Classifier results for the evaluation plot
global y_test_for_plot, y_pred_for_plot, y_prob_for_plot
y_test_for_plot = y_test
y_pred_for_plot = y_pred_rf_classifier
y_prob_for_plot = y_prob_rf_classifier

print('\n=== Done ===')


**Insights**


*   Data Cleaning : descrpitive Analysis of average, Median and distribution in prices by city, houses and Neighbourhood.

*  Trend Analysis price of housing change over time in month, year and annually.

*   Feature Impact: like hous design, bedroom location and afforable price for amenities to stay and get assessbility.

*  Segmentaion of the proerties on the prices and according to price in sqft like -: Average , Median and Luxury.

*  Predicting regressiona nd classification model to get the accuracy of the property prices.





## DATA VISUALIZATION


*  Heatmaps showing price variation across regions.

*  Boxplots comparing property types.

*  Time-series line charts of average prices.

*  Scatter plots of price vs. square footage.



In [None]:
import matplotlib.pyplot as plt # Import matplotlib.pyplot
import seaborn as sns
import pandas as pd # Ensure pandas is imported

# --- 1. Heatmap: price variation across regions by Property Type ---
# Average price by city and property type
pivot = df.pivot_table(values='Price_in_Lakhs',
                       index='City',
                       columns='Property_Type',
                       aggfunc='mean')

plt.figure(figsize=(15, 10)) # Increased figure size for better readability
sns.heatmap(pivot, cmap="YlOrRd", annot=True, fmt=".1f") # Added annot=True and fmt for values
plt.title("Heatmap of Average Housing Prices (Lakhs) by City & Property Type")
plt.tight_layout()
plt.show()

# --- 2. Boxplots: comparing property types ---
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='Property_Type', y='Price_in_Lakhs')
plt.title("Price Distribution by Property Type")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# --- 3. Time-series line chart: average prices by Year_Built ---
# Use Year_Built as a proxy for time trend if a transaction date is not available
# Note: This represents prices of properties built in a given year, not market transaction trends over time.
yearly_prices = df.groupby('Year_Built')['Price_in_Lakhs'].mean().reset_index()

plt.figure(figsize=(12, 6))
sns.lineplot(data=yearly_prices, x='Year_Built', y='Price_in_Lakhs')
plt.title("Average Housing Prices (Lakhs) by Year Built")
plt.xlabel("Year Built")
plt.ylabel("Average Price (Lakhs)")
plt.tight_layout()
plt.show()

# --- 4. Scatter plot: price vs. square footage ---
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Size_in_SqFt', y='Price_in_Lakhs', alpha=0.6)
plt.title("Scatter Plot of Price (Lakhs) vs. Square Footage")
plt.xlabel("Square Footage")
plt.ylabel("Price (Lakhs)")
plt.tight_layout()
plt.show()


**Insights**


*  Data Visualization Heatmaps showing price variation across regions.

*  Boxplots comparing property types and price per sqft.

*  Time-series line charts of average prices.

*  Scatter plots of price vs. square footage.




## STEP 3 : MODEL DEVELOPMENT

## TEST AND TRAIN THE MODEL

Trained classification model with high accuracy for investment prediction.


In [None]:
import pandas as pd

df = pd.read_csv("india_housing_prices.csv")
print(df.columns)

In [None]:
# -----------------------
# 1) Price-based Rule
# -----------------------
median_price = df["Price_in_Lakhs"].median()
df["rule_price"] = df["Price_in_Lakhs"] <= median_price

# -----------------------
# 2) PPSF (Locality Median)
# -----------------------
df["locality_ppsf_median"] = df.groupby("Locality")["Price_per_SqFt"].transform("median")
df["rule_ppsf"] = df["Price_per_SqFt"] <= df["locality_ppsf_median"]

# -----------------------
# 3) Multi-Factor Score
# -----------------------
df["locality_area_median"] = df.groupby("Locality")["Size_in_SqFt"].transform("median")

df["score"] = (
    (df["BHK"] >= 3).astype(int) +
    (df["Availability_Status"].isin(["Ready to Move", "ReadyToMove"])).astype(int) +
    (df["Size_in_SqFt"] >= df["locality_area_median"]).astype(int)
)

df["rule_multifactor"] = df["score"] >= 2

# -----------------------
# 4) Final Label
# -----------------------
df["Good_Investment"] = (
    df["rule_price"] |
    df["rule_ppsf"] |
    df["rule_multifactor"]
).map({True: "Good", False: "Bad"})

df.to_csv("housing_dataset_with_labels.csv", index=False)

print("‚úî Good Investment target created successfully!")


**Insights**
Model Development where to train and test the model
price based rule, Locality median, Mulity factor score and Label .

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib

# Load updated dataset
df = pd.read_csv("housing_dataset_with_labels.csv")

# Encode categorical columns
cat_cols = ["State", "City", "Locality", "Property_Type",
            "Furnished_Status", "Facing", "Owner_Type",
            "Availability_Status"]

for col in cat_cols:
    df[col] = df[col].astype(str)
    df[col] = LabelEncoder().fit_transform(df[col])

# Create target variable
y = df["Good_Investment"].map({"Good": 1, "Bad": 0})

# Select features
X = df[[
    "Size_in_SqFt",
    "Price_in_Lakhs",
    "Price_per_SqFt",
    "BHK",
    "City",
    "Locality",
    "Availability_Status"
]]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestClassifier(
    n_estimators=300,
    max_depth=12,
    min_samples_split=4,
    random_state=42
)

model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
acc = accuracy_score(y_test, y_pred)
print("üî• Model Accuracy:", acc)
print("\nüìä Classification Report:\n", classification_report(y_test, y_pred))

# Save model
joblib.dump(model, "good_investment_model.pkl")


**Insights**
Classification Model prediction and investment with the accuracy of 98% for good investment

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc

# Use the variables made available from the previous cell's classification model
# Check if variables are defined
if 'y_test_for_plot' in globals() and 'y_pred_for_plot' in globals() and 'y_prob_for_plot' in globals():
    y_test = y_test_for_plot
    y_pred = y_pred_for_plot
    y_prob = y_prob_for_plot

    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.show()

    # Classification report
    print("Classification Report:\n", classification_report(y_test, y_pred))

    # ROC Curve
    # y_prob is already calculated and available from previous cell
    fpr, tpr, thresholds = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)

    plt.figure(figsize=(6,4))
    plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
    plt.plot([0,1],[0,1],'--',color='gray')
    plt.title("ROC Curve")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.legend()
    plt.show()
elif 'y_test' not in globals(): # More specific check for initial state
    print("Error: Classification model variables (y_test, y_pred, y_prob) are not defined. Please ensure the model training cell (m1yRBfE2qUVi) was executed successfully.")
else:
    print("Error: An unexpected issue occurred with variable definitions. Please check the model training cell.")


**Insights**


*  Condusion Matrix to predict the actual data
*  ROC curve to predict the true and false poistive rate of the investment.



## Regression model with low RMSE for price forecasting.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Assume y_test and y_pred are available
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R¬≤: {r2:.2f}")

# Residuals
residuals = y_test - y_pred

plt.figure(figsize=(10,6))
sns.scatterplot(x=y_pred, y=residuals, alpha=0.6)
plt.axhline(0, color='red', linestyle='--')
plt.title("Residuals vs Predicted Prices")
plt.xlabel("Predicted Price")
plt.ylabel("Residuals")
plt.tight_layout()
plt.show()

# Distribution of residuals
plt.figure(figsize=(10,6))
sns.histplot(residuals, bins=40, kde=True)
plt.title("Distribution of Residuals")
plt.xlabel("Residual")
plt.tight_layout()
plt.show()

**Insights**
* Regression model with low RMSE for price forcasting.
* Residuals Vs Predicted price.

## STEP 4: ML FLOW INTEGRATION

ML Integration means connecting Machine Learning models with an application so that the model can be used in the real world.



*  Loading the trained model

*  Sending user input to the model

*  Getting predictions from the model

*  Showing the results to the user (UI, app, dashboard)



In [None]:
pip install mlflow


In [None]:
import mlflow
mlflow.set_tracking_uri("http://127.0.0.1:5000")

In [None]:
"""
Step 4: MLflow Integration
Track experiments with different models, log parameters/metrics/artifacts,
and use the MLflow Model Registry to manage best models.
"""

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
import numpy as np # Import numpy for np.sqrt
import joblib # Import joblib to save the best model

# -----------------------------
# 1. Load dataset (example)
# -----------------------------
df = pd.read_csv("india_housing_prices.csv")

# Adjust column names to match your dataset
X = df[["Size_in_SqFt", "BHK", "Age_of_Property"]]
y = df["Price_in_Lakhs"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -----------------------------
# 2. Set MLflow experiment
# -----------------------------
mlflow.set_tracking_uri("./mlruns")  # Changed to local directory for tracking
mlflow.set_experiment("HousingPricePrediction")

# -----------------------------
# 3. Define models to compare
# -----------------------------
models = {
    "LinearRegression": LinearRegression(),
    "DecisionTree": DecisionTreeRegressor(max_depth=5, random_state=42),
    "RandomForest": RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
}

# -----------------------------
# 4. Train, evaluate, log runs
# -----------------------------
for model_name, model in models.items():
    with mlflow.start_run(run_name=model_name):
        # Fit model
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        # Metrics
        # Fix: Calculate RMSE by taking the square root of MSE, as 'squared' parameter is not recognized.
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)

        # Log parameters (if available)
        if hasattr(model, "get_params"):
            params = model.get_params()
            for p, v in params.items():
                mlflow.log_param(p, v)

        # Log metrics
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)

        # Log model artifact
        mlflow.sklearn.log_model(model, "model")

        print(f"{model_name}: RMSE={rmse:.3f}, R2={r2:.3f}")

# -----------------------------
# 5. Register best model and save locally for Streamlit
# -----------------------------
# Find best run by lowest RMSE
client = MlflowClient()
experiment = client.get_experiment_by_name("HousingPricePrediction")
runs = client.search_runs([experiment.experiment_id], order_by=["metrics.rmse ASC"], max_results=1)

best_run = runs[0]
best_run_id = best_run.info.run_id
best_model_name = "HousingPriceModel"

# Register model
result = mlflow.register_model(
    f"runs:/{best_run_id}/model",
    best_model_name
)

# Transition to Production stage
client.transition_model_version_stage(
    name=best_model_name,
    version=result.version,
    stage="Production"
)

print(f"Best model registered: {best_model_name}, version {result.version}, stage=Production")

# Load the best model from MLflow and save it locally for the Streamlit app
loaded_best_model = mlflow.sklearn.load_model(f"runs:/{best_run_id}/model")
joblib.dump(loaded_best_model, 'regression_model.pkl')
print("Saved regression_model.pkl from the best MLflow run.")


**Insights**

* ML flow integration application install to manage best model, artifacts, Registry of the model.

* ML flow experiment and define model to compare

## STREAMLIT APP

A Streamlit app is a simple, interactive web application built using Python, mainly used for:

‚úÖ Data Science
‚úÖ Machine Learning Projects
‚úÖ Dashboards
‚úÖ Model Deployment
‚úÖ Data Visualizations

It lets you turn your Python code into a beautiful web interface‚Äîwithout needing HTML, CSS, or JavaScript.

In [None]:
!pip install streamlit

In [None]:
# app.py
import os
import json
import math
import time
import numpy as np
import pandas as pd
import streamlit as st
import altair as alt
from datetime import datetime

# Optional: MLflow integration for loading registered models
try:
    import mlflow
    import mlflow.sklearn
    from mlflow.tracking import MlflowClient
    MLFLOW_AVAILABLE = True
except Exception:
    MLFLOW_AVAILABLE = False

# -----------------------------
# Config
# -----------------------------
CSV_PATH = "india_housing_prices.csv"  # change to your data path
FEATURES = ["Area_sqft", "BHK", "Age_of_Property"]  # adjust to your columns
TARGET_PRICE = "Price_in_Lakhs"  # numeric target in lakhs
CATEGORICALS = ["State", "City", "Locality"]
DATE_COL = "Listing_Date"  # change if you have listing_date
MODEL_NAME_REG = "HousingPriceModel"  # MLflow registered regression model
MODEL_NAME_CLS = "InvestmentClassifier"  # MLflow registered classifier (if you have one)

# Growth assumptions for 5-year projection (simple compounding)
DEFAULT_CAGR = 0.06  # 6% annual
YEARS_FORWARD = 5

# -----------------------------
# Helpers
# -----------------------------
@st.cache_data
def load_data(path: str) -> pd.DataFrame:
    if not os.path.exists(path):
        st.error(f"Dataset not found at: {path}")
        return pd.DataFrame()
    df = pd.read_csv(path)
    # normalize optional
    # Coerce numerics for safety
    for col in [TARGET_PRICE, "Area_sqft", "BHK", "Age_of_Property"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")
    # parse dates if present
    if DATE_COL in df.columns:
        df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
    return df

def get_mlflow_model(model_name: str):
    if not MLFLOW_AVAILABLE:
        return None, None
    try:
        mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI", "http://127.0.0.1:5000"))
        client = MlflowClient()
        versions = client.search_model_versions(f"name='{model_name}'")
        # pick latest in Production
        prod = [v for v in versions if v.current_stage == "Production"]
        if not prod:
            return None, None
        # latest by last_updated_timestamp
        best = sorted(prod, key=lambda x: x.last_updated_timestamp, reverse=True)[0]
        uri = f"models:/{model_name}/{best.current_stage}"
        model = mlflow.sklearn.load_model(uri)
        return model, best
    except Exception as e:
        st.warning(f"Could not load MLflow model '{model_name}': {e}")
        return None, None

def compute_feature_importance(model, feature_names):
    # Supports scikit-learn tree-based models or linear models
    try:
        if hasattr(model, "feature_importances_"):
            imp = model.feature_importances_
            return pd.DataFrame({"feature": feature_names, "importance": imp}).sort_values("importance", ascending=False)
        elif hasattr(model, "coef_"):
            coef = model.coef_
            if np.ndim(coef) == 1:
                imp = np.abs(coef)
            else:
                imp = np.abs(coef).mean(axis=0)
            return pd.DataFrame({"feature": feature_names, "importance": imp}).sort_values("importance", ascending=False)
        else:
            return pd.DataFrame({"feature": feature_names, "importance": np.nan})
    except Exception:
        return pd.DataFrame({"feature": feature_names, "importance": np.nan})

def baseline_regressor_predict(area, bhk, age, base_ppsf=10000.0):
    # Simple heuristic: price = ppsf * area * adjustment
    # Adjust for bhk and age
    bhk_adj = 1.0 + 0.08 * max(0, (bhk - 2))
    age_adj = max(0.6, 1.0 - 0.02 * max(0, age - 5))
    price = base_ppsf * area * bhk_adj * age_adj
    # Convert to lakhs
    return price / 100000.0

def baseline_classifier_score(price_lakhs, city_avg_lakhs):
    # Score 0-1 based on how the price compares to city average
    # Cheaper than average -> higher score
    if city_avg_lakhs is None or np.isnan(city_avg_lakhs):
        return 0.5
    ratio = price_lakhs / max(1e-6, city_avg_lakhs)
    score = np.clip(1.2 - ratio, 0.0, 1.0)
    return score

def project_price_5y(current_price_lakhs, cagr=DEFAULT_CAGR, years=YEARS_FORWARD):
    return current_price_lakhs * ((1 + cagr) ** years)

def city_avg_price(df: pd.DataFrame, city: str) -> float:
    if df.empty or TARGET_PRICE not in df.columns or "City" not in df.columns:
        return np.nan
    sub = df[(df["City"] == city) & df[TARGET_PRICE].notna()]
    if sub.empty:
        return np.nan
    return sub[TARGET_PRICE].mean()

# -----------------------------
# UI
# -----------------------------
st.set_page_config(page_title="Housing Investment Insights", layout="wide")
st.title("Housing Investment Insights")

df = load_data(CSV_PATH)
if df.empty:
    st.stop()

with st.sidebar:
    st.header("Filter dataset")
    states = sorted(df["State"].dropna().unique().tolist()) if "State" in df.columns else []
    cities = sorted(df["City"].dropna().unique().tolist()) if "City" in df.columns else []
    localities = sorted(df["Locality"].dropna().unique().tolist()) if "Locality" in df.columns else []

    sel_state = st.selectbox("State", options=["All"] + states)
    sel_city = st.selectbox("City", options=["All"] + cities)
    sel_locality = st.selectbox("Locality", options=["All"] + localities)

    min_area = st.number_input("Min area (sqft)", min_value=0, value=0, step=50)
    max_area = st.number_input("Max area (sqft)", min_value=0, value=0, step=50, help="0 means no upper bound")
    min_price = st.number_input("Min price (lakhs)", min_value=0.0, value=0.0, step=1.0)
    max_price = st.number_input("Max price (lakhs)", min_value=0.0, value=0.0, step=1.0, help="0 means no upper bound")
    sel_bhk = st.multiselect("BHK", options=sorted(df["BHK"].dropna().unique().tolist()) if "BHK" in df.columns else [], default=[])

# Apply filters
filtered = df.copy()
if sel_state != "All" and "State" in filtered.columns:
    filtered = filtered[filtered["State"] == sel_state]
if sel_city != "All" and "City" in filtered.columns:
    filtered = filtered[filtered["City"] == sel_city]
if sel_locality != "All" and "Locality" in filtered.columns:
    filtered = filtered[filtered["Locality"] == sel_locality]
if "Area_sqft" in filtered.columns and min_area > 0:
    filtered = filtered[filtered["Area_sqft"] >= min_area]
if "Area_sqft" in filtered.columns and max_area > 0:
    filtered = filtered[filtered["Area_sqft"] <= max_area]
if TARGET_PRICE in filtered.columns and min_price > 0:
    filtered = filtered[filtered[TARGET_PRICE] >= min_price]
if TARGET_PRICE in filtered.columns and max_price > 0:
    filtered = filtered[filtered[TARGET_PRICE] <= max_price]
if "BHK" in filtered.columns and sel_bhk:
    filtered = filtered[filtered["BHK"].isin(sel_bhk)]

st.subheader("Filtered dataset preview")
st.dataframe(filtered.head(50), use_container_width=True)

# -----------------------------
# User input form: property details
# -----------------------------
st.subheader("Enter property details")
colA, colB, colC, colD = st.columns(4)
with colA:
    in_state = st.selectbox("Input state", options=states if states else ["Unknown"])
with colB:
    in_city = st.selectbox("Input city", options=cities if cities else ["Unknown"])
with colC:
    in_locality = st.text_input("Input locality", value=(localities[0] if localities else ""))
with colD:
    in_bhk = st.number_input("BHK", min_value=1, max_value=10, value=2, step=1)

colE, colF, colG, colH = st.columns(4)
with colE:
    in_area = st.number_input("Area (sqft)", min_value=100, max_value=20000, value=1000, step=50)
with colF:
    in_age = st.number_input("Age of Property (years)", min_value=0, max_value=100, value=5, step=1)
with colG:
    in_current_price = st.number_input("Current price (lakhs)", min_value=0.0, value=50.0, step=0.5)
with colH:
    in_cagr = st.number_input("Assumed CAGR for 5 years", min_value=0.0, max_value=0.25, value=DEFAULT_CAGR, step=0.01)

# -----------------------------
# Models: load MLflow models or baseline
# -----------------------------
reg_model, reg_meta = get_mlflow_model(MODEL_NAME_REG) if MLFLOW_AVAILABLE else (None, None)
cls_model, cls_meta = get_mlflow_model(MODEL_NAME_CLS) if MLFLOW_AVAILABLE else (None, None)

# Prepare input vector
def input_vector():
    vec = pd.DataFrame([{
        "Area_sqft": in_area,
        "BHK": in_bhk,
        "Age_of_Property": in_age
    }])
    return vec

# -----------------------------
# Predictions
# -----------------------------
st.markdown("---")
st.subheader("Results")

# Regression: Estimated Price after 5 Years
if reg_model is not None:
    try:
        pred_current_lakhs = float(reg_model.predict(input_vector())[0])
    except Exception as e:
        st.warning(f"Regression model prediction failed, using baseline. Error: {e}")
        pred_current_lakhs = baseline_regressor_predict(in_area, in_bhk, in_age)
else:
    pred_current_lakhs = baseline_regressor_predict(in_area, in_bhk, in_age)

pred_5y_lakhs = project_price_5y(pred_current_lakhs, cagr=in_cagr, years=YEARS_FORWARD)

# Classification: Is this a Good Investment?
if cls_model is not None:
    try:
        # If model has predict_proba, use probability for confidence
        if hasattr(cls_model, "predict_proba"):
            proba = cls_model.predict_proba(input_vector())
            confidence = float(np.max(proba))
            label = int(cls_model.predict(input_vector())[0])
        else:
            # fallback: decision_function or predict
            label = int(cls_model.predict(input_vector())[0])
            confidence = 0.5
    except Exception as e:
        st.warning(f"Classifier prediction failed, using heuristic. Error: {e}")
        city_avg = city_avg_price(df, in_city)
        score = baseline_classifier_score(pred_current_lakhs, city_avg)
        label = 1 if score >= 0.5 else 0
        confidence = float(score)
else:
    city_avg = city_avg_price(df, in_city)
    score = baseline_classifier_score(pred_current_lakhs, city_avg)
    label = 1 if score >= 0.5 else 0
    confidence = float(score)

inv_text = "Good Investment" if label == 1 else "Not Ideal Right Now"

met1, met2, met3 = st.columns(3)
with met1:
    st.metric("Predicted current fair price (lakhs)", f"{pred_current_lakhs:,.2f}")
with met2:
    st.metric("Estimated price after 5 years (lakhs)", f"{pred_5y_lakhs:,.2f}")
with met3:
    st.metric("Classification", inv_text)

st.caption(f"Model confidence: {confidence:0.2f}")

# Feature importance
st.subheader("Model feature importance")
reg_imp = compute_feature_importance(reg_model, FEATURES) if reg_model is not None else pd.DataFrame({"feature": FEATURES, "importance": [np.nan]*len(FEATURES)})
st.dataframe(reg_imp, use_container_width=True)
if reg_imp["importance"].notna().any():
    chart_imp = alt.Chart(reg_imp).mark_bar().encode(
        x=alt.X("importance:Q", title="Importance"),
        y=alt.Y("feature:N", sort="-x", title="Feature")
    ).properties(height=200)
    st.altair_chart(chart_imp, use_container_width=True)

# -----------------------------
# Visual insights
# -----------------------------
st.markdown("---")
st.subheader("Visual insights")

# Location-wise heatmap (City vs BHK by avg price)
if {"City", "BHK", TARGET_PRICE}.issubset(filtered.columns):
    heat = (
        filtered.dropna(subset=["City", "BHK", TARGET_PRICE])
        .groupby(["City", "BHK"], as_index=False)[TARGET_PRICE]
        .mean()
        .rename(columns={TARGET_PRICE: "AvgPriceLakhs"})
    )
    heat_chart = alt.Chart(heat).mark_rect().encode(
        x=alt.X("City:N", title="City"),
        y=alt.Y("BHK:O", title="BHK"),
        color=alt.Color("AvgPriceLakhs:Q", title="Avg Price (Lakhs)", scale=alt.Scale(scheme="viridis")),
        tooltip=["City", "BHK", "AvgPriceLakhs"]
    ).properties(height=300)
    st.altair_chart(heat_chart, use_container_width=True)
else:
    st.info("Heatmap requires City, BHK, and Price_in_Lakhs columns.")

# Trend charts: monthly average prices by city
if DATE_COL in filtered.columns and "City" in filtered.columns and TARGET_PRICE in filtered.columns:
    trend = (
        filtered.dropna(subset=[DATE_COL, "City", TARGET_PRICE])
        .assign(month=lambda d: d[DATE_COL].dt.to_period("M").dt.to_timestamp()))


    # Select top 8 cities by recent avg
    recent = trend[trend["month"] == trend["month"].max()].sort_values("AvgPriceLakhs", ascending=False)["City"].head(8).tolist()
    trend_sel = trend[trend["City"].isin(recent)]
    line = alt.Chart(trend_sel).mark_line(point=True).encode(
        x=alt.X("month:T", title="Month"),
        y=alt.Y("AvgPriceLakhs:Q", title="Avg Price (Lakhs)"),
        color="City:N",
        tooltip=["City", "month", "AvgPriceLakhs"]
    ).properties(height=350)
    st.altair_chart(line, use_container_width=True)
else:
    st.info("Trend chart requires Listing_Date, City, and Price_in_Lakhs columns.")

# PPSF by state (bar)
if {"State", "Area_sqft", TARGET_PRICE}.issubset(filtered.columns):
    ppsf_df = filtered.dropna(subset=["State", "Area_sqft", TARGET_PRICE]).copy()
    ppsf_df["Price_per_SqFt"] = (ppsf_df[TARGET_PRICE] * 100000.0) / ppsf_df["Area_sqft"]
    ppsf_state = ppsf_df.groupby("State", as_index=False)["Price_per_SqFt"].mean().sort_values("Price_per_SqFt", ascending=False)
    bar_ppsf = alt.Chart(ppsf_state).mark_bar().encode(
        x=alt.X("Price_per_SqFt:Q", title="Avg Price per Sq Ft"),
        y=alt.Y("State:N", sort="-x"),
        tooltip=["State", "Price_per_SqFt"]
    ).properties(height=400)
    st.altair_chart(bar_ppsf, use_container_width=True)
else:
    st.info("PPSF chart requires State, Area_sqft, and Price_in_Lakhs columns.")

# -----------------------------
# Confidence and rationale
# -----------------------------
st.markdown("---")
st.subheader("Model confidence and rationale")
city_avg_val = city_avg_price(df, in_city)
st.write(f"- City average price (lakhs): {city_avg_val:,.2f}" if not np.isnan(city_avg_val) else "- City average price (lakhs): N/A")
st.write(f"- Input BHK: {in_bhk}, Area: {in_area} sqft, Age: {in_age} years")
st.write(f"- Assumed CAGR for projection: {in_cagr*100:.1f}% over {YEARS_FORWARD} years")

st.success("Tip: Compare the predicted current fair price vs. your listed price and the city average. If the predicted fair price is below city average and projection looks strong, confidence will be higher.")

# Footer
st.markdown("---")
st.caption("This app integrates dataset filters, model predictions, visual insights, and MLflow-based model management (when available). Replace feature names or MLflow model names to match your environment.")

**Insights**
* Streamlit App install to Predict the model and Growth prediction after 5 years.

* monthly avaerage charges by cities and locality.

## Target Variables

In [None]:
# ---------------------------------------------------------------
#   FULL CODE: Future Price Prediction (All 4 Methods Combined)
# ---------------------------------------------------------------

import pandas as pd
import joblib


# ---------------------------------------------------------------
# 1Ô∏è‚É£  Fixed Rate Future Price Prediction (default = 8%)
# ---------------------------------------------------------------
def future_price_fixed(current_price, rate=0.08, years=5):
    """
    Simple fixed rate appreciation model.
    """
    return current_price * ((1 + rate) ** years)


# ---------------------------------------------------------------
# 2Ô∏è‚É£  Location-Based Growth Model
# ---------------------------------------------------------------
location_growth_rates = {
    "Mumbai": 0.09,
    "Delhi": 0.06,
    "Bangalore": 0.08,
    "Hyderabad": 0.10,
    "Pune": 0.07,
    "Chennai": 0.06,
    "Kolkata": 0.05
}

def future_price_location(current_price, location, years=5):
    """
    Predict future price using location-specific growth rate.
    """
    rate = location_growth_rates.get(location, 0.06)  # Default = 6%
    return current_price * ((1 + rate) ** years)


# ---------------------------------------------------------------
# 3Ô∏è‚É£  Property-Type Based Growth Model
# ---------------------------------------------------------------
property_growth_rates = {
    "Apartment": 0.07,
    "Villa": 0.10,
    "Plot": 0.12,
    "Independent House": 0.08
}

def future_price_property(current_price, property_type, years=5):
    """
    Predict future price using property-type growth rate.
    """
    rate = property_growth_rates.get(property_type, 0.07)  # Default = 7%
    return current_price * ((1 + rate) ** years)


# ---------------------------------------------------------------
# 4Ô∏è‚É£  ML Model Based Future Price Prediction (Best Approach)
# ---------------------------------------------------------------
def future_price_ml(model, input_features, current_price, years=5):
    """
    Predict appreciation rate using ML model, then compute future price.
    """
    predicted_rate = model.predict(input_features)[0]   # Example: 0.085 (8.5%)
    future_price = current_price * ((1 + predicted_rate) ** years)

    return future_price, predicted_rate



# ---------------------------------------------------------------
# 5Ô∏è‚É£  Combined Function (Runs All Methods)
# ---------------------------------------------------------------
def calculate_future_prices(current_price, location, property_type, input_features=None, ml_model=None):
    """
    Returns a dictionary with all future price predictions.
    """
    results = {}

    # ---- METHOD 1: FIXED RATE (8%) ----
    results["Fixed Method (8%)"] = future_price_fixed(current_price)

    # ---- METHOD 2: LOCATION BASED ----
    results["Location Method"] = future_price_location(current_price, location)

    # ---- METHOD 3: PROPERTY TYPE BASED ----
    results["Property Type Method"] = future_price_property(current_price, property_type)

    # ---- METHOD 4: ML BASED ----
    if ml_model is not None and input_features is not None:
        ml_price, ml_rate = future_price_ml(ml_model, input_features, current_price)
        results["ML Method Future Price"] = ml_price
        results["ML Predicted Growth Rate"] = ml_rate

    return results



# ---------------------------------------------------------------
# 6Ô∏è‚É£  EXAMPLE USAGE (Comment out in Streamlit)
# ---------------------------------------------------------------
if __name__ == "__main__":

    # Example input features
    input_data = pd.DataFrame({
        "Area": [1200],
        "BHK": [3],
        "Bathrooms": [2],
        "Location": ["Mumbai"],
        "Property_Type": ["Apartment"]
    })

    # Load ML appreciation model (if available)
    try:
        ml_model = joblib.load("growth_rate_model.pkl")
    except:
        ml_model = None
        print("‚ö†Ô∏è ML growth model not found. Skipping ML-based prediction.")

    # Run all predictions
    results = calculate_future_prices(
        current_price = 5000000,
        location = "Hyderabad",
        property_type = "Apartment",
        input_features = input_data,
        ml_model = ml_model
    )

    # Print results
    for method, value in results.items():
        if "Rate" in method:
            print(f"{method}: {value * 100:.2f}%")
        else:
            print(f"{method}: ‚Çπ {value:,.0f}")


In [None]:
import pandas as pd

df = pd.read_csv("india_housing_prices.csv")

# -----------------------------------------------------
# Rule 1: Price ‚â§ Median Price ‚Üí GOOD
# -----------------------------------------------------
median_price = df["Price_in_Lakhs"].median()
df["rule_price"] = df["Price_in_Lakhs"] <= median_price

# -----------------------------------------------------
# Rule 2: PPSF ‚â§ Locality Median ‚Üí GOOD
# -----------------------------------------------------
df["locality_ppsf_median"] = df.groupby("Locality")["Price_per_SqFt"].transform("median")
df["rule_ppsf"] = df["Price_per_SqFt"] <= df["locality_ppsf_median"]

# -----------------------------------------------------
# Rule 3: Multi-Factor Score
# Conditions:
#   +1 if BHK ‚â• 3
#   +1 if Ready to Move
#   +1 if Area ‚â• locality median area
# -----------------------------------------------------
df["locality_area_median"] = df.groupby("Locality")["Size_in_SqFt"].transform("median")

df["score"] = (
    (df["BHK"] >= 3).astype(int) +
    (df["Availability_Status"].isin(["Ready to Move", "ReadyToMove"])).astype(int) +
    (df["Size_in_SqFt"] >= df["locality_area_median"]).astype(int)
)

df["rule_multi"] = df["score"] >= 2     # Threshold

# -----------------------------------------------------
# FINAL TARGET: GOOD if any rule is TRUE
# -----------------------------------------------------
df["Good_Investment"] = (
    df["rule_price"] |
    df["rule_ppsf"] |
    df["rule_multi"]
).map({True: "Good", False: "Bad"})

# Save
df.to_csv("housing_dataset_with_labels.csv", index=False)

print("‚úî 'Good_Investment' column created successfully!")


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib

# Load updated dataset
df = pd.read_csv("housing_dataset_with_labels.csv")

# Categorical columns to encode
cat_cols = [
    "State", "City", "Locality", "Property_Type",
    "Furnished_Status", "Facing", "Owner_Type",
    "Availability_Status"
]

# Encode categoricals
for col in cat_cols:
    df[col] = df[col].astype(str)
    df[col] = LabelEncoder().fit_transform(df[col])

# Target
y = df["Good_Investment"].map({"Good": 1, "Bad": 0})

# Features
X = df[[
    "Size_in_SqFt",
    "Price_in_Lakhs",
    "Price_per_SqFt",
    "BHK",
    "City",
    "Locality",
    "Availability_Status"
]]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestClassifier(
    n_estimators=300,
    max_depth=12,
    min_samples_split=4,
    random_state=42
)

model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Accuracy
acc = accuracy_score(y_test, y_pred)
print("üî• Model Accuracy:", acc)
print("\nüìä Classification Report:\n", classification_report(y_test, y_pred))

# Save model
joblib.dump(model, "good_investment_model.pkl")


## CONCLUSION
* Data Processing to check and handle missing values and remove duplicates.
* Data Cleaning of categorical features and missing values.
* Data Visulaization Heatmap , Trends in prices and price in sqft, Scatter plot and box plot.
* Model development to test and train the model to get the accuracy of the model and predict the outcome of the investment on properties.
* ML flow integration
* Streamlit application for reister the model and predict the Model of the housing prices .