#### In house price prediction, our primary evaluation metric is RMSLE (Root Mean Squared Logarithmic Error). It is calculated by taking the standard RMSE on the natural logarithm of both the predicted and actual values.
    - Interpretation: Penalizes relative errors. It considers the percentage difference. A $10k error on a $100k home is penalized similarly to a $100k error on a $1M home.
    - Outlier sensitivity: Low. High-priced outliers (e.g., Mansions) do not explode the error metric.
    - Target distribution: Ideal for highly right-skewed targets like home prices.
    - Directional Bias: Slightly penalizes under-predictions more than over-predictions.
Since a buyer's perception of "overpaying" is usually relative to the total value of the home, RMSLE perfectly aligns the model's objective with human intuition.

####  In contrast, RMSE and MAE:
        - Penalize absolute errors.
        - Have an outlier sensitivity that is very high (RMSE) or Moderate (MAE).
        - Give a symmetric penalty.

In [2]:
import pandas as pd
import numpy as np

# 1. Load the Dataset
# We will use a reliable GitHub mirror of the Kaggle House Prices train dataset
path = "datasets/test.csv"
train_df = pd.read_csv(path)

# Drop Id from features as requested
train_df = train_df.drop('Id', axis=1)

# ---------------------------------------------------------
# METRIC JUSTIFICATION: RMSLE vs RMSE Example
# ---------------------------------------------------------
print("--- RMSLE vs RMSE Demonstration ---")
from sklearn.metrics import root_mean_squared_error, root_mean_squared_log_error

# Hypothetical Scenario: $10k error on a Cheap home vs an Expensive home
actuals = [100_000, 1_000_000]
predictions = [110_000, 1_010_000] # $10k error for both

rmse = root_mean_squared_error(actuals, predictions)
rmsle = root_mean_squared_log_error(actuals, predictions)

print(f"Absolute RMSE for both combined: ${rmse:.2f}")
print(f"RMSLE (Relative error): {rmsle:.4f}")
print("RMSLE naturally normalizes the errors across huge price differences.\n")

# ---------------------------------------------------------
# DATA OVERVIEW & BASELINE REPORT
# ---------------------------------------------------------
print("--- Dataset Overview ---")
print(f"Shape (Rows, Columns): {train_df.shape}\n")

# Target Summary
target = train_df['SalePrice']
print("--- Target (SalePrice) Summary ---")
print(f"Mean:   ${target.mean():,.0f}")
print(f"Median: ${target.median():,.0f}")
print(f"Skew:   {target.skew():.2f}")
print()

# Feature Typology
# Exclude the target from the feature count
features = train_df.drop('SalePrice', axis=1)
num_features = features.select_dtypes(include=[np.number]).columns
cat_features = features.select_dtypes(include=['object']).columns

print("--- Feature Types ---")
print(f"Numeric features:     {len(num_features)}")
print(f"Categorical features: {len(cat_features)}")
print()

# Missing Values Table
print("--- Top 10 Missing Features ---")
missing_counts = features.isnull().sum()
missing_pct = (missing_counts / len(features)) * 100

# Create a DataFrame for easy viewing
missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Percentage': missing_pct
})

# Sort and get top 10
top_missing = missing_df[missing_df['Missing Count'] > 0].sort_values(by='Missing Count', ascending=False).head(10)

# Format the percentage for display
top_missing['Percentage'] = top_missing['Percentage'].map('{:.1f}%'.format)

print(top_missing)

--- RMSLE vs RMSE Demonstration ---
Absolute RMSE for both combined: $10000.00
RMSLE (Relative error): 0.0678
RMSLE naturally normalizes the errors across huge price differences.

--- Dataset Overview ---
Shape (Rows, Columns): (1459, 79)



KeyError: 'SalePrice'