# Data Cleaning & Feature Engineering

This notebook transforms the raw scraped data into a clean, model-ready dataset through systematic cleaning and feature engineering.

## Load Data

In [23]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import joblib
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('../data/02_intermediate/house_data_with_city.csv')
print(f"Initial dataset: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

Initial dataset: 949 rows, 9 columns


Unnamed: 0,URL,Price,Address,Bedrooms,Bathrooms,House_Size,Land_Size,Description,City
0,https://ikman.lk/en/ad/swimming-pool-with-luxu...,69000000,Kesbewa,5,5,"2,800.0 sqft",20.0 perches,‚ú≥Ô∏è Brand New Super Luxury Modern Villa‚Äôs For S...,Kesbewa
1,https://ikman.lk/en/ad/modern-house-for-sale-i...,200000000,,9,8,"9,000.0 sqft",20.0 perches,luxury New House for sale in Thalawathugoda‚Ç¨12...,Thalawathugoda
2,https://ikman.lk/en/ad/luxury-house-for-sale-i...,39000000,Kesbewa,4,4,"2,680.0 sqft",6.2 perches,"‚úÖ ‡∂¥‡∑í‡∂Ω‡∑í‡∂∫‡∂±‡∑ä‡∂Ø‡∂Ω ‡∂ß‡∑Ä‡∑î‡∂∏‡∂ß ‡∂±‡∑î‡∂Ø‡∑î‡∂ª‡∑î‡∑Ä, ‡∂ö‡∑ê‡∑É‡∑ä‡∂∂‡∑ê‡∑Ä ‡∑Ñ‡∂Ç‡∂Ø‡∑í‡∂∫‡∂ß ‡∂á‡∑Ä‡∑í‡∂Ø...",Kesbewa
3,https://ikman.lk/en/ad/kaduwela-nawagamuwa-two...,19000000,Kaduwela Nawagamuwa,4,2,"2,280.0 sqft",10.5 perches,TWO STORY HOUSE FOR SALE IN KADUWELA NAWAGAMU...,Kaduwela
4,https://ikman.lk/en/ad/single-storied-best-hou...,22000000,"Galwarusawa Rd, Athurugiriya",3,2,"1,366.0 sqft",8.15 perches,Single Storied House For Sale Galwarusawa Roa...,Athurugiriya


## Data Quality Assessment

Analyze missing values and data completeness before processing.

In [24]:
print("Missing Values Analysis:")
print("=" * 50)
missing_summary = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percent': (df.isnull().sum() / len(df) * 100).round(2)
})
missing_summary = missing_summary[missing_summary['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
if len(missing_summary) > 0:
    print(missing_summary.to_string(index=False))
else:
    print("No missing values detected!")

print(f"\nTotal missing cells: {df.isnull().sum().sum()}")
print(f"Data completeness: {(1 - df.isnull().sum().sum()/(df.shape[0]*df.shape[1]))*100:.2f}%")

Missing Values Analysis:
 Column  Missing_Count  Missing_Percent
Address             77             8.11

Total missing cells: 77
Data completeness: 99.10%


## Step 1: Clean House_Size & Land_Size

Strip units, commas, and convert to numeric values for model compatibility.

In [25]:
def clean_size(value):
    """Extract numeric value from size strings"""
    if pd.isna(value):
        return np.nan
    value_str = str(value)
    value_str = re.sub(r'[^\d.]', '', value_str)
    try:
        return float(value_str) if value_str else np.nan
    except ValueError:
        return np.nan

df['House_Size'] = df['House_Size'].apply(clean_size)
df['Land_Size'] = df['Land_Size'].apply(clean_size)

print(f"House_Size - Non-null: {df['House_Size'].notna().sum()}, Mean: {df['House_Size'].mean():.2f}")
print(f"Land_Size - Non-null: {df['Land_Size'].notna().sum()}, Mean: {df['Land_Size'].mean():.2f}")

House_Size - Non-null: 949, Mean: 3027.20
Land_Size - Non-null: 949, Mean: 12.58


## Step 1b: Outlier Detection & Handling

Remove extreme outliers using IQR method to ensure data quality.

In [26]:
def remove_outliers_iqr(df, column, multiplier=3.0):
    """Remove outliers using IQR method with adjustable sensitivity"""
    if df[column].notna().sum() == 0:
        return df
    
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR
    
    # Keep lower bound at 0 for sizes (can't be negative)
    lower_bound = max(0, lower_bound)
    
    initial_count = len(df)
    outliers = ((df[column] < lower_bound) | (df[column] > upper_bound)) & df[column].notna()
    df_filtered = df[~outliers].copy()
    removed = initial_count - len(df_filtered)
    
    print(f"{column}:")
    print(f"  Range: [{lower_bound:.2f}, {upper_bound:.2f}]")
    print(f"  Removed: {removed} outliers ({removed/initial_count*100:.2f}%)")
    
    return df_filtered

initial_size = len(df)

# Remove outliers from size columns
df = remove_outliers_iqr(df, 'House_Size', multiplier=3.0)
df = remove_outliers_iqr(df, 'Land_Size', multiplier=3.0)

total_removed = initial_size - len(df)
print(f"\nTotal rows removed: {total_removed} ({total_removed/initial_size*100:.2f}%)")
print(f"Remaining rows: {len(df)}")

House_Size:
  Range: [0.00, 8000.00]
  Removed: 18 outliers (1.90%)
Land_Size:
  Range: [0.00, 34.20]
  Removed: 23 outliers (2.47%)

Total rows removed: 41 (4.32%)
Remaining rows: 908


## Step 2: Logic Check & Filtering

Remove invalid entries to prevent model confusion.

In [27]:
initial_count = len(df)

# Convert to numeric
df['Bedrooms'] = pd.to_numeric(df['Bedrooms'], errors='coerce')
df['Bathrooms'] = pd.to_numeric(df['Bathrooms'], errors='coerce')
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

# Remove rows with critical missing values
print("Removing rows with missing critical values:")
before = len(df)
df = df.dropna(subset=['Bedrooms', 'Bathrooms', 'Price'])
print(f"  Rows with NaN in critical columns: {before - len(df)}")

# Apply logical filters
print("\nApplying logical filters:")
before = len(df)
df = df[df['Bedrooms'] > 0]
print(f"  Bedrooms = 0: {before - len(df)}")

before = len(df)
df = df[df['Bathrooms'] > 0]
print(f"  Bathrooms = 0: {before - len(df)}")

before = len(df)
df = df[(df['Bedrooms'] <= 15) & (df['Bathrooms'] <= 15)]
print(f"  Unrealistic bedroom/bathroom counts (>15): {before - len(df)}")

before = len(df)
df = df[(df['Price'] >= 1_000_000) & (df['Price'] <= 2_000_000_000)]
print(f"  Price outliers: {before - len(df)}")

removed_count = initial_count - len(df)
print(f"\n{'='*50}")
print(f"Total removed: {removed_count} rows ({removed_count/initial_count*100:.1f}%)")
print(f"Remaining: {len(df)} rows")
print(f"Price range: LKR {df['Price'].min():,.0f} - LKR {df['Price'].max():,.0f}")
print(f"Bedrooms range: {df['Bedrooms'].min():.0f} - {df['Bedrooms'].max():.0f}")
print(f"Bathrooms range: {df['Bathrooms'].min():.0f} - {df['Bathrooms'].max():.0f}")

Removing rows with missing critical values:
  Rows with NaN in critical columns: 2

Applying logical filters:
  Bedrooms = 0: 0
  Bathrooms = 0: 0
  Unrealistic bedroom/bathroom counts (>15): 0
  Price outliers: 0

Total removed: 2 rows (0.2%)
Remaining: 906 rows
Price range: LKR 1,500,000 - LKR 850,000,000
Bedrooms range: 2 - 10
Bathrooms range: 1 - 6


## Step 3: Feature Engineering - Text Extraction

Extract valuable signals from property descriptions.

In [41]:
def extract_binary_features(description):
    """Extract key property attributes from text"""
    if pd.isna(description):
        return 0, 0
    
    desc_lower = str(description).lower()
    
    is_brand_new = 1 if any(keyword in desc_lower for keyword in ['brand new', 'newly built', 'new house']) else 0
    is_modern = 1 if any(keyword in desc_lower for keyword in ['modern', 'luxury', 'contemporary']) else 0
    
    return is_brand_new, is_modern

df[['Is_Brand_New', 'Is_Modern']] = df['Description'].apply(
    lambda x: pd.Series(extract_binary_features(x))
)

print(f"Brand New: {df['Is_Brand_New'].sum()} ({df['Is_Brand_New'].mean()*100:.1f}%)")
print(f"Modern/Luxury: {df['Is_Modern'].sum()} ({df['Is_Modern'].mean()*100:.1f}%)")

Brand New: 181 (20.0%)
Modern/Luxury: 377 (41.6%)


## Step 4: Feature Engineering - City Tiering

Group cities by median price to solve the high-cardinality problem.

In [42]:
city_median_prices = df.groupby('City')['Price'].median().sort_values(ascending=False)
print(f"Total unique cities: {len(city_median_prices)}")
print(f"\nTop 10 most expensive cities:")
print(city_median_prices.head(10))

Total unique cities: 57

Top 10 most expensive cities:
City
Colombo 4     300000000.0
Colombo 7     235000000.0
Nawala        133000000.0
Kohuwala      132500000.0
Kotte         126000000.0
Colombo 1     118000000.0
Dehiwala      115000000.0
Kalubowila     97500000.0
Colombo 3      97000000.0
Colombo 2      95500000.0
Name: Price, dtype: float64


In [None]:
n_tiers = 6
city_median_prices = df.groupby('City')['Price'].median().sort_values(ascending=False)

# Tier 1 = Luxury (highest price), Tier 6 = Budget (lowest price)
tier_labels = {1: 'Luxury', 2: 'Premium', 3: 'Upper-Mid', 4: 'Mid-Range', 5: 'Affordable', 6: 'Budget'}

city_tiers = pd.qcut(city_median_prices, q=n_tiers, labels=range(n_tiers, 0, -1), duplicates='drop')
city_tier_map = city_tiers.to_dict()

df['City_Tier'] = df['City'].map(city_tier_map)

print(f"City distribution across tiers:")
for tier in range(1, n_tiers+1):
    tier_cities = [city for city, t in city_tier_map.items() if t == tier]
    tier_count = (df['City_Tier'] == tier).sum()
    cities_list = ', '.join(sorted(tier_cities))
    tier_median = df[df['City'].isin(tier_cities)]['Price'].median()
    print(f"  Tier {tier} ({tier_labels[tier]}): {len(tier_cities)} cities, {tier_count} properties, Median: LKR {tier_median:,.0f}")
    print(f"    Cities: {cities_list}")

City distribution across tiers:
  Tier 1 (Luxury): 10 cities, 92 properties, Median: LKR 125,000,000
    Cities: Colombo 1, Colombo 2, Colombo 3, Colombo 4, Colombo 7, Dehiwala, Kalubowila, Kohuwala, Kotte, Nawala
  Tier 2 (Premium): 9 cities, 166 properties, Median: LKR 74,750,000
    Cities: Battaramulla, Borella, Colombo 5, Colombo 6, Mount Lavinia, Nugegoda, Pelawatte, Rajagiriya, Ratmalana
  Tier 3 (Upper-Mid): 9 cities, 151 properties, Median: LKR 59,000,000
    Cities: Angoda, Avissawella, Boralesgamuwa, Colombo 10, Hokandara, Moratuwa, Pannipitiya, Talawatugoda, Thalawathugoda
  Tier 4 (Mid-Range): 10 cities, 233 properties, Median: LKR 39,000,000
    Cities: Bokundara, Colombo 13, Colombo 15, Dematagoda, Kesbewa, Madapatha, Maharagama, Malabe, Pettah, Piliyandala
  Tier 5 (Affordable): 9 cities, 203 properties, Median: LKR 29,500,000
    Cities: Athurugiriya, Colombo 12, Diyagama, Fort, Kahathuduwa, Kolonnawa, Kottawa, Mattegoda, Polgasowita
  Tier 6 (Budget): 10 cities, 61 pr

## Step 5: Encoding

Apply label encoding to preserve ordinal relationship in city tiers.

In [44]:
df['City_Tier'] = df['City_Tier'].astype(int)

print(f"City_Tier encoding complete")
print(f"Value range: {df['City_Tier'].min()} to {df['City_Tier'].max()}")
print(f"\nCity Tier vs Median Price:")
print(df.groupby('City_Tier')['Price'].median().sort_index())

City_Tier encoding complete
Value range: 1 to 6

City Tier vs Median Price:
City_Tier
1    125000000.0
2     74750000.0
3     59000000.0
4     39000000.0
5     29500000.0
6     19000000.0
Name: Price, dtype: float64


## Step 5b: Train-Test Split

Split data before scaling to prevent data leakage. Scaler should only learn from training data.

In [45]:
# Create target variable before split
df['Price_Log'] = np.log(df['Price'])

# Prepare features and target
feature_cols = ['Bedrooms', 'Bathrooms', 'House_Size', 'Land_Size', 'Is_Brand_New', 'Is_Modern', 'City_Tier']
X = df[feature_cols].copy()
y = df['Price_Log'].copy()

# Split into train and test sets (80-20 split, stratified by City_Tier)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=df['City_Tier']
)

print(f"Training set: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test set: {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")
print(f"\nTraining set price range: LKR {np.exp(y_train.min()):,.0f} - LKR {np.exp(y_train.max()):,.0f}")
print(f"Test set price range: LKR {np.exp(y_test.min()):,.0f} - LKR {np.exp(y_test.max()):,.0f}")

# City tier distribution in train/test
print(f"\nCity Tier distribution:")
print("Train set:")
print(X_train['City_Tier'].value_counts().sort_index())
print("\nTest set:")
print(X_test['City_Tier'].value_counts().sort_index())

Training set: 724 samples (79.9%)
Test set: 182 samples (20.1%)

Training set price range: LKR 1,500,000 - LKR 850,000,000
Test set price range: LKR 10,500,000 - LKR 270,000,000

City Tier distribution:
Train set:
City_Tier
1     73
2    133
3    121
4    186
5    162
6     49
Name: count, dtype: int64

Test set:
City_Tier
1    19
2    33
3    30
4    47
5    41
6    12
Name: count, dtype: int64


## Step 6: Scaling & Normalization

Transform distributions using training data only to prevent data leakage.

In [46]:
# Handle missing values in size columns (impute with training median)
train_house_median = X_train['House_Size'].median()
train_land_median = X_train['Land_Size'].median()

X_train_filled = X_train.copy()
X_test_filled = X_test.copy()

X_train_filled['House_Size'].fillna(train_house_median, inplace=True)
X_train_filled['Land_Size'].fillna(train_land_median, inplace=True)
X_test_filled['House_Size'].fillna(train_house_median, inplace=True)
X_test_filled['Land_Size'].fillna(train_land_median, inplace=True)

print(f"Imputed missing House_Size with: {train_house_median:.2f}")
print(f"Imputed missing Land_Size with: {train_land_median:.2f}")

# Fit scaler ONLY on training data
scaler = StandardScaler()
columns_to_scale = ['Bedrooms', 'Bathrooms', 'House_Size', 'Land_Size']

scaler.fit(X_train_filled[columns_to_scale])

# Transform both train and test with the same scaler
X_train_scaled = X_train_filled.copy()
X_test_scaled = X_test_filled.copy()

X_train_scaled[columns_to_scale] = scaler.transform(X_train_filled[columns_to_scale])
X_test_scaled[columns_to_scale] = scaler.transform(X_test_filled[columns_to_scale])

print(f"\nScaled features: {columns_to_scale}")
print(f"\nTraining set statistics (after scaling):")
for col in columns_to_scale:
    print(f"  {col}: Mean={X_train_scaled[col].mean():.2f}, Std={X_train_scaled[col].std():.2f}")

print(f"\nTest set statistics (after scaling):")
for col in columns_to_scale:
    print(f"  {col}: Mean={X_test_scaled[col].mean():.2f}, Std={X_test_scaled[col].std():.2f}")

print(f"\nTarget variable (Price_Log) statistics:")
print(f"  Training: Mean={y_train.mean():.2f}, Std={y_train.std():.2f}, Skewness={y_train.skew():.2f}")
print(f"  Test: Mean={y_test.mean():.2f}, Std={y_test.std():.2f}, Skewness={y_test.skew():.2f}")

Imputed missing House_Size with: 2800.00
Imputed missing Land_Size with: 9.60

Scaled features: ['Bedrooms', 'Bathrooms', 'House_Size', 'Land_Size']

Training set statistics (after scaling):
  Bedrooms: Mean=0.00, Std=1.00
  Bathrooms: Mean=-0.00, Std=1.00
  House_Size: Mean=-0.00, Std=1.00
  Land_Size: Mean=0.00, Std=1.00

Test set statistics (after scaling):
  Bedrooms: Mean=-0.07, Std=0.92
  Bathrooms: Mean=-0.05, Std=0.91
  House_Size: Mean=-0.12, Std=0.86
  Land_Size: Mean=0.01, Std=1.06

Target variable (Price_Log) statistics:
  Training: Mean=17.69, Std=0.70, Skewness=0.44
  Test: Mean=17.65, Std=0.64, Skewness=0.41


## Step 7: Correlation Analysis

Analyze feature correlations to understand relationships and detect multicollinearity.

In [47]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate correlation matrix on scaled training data
correlation_matrix = X_train_scaled.corr()

print("Feature Correlation Matrix:")
print("=" * 70)
print(correlation_matrix.round(3))

# Identify highly correlated pairs (|corr| > 0.7)
print(f"\n{'='*70}")
print("Highly Correlated Feature Pairs (|correlation| > 0.7):")
print("=" * 70)
high_corr_found = False
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_val = correlation_matrix.iloc[i, j]
        if abs(corr_val) > 0.7:
            print(f"  {correlation_matrix.columns[i]} <-> {correlation_matrix.columns[j]}: {corr_val:.3f}")
            high_corr_found = True

if not high_corr_found:
    print("  No highly correlated pairs found. Features are relatively independent.")

# Correlation with target
train_data_with_target = X_train_scaled.copy()
train_data_with_target['Price_Log'] = y_train.values

target_correlation = train_data_with_target.corr()['Price_Log'].sort_values(ascending=False)
print(f"\n{'='*70}")
print("Feature Correlation with Target (Price_Log):")
print("=" * 70)
for feature, corr in target_correlation.items():
    if feature != 'Price_Log':
        print(f"  {feature}: {corr:.3f}")

Feature Correlation Matrix:
              Bedrooms  Bathrooms  House_Size  Land_Size  Is_Brand_New  \
Bedrooms         1.000      0.607       0.614      0.315        -0.053   
Bathrooms        0.607      1.000       0.667      0.162         0.042   
House_Size       0.614      0.667       1.000      0.407        -0.039   
Land_Size        0.315      0.162       0.407      1.000        -0.205   
Is_Brand_New    -0.053      0.042      -0.039     -0.205         1.000   
Is_Modern        0.071      0.218       0.216     -0.053         0.265   
City_Tier       -0.310     -0.388      -0.458     -0.276         0.166   

              Is_Modern  City_Tier  
Bedrooms          0.071     -0.310  
Bathrooms         0.218     -0.388  
House_Size        0.216     -0.458  
Land_Size        -0.053     -0.276  
Is_Brand_New      0.265      0.166  
Is_Modern         1.000     -0.105  
City_Tier        -0.105      1.000  

Highly Correlated Feature Pairs (|correlation| > 0.7):
  No highly correlated pair

## Step 8: Prepare Final Datasets

Combine features and target into final train/test datasets ready for modeling.

In [48]:
# Add target variable to create complete datasets
train_data = X_train_scaled.copy()
train_data['Price_Log'] = y_train.values

test_data = X_test_scaled.copy()
test_data['Price_Log'] = y_test.values

print(f"Final Training Dataset: {train_data.shape[0]} rows, {train_data.shape[1]} columns")
print(f"Final Test Dataset: {test_data.shape[0]} rows, {test_data.shape[1]} columns")

print(f"\nFinal features:")
for col in train_data.columns:
    if col != 'Price_Log':
        print(f"  - {col}")
print(f"\nTarget variable: Price_Log (log-transformed price)")

print(f"\nTraining set summary:")
print(train_data.describe())

print(f"\nTest set summary:")
print(test_data.describe())

Final Training Dataset: 724 rows, 8 columns
Final Test Dataset: 182 rows, 8 columns

Final features:
  - Bedrooms
  - Bathrooms
  - House_Size
  - Land_Size
  - Is_Brand_New
  - Is_Modern
  - City_Tier

Target variable: Price_Log (log-transformed price)

Training set summary:
           Bedrooms     Bathrooms    House_Size     Land_Size  Is_Brand_New  \
count  7.240000e+02  7.240000e+02  7.240000e+02  7.240000e+02    724.000000   
mean   6.869888e-17 -1.472119e-16 -8.342007e-17  2.441264e-16      0.209945   
std    1.000691e+00  1.000691e+00  1.000691e+00  1.000691e+00      0.407550   
min   -1.884979e+00 -1.926143e+00 -2.282153e+00 -2.003928e+00      0.000000   
25%   -9.817861e-01 -1.044089e+00 -6.893001e-01 -7.178361e-01      0.000000   
50%   -7.859279e-02 -1.620348e-01 -5.215899e-02 -2.401449e-01      0.000000   
75%    8.246005e-01  7.200194e-01  5.053394e-01  3.477827e-01      0.000000   
max    5.340567e+00  2.484128e+00  4.089258e+00  4.242803e+00      1.000000   

        Is_

## Save Processed Data & Artifacts

In [49]:
import os

# Create directories if they don't exist
os.makedirs('../data/03_processed', exist_ok=True)
os.makedirs('../models', exist_ok=True)

# Save training and test datasets
train_path = '../data/03_processed/train_data.csv'
test_path = '../data/03_processed/test_data.csv'

train_data.to_csv(train_path, index=False)
test_data.to_csv(test_path, index=False)

print(f"‚úì Training data saved to: {train_path}")
print(f"  Shape: {train_data.shape[0]} rows, {train_data.shape[1]} columns")

print(f"\n‚úì Test data saved to: {test_path}")
print(f"  Shape: {test_data.shape[0]} rows, {test_data.shape[1]} columns")

# Save preprocessing artifacts for production use
preprocessing_artifacts = {
    'scaler': scaler,
    'imputation_values': {
        'House_Size_median': train_house_median,
        'Land_Size_median': train_land_median
    },
    'feature_columns': columns_to_scale,
    'tier_labels': tier_labels,
    'city_tier_map': city_tier_map
}

artifacts_path = '../models/preprocessing_artifacts.pkl'
joblib.dump(preprocessing_artifacts, artifacts_path)

print(f"\n‚úì Preprocessing artifacts saved to: {artifacts_path}")
print(f"  Contains: scaler, imputation values, feature columns, city mappings")

print(f"\n{'='*70}")
print(f"SUMMARY")
print(f"{'='*70}")
print(f"‚úì Training samples: {len(train_data)}")
print(f"‚úì Test samples: {len(test_data)}")
print(f"‚úì Features: {len([col for col in train_data.columns if col != 'Price_Log'])}")
print(f"‚úì Target: Price_Log (log-transformed price)")
print(f"‚úì All artifacts saved and ready for modeling!")

‚úì Training data saved to: ../data/03_processed/train_data.csv
  Shape: 724 rows, 8 columns

‚úì Test data saved to: ../data/03_processed/test_data.csv
  Shape: 182 rows, 8 columns

‚úì Preprocessing artifacts saved to: ../models/preprocessing_artifacts.pkl
  Contains: scaler, imputation values, feature columns, city mappings

SUMMARY
‚úì Training samples: 724
‚úì Test samples: 182
‚úì Features: 7
‚úì Target: Price_Log (log-transformed price)
‚úì All artifacts saved and ready for modeling!


## Pipeline Summary

**Data Quality Improvements:**
- Missing value analysis and handling
- Outlier removal using IQR method (3x multiplier)
- Cleaned numeric fields (House_Size, Land_Size)
- Comprehensive validation (bedrooms, bathrooms, price ranges)
- Removed invalid/extreme entries

**Feature Engineering:**
- Extracted 2 binary features from descriptions (Is_Brand_New, Is_Modern)
- Created City_Tier to reduce cardinality (6 tiers from Luxury to Budget)
- Log-transformed Price to normalize distribution
- Scaled all numeric features using StandardScaler

**Best Practices Applied:**
- ‚úÖ Train-Test split (80-20) BEFORE scaling to prevent data leakage
- ‚úÖ Scaler fitted only on training data
- ‚úÖ Stratified split to maintain City_Tier distribution
- ‚úÖ Saved preprocessing artifacts for production deployment
- ‚úÖ Correlation analysis to detect multicollinearity
- ‚úÖ Comprehensive validation and quality checks

**Output Files:**
- `train_data.csv` - Training dataset ready for modeling
- `test_data.csv` - Test dataset for final evaluation
- `preprocessing_artifacts.pkl` - Scaler and mappings for production use

**Ready for Modeling!** üéØ