# Improved Real Estate Price Prediction Preprocessing
## Tunisian Real Estate Dataset

**Key Improvements:**
1. Separate pipelines for rent and sale transactions
2. Better handling of high-cardinality features (region)
3. Feature engineering (price per sqm, region aggregations)
4. Proper treatment of ordinal features
5. Rare category grouping

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib
import os
import warnings
warnings.filterwarnings('ignore')

## 1. Load and Explore Data

In [2]:
# Update this path to your data location
DATA_PATH = "C:\\Users\\user\\OneDrive\\Bureau\\Data Mining Projecy\\Tunisan-Real-Estate-Price-Prediction-Platform\\ML\\data\\processed\\cleaned_real_estate.csv"
df = pd.read_csv(DATA_PATH)

print(f"Total records: {len(df):,}")
print(f"\nTransaction distribution:")
print(df['transaction'].value_counts())
print(f"\nRent: {len(df[df['transaction']=='rent']):,} ({len(df[df['transaction']=='rent'])/len(df)*100:.1f}%)")
print(f"Sale: {len(df[df['transaction']=='sale']):,} ({len(df[df['transaction']=='sale'])/len(df)*100:.1f}%)")

Total records: 9,744

Transaction distribution:
transaction
rent    5636
sale    4108
Name: count, dtype: int64

Rent: 5,636 (57.8%)
Sale: 4,108 (42.2%)


## 2. Feature Engineering

Creating new features that will help the model:

In [3]:
# Feature 1: Price per square meter (will be dropped later, used for aggregations)
df['price_per_sqm'] = df['price'] / df['surface']

# Feature 2: Total rooms (rooms + bathrooms)
df['total_rooms'] = df['rooms'] + df['bathrooms']

# Feature 3: Surface category (small, medium, large)
df['surface_category'] = pd.cut(df['surface'], 
                                  bins=[0, 80, 150, 300, 10000],
                                  labels=['small', 'medium', 'large', 'very_large'])

# Feature 4: Region frequency (how common is this region?)
region_counts = df['region'].value_counts()
df['region_frequency'] = df['region'].map(region_counts)

# Feature 5: Group rare regions (< 20 occurrences) into 'other'
df['region_grouped'] = df['region'].copy()
rare_regions = region_counts[region_counts < 20].index
df.loc[df['region'].isin(rare_regions), 'region_grouped'] = 'other_region'

print(f"Original regions: {df['region'].nunique()}")
print(f"Grouped regions: {df['region_grouped'].nunique()}")
print(f"Rare regions grouped: {len(rare_regions)}")

Original regions: 267
Grouped regions: 100
Rare regions grouped: 168


## 3. Target Encoding for High Cardinality Features

For regions, we'll create mean-encoded features (average price by region):

In [4]:
def create_target_encoding(df, column, target, min_samples=10):
    """
    Create target encoding with smoothing for rare categories.
    Returns the mapping dictionary.
    """
    # Calculate global mean
    global_mean = df[target].mean()
    
    # Calculate aggregations per category
    agg = df.groupby(column)[target].agg(['mean', 'count'])
    
    # Smooth the means (for categories with few samples, pull towards global mean)
    smoothing = 1.0
    agg['smoothed_mean'] = (
        (agg['count'] * agg['mean'] + smoothing * global_mean) / 
        (agg['count'] + smoothing)
    )
    
    return agg['smoothed_mean'].to_dict(), global_mean

## 4. Split Data by Transaction Type

**Critical Step:** Separate rent and sale data for independent modeling:

In [5]:
# Split by transaction type
df_rent = df[df['transaction'] == 'rent'].copy()
df_sale = df[df['transaction'] == 'sale'].copy()

print(f"Rent dataset: {len(df_rent):,} records")
print(f"Sale dataset: {len(df_sale):,} records")

# Drop transaction column as it's no longer needed
df_rent = df_rent.drop(columns=['transaction'])
df_sale = df_sale.drop(columns=['transaction'])

Rent dataset: 5,636 records
Sale dataset: 4,108 records


## 5. Create Target Encodings (Separate for Rent and Sale)

In [6]:
# We'll create these encodings on the training set only (to avoid data leakage)
# For now, we'll prepare the data structure

TARGET = 'price'

# Define feature groups
CATEGORICAL_COLS = [
    'property_type',
    'city',
    'surface_category'
]

NUMERICAL_COLS = [
    'surface',
    'region_frequency'
]

# Treat rooms and bathrooms as ordinal (order matters)
ORDINAL_COLS = [
    'rooms',
    'bathrooms',
    'total_rooms'
]

# High cardinality features for target encoding
TARGET_ENCODE_COLS = [
    'region_grouped'
]

ALL_FEATURES = CATEGORICAL_COLS + NUMERICAL_COLS + ORDINAL_COLS + TARGET_ENCODE_COLS

print(f"Total features: {len(ALL_FEATURES)}")
print(f"Categorical: {CATEGORICAL_COLS}")
print(f"Numerical: {NUMERICAL_COLS}")
print(f"Ordinal: {ORDINAL_COLS}")
print(f"Target encoded: {TARGET_ENCODE_COLS}")

Total features: 9
Categorical: ['property_type', 'city', 'surface_category']
Numerical: ['surface', 'region_frequency']
Ordinal: ['rooms', 'bathrooms', 'total_rooms']
Target encoded: ['region_grouped']


## 6. Train-Test Split (Separate for Each Transaction Type)

In [7]:
# RENT data split
X_rent = df_rent[ALL_FEATURES]
y_rent = df_rent[TARGET]

X_rent_train, X_rent_test, y_rent_train, y_rent_test = train_test_split(
    X_rent, y_rent,
    test_size=0.2,
    random_state=42
)

print("RENT SPLIT:")
print(f"  Train: {len(X_rent_train):,} samples")
print(f"  Test:  {len(X_rent_test):,} samples")

# SALE data split
X_sale = df_sale[ALL_FEATURES]
y_sale = df_sale[TARGET]

X_sale_train, X_sale_test, y_sale_train, y_sale_test = train_test_split(
    X_sale, y_sale,
    test_size=0.2,
    random_state=42
)

print("\nSALE SPLIT:")
print(f"  Train: {len(X_sale_train):,} samples")
print(f"  Test:  {len(X_sale_test):,} samples")

RENT SPLIT:
  Train: 4,508 samples
  Test:  1,128 samples

SALE SPLIT:
  Train: 3,286 samples
  Test:  822 samples


## 7. Fit Target Encodings on Training Data Only

In [8]:
# Create target encodings for RENT
rent_region_encoding, rent_region_global = create_target_encoding(
    pd.concat([X_rent_train, y_rent_train], axis=1),
    'region_grouped',
    'price'
)

# Create target encodings for SALE  
sale_region_encoding, sale_region_global = create_target_encoding(
    pd.concat([X_sale_train, y_sale_train], axis=1),
    'region_grouped',
    'price'
)

print("Target encodings created successfully!")
print(f"\nRent - example region encodings:")
for region, value in list(rent_region_encoding.items())[:5]:
    print(f"  {region}: {value:.2f} TND")
    
print(f"\nSale - example region encodings:")
for region, value in list(sale_region_encoding.items())[:5]:
    print(f"  {region}: {value:.2f} TND")

Target encodings created successfully!

Rent - example region encodings:
  Ain Zaghouan Nord: 1617.42 TND
  Ain Zaghouan Sud: 1154.46 TND
  Ain Zaghouen: 1241.57 TND
  Akouda: 1232.24 TND
  Ariana: 950.28 TND

Sale - example region encodings:
  Ain Zaghouan Nord: 422716.83 TND
  Ain Zaghouan Sud: 615920.98 TND
  Ain Zaghouen: 297845.37 TND
  Akouda: 455418.39 TND
  Ariana: 583383.42 TND


## 8. Apply Target Encodings

In [9]:
# Apply to RENT data
X_rent_train['region_encoded'] = X_rent_train['region_grouped'].map(rent_region_encoding).fillna(rent_region_global)
X_rent_test['region_encoded'] = X_rent_test['region_grouped'].map(rent_region_encoding).fillna(rent_region_global)

# Apply to SALE data
X_sale_train['region_encoded'] = X_sale_train['region_grouped'].map(sale_region_encoding).fillna(sale_region_global)
X_sale_test['region_encoded'] = X_sale_test['region_grouped'].map(sale_region_encoding).fillna(sale_region_global)

# Drop the original region column
X_rent_train = X_rent_train.drop(columns=['region_grouped'])
X_rent_test = X_rent_test.drop(columns=['region_grouped'])
X_sale_train = X_sale_train.drop(columns=['region_grouped'])
X_sale_test = X_sale_test.drop(columns=['region_grouped'])

# Update feature lists
NUMERICAL_COLS.append('region_encoded')
print("Target encoding applied!")
print(f"\nUpdated numerical columns: {NUMERICAL_COLS}")

Target encoding applied!

Updated numerical columns: ['surface', 'region_frequency', 'region_encoded']


## 9. Create Preprocessing Pipelines

In [10]:
# Create the same preprocessor for both rent and sale
def create_preprocessor():
    return ColumnTransformer(
        transformers=[
            (
                'categorical',
                OneHotEncoder(handle_unknown='ignore', sparse_output=False),
                CATEGORICAL_COLS
            ),
            (
                'numerical',
                StandardScaler(),
                NUMERICAL_COLS
            ),
            (
                'ordinal',
                StandardScaler(),  # Or keep as-is, depending on your preference
                ORDINAL_COLS
            )
        ]
    )

preprocessor_rent = create_preprocessor()
preprocessor_sale = create_preprocessor()

print("Preprocessing pipelines created!")

Preprocessing pipelines created!


## 10. Fit and Transform Data

In [11]:
# RENT preprocessing
X_rent_train_prepared = preprocessor_rent.fit_transform(X_rent_train)
X_rent_test_prepared = preprocessor_rent.transform(X_rent_test)

print("RENT data preprocessed:")
print(f"  Train shape: {X_rent_train_prepared.shape}")
print(f"  Test shape:  {X_rent_test_prepared.shape}")

# SALE preprocessing
X_sale_train_prepared = preprocessor_sale.fit_transform(X_sale_train)
X_sale_test_prepared = preprocessor_sale.transform(X_sale_test)

print("\nSALE data preprocessed:")
print(f"  Train shape: {X_sale_train_prepared.shape}")
print(f"  Test shape:  {X_sale_test_prepared.shape}")

RENT data preprocessed:
  Train shape: (4508, 36)
  Test shape:  (1128, 36)

SALE data preprocessed:
  Train shape: (3286, 37)
  Test shape:  (822, 37)


## 11. Save Preprocessed Data and Pipelines

In [12]:
# Create output directories
OUTPUT_DIR = "C:\\Users\\user\\OneDrive\\Bureau\\Data Mining Projecy\\Tunisan-Real-Estate-Price-Prediction-Platform\\ML"
os.makedirs(f"{OUTPUT_DIR}/data/prepared/rent", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/data/prepared/sale", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/data/preprocessing", exist_ok=True)

# Save RENT data
np.save(f"{OUTPUT_DIR}/data/prepared/rent/X_train.npy", X_rent_train_prepared)
np.save(f"{OUTPUT_DIR}/data/prepared/rent/X_test.npy", X_rent_test_prepared)
np.save(f"{OUTPUT_DIR}/data/prepared/rent/y_train.npy", y_rent_train.to_numpy())
np.save(f"{OUTPUT_DIR}/data/prepared/rent/y_test.npy", y_rent_test.to_numpy())

# Save SALE data
np.save(f"{OUTPUT_DIR}/data/prepared/sale/X_train.npy", X_sale_train_prepared)
np.save(f"{OUTPUT_DIR}/data/prepared/sale/X_test.npy", X_sale_test_prepared)
np.save(f"{OUTPUT_DIR}/data/prepared/sale/y_train.npy", y_sale_train.to_numpy())
np.save(f"{OUTPUT_DIR}/data/prepared/sale/y_test.npy", y_sale_test.to_numpy())

# Save preprocessors
joblib.dump(
    preprocessor_rent,
    f"{OUTPUT_DIR}/data/preprocessing/preprocessor_rent.joblib"
)
joblib.dump(
    preprocessor_sale,
    f"{OUTPUT_DIR}/data/preprocessing/preprocessor_sale.joblib"
)

# Save target encodings
joblib.dump(
    {'region_encoding': rent_region_encoding, 'region_global': rent_region_global},
    f"{OUTPUT_DIR}/data/preprocessing/target_encodings_rent.joblib"
)
joblib.dump(
    {'region_encoding': sale_region_encoding, 'region_global': sale_region_global},
    f"{OUTPUT_DIR}/data/preprocessing/target_encodings_sale.joblib"
)

print("âœ… All data and pipelines saved successfully!")
print(f"\nOutput directory: {OUTPUT_DIR}")
print("\nFiles saved:")
print("  - Rent: X_train, X_test, y_train, y_test")
print("  - Sale: X_train, X_test, y_train, y_test")
print("  - Preprocessors: preprocessor_rent.joblib, preprocessor_sale.joblib")
print("  - Target encodings: target_encodings_rent.joblib, target_encodings_sale.joblib")

âœ… All data and pipelines saved successfully!

Output directory: C:\Users\user\OneDrive\Bureau\Data Mining Projecy\Tunisan-Real-Estate-Price-Prediction-Platform\ML

Files saved:
  - Rent: X_train, X_test, y_train, y_test
  - Sale: X_train, X_test, y_train, y_test
  - Preprocessors: preprocessor_rent.joblib, preprocessor_sale.joblib
  - Target encodings: target_encodings_rent.joblib, target_encodings_sale.joblib


## 12. Summary Statistics

In [13]:
print("=" * 70)
print("PREPROCESSING SUMMARY")
print("=" * 70)

print("\nðŸ“Š DATASET SIZES:")
print(f"  Rent Train:  {X_rent_train_prepared.shape[0]:,} samples x {X_rent_train_prepared.shape[1]} features")
print(f"  Rent Test:   {X_rent_test_prepared.shape[0]:,} samples x {X_rent_test_prepared.shape[1]} features")
print(f"  Sale Train:  {X_sale_train_prepared.shape[0]:,} samples x {X_sale_train_prepared.shape[1]} features")
print(f"  Sale Test:   {X_sale_test_prepared.shape[0]:,} samples x {X_sale_test_prepared.shape[1]} features")

print("\nðŸ’° PRICE STATISTICS:")
print("  RENT:")
print(f"    Train mean: {y_rent_train.mean():.2f} TND")
print(f"    Train std:  {y_rent_train.std():.2f} TND")
print(f"    Test mean:  {y_rent_test.mean():.2f} TND")
print(f"    Test std:   {y_rent_test.std():.2f} TND")
print("  SALE:")
print(f"    Train mean: {y_sale_train.mean():.2f} TND")
print(f"    Train std:  {y_sale_train.std():.2f} TND")
print(f"    Test mean:  {y_sale_test.mean():.2f} TND")
print(f"    Test std:   {y_sale_test.std():.2f} TND")

print("\nðŸ”§ FEATURE ENGINEERING:")
print(f"  Original features: 7")
print(f"  Engineered features added: {len(ALL_FEATURES) - 7}")
print(f"  Total features before encoding: {len(ALL_FEATURES)}")
print(f"  Total features after encoding: {X_rent_train_prepared.shape[1]}")

print("\nâœ… Ready for model training!")

PREPROCESSING SUMMARY

ðŸ“Š DATASET SIZES:
  Rent Train:  4,508 samples x 36 features
  Rent Test:   1,128 samples x 36 features
  Sale Train:  3,286 samples x 37 features
  Sale Test:   822 samples x 37 features

ðŸ’° PRICE STATISTICS:
  RENT:
    Train mean: 1469.09 TND
    Train std:  1401.08 TND
    Test mean:  1458.82 TND
    Test std:   1357.72 TND
  SALE:
    Train mean: 478367.84 TND
    Train std:  396574.43 TND
    Test mean:  488662.96 TND
    Test std:   423537.30 TND

ðŸ”§ FEATURE ENGINEERING:
  Original features: 7
  Engineered features added: 2
  Total features before encoding: 9
  Total features after encoding: 36

âœ… Ready for model training!


## Next Steps

Now you're ready to train separate models:

1. **For RENT model:**
   - Load: `X_rent_train.npy`, `y_rent_train.npy`
   - Train models (Linear Regression, Random Forest, XGBoost, etc.)
   - Evaluate on `X_rent_test.npy`, `y_rent_test.npy`

2. **For SALE model:**
   - Load: `X_sale_train.npy`, `y_sale_train.npy`
   - Train models (Linear Regression, Random Forest, XGBoost, etc.)
   - Evaluate on `X_sale_test.npy`, `y_sale_test.npy`

3. **For predictions on new data:**
   - Load the appropriate preprocessor (`preprocessor_rent.joblib` or `preprocessor_sale.joblib`)
   - Load target encodings
   - Apply the same feature engineering
   - Transform with the preprocessor
   - Predict with your trained model