# 🤖 ML Training Pipeline - Learning from Real Appraisers


## Step 1: Load Data and Understand Structure
Load both the real appraiser selections (ground truth) and our engineered features dataset.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Load the datasets
print("📊 Loading datasets...")
comps_df = pd.read_csv('data/processed/comps_cleaned_with_subjects.csv')
properties_df = pd.read_csv('data/processed/properties_comparison_engineered.csv')
subject_df = pd.read_csv('data/processed/subjects_cleaned.csv')

print(f"Real appraiser selections: {len(comps_df):,} comparables")
print(f"All property-subject pairs: {len(properties_df):,} pairs")
print(f"Subjects in comps: {comps_df['subject_id'].nunique()}")
print(f"Subjects in properties: {properties_df['subject_id'].nunique()}")

# Quick look at the data
print(f"\nComps columns: {list(comps_df.columns[:5])}...")
print(f"Properties columns: {list(properties_df.columns[:5])}...")

📊 Loading datasets...
Real appraiser selections: 264 comparables
All property-subject pairs: 7,246 pairs
Subjects in comps: 88
Subjects in properties: 88

Comps columns: ['subject_id', 'comp_id', 'comp_index', 'condition', 'age_years']...
Properties columns: ['property_id', 'subject_id', 'orderID', 'structure_type', 'property_sub_type']...


## Step 2: Examine Data Structure and Key Columns

**What we're doing here:**
- The `comps_df` contains the **real appraiser selections** - these are the actual properties that professional appraisers chose as the best comparables for each subject
- The `properties_df` contains **all possible property-subject pairs** with our 90+ engineered features (distance, size similarity, composite scores, etc.)
- We need to understand the structure of both datasets so we can map which properties in our engineered dataset were actually selected by appraisers

**Why this matters:**
- The appraiser selections become our **ground truth labels** (1 = selected, 0 = not selected)
- This mapping is crucial for supervised learning - we're teaching the ML model to predict what appraisers would choose
- We'll discover if our analytical scoring aligns with real appraiser preferences

**What to look for:**
- How many columns each dataset has and what they contain
- Whether we have matching identifiers to link the datasets
- Distribution of properties per subject (some subjects might have more candidates than others)

In [2]:
# Examine the structure of both datasets
print("🔍 COMPS DATASET (Real Appraiser Selections):")
print(f"Shape: {comps_df.shape}")
print(f"Columns: {list(comps_df.columns)}")
print(f"\nSample comp data:")
print(comps_df.head(2))

print(f"\n" + "="*60)
print("🔍 PROPERTIES DATASET (All Candidates with Features):")
print(f"Shape: {properties_df.shape}")
print(f"Key columns: {[col for col in properties_df.columns if any(x in col for x in ['score', 'match', 'distance', 'gla', 'price'])]}")

print(f"\n📊 Properties per subject distribution:")
props_per_subject = properties_df.groupby('subject_id').size()
print(f"Min: {props_per_subject.min()}, Max: {props_per_subject.max()}, Avg: {props_per_subject.mean():.1f}")

🔍 COMPS DATASET (Real Appraiser Selections):
Shape: (264, 22)
Columns: ['subject_id', 'comp_id', 'comp_index', 'condition', 'age_years', 'age_uncertainty', 'prop_type', 'city_province', 'address', 'lot_size_sqft', 'gla_sqft', 'bedrooms_main', 'bedrooms_additional', 'bedrooms_total', 'bathrooms_full', 'bathrooms_half', 'bathrooms_equivalent', 'sale_price', 'sale_date', 'has_lot_size_uncertainty', 'has_gla_uncertainty', 'has_additional_bedrooms']

Sample comp data:
   subject_id comp_id  comp_index  condition  age_years  age_uncertainty  \
0           0     0_0           0  Excellent       49.0            False   
1           0     0_1           1       Fair       49.0            False   

   prop_type        city_province             address  lot_size_sqft  ...  \
0  Townhouse  Kingston ON K7M 6V1  930 Amberdale Cres            NaN  ...   
1  Townhouse  Kingston ON K7M 6X7      771 Ashwood Dr            NaN  ...   

   bedrooms_additional  bedrooms_total  bathrooms_full  bathrooms_half 

In [3]:
# STEP 3: Investigate Address Matching Issues and Create Hybrid Mapping
print("🔍 INVESTIGATING ADDRESS MATCHING POTENTIAL")
print("=" * 60)

import re
from difflib import SequenceMatcher

# First, let's see what the address data looks like
print("📊 ADDRESS DATA ANALYSIS:")
print(f"   Comps with addresses: {comps_df['address'].notna().sum()}/{len(comps_df)}")
print(f"   Properties with addresses: {properties_df['address'].notna().sum()}/{len(properties_df)}")

# Sample addresses from both datasets
print(f"\n🔍 SAMPLE ADDRESS COMPARISON:")
print("   Comp addresses:")
for addr in comps_df['address'].dropna().head(5):
    print(f"     '{addr}'")

print("   Property addresses:")  
for addr in properties_df['address'].dropna().head(5):
    print(f"     '{addr}'")

# Test address matching for Subject 0 to see the issue
subject_0_comps = comps_df[comps_df['subject_id'] == 0]
subject_0_props = properties_df[properties_df['subject_id'] == 0]

print(f"\n🔍 SUBJECT 0 ADDRESS ANALYSIS:")
print(f"   Comp addresses ({len(subject_0_comps)}):")
for _, comp in subject_0_comps.iterrows():
    print(f"     '{comp['address']}'")

print(f"   Property addresses (first 5 of {len(subject_0_props)}):")
for _, prop in subject_0_props.head(5).iterrows():
    print(f"     '{prop['address']}'")

🔍 INVESTIGATING ADDRESS MATCHING POTENTIAL
📊 ADDRESS DATA ANALYSIS:
   Comps with addresses: 264/264
   Properties with addresses: 7246/7246

🔍 SAMPLE ADDRESS COMPARISON:
   Comp addresses:
     '930 Amberdale Cres'
     '771 Ashwood Dr'
     '995 Amberdale Cres'
     '64 Deermist Dr'
     '85 Oceanic Dr'
   Property addresses:
     '692 Truedell Rd'
     '1034 Craig Lane '
     '950 Oakview Avenue '
     'Unit 51 - 808 Datzell Lane'
     '771 ASHWOOD Dr '

🔍 SUBJECT 0 ADDRESS ANALYSIS:
   Comp addresses (3):
     '930 Amberdale Cres'
     '771 Ashwood Dr'
     '995 Amberdale Cres'
   Property addresses (first 5 of 121):
     '692 Truedell Rd'
     '1034 Craig Lane '
     '950 Oakview Avenue '
     'Unit 51 - 808 Datzell Lane'
     '771 ASHWOOD Dr '


In [4]:
# Test improved address normalization and matching
print("🔧 TESTING IMPROVED ADDRESS MATCHING")
print("=" * 50)

def normalize_address_improved(address):
    """Improved address normalization"""
    if pd.isna(address):
        return ""
    
    addr = str(address).upper().strip()
    
    # Remove common prefixes/suffixes
    addr = re.sub(r'^UNIT\s+\d+\s*-\s*', '', addr)  # Remove "Unit 51 - "
    addr = re.sub(r'^\d+\s*-\s*', '', addr)  # Remove "51 - "
    
    # Common abbreviation mappings
    abbreviations = {
        'STREET': 'ST', 'AVENUE': 'AVE', 'ROAD': 'RD', 'DRIVE': 'DR',
        'CRESCENT': 'CRES', 'BOULEVARD': 'BLVD', 'PLACE': 'PL',
        'COURT': 'CT', 'LANE': 'LN', 'CIRCLE': 'CIR', 'TERRACE': 'TER'
    }
    
    for full, abbrev in abbreviations.items():
        addr = addr.replace(full, abbrev)
    
    # Remove extra spaces and punctuation
    addr = re.sub(r'[^\w\s]', '', addr)
    addr = re.sub(r'\s+', ' ', addr)
    
    return addr.strip()

def address_similarity_improved(addr1, addr2):
    """Calculate similarity between two addresses"""
    norm1 = normalize_address_improved(addr1)
    norm2 = normalize_address_improved(addr2)
    
    if not norm1 or not norm2:
        return 0.0
    
    # Exact match gets full score
    if norm1 == norm2:
        return 1.0
    
    # Use sequence matcher for partial similarity
    return SequenceMatcher(None, norm1, norm2).ratio()

# Test the improved normalization on Subject 0
print("🧪 TESTING ON SUBJECT 0:")
comp_addr = "771 Ashwood Dr"
prop_addr = "771 ASHWOOD Dr "

print(f"   Comp: '{comp_addr}' → '{normalize_address_improved(comp_addr)}'")
print(f"   Prop: '{prop_addr}' → '{normalize_address_improved(prop_addr)}'")
print(f"   Similarity: {address_similarity_improved(comp_addr, prop_addr):.3f}")

# Test on all Subject 0 addresses
print(f"\n📍 SUBJECT 0 FULL ADDRESS MATCHING TEST:")
for _, comp in subject_0_comps.iterrows():
    comp_addr = comp['address']
    print(f"\n   Comp: '{comp_addr}'")
    
    best_matches = []
    for _, prop in subject_0_props.iterrows():
        prop_addr = prop['address']
        similarity = address_similarity_improved(comp_addr, prop_addr)
        if similarity > 0.5:  # Only show decent matches
            best_matches.append((prop_addr, similarity, prop['property_id']))
    
    # Sort by similarity
    best_matches.sort(key=lambda x: x[1], reverse=True)
    
    if best_matches:
        print(f"   Best matches:")
        for prop_addr, sim, prop_id in best_matches[:3]:
            print(f"     {sim:.3f}: '{prop_addr}' (ID: {prop_id})")
    else:
        print(f"   No good address matches found")

🔧 TESTING IMPROVED ADDRESS MATCHING
🧪 TESTING ON SUBJECT 0:
   Comp: '771 Ashwood Dr' → '771 ASHWOOD DR'
   Prop: '771 ASHWOOD Dr ' → '771 ASHWOOD DR'
   Similarity: 1.000

📍 SUBJECT 0 FULL ADDRESS MATCHING TEST:

   Comp: '930 Amberdale Cres'
   Best matches:
     0.889: '995 Amberdale Cres ' (ID: 64)
     0.722: '941 AMBLESIDE Crescent ' (ID: 120)
     0.649: 'Unit 101 - 1010 Pembridge Crescent ' (ID: 72)

   Comp: '771 Ashwood Dr'
   Best matches:
     1.000: '771 ASHWOOD Dr ' (ID: 47)
     0.710: '371 Tanglewood Drive ' (ID: 143)
     0.688: '871 Larchwood Crescent ' (ID: 92)

   Comp: '995 Amberdale Cres'
   Best matches:
     1.000: '995 Amberdale Cres ' (ID: 64)
     0.722: '941 AMBLESIDE Crescent ' (ID: 120)
     0.667: '995 Waterbury Crescent ' (ID: 70)


In [5]:
# STEP 4: Create Final Hybrid Mapping
print("🎯 CREATING FINAL HYBRID MAPPING (CHARACTERISTICS + ADDRESS)")
print("=" * 70)

def find_best_matching_property_hybrid(comp_row, subject_properties):
    """Find best matching property using characteristics + address"""
    best_match = None
    best_score = 0
    best_breakdown = {}
    
    for _, prop in subject_properties.iterrows():
        # CHARACTERISTICS SIMILARITY (70% weight)
        char_score = 0
        
        # GLA similarity (25% of total = 35% of char score)
        if pd.notna(comp_row['gla_sqft']) and pd.notna(prop['gla_sqft']):
            if comp_row['gla_sqft'] == prop['gla_sqft']:
                char_score += 0.25
            else:
                gla_diff = abs(comp_row['gla_sqft'] - prop['gla_sqft']) / max(comp_row['gla_sqft'], prop['gla_sqft'])
                char_score += max(0, 0.25 * (1 - gla_diff))
        
        # Price similarity (25% of total)
        if pd.notna(comp_row['sale_price']) and pd.notna(prop['close_price']):
            if comp_row['sale_price'] == prop['close_price']:
                char_score += 0.25
            else:
                price_diff = abs(comp_row['sale_price'] - prop['close_price']) / max(comp_row['sale_price'], prop['close_price'])
                char_score += max(0, 0.25 * (1 - price_diff))
        
        # Bedroom similarity (10% of total)
        if pd.notna(comp_row['bedrooms_total']) and pd.notna(prop['bedrooms_total']):
            if comp_row['bedrooms_total'] == prop['bedrooms_total']:
                char_score += 0.10
        
        # Structure type similarity (10% of total)
        if pd.notna(comp_row['prop_type']) and pd.notna(prop['structure_type']):
            if comp_row['prop_type'].lower() in prop['structure_type'].lower():
                char_score += 0.10
        
        # ADDRESS SIMILARITY (30% weight)
        addr_similarity = address_similarity_improved(comp_row.get('address', ''), prop.get('address', ''))
        addr_score = 0.30 * addr_similarity
        
        # TOTAL SCORE
        total_score = char_score + addr_score
        
        if total_score > best_score:
            best_score = total_score
            best_match = prop['property_id']
            best_breakdown = {
                'char_score': char_score,
                'addr_score': addr_score,
                'addr_similarity': addr_similarity,
                'total_score': total_score
            }
    
    return best_match, best_score, best_breakdown

# Create final hybrid mapping
print("🔄 Processing all 264 appraiser selections...")
appraiser_selections = set()
mapping_details = []
unmatched_count = 0

for _, comp in comps_df.iterrows():
    subject_id = comp['subject_id']
    subject_properties = properties_df[properties_df['subject_id'] == subject_id]
    
    if len(subject_properties) > 0:
        matched_property_id, match_score, breakdown = find_best_matching_property_hybrid(comp, subject_properties)
        
        # Accept matches with score >= 0.4 (balanced threshold)
        if matched_property_id and match_score >= 0.4:
            appraiser_selections.add((subject_id, matched_property_id))
            mapping_details.append({
                'subject_id': subject_id,
                'comp_id': comp['comp_id'],
                'property_id': matched_property_id,
                'total_score': match_score,
                'char_score': breakdown['char_score'],
                'addr_score': breakdown['addr_score'],
                'addr_similarity': breakdown['addr_similarity']
            })
        else:
            unmatched_count += 1

# Results summary
print(f"\n✅ FINAL HYBRID MAPPING RESULTS:")
print(f"   Total appraiser selections: 264")
print(f"   Successfully mapped: {len(appraiser_selections)}")
print(f"   Mapping rate: {len(appraiser_selections)/264:.1%}")
print(f"   Unmatched: {unmatched_count}")

if mapping_details:
    total_scores = [m['total_score'] for m in mapping_details]
    char_scores = [m['char_score'] for m in mapping_details]
    addr_scores = [m['addr_score'] for m in mapping_details]
    addr_similarities = [m['addr_similarity'] for m in mapping_details]
    
    print(f"\n📊 SCORE ANALYSIS:")
    print(f"   Average total score: {np.mean(total_scores):.3f}")
    print(f"   Average char score: {np.mean(char_scores):.3f}")
    print(f"   Average addr score: {np.mean(addr_scores):.3f}")
    print(f"   Average addr similarity: {np.mean(addr_similarities):.3f}")
    print(f"   Score range: {min(total_scores):.3f} - {max(total_scores):.3f}")
    
    # Quality distribution
    high_quality = sum(1 for s in total_scores if s >= 0.7)
    medium_quality = sum(1 for s in total_scores if 0.5 <= s < 0.7)
    lower_quality = sum(1 for s in total_scores if 0.4 <= s < 0.5)
    
    print(f"\n🎯 QUALITY DISTRIBUTION:")
    print(f"   High quality (≥0.7): {high_quality} ({high_quality/len(mapping_details):.1%})")
    print(f"   Medium quality (0.5-0.7): {medium_quality} ({medium_quality/len(mapping_details):.1%})")
    print(f"   Lower quality (0.4-0.5): {lower_quality} ({lower_quality/len(mapping_details):.1%})")

🎯 CREATING FINAL HYBRID MAPPING (CHARACTERISTICS + ADDRESS)
🔄 Processing all 264 appraiser selections...

✅ FINAL HYBRID MAPPING RESULTS:
   Total appraiser selections: 264
   Successfully mapped: 219
   Mapping rate: 83.0%
   Unmatched: 0

📊 SCORE ANALYSIS:
   Average total score: 0.793
   Average char score: 0.588
   Average addr score: 0.204
   Average addr similarity: 0.682
   Score range: 0.434 - 1.000

🎯 QUALITY DISTRIBUTION:
   High quality (≥0.7): 199 (75.4%)
   Medium quality (0.5-0.7): 59 (22.3%)
   Lower quality (0.4-0.5): 6 (2.3%)


## Step 3: Create Binary Labels for ML Training

**What we're doing:**
- Add `is_appraiser_selected` column to our properties dataset 
- Mark properties that real appraisers selected as 1, all others as 0
- This creates our supervised learning target variable for training

**Why this matters:**
- The binary labels become our **ground truth** for ML training
- 1 = "A professional appraiser chose this property as a comparable"
- 0 = "This property was available but the appraiser didn't choose it"
- We can now train ML models to predict appraiser preferences

**Expected outcome:**
- Highly imbalanced dataset (~3% positive class) - this is normal!
- Real appraisers are very selective (only ~3 out of 80+ properties per subject)
- We'll need special techniques to handle this imbalance in ML training
- This gives us the foundation for supervised learning

**What this enables:**
- Train models to learn: "What makes appraisers choose certain properties?"
- Compare our analytical scoring vs real appraiser preferences
- Build ML models that can predict appraiser-quality recommendations

In [6]:
# STEP 5: Create Binary Labels for ML Training
print("🏷️ CREATING BINARY LABELS FOR ML TRAINING")
print("=" * 60)

# Create the binary label column
properties_df['is_appraiser_selected'] = 0

# Mark the appraiser-selected properties
for subject_id, property_id in appraiser_selections:
    mask = (properties_df['subject_id'] == subject_id) & (properties_df['property_id'] == property_id)
    properties_df.loc[mask, 'is_appraiser_selected'] = 1

# Analyze the class distribution
total_properties = len(properties_df)
selected_properties = properties_df['is_appraiser_selected'].sum()
selection_rate = selected_properties / total_properties

print(f"📊 CLASS DISTRIBUTION ANALYSIS:")
print(f"   Total property-subject pairs: {total_properties:,}")
print(f"   Appraiser-selected properties: {selected_properties:,}")
print(f"   Not selected properties: {total_properties - selected_properties:,}")
print(f"   Selection rate: {selection_rate:.1%}")

# Analyze by subject
subjects_with_selections = properties_df[properties_df['is_appraiser_selected'] == 1]['subject_id'].nunique()
total_subjects = properties_df['subject_id'].nunique()

print(f"\n🎯 SUBJECT COVERAGE:")
print(f"   Total subjects: {total_subjects}")
print(f"   Subjects with mapped selections: {subjects_with_selections}")
print(f"   Coverage rate: {subjects_with_selections/total_subjects:.1%}")

# Show selection distribution per subject
selections_per_subject = properties_df[properties_df['is_appraiser_selected'] == 1].groupby('subject_id').size()
print(f"\n📈 SELECTIONS PER SUBJECT:")
print(f"   Min: {selections_per_subject.min()}")
print(f"   Max: {selections_per_subject.max()}")
print(f"   Average: {selections_per_subject.mean():.1f}")
print(f"   Median: {selections_per_subject.median():.1f}")

# Check for class imbalance (expected to be highly imbalanced)
print(f"\n⚖️ CLASS IMBALANCE ANALYSIS:")
if selection_rate < 0.1:
    print(f"   ⚠️  Highly imbalanced dataset ({selection_rate:.1%} positive class)")
    print(f"   💡 Will need to use techniques like:")
    print(f"      - Class weights in ML models")
    print(f"      - Stratified sampling")
    print(f"      - Precision-Recall metrics instead of accuracy")
else:
    print(f"   ✅ Reasonably balanced dataset")

# Sample of the labeled data
print(f"\n🔍 SAMPLE LABELED DATA:")
sample_data = properties_df[['subject_id', 'property_id', 'composite_score', 'is_appraiser_selected']].head(10)
print(sample_data)

print(f"\n✅ BINARY LABELS CREATED SUCCESSFULLY!")
print(f"   Dataset ready for ML training with {len(properties_df):,} samples")
print(f"   Target variable: 'is_appraiser_selected' (0/1)")

🏷️ CREATING BINARY LABELS FOR ML TRAINING
📊 CLASS DISTRIBUTION ANALYSIS:
   Total property-subject pairs: 7,246
   Appraiser-selected properties: 219
   Not selected properties: 7,027
   Selection rate: 3.0%

🎯 SUBJECT COVERAGE:
   Total subjects: 88
   Subjects with mapped selections: 88
   Coverage rate: 100.0%

📈 SELECTIONS PER SUBJECT:
   Min: 1
   Max: 3
   Average: 2.5
   Median: 3.0

⚖️ CLASS IMBALANCE ANALYSIS:
   ⚠️  Highly imbalanced dataset (3.0% positive class)
   💡 Will need to use techniques like:
      - Class weights in ML models
      - Stratified sampling
      - Precision-Recall metrics instead of accuracy

🔍 SAMPLE LABELED DATA:
   subject_id  property_id  composite_score  is_appraiser_selected
0           0           28             79.5                      0
1           0           96             78.0                      0
2           0          125             75.5                      0
3           0           32             75.5                      0
4       

## Step 6: Prepare Base Features for ML Training

**Strategy: Use Raw Features Only**
- **Problem**: Our current 24+ engineered features include pre-computed similarity scores and ratios
- **Issue**: This constrains the ML model and adds noise - the model should learn patterns itself
- **Solution**: Use only raw/base features and let ML algorithms discover optimal patterns

**Base Feature Categories (12-15 features):**
- **Physical**: square_feet, bedrooms, bathrooms, structure_type 
- **Location**: latitude, longitude, distance_km, same_city
- **Temporal**: sale_date (as numeric), days_since_sale
- **Market**: sale_price, price_per_sqft
- **Subject Reference**: subject_square_feet, subject_bedrooms, subject_bathrooms, subject_latitude, subject_longitude

**Why This Approach:**
- ML excels at finding complex patterns in raw data
- Engineered similarity scores constrain learning
- Simpler feature space reduces overfitting  
- Model learns optimal feature interactions and weightings
- More generalizable to new data

**Expected Outcome:**
- Clean, minimal feature matrix (~12-15 features)
- Raw data that lets ML algorithms shine
- Better generalization and interpretability

In [7]:
# STEP 8.5: ADD KEY SUBJECT FEATURES
print("🔗 ADDING KEY SUBJECT FEATURES")
print("=" * 50)

# Parse bathrooms from string format (e.g., "2:1" -> 2.5)
def parse_bathrooms(bath_str):
    """Convert bathroom string like '2:1' to numeric (2.5)"""
    if pd.isna(bath_str) or bath_str == '':
        return np.nan
    
    bath_str = str(bath_str).strip()
    
    # Handle special formats
    if 'F' in bath_str or 'H' in bath_str:
        # Format like "2F 1H" or "3F"
        full = len([x for x in bath_str.split() if 'F' in x])
        half = len([x for x in bath_str.split() if 'H' in x])
        return full + (half * 0.5)
    
    if ':' in bath_str:
        # Format like "2:1" 
        parts = bath_str.split(':')
        if len(parts) == 2:
            full = float(parts[0]) if parts[0].isdigit() else 0
            half = float(parts[1]) if parts[1].isdigit() else 0
            return full + (half * 0.5)
    
    # Try direct conversion
    try:
        return float(bath_str)
    except:
        return np.nan

# Select and clean subject features
subject_features = subject_df[['subject_id', 'condition', 'age_years', 'effective_date', 
                              'structure_type', 'gla_sqft', 'bedrooms_raw', 'bathrooms_raw', 
                              'municipality_district']].copy()

# Rename columns to have subject_ prefix for clarity
subject_features.rename(columns={
    'condition': 'subject_condition',
    'age_years': 'subject_age_years', 
    'effective_date': 'subject_effective_date',
    'structure_type': 'subject_structure_type',
    'gla_sqft': 'subject_gla_sqft',
    'bedrooms_raw': 'subject_bedrooms',
    'bathrooms_raw': 'subject_bathrooms_raw',
    'municipality_district': 'subject_municipality_district'
}, inplace=True)

# Clean the bathrooms field
subject_features['subject_bathrooms'] = subject_features['subject_bathrooms_raw'].apply(parse_bathrooms)
subject_features['subject_bedrooms'] = pd.to_numeric(subject_features['subject_bedrooms'], errors='coerce')
subject_features['subject_gla_sqft'] = pd.to_numeric(subject_features['subject_gla_sqft'], errors='coerce')

# Drop the raw bathrooms column
subject_features.drop('subject_bathrooms_raw', axis=1, inplace=True)

print(f"📊 SUBJECT FEATURES TO ADD:")
print(f"   Features: {list(subject_features.columns)}")
print(f"   Sample data:")
print(subject_features.head(3))

# Merge with properties data
properties_with_subjects = properties_df.merge(subject_features, on='subject_id', how='left')

print(f"\n🔗 MERGE RESULTS:")
print(f"   Before: {properties_df.shape}")
print(f"   After: {properties_with_subjects.shape}")
print(f"   Success: {(~properties_with_subjects['subject_gla_sqft'].isna()).mean():.1%}")

# Update main dataframe
properties_df = properties_with_subjects.copy()

print(f"\n✅ SUBJECT FEATURES ADDED!")
print(f"   New columns added: {len(subject_features.columns)-1}")  # -1 for subject_id
print(f"   Ready for ML training with subject context")

🔗 ADDING KEY SUBJECT FEATURES
📊 SUBJECT FEATURES TO ADD:
   Features: ['subject_id', 'subject_condition', 'subject_age_years', 'subject_effective_date', 'subject_structure_type', 'subject_gla_sqft', 'subject_bedrooms', 'subject_municipality_district', 'subject_bathrooms']
   Sample data:
   subject_id subject_condition  subject_age_years subject_effective_date  \
0           0           Average                NaN            Apr/11/2025   
1           1           Average                NaN            Apr/17/2025   
2           2           Average                NaN            May/01/2025   

  subject_structure_type  subject_gla_sqft  subject_bedrooms  \
0              Townhouse            1044.0               3.0   
1               Detached            1500.0               3.0   
2               Detached            3000.0               4.0   

                      subject_municipality_district  subject_bathrooms  
0                                          Kingston                1.5  

In [8]:
# CORRECTED: Use the base features that actually exist
corrected_base_features = [
    # Physical characteristics (raw)
    'gla_sqft',                    # square footage
    'bedrooms_total',              # bedrooms  
    'bathrooms_equivalent',        # bathrooms
    'close_price',                 # sale price
    'price_per_sqft',             # price per sqft
    
    # Location (raw)
    'latitude', 'longitude',       # coordinates
    'distance_km',                 # distance to subject
    'same_city',                   # same city indicator
    
    # Temporal (raw)
    'close_date',                  # sale date
    'days_from_effective',         # days since sale
    
    # Categorical (raw)
    'prop_type_clean',             # property type
    'structure_type_match',        # structure match
    
    # Subject features
    'subject_condition',           # subject condition          # subject age
    'subject_structure_type',     # subject structure type
    'subject_gla_sqft',          # subject square footage
    'subject_bedrooms',          # subject bedrooms
    'subject_bathrooms',         # subject bathrooms
    'subject_municipality_district' # subject location
]

print("🔧 USING CORRECTED BASE FEATURES")
print("=" * 50)
print(f"Corrected features ({len(corrected_base_features)}):")
for i, feat in enumerate(corrected_base_features, 1):
    print(f"   {i:2d}. {feat}")

# Check availability
available_corrected = [f for f in corrected_base_features if f in properties_df.columns]
missing_corrected = [f for f in corrected_base_features if f not in properties_df.columns]

print(f"\nAvailable: {len(available_corrected)}/{len(corrected_base_features)}")
if missing_corrected:
    print(f"Still missing: {missing_corrected}")

🔧 USING CORRECTED BASE FEATURES
Corrected features (19):
    1. gla_sqft
    2. bedrooms_total
    3. bathrooms_equivalent
    4. close_price
    5. price_per_sqft
    6. latitude
    7. longitude
    8. distance_km
    9. same_city
   10. close_date
   11. days_from_effective
   12. prop_type_clean
   13. structure_type_match
   14. subject_condition
   15. subject_structure_type
   16. subject_gla_sqft
   17. subject_bedrooms
   18. subject_bathrooms
   19. subject_municipality_district

Available: 19/19


In [9]:
# STEP 7: Create Clean Feature Matrix and Train/Test Split
print("🔧 CREATING CLEAN FEATURE MATRIX FOR ML TRAINING")
print("=" * 60)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

# First, check which features actually exist
print("🔍 CHECKING FEATURE AVAILABILITY:")
available_features = []
missing_features = []

for feat in corrected_base_features:
    if feat in properties_df.columns:
        available_features.append(feat)
    else:
        missing_features.append(feat)

print(f"   Available features ({len(available_features)}): {available_features}")
if missing_features:
    print(f"   Missing features ({len(missing_features)}): {missing_features}")

# Use only available features
X_clean = properties_df[available_features].copy()
y = properties_df['is_appraiser_selected'].copy()
subjects = properties_df['subject_id'].copy()

print(f"\n📊 INITIAL FEATURE MATRIX:")
print(f"   Shape: {X_clean.shape}")
print(f"   Features: {len(available_features)}")
print(f"   Samples: {len(X_clean):,}")

# Handle temporal features FIRST - convert close_date to numerical
print(f"\n📅 PROCESSING TEMPORAL FEATURES:")
if 'close_date' in X_clean.columns:
    # Convert to datetime and then to numerical (days since epoch)
    X_clean['close_date'] = pd.to_datetime(X_clean['close_date'])
    X_clean['close_date_numeric'] = (X_clean['close_date'] - pd.Timestamp('1970-01-01')).dt.days
    
    # Drop original date column, keep numeric version
    X_clean = X_clean.drop('close_date', axis=1)
    print(f"   ✅ Converted close_date to close_date_numeric (days since epoch)")

# Handle subject_effective_date if it exists
if 'subject_effective_date' in X_clean.columns:
    # For now, drop it since it's a date string that's hard to parse
    X_clean = X_clean.drop('subject_effective_date', axis=1)
    print(f"   ✅ Dropped subject_effective_date (date string)")

# Handle categorical features - encode for ML
print(f"\n🔧 ENCODING CATEGORICAL FEATURES:")
categorical_cols = X_clean.select_dtypes(include=['object', 'string']).columns.tolist()

for col in categorical_cols:
    unique_vals = X_clean[col].nunique()
    print(f"   {col}: {unique_vals} unique values")
    
    # Use LabelEncoder for all categorical features
    le = LabelEncoder()
    X_clean[col] = le.fit_transform(X_clean[col].astype(str))
    print(f"     → Encoded {col} with LabelEncoder")

# Handle missing values using a simpler approach
print(f"\n🔧 HANDLING MISSING VALUES:")
missing_summary = X_clean.isnull().sum()
features_with_missing = missing_summary[missing_summary > 0]

if len(features_with_missing) > 0:
    print(f"   Features with missing values:")
    for feat, count in features_with_missing.items():
        print(f"     {feat}: {count} ({count/len(X_clean):.1f}%)")
    
    # Fill missing values column by column to avoid shape issues
    for col in X_clean.columns:
        if X_clean[col].isnull().sum() > 0:
            median_val = X_clean[col].median()
            X_clean[col] = X_clean[col].fillna(median_val)
            print(f"     → Filled {col} missing values with median: {median_val:.2f}")
    
    print(f"   ✅ Applied median imputation")
else:
    print(f"   ✅ No missing values found")

# Verify no missing values remain
remaining_missing = X_clean.isnull().sum().sum()
print(f"   Remaining missing values: {remaining_missing}")

print(f"\n📋 FINAL FEATURE MATRIX:")
print(f"   Shape: {X_clean.shape}")
print(f"   Features: {list(X_clean.columns)}")
print(f"   All features are numerical: {X_clean.dtypes.apply(lambda x: x.kind in 'biufc').all()}")
print(f"   Target distribution: {y.sum()} positive ({y.mean():.1%}), {(~y).sum()} negative")

🔧 CREATING CLEAN FEATURE MATRIX FOR ML TRAINING
🔍 CHECKING FEATURE AVAILABILITY:
   Available features (19): ['gla_sqft', 'bedrooms_total', 'bathrooms_equivalent', 'close_price', 'price_per_sqft', 'latitude', 'longitude', 'distance_km', 'same_city', 'close_date', 'days_from_effective', 'prop_type_clean', 'structure_type_match', 'subject_condition', 'subject_structure_type', 'subject_gla_sqft', 'subject_bedrooms', 'subject_bathrooms', 'subject_municipality_district']

📊 INITIAL FEATURE MATRIX:
   Shape: (7246, 19)
   Features: 19
   Samples: 7,246

📅 PROCESSING TEMPORAL FEATURES:
   ✅ Converted close_date to close_date_numeric (days since epoch)

🔧 ENCODING CATEGORICAL FEATURES:
   prop_type_clean: 9 unique values
     → Encoded prop_type_clean with LabelEncoder
   subject_condition: 4 unique values
     → Encoded subject_condition with LabelEncoder
   subject_structure_type: 9 unique values
     → Encoded subject_structure_type with LabelEncoder
   subject_municipality_district: 74 uni

## Step 8: Create Subject-Level Train/Test Split and Feature Scaling

**Current Status - Good Progress:**
- ✅ 13 clean base features properly encoded
- ✅ Missing values handled (max 2.2%)
- ✅ Temporal and categorical features processed

**Critical Next Steps:**
- Fix target distribution calculation
- Create proper subject-level train/test split (prevents data leakage)
- Apply feature scaling for ML algorithms
- Verify feature quality and correlations

**Why Subject-Level Split:**
- Prevents data leakage (same subject properties in both train/test)
- Simulates real-world deployment scenario
- Maintains proper validation methodology

**Expected Outcome:**
- X_train, X_test, y_train, y_test ready for ML training
- Features properly scaled and normalized
- No data leakage between splits

In [10]:
# STEP 8: Create Subject-Level Train/Test Split and Feature Scaling
print("🎯 CREATING PROPER TRAIN/TEST SPLIT AND SCALING")
print("=" * 60)

from sklearn.preprocessing import StandardScaler

# Fix target distribution calculation
print(f"📊 TARGET DISTRIBUTION CHECK:")
positive_count = y.sum()
negative_count = len(y) - positive_count
print(f"   Positive samples: {positive_count} ({positive_count/len(y):.1%})")
print(f"   Negative samples: {negative_count} ({negative_count/len(y):.1%})")
print(f"   Total samples: {len(y)}")

# Create subject-level train/test split
print(f"\n🔀 CREATING SUBJECT-LEVEL TRAIN/TEST SPLIT:")
unique_subjects = subjects.unique()
print(f"   Total subjects: {len(unique_subjects)}")

# Split subjects 80/20
train_subjects, test_subjects = train_test_split(
    unique_subjects, 
    test_size=0.2, 
    random_state=42
)

# Create masks for train/test
train_mask = subjects.isin(train_subjects)
test_mask = subjects.isin(test_subjects)

# Split the data
X_train_raw = X_clean[train_mask].copy()
X_test_raw = X_clean[test_mask].copy()
y_train = y[train_mask].copy()
y_test = y[test_mask].copy()

print(f"   Train subjects: {len(train_subjects)} ({len(train_subjects)/len(unique_subjects):.1%})")
print(f"   Test subjects: {len(test_subjects)} ({len(test_subjects)/len(unique_subjects):.1%})")
print(f"   Train samples: {len(X_train_raw)} ({y_train.sum()} positive, {y_train.mean():.1%})")
print(f"   Test samples: {len(X_test_raw)} ({y_test.sum()} positive, {y_test.mean():.1%})")

# Apply feature scaling
print(f"\n⚖️ APPLYING FEATURE SCALING:")
scaler = StandardScaler()

# Fit on training data only, transform both
X_train = pd.DataFrame(
    scaler.fit_transform(X_train_raw),
    columns=X_train_raw.columns,
    index=X_train_raw.index
)

X_test = pd.DataFrame(
    scaler.transform(X_test_raw),
    columns=X_test_raw.columns, 
    index=X_test_raw.index
)

print(f"   ✅ Features scaled using StandardScaler")
print(f"   ✅ Scaler fitted on training data only")

# Verify final dataset
print(f"\n✅ FINAL ML-READY DATASET:")
print(f"   Training set: {X_train.shape} features, {len(y_train)} samples")
print(f"   Test set: {X_test.shape} features, {len(y_test)} samples")
print(f"   Feature names: {list(X_train.columns)}")
print(f"   Class balance - Train: {y_train.mean():.1%}, Test: {y_test.mean():.1%}")

print(f"\n🎯 READY FOR ML TRAINING!")
print(f"   ✅ No data leakage (subject-level split)")
print(f"   ✅ Proper feature scaling")
print(f"   ✅ Clean base features only")
print(f"   ✅ {X_train.shape[1]} features ready for algorithms")

🎯 CREATING PROPER TRAIN/TEST SPLIT AND SCALING
📊 TARGET DISTRIBUTION CHECK:
   Positive samples: 219 (3.0%)
   Negative samples: 7027 (97.0%)
   Total samples: 7246

🔀 CREATING SUBJECT-LEVEL TRAIN/TEST SPLIT:
   Total subjects: 88
   Train subjects: 70 (79.5%)
   Test subjects: 18 (20.5%)
   Train samples: 5980 (170 positive, 2.8%)
   Test samples: 1266 (49 positive, 3.9%)

⚖️ APPLYING FEATURE SCALING:
   ✅ Features scaled using StandardScaler
   ✅ Scaler fitted on training data only

✅ FINAL ML-READY DATASET:
   Training set: (5980, 19) features, 5980 samples
   Test set: (1266, 19) features, 1266 samples
   Feature names: ['gla_sqft', 'bedrooms_total', 'bathrooms_equivalent', 'close_price', 'price_per_sqft', 'latitude', 'longitude', 'distance_km', 'same_city', 'days_from_effective', 'prop_type_clean', 'structure_type_match', 'subject_condition', 'subject_structure_type', 'subject_gla_sqft', 'subject_bedrooms', 'subject_bathrooms', 'subject_municipality_district', 'close_date_numeric'

In [12]:
# Show sample rows of the numerical features
print("🔍 SAMPLE ROWS OF NUMERICAL FEATURES")
print("=" * 60)

print("📊 RAW FEATURES (before scaling):")
print("First 5 rows of X_train_raw:")
print(X_train_raw.head())

print(f"\n📊 SCALED FEATURES (after scaling):")
print("First 5 rows of X_train (scaled):")
print(X_train.head())

print(f"\n🎯 FEATURE STATISTICS:")
print("Feature ranges in raw data:")
for col in X_train_raw.columns:
    min_val = X_train_raw[col].min()
    max_val = X_train_raw[col].max()
    mean_val = X_train_raw[col].mean()
    print(f"   {col:30s}: {min_val:10.2f} to {max_val:12.2f} (mean: {mean_val:8.2f})")

print(f"\n🎯 SCALED FEATURE STATISTICS:")
print("Feature ranges in scaled data (should be ~-3 to +3):")
for col in X_train.columns:
    min_val = X_train[col].min()
    max_val = X_train[col].max()
    mean_val = X_train[col].mean()
    print(f"   {col:30s}: {min_val:6.2f} to {max_val:6.2f} (mean: {mean_val:6.2f})")

print(f"\n📋 SAMPLE WITH TARGET LABELS:")
print("Sample rows with features and target:")
sample_indices = X_train.head().index
sample_data = pd.DataFrame({
    'gla_sqft': X_train_raw.loc[sample_indices, 'gla_sqft'],
    'subject_gla_sqft': X_train_raw.loc[sample_indices, 'subject_gla_sqft'], 
    'distance_km': X_train_raw.loc[sample_indices, 'distance_km'],
    'close_price': X_train_raw.loc[sample_indices, 'close_price'],
    'same_city': X_train_raw.loc[sample_indices, 'same_city'],
    'is_selected': y_train.loc[sample_indices]
})
print(sample_data)

🔍 SAMPLE ROWS OF NUMERICAL FEATURES
📊 RAW FEATURES (before scaling):
First 5 rows of X_train_raw:
     gla_sqft  bedrooms_total  bathrooms_equivalent  close_price  \
121    1472.0               3                   2.0     545000.0   
122    1350.0               3                   1.0     395000.0   
123    1320.0               3                   2.0     378275.0   
124    1182.0               3                   2.0     500000.0   
125    1136.0               2                   2.0     710000.0   

     price_per_sqft  latitude  longitude  distance_km  same_city  \
121      370.244565   44.7865   -63.1475     8.392139          1   
122      292.592593   44.7903   -63.1199     6.322417          1   
123      286.571970   44.7584   -63.0603     2.460445          1   
124      423.011844   44.7832   -63.1657     9.787332          1   
125      625.000000   44.7765   -63.0424     0.000000          1   

     days_from_effective  prop_type_clean  structure_type_match  \
121              

## Step 9: Train Multiple ML Models and Evaluate Performance

**What we're doing:**
- Train multiple ML algorithms to learn appraiser selection patterns
- Handle class imbalance with appropriate techniques
- Evaluate models using precision/recall metrics (not accuracy)
- Compare model performance to find the best approach

**Models to test:**
- **Random Forest**: Excellent for feature interactions and interpretability
- **Gradient Boosting (XGBoost)**: State-of-the-art for tabular data
- **Logistic Regression**: Simple baseline with good interpretability
- **Support Vector Machine**: Good for high-dimensional data

**Key considerations:**
- **Class imbalance**: Only 3% positive samples, so we'll use class weights
- **Metrics**: Focus on Precision, Recall, F1-score, and AUC-ROC
- **Feature importance**: Understand what drives appraiser decisions
- **Cross-validation**: Ensure robust performance estimates

**Expected outcome:**
- Trained models that can predict appraiser-quality property selections
- Performance comparison to identify the best approach
- Feature importance insights into appraiser decision patterns

In [13]:
# STEP 9: TRAIN MULTIPLE ML MODELS AND EVALUATE PERFORMANCE
print("🤖 TRAINING MULTIPLE ML MODELS")
print("=" * 60)

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, roc_curve
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import numpy as np

# Calculate class weights for imbalanced data
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}

print(f"📊 CLASS IMBALANCE HANDLING:")
print(f"   Class weights: {class_weight_dict}")
print(f"   Positive class weight: {class_weight_dict[1]:.2f}x")

# Define models with class weights
models = {
    'Random Forest': RandomForestClassifier(
        n_estimators=100, 
        max_depth=10, 
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ),
    'XGBoost': XGBClassifier(
        n_estimators=100,
        max_depth=6,
        scale_pos_weight=class_weight_dict[1],
        random_state=42,
        eval_metric='logloss'
    ),
    'Logistic Regression': LogisticRegression(
        class_weight='balanced',
        random_state=42,
        max_iter=1000
    ),
    'SVM': SVC(
        class_weight='balanced',
        probability=True,
        random_state=42
    )
}

# Train and evaluate each model
results = {}
trained_models = {}

print(f"\n🔄 TRAINING AND EVALUATING MODELS:")
for name, model in models.items():
    print(f"\n   Training {name}...")
    
    # Train model
    model.fit(X_train, y_train)
    trained_models[name] = model
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    auc_score = roc_auc_score(y_test, y_pred_proba)
    
    # Store results
    results[name] = {
        'model': model,
        'y_pred': y_pred,
        'y_pred_proba': y_pred_proba,
        'auc_score': auc_score
    }
    
    print(f"     ✅ {name} trained - AUC: {auc_score:.3f}")

# Display detailed results
print(f"\n📊 DETAILED MODEL PERFORMANCE:")
print("=" * 80)

for name, result in results.items():
    print(f"\n🎯 {name.upper()}:")
    print(f"   AUC-ROC Score: {result['auc_score']:.3f}")
    
    # Classification report
    report = classification_report(y_test, result['y_pred'], output_dict=True)
    print(f"   Precision (Class 1): {report['1']['precision']:.3f}")
    print(f"   Recall (Class 1): {report['1']['recall']:.3f}")
    print(f"   F1-Score (Class 1): {report['1']['f1-score']:.3f}")
    
    # Show confusion matrix info
    tp = sum((y_test == 1) & (result['y_pred'] == 1))
    fp = sum((y_test == 0) & (result['y_pred'] == 1))
    fn = sum((y_test == 1) & (result['y_pred'] == 0))
    tn = sum((y_test == 0) & (result['y_pred'] == 0))
    
    print(f"   True Positives: {tp}/{y_test.sum()} ({tp/y_test.sum():.1%})")
    print(f"   False Positives: {fp}")

# Find best model
best_model_name = max(results.keys(), key=lambda x: results[x]['auc_score'])
best_model = results[best_model_name]

print(f"\n🏆 BEST MODEL: {best_model_name}")
print(f"   AUC Score: {best_model['auc_score']:.3f}")

# Feature importance for tree-based models
if best_model_name in ['Random Forest', 'XGBoost']:
    print(f"\n🔍 TOP 10 FEATURE IMPORTANCES ({best_model_name}):")
    
    if best_model_name == 'Random Forest':
        importances = trained_models[best_model_name].feature_importances_
    else:  # XGBoost
        importances = trained_models[best_model_name].feature_importances_
    
    feature_importance = pd.DataFrame({
        'feature': X_train.columns,
        'importance': importances
    }).sort_values('importance', ascending=False)
    
    for i, (_, row) in enumerate(feature_importance.head(10).iterrows()):
        print(f"   {i+1:2d}. {row['feature']:30s}: {row['importance']:.3f}")

print(f"\n✅ MODEL TRAINING COMPLETE!")
print(f"   Best performing model: {best_model_name}")
print(f"   Ready for deployment and further analysis")

🤖 TRAINING MULTIPLE ML MODELS
📊 CLASS IMBALANCE HANDLING:
   Class weights: {0: np.float64(0.5146299483648882), 1: np.float64(17.58823529411765)}
   Positive class weight: 17.59x

🔄 TRAINING AND EVALUATING MODELS:

   Training Random Forest...
     ✅ Random Forest trained - AUC: 0.615

   Training XGBoost...
     ✅ XGBoost trained - AUC: 0.651

   Training Logistic Regression...
     ✅ Logistic Regression trained - AUC: 0.626

   Training SVM...
     ✅ SVM trained - AUC: 0.647

📊 DETAILED MODEL PERFORMANCE:

🎯 RANDOM FOREST:
   AUC-ROC Score: 0.615
   Precision (Class 1): 0.091
   Recall (Class 1): 0.184
   F1-Score (Class 1): 0.122
   True Positives: 9/49 (18.4%)
   False Positives: 90

🎯 XGBOOST:
   AUC-ROC Score: 0.651
   Precision (Class 1): 0.135
   Recall (Class 1): 0.102
   F1-Score (Class 1): 0.116
   True Positives: 5/49 (10.2%)
   False Positives: 32

🎯 LOGISTIC REGRESSION:
   AUC-ROC Score: 0.626
   Precision (Class 1): 0.050
   Recall (Class 1): 0.878
   F1-Score (Class 1):