<a href="https://www.kaggle.com/code/bborya/home-data-using-xgboost-2025-04?scriptVersionId=248168273" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# House Prices Prediction Notebook
This notebook predicts house prices for the Kaggle "Home Data for ML Course" competition using XGBoost. 

It was originally a working draft but has been cleaned up with help from Grok (from xAI) into a modular, readable pipeline. 

It scores 13937 with the setup below. Ranking 136 out of 5580 (as of 2025-06-30)

 - Data Prep: Loads train/test data, imputes missing values (mostly medians for numeric, 'None' for categorical, but some exceptions for modes and 0s).
 - Feature Engineering: Adds derived features (TotalSF, HouseAge, TotalBath etc) and neighborhood stats (Neighborhood_MedianPrice, Neighborhood_Qual_MedianPrice, PricePerSqFt).
 - Encoding: One-hot encodes categoricals with a 1% frequency filter to reduce noise.
 - Feature Selection: Drops low-importance features (~80% retention) using XGBoost importances.
   - Top features are: Neighborhood_Qual_MedianPrice 43%, TotalSF 8%, BsmtQual_Ex 3%)
 - Tuning: RandomizedSearchCV with 50 iterations, 5-fold CV, and broad XGBoost params (max_depth etc)

## Data Prep
### Load Data

In [1]:
# Toggle for log transforming the target variable
USE_LOG_TRANSFORM = False  # Set to True to enable log transform, False to disable
EVAL_METRIC = 'mae'  # Evaluation metric: 'mae', 'rmse', 'mape', etc.

import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV

# Global metric mapping
# Maps EVAL_METRIC to: (function to compute the metric, sklearn CV scoring string, display name)
# - Function: Computes error (e.g., mean_absolute_error for MAE) between predictions and true values.
# - CV Scoring: Negative version (e.g., 'neg_mean_absolute_error') for sklearn’s cross_val_score, which maximizes scores (we minimize errors).
# - Name: Readable label (e.g., 'MAE') for debug outputs.
METRIC_MAP = {
    'mae': (mean_absolute_error, 'neg_mean_absolute_error', 'MAE'),
    'rmse': (lambda y_true, y_pred: mean_squared_error(y_true, y_pred, squared=False), 'neg_root_mean_squared_error', 'RMSE')
}


def load_and_prepare_data(train_path, test_path, target_col='SalePrice', debug=False):
    """
    Load and prepare training and test data for modeling.
    
    Args:
        train_path (str): Path to training CSV file.
        test_path (str): Path to test CSV file.
        target_col (str): Name of the target column (default: 'SalePrice').
        debug (bool): If True, print debug information (default: False).
    
    Returns:
        tuple: (X_train, y_train, X_test) - Training features, target, and test features.
    """
    # Load data
    X_train = pd.read_csv(train_path, index_col='Id')
    X_test = pd.read_csv(test_path, index_col='Id')
    
    # Separate target and drop rows with missing target
    X_train = X_train.dropna(subset=[target_col])
    y_train = X_train[target_col]
    if USE_LOG_TRANSFORM:
        y_train = np.log1p(y_train)  # Apply log transform if enabled
        if debug:
            print("Log transform enabled for target")

    X_train = X_train.drop(columns=[target_col])
    
    if debug:
        print("Training data shape:", X_train.shape)
        print("Test data shape:", X_test.shape)
        print("Target summary:\n", y_train.describe())
    
    return X_train, y_train, X_test

# Usage
train_path = '/kaggle/input/home-data-for-ml-course/train.csv'
test_path = '/kaggle/input/home-data-for-ml-course/test.csv'
X_train, y_train, X_test = load_and_prepare_data(train_path, test_path, debug=False)

### Impute Missing Data

In [2]:
def impute_missing_data(X_train, X_test, debug=False):
    """
    Impute missing values in training and test datasets based on column types and context.
    
    Args:
        X_train (pd.DataFrame): Training data with features.
        X_test (pd.DataFrame): Test data with features.
        debug (bool): If True, print debug information (default: False).
    
    Returns:
        tuple: (X_train_imputed, X_test_imputed) - Imputed training and test data.
    """
    X_train_imputed = X_train.copy()
    X_test_imputed = X_test.copy()
    
    # Define imputation strategies
    numeric_strategies = {
        'median': ['LotFrontage', 'MasVnrArea'],  # Continuous numeric
        'ref_col': {'GarageYrBlt': 'YearBuilt'},  # Use value from reference column
        'zero': [col for col in X_train.select_dtypes(include=['int64', 'float64']).columns 
                 if col not in ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']]
    }
    
    categorical_strategies = {
        'none': ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 
                 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'MasVnrType', 
                 'FireplaceQu', 'PoolQC', 'Fence', 'Alley', 'MiscFeature'],
        'mode': [col for col in X_train.select_dtypes(include=['object']).columns 
                 if col not in ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 
                                'BsmtFinType2', 'GarageType', 'GarageFinish', 'GarageQual', 
                                'GarageCond', 'MasVnrType', 'FireplaceQu', 'PoolQC', 'Fence', 
                                'Alley', 'MiscFeature']]
    }
    
    # Numeric imputation
    # Medians:
    for col in numeric_strategies['median']:
        median_val = X_train_imputed[col].median()
        X_train_imputed[col] = X_train_imputed[col].fillna(median_val)
        X_test_imputed[col] = X_test_imputed[col].fillna(median_val)
    # Using other Column -- e.g. Year built for Garage Built
    for col, ref_col in numeric_strategies['ref_col'].items():
        X_train_imputed[col] = X_train_imputed[col].fillna(X_train_imputed[ref_col])
        X_test_imputed[col] = X_test_imputed[col].fillna(X_test_imputed[ref_col])
    # Fill with Zeros
    X_train_imputed[numeric_strategies['zero']] = X_train_imputed[numeric_strategies['zero']].fillna(0)
    X_test_imputed[numeric_strategies['zero']] = X_test_imputed[numeric_strategies['zero']].fillna(0)
    
    # Categorical imputation
    for col in categorical_strategies['none']:
        X_train_imputed[col] = X_train_imputed[col].fillna('None')
        X_test_imputed[col] = X_test_imputed[col].fillna('None')
    
    for col in categorical_strategies['mode']:
        mode_val = X_train_imputed[col].mode()[0]
        X_train_imputed[col] = X_train_imputed[col].fillna(mode_val)
        X_test_imputed[col] = X_test_imputed[col].fillna(mode_val)
    
    if debug:
        print("Missing values after imputation - Train:", X_train_imputed.isnull().sum().sum())
        print("Missing values after imputation - Test:", X_test_imputed.isnull().sum().sum())
    
    return X_train_imputed, X_test_imputed

# Usage
X_train_imputed, X_test_imputed = impute_missing_data(X_train, X_test, debug=False)

# Feature Engineering

In [3]:
def engineer_features(train_data, X_test, debug=False):
    """Engineer additional features for training and test datasets.
    
    Args:
        train_data (pd.DataFrame): Training data with features and target ('SalePrice').
        X_test (pd.DataFrame): Test data with features only.
        debug (bool): If True, print debug information (default: False).
    
    Returns:
        tuple: (X_train_enriched, X_test_enriched) - Feature-enriched training and test data.
    """
    X_train_enriched = train_data.drop(columns=['SalePrice'], errors='ignore').copy()
    X_test_enriched = X_test.copy()
    
    # 1. Neighborhood Pricing Features
    agg_stats = ['min', 'max', 'median', 'count']
    neigh_stats = train_data.groupby('Neighborhood', observed=False)['SalePrice'].agg(agg_stats)
    for stat in agg_stats:
        col_name = f'Neighborhood_{stat.capitalize()}Price'
        X_train_enriched[col_name] = X_train_enriched['Neighborhood'].map(neigh_stats[stat])
        X_test_enriched[col_name] = X_test_enriched['Neighborhood'].map(neigh_stats[stat])
    
    neigh_qual_stats = train_data.groupby(['Neighborhood', 'OverallQual'])['SalePrice'].median()
    X_train_enriched['Neighborhood_Qual_MedianPrice'] = X_train_enriched.apply(
        lambda row: neigh_qual_stats.get((row['Neighborhood'], row['OverallQual']), 
                                        neigh_stats['median'][row['Neighborhood']]), axis=1
    )
    X_test_enriched['Neighborhood_Qual_MedianPrice'] = X_test_enriched.apply(
        lambda row: neigh_qual_stats.get((row['Neighborhood'], row['OverallQual']), 
                                        neigh_stats['median'][row['Neighborhood']]), axis=1
    )
    
    train_data['PricePerSqFt'] = train_data['SalePrice'] / train_data['GrLivArea']
    neigh_price_sqft = train_data.groupby('Neighborhood')['PricePerSqFt'].median()
    X_train_enriched['Neighborhood_PricePerSqFt'] = X_train_enriched['Neighborhood'].map(neigh_price_sqft)
    X_test_enriched['Neighborhood_PricePerSqFt'] = X_test_enriched['Neighborhood'].map(neigh_price_sqft)
    
    # Derived Features
    # House Age
    X_train_enriched['HouseAge'] = np.maximum(X_train_enriched['YrSold'] - X_train_enriched['YearBuilt'], 0)
    X_test_enriched['HouseAge'] = np.maximum(X_test_enriched['YrSold'] - X_test_enriched['YearBuilt'], 0)

    # Total SqFt (Living + Basement + Garage)
    X_train_enriched['TotalSF'] = (X_train_enriched['GrLivArea'] + X_train_enriched['TotalBsmtSF'] + 
                                  X_train_enriched['GarageArea'])
    X_test_enriched['TotalSF'] = (X_test_enriched['GrLivArea'] + X_test_enriched['TotalBsmtSF'] + 
                                 X_test_enriched['GarageArea'])

    # Total Bathrooms
    X_train_enriched['TotalBath'] = (X_train_enriched['FullBath'] + 0.5 * X_train_enriched['HalfBath'] + 
                                    X_train_enriched['BsmtFullBath'] + 0.5 * X_train_enriched['BsmtHalfBath'])
    X_test_enriched['TotalBath'] = (X_test_enriched['FullBath'] + 0.5 * X_test_enriched['HalfBath'] + 
                                   X_test_enriched['BsmtFullBath'] + 0.5 * X_test_enriched['BsmtHalfBath'])
    
    # Remodel Age: Years since last remodel (0 if no remodel)
    X_train_enriched['RemodAge'] = np.maximum(X_train_enriched['YrSold'] - X_train_enriched['YearRemodAdd'], 0)
    X_test_enriched['RemodAge'] = np.maximum(X_test_enriched['YrSold'] - X_test_enriched['YearRemodAdd'], 0)
    
    # Outdoor Space: Total square footage of outdoor areas
    outdoor_cols = ['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea']
    X_train_enriched['OutdoorSF'] = X_train_enriched[outdoor_cols].sum(axis=1)
    X_test_enriched['OutdoorSF'] = X_test_enriched[outdoor_cols].sum(axis=1)
    
    # Quality-Condition Interaction: Product of overall quality and condition
    X_train_enriched['QualCond'] = X_train_enriched['OverallQual'] * X_train_enriched['OverallCond']
    X_test_enriched['QualCond'] = X_test_enriched['OverallQual'] * X_test_enriched['OverallCond']
    
    # Lot Utilization: Ratio of total square footage to lot area
    X_train_enriched['LotRatio'] = X_train_enriched['TotalSF'] / X_train_enriched['LotArea']
    X_test_enriched['LotRatio'] = X_test_enriched['TotalSF'] / X_test_enriched['LotArea']
    
    # Basement Finish Ratio: Proportion of basement that’s finished
    X_train_enriched['BsmtFinRatio'] = (X_train_enriched['BsmtFinSF1'] + X_train_enriched['BsmtFinSF2']) / \
                                       X_train_enriched['TotalBsmtSF'].replace(0, np.nan).fillna(0)
    X_test_enriched['BsmtFinRatio'] = (X_test_enriched['BsmtFinSF1'] + X_test_enriched['BsmtFinSF2']) / \
                                      X_test_enriched['TotalBsmtSF'].replace(0, np.nan).fillna(0)
    
    # Garage Score: Quality * Condition of garage (numeric mapping)
    qual_map = {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0}
    X_train_enriched['GarageScore'] = (X_train_enriched['GarageQual'].map(qual_map, na_action='ignore').fillna(0) * 
                                      X_train_enriched['GarageCond'].map(qual_map, na_action='ignore').fillna(0))
    X_test_enriched['GarageScore'] = (X_test_enriched['GarageQual'].map(qual_map, na_action='ignore').fillna(0) * 
                                     X_test_enriched['GarageCond'].map(qual_map, na_action='ignore').fillna(0))
    
    if debug:
        print("Neighborhood stats sample:\n", neigh_stats.head())
        print("Engineered features summary:\n", 
              X_train_enriched[['HouseAge', 'TotalSF', 'TotalBath', 'RemodAge', 'OutdoorSF', 
                              'QualCond', 'LotRatio', 'BsmtFinRatio', 'GarageScore', 
                              'Neighborhood_MedianPrice', 'Neighborhood_PricePerSqFt']].describe())
    
    return X_train_enriched, X_test_enriched

# Usage
train_data = X_train_imputed.copy()
train_data['SalePrice'] = y_train
X_train_enriched, X_test_enriched = engineer_features(train_data, X_test_imputed, debug=True)

Neighborhood stats sample:
                  min     max    median  count
Neighborhood                                 
Blmngtn       159895  264561  191000.0     17
Blueste       124000  151000  137500.0      2
BrDale         83000  125000  106000.0     16
BrkSide        39300  223500  124300.0     58
ClearCr       130000  328000  200250.0     28
Engineered features summary:
           HouseAge       TotalSF    TotalBath     RemodAge    OutdoorSF  \
count  1460.000000   1460.000000  1460.000000  1460.000000  1460.000000   
mean     36.547945   3045.873288     2.210616    22.950685   184.088356   
std      30.250152    959.534673     0.785399    20.639875   166.418528   
min       0.000000    334.000000     1.000000     0.000000     0.000000   
25%       8.000000   2396.750000     2.000000     4.000000    45.000000   
50%      35.000000   2939.500000     2.000000    14.000000   164.000000   
75%      54.000000   3575.750000     2.500000    41.000000   266.250000   
max     136.000000  

## One-Hot Encoding

In [4]:
def one_hot_encode_data(X_train, X_test, min_freq=0.01, debug=False):
    """Perform one-hot encoding on training and test datasets, dropping rare categories.
    
    Args:
        X_train (pd.DataFrame): Training data with features.
        X_test (pd.DataFrame): Test data with features.
        min_freq (float): Minimum frequency threshold for categories to keep (default: 0.01, (1%)).
        debug (bool): If True, print debug information (default: False).
    
    Returns:
        tuple: (X_train_oh, X_test_oh) - One-hot encoded training and test data.
    """
    categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()
    
    # One-hot encode with frequency filtering
    X_train_oh = pd.get_dummies(X_train, columns=categorical_cols, sparse=False, dtype=int)
    X_test_oh = pd.get_dummies(X_test, columns=categorical_cols, sparse=False, dtype=int)
    
    # Filter rare categories (less than min_freq in training data)
    for col in X_train_oh.columns:
        if col not in X_train.select_dtypes(exclude=['object']).columns:  # Only encoded cols
            freq = X_train_oh[col].mean()  # Proportion of 1s
            if freq < min_freq:
                X_train_oh.drop(col, axis=1, inplace=True)
                if col in X_test_oh.columns:
                    X_test_oh.drop(col, axis=1, inplace=True)
    
    # Align train and test columns
    X_train_oh, X_test_oh = X_train_oh.align(X_test_oh, join='outer', axis=1, fill_value=0)
    
    if debug:
        original_cols = X_train.shape[1]
        encoded_cols = X_train_oh.shape[1] - X_train.select_dtypes(exclude=['object']).columns.size
        print(f"Train shape after encoding: {X_train_oh.shape}")
        print(f"Test shape after encoding: {X_test_oh.shape}")
        print(f"Original categorical columns: {len(categorical_cols)}")
        print(f"Encoded columns added: {encoded_cols}")
        print(f"Sample categorical columns: {categorical_cols[:5]}")
    
    return X_train_oh, X_test_oh

# Usage
X_train_oh, X_test_oh = one_hot_encode_data(X_train_enriched, X_test_enriched, min_freq=0.01, debug=True)

Train shape after encoding: (1460, 232)
Test shape after encoding: (1459, 232)
Original categorical columns: 43
Encoded columns added: 181
Sample categorical columns: ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour']


## Train / Test Split

In [5]:
from sklearn.model_selection import train_test_split

def split_train_validation(X, y, train_size=0.8, random_state=0, debug=False):
    """
    Split data into training and validation sets.
    
    Args:
        X (pd.DataFrame): Feature data to split.
        y (pd.Series): Target data to split.
        train_size (float): Proportion of data for training (default: 0.8).
        random_state (int): Seed for reproducibility (default: 0).
        debug (bool): If True, print debug information (default: False).
    
    Returns:
        tuple: (X_train, X_valid, y_train, y_valid) - Split training and validation data.
    """
    # Ensure target aligns with features
    y = y.reindex(X.index)
    
    # Split into training and validation sets
    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, train_size=train_size, test_size=1 - train_size, random_state=random_state
    )
    
    if debug:
        print("Train split shape:", X_train.shape)
        print("Validation split shape:", X_valid.shape)
    
    return X_train, X_valid, y_train, y_valid

# Usage
X_train_split, X_valid_split, y_train_split, y_valid_split = split_train_validation(
    X_train_oh, y_train, debug=False
)

## Evaluate Model

In [6]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import cross_val_score
import numpy as np

def evaluate_model(X_train, y_train, X_valid, y_valid, cv_folds=5, random_state=0, debug=False):
    """
    Evaluate an XGBoost model with cross-validation and validation set metrics.
    
    Args:
        X_train (pd.DataFrame): Training features.
        y_train (pd.Series): Training target.
        X_valid (pd.DataFrame): Validation features.
        y_valid (pd.Series): Validation target.
        cv_folds (int): Number of cross-validation folds (default: 5).
        random_state (int): Seed for reproducibility (default: 0).
        debug (bool): If True, print debug information (default: False).
    
    Returns:
        dict: Evaluation metrics (RMSE, MAE, CV MAE mean/std), and the fitted model.
    """

    # Model with early stopping for validation set evaluation
    model_with_es = XGBRegressor(
        n_estimators=1000, # High value since early stopping will optimize
        random_state=random_state, 
        objective='reg:squarederror',
        eval_metric=EVAL_METRIC, 
        early_stopping_rounds=10, 
        learning_rate=0.05
    )

    # Fit model with early stopping using validation set
    model_with_es.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)

    # Validation set predictions and metrics
    y_pred = model_with_es.predict(X_valid)

    # Map eval_metric to scoring function
    metric_func, cv_scoring, metric_name = METRIC_MAP.get(EVAL_METRIC, METRIC_MAP['mae'])  # Default to MAE
    valid_metric = metric_func(y_valid, y_pred)

    # Required for CV scoring
    model_no_es = XGBRegressor(
        n_estimators=model_with_es.best_iteration, 
        random_state=random_state,
        objective='reg:squarederror', 
        eval_metric=EVAL_METRIC
    )
    cv_scores = cross_val_score(model_no_es, X_train, y_train, cv=cv_folds, scoring=cv_scoring, n_jobs=-1)
    metrics = {
        'valid_metric': valid_metric,
        'cv_metric_mean': -cv_scores.mean(),
        'cv_metric_std': cv_scores.std()
    }

    # Calculate translated metric in original price scale if log transform is on
    if USE_LOG_TRANSFORM and debug:
        y_valid_orig = np.expm1(y_valid)
        y_pred_orig = np.expm1(y_pred)
        if EVAL_METRIC == 'mae':
            translated_metric = mean_absolute_error(y_valid_orig, y_pred_orig)
            metrics['translated_mae'] = translated_metric
        elif EVAL_METRIC == 'rmse':
            translated_metric = mean_squared_error(y_valid_orig, y_pred_orig, squared=False)
            metrics['translated_rmse'] = translated_metric
    
    if debug:
        print(f"Validation {metric_name}: {valid_metric:.4f}{' (log scale)' if USE_LOG_TRANSFORM else ''}")
        print(f"{cv_folds}-Fold CV {metric_name}: {metrics['cv_metric_mean']:.4f} ± {metrics['cv_metric_std']:.4f}{' (log scale)' if USE_LOG_TRANSFORM else ''}")
        if USE_LOG_TRANSFORM and EVAL_METRIC in ['mae', 'rmse']:
            print(f"Translated Validation {metric_name}: {translated_metric:.2f}")
        print(f"Best iteration: {model_with_es.best_iteration}")
    return metrics, model_with_es

# Usage
metrics, baseline_model = evaluate_model(
    X_train_split, y_train_split, X_valid_split, y_valid_split, debug=True
)

Validation MAE: 16472.3345
5-Fold CV MAE: 16508.0094 ± 1402.9419
Best iteration: 88


## Drop Features

In [7]:
import numpy as np
import pandas as pd

def drop_features(X_train, X_valid, X_test, model, threshold=None, target_retention=0.6, use_known_threshold=False, debug=False):
    """Drop low-importance features based on a threshold or automatically determine it.
    
    Args:
        X_train (pd.DataFrame): Training features.
        X_valid (pd.DataFrame): Validation features.
        X_test (pd.DataFrame): Test features.
        model (XGBRegressor): Fitted model with feature importances.
        threshold (float, optional): Feature importance threshold to drop below. If None, auto-determine.
        target_retention (float): Target fraction of features to retain (default: 0.6, i.e., 60%).
        use_known_threshold (bool): If True, use 0.00055 instead of searching (default: False).
        debug (bool): If True, print debug information and optionally plot threshold vs. features (default: False).
    
    Returns:
        tuple: (X_train_reduced, X_valid_reduced, X_test_reduced, used_threshold) - Reduced datasets and threshold used.
    """
    importances = pd.Series(model.feature_importances_, index=X_train.columns)
    
    if threshold is not None:
        used_threshold = threshold
    elif use_known_threshold:
        used_threshold = 0.00055  # Known good value from previous runs
    else:
        # Automatic threshold search
        thresholds = np.linspace(0.0001, 0.01, 100)
        feature_counts = [sum(importances >= t) for t in thresholds]
        target_count = int(target_retention * len(importances))
        used_threshold = thresholds[np.argmin([abs(count - target_count) for count in feature_counts])]
    
    # Drop low-importance features
    low_imp_features = importances[importances < used_threshold].index.tolist()
    X_train_reduced = X_train.drop(columns=low_imp_features)
    X_valid_reduced = X_valid.drop(columns=low_imp_features)
    X_test_reduced = X_test.drop(columns=low_imp_features)
    
    if debug:
        print(f"Used threshold: {used_threshold:.5f}")
        print(f"Dropped {len(low_imp_features)} features out of {len(importances)}")
        print(f"Remaining features: {X_train_reduced.shape[1]} (Retention: {X_train_reduced.shape[1]/len(importances):.2%})")
        print("Top 10 feature importances:\n", importances.nlargest(10))
        if threshold is None and not use_known_threshold:
            # Extended debug info for threshold search
            print("\nThreshold vs. Features Kept (sample):")
            for t, count in list(zip(thresholds, feature_counts))[::20]:  # Show every 20th for brevity
                print(f"Threshold {t:.5f}: {count} features ({count/len(importances):.2%} retention)")
    
    return X_train_reduced, X_valid_reduced, X_test_reduced, used_threshold

# Usage
X_train_red, X_valid_red, X_test_red, thresh = drop_features(
    X_train_split, X_valid_split, X_test_oh, baseline_model, 
    threshold=None, target_retention=0.8, use_known_threshold=False, debug=True
)

Used threshold: 0.00010
Dropped 81 features out of 232
Remaining features: 151 (Retention: 65.09%)
Top 10 feature importances:
 Neighborhood_Qual_MedianPrice    0.432691
TotalSF                          0.080904
BsmtQual_Ex                      0.029020
KitchenQual_TA                   0.023776
Fence_GdPrv                      0.017003
CentralAir_N                     0.012003
KitchenQual_Gd                   0.009883
RemodAge                         0.009352
Exterior2nd_HdBoard              0.009247
TotalBath                        0.008582
dtype: float32

Threshold vs. Features Kept (sample):
Threshold 0.00010: 151 features (65.09% retention)
Threshold 0.00210: 73 features (31.47% retention)
Threshold 0.00410: 41 features (17.67% retention)
Threshold 0.00610: 20 features (8.62% retention)
Threshold 0.00810: 12 features (5.17% retention)


## Tune Model

I readjusted numbers after running these initial params and seeing their best results:
```
param_dist = {
    'learning_rate': [0.01, 0.025, 0.05, 0.1],
    'max_depth': [3, 5, 7, 9],
    'subsample': [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 3, 5], 
    'gamma': [0, 0.1, 0.5, 1] 
    }
```

```
Best parameters: {'subsample': 0.8, 'min_child_weight': 1, 'max_depth': 5, 'learning_rate': 0.025, 'gamma': 0.1, 'colsample_bytree': 0.7}
Best CV MAE: 14450.3097
Validation MAE: 15343.2544
Best iteration: 348
```


In [8]:
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV

def tune_model(X_train, y_train, X_valid, y_valid, random_state=0, debug=False):
    """Tune an XGBoost model using RandomizedSearchCV on a reduced feature set.
    
    Args:
        X_train (pd.DataFrame): Training features (reduced).
        y_train (pd.Series): Training target.
        X_valid (pd.DataFrame): Validation features (reduced).
        y_valid (pd.Series): Validation target.
        random_state (int): Seed for reproducibility (default: 0).
        debug (bool): If True, print debug information (default: False).
    
    Returns:
        XGBRegressor: Best tuned model.
    """
    metric_func, cv_scoring, metric_name = METRIC_MAP.get(EVAL_METRIC, METRIC_MAP['mae'])
    scoring = cv_scoring
    
    base_model = XGBRegressor(
        n_estimators=1000, objective='reg:squarederror', eval_metric=EVAL_METRIC,
        early_stopping_rounds=10, random_state=random_state 
        # nthread=1 # Force single-threaded XGBoost
    )
    
    # Expanded parameter grid
    param_dist = {
        'learning_rate': [0.0125, 0.01875, 0.025, .0375, 0.05],
        'max_depth': [4, 5, 6],
        'subsample': [0.75, 0.8, 0.85],
        'colsample_bytree': [0.65, 0.7, 0.75],
        'min_child_weight': [1, 2], 
        'gamma': [0, 0.1, 0.2] 
    }
    
    random_search = RandomizedSearchCV(
        base_model, param_dist, n_iter=50, cv=5, scoring=scoring,
        random_state=random_state, n_jobs=2, verbose=0
    )
    random_search.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
    best_model = random_search.best_estimator_
    
    y_pred = best_model.predict(X_valid)
    valid_metric = metric_func(y_valid, y_pred)
    
    if USE_LOG_TRANSFORM and debug:
        y_valid_orig = np.expm1(y_valid)
        y_pred_orig = np.expm1(y_pred)
        if EVAL_METRIC == 'mae':
            translated_metric = mean_absolute_error(y_valid_orig, y_pred_orig)
        elif EVAL_METRIC == 'rmse':
            translated_metric = mean_squared_error(y_valid_orig, y_pred_orig, squared=False)
    
    if debug:
        print("Best parameters:", random_search.best_params_)
        print(f"Best CV {metric_name}: {-random_search.best_score_:.4f}{' (log scale)' if USE_LOG_TRANSFORM else ''}")
        print(f"Validation {metric_name}: {valid_metric:.4f}{' (log scale)' if USE_LOG_TRANSFORM else ''}")
        if USE_LOG_TRANSFORM and EVAL_METRIC in ['mae', 'rmse']:
            print(f"Translated Validation {metric_name}: {translated_metric:.2f}")
        print(f"Best iteration: {best_model.best_iteration}")
    
    return best_model

# Usage
tuned_model = tune_model(X_train_red, y_train_split, X_valid_red, y_valid_split, debug=True)

Best parameters: {'subsample': 0.75, 'min_child_weight': 1, 'max_depth': 5, 'learning_rate': 0.025, 'gamma': 0.2, 'colsample_bytree': 0.65}
Best CV MAE: 14392.7112
Validation MAE: 15292.1151
Best iteration: 347


## Submission

In [9]:
import pandas as pd

def generate_submission(X_train, y_train, X_test, model, output_file='submission.csv', debug=False):
    """Fit final model and generate test predictions, reversing log transform if enabled.
    
    Args:
        X_train (pd.DataFrame): Full training features (reduced).
        y_train (pd.Series): Full training target.
        X_test (pd.DataFrame): Test features (reduced).
        model (XGBRegressor): Tuned model with best parameters.
        output_file (str): Path to save the submission CSV (default: 'submission.csv').
        debug (bool): If True, print debug information (default: False).
    
    Returns:
        pd.DataFrame: Submission DataFrame with Id and SalePrice.
    """
    final_model = XGBRegressor(
        n_estimators=model.best_iteration, learning_rate=model.learning_rate,
        max_depth=model.max_depth, subsample=model.subsample,
        colsample_bytree=model.colsample_bytree, objective='reg:squarederror',
        eval_metric=EVAL_METRIC, random_state=0
    )
    
    final_model.fit(X_train, y_train, verbose=False)
    test_preds = final_model.predict(X_test)
    if USE_LOG_TRANSFORM:
        test_preds = np.expm1(test_preds)  # Reverse log transform if enabled
    
    submission = pd.DataFrame({
        'Id': X_test.index,
        'SalePrice': test_preds
    })
    submission.to_csv(output_file, index=False)
    
    if debug:
        print(f"Submission saved to {output_file}")
        print(f"Log transform: {'On' if USE_LOG_TRANSFORM else 'Off'}")
        print("Sample predictions:\n", submission.head())
        print("Test prediction summary:\n", pd.Series(test_preds, name='SalePrice').describe().round(2))
    
    return submission

# Usage
X_full_red = pd.concat([X_train_red, X_valid_red])
y_full = pd.concat([y_train_split, y_valid_split])
submission = generate_submission(X_full_red, y_full, X_test_red, tuned_model, debug=True)

Submission saved to submission.csv
Log transform: Off
Sample predictions:
      Id      SalePrice
0  1461  134649.156250
1  1462  159860.062500
2  1463  183229.359375
3  1464  190165.484375
4  1465  184458.328125
Test prediction summary:
 count      1459.00
mean     178746.20
std       76499.46
min       48647.50
25%      128537.95
50%      158215.00
75%      208677.87
max      574844.00
Name: SalePrice, dtype: float64
