#**House Prices Prediction**
###*Dataset - House Sales Prediction (Kaggle)*
***Link to Dataset - https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data***



##***1. Setup and dataset loading***


##Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
import joblib
warnings.filterwarnings('ignore')


##**Creating Functions to process data and perform Feature Engineering**

###*Processing train.csv to add features and remove unwanted items(like outliers) for model training*

In [2]:
class HousePricePreprocessor:
    def __init__(self):
        self.label_encoders = {}
        self.scaler = StandardScaler()
        self.numerical_imputer = SimpleImputer(strategy='median')
        self.categorical_imputer = SimpleImputer(strategy='most_frequent')
        self.numerical_cols = []
        self.categorical_cols = []

    def load_data(self, file_path):
        """Load the dataset"""
        try:
            df = pd.read_csv(file_path)
            print(f"Data loaded successfully! Shape: {df.shape}")
            return df
        except Exception as e:
            print(f"Error loading data: {e}")
            return None

    def basic_info(self, df):
        """Display basic information about the dataset"""
        print("\n" + "="*50)
        print("BASIC DATA INFORMATION")
        print("="*50)
        print(f"Dataset Shape: {df.shape}")
        print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
        print(f"Duplicate rows: {df.duplicated().sum()}")

        # Missing values summary
        missing = df.isnull().sum()
        missing_percent = (missing / len(df)) * 100
        missing_df = pd.DataFrame({
            'Missing_Count': missing,
            'Missing_Percent': missing_percent
        }).sort_values('Missing_Count', ascending=False)

        missing_features = missing_df[missing_df['Missing_Count'] > 0]
        if len(missing_features) > 0:
            print(f"\nFeatures with missing values: {len(missing_features)}")
            print("\nTop 10 features with most missing values:")
            print(missing_features.head(10))
        else:
            print("\nNo missing values found!")

        return missing_df

    def identify_column_types(self, df):
        """Identify numerical and categorical columns"""
        self.numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
        self.categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

        # Remove target variable if present
        if 'SalePrice' in self.numerical_cols:
            self.numerical_cols.remove('SalePrice')

        print(f"\nNumerical columns ({len(self.numerical_cols)}): {self.numerical_cols[:5]}...")
        print(f"Categorical columns ({len(self.categorical_cols)}): {self.categorical_cols[:5]}...")

        return self.numerical_cols, self.categorical_cols

    def handle_missing_values(self, df):
        """Handle missing values in the dataset"""
        print("\n" + "="*50)
        print("HANDLING MISSING VALUES")
        print("="*50)

        df_clean = df.copy()

        # Special handling for specific columns based on domain knowledge
        special_na_cols = {
            'Alley': 'No_Alley',
            'BsmtQual': 'No_Basement',
            'BsmtCond': 'No_Basement',
            'BsmtExposure': 'No_Basement',
            'BsmtFinType1': 'No_Basement',
            'BsmtFinType2': 'No_Basement',
            'FireplaceQu': 'No_Fireplace',
            'GarageType': 'No_Garage',
            'GarageFinish': 'No_Garage',
            'GarageQual': 'No_Garage',
            'GarageCond': 'No_Garage',
            'PoolQC': 'No_Pool',
            'Fence': 'No_Fence',
            'MiscFeature': 'None'
        }

        # Fill special NA columns
        for col, fill_value in special_na_cols.items():
            if col in df_clean.columns:
                df_clean[col] = df_clean[col].fillna(fill_value)
                print(f" {col}: Filled {df[col].isnull().sum()} missing values with '{fill_value}'")

        # Handle LotFrontage (often missing, use median by neighborhood)
        if 'LotFrontage' in df_clean.columns:
            df_clean['LotFrontage'] = df_clean.groupby('Neighborhood')['LotFrontage'].transform(
                lambda x: x.fillna(x.median())
            )
            print("LotFrontage: Filled with neighborhood median")

        # Handle GarageYrBlt
        if 'GarageYrBlt' in df_clean.columns:
            df_clean['GarageYrBlt'] = df_clean['GarageYrBlt'].fillna(df_clean['YearBuilt'])
            print("GarageYrBlt: Filled with YearBuilt")

        # Handle MasVnrArea and MasVnrType
        if 'MasVnrArea' in df_clean.columns:
            df_clean['MasVnrArea'] = df_clean['MasVnrArea'].fillna(0)
        if 'MasVnrType' in df_clean.columns:
            df_clean['MasVnrType'] = df_clean['MasVnrType'].fillna('None')
            print("MasVnr columns: Handled missing values")

        # Fill remaining numerical columns with median
        remaining_num_cols = [col for col in self.numerical_cols if df_clean[col].isnull().sum() > 0]
        if remaining_num_cols:
            df_clean[remaining_num_cols] = self.numerical_imputer.fit_transform(df_clean[remaining_num_cols])
            print(f"Numerical columns: Filled {len(remaining_num_cols)} columns with median")

        # Fill remaining categorical columns with mode
        remaining_cat_cols = [col for col in self.categorical_cols if df_clean[col].isnull().sum() > 0]
        if remaining_cat_cols:
            for col in remaining_cat_cols:
                mode_value = df_clean[col].mode()[0] if not df_clean[col].mode().empty else 'Unknown'
                df_clean[col] = df_clean[col].fillna(mode_value)
            print(f"Categorical columns: Filled {len(remaining_cat_cols)} columns with mode")

        print(f"\nMissing values after cleaning: {df_clean.isnull().sum().sum()}")
        return df_clean

    def create_new_features(self, df):
        """Create new features from existing ones"""
        print("\n" + "="*50)
        print("FEATURE ENGINEERING")
        print("="*50)

        df_features = df.copy()

        # Age features
        if 'YearBuilt' in df_features.columns:
            df_features['House_Age'] = 2023 - df_features['YearBuilt']
            print("Created: House_Age")

        if 'YearRemodAdd' in df_features.columns:
            df_features['Years_Since_Remod'] = 2023 - df_features['YearRemodAdd']
            df_features['Was_Remodeled'] = (df_features['YearBuilt'] != df_features['YearRemodAdd']).astype(int)
            print("Created: Years_Since_Remod, Was_Remodeled")

        # Total area features
        if all(col in df_features.columns for col in ['1stFlrSF', '2ndFlrSF']):
            df_features['Total_SF'] = df_features['1stFlrSF'] + df_features['2ndFlrSF']
            print("Created: Total_SF")

        if 'TotalBsmtSF' in df_features.columns and 'Total_SF' in df_features.columns:
            df_features['Total_Area'] = df_features['Total_SF'] + df_features['TotalBsmtSF']
            print("Created: Total_Area")

        # Bathroom features
        bathroom_cols = ['FullBath', 'HalfBath', 'BsmtFullBath', 'BsmtHalfBath']
        if all(col in df_features.columns for col in bathroom_cols):
            df_features['Total_Bathrooms'] = (df_features['FullBath'] +
                                            df_features['BsmtFullBath'] +
                                            0.5 * df_features['HalfBath'] +
                                            0.5 * df_features['BsmtHalfBath'])
            print("Created: Total_Bathrooms")

        # Porch features
        porch_cols = ['OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch']
        available_porch_cols = [col for col in porch_cols if col in df_features.columns]
        if available_porch_cols:
            df_features['Total_Porch_SF'] = df_features[available_porch_cols].sum(axis=1)
            print("Created: Total_Porch_SF")

        # Quality features
        quality_cols = ['OverallQual', 'OverallCond']
        if all(col in df_features.columns for col in quality_cols):
            df_features['Overall_Quality_Score'] = df_features['OverallQual'] * df_features['OverallCond']
            print("Created: Overall_Quality_Score")

        # Garage features
        if 'GarageCars' in df_features.columns:
            df_features['Has_Garage'] = (df_features['GarageCars'] > 0).astype(int)
            print("Created: Has_Garage")

        # Basement features
        if 'TotalBsmtSF' in df_features.columns:
            df_features['Has_Basement'] = (df_features['TotalBsmtSF'] > 0).astype(int)
            print("Created: Has_Basement")

        print(f"\nNew dataset shape: {df_features.shape}")
        return df_features

    def encode_categorical_features(self, df):
        """Encode categorical features"""
        print("\n" + "="*50)
        print("ENCODING CATEGORICAL FEATURES")
        print("="*50)

        df_encoded = df.copy()

        # Update categorical columns list
        categorical_cols = df_encoded.select_dtypes(include=['object']).columns.tolist()

        # Ordinal encoding for quality/condition columns
        ordinal_mappings = {
            'ExterQual': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
            'ExterCond': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
            'BsmtQual': {'No_Basement': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
            'BsmtCond': {'No_Basement': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
            'HeatingQC': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
            'KitchenQual': {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
            'FireplaceQu': {'No_Fireplace': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
            'GarageQual': {'No_Garage': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5},
            'GarageCond': {'No_Garage': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
        }

        # Apply ordinal encoding
        ordinal_encoded = []
        for col, mapping in ordinal_mappings.items():
            if col in df_encoded.columns:
                df_encoded[col] = df_encoded[col].map(mapping)
                ordinal_encoded.append(col)
                categorical_cols.remove(col)

        print(f"Ordinal encoded: {len(ordinal_encoded)} columns")

        # Label encoding for remaining categorical columns
        label_encoded = []
        for col in categorical_cols:
            if col in df_encoded.columns:
                le = LabelEncoder()
                df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))
                self.label_encoders[col] = le
                label_encoded.append(col)

        print(f"Label encoded: {len(label_encoded)} columns")
        print(f"Total categorical features processed: {len(ordinal_encoded) + len(label_encoded)}")

        return df_encoded

    def scale_features(self, df, target_col='SalePrice'):
        """Scale numerical features"""
        print("\n" + "="*50)
        print("SCALING FEATURES")
        print("="*50)

        df_scaled = df.copy()

        # Get numerical columns (excluding target)
        numerical_cols = df_scaled.select_dtypes(include=[np.number]).columns.tolist()
        if target_col in numerical_cols:
            numerical_cols.remove(target_col)

        # Scale numerical features
        df_scaled[numerical_cols] = self.scaler.fit_transform(df_scaled[numerical_cols])

        print(f"Scaled {len(numerical_cols)} numerical features")
        print("Scaling method: StandardScaler (mean=0, std=1)")

        return df_scaled

    def preprocess_features(self, df):
      """
      Preprocess features for prediction (used by Streamlit)
      This method handles single prediction inputs
      """
      print("\n" + "="*50)
      print("PREPROCESSING FEATURES FOR PREDICTION")
      print("="*50)

      df_processed = df.copy()

      # Convert Yes/No to 1/0 for categorical columns
      yes_no_cols = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']
      for col in yes_no_cols:
        if col in df_processed.columns:
          df_processed[col] = df_processed[col].map({'Yes': 1, 'No': 0})

      # Create basic feature engineering (simplified version)
      # Age feature (assuming current year is 2023)
      if 'yearbuilt' in df_processed.columns:
        df_processed['House_Age'] = 2023 - df_processed['yearbuilt']

      # Total bathrooms (simplified)
      if 'bathrooms' in df_processed.columns:
        df_processed['Total_Bathrooms'] = df_processed['bathrooms']

      # Has garage
      if 'parking' in df_processed.columns:
        df_processed['Has_Garage'] = (df_processed['parking'] > 0).astype(int)

      # Has basement
      if 'basement' in df_processed.columns:
        df_processed['Has_Basement'] = df_processed['basement']

      # Quality score (simplified)
      if 'bedrooms' in df_processed.columns and 'bathrooms' in df_processed.columns:
        df_processed['Overall_Quality_Score'] = df_processed['bedrooms'] * df_processed['bathrooms']

      print(f"Processed features shape: {df_processed.shape}")
      return df_processed

    def remove_outliers(self, df, target_col='SalePrice'):
        """Remove outliers using IQR method"""
        print("\n" + "="*50)
        print("REMOVING OUTLIERS")
        print("="*50)

        df_clean = df.copy()
        initial_shape = df_clean.shape[0]

        # Remove outliers from target variable if present
        if target_col in df_clean.columns:
            Q1 = df_clean[target_col].quantile(0.25)
            Q3 = df_clean[target_col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            df_clean = df_clean[(df_clean[target_col] >= lower_bound) &
                               (df_clean[target_col] <= upper_bound)]

            removed = initial_shape - df_clean.shape[0]
            print(f"Removed {removed} outliers from {target_col}")
            print(f"Remaining samples: {df_clean.shape[0]} ({100*df_clean.shape[0]/initial_shape:.1f}%)")

        return df_clean

    def get_feature_importance_data(self, df, target_col='SalePrice'):
        """Prepare data for feature importance analysis"""
        if target_col not in df.columns:
            print(f"Target column '{target_col}' not found")
            return None, None

        X = df.drop(columns=[target_col])
        y = df[target_col]

        print(f"Prepared data for modeling: X{X.shape}, y{y.shape}")
        return X, y

    def full_preprocessing_pipeline(self, file_path, remove_outliers=True):
        """Complete preprocessing pipeline"""
        print("HOUSE PRICES DATA PREPROCESSING PIPELINE")
        print("="*60)

        # Step 1: Load data
        df = self.load_data(file_path)
        if df is None:
            return None

        # Step 2: Basic info
        self.basic_info(df)

        # Step 3: Identify column types
        self.identify_column_types(df)

        # Step 4: Handle missing values
        df = self.handle_missing_values(df)

        # Step 5: Feature engineering
        df = self.create_new_features(df)

        # Step 6: Encode categorical features
        df = self.encode_categorical_features(df)

        # Step 7: Remove outliers (optional)
        if remove_outliers:
            df = self.remove_outliers(df)

        # Step 8: Scale features
        df = self.scale_features(df)

        print("\n" + "="*60)
        print("PREPROCESSING COMPLETED SUCCESSFULLY!")
        print(f"Final dataset shape: {df.shape}")
        print("="*60)

        return df

##**Processing test.csv and training model on train.csv to predict house prices**

###*Using Different models(Random Forest, Gradient Boosting) to analyse which model predits more accurately*

In [3]:
def train_and_predict(preprocessor, processed_train_df, test_file_path):
    """
    Simple function to train model and make predictions
    Uses already processed training data to avoid duplicate preprocessing
    """
    print("\n" + "="*60)
    print("TRAINING MODEL AND MAKING PREDICTIONS")
    print("="*60)

    # 1. Use already processed training data
    print("Using pre-processed training data...")
    train_df = processed_train_df

    if train_df is None:
        print("No processed training data provided")
        return None

    # 2. Prepare training data
    X, y = preprocessor.get_feature_importance_data(train_df, 'SalePrice')

    if X is None:
        print("No target variable found")
        return None

    # 3. Split data for validation
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
    print(f"Data split - Train: {X_train.shape}, Validation: {X_val.shape}")

    # 4. Train models and select best one
    models = {
        'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1),
        'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)
    }

    best_model = None
    best_score = float('inf')
    best_name = ""

    print("\nTraining models...")
    for name, model in models.items():
        # Train model
        model.fit(X_train, y_train)

        # Validate
        val_pred = model.predict(X_val)
        rmse = np.sqrt(mean_squared_error(y_val, val_pred))
        r2 = r2_score(y_val, val_pred)

        print(f"   {name}: RMSE=${rmse:,.0f}, R²={r2:.4f}")

        if rmse < best_score:
            best_score = rmse
            best_model = model
            best_name = name

    print(f"\nBest model: {best_name} (RMSE: ${best_score:,.0f})")

    # SAVE THE BEST MODEL (not just 'model')
    joblib.dump(best_model, 'house_price_model.pkl')

    # SAVE FEATURE NAMES AND SCALER FOR STREAMLIT
    save_feature_names_and_scaler(preprocessor, X_train)

    # 5. Process test data
    print("\nProcessing test data...")
    test_df = preprocessor.load_data(test_file_path)

    if test_df is None:
        print("Failed to load test data")
        return None

    # Store test IDs if they exist
    test_ids = None
    if 'Id' in test_df.columns:
        test_ids = test_df['Id'].copy()
        test_df = test_df.drop('Id', axis=1)

    # Apply same preprocessing to test data (without target operations)
    original_numerical_cols = preprocessor.numerical_cols.copy()
    original_categorical_cols = preprocessor.categorical_cols.copy()

    # Identify columns
    preprocessor.identify_column_types(test_df)

    # Handle missing values
    test_df = preprocessor.handle_missing_values(test_df)

    # Create new features
    test_df = preprocessor.create_new_features(test_df)

    # Encode categorical features
    test_df = preprocessor.encode_categorical_features(test_df)

    # Align test data with training data features
    train_features = X.columns.tolist()

    # Add missing columns with zeros
    for col in train_features:
        if col not in test_df.columns:
            test_df[col] = 0
            print(f"Added missing column: {col}")

    # Remove extra columns
    extra_cols = [col for col in test_df.columns if col not in train_features]
    if extra_cols:
        test_df = test_df.drop(columns=extra_cols)
        print(f"Removed extra columns: {extra_cols}")

    # Reorder columns to match training data
    test_df = test_df[train_features]

    # Scale test data using the same scaler fitted on training data
    numerical_cols = test_df.select_dtypes(include=[np.number]).columns.tolist()
    test_df[numerical_cols] = preprocessor.scaler.transform(test_df[numerical_cols])

    print(f"Test data processed: {test_df.shape}")

    # 6. Make predictions
    print("\nMaking predictions...")
    predictions = best_model.predict(test_df)

    # 7. Create submission file
    if test_ids is not None:
        submission = pd.DataFrame({
            'Id': test_ids,
            'SalePrice': predictions
        })
    else:
        submission = pd.DataFrame({
            'SalePrice': predictions
        })

    # Save predictions
    submission.to_csv('house_price_predictions.csv', index=False)

    print("Predictions saved to 'house_price_predictions.csv'")
    print(f"Prediction stats:")
    print(f"   Mean: ${predictions.mean():,.0f}")
    print(f"   Median: ${np.median(predictions):,.0f}")
    print(f"   Min: ${predictions.min():,.0f}")
    print(f"   Max: ${predictions.max():,.0f}")

    # 8. Show feature importance
    if hasattr(best_model, 'feature_importances_'):
        print(f"\nTOP 10 MOST IMPORTANT FEATURES:")
        feature_importance = pd.DataFrame({
            'Feature': train_features,
            'Importance': best_model.feature_importances_
        }).sort_values('Importance', ascending=False)

        for i, row in feature_importance.head(10).iterrows():
            print(f"   {row['Feature']:<20} {row['Importance']:.4f}")

    print("\nFiles created for Streamlit:")
    print("- house_price_model.pkl (trained model)")
    print("- feature_names.pkl (feature names)")
    print("- scaler.pkl (fitted scaler)")

    return submission, best_model


##**Saving Feature names for appropriate scaling**

In [4]:
def save_feature_names_and_scaler(preprocessor, X_train):
    """
    Save feature names and scaler for later use in Streamlit
    """
    import pickle

    # Save feature names
    with open('feature_names.pkl', 'wb') as f:
        pickle.dump(X_train.columns.tolist(), f)

    # Save the fitted scaler
    with open('scaler.pkl', 'wb') as f:
        pickle.dump(preprocessor.scaler, f)

    print(f"Saved {len(X_train.columns)} feature names and scaler")


##**processed_house_data.csv and house_price_predictions.csv created using Random Forest**

In [6]:
if __name__ == "__main__":
    preprocessor = HousePricePreprocessor()

    #from google.colab import drive
    #drive.mount('/content/drive')

    # File paths
    train_file_path = '/content/drive/MyDrive/train.csv'
    test_file_path = '/content/drive/MyDrive/test.csv'

    processed_df = preprocessor.full_preprocessing_pipeline(train_file_path)

    if processed_df is not None:
        processed_df.to_csv('processed_house_data.csv', index=False)
        print("Processed data saved as 'processed_house_data.csv'")

        X, y = preprocessor.get_feature_importance_data(processed_df)
        if X is not None:
            print(f"\nReady for modeling with {X.shape[1]} features!")

        predictions, model = train_and_predict(preprocessor, processed_df, test_file_path)

        if predictions is not None:
            print("\n house_price_predictions.csv created.")
            print("Predictions:")
            print(predictions.head())
        else:
            print("Something went wrong. Please check your file paths and data.")


HOUSE PRICES DATA PREPROCESSING PIPELINE
Data loaded successfully! Shape: (1460, 81)

BASIC DATA INFORMATION
Dataset Shape: (1460, 81)
Memory Usage: 3.86 MB
Duplicate rows: 0

Features with missing values: 19

Top 10 features with most missing values:
              Missing_Count  Missing_Percent
PoolQC                 1453        99.520548
MiscFeature            1406        96.301370
Alley                  1369        93.767123
Fence                  1179        80.753425
MasVnrType              872        59.726027
FireplaceQu             690        47.260274
LotFrontage             259        17.739726
GarageQual               81         5.547945
GarageFinish             81         5.547945
GarageType               81         5.547945

Numerical columns (37): ['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual']...
Categorical columns (43): ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour']...

HANDLING MISSING VALUES
 Alley: Filled 1369 missing values with 'No_Alley'

Key Steps involved:
- Cleaned and prepared the training dataset (`train.csv`) by removing outliers and imputing missing values.
- Engineered relevant features for better model performance.
- Processed the test dataset (`test.csv`) in a similar fashion to ensure consistent input.
- Test data after preprocessing is also saved in `processed_house_data.csv` for reference.
- Trained multiple models including Random Forest and Gradient Boosting Regressors.
- Selected Random Forest as the final model based on its superior performance.
- Saved the trained model (`house_price_model.pkl`), feature names (`feature_names.pkl`), and scaler (`scaler.pkl`) for deployment.
- Generated predictions on the test set and saved them in `house_price_predictions.csv`.