# 🏠 Predicting House Prices with Clean ML Pipeline


This notebook demonstrates a clean end-to-end ML workflow to predict house prices
using the Ames Housing dataset, **without any leftover target column errors**.


## 📂 Data Loading (from runtime upload)

In [1]:
import pandas as pd
df = pd.read_csv('train.csv')
df.columns = df.columns.str.strip()  # Remove any trailing spaces
df.head()


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## 🎯 Preparing Features and Target

In [2]:

# Drop target column from features
X = df.drop(columns='SalePrice')
y = df['SalePrice']

print('Is SalePrice in X?', 'SalePrice' in X.columns)  # Should be False


Is SalePrice in X? False


## 🔍 Defining Numeric and Categorical Features

In [3]:

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print('Numeric features:', numeric_features)
print('Categorical features:', categorical_features)


Numeric features: ['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
Categorical features: ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageT

## ⚙️ Preprocessing Pipeline

In [4]:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])


## ✂️ Train-Test Split

In [5]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print('Train size:', X_train.shape, '| Test size:', X_test.shape)


Train size: (1168, 80) | Test size: (292, 80)


## 🚀 Model Pipeline and Hyperparameter Tuning

In [6]:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingRegressor(random_state=42))
])

param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [3, 5],
    'model__learning_rate': [0.05, 0.1]
}

grid = GridSearchCV(pipeline, param_grid, cv=3, scoring='neg_root_mean_squared_error', n_jobs=-1)
grid.fit(X_train, y_train)

print('Best parameters:', grid.best_params_)


Best parameters: {'model__learning_rate': 0.1, 'model__max_depth': 3, 'model__n_estimators': 200}


## 🧪 Model Evaluation

In [7]:

from sklearn.metrics import mean_squared_error
import numpy as np

y_pred = grid.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print('Test RMSE:', rmse)


Test RMSE: 26681.34661453705


## 🔮 Predicting Price for a New House with Proper Data Types

In [9]:

# Create a new DataFrame matching X_train's columns and dtypes
new_house_full = pd.DataFrame(columns=X_train.columns)
new_house_full = new_house_full.astype(X_train.dtypes)

# Fill numeric with 0 and categorical with 'None'
for col in new_house_full.columns:
    if new_house_full[col].dtype == 'object':
        new_house_full.loc[0, col] = 'None'
    else:
        new_house_full.loc[0, col] = 0

# Set specific values
new_house_full.loc[0, 'LotArea'] = 8450
new_house_full.loc[0, 'YearBuilt'] = 2003  # Example year
new_house_full.loc[0, '1stFlrSF'] = 856
new_house_full.loc[0, '2ndFlrSF'] = 854
new_house_full.loc[0, 'GrLivArea'] = 1710
new_house_full.loc[0, 'GarageCars'] = 2
new_house_full.loc[0, 'FullBath'] = 2
new_house_full.loc[0, 'TotRmsAbvGrd'] = 8
new_house_full.loc[0, 'Neighborhood'] = 'CollgCr'
new_house_full.loc[0, 'HouseStyle'] = '2Story'

# Predict the price
prediction = grid.predict(new_house_full)
print('Predicted price:', prediction)


Predicted price: [148821.74844651]


## 💡 Conclusion
✅ Clean pipeline free of target column errors  
✅ Model evaluation with robust RMSE  
✅ Ready to predict house prices for new samples!