# Project 2 â€” House Price Prediction (Regression)

**Goal:** Predict house prices using regression models and compare model performance.

Dataset Example: Kaggle House Prices (or any dataset with `SalePrice` column).

**Instructions:**
- Place `house_prices.csv` in the same directory before running the notebook.


In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

sns.set(style='whitegrid')

In [None]:
df = pd.read_csv('house_prices.csv')
print(df.shape)
df.head()

In [None]:
if 'Id' in df.columns:
    df = df.drop('Id', axis=1)

y = df['SalePrice']
X = df.drop('SalePrice', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Train:', X_train.shape, 'Test:', X_test.shape)

In [None]:
num_cols = X.select_dtypes(include=['int64','float64']).columns.tolist()
cat_cols = X.select_dtypes(include=['object']).columns.tolist()

num_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
cat_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='NA')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num_cols), ('cat', cat_transformer, cat_cols)])

In [None]:
lr_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', LinearRegression())])

lr_pipeline.fit(X_train, y_train)
preds_lr = lr_pipeline.predict(X_test)

print('Linear Regression RMSE:', np.sqrt(mean_squared_error(y_test, preds_lr)))
print('Linear Regression R2:', r2_score(y_test, preds_lr))

In [None]:
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1))])

rf_pipeline.fit(X_train, y_train)
preds_rf = rf_pipeline.predict(X_test)

print('Random Forest RMSE:', np.sqrt(mean_squared_error(y_test, preds_rf)))
print('Random Forest R2:', r2_score(y_test, preds_rf))

In [None]:
residuals = y_test - preds_rf
plt.figure(figsize=(6,4))
sns.scatterplot(x=preds_rf, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted (Random Forest)')
plt.show()

### Next Steps / Enhancements
- Apply hyperparameter tuning using GridSearchCV
- Try models: XGBoost, CatBoost, Gradient Boosting
- Perform feature selection & importance ranking
- Save model using joblib and deploy using Streamlit or FastAPI