# Capstone Project: Predicting Housing Prices
### Berkeley AI/ML Professional Certificate
---
This notebook explores housing price prediction using regression and tree-based models. The goal is to understand which factors drive home prices and how well machine learning can predict them.

## 1. Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset (Kaggle Ames Housing dataset)
url = "https://raw.githubusercontent.com/selva86/datasets/master/AmesHousing.csv"
df = pd.read_csv(url)
df.head()

## 2. Exploratory Data Analysis (EDA)

In [None]:
# Basic info
df.info()
df.describe().T.head(10)

In [None]:
# Missing values
missing = df.isnull().mean().sort_values(ascending=False)
missing[missing > 0].head(10)

In [None]:
# Distribution of SalePrice
plt.figure(figsize=(8,5))
sns.histplot(df['SalePrice'], kde=True)
plt.title('Distribution of Home Sale Prices')
plt.show()

In [None]:
# Correlation with SalePrice
corr = df.corr(numeric_only=True)['SalePrice'].sort_values(ascending=False)
corr.head(10)

In [None]:
# Heatmap of top correlated features
top_corr_features = corr.index[1:11]
plt.figure(figsize=(10,8))
sns.heatmap(df[top_corr_features].corr(), annot=True, cmap="coolwarm")
plt.title("Top Correlated Features with SalePrice")
plt.show()

## 3. Data Preprocessing

In [None]:
# Drop columns with too many missing values
df = df.dropna(axis=1, thresh=len(df)*0.7)

# Simple fill for remaining missing values
df = df.fillna(df.median(numeric_only=True))

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded.shape

## 4. Train/Test Split

In [None]:
X = df_encoded.drop('SalePrice', axis=1)
y = df_encoded['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 5. Baseline and Models

In [None]:
models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.001),
    'DecisionTree': DecisionTreeRegressor(max_depth=5),
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42)
}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    r2 = r2_score(y_test, preds)
    results[name] = {'RMSE': rmse, 'R2': r2}

pd.DataFrame(results).T

## 6. Hyperparameter Tuning (GridSearch)

In [None]:
param_grid = {'n_estimators': [100, 200], 'max_depth': [5, 10, None]}
grid = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=3, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
best_preds = best_model.predict(X_test)
best_rmse = np.sqrt(mean_squared_error(y_test, best_preds))
best_r2 = r2_score(y_test, best_preds)

best_rmse, best_r2, grid.best_params_

## 7. Findings and Interpretation
### Key Insights:
- Larger homes (square footage, number of rooms) are strongly correlated with higher prices.
- Location-related variables (Neighborhood) also play a big role.
- Among models, **Random Forest with tuned hyperparameters performed best** with the lowest RMSE and highest R².
- Linear models provide interpretability, while tree-based models capture complex non-linear relationships.

### Recommendations:
- Real estate professionals can use this model for better pricing.
- Policymakers could identify affordability trends by analyzing influential features.
- Next steps: try gradient boosting methods (XGBoost, LightGBM) for further performance gains.