# Practice Lab: Multiple Linear Regression

**Module 2 - Lesson 3**  
**Date:** November 11, 2025

---

## üéØ Learning Objectives

In this practice lab, you will:
- ‚úÖ Build multiple linear regression models with 2+ features
- ‚úÖ Understand matrix representation (X, Œ∏)
- ‚úÖ Handle categorical variables (one-hot encoding)
- ‚úÖ Detect and address multicollinearity
- ‚úÖ Perform what-if scenario analysis
- ‚úÖ Compare OLS vs Gradient Descent
- ‚úÖ Visualize hyperplanes (3D regression planes)
- ‚úÖ Interpret coefficients and feature importance

---

## üìä Real-World Scenario: Real Estate Valuation

You're a data scientist at **Metro Realty**, a real estate company. Your goal is to build an accurate house price prediction model considering multiple factors:

**Why Multiple Features?**
- House price depends on size, location, age, amenities, etc.
- Using only one feature (e.g., square footage) ignores other important factors
- Multiple regression captures the combined effect of all features

**Dataset Features:**
- `house_id`: Unique identifier
- `sqft`: Square footage
- `bedrooms`: Number of bedrooms
- `bathrooms`: Number of bathrooms
- `age_years`: Age of house in years
- `garage_spaces`: Garage capacity
- `location`: Neighborhood (categorical: Urban/Suburban/Rural)
- `has_pool`: Pool present (categorical: Yes/No)
- `condition`: House condition score (1-10)
- `price_usd`: **TARGET** - House price in USD

---

## üîß Setup: Run Me First!

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Set random seeds for reproducibility
np.random.seed(42)

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Setup complete!")
print(f"NumPy: {np.__version__} | Pandas: {pd.__version__}")

## üì¶ Generate Synthetic Real Estate Dataset

In [None]:
# Generate 200 house records
n_samples = 200

# Generate continuous features
sqft = np.random.uniform(1000, 4000, n_samples).round(0)
bedrooms = np.random.choice([2, 3, 4, 5, 6], n_samples)
bathrooms = np.random.choice([1, 1.5, 2, 2.5, 3, 3.5], n_samples)
age_years = np.random.uniform(0, 50, n_samples).round(0)
garage_spaces = np.random.choice([0, 1, 2, 3], n_samples, p=[0.1, 0.3, 0.4, 0.2])
condition = np.random.uniform(4, 10, n_samples).round(1)

# Generate categorical features
location = np.random.choice(['Urban', 'Suburban', 'Rural'], n_samples, p=[0.4, 0.4, 0.2])
has_pool = np.random.choice(['Yes', 'No'], n_samples, p=[0.3, 0.7])

# Generate price based on realistic formula with correlations
base_price = (
    50000 +                                    # Base cost
    sqft * 150 +                              # $150 per sqft
    bedrooms * 20000 +                        # $20k per bedroom
    bathrooms * 15000 +                       # $15k per bathroom
    -age_years * 2000 +                       # Depreciation $2k/year
    garage_spaces * 10000 +                   # $10k per garage space
    condition * 5000 +                        # $5k per condition point
    (location == 'Urban') * 80000 +           # Urban premium $80k
    (location == 'Suburban') * 40000 +        # Suburban premium $40k
    (has_pool == 'Yes') * 25000               # Pool adds $25k
)

# Add realistic noise (¬±12%)
noise = np.random.normal(0, base_price * 0.12, n_samples)
price_usd = (base_price + noise).round(2)

# Ensure minimum price
price_usd = np.maximum(price_usd, 150000)

# Create DataFrame
df = pd.DataFrame({
    'house_id': [f'H{i:04d}' for i in range(1, n_samples + 1)],
    'sqft': sqft,
    'bedrooms': bedrooms,
    'bathrooms': bathrooms,
    'age_years': age_years,
    'garage_spaces': garage_spaces,
    'location': location,
    'has_pool': has_pool,
    'condition': condition,
    'price_usd': price_usd
})

print("‚úÖ Real estate dataset generated!")
print(f"\nüìä Dataset: {df.shape[0]} houses, {df.shape[1]} columns")
print(f"\nüí∞ Price range: ${df['price_usd'].min():,.0f} - ${df['price_usd'].max():,.0f}")
print(f"   Average price: ${df['price_usd'].mean():,.0f}")
print("\nüè† Sample houses:")
df.head()

## üìä Data Dictionary

In [None]:
print("üìö Data Dictionary\n")
print("Continuous Features:")
print("  sqft           - House size in square feet (1,000-4,000)")
print("  bedrooms       - Number of bedrooms (2-6)")
print("  bathrooms      - Number of bathrooms (1-3.5)")
print("  age_years      - Age of house (0-50 years)")
print("  garage_spaces  - Garage capacity (0-3 cars)")
print("  condition      - House condition score (4-10)")
print("\nCategorical Features:")
print("  location       - Neighborhood type (Urban/Suburban/Rural)")
print("  has_pool       - Swimming pool present (Yes/No)")
print("\nTarget Variable:")
print("  price_usd      - üéØ House sale price in USD")

# Dataset info
print("\n" + "="*60)
df.info()

## üìà Concept 1: Handling Categorical Variables

**Problem:** Regression models require numerical inputs, but we have categorical features (location, has_pool).

**Solutions:**
1. **Binary encoding** for 2 categories (0/1)
2. **One-hot encoding** for 3+ categories (create binary columns)
3. **Dummy variable trap:** Drop one category to avoid perfect multicollinearity

In [None]:
# Before encoding
print("üîç BEFORE Encoding:")
print(df[['house_id', 'location', 'has_pool', 'price_usd']].head(10))

# Create copy for encoding
df_encoded = df.copy()

# Binary encoding: has_pool (Yes=1, No=0)
df_encoded['has_pool_binary'] = (df_encoded['has_pool'] == 'Yes').astype(int)

# One-hot encoding: location (3 categories ‚Üí 2 dummy variables)
location_dummies = pd.get_dummies(df_encoded['location'], prefix='location', drop_first=True)
df_encoded = pd.concat([df_encoded, location_dummies], axis=1)

print("\n\n‚úÖ AFTER Encoding:")
print(df_encoded[['house_id', 'has_pool_binary', 'location_Suburban', 'location_Urban', 'price_usd']].head(10))

print("\n\nüìä Encoding Explanation:")
print("  has_pool: Yes ‚Üí 1, No ‚Üí 0")
print("  \n  location (dropped 'Rural' to avoid dummy trap):")
print("    Rural     ‚Üí location_Suburban=0, location_Urban=0")
print("    Suburban  ‚Üí location_Suburban=1, location_Urban=0")
print("    Urban     ‚Üí location_Suburban=0, location_Urban=1")

# Verify encoding
print("\n\nüîç Verification:")
verification = df_encoded.groupby('location')[['location_Suburban', 'location_Urban']].mean()
print(verification)

## üìä Concept 2: Detecting Multicollinearity

**Multicollinearity:** When independent variables are highly correlated with each other.

**Problems:**
- Unstable coefficients
- Can't isolate individual effects
- Unrealistic what-if scenarios

**Detection:** Correlation matrix and VIF (Variance Inflation Factor)

In [None]:
# Select numeric features for correlation analysis
numeric_cols = ['sqft', 'bedrooms', 'bathrooms', 'age_years', 'garage_spaces', 
                'condition', 'has_pool_binary', 'location_Suburban', 'location_Urban', 'price_usd']

# Calculate correlation matrix
corr_matrix = df_encoded[numeric_cols].corr()

# Visualize
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='RdYlGn', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix: Checking for Multicollinearity', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Identify high correlations (excluding target)
print("\n‚ö†Ô∏è Checking for Multicollinearity (|correlation| > 0.7):\n")
feature_cols = [col for col in numeric_cols if col != 'price_usd']
found_issues = False

for i, col1 in enumerate(feature_cols):
    for col2 in feature_cols[i+1:]:
        corr_val = corr_matrix.loc[col1, col2]
        if abs(corr_val) > 0.7:
            print(f"   ‚ö†Ô∏è {col1:20s} ‚Üî {col2:20s}: {corr_val:+.3f}")
            found_issues = True

if not found_issues:
    print("   ‚úÖ No severe multicollinearity detected!")

# Show correlations with target
print("\n\nüìä Correlations with Target (price_usd):\n")
target_corr = corr_matrix['price_usd'].sort_values(ascending=False)
for feature, corr in target_corr.items():
    if feature != 'price_usd':
        strength = 'Strong' if abs(corr) > 0.7 else 'Moderate' if abs(corr) > 0.4 else 'Weak'
        print(f"   {feature:20s}: {corr:+.3f} ({strength})")

## üèóÔ∏è Concept 3: Building Multiple Linear Regression Model

**Equation:**
$$\text{Price} = \theta_0 + \theta_1 \times \text{sqft} + \theta_2 \times \text{bedrooms} + ... + \theta_n \times \text{feature}_n$$

**Matrix Form:**
$$\hat{y} = X\theta$$

In [None]:
# Prepare features and target
feature_columns = ['sqft', 'bedrooms', 'bathrooms', 'age_years', 'garage_spaces', 
                   'condition', 'has_pool_binary', 'location_Suburban', 'location_Urban']

X = df_encoded[feature_columns]
y = df_encoded['price_usd']

# Split into train/test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"üìä Dataset Split:")
print(f"   Training set: {X_train.shape[0]} houses ({X_train.shape[0]/len(df)*100:.0f}%)")
print(f"   Test set:     {X_test.shape[0]} houses ({X_test.shape[0]/len(df)*100:.0f}%)")
print(f"   Features:     {X_train.shape[1]}")

# Display feature matrix structure
print("\n\nüîç Feature Matrix (X) - First 5 rows:")
print(X_train.head())

print("\n\nüéØ Target Vector (y) - First 5 values:")
print(y_train.head())

## üßÆ Method 1: Ordinary Least Squares (OLS)

**Formula:** $\theta = (X^T X)^{-1} X^T y$

**Characteristics:**
- ‚úÖ Exact solution (closed-form)
- ‚úÖ Fast for small/medium datasets
- ‚ùå Computationally expensive for large data (matrix inversion)

In [None]:
# Train model using OLS (default in sklearn)
model_ols = LinearRegression()
model_ols.fit(X_train, y_train)

# Extract parameters
intercept = model_ols.intercept_
coefficients = model_ols.coef_

print("üèóÔ∏è Multiple Linear Regression Model (OLS)\n")
print("="*70)
print(f"\nüìê Equation:\n")
print(f"   Price = {intercept:,.2f}")
for feature, coef in zip(feature_columns, coefficients):
    print(f"         {coef:+,.2f} √ó {feature}")

print(f"\n\nüí° Coefficient Interpretation:\n")
for feature, coef in zip(feature_columns, coefficients):
    if coef > 0:
        print(f"   ‚ÜóÔ∏è {feature:20s}: Each unit increases price by ${abs(coef):,.2f}")
    else:
        print(f"   ‚ÜòÔ∏è {feature:20s}: Each unit decreases price by ${abs(coef):,.2f}")

# Make predictions
y_pred_train_ols = model_ols.predict(X_train)
y_pred_test_ols = model_ols.predict(X_test)

# Evaluate
r2_train = r2_score(y_train, y_pred_train_ols)
r2_test = r2_score(y_test, y_pred_test_ols)
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train_ols))
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test_ols))
mae_train = mean_absolute_error(y_train, y_pred_train_ols)
mae_test = mean_absolute_error(y_test, y_pred_test_ols)

print(f"\n\nüìä Model Performance (OLS):\n")
print(f"   Training Set:")
print(f"     R¬≤:   {r2_train:.4f} ({r2_train*100:.2f}% variance explained)")
print(f"     RMSE: ${rmse_train:,.2f}")
print(f"     MAE:  ${mae_train:,.2f}")
print(f"\n   Test Set:")
print(f"     R¬≤:   {r2_test:.4f} ({r2_test*100:.2f}% variance explained)")
print(f"     RMSE: ${rmse_test:,.2f}")
print(f"     MAE:  ${mae_test:,.2f}")

## üèÉ Method 2: Gradient Descent

**Approach:** Iteratively update weights to minimize error

**Formula:** $\theta = \theta - \alpha \nabla MSE$

**Characteristics:**
- ‚úÖ Scales to large datasets
- ‚úÖ More robust to multicollinearity
- ‚ùå Approximate solution
- ‚ùå Requires hyperparameter tuning (learning rate)

In [None]:
# Gradient Descent requires feature scaling!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model using SGD (Stochastic Gradient Descent)
model_gd = SGDRegressor(max_iter=1000, learning_rate='constant', eta0=0.01, random_state=42)
model_gd.fit(X_train_scaled, y_train)

# Extract parameters
intercept_gd = model_gd.intercept_[0]
coefficients_gd = model_gd.coef_

print("üèÉ Multiple Linear Regression Model (Gradient Descent)\n")
print("="*70)
print(f"\nüìê Equation (on scaled features):\n")
print(f"   Price = {intercept_gd:,.2f}")
for feature, coef in zip(feature_columns, coefficients_gd):
    print(f"         {coef:+,.2f} √ó {feature}_scaled")

# Make predictions
y_pred_train_gd = model_gd.predict(X_train_scaled)
y_pred_test_gd = model_gd.predict(X_test_scaled)

# Evaluate
r2_train_gd = r2_score(y_train, y_pred_train_gd)
r2_test_gd = r2_score(y_test, y_pred_test_gd)
rmse_train_gd = np.sqrt(mean_squared_error(y_train, y_pred_train_gd))
rmse_test_gd = np.sqrt(mean_squared_error(y_test, y_pred_test_gd))
mae_train_gd = mean_absolute_error(y_train, y_pred_train_gd)
mae_test_gd = mean_absolute_error(y_test, y_pred_test_gd)

print(f"\n\nüìä Model Performance (Gradient Descent):\n")
print(f"   Training Set:")
print(f"     R¬≤:   {r2_train_gd:.4f} ({r2_train_gd*100:.2f}% variance explained)")
print(f"     RMSE: ${rmse_train_gd:,.2f}")
print(f"     MAE:  ${mae_train_gd:,.2f}")
print(f"\n   Test Set:")
print(f"     R¬≤:   {r2_test_gd:.4f} ({r2_test_gd*100:.2f}% variance explained)")
print(f"     RMSE: ${rmse_test_gd:,.2f}")
print(f"     MAE:  ${mae_test_gd:,.2f}")

## üÜö Comparing OLS vs Gradient Descent

In [None]:
# Create comparison table
comparison = pd.DataFrame({
    'Metric': ['Training R¬≤', 'Test R¬≤', 'Training RMSE', 'Test RMSE', 'Training MAE', 'Test MAE'],
    'OLS': [
        f"{r2_train:.4f}",
        f"{r2_test:.4f}",
        f"${rmse_train:,.2f}",
        f"${rmse_test:,.2f}",
        f"${mae_train:,.2f}",
        f"${mae_test:,.2f}"
    ],
    'Gradient Descent': [
        f"{r2_train_gd:.4f}",
        f"{r2_test_gd:.4f}",
        f"${rmse_train_gd:,.2f}",
        f"${rmse_test_gd:,.2f}",
        f"${mae_train_gd:,.2f}",
        f"${mae_test_gd:,.2f}"
    ]
})

print("\n‚öñÔ∏è OLS vs Gradient Descent Comparison:\n")
print(comparison.to_string(index=False))

# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# OLS
axes[0].scatter(y_test, y_pred_test_ols, alpha=0.6, s=50)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Price (USD)', fontsize=12)
axes[0].set_ylabel('Predicted Price (USD)', fontsize=12)
axes[0].set_title(f'OLS (R¬≤={r2_test:.4f})', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Gradient Descent
axes[1].scatter(y_test, y_pred_test_gd, alpha=0.6, s=50, color='green')
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', linewidth=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Price (USD)', fontsize=12)
axes[1].set_ylabel('Predicted Price (USD)', fontsize=12)
axes[1].set_title(f'Gradient Descent (R¬≤={r2_test_gd:.4f})', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Insight:")
print("   Both methods produce similar results on this dataset!")
print("   OLS is slightly better for small data; GD scales better to large data.")

## üìä Visualizing 3D Regression Plane

For 2 features, regression creates a **plane** in 3D space.
For 3+ features, it creates a **hyperplane** (can't visualize directly).

In [None]:
# Build simple model with only 2 features for visualization
X_viz = df_encoded[['sqft', 'bedrooms']]
y_viz = df_encoded['price_usd']

model_viz = LinearRegression()
model_viz.fit(X_viz, y_viz)

# Create mesh grid
sqft_range = np.linspace(df_encoded['sqft'].min(), df_encoded['sqft'].max(), 20)
bedrooms_range = np.linspace(df_encoded['bedrooms'].min(), df_encoded['bedrooms'].max(), 20)
sqft_grid, bedrooms_grid = np.meshgrid(sqft_range, bedrooms_range)

# Predict prices for grid
X_grid = np.c_[sqft_grid.ravel(), bedrooms_grid.ravel()]
price_grid = model_viz.predict(X_grid).reshape(sqft_grid.shape)

# 3D plot
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

# Scatter plot of actual data
ax.scatter(df_encoded['sqft'], df_encoded['bedrooms'], df_encoded['price_usd'], 
           c='red', marker='o', s=30, alpha=0.6, label='Actual Houses')

# Surface plot of regression plane
ax.plot_surface(sqft_grid, bedrooms_grid, price_grid, alpha=0.3, cmap='viridis')

ax.set_xlabel('Square Feet', fontsize=12)
ax.set_ylabel('Bedrooms', fontsize=12)
ax.set_zlabel('Price (USD)', fontsize=12)
ax.set_title('3D Regression Plane: Price ~ SqFt + Bedrooms', fontsize=14, fontweight='bold')
ax.legend()

plt.tight_layout()
plt.show()

print("\nüìê Equation (2D plane):")
print(f"   Price = {model_viz.intercept_:,.2f} + {model_viz.coef_[0]:,.2f}√óSqFt + {model_viz.coef_[1]:,.2f}√óBedrooms")
print("\nüí° With 3+ features, we get a HYPERPLANE (can't visualize, but same concept!)")

## üîÆ Concept 4: What-If Scenario Analysis

In [None]:
# Scenario: Renovating a house - what's the ROI?
print("üè° What-If Scenario: House Renovation Analysis\n")
print("="*70)

# Baseline house
baseline = pd.DataFrame({
    'sqft': [2000],
    'bedrooms': [3],
    'bathrooms': [2],
    'age_years': [20],
    'garage_spaces': [1],
    'condition': [6],
    'has_pool_binary': [0],
    'location_Suburban': [1],
    'location_Urban': [0]
})

baseline_price = model_ols.predict(baseline)[0]
print(f"\nüè† Baseline House:")
print(f"   2,000 sqft, 3 bed, 2 bath, 20 years old, 1 garage, condition=6")
print(f"   Suburban, No pool")
print(f"   Predicted Price: ${baseline_price:,.2f}")

# Scenario 1: Add a pool
with_pool = baseline.copy()
with_pool['has_pool_binary'] = 1
pool_price = model_ols.predict(with_pool)[0]
pool_roi = pool_price - baseline_price

print(f"\n\nüíß Scenario 1: Add Swimming Pool")
print(f"   New Price: ${pool_price:,.2f}")
print(f"   Price Increase: ${pool_roi:,.2f}")
print(f"   ROI: {(pool_roi/baseline_price)*100:.1f}%")

# Scenario 2: Renovate (improve condition from 6 to 9)
renovated = baseline.copy()
renovated['condition'] = 9
renovated_price = model_ols.predict(renovated)[0]
renovation_roi = renovated_price - baseline_price

print(f"\n\nüî® Scenario 2: Full Renovation (condition 6‚Üí9)")
print(f"   New Price: ${renovated_price:,.2f}")
print(f"   Price Increase: ${renovation_roi:,.2f}")
print(f"   ROI: {(renovation_roi/baseline_price)*100:.1f}%")

# Scenario 3: Add garage space
with_garage = baseline.copy()
with_garage['garage_spaces'] = 2
garage_price = model_ols.predict(with_garage)[0]
garage_roi = garage_price - baseline_price

print(f"\n\nüöó Scenario 3: Add Garage Space (1‚Üí2 cars)")
print(f"   New Price: ${garage_price:,.2f}")
print(f"   Price Increase: ${garage_roi:,.2f}")
print(f"   ROI: {(garage_roi/baseline_price)*100:.1f}%")

# Comparison
print(f"\n\n" + "="*70)
print(f"\nüèÜ Best ROI: ", end="")
max_roi = max(pool_roi, renovation_roi, garage_roi)
if max_roi == pool_roi:
    print(f"Add Pool (${pool_roi:,.2f})")
elif max_roi == renovation_roi:
    print(f"Renovation (${renovation_roi:,.2f})")
else:
    print(f"Add Garage (${garage_roi:,.2f})")

## üìä Feature Importance Analysis

In [None]:
# Create feature importance DataFrame
feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Coefficient': coefficients,
    'Abs_Coefficient': np.abs(coefficients)
}).sort_values('Abs_Coefficient', ascending=False)

print("\nüìä Feature Importance (by coefficient magnitude):\n")
print(feature_importance[['Feature', 'Coefficient']].to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
colors = ['green' if x > 0 else 'red' for x in feature_importance['Coefficient']]
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'], color=colors, alpha=0.7)
plt.xlabel('Coefficient Value ($ impact)', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance: Impact on House Price', fontsize=14, fontweight='bold')
plt.axvline(0, color='black', linewidth=1)
plt.grid(alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print(f"\n\nüí° Top 3 Most Important Features:")
for i, row in feature_importance.head(3).iterrows():
    print(f"   {i+1}. {row['Feature']:20s}: ${abs(row['Coefficient']):,.2f} per unit")

## üéØ Practice Exercise 1: Predict Price for New House

In [None]:
# YOUR CODE HERE
# New house specs:
# - 2,500 sqft
# - 4 bedrooms
# - 2.5 bathrooms
# - 5 years old
# - 2 garage spaces
# - Condition: 8
# - Urban location
# - Has pool

new_house = pd.DataFrame({
    'sqft': [2500],
    'bedrooms': [4],
    'bathrooms': [2.5],
    'age_years': [5],
    'garage_spaces': [2],
    'condition': [8],
    'has_pool_binary': [1],
    'location_Suburban': [0],
    'location_Urban': [1]
})

predicted_price = model_ols.predict(new_house)[0]

print(f"\nüè† New House Prediction:")
print(f"   Predicted Price: ${predicted_price:,.2f}")
print(f"   Recommended listing range: ${predicted_price*0.97:,.2f} - ${predicted_price*1.03:,.2f}")

## üéØ Practice Exercise 2: Multicollinearity Challenge

**Task:** Add a highly correlated feature (price per sqft) and observe the impact.

In [None]:
# YOUR CODE HERE
# Create a new feature: price_per_sqft = price_usd / sqft
# Add it to the model and compare results

df_multicollinear = df_encoded.copy()
df_multicollinear['price_per_sqft'] = df_multicollinear['price_usd'] / df_multicollinear['sqft']

# Add to features
feature_columns_mc = feature_columns + ['price_per_sqft']
X_mc = df_multicollinear[feature_columns_mc]
y_mc = df_multicollinear['price_usd']

# Split and train
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X_mc, y_mc, test_size=0.2, random_state=42)
model_mc = LinearRegression()
model_mc.fit(X_train_mc, y_train_mc)

# Compare coefficients
print("\n‚ö†Ô∏è Multicollinearity Impact:\n")
comparison = pd.DataFrame({
    'Feature': feature_columns,
    'Original_Coef': coefficients,
    'With_Multicollinear_Coef': model_mc.coef_[:len(feature_columns)],
    'Change': model_mc.coef_[:len(feature_columns)] - coefficients
})
print(comparison.to_string(index=False))

print(f"\n\nüí° Notice how coefficients became UNSTABLE when we added price_per_sqft!")
print(f"   This is because price_per_sqft is perfectly correlated with price (it's derived from it).")

## üéØ Practice Exercise 3: Identify Best Value Houses

**Task:** Find houses where actual price is significantly lower than predicted (good deals!).

In [None]:
# YOUR CODE HERE
# Calculate residuals and find top 10 underpriced houses

df_analysis = df_encoded.copy()
df_analysis['predicted_price'] = model_ols.predict(X)
df_analysis['residual'] = df_analysis['price_usd'] - df_analysis['predicted_price']
df_analysis['value_score'] = df_analysis['residual'] / df_analysis['predicted_price'] * 100

# Top 10 best deals (actual < predicted)
best_deals = df_analysis.nsmallest(10, 'residual')[[
    'house_id', 'sqft', 'bedrooms', 'location', 'price_usd', 'predicted_price', 'residual', 'value_score'
]]

print("\nüî• Top 10 Best Value Houses (Underpriced):\n")
print(best_deals.to_string(index=False))

print(f"\n\nüí° Interpretation:")
print(f"   Negative residual = Actual price LOWER than predicted")
print(f"   These are potential great deals for buyers!")
print(f"   Sellers might be motivated or house needs minor repairs.")

## üìö Key Concepts Summary

### What You Learned:

1. **‚úÖ Multiple Linear Regression**
   - Uses 2+ features for better predictions
   - Equation: $\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n$
   - Matrix form: $\hat{y} = X\theta$

2. **‚úÖ Handling Categorical Variables**
   - Binary encoding for 2 categories (0/1)
   - One-hot encoding for 3+ categories
   - Drop one category to avoid dummy variable trap

3. **‚úÖ Multicollinearity**
   - When features are highly correlated
   - Causes unstable coefficients
   - Detect with correlation matrix and VIF
   - Solution: Remove redundant features

4. **‚úÖ Estimation Methods**
   - **OLS:** Exact, fast for small data, matrix-based
   - **Gradient Descent:** Iterative, scales to large data, requires scaling

5. **‚úÖ Visualization**
   - 2 features ‚Üí plane (3D)
   - 3+ features ‚Üí hyperplane (can't visualize)

6. **‚úÖ What-If Scenarios**
   - Predict impact of changes
   - Calculate ROI for renovations
   - Must respect feature correlations

7. **‚úÖ Feature Importance**
   - Coefficient magnitude shows impact
   - Positive = increases target
   - Negative = decreases target

---

## üéâ Congratulations!

You've mastered **Multiple Linear Regression**!

**Next Steps:**
- ‚úÖ Learn about polynomial regression (curved relationships)
- ‚úÖ Explore regularization (Ridge, Lasso) to prevent overfitting
- ‚úÖ Practice feature engineering
- ‚úÖ Study logistic regression (classification)

---

## üì¶ Library Versions

In [None]:
# Document versions for reproducibility
import sys
import sklearn

print("Library Versions:")
print(f"  Python: {sys.version}")
print(f"  NumPy: {np.__version__}")
print(f"  Pandas: {pd.__version__}")
print(f"  Matplotlib: {plt.matplotlib.__version__}")
print(f"  Seaborn: {sns.__version__}")
print(f"  Scikit-learn: {sklearn.__version__}")
print(f"\nRandom Seed: 42")