# üìò California Housing Price Prediction - Linear Regression Assignment

This notebook implements the complete ML pipeline for both:
1. **Simple Linear Regression** (single feature)
2. **Multiple Linear Regression** (all features)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
df = pd.read_csv('housing.csv')

---

# üß™ Task 1: Simple Linear Regression (Single Feature)

---

## Step 1Ô∏è‚É£: Data Retrieval and Collection

**Objective:** Load the California Housing dataset and display basic information.

In [None]:
# Display basic dataset information
print("Dataset Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())
print("\nData Types:")
print(df.dtypes)
print("\nFirst 5 rows:")
df.head()

In [None]:
# Summary statistics
df.describe()

## Step 2Ô∏è‚É£: Data Cleaning

**Objective:** Handle missing values and verify data integrity.

In [None]:
# Check for missing values
print("Missing Values per Column:")
print(df.isnull().sum())
print("\nTotal missing values:", df.isnull().sum().sum())

In [None]:
# Handle missing values in total_bedrooms (if any)
if df['total_bedrooms'].isnull().sum() > 0:
    print(f"Filling {df['total_bedrooms'].isnull().sum()} missing values in 'total_bedrooms' with median")
    df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)

# Drop rows with remaining missing values
df_clean = df.dropna()
print(f"\nDataset shape after cleaning: {df_clean.shape}")
print(f"Missing values after cleaning: {df_clean.isnull().sum().sum()}")

**Explanation:**
- The `total_bedrooms` column typically contains missing values
- We fill missing values with the **median** (more robust than mean)
- All data types are numeric, appropriate for linear regression

## Step 3Ô∏è‚É£: Feature Design

**Objective:** Select `housing_median_age` as the single input feature.

**Why `housing_median_age`?**
- Represents the median age of houses in a district
- Provides a simple, interpretable relationship with house prices
- Older/newer houses may have different values

In [None]:
# Select single feature and target
X_simple = df_clean[['housing_median_age']].values
y = df_clean['median_house_value'].values

print("Feature (X) shape:", X_simple.shape)
print("Label (y) shape:", y.shape)
print("\nFeature statistics:")
print(f"  Min age: {X_simple.min():.2f}")
print(f"  Max age: {X_simple.max():.2f}")
print(f"  Mean age: {X_simple.mean():.2f}")

## Step 4Ô∏è‚É£: Algorithm Selection

**Algorithm:** Linear Regression

**Why Linear Regression?**
- Predicting a **continuous numerical value** (house price)
- Simple, interpretable, serves as a baseline model
- Assumes linear relationship: $y = mx + b$
  - $y$ = predicted house value
  - $x$ = housing median age
  - $m$ = slope (coefficient)
  - $b$ = y-intercept

## Step 5Ô∏è‚É£: Loss Function Selection

**Loss Function:** Mean Squared Error (MSE)

**What is MSE?**
- Measures average squared differences between predicted and actual values
- Formula: $MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$
- **Lower MSE = Better performance**
- Squaring penalizes larger errors more heavily
- Default loss function minimized during training

## Step 6Ô∏è‚É£: Model Learning (Training)

**Objective:** Split data and train the linear regression model.

In [None]:
# Split data (80% train, 20% test)
X_train_simple, X_test_simple, y_train, y_test = train_test_split(
    X_simple, y, test_size=0.2, random_state=42
)

print("Training set size:", X_train_simple.shape[0])
print("Testing set size:", X_test_simple.shape[0])

# Create and train the model
model_simple = LinearRegression()
model_simple.fit(X_train_simple, y_train)

print("\n‚úì Model training complete!")

**Learning Process:**
- Uses **Ordinary Least Squares (OLS)** to find best-fit line
- Minimizes sum of squared residuals
- Calculates optimal coefficient and intercept

## Step 7Ô∏è‚É£: Model Evaluation

**Objective:** Evaluate model performance on test set.

In [None]:
# Make predictions
y_pred_simple = model_simple.predict(X_test_simple)

# Calculate metrics
mse_simple = mean_squared_error(y_test, y_pred_simple)
r2_simple = r2_score(y_test, y_pred_simple)
rmse_simple = np.sqrt(mse_simple)

print("="*60)
print("TASK 1: SIMPLE LINEAR REGRESSION - EVALUATION")
print("="*60)
print(f"Mean Squared Error (MSE):       ${mse_simple:,.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse_simple:,.2f}")
print(f"R¬≤ Score:                       {r2_simple:.4f}")
print("="*60)

**Interpretation:**
- **MSE/RMSE:** Average prediction error (lower is better)
- **R¬≤ Score:** Proportion of variance explained (0-1, higher is better)
- Since we use only ONE feature, expect moderate performance

---

## üìà Model Interpretation - Task 1

In [None]:
# Extract parameters
coefficient = model_simple.coef_[0]
intercept = model_simple.intercept_

print("="*60)
print("MODEL PARAMETERS")
print("="*60)
print(f"Coefficient (Slope): {coefficient:,.2f}")
print(f"Intercept:           ${intercept:,.2f}")
print("="*60)
print(f"\nModel Equation:")
print(f"House Value = {coefficient:,.2f} √ó Age + ${intercept:,.2f}")

**What does the Coefficient represent?**
- Change in house value for each additional year of age
- Positive ‚Üí house value increases with age
- Negative ‚Üí house value decreases with age (depreciation)

**What does the Intercept mean?**
- Predicted house value when age = 0 (newly built)
- Baseline house value in the model

## üìä Visualization - Task 1

In [None]:
# Create visualizations
plt.figure(figsize=(14, 5))

# Plot 1: Regression Line
plt.subplot(1, 2, 1)
plt.scatter(X_test_simple, y_test, alpha=0.5, label='Actual Prices', color='blue', s=20)
plt.plot(X_test_simple, y_pred_simple, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Housing Median Age (years)', fontsize=12)
plt.ylabel('Median House Value ($)', fontsize=12)
plt.title('Simple Linear Regression: Age vs House Price', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Predicted vs Actual
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_pred_simple, alpha=0.5, color='green', s=20)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
         'r--', linewidth=2, label='Perfect Prediction')
plt.xlabel('Actual House Value ($)', fontsize=12)
plt.ylabel('Predicted House Value ($)', fontsize=12)
plt.title('Predicted vs Actual Values', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
---

# üß™ Task 2: Multiple Linear Regression (All Features)

---

## Step 1Ô∏è‚É£ & 2Ô∏è‚É£: Data Retrieval and Cleaning

**Using the same cleaned dataset from Task 1.**

In [None]:
print("Reusing cleaned dataset from Task 1")
print(f"Dataset shape: {df_clean.shape}")
print(f"Missing values: {df_clean.isnull().sum().sum()}")

## Step 3Ô∏è‚É£: Feature Design

**Objective:** Use ALL available features (except target) to predict house prices.

**Why use multiple features?**
- House prices influenced by many factors, not just age
- Multiple features capture complex relationships
- Expected to significantly improve prediction accuracy
- Features like income, location, rooms are important predictors

In [None]:
# Select all numeric features except target
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
if 'median_house_value' in numeric_cols:
    numeric_cols.remove('median_house_value')

print(f"Number of features: {len(numeric_cols)}")
print(f"Feature names: {numeric_cols}")

# Create feature matrix
X_multiple = df_clean[numeric_cols].values
y = df_clean['median_house_value'].values

print(f"\nFeature matrix shape: {X_multiple.shape}")
print(f"Target vector shape: {y.shape}")

**Note on Feature Scaling:**
- Features have different scales (e.g., total_rooms vs median_income)
- Linear regression can handle this, but coefficients will vary in magnitude
- Proceeding without scaling for simplicity

## Step 4Ô∏è‚É£: Algorithm Selection

**Algorithm:** Multiple Linear Regression

**Why?**
- Same as Task 1: predicting continuous values
- Now using multiple input features
- Model: $y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

## Step 5Ô∏è‚É£: Loss Function Selection

**Loss Function:** Mean Squared Error (MSE)

Same as Task 1 - minimized across all features during training.

## Step 6Ô∏è‚É£: Model Learning (Training)

In [None]:
# Split data (same random_state for fair comparison)
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_multiple, y, test_size=0.2, random_state=42
)

print("Training set size:", X_train_multi.shape[0])
print("Testing set size:", X_test_multi.shape[0])
print(f"Number of features: {X_train_multi.shape[1]}")

# Train the model
model_multiple = LinearRegression()
model_multiple.fit(X_train_multi, y_train_multi)

print("\n‚úì Multiple Linear Regression training complete!")

**Learning Process:**
- Uses OLS to find optimal coefficients
- Solves system of equations to minimize MSE across all features
- Each feature gets its own coefficient

## Step 7Ô∏è‚É£: Model Evaluation

In [None]:
# Make predictions
y_pred_multi = model_multiple.predict(X_test_multi)

# Calculate metrics
mse_multi = mean_squared_error(y_test_multi, y_pred_multi)
r2_multi = r2_score(y_test_multi, y_pred_multi)
rmse_multi = np.sqrt(mse_multi)

print("="*60)
print("TASK 2: MULTIPLE LINEAR REGRESSION - EVALUATION")
print("="*60)
print(f"Mean Squared Error (MSE):       ${mse_multi:,.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse_multi:,.2f}")
print(f"R¬≤ Score:                       {r2_multi:.4f}")
print("="*60)

**Interpretation:**
- **MSE/RMSE:** Should be significantly lower than Task 1
- **R¬≤ Score:** Should be much higher (closer to 1)
- Multiple features capture complex relationships

---

## üìà Model Interpretation - Task 2

In [None]:
# Extract parameters
coefficients = model_multiple.coef_
intercept_multi = model_multiple.intercept_

print("="*60)
print("MODEL PARAMETERS - MULTIPLE LINEAR REGRESSION")
print("="*60)
print(f"Intercept: ${intercept_multi:,.2f}\n")
print("Coefficients:")
print("-" * 60)

# Create DataFrame for better visualization
coef_df = pd.DataFrame({
    'Feature': numeric_cols,
    'Coefficient': coefficients
}).sort_values('Coefficient', ascending=False)

for idx, row in coef_df.iterrows():
    print(f"  {row['Feature']:<25} {row['Coefficient']:>15,.2f}")
    
print("="*60)

**What does each coefficient represent?**

Each coefficient shows the **change in house value** for a **one-unit increase** in that feature, **holding all other features constant**.

**Interpretation:**
- **Positive coefficients:** Feature increase ‚Üí higher house prices
- **Negative coefficients:** Feature increase ‚Üí lower house prices
- **Magnitude:** Larger absolute values = stronger influence

**Important:**
- Coefficients NOT directly comparable due to different scales
- Features with large ranges have smaller coefficients
- For fair comparison, standardization would be needed

## üìä Visualization - Task 2

In [None]:
# Visualize predicted vs actual
plt.figure(figsize=(14, 5))

# Plot 1: Predicted vs Actual
plt.subplot(1, 2, 1)
plt.scatter(y_test_multi, y_pred_multi, alpha=0.5, color='purple', s=20)
plt.plot([y_test_multi.min(), y_test_multi.max()], 
         [y_test_multi.min(), y_test_multi.max()], 
         'r--', linewidth=2, label='Perfect Prediction')
plt.xlabel('Actual House Value ($)', fontsize=12)
plt.ylabel('Predicted House Value ($)', fontsize=12)
plt.title('Multiple Regression: Predicted vs Actual', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Feature Coefficients
plt.subplot(1, 2, 2)
colors = ['green' if x > 0 else 'red' for x in coef_df['Coefficient']]
plt.barh(coef_df['Feature'], coef_df['Coefficient'], color=colors, alpha=0.7)
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance (Coefficients)', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

---
---

# üîÑ Model Comparison: Task 1 vs Task 2

---

In [None]:
# Create comparison table
comparison = pd.DataFrame({
    'Metric': ['MSE', 'RMSE', 'R¬≤ Score', 'Number of Features'],
    'Simple Linear Regression': [
        f"${mse_simple:,.2f}",
        f"${rmse_simple:,.2f}",
        f"{r2_simple:.4f}",
        "1"
    ],
    'Multiple Linear Regression': [
        f"${mse_multi:,.2f}",
        f"${rmse_multi:,.2f}",
        f"{r2_multi:.4f}",
        str(len(numeric_cols))
    ]
})

print("="*80)
print("MODEL COMPARISON")
print("="*80)
print(comparison.to_string(index=False))
print("="*80)

# Calculate improvements
mse_improvement = ((mse_simple - mse_multi) / mse_simple) * 100
r2_improvement = ((r2_multi - r2_simple) / r2_simple) * 100

print(f"\nIMPROVEMENTS:")
print(f"  MSE reduced by: {mse_improvement:.2f}%")
print(f"  R¬≤ increased by: {r2_improvement:.2f}%")

## üìù Analysis Summary

### Which model performs better?
**Multiple Linear Regression (Task 2)** performs significantly better:
- **Lower MSE/RMSE:** More accurate predictions
- **Higher R¬≤ Score:** Explains more variance in house prices

### Why does using multiple features help?
1. **Captures complexity:** House prices depend on many factors (income, location, size, age)
2. **Reduces unexplained variance:** Single feature leaves much information unused
3. **Better generalization:** Model learns richer patterns from data

### Which model is easier to interpret?
**Simple Linear Regression (Task 1)** is easier to interpret:
- **One relationship:** Clear understanding of age vs price
- **Simple visualization:** Easy to plot and explain
- **Single coefficient:** Straightforward meaning

**Multiple Linear Regression** is more powerful but:
- **Many coefficients:** Harder to understand individual effects
- **Interaction effects:** Features may correlate with each other
- **Scaling issues:** Coefficients not directly comparable

### Conclusion:
**Trade-off between interpretability and performance:**
- Use **simple** regression for understanding individual relationships
- Use **multiple** regression for accurate predictions and capturing real-world complexity

---

## ‚úÖ Assignment Complete!