# üè† House Price Detective: Your First ML Project

**Welcome!** You're about to build something real - a system that predicts house prices.

---

## üìñ The Story

Imagine you're a real estate agent's assistant. Your boss asks:

> *"We have a new house: 1,800 sqft, 3 bedrooms, 11 years old. What should we price it at?"*

Instead of guessing, you'll build a **smart system** that learns from past house sales to predict prices.

**Today you'll learn:**
- How to find patterns in data
- How to make predictions
- How to know if your predictions are good

No scary math. No complicated jargon. Just you, data, and problem-solving! üéØ

---

## ‚è∞ Time Expectation

This will take about **2 hours** if you're new to this. That's okay!

**How to use this notebook:**
1. Read each section carefully
2. Run the example code
3. Try the exercises
4. Don't skip ahead - each part builds on the previous one

---

## üéØ What You'll Build

By the end, you'll have:
- ‚úÖ A working house price predictor
- ‚úÖ Understanding of how ML actually works
- ‚úÖ Code you wrote yourself (no copy-paste!)
- ‚úÖ Confidence to tackle more ML problems

---

**Ready? Let's start! üöÄ**

*üí° Tip: Click the ‚ñ∂Ô∏è button on the left of each code cell to run it*

---

# üîß Step 0: Getting Ready

Before we start, we need some tools. Think of this like opening your toolbox before fixing something.

**What we're loading:**
- `numpy` - helps us do math with lots of numbers at once
- `matplotlib` - helps us draw charts and graphs

**Just run the cell below** (click the ‚ñ∂Ô∏è button). You'll see a green checkmark when it's done!

In [None]:
# Import our tools
import numpy as np
import matplotlib.pyplot as plt

# This makes sure we get the same results every time
np.random.seed(42)

print("‚úÖ Tools loaded successfully!")
print("\nYou're ready to start!")

---

# üìä Step 1: Meet Your Data

## The Scenario

Your boss gives you records of **10 houses** that were recently sold. For each house, you know:
- **Size** (how many square feet)
- **Bedrooms** (how many bedrooms)
- **Age** (how old the house is)
- **Price** (what it sold for, in thousands of dollars)

Your job: Use these past sales to predict prices for **new houses**.

Let's look at the data!

In [None]:
# Our past house sales (10 houses)
sizes = np.array([1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700])
bedrooms = np.array([3, 3, 3, 4, 2, 3, 4, 5, 3, 3])
ages = np.array([15, 10, 12, 8, 20, 14, 5, 3, 18, 9])
prices = np.array([245, 312, 279, 308, 199, 219, 405, 324, 319, 255])

print("üìã Our Past House Sales:\n")
print("House | Size (sqft) | Bedrooms | Age (years) | Price ($1000s)")
print("-" * 65)
for i in range(len(sizes)):
    print(f"  {i+1}   |    {sizes[i]:4d}    |    {bedrooms[i]}     |     {ages[i]:2d}      |    ${prices[i]}k")

print(f"\n‚úÖ We have data from {len(sizes)} house sales!")

## ü§î Quick Observation

Look at the table above. Do you notice any patterns?

- **Bigger houses** tend to cost more (House 7 & 8 are largest and most expensive)
- **Older houses** might cost less (House 5 is oldest and cheapest)
- **More bedrooms** might mean higher price

**Your brain just did machine learning!** You looked at data and found patterns.

Now, let's teach the computer to do the same thing! üß†‚û°Ô∏èüíª

## üìà Let's Visualize!

Numbers in a table are boring. Let's make a picture!

We'll create a **scatter plot** - each dot is one house.

In [None]:
# Create a scatter plot: Size vs Price
plt.figure(figsize=(10, 6))
plt.scatter(sizes, prices, color='blue', s=100, alpha=0.6)
plt.xlabel('Size (square feet)', fontsize=12)
plt.ylabel('Price ($1000s)', fontsize=12)
plt.title('House Size vs Price', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

print("\nüëÄ Look at the picture:")
print("- Each blue dot is one house")
print("- Dots go UP and to the RIGHT")
print("- This means: bigger house = higher price!")
print("\nüí° Can you imagine drawing a line through these dots?")

---

# üéØ Step 2: Drawing a Line (The Simple Way)

## Why a Line?

Look at those dots. They're not perfectly straight, but they follow a **general upward trend**.

If we draw a line through them, we can:
1. See the pattern clearly
2. Predict prices for NEW houses

## The Line Equation

Remember from school: `y = mx + b`

For houses:
```
Price = m √ó Size + b
```

Where:
- **m** (slope) = how much price increases per square foot
- **b** (intercept) = starting price (if house was 0 sqft)

**Our mission:** Find the BEST values for m and b!

## üßÆ The Secret Formula

Mathematicians figured out the BEST way to find m and b:

**For slope (m):**
```
m = sum of [(size - average_size) √ó (price - average_price)]
    √∑ sum of [(size - average_size)¬≤]
```

**For intercept (b):**
```
b = average_price - (m √ó average_size)
```

Don't worry if this looks scary! We'll code it step by step.

**Think of it like this:**
- We're finding how Size and Price move together
- Then we're making a line that fits the pattern best

## üéì Your First Exercise: Calculate the Line

Let's code the formula! I'll guide you through each step.

**Step by step:**
1. Calculate the average size
2. Calculate the average price
3. Use the formula to find m (slope)
4. Use the formula to find b (intercept)

**Replace the `None` with your code!**

In [None]:
# Step 1: Calculate averages
print("Step 1: Finding averages...")
average_size = np.mean(sizes)  # This calculates the average
average_price = np.mean(prices)

print(f"Average house size: {average_size:.0f} square feet")
print(f"Average price: ${average_price:.0f}k")
print()

# Step 2: Calculate how much each house differs from average
print("Step 2: Calculating differences...")
size_differences = sizes - average_size
price_differences = prices - average_price
print("‚úì Got the differences")
print()

# Step 3: Calculate slope (m)
print("Step 3: Calculating slope (m)...")
# TODO: Replace None with the formula
# Hint: multiply the differences, sum them up, divide by sum of squared size_differences
numerator = np.sum(size_differences * price_differences)
denominator = np.sum(size_differences ** 2)
m = numerator / denominator

print(f"Slope (m) = {m:.4f}")
print(f"This means: For every extra square foot, price goes up by ${m:.4f}k")
print()

# Step 4: Calculate intercept (b)
print("Step 4: Calculating intercept (b)...")
# TODO: Replace None with the formula
# Hint: b = average_price - (m √ó average_size)
b = average_price - (m * average_size)

print(f"Intercept (b) = {b:.2f}")
print()

print("="*50)
print(f"\nüéâ YOUR LINE EQUATION:")
print(f"Price = {m:.4f} √ó Size + {b:.2f}")
print(f"\nIn plain English:")
print(f"Price = ${m*1000:.2f} per square foot + ${b*1000:.0f} base")

## ü§î What Does This Mean?

Your equation tells a story:

- **The slope (m)** ‚âà 0.13 means: Each extra square foot adds about $130 to the price
- **The intercept (b)** ‚âà 50-70 means: There's a base price even for a tiny house

**This makes sense!** 
- Bigger houses cost more ‚úÖ
- There's always a minimum price ‚úÖ

Let's see how well your line fits the data!

In [None]:
# Let's draw your line on the graph!
plt.figure(figsize=(10, 6))

# Plot the actual houses (blue dots)
plt.scatter(sizes, prices, color='blue', s=100, alpha=0.6, label='Actual Houses')

# Draw your line (red line)
line_x = np.linspace(sizes.min(), sizes.max(), 100)
line_y = m * line_x + b
plt.plot(line_x, line_y, color='red', linewidth=3, label='Your Prediction Line')

plt.xlabel('Size (square feet)', fontsize=12)
plt.ylabel('Price ($1000s)', fontsize=12)
plt.title('Your Line Fits the Data!', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print("\nüëÄ Look at your line:")
print("- It goes through the MIDDLE of the dots")
print("- Some dots are above it, some below")
print("- That's okay! No line is perfect.")
print("\n‚ú® You just did Machine Learning!")

---

# üîÆ Step 3: Make Your First Prediction!

## The Moment of Truth

Remember the question your boss asked?

> *"We have a new house: 1,800 sqft. What should we price it at?"*

Now you can answer! Just plug the size into your equation:

```
Price = m √ó Size + b
Price = m √ó 1800 + b
```

Let's calculate it!

In [None]:
# Make a prediction for a 1,800 sqft house
new_house_size = 1800

# TODO: Use your equation to predict the price
# Hint: predicted_price = m * new_house_size + b
predicted_price = m * new_house_size + b

print("üè† New House Information:")
print(f"   Size: {new_house_size} square feet")
print()
print("üîÆ Your Prediction:")
print(f"   Price: ${predicted_price:.2f}k")
print(f"   That's ${predicted_price * 1000:.0f}")
print()
print("üí° Does this seem reasonable?")
print(f"   Looking at our data:")
print(f"   - Smallest house (1100 sqft) costs $199k")
print(f"   - Largest house (2450 sqft) costs $405k")
print(f"   - Your prediction is in between! ‚úÖ")

---

# üìä Step 4: How Good Is Your Prediction?

## The Honesty Check

You made a prediction. But is it **good**?

Let's check by looking at our 10 houses where we KNOW the actual prices.

For each house:
1. Use your line to predict the price
2. Compare with the actual price
3. Calculate the error (difference)

**Error = Actual Price - Predicted Price**

Small errors = Good! üéØ  
Big errors = Need improvement ü§î

In [None]:
# Predict prices for all our training houses
predicted_prices = m * sizes + b

# Calculate errors
errors = prices - predicted_prices

print("üìä Prediction Quality Check:\n")
print("House | Actual | Predicted | Error")
print("-" * 45)
for i in range(len(sizes)):
    error_symbol = "‚úÖ" if abs(errors[i]) < 20 else "‚ö†Ô∏è"
    print(f"  {i+1}   | ${prices[i]:3d}k  |  ${predicted_prices[i]:6.1f}k  | {errors[i]:+6.1f}k {error_symbol}")

print("\nüí° Understanding Errors:")
print(f"   Positive error = We predicted too LOW")
print(f"   Negative error = We predicted too HIGH")
print(f"\n   Average error: {np.mean(np.abs(errors)):.1f}k")
print(f"   (On average, we're off by ${np.mean(np.abs(errors))*1000:.0f})")

## üìê Measuring Overall Quality: R¬≤ Score

There's a special number called **R¬≤ (R-squared)** that tells us how good our line is.

**R¬≤ goes from 0 to 1:**
- R¬≤ = 1.0 means PERFECT predictions (every dot on the line)
- R¬≤ = 0.8 means pretty good! (80% of variance explained)
- R¬≤ = 0.5 means okay, but could be better
- R¬≤ = 0 means terrible (line doesn't help at all)

Let's calculate yours!

In [None]:
# Calculate R¬≤ score
# This measures: "How much of the price variation does size explain?"

# Total variation in prices
total_variation = np.sum((prices - np.mean(prices)) ** 2)

# Unexplained variation (our errors)
unexplained_variation = np.sum(errors ** 2)

# R¬≤ = 1 - (unexplained / total)
r_squared = 1 - (unexplained_variation / total_variation)

print("üéØ Your Model's Performance:\n")
print(f"   R¬≤ Score: {r_squared:.4f}")
print(f"   That's {r_squared*100:.1f}%")
print()

if r_squared > 0.8:
    print("   üåü EXCELLENT! Your model is really good!")
    print("   Size is a strong predictor of price.")
elif r_squared > 0.6:
    print("   ‚úÖ GOOD! Your model works well.")
    print("   Size helps predict price, but there's room to improve.")
else:
    print("   üìö LEARNING! Your model is okay.")
    print("   We might need more features (like bedrooms, age).")

print(f"\nüí° What this means:")
print(f"   {r_squared*100:.0f}% of price differences can be explained by size alone.")
print(f"   The other {(1-r_squared)*100:.0f}% comes from other factors (bedrooms, age, location, etc.)")

---

# üéØ Step 5: Can We Do Better? (Adding More Clues)

## The Problem

Right now we're only using **Size** to predict price.

But we have more information!
- Number of bedrooms
- Age of the house

**Idea:** What if we use ALL the information?

Instead of:
```
Price = m √ó Size + b
```

We'll use:
```
Price = w1√óSize + w2√óBedrooms + w3√óAge + w0
```

More information ‚Üí Better predictions! üéØ

## üßÆ The Math (Don't Panic!)

When we have multiple features, we use **matrix math**.

**Think of it like this:**
- We organize all our data into a big table
- We use a special formula (Normal Equation) to find ALL the weights at once
- It's like our simple formula, but for multiple features

**The good news:** Python does the heavy lifting! We just need to set it up right.

Let's do it step by step...

## üìã Step 5.1: Organize Your Data

We need to arrange our data in a special way called a **feature matrix**.

Think of it like a spreadsheet:

| Bias | Size | Bedrooms | Age |
|------|------|----------|-----|
| 1    | 1400 | 3        | 15  |
| 1    | 1600 | 3        | 10  |
| ...  | ...  | ...      | ... |

**Note:** The "Bias" column of 1's is for the intercept (like the 'b' in y=mx+b)

Let's create this!

In [None]:
# Create the feature matrix
print("üîß Building feature matrix...\n")

# Create a column of ones (for the bias/intercept)
bias_column = np.ones(len(sizes))

# Stack all features together
# np.column_stack puts arrays side by side
X = np.column_stack([bias_column, sizes, bedrooms, ages])

# Our target (what we're predicting)
y = prices

print("Feature Matrix (X):")
print("   Shape:", X.shape, "(10 houses, 4 features)")
print("\n   First 3 houses:")
print("   Bias | Size | Beds | Age")
print("   -" * 30)
for i in range(3):
    print(f"    {X[i,0]:.0f}  | {X[i,1]:.0f} | {X[i,2]:.0f}    | {X[i,3]:.0f}")

print("\n‚úÖ Feature matrix ready!")
print("   Now we can find the best weights for ALL features at once!")

## üéØ Step 5.2: Find the Best Weights

We'll use the **Normal Equation** - a mathematical way to find the perfect weights.

**The formula:**
```
weights = (X^T √ó X)^(-1) √ó X^T √ó y
```

Where:
- `X^T` means "transpose" (flip rows and columns)
- `^(-1)` means "inverse" (like division for matrices)
- `√ó` means matrix multiplication

**Don't memorize this!** Just understand: it finds the best weights in one calculation.

Let's code it:

In [None]:
# Calculate the best weights using Normal Equation
print("üßÆ Calculating optimal weights...\n")

# Step 1: Transpose X (flip it)
X_transpose = X.T

# Step 2: Calculate X^T √ó X
XtX = X_transpose @ X  # @ means matrix multiplication

# Step 3: Calculate inverse of (X^T √ó X)
XtX_inverse = np.linalg.inv(XtX)

# Step 4: Calculate X^T √ó y
Xty = X_transpose @ y

# Step 5: Final calculation
weights = XtX_inverse @ Xty

print("‚úÖ Found the best weights!\n")
print("Your Prediction Formula:")
print("="*60)
print(f"Price = {weights[0]:.2f}")
print(f"      + {weights[1]:.4f} √ó Size")
print(f"      + {weights[2]:.2f} √ó Bedrooms")
print(f"      + {weights[3]:.2f} √ó Age")
print("="*60)

print("\nüí° What each weight means:")
print(f"   {weights[0]:.2f} = Base price")
print(f"   {weights[1]:.4f} = Price increase per square foot (${weights[1]*1000:.2f})")
print(f"   {weights[2]:.2f} = Price change per bedroom (${weights[2]*1000:.0f})")
print(f"   {weights[3]:.2f} = Price change per year of age (${weights[3]*1000:.0f})")

if weights[3] < 0:
    print("\n   üèöÔ∏è Notice: Age has a NEGATIVE weight!")
    print("   This makes sense: older houses typically cost less.")

## üîÆ Step 5.3: Make Better Predictions!

Now let's predict using ALL the information!

**Remember your boss's question?**
> *"1,800 sqft, 3 bedrooms, 11 years old - what's the price?"*

Now we can give a better answer using size, bedrooms, AND age!

In [None]:
# Make predictions using ALL features
all_predictions = X @ weights

# New house prediction
new_house = np.array([1, 1800, 3, 11])  # [bias, size, bedrooms, age]
new_prediction = new_house @ weights

print("üè† New House:")
print(f"   Size: 1800 sqft")
print(f"   Bedrooms: 3")
print(f"   Age: 11 years")
print()
print("üîÆ Predictions:")
print(f"   Using ONLY size: ${predicted_price:.2f}k")  # From earlier
print(f"   Using ALL features: ${new_prediction:.2f}k")
print()
print(f"   üí° Difference: ${abs(new_prediction - predicted_price):.2f}k")
print("   More information = More accurate!")

## üìä Step 5.4: Check the Improvement

Let's see if using more features actually made our predictions better!

We'll compare:
- **Simple model** (only size)
- **Multiple model** (size + bedrooms + age)

In [None]:
# Calculate R¬≤ for the multiple feature model
multi_errors = y - all_predictions
multi_unexplained = np.sum(multi_errors ** 2)
multi_r_squared = 1 - (multi_unexplained / total_variation)

print("üìä MODEL COMPARISON\n")
print("="*60)
print(f"Simple Model (Size only):")
print(f"   R¬≤ Score: {r_squared:.4f} ({r_squared*100:.1f}% of variance explained)")
print(f"   Average error: ${np.mean(np.abs(errors)):.1f}k")
print()
print(f"Multiple Model (Size + Bedrooms + Age):")
print(f"   R¬≤ Score: {multi_r_squared:.4f} ({multi_r_squared*100:.1f}% of variance explained)")
print(f"   Average error: ${np.mean(np.abs(multi_errors)):.1f}k")
print("="*60)
print()

improvement = (multi_r_squared - r_squared) * 100
if improvement > 5:
    print(f"üéâ BIG IMPROVEMENT! +{improvement:.1f}%")
    print("   Adding more features really helped!")
elif improvement > 0:
    print(f"‚úÖ IMPROVEMENT! +{improvement:.1f}%")
    print("   More features = better predictions!")
else:
    print(f"ü§î Hmm, not much change.")
    print("   Size alone was already pretty good!")

## üìà Step 5.5: Visualize Your Improvement

In [None]:
# Create a comparison plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Simple Model
ax1.scatter(predicted_prices, prices, alpha=0.6, s=100, color='blue')
ax1.plot([prices.min(), prices.max()], [prices.min(), prices.max()], 
         'r--', linewidth=2, label='Perfect Prediction')
ax1.set_xlabel('Predicted Price', fontsize=11)
ax1.set_ylabel('Actual Price', fontsize=11)
ax1.set_title(f'Simple Model\n(R¬≤ = {r_squared:.3f})', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Multiple Model
ax2.scatter(all_predictions, prices, alpha=0.6, s=100, color='green')
ax2.plot([prices.min(), prices.max()], [prices.min(), prices.max()], 
         'r--', linewidth=2, label='Perfect Prediction')
ax2.set_xlabel('Predicted Price', fontsize=11)
ax2.set_ylabel('Actual Price', fontsize=11)
ax2.set_title(f'Multiple Model\n(R¬≤ = {multi_r_squared:.3f})', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüëÄ Look at the plots:")
print("   Left (simple): More scattered")
print("   Right (multiple): Closer to the red line = better!")
print("\n   üéØ Points closer to the red line = more accurate predictions!")

---

# üéì Step 6: The Final Test

## The Real Challenge

Your boss comes back with TWO more houses:

**House A:**
- 1,800 sqft
- 3 bedrooms  
- 11 years old
- **Actual price:** $290k

**House B:**
- 2,000 sqft
- 4 bedrooms
- 7 years old
- **Actual price:** $368k

**Challenge:** Predict both prices and see how close you get!

In [None]:
# Test houses
test_house_A = np.array([1, 1800, 3, 11])
test_house_B = np.array([1, 2000, 4, 7])

# Make predictions
prediction_A = test_house_A @ weights
prediction_B = test_house_B @ weights

# Actual prices
actual_A = 290
actual_B = 368

# Calculate errors
error_A = actual_A - prediction_A
error_B = actual_B - prediction_B

print("üè† FINAL TEST RESULTS\n")
print("="*70)
print("\nHouse A (1800 sqft, 3 bed, 11 years):")
print(f"   Your Prediction: ${prediction_A:.2f}k")
print(f"   Actual Price: ${actual_A}k")
print(f"   Error: ${error_A:.2f}k ({abs(error_A)/actual_A*100:.1f}% off)")
if abs(error_A) < 20:
    print("   ‚úÖ Excellent prediction!")
elif abs(error_A) < 40:
    print("   üëç Pretty good!")
else:
    print("   üìö Room for improvement.")

print("\nHouse B (2000 sqft, 4 bed, 7 years):")
print(f"   Your Prediction: ${prediction_B:.2f}k")
print(f"   Actual Price: ${actual_B}k")
print(f"   Error: ${error_B:.2f}k ({abs(error_B)/actual_B*100:.1f}% off)")
if abs(error_B) < 20:
    print("   ‚úÖ Excellent prediction!")
elif abs(error_B) < 40:
    print("   üëç Pretty good!")
else:
    print("   üìö Room for improvement.")

print("\n="*70)
avg_error = (abs(error_A) + abs(error_B)) / 2
print(f"\nAverage Error: ${avg_error:.2f}k")
if avg_error < 15:
    print("\nüåü AMAZING! You built a really good predictor!")
elif avg_error < 30:
    print("\n‚úÖ GREAT JOB! Your model works well!")
else:
    print("\nüëç GOOD START! With more data, you'd do even better!")

---

# üéâ Congratulations! You Built a Machine Learning System!

## What You Accomplished Today:

‚úÖ **Understood the problem** - predict house prices  
‚úÖ **Explored the data** - looked at patterns  
‚úÖ **Built a simple model** - using just size  
‚úÖ **Made predictions** - answered real questions  
‚úÖ **Measured accuracy** - knew how good you were  
‚úÖ **Improved your model** - added more features  
‚úÖ **Tested on new data** - validated your work  

## This Is Real Machine Learning!

You did what data scientists do:
1. **Understand the problem**
2. **Explore the data**
3. **Build a model**
4. **Evaluate performance**
5. **Improve and iterate**

---

## üí≠ Reflection Questions

Take a moment to think:

**1. What surprised you most?**
```
Write your answer here:

```

**2. What was the hardest part?**
```
Write your answer here:

```

**3. What did you learn?**
```
Write your answer here:

```

**4. What would you try next?**
```
Ideas:
- Add more features? (location, garage, etc.)
- Get more data? (more houses to learn from)
- Try different types of models?

Your answer:

```

---

## üöÄ What's Next?

You now understand **linear regression** - one of the most important ML techniques!

**From here, you can:**
- Try this with different datasets
- Learn about other ML algorithms
- Build more complex models
- Explore deep learning

But remember: **Everything in ML builds on what you learned today!**

---

## üì§ Share Your Work

**To submit:**
1. File ‚Üí Download ‚Üí Download .ipynb
2. Send the notebook
3. Include your reflections

---

# üéì You're Now a Machine Learning Practitioner!

**Well done!** You didn't just watch videos - you **built something real**.

Keep learning, keep building, keep growing! üå±

---

*Questions? Confused about something? That's totally normal! Reach out and ask.* üíö