# Practice Lab: Introduction to Regression

**Module 2 - Lesson 1**  
**Date:** November 11, 2025

---

## üéØ Learning Objectives

In this practice lab, you will:
- ‚úÖ Understand what regression is and when to use it
- ‚úÖ Differentiate between regression and classification problems
- ‚úÖ Work with continuous target variables
- ‚úÖ Visualize relationships between independent and dependent variables
- ‚úÖ Understand simple vs multiple regression
- ‚úÖ Practice identifying appropriate use cases for regression

---

## üìä Real-World Scenario: E-Commerce Product Pricing

You work for **TechMart**, an online electronics retailer. Your task is to understand how different product features affect pricing so you can:
- Price new products competitively
- Identify overpriced/underpriced items
- Understand which features customers value most

**Dataset Features:**
- `product_id`: Unique product identifier
- `brand_score`: Brand reputation score (1-10)
- `screen_size`: Screen size in inches
- `storage_gb`: Storage capacity in GB
- `ram_gb`: RAM in GB
- `battery_hours`: Battery life in hours
- `camera_mp`: Camera megapixels
- `rating`: Customer rating (1-5 stars)
- `price_usd`: **TARGET** - Product price in USD

---

## üîß Setup: Run Me First!

This cell imports all necessary libraries and sets up our environment for reproducible results.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Set random seeds for reproducibility
np.random.seed(42)

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("‚úÖ Setup complete!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## üì¶ Generate Synthetic E-Commerce Dataset

We'll create a realistic dataset of 150 electronic products with features that correlate with price.

In [None]:
# Set sample size
n_samples = 150

# Generate base features with realistic distributions
brand_score = np.random.uniform(3, 10, n_samples).round(1)
screen_size = np.random.choice([5.5, 6.1, 6.5, 6.7, 7.0], n_samples)
storage_gb = np.random.choice([64, 128, 256, 512, 1024], n_samples)
ram_gb = np.random.choice([4, 6, 8, 12, 16], n_samples)
battery_hours = np.random.uniform(8, 24, n_samples).round(1)
camera_mp = np.random.choice([12, 16, 48, 64, 108], n_samples)
rating = np.random.uniform(3.0, 5.0, n_samples).round(1)

# Generate price as a function of features (with noise)
# Base price formula: realistic pricing model for electronics
base_price = (
    100 +                           # Base cost
    brand_score * 50 +              # Brand premium ($50 per point)
    screen_size * 30 +              # Screen cost ($30 per inch)
    storage_gb * 0.15 +             # Storage cost ($0.15 per GB)
    ram_gb * 25 +                   # RAM cost ($25 per GB)
    battery_hours * 5 +             # Battery cost ($5 per hour)
    camera_mp * 2 +                 # Camera cost ($2 per MP)
    rating * 40                     # Rating premium ($40 per star)
)

# Add realistic noise (¬±15% variation)
noise = np.random.normal(0, base_price * 0.15, n_samples)
price_usd = (base_price + noise).round(2)

# Ensure no negative prices
price_usd = np.maximum(price_usd, 200)

# Create DataFrame
df = pd.DataFrame({
    'product_id': [f'PROD-{i:04d}' for i in range(1, n_samples + 1)],
    'brand_score': brand_score,
    'screen_size': screen_size,
    'storage_gb': storage_gb,
    'ram_gb': ram_gb,
    'battery_hours': battery_hours,
    'camera_mp': camera_mp,
    'rating': rating,
    'price_usd': price_usd
})

print("‚úÖ Dataset generated successfully!")
print(f"\nüìä Dataset shape: {df.shape[0]} products, {df.shape[1]} features")
print("\nüîç First 5 products:")
df.head()

## üìä Data Dictionary

Understanding your data is crucial before any analysis!

In [None]:
# Display data dictionary
data_dict = pd.DataFrame({
    'Column': df.columns,
    'Type': df.dtypes.values,
    'Non-Null Count': df.count().values,
    'Description': [
        'Unique product identifier',
        'Brand reputation score (1-10, higher = premium brand)',
        'Screen size in inches',
        'Storage capacity in GB',
        'RAM memory in GB',
        'Battery life in hours',
        'Camera resolution in megapixels',
        'Average customer rating (1-5 stars)',
        'üéØ TARGET: Product price in USD'
    ]
})

print("üìö Data Dictionary\n")
print(data_dict.to_string(index=False))

## üìà Concept 1: Understanding Continuous Variables

**Regression predicts continuous values** (can take any number within a range, including decimals).

Let's examine our target variable: **price_usd**

In [None]:
# Statistical summary of our target variable
print("üìä Price Statistics:")
print(f"   Mean Price: ${df['price_usd'].mean():,.2f}")
print(f"   Median Price: ${df['price_usd'].median():,.2f}")
print(f"   Std Dev: ${df['price_usd'].std():,.2f}")
print(f"   Min Price: ${df['price_usd'].min():,.2f}")
print(f"   Max Price: ${df['price_usd'].max():,.2f}")
print(f"   Price Range: ${df['price_usd'].max() - df['price_usd'].min():,.2f}")

# Visualize price distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df['price_usd'], bins=30, edgecolor='black', alpha=0.7)
axes[0].axvline(df['price_usd'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: ${df["price_usd"].mean():.2f}')
axes[0].axvline(df['price_usd'].median(), color='green', linestyle='--', linewidth=2, label=f'Median: ${df["price_usd"].median():.2f}')
axes[0].set_xlabel('Price (USD)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Price Distribution (Continuous Variable)', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Box plot
axes[1].boxplot(df['price_usd'], vert=True)
axes[1].set_ylabel('Price (USD)', fontsize=12)
axes[1].set_title('Price Box Plot (Shows Outliers)', fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Insight: Price is CONTINUOUS - it can be $599.99, $600.00, $600.01, etc.")
print("   This is PERFECT for regression (not classification!)")

## üÜö Concept 2: Regression vs Classification

**Understanding the difference is crucial!**

- **Regression**: Predicts continuous numbers (prices, temperatures, sales)
- **Classification**: Predicts categories (spam/not spam, high/medium/low price tier)

In [None]:
# Create a classification version for comparison
# Convert continuous price into categories (this is what we DON'T want for regression!)
def categorize_price(price):
    if price < 500:
        return 'Budget'
    elif price < 800:
        return 'Mid-Range'
    else:
        return 'Premium'

df['price_category'] = df['price_usd'].apply(categorize_price)

# Comparison table
print("üîç Same Product, Different Problem Types:\n")
comparison = df[['product_id', 'brand_score', 'storage_gb', 'price_usd', 'price_category']].head(10)
print(comparison.to_string(index=False))

print("\n" + "="*70)
print("üéØ REGRESSION PROBLEM (What we want):")
print("   Goal: Predict exact price (e.g., $654.32)")
print("   Output: Continuous number")
print("   Example: 'This product should cost $654.32'")

print("\n‚ùå CLASSIFICATION PROBLEM (Different use case):")
print("   Goal: Predict price category (e.g., 'Mid-Range')")
print("   Output: Category label")
print("   Example: 'This product is in the Mid-Range category'")
print("="*70)

# Visualize the difference
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Regression view (continuous)
axes[0].scatter(df['storage_gb'], df['price_usd'], alpha=0.6, s=50)
axes[0].set_xlabel('Storage (GB)', fontsize=12)
axes[0].set_ylabel('Price (USD) - Continuous', fontsize=12)
axes[0].set_title('REGRESSION: Continuous Output', fontsize=14, fontweight='bold')
axes[0].grid(alpha=0.3)

# Classification view (categories)
category_map = {'Budget': 0, 'Mid-Range': 1, 'Premium': 2}
df['price_category_numeric'] = df['price_category'].map(category_map)
colors = df['price_category'].map({'Budget': 'green', 'Mid-Range': 'orange', 'Premium': 'red'})
axes[1].scatter(df['storage_gb'], df['price_category_numeric'], alpha=0.6, s=50, c=colors)
axes[1].set_xlabel('Storage (GB)', fontsize=12)
axes[1].set_ylabel('Price Category', fontsize=12)
axes[1].set_yticks([0, 1, 2])
axes[1].set_yticklabels(['Budget', 'Mid-Range', 'Premium'])
axes[1].set_title('CLASSIFICATION: Categorical Output', fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Clean up temporary column
df.drop(['price_category', 'price_category_numeric'], axis=1, inplace=True)

## üìä Concept 3: Exploratory Data Analysis - Visualizing Relationships

Before building models, we need to understand how features relate to price.

In [None]:
# Select numeric features for correlation analysis
numeric_features = ['brand_score', 'screen_size', 'storage_gb', 'ram_gb', 
                    'battery_hours', 'camera_mp', 'rating', 'price_usd']

# Calculate correlation matrix
correlation_matrix = df[numeric_features].corr()

# Visualize correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Identify top correlations with price
price_correlations = correlation_matrix['price_usd'].sort_values(ascending=False)
print("\nüìä Features Correlated with Price (sorted by strength):\n")
for feature, corr in price_correlations.items():
    if feature != 'price_usd':
        strength = 'Strong' if abs(corr) > 0.7 else 'Moderate' if abs(corr) > 0.4 else 'Weak'
        direction = 'Positive' if corr > 0 else 'Negative'
        print(f"   {feature:20s}: {corr:+.3f} ({strength} {direction})")

## üìà Concept 4: Simple Linear Regression (One Predictor)

Let's predict price using **only one feature** - brand score (highest correlation).

In [None]:
# Simple linear regression: Price ~ Brand Score
X_simple = df[['brand_score']]
y = df['price_usd']

# Fit model
model_simple = LinearRegression()
model_simple.fit(X_simple, y)

# Get parameters
intercept = model_simple.intercept_
coefficient = model_simple.coef_[0]

print("üìê Simple Linear Regression Results:\n")
print(f"   Equation: Price = {intercept:.2f} + {coefficient:.2f} √ó Brand_Score")
print(f"   \n   Interpretation:")
print(f"   - Base price (intercept): ${intercept:.2f}")
print(f"   - Each brand score point adds: ${coefficient:.2f}")
print(f"   \n   Example prediction:")
print(f"   - Brand Score = 7.5 ‚Üí Price = ${intercept:.2f} + ${coefficient:.2f} √ó 7.5 = ${intercept + coefficient * 7.5:.2f}")

# Make predictions
y_pred_simple = model_simple.predict(X_simple)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(df['brand_score'], df['price_usd'], alpha=0.5, s=50, label='Actual Prices')
plt.plot(df['brand_score'], y_pred_simple, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Brand Score', fontsize=12)
plt.ylabel('Price (USD)', fontsize=12)
plt.title('Simple Linear Regression: Price vs Brand Score', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Model performance
r2_simple = r2_score(y, y_pred_simple)
rmse_simple = np.sqrt(mean_squared_error(y, y_pred_simple))
mae_simple = mean_absolute_error(y, y_pred_simple)

print(f"\nüìä Model Performance:")
print(f"   R¬≤ Score: {r2_simple:.3f} ({r2_simple*100:.1f}% of variance explained)")
print(f"   RMSE: ${rmse_simple:.2f}")
print(f"   MAE: ${mae_simple:.2f}")

## üìà Concept 5: Multiple Linear Regression (Multiple Predictors)

Let's improve predictions by using **multiple features** simultaneously!

In [None]:
# Multiple linear regression: Price ~ All Features
feature_columns = ['brand_score', 'screen_size', 'storage_gb', 'ram_gb', 
                   'battery_hours', 'camera_mp', 'rating']
X_multiple = df[feature_columns]
y = df['price_usd']

# Fit model
model_multiple = LinearRegression()
model_multiple.fit(X_multiple, y)

# Get parameters
intercept_multi = model_multiple.intercept_
coefficients_multi = model_multiple.coef_

print("üìê Multiple Linear Regression Results:\n")
print(f"   Equation: Price = {intercept_multi:.2f}")
for feature, coef in zip(feature_columns, coefficients_multi):
    print(f"             + {coef:+.2f} √ó {feature}")

print(f"\n   Coefficient Interpretation:")
for feature, coef in zip(feature_columns, coefficients_multi):
    impact = "increases" if coef > 0 else "decreases"
    print(f"   - Each unit of {feature:20s} {impact} price by ${abs(coef):.2f}")

# Make predictions
y_pred_multiple = model_multiple.predict(X_multiple)

# Model performance
r2_multiple = r2_score(y, y_pred_multiple)
rmse_multiple = np.sqrt(mean_squared_error(y, y_pred_multiple))
mae_multiple = mean_absolute_error(y, y_pred_multiple)

print(f"\nüìä Model Performance:")
print(f"   R¬≤ Score: {r2_multiple:.3f} ({r2_multiple*100:.1f}% of variance explained)")
print(f"   RMSE: ${rmse_multiple:.2f}")
print(f"   MAE: ${mae_multiple:.2f}")

## üÜö Concept 6: Comparing Simple vs Multiple Regression

**Which model is better?**

In [None]:
# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Metric': ['R¬≤ Score', 'RMSE', 'MAE', 'Number of Features'],
    'Simple Regression': [
        f"{r2_simple:.3f}",
        f"${rmse_simple:.2f}",
        f"${mae_simple:.2f}",
        "1 (brand_score only)"
    ],
    'Multiple Regression': [
        f"{r2_multiple:.3f}",
        f"${rmse_multiple:.2f}",
        f"${mae_multiple:.2f}",
        f"{len(feature_columns)} features"
    ],
    'Winner': [
        'Multiple' if r2_multiple > r2_simple else 'Simple',
        'Multiple' if rmse_multiple < rmse_simple else 'Simple',
        'Multiple' if mae_multiple < mae_simple else 'Simple',
        '-'
    ]
})

print("\nüìä Model Comparison:\n")
print(comparison_df.to_string(index=False))

# Visualize predictions comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Simple regression predictions
axes[0].scatter(y, y_pred_simple, alpha=0.5, s=30)
axes[0].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Price (USD)', fontsize=12)
axes[0].set_ylabel('Predicted Price (USD)', fontsize=12)
axes[0].set_title(f'Simple Regression (R¬≤={r2_simple:.3f})', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Multiple regression predictions
axes[1].scatter(y, y_pred_multiple, alpha=0.5, s=30, color='green')
axes[1].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Price (USD)', fontsize=12)
axes[1].set_ylabel('Predicted Price (USD)', fontsize=12)
axes[1].set_title(f'Multiple Regression (R¬≤={r2_multiple:.3f})', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüí° Key Insight:")
print(f"   Multiple regression R¬≤ improved by {(r2_multiple - r2_simple)*100:.1f} percentage points!")
print(f"   Error (MAE) reduced by ${mae_simple - mae_multiple:.2f}")
print(f"   \n   Using multiple features gives more accurate predictions!")

## üéØ Practice Exercise 1: Predict Price for New Product

**Scenario:** TechMart is launching a new smartphone with these specs:
- Brand Score: 8.5
- Screen Size: 6.5 inches
- Storage: 256 GB
- RAM: 8 GB
- Battery: 18 hours
- Camera: 48 MP
- Rating: 4.5 stars

**Task:** Use the multiple regression model to predict the price.

In [None]:
# YOUR CODE HERE
# Create a DataFrame for the new product
# Use model_multiple.predict() to get the price

new_product = pd.DataFrame({
    'brand_score': [8.5],
    'screen_size': [6.5],
    'storage_gb': [256],
    'ram_gb': [8],
    'battery_hours': [18],
    'camera_mp': [48],
    'rating': [4.5]
})

# Make prediction
predicted_price = model_multiple.predict(new_product)[0]

print(f"\nüéØ Prediction for New Product:")
print(f"   Predicted Price: ${predicted_price:.2f}")
print(f"   \n   Recommended retail price range: ${predicted_price*0.95:.2f} - ${predicted_price*1.05:.2f}")

## üéØ Practice Exercise 2: Feature Importance

**Task:** Identify which feature has the **biggest impact** on price by examining coefficients.

In [None]:
# YOUR CODE HERE
# Create a DataFrame of features and their coefficients
# Sort by absolute coefficient value

feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Coefficient': coefficients_multi,
    'Abs_Coefficient': np.abs(coefficients_multi)
}).sort_values('Abs_Coefficient', ascending=False)

print("\nüìä Feature Importance (by coefficient magnitude):\n")
print(feature_importance[['Feature', 'Coefficient']].to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'], color='steelblue')
plt.xlabel('Coefficient Value ($ impact)', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance: Impact on Price', fontsize=14, fontweight='bold')
plt.axvline(0, color='black', linewidth=0.8)
plt.grid(alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print(f"\nüí° Most Important Feature: {feature_importance.iloc[0]['Feature']}")
print(f"   Impact: ${abs(feature_importance.iloc[0]['Coefficient']):.2f} per unit")

## üéØ Practice Exercise 3: Identify Overpriced/Underpriced Products

**Task:** Find products where actual price differs significantly from predicted price.

In [None]:
# YOUR CODE HERE
# Calculate residuals (actual - predicted)
# Find products with largest positive and negative residuals

df['predicted_price'] = y_pred_multiple
df['residual'] = df['price_usd'] - df['predicted_price']
df['price_difference'] = df['residual'].abs()

# Top 5 overpriced (actual > predicted)
overpriced = df.nlargest(5, 'residual')[['product_id', 'brand_score', 'storage_gb', 
                                          'price_usd', 'predicted_price', 'residual']]
print("\nüí∞ Top 5 OVERPRICED Products (Actual > Predicted):\n")
print(overpriced.to_string(index=False))

# Top 5 underpriced (actual < predicted)
underpriced = df.nsmallest(5, 'residual')[['product_id', 'brand_score', 'storage_gb', 
                                            'price_usd', 'predicted_price', 'residual']]
print("\nüî• Top 5 UNDERPRICED Products (Actual < Predicted):\n")
print(underpriced.to_string(index=False))

# Visualize residuals
plt.figure(figsize=(10, 6))
plt.scatter(df['predicted_price'], df['residual'], alpha=0.5, s=50)
plt.axhline(0, color='red', linestyle='--', linewidth=2, label='Perfect Prediction')
plt.xlabel('Predicted Price (USD)', fontsize=12)
plt.ylabel('Residual (Actual - Predicted)', fontsize=12)
plt.title('Residual Plot: Identifying Pricing Anomalies', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nüí° Interpretation:")
print(f"   Positive residuals = OVERPRICED (consider discount)")
print(f"   Negative residuals = UNDERPRICED (potential to increase price)")

# Cleanup
df.drop(['predicted_price', 'residual', 'price_difference'], axis=1, inplace=True)

## üéØ Practice Exercise 4: What-If Scenario

**Scenario:** If we improve the camera from 48 MP to 108 MP, how much can we increase the price?

**Task:** Calculate the price difference for this upgrade.

In [None]:
# YOUR CODE HERE
# Create two products: one with 48 MP, one with 108 MP
# Keep all other features identical
# Calculate price difference

# Base product
base_specs = {
    'brand_score': [8.0],
    'screen_size': [6.5],
    'storage_gb': [256],
    'ram_gb': [8],
    'battery_hours': [18],
    'camera_mp': [48],  # Base camera
    'rating': [4.5]
}

# Upgraded product
upgraded_specs = base_specs.copy()
upgraded_specs['camera_mp'] = [108]  # Upgraded camera

# Predict prices
base_price = model_multiple.predict(pd.DataFrame(base_specs))[0]
upgraded_price = model_multiple.predict(pd.DataFrame(upgraded_specs))[0]
price_increase = upgraded_price - base_price

print(f"\nüì∏ What-If Scenario: Camera Upgrade Analysis")
print(f"   " + "="*50)
print(f"   Base Camera (48 MP):      ${base_price:.2f}")
print(f"   Upgraded Camera (108 MP): ${upgraded_price:.2f}")
print(f"   " + "="*50)
print(f"   Price Increase:           ${price_increase:.2f}")
print(f"   Percentage Increase:      {(price_increase/base_price)*100:.1f}%")

# Verify with coefficient
camera_coef = coefficients_multi[feature_columns.index('camera_mp')]
expected_increase = camera_coef * (108 - 48)
print(f"\n   ‚úÖ Verification using coefficient:")
print(f"   Camera coefficient: ${camera_coef:.2f} per MP")
print(f"   Expected increase: ${camera_coef:.2f} √ó {108-48} MP = ${expected_increase:.2f}")
print(f"   Matches prediction: {np.isclose(price_increase, expected_increase)}")

## üéØ Practice Exercise 5: Real-World Use Cases

**Task:** Match each scenario to the correct ML approach (Regression or Classification).

In [None]:
# Create quiz scenarios
scenarios = [
    ("Predict tomorrow's temperature in Celsius", "Regression", "Continuous value (e.g., 22.5¬∞C)"),
    ("Classify emails as spam or not spam", "Classification", "Category (spam/not spam)"),
    ("Estimate house selling price", "Regression", "Continuous value (e.g., $347,250)"),
    ("Determine if a tumor is malignant or benign", "Classification", "Category (malignant/benign)"),
    ("Forecast next month's sales revenue", "Regression", "Continuous value (e.g., $1,245,678.90)"),
    ("Predict customer churn (will leave or stay)", "Classification", "Category (churn/retain)"),
    ("Estimate student's final exam score (0-100)", "Regression", "Continuous value (e.g., 87.5)"),
    ("Categorize customers as high/medium/low value", "Classification", "Category (high/med/low)"),
]

print("\nüéì Regression vs Classification Quiz\n")
print("="*80)

for i, (scenario, correct_answer, reason) in enumerate(scenarios, 1):
    print(f"\n{i}. {scenario}")
    print(f"   Answer: {correct_answer}")
    print(f"   Why: {reason}")

print("\n" + "="*80)
print("\nüí° Key Takeaway:")
print("   - Regression: Predicting NUMBERS (continuous values)")
print("   - Classification: Predicting CATEGORIES (discrete labels)")

## üìö Key Concepts Summary

### What You Learned:

1. **‚úÖ Regression Definition**
   - Predicts continuous numerical values
   - Different from classification (which predicts categories)
   - Used for: prices, temperatures, sales, scores, etc.

2. **‚úÖ Simple vs Multiple Regression**
   - **Simple**: One predictor (e.g., price from brand score only)
   - **Multiple**: Multiple predictors (e.g., price from brand, storage, RAM, etc.)
   - Multiple regression is usually more accurate!

3. **‚úÖ Key Components**
   - **Target Variable (y)**: What we're predicting (price)
   - **Features (X)**: What we use to predict (brand, storage, RAM, etc.)
   - **Coefficients**: Show impact of each feature on target
   - **Intercept**: Base value when all features are zero

4. **‚úÖ Model Evaluation**
   - **R¬≤ Score**: How much variance explained (0-1, higher is better)
   - **RMSE/MAE**: Average prediction error (lower is better)
   - **Residuals**: Difference between actual and predicted

5. **‚úÖ Practical Applications**
   - Pricing strategy (our example)
   - What-if scenario analysis
   - Identifying overpriced/underpriced items
   - Feature importance analysis

---

## üéâ Congratulations!

You've completed the **Introduction to Regression** practice lab!

**Next Steps:**
- ‚úÖ Practice with your own datasets
- ‚úÖ Try different feature combinations
- ‚úÖ Explore non-linear relationships (coming in future lessons)
- ‚úÖ Learn about regularization to prevent overfitting

---

## üì¶ Library Versions

In [None]:
# Document versions for reproducibility
import sys
import sklearn

print("Library Versions:")
print(f"  Python: {sys.version}")
print(f"  NumPy: {np.__version__}")
print(f"  Pandas: {pd.__version__}")
print(f"  Matplotlib: {plt.matplotlib.__version__}")
print(f"  Seaborn: {sns.__version__}")
print(f"  Scikit-learn: {sklearn.__version__}")
print(f"\nRandom Seed: 42")