# 010: Linear RegressionLinear regression models the relationship between:- **Independent variables** (features, X): factors that influence the outcome- **Dependent variable** (target, y): the value we want to predict### 📊 Visual Concept```mermaidgraph LR    A[Features X] --> B[Linear Regression Model]    B --> C[Prediction y]    D[Training Data] --> B    style B fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff    style C fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff```### The Linear EquationFor simple linear regression (one feature):$$y = \beta_0 + \beta_1 x + \epsilon$$Where:- $y$ = predicted value (target)- $x$ = feature value- $\beta_0$ = intercept (y-value when x=0)- $\beta_1$ = slope (change in y per unit change in x)- $\epsilon$ = error term (residual)For multiple linear regression:$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon$$### 🎯 Linear Regression Workflow```mermaidgraph TD    A[Collect Data] --> B[Exploratory Data Analysis]    B --> C[Feature Selection]    C --> D[Train-Test Split]    D --> E[Fit Linear Model]    E --> F[Make Predictions]    F --> G[Evaluate Performance]    G --> H{Good Performance?}    H -->|Yes| I[Deploy Model]    H -->|No| J[Feature Engineering]    J --> E    style E fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff    style I fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff```### When to Use Linear Regression?✅ **Use when:**- Relationship between features and target is approximately linear- You need interpretable results- Continuous target variable- Fast training and prediction needed❌ **Don't use when:**- Relationship is highly non-linear- Many categorical features with high cardinality- Target has complex interactions- Presence of severe outliers (consider robust regression)### 🏭 Real-World Applications**Post-Silicon Validation:**- Predicting device yield from test parameters- Estimating power consumption from voltage/frequency- Forecasting test time from complexity metrics- Correlating performance with process variations**General AI/ML:**- Stock price prediction- Real estate valuation- Sales forecasting- Risk assessment---

## 2. Setup and Data Preparation

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✅ Libraries imported successfully")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

### 📝 What's Happening in This Code?

**Purpose:** Import essential libraries and configure the environment

**Key Points:**
1. **NumPy & Pandas**: Core libraries for numerical computing and data manipulation
2. **Matplotlib & Seaborn**: Visualization libraries for plotting graphs and charts
3. **Scikit-learn**: Industry-standard ML library providing LinearRegression and evaluation metrics
4. **Configuration**: Set random seed for reproducibility and configure plot styling for consistent visuals

**Why This Matters:**
- Setting random seed ensures your results are reproducible across runs
- Warning suppression keeps output clean and focused on results
- Pre-configuring plotting style creates professional visualizations

### 2.1 Generate Synthetic Dataset

Let's create a dataset that simulates **post-silicon validation scenarios** - semiconductor test parameters and device yield.

### 📝 What's Happening in This Code?

**Purpose:** Create realistic synthetic data mimicking post-silicon validation scenarios

**Key Points:**
1. **Feature Generation**: Temperature, voltage, current, and pressure represent typical manufacturing test parameters
2. **Linear Relationship**: Yield is calculated as a linear combination of features with added random noise
3. **Domain Knowledge**: Coefficients reflect real-world physics (e.g., high current reduces yield, voltage positively affects yield)
4. **Data Validation**: Clipping ensures yield stays within realistic bounds (60-100%)

**Why This Approach:**
- Synthetic data lets us control ground truth and understand model behavior
- Mimics real STDF data patterns from semiconductor testing
- Allows experimenting without exposing proprietary test data

In [None]:
def generate_semiconductor_data(n_samples=1000, noise_level=0.1):
    """
    Generate synthetic semiconductor test data
    
    Scenario: Predict chip yield based on manufacturing parameters
    """
    np.random.seed(42)
    
    # Manufacturing parameters (features)
    temperature = np.random.uniform(20, 30, n_samples)  # °C
    voltage = np.random.uniform(3.0, 3.6, n_samples)     # V
    current = np.random.uniform(0.4, 0.6, n_samples)     # A
    pressure = np.random.uniform(0.95, 1.05, n_samples)  # atm
    
    # True relationship (linear with some noise)
    # yield = f(temperature, voltage, current, pressure)
    yield_pct = (
        70 +                           # base yield
        0.5 * temperature +            # higher temp slightly increases yield
        10 * voltage +                 # voltage has strong positive effect
        -20 * current +                # high current reduces yield
        5 * pressure +                 # pressure has moderate effect
        np.random.normal(0, noise_level * 10, n_samples)  # random noise
    )
    
    # Clip yield to realistic range
    yield_pct = np.clip(yield_pct, 60, 100)
    
    # Create DataFrame
    df = pd.DataFrame({
        'temperature': temperature,
        'voltage': voltage,
        'current': current,
        'pressure': pressure,
        'yield_pct': yield_pct
    })
    
    return df

# Generate data
df = generate_semiconductor_data(n_samples=1000)

print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
print(df.head())
print(f"\nDataset statistics:")
print(df.describe())

### 2.2 Exploratory Data Analysis (EDA)

### 📝 What's Happening in This Code?

**Purpose:** Visualize relationships between features and target variable

**Key Points:**
1. **Scatter Plots**: Show individual data points to reveal linear/non-linear patterns
2. **Trend Lines**: Red dashed lines indicate the best-fit linear relationship for each feature
3. **Visual Inspection**: Helps identify which features have strong predictive power
4. **Assumption Check**: Verifies linearity assumption before building model

**What to Look For:**
- Points clustering around trend line = strong linear relationship
- Scattered points = weak relationship or non-linearity
- Outliers that deviate significantly from the trend

In [None]:
# Visualize relationships
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

features = ['temperature', 'voltage', 'current', 'pressure']
for idx, feature in enumerate(features):
    ax = axes[idx // 2, idx % 2]
    ax.scatter(df[feature], df['yield_pct'], alpha=0.5, s=20)
    ax.set_xlabel(feature.capitalize(), fontsize=12)
    ax.set_ylabel('Yield %', fontsize=12)
    ax.set_title(f'Yield vs {feature.capitalize()}', fontsize=14)
    ax.grid(True, alpha=0.3)
    
    # Add trend line
    z = np.polyfit(df[feature], df['yield_pct'], 1)
    p = np.poly1d(z)
    ax.plot(df[feature], p(df[feature]), "r--", alpha=0.8, linewidth=2)

plt.tight_layout()
plt.show()

print("📊 Scatter plots show relationships between features and yield")

### 📝 What's Happening in This Code?

**Purpose:** Calculate correlations and identify multicollinearity issues

**Key Points:**
1. **Correlation Matrix**: Heatmap shows linear relationships between all variables (ranges from -1 to +1)
2. **Feature Importance Indicator**: High correlation with target = potentially useful feature
3. **Multicollinearity Detection**: High correlation between features can cause model instability
4. **Color Coding**: Red = positive correlation, Blue = negative correlation, White = no correlation

**Interpretation Guide:**
- |r| > 0.7: Strong correlation
- 0.4 < |r| < 0.7: Moderate correlation  
- |r| < 0.4: Weak correlation
- r close to 0: No linear relationship

In [None]:
# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, fmt='.2f')
plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold')
plt.show()

print("\n📊 Correlation with yield:")
print(correlation_matrix['yield_pct'].sort_values(ascending=False))

---

## 3. Linear Regression from Scratch

Let's implement linear regression using the Normal Equation (Ordinary Least Squares).

### 🧮 Mathematical Foundation

```mermaid
graph TD
    A[Training Data X, y] --> B[Add Intercept Column]
    B --> C[Compute X^T X]
    C --> D[Compute Inverse]
    D --> E[Compute X^T y]
    E --> F[β = inv X^T X × X^T y]
    F --> G[Extract β0 intercept and β1...βn coefficients]
    style F fill:#FF5722,stroke:#333,stroke-width:2px,color:#fff
```

### The Normal Equation

$$\hat{\beta} = (X^T X)^{-1} X^T y$$

Where:
- $\hat{\beta}$ = estimated coefficients (including intercept)
- $X$ = feature matrix (with column of 1s for intercept)
- $y$ = target vector

### 📝 What's Happening in This Code?

**Purpose:** Implement linear regression from scratch to understand the mathematics

**Key Points:**
1. **Normal Equation Method**: Closed-form solution that directly computes optimal coefficients (no iterations needed)
2. **Matrix Operations**: Uses linear algebra (transpose, inverse, dot product) to solve the equation
3. **Intercept Handling**: Adds column of 1s to feature matrix to learn the bias term
4. **Educational Value**: Understanding the math helps debug issues and appreciate sklearn's optimizations

**Why From Scratch:**
- Builds deep understanding of what's happening "under the hood"
- Helps recognize when assumptions are violated
- Demonstrates that ML isn't magic - it's mathematics
- Validates that our implementation matches sklearn (coming next)

In [None]:
class LinearRegressionScratch:
    """Linear Regression implemented from scratch"""
    
    def __init__(self):
        self.coefficients = None
        self.intercept = None
    
    def fit(self, X, y):
        """
        Fit linear model using Normal Equation
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
        y : array-like, shape (n_samples,)
        """
        # Add column of ones for intercept
        X_with_intercept = np.c_[np.ones(X.shape[0]), X]
        
        # Normal equation: β = (X^T X)^-1 X^T y
        XtX = X_with_intercept.T @ X_with_intercept
        XtX_inv = np.linalg.inv(XtX)
        Xty = X_with_intercept.T @ y
        
        beta = XtX_inv @ Xty
        
        # Extract intercept and coefficients
        self.intercept = beta[0]
        self.coefficients = beta[1:]
        
        return self
    
    def predict(self, X):
        """Make predictions"""
        return X @ self.coefficients + self.intercept
    
    def score(self, X, y):
        """Calculate R² score"""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - y.mean()) ** 2)
        return 1 - (ss_res / ss_tot)

# Test our implementation
X = df[features].values
y = df['yield_pct'].values

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train from scratch
lr_scratch = LinearRegressionScratch()
lr_scratch.fit(X_train, y_train)

# Predictions
y_pred_scratch = lr_scratch.predict(X_test)
r2_scratch = lr_scratch.score(X_test, y_test)

print("="*60)
print("LINEAR REGRESSION FROM SCRATCH")
print("="*60)
print(f"Intercept: {lr_scratch.intercept:.4f}")
print(f"\nCoefficients:")
for feat, coef in zip(features, lr_scratch.coefficients):
    print(f"  {feat:12s}: {coef:8.4f}")
print(f"\nR² Score on test set: {r2_scratch:.4f}")
print("="*60)

---

## 4. Linear Regression with Scikit-learn

Now let's use the industry-standard scikit-learn implementation.

### 📝 What's Happening in This Code?

**Purpose:** Compare our implementation with production-ready sklearn

**Key Points:**
1. **Industry Standard**: Sklearn's LinearRegression is optimized, tested, and widely used in production
2. **API Simplicity**: Consistent fit/predict interface used across all sklearn models
3. **Validation**: Comparing coefficients proves our scratch implementation is mathematically correct
4. **Production Choice**: In real projects, always prefer sklearn for robustness and performance

**Key Takeaway:**
- Both implementations produce identical results (validates our understanding)
- Sklearn adds: numerical stability, edge case handling, performance optimizations
- Understanding the math helps you debug and extend models confidently

In [None]:
# Train with scikit-learn
lr_sklearn = LinearRegression()
lr_sklearn.fit(X_train, y_train)

# Predictions
y_pred_sklearn = lr_sklearn.predict(X_test)

# Compare with our implementation
print("="*60)
print("COMPARISON: Scratch vs Scikit-learn")
print("="*60)
print(f"{'Metric':<20} {'Scratch':<15} {'Scikit-learn':<15}")
print("-"*60)
print(f"{'Intercept':<20} {lr_scratch.intercept:>14.4f} {lr_sklearn.intercept_:>14.4f}")

for i, feat in enumerate(features):
    print(f"{feat:<20} {lr_scratch.coefficients[i]:>14.4f} {lr_sklearn.coef_[i]:>14.4f}")

r2_sklearn = lr_sklearn.score(X_test, y_test)
print("-"*60)
print(f"{'R² Score':<20} {r2_scratch:>14.4f} {r2_sklearn:>14.4f}")
print("="*60)
print("\n✅ Both implementations produce identical results!")

---

## 5. Model Evaluation

### 📊 Metrics Overview

```mermaid
graph LR
    A[Model Predictions] --> B[Evaluation Metrics]
    B --> C[MAE - Average Error]
    B --> D[RMSE - Penalizes Large Errors]
    B --> E[R² - Variance Explained]
    B --> F[MAPE - Percentage Error]
    style B fill:#9C27B0,stroke:#333,stroke-width:2px,color:#fff
```

### 5.1 Key Regression Metrics

1. **Mean Absolute Error (MAE)**: Average absolute difference
   $$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

2. **Mean Squared Error (MSE)**: Average squared difference
   $$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

3. **Root Mean Squared Error (RMSE)**: Square root of MSE
   $$RMSE = \sqrt{MSE}$$

4. **R² Score (Coefficient of Determination)**: Proportion of variance explained
   $$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$
   
   - R² = 1: Perfect predictions
   - R² = 0: Model no better than mean
   - R² < 0: Model worse than mean

### 📝 What's Happening in This Code?

**Purpose:** Quantify model performance using multiple metrics

**Key Points:**
1. **Multiple Metrics**: Each metric reveals different aspects of performance (no single metric tells the full story)
2. **MAE**: Easy to interpret (average prediction error in original units)
3. **RMSE**: Penalizes large errors more than MAE (sensitive to outliers)
4. **R²**: Shows how much variance your model explains (0% to 100% scale)

**Choosing Metrics:**
- **MAE**: When all errors are equally important
- **RMSE**: When large errors are especially bad (e.g., safety-critical systems)
- **R²**: For comparing models and communicating with non-technical stakeholders
- **MAPE**: When relative errors matter more than absolute

In [None]:
# Calculate all metrics
mae = mean_absolute_error(y_test, y_pred_sklearn)
mse = mean_squared_error(y_test, y_pred_sklearn)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_sklearn)

# Calculate additional metrics
mape = np.mean(np.abs((y_test - y_pred_sklearn) / y_test)) * 100  # Mean Absolute Percentage Error
residuals = y_test - y_pred_sklearn

print("="*60)
print("MODEL EVALUATION METRICS")
print("="*60)
print(f"Mean Absolute Error (MAE):        {mae:.4f}")
print(f"Mean Squared Error (MSE):         {mse:.4f}")
print(f"Root Mean Squared Error (RMSE):   {rmse:.4f}")
print(f"R² Score:                         {r2:.4f}")
print(f"Mean Absolute Percentage Error:   {mape:.2f}%")
print("="*60)

# Interpretation
print("\n📊 Interpretation:")
print(f"  - On average, predictions are off by {mae:.2f} percentage points")
print(f"  - Model explains {r2*100:.2f}% of variance in yield")
if r2 > 0.9:
    print(f"  - 🎯 Excellent fit!")
elif r2 > 0.7:
    print(f"  - ✅ Good fit")
elif r2 > 0.5:
    print(f"  - ⚠️  Moderate fit - consider feature engineering")
else:
    print(f"  - ❌ Poor fit - linear model may not be appropriate")

### 5.2 Cross-Validation

Cross-validation provides a more robust estimate of model performance.

### 📝 What's Happening in This Code?

**Purpose:** Evaluate model on multiple train-test splits for robust performance estimation

**Key Points:**
1. **5-Fold CV**: Data split into 5 parts; each part serves as test set once while others train the model
2. **Reduces Overfitting Risk**: Single train-test split might be lucky/unlucky; CV averages across multiple splits
3. **Confidence Intervals**: Standard deviation shows how much performance varies across folds
4. **Production Readiness**: Consistent CV scores indicate model will generalize well to new data

**When to Use CV:**
- Small to medium datasets (where single split might not represent population)
- Hyperparameter tuning (coming in notebook 043)
- Model selection (comparing different algorithms)
- When you need confidence in performance estimates

In [None]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation
cv_scores = cross_val_score(lr_sklearn, X, y, cv=5, 
                            scoring='r2')
cv_rmse_scores = -cross_val_score(lr_sklearn, X, y, cv=5,
                                  scoring='neg_root_mean_squared_error')

print("="*60)
print("CROSS-VALIDATION RESULTS (5-Fold)")
print("="*60)
print(f"R² Scores per fold: {cv_scores}")
print(f"Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"\nRMSE per fold: {cv_rmse_scores}")
print(f"Mean RMSE: {cv_rmse_scores.mean():.4f} (+/- {cv_rmse_scores.std() * 2:.4f})")
print("="*60)

### 5.3 Visualization of Results

### 📝 What's Happening in This Code?

**Purpose:** Visual diagnostics to understand model behavior and validate assumptions

**Key Points:**
1. **Predicted vs Actual**: Ideal model has all points on diagonal line; scatter indicates prediction errors
2. **Residual Plot**: Random scatter around zero = good; patterns indicate model problems (non-linearity, heteroscedasticity)
3. **Residual Distribution**: Should be bell-shaped (normal); skewed = violated assumptions
4. **Q-Q Plot**: Points on diagonal = normal distribution; deviations indicate non-normality

**Red Flags to Watch:**
- Funnel shape in residuals → heteroscedasticity (variance increases with prediction)
- Curved pattern in residuals → non-linearity (need polynomial features)
- Heavy tails in Q-Q plot → outliers affecting model

In [None]:
# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Predicted vs Actual
ax1 = axes[0, 0]
ax1.scatter(y_test, y_pred_sklearn, alpha=0.6, s=50)
ax1.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
         'r--', lw=2, label='Perfect Prediction')
ax1.set_xlabel('Actual Yield %', fontsize=12)
ax1.set_ylabel('Predicted Yield %', fontsize=12)
ax1.set_title('Predicted vs Actual', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Residuals vs Predicted
ax2 = axes[0, 1]
ax2.scatter(y_pred_sklearn, residuals, alpha=0.6, s=50)
ax2.axhline(y=0, color='r', linestyle='--', lw=2)
ax2.set_xlabel('Predicted Yield %', fontsize=12)
ax2.set_ylabel('Residuals', fontsize=12)
ax2.set_title('Residual Plot', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

# 3. Residual Distribution
ax3 = axes[1, 0]
ax3.hist(residuals, bins=30, edgecolor='black', alpha=0.7)
ax3.axvline(x=0, color='r', linestyle='--', lw=2)
ax3.set_xlabel('Residuals', fontsize=12)
ax3.set_ylabel('Frequency', fontsize=12)
ax3.set_title('Residual Distribution', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3)

# 4. Q-Q Plot (check normality of residuals)
ax4 = axes[1, 1]
from scipy import stats
stats.probplot(residuals, dist="norm", plot=ax4)
ax4.set_title('Q-Q Plot (Normality Check)', fontsize=14, fontweight='bold')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("📊 Diagnostic Plots:")
print("  1. Predicted vs Actual: Points should cluster around diagonal")
print("  2. Residual Plot: Should show random scatter (no patterns)")
print("  3. Residual Distribution: Should be approximately normal")
print("  4. Q-Q Plot: Points should follow diagonal line")

---

## 6. Model Assumptions & Diagnostics

Linear regression assumes:
1. **Linearity**: Relationship between X and y is linear
2. **Independence**: Observations are independent
3. **Homoscedasticity**: Constant variance of residuals
4. **Normality**: Residuals are normally distributed
5. **No multicollinearity**: Features are not highly correlated

Let's check these assumptions:

In [None]:
def check_assumptions(X, y, y_pred, feature_names):
    """Comprehensive assumption checking"""
    
    residuals = y - y_pred
    
    print("="*60)
    print("ASSUMPTION DIAGNOSTICS")
    print("="*60)
    
    # 1. Linearity (already visualized)
    print("\n1. LINEARITY")
    print("   ✓ Check scatter plots above - relationships appear linear")
    
    # 2. Independence (Durbin-Watson test)
    from statsmodels.stats.stattools import durbin_watson
    dw_stat = durbin_watson(residuals)
    print(f"\n2. INDEPENDENCE (Durbin-Watson)")
    print(f"   DW Statistic: {dw_stat:.4f}")
    print(f"   Interpretation: ", end='')
    if 1.5 < dw_stat < 2.5:
        print("✓ No significant autocorrelation")
    else:
        print("⚠️  Possible autocorrelation detected")
    
    # 3. Homoscedasticity (Constant Variance)
    print(f"\n3. HOMOSCEDASTICITY")
    print(f"   Check residual plot above:")
    print(f"   ✓ Residuals should be randomly scattered")
    print(f"   ✓ No 'funnel' or 'cone' shape")
    
    # 4. Normality of Residuals
    from scipy.stats import shapiro
    stat, p_value = shapiro(residuals)
    print(f"\n4. NORMALITY (Shapiro-Wilk Test)")
    print(f"   Test Statistic: {stat:.4f}")
    print(f"   P-value: {p_value:.4f}")
    if p_value > 0.05:
        print(f"   ✓ Residuals appear normally distributed (p > 0.05)")
    else:
        print(f"   ⚠️  Residuals may not be normal (p < 0.05)")
    
    # 5. Multicollinearity (VIF)
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    
    print(f"\n5. MULTICOLLINEARITY (VIF)")
    print(f"   {'Feature':<15} {'VIF':<10} {'Status'}")
    print(f"   {'-'*40}")
    
    vif_data = pd.DataFrame()
    vif_data["Feature"] = feature_names
    vif_data["VIF"] = [variance_inflation_factor(X, i) for i in range(X.shape[1])]
    
    for _, row in vif_data.iterrows():
        status = "✓ Good" if row['VIF'] < 5 else "⚠️  High" if row['VIF'] < 10 else "❌ Very High"
        print(f"   {row['Feature']:<15} {row['VIF']:<10.2f} {status}")
    
    print(f"\n   VIF < 5: No multicollinearity")
    print(f"   VIF 5-10: Moderate multicollinearity")
    print(f"   VIF > 10: High multicollinearity (consider removing features)")
    
    print("="*60)

# Run diagnostics
check_assumptions(X_test, y_test, y_pred_sklearn, features)

---

## 7. Feature Importance and Interpretation

In [None]:
# Standardize features to compare coefficient magnitudes
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train on scaled data
lr_scaled = LinearRegression()
lr_scaled.fit(X_train_scaled, y_train)

# Plot feature importance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Original coefficients
ax1.barh(features, lr_sklearn.coef_)
ax1.set_xlabel('Coefficient Value', fontsize=12)
ax1.set_title('Original Coefficients', fontsize=14, fontweight='bold')
ax1.axvline(x=0, color='black', linestyle='--', linewidth=1)
ax1.grid(True, alpha=0.3)

# Standardized coefficients (fair comparison)
colors = ['green' if c > 0 else 'red' for c in lr_scaled.coef_]
ax2.barh(features, np.abs(lr_scaled.coef_), color=colors, alpha=0.7)
ax2.set_xlabel('Absolute Coefficient (Standardized)', fontsize=12)
ax2.set_title('Feature Importance (Standardized)', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Interpretation
print("="*60)
print("FEATURE IMPORTANCE")
print("="*60)
importance_df = pd.DataFrame({
    'Feature': features,
    'Coefficient': lr_sklearn.coef_,
    'Abs_Std_Coef': np.abs(lr_scaled.coef_)
}).sort_values('Abs_Std_Coef', ascending=False)

print(importance_df.to_string(index=False))
print("="*60)

print("\n📊 Interpretation:")
most_important = importance_df.iloc[0]
print(f"  - Most influential: {most_important['Feature']}")
print(f"  - Effect: {most_important['Coefficient']:.4f} change in yield per unit change")

for _, row in importance_df.iterrows():
    direction = "increases" if row['Coefficient'] > 0 else "decreases"
    print(f"  - {row['Feature']}: {direction} yield")

---

## 8. Real-World Project: Post-Silicon Validation Yield Prediction

Now let's apply everything to a realistic **post-silicon validation** scenario.

### 🔬 Post-Silicon Validation Context

```mermaid
graph TD
    A[Silicon Chip Manufacturing] --> B[Wafer Testing]
    B --> C[Die Level Tests]
    C --> D[Electrical Parameters]
    C --> E[Environmental Conditions]
    C --> F[Position on Wafer]
    D --> G[Yield Prediction Model]
    E --> G
    F --> G
    G --> H[Pass/Fail Decision]
    H --> I[Yield Optimization]
    style G fill:#FF6F00,stroke:#333,stroke-width:2px,color:#fff
    style I fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
```

### 📝 What's Happening in This Code?

**Purpose:** Generate realistic post-silicon validation data mimicking real STDF files

**Key Points:**
1. **Realistic Parameters**: Voltage, current, frequency, temperature from actual semiconductor test specifications
2. **Wafer-Level Effects**: Systematic variations across wafers (process differences)
3. **Position Effects**: Edge dies have lower yield (thermal and mechanical stress)
4. **Complex Interactions**: Yield depends on multiple interacting factors (more realistic than simple linear relationships)

**Post-Silicon Validation Scenarios:**
- **Electrical Tests**: VDD, VSS, leakage current, operating frequency
- **Environmental**: Temperature and humidity during test
- **Spatial**: Die position affects yield (edge effects)
- **Process**: Wafer-to-wafer variation from manufacturing

In [None]:
def generate_realistic_stdf_data(n_devices=5000):
    """
    Generate realistic STDF data for semiconductor test
    
    Real-world scenario: Predict final yield based on early-stage test parameters
    """
    np.random.seed(42)
    
    # Test parameters (realistic ranges for semiconductor testing)
    data = {
        # Electrical tests
        'vdd_voltage': np.random.normal(3.3, 0.05, n_devices),      # V
        'vss_voltage': np.random.normal(0.0, 0.02, n_devices),      # V
        'leakage_current': np.random.lognormal(-3, 0.5, n_devices), # µA
        'freq_mhz': np.random.normal(1000, 50, n_devices),          # MHz
        
        # Environmental conditions
        'temp_celsius': np.random.normal(25, 3, n_devices),         # °C
        'humidity_pct': np.random.normal(45, 5, n_devices),         # %
        
        # Process parameters
        'wafer_id': np.random.randint(1, 26, n_devices),            # Wafer 1-25
        'die_x': np.random.randint(0, 50, n_devices),               # Position X
        'die_y': np.random.randint(0, 50, n_devices),               # Position Y
    }
    
    df = pd.DataFrame(data)
    
    # Complex yield formula (realistic dependencies)
    base_yield = 85
    
    yield_score = (
        base_yield +
        50 * (df['vdd_voltage'] - 3.3) +              # Voltage deviation
        -10 * df['leakage_current'] +                  # Leakage is bad
        0.01 * (df['freq_mhz'] - 1000) +              # Frequency matters
        -0.2 * abs(df['temp_celsius'] - 25) +         # Temp deviation
        -0.1 * abs(df['humidity_pct'] - 45) +         # Humidity deviation
        np.random.normal(0, 2, n_devices)              # Random variation
    )
    
    # Add wafer-level systematic variation
    wafer_effects = df['wafer_id'] * 0.1 - 1.25
    yield_score += wafer_effects
    
    # Add position effects (edge dies have lower yield)
    edge_penalty = (
        np.minimum(df['die_x'], 50 - df['die_x']) +
        np.minimum(df['die_y'], 50 - df['die_y'])
    ) * -0.05
    yield_score += edge_penalty
    
    # Clip to realistic range
    df['yield_score'] = np.clip(yield_score, 60, 100)
    
    return df

# Generate realistic data
stdf_df = generate_realistic_stdf_data(5000)

print("="*60)
print("REALISTIC STDF DATASET")
print("="*60)
print(f"Total devices: {len(stdf_df)}")
print(f"\nFeatures:\n{stdf_df.columns.tolist()}")
print(f"\nSample data:")
print(stdf_df.head(10))
print(f"\nStatistics:")
print(stdf_df.describe())

### 📝 What's Happening in This Code?

**Purpose:** Train and evaluate model on realistic post-silicon validation data

**Key Points:**
1. **Feature Scaling**: StandardScaler normalizes features (different units: voltage, frequency, position) to same scale
2. **Train-Test Split**: 80-20 split ensures model evaluated on unseen data
3. **Feature Importance**: Identifies which test parameters most strongly predict yield
4. **Business Metrics**: R², RMSE, MAE translate to actionable insights for manufacturing optimization

**Real-World Application:**
- Early prediction of yield reduces test time and cost
- Identifies problematic process parameters for optimization
- Enables data-driven decisions in manufacturing
- Can integrate with ATE (Automatic Test Equipment) systems

In [None]:
# Prepare data for modeling
feature_cols = ['vdd_voltage', 'vss_voltage', 'leakage_current', 'freq_mhz',
                'temp_celsius', 'humidity_pct', 'wafer_id', 'die_x', 'die_y']

X_stdf = stdf_df[feature_cols].values
y_stdf = stdf_df['yield_score'].values

# Train-test split
X_train_stdf, X_test_stdf, y_train_stdf, y_test_stdf = train_test_split(
    X_stdf, y_stdf, test_size=0.2, random_state=42
)

# Scale features
scaler_stdf = StandardScaler()
X_train_stdf_scaled = scaler_stdf.fit_transform(X_train_stdf)
X_test_stdf_scaled = scaler_stdf.transform(X_test_stdf)

# Train model
lr_stdf = LinearRegression()
lr_stdf.fit(X_train_stdf_scaled, y_train_stdf)

# Predictions
y_pred_stdf = lr_stdf.predict(X_test_stdf_scaled)

# Evaluate
r2_stdf = r2_score(y_test_stdf, y_pred_stdf)
rmse_stdf = np.sqrt(mean_squared_error(y_test_stdf, y_pred_stdf))
mae_stdf = mean_absolute_error(y_test_stdf, y_pred_stdf)

print("="*60)
print("STDF YIELD PREDICTION MODEL - RESULTS")
print("="*60)
print(f"R² Score:  {r2_stdf:.4f}")
print(f"RMSE:      {rmse_stdf:.4f} yield points")
print(f"MAE:       {mae_stdf:.4f} yield points")
print("="*60)

# Feature importance for STDF model
importance_stdf = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': np.abs(lr_stdf.coef_)
}).sort_values('Importance', ascending=False)

print("\nTop 5 Most Important Features:")
print(importance_stdf.head().to_string(index=False))

In [None]:
# Visualize STDF model results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Predicted vs Actual
ax1 = axes[0, 0]
ax1.scatter(y_test_stdf, y_pred_stdf, alpha=0.4, s=20)
ax1.plot([y_test_stdf.min(), y_test_stdf.max()], 
         [y_test_stdf.min(), y_test_stdf.max()], 'r--', lw=2)
ax1.set_xlabel('Actual Yield Score', fontsize=12)
ax1.set_ylabel('Predicted Yield Score', fontsize=12)
ax1.set_title(f'STDF Prediction Results (R²={r2_stdf:.3f})', 
              fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# 2. Feature Importance
ax2 = axes[0, 1]
ax2.barh(importance_stdf['Feature'], importance_stdf['Importance'])
ax2.set_xlabel('Importance (Abs Coefficient)', fontsize=12)
ax2.set_title('Feature Importance for Yield Prediction', 
              fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

# 3. Error Distribution
ax3 = axes[1, 0]
errors = y_test_stdf - y_pred_stdf
ax3.hist(errors, bins=50, edgecolor='black', alpha=0.7)
ax3.axvline(x=0, color='r', linestyle='--', lw=2)
ax3.set_xlabel('Prediction Error', fontsize=12)
ax3.set_ylabel('Frequency', fontsize=12)
ax3.set_title('Error Distribution', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3)

# 4. Residual Plot
ax4 = axes[1, 1]
ax4.scatter(y_pred_stdf, errors, alpha=0.4, s=20)
ax4.axhline(y=0, color='r', linestyle='--', lw=2)
ax4.set_xlabel('Predicted Yield Score', fontsize=12)
ax4.set_ylabel('Residuals', fontsize=12)
ax4.set_title('Residual Plot', fontsize=14, fontweight='bold')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("✅ STDF yield prediction model complete!")
print("\n💡 Business Impact:")
print(f"   - Can predict yield within ±{mae_stdf:.2f} points")
print(f"   - Explains {r2_stdf*100:.1f}% of yield variation")
print(f"   - Top factors: {', '.join(importance_stdf.head(3)['Feature'].tolist())}")

---

## 9. Advanced Topics

### 🚀 Extending Linear Regression

```mermaid
graph TD
    A[Basic Linear Regression] --> B[Polynomial Features]
    A --> C[Regularization]
    A --> D[Feature Engineering]
    B --> E[Capture Non-linearity]
    C --> F[Ridge - L2]
    C --> G[Lasso - L1]
    C --> H[ElasticNet]
    D --> I[Interactions]
    D --> J[Transformations]
    style A fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff
```

### 9.1 Polynomial Features (Handling Non-linearity)

### 📝 What's Happening in This Code?

**Purpose:** Extend linear regression to capture non-linear relationships

**Key Points:**
1. **Polynomial Features**: Creates squared terms (x²) and interaction terms (x₁ × x₂) from original features
2. **Still Linear Model**: Despite polynomial features, it's still linear in coefficients (linear regression can fit it)
3. **Curse of Dimensionality**: 9 features → 54 polynomial features (combinations grow quickly)
4. **Trade-off**: Better fit on training data but risk of overfitting on test data

**When to Use:**
- Clear non-linear patterns in scatter plots
- Domain knowledge suggests quadratic/interaction effects (e.g., voltage × current = power)
- Linear model underperforming despite good data

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features (degree 2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train_stdf_scaled)
X_test_poly = poly.transform(X_test_stdf_scaled)

print(f"Original features: {X_train_stdf_scaled.shape[1]}")
print(f"Polynomial features: {X_train_poly.shape[1]}")
print(f"(Includes interactions and squares)")

# Train polynomial model
lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train_stdf)
y_pred_poly = lr_poly.predict(X_test_poly)

# Compare
r2_poly = r2_score(y_test_stdf, y_pred_poly)
rmse_poly = np.sqrt(mean_squared_error(y_test_stdf, y_pred_poly))

print("\n" + "="*60)
print("POLYNOMIAL FEATURES COMPARISON")
print("="*60)
print(f"{'Metric':<20} {'Linear':<15} {'Polynomial':<15}")
print("-"*60)
print(f"{'R² Score':<20} {r2_stdf:<15.4f} {r2_poly:<15.4f}")
print(f"{'RMSE':<20} {rmse_stdf:<15.4f} {rmse_poly:<15.4f}")
print("="*60)

if r2_poly > r2_stdf:
    print("✅ Polynomial features improved performance!")
else:
    print("⚠️  Linear model is sufficient for this data")

### 9.2 Regularization Preview (Ridge & Lasso)

When you have many features, regularization helps prevent overfitting.
We'll cover this in detail in **012_Ridge_Lasso_ElasticNet.ipynb**.

### 📝 What's Happening in This Code?

**Purpose:** Introduce regularization techniques for preventing overfitting

**Key Points:**
1. **Ridge (L2)**: Shrinks coefficients toward zero but keeps all features (adds penalty: α∑β²)
2. **Lasso (L1)**: Shrinks some coefficients exactly to zero = automatic feature selection (adds penalty: α∑|β|)
3. **Alpha Parameter**: Controls regularization strength (higher α = more regularization)
4. **When Needed**: Many features, multicollinearity, or polynomial features (54 features here!)

**Preview of Notebook 012:**
- Detailed math behind regularization
- Cross-validated alpha selection
- Feature selection with Lasso
- ElasticNet (combination of Ridge + Lasso)

In [None]:
from sklearn.linear_model import Ridge, Lasso

# Ridge regression (L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_poly, y_train_stdf)
y_pred_ridge = ridge.predict(X_test_poly)
r2_ridge = r2_score(y_test_stdf, y_pred_ridge)

# Lasso regression (L1 regularization)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_poly, y_train_stdf)
y_pred_lasso = lasso.predict(X_test_poly)
r2_lasso = r2_score(y_test_stdf, y_pred_lasso)

print("="*60)
print("REGULARIZATION COMPARISON (Preview)")
print("="*60)
print(f"{'Model':<20} {'R² Score':<15} {'Non-zero Coefs'}")
print("-"*60)
print(f"{'Linear':<20} {r2_stdf:<15.4f} {len(feature_cols)}")
print(f"{'Polynomial':<20} {r2_poly:<15.4f} {X_train_poly.shape[1]}")
print(f"{'Ridge':<20} {r2_ridge:<15.4f} {X_train_poly.shape[1]}")
print(f"{'Lasso':<20} {r2_lasso:<15.4f} {np.sum(lasso.coef_ != 0)}")
print("="*60)
print("\n💡 Lasso performs feature selection by setting some coefficients to zero")
print("   → More in notebook 012!")

---

## 10. Key Takeaways

### ✅ When Linear Regression Works Well:
- Linear relationships between features and target
- Continuous target variable
- Need for interpretability
- Fast training/prediction required
- Features are not highly correlated

### ⚠️ Limitations:
- Cannot capture non-linear relationships (use polynomial features or other models)
- Sensitive to outliers (consider robust regression)
- Assumes linear additive effects
- Multicollinearity causes unstable coefficients

### 🎯 Best Practices:
1. Always visualize relationships first (EDA)
2. Check model assumptions
3. Use cross-validation for robust evaluation
4. Scale features for fair coefficient comparison
5. Test for multicollinearity (VIF)
6. Examine residual plots
7. Consider regularization for many features

### 📈 Next Models to Learn:
When linear regression isn't enough, progress to:
- **011_Polynomial_Regression.ipynb** - Handle non-linear relationships
- **012_Ridge_Lasso_ElasticNet.ipynb** - Regularization techniques
- **013_Logistic_Regression.ipynb** - Classification problems
- **016_Decision_Trees.ipynb** - Complex non-linear patterns

---

## 11. Real-World Projects

### 🔬 Post-Silicon Validation Projects

#### Project 1: Device Power Consumption Predictor
**Objective:** Predict power consumption from voltage, frequency, and temperature
- Load STDF power test data
- Engineer features: voltage × current, frequency bins
- Build linear regression model
- Validate against specifications
- **Business Value:** Early detection of high-power devices, yield optimization

#### Project 2: Test Time Estimator
**Objective:** Estimate test execution time from test complexity metrics
- Features: number of test patterns, vector count, clock frequency
- Handle time-series aspects (sequential tests)
- Predict total test time per device
- **Business Value:** ATE scheduling optimization, capacity planning

#### Project 3: Parametric Yield Prediction
**Objective:** Predict final yield based on early parametric test results
- Features: electrical parameters (Vdd, Vss, leakage, frequency)
- Wafer-level and die-level spatial features
- Environmental conditions (temp, humidity)
- **Business Value:** Early yield prediction, process optimization feedback

#### Project 4: Voltage-Frequency Characterization
**Objective:** Model voltage-frequency operating curves (V-F curves)
- Non-linear relationship (may need polynomial features)
- Device-to-device variation modeling
- Guardband prediction for production test
- **Business Value:** Optimized test limits, reduced guardbands, higher yield

### 💡 General AI/ML Projects

#### Project 5: Sales Forecasting
**Objective:** Predict monthly sales from marketing spend and seasonality
- Features: advertising budget, month, previous sales
- Time series considerations
- Interpretable coefficients for business decisions

#### Project 6: Real Estate Price Prediction
**Objective:** Estimate house prices from features
- Features: square footage, bedrooms, location, age
- Feature engineering: price per sq ft, neighborhood clusters
- Compare linear vs non-linear models

#### Project 7: Customer Lifetime Value (CLV)
**Objective:** Predict customer value from behavior metrics
- Features: purchase frequency, average order value, tenure
- Interaction effects (frequency × value)
- Segmentation for targeted marketing

#### Project 8: Energy Consumption Forecasting
**Objective:** Predict building energy usage
- Features: temperature, occupancy, time of day, day of week
- Seasonal patterns and trends
- Optimization recommendations

---

### 🎯 Project Implementation Template

```python
# 1. Load and explore data
df = pd.read_csv('your_data.csv')
# or for STDF: df = parse_stdf_file('test_results.stdf')

# 2. EDA and visualization
# - Scatter plots
# - Correlation matrix
# - Distribution checks

# 3. Feature engineering
# - Create interactions
# - Transform variables
# - Handle missing data

# 4. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 5. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 6. Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# 7. Evaluate
y_pred = model.predict(X_test_scaled)
print(f"R²: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

# 8. Interpret and deploy
# - Feature importance
# - Diagnostic plots
# - Save model for production
```

---

**Congratulations! 🎉**

You've mastered linear regression - the foundation of machine learning!

**Next Steps:**
- → **011_Polynomial_Regression.ipynb** for non-linear relationships
- → **041_Feature_Engineering_Masterclass.ipynb** for advanced feature creation
- → **042_Model_Evaluation_Metrics.ipynb** for deeper metric understanding

---

**Notebook Complete!** ✅