# Linear Regression Exercise Solutions

---

This notebook contains detailed solutions to all exercises from the Linear Regression: Univariate notebook. Each solution includes:
- Complete working code
- Detailed explanations of the approach
- Business interpretation of results
- Key insights and takeaways

**Prerequisites:** Make sure you have run the main linear regression notebook first to have the necessary data and models loaded.

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Recreate the banking dataset and model from the main notebook
# This ensures we have all necessary variables for the exercises

banking_url = "https://raw.githubusercontent.com/umatter/EDFB/main/data/banking.csv"
dataset = pd.read_csv(banking_url)

# Prepare the data exactly as in the main notebook
df_banking = dataset.copy()
df_banking['was_previously_contacted'] = (df_banking['pdays'] != 999).astype(int)
df_banking['pdays_clean'] = df_banking['pdays'].replace(999, np.nan)
df_banking['pdays_clean'] = df_banking['pdays_clean'].fillna(df_banking['pdays_clean'].median())

feature_cols_cat = ['marital', 'education', 'housing', 'loan', 'contact', 'poutcome']
feature_cols_num = ['age', 'previous', 'pdays_clean', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed', 'was_previously_contacted']
X_df_banking = pd.get_dummies(df_banking[feature_cols_cat + feature_cols_num], drop_first=True)
X_banking = X_df_banking.values
y_banking = np.log1p(df_banking['duration']).values.reshape(-1,1)

# Split the data
X_train_banking, X_test_banking, y_train_banking, y_test_banking = train_test_split(
    X_banking, y_banking, test_size=0.2, random_state=0
)

# Train the model
model_banking = LinearRegression()
model_banking.fit(X_train_banking, y_train_banking)
y_test_predicted_banking = model_banking.predict(X_test_banking)

print("Banking dataset prepared successfully!")
print(f"Features: {X_df_banking.shape[1]}")
print(f"Training samples: {X_train_banking.shape[0]}")
print(f"Test samples: {X_test_banking.shape[0]}")

## Exercise 1: Understanding Model Coefficients

**Task:** Interpret the coefficients from the banking dataset model and understand their business meaning.

### Solution Approach:
1. Extract coefficients and feature names
2. Create a readable DataFrame
3. Identify most influential features
4. Provide business interpretation

In [None]:
# Exercise 1 Solution

# Create DataFrame with feature names and coefficients
coef_df = pd.DataFrame({
    'Feature': X_df_banking.columns,
    'Coefficient': model_banking.coef_[0],
    'Abs_Coefficient': np.abs(model_banking.coef_[0])
})

# Sort by absolute value to see most influential features
coef_df_sorted = coef_df.sort_values('Abs_Coefficient', ascending=False)

print("=== MODEL COEFFICIENTS ANALYSIS ===")
print(f"Intercept: {model_banking.intercept_[0]:.4f}")
print("\nAll coefficients (sorted by absolute magnitude):")
print(coef_df_sorted.to_string(index=False))

print("\n=== TOP 5 POSITIVE COEFFICIENTS ===")
top_positive = coef_df[coef_df['Coefficient'] > 0].nlargest(5, 'Coefficient')
print(top_positive[['Feature', 'Coefficient']].to_string(index=False))

print("\n=== TOP 5 NEGATIVE COEFFICIENTS ===")
top_negative = coef_df[coef_df['Coefficient'] < 0].nsmallest(5, 'Coefficient')
print(top_negative[['Feature', 'Coefficient']].to_string(index=False))

In [None]:
# Visualize the most important coefficients
plt.figure(figsize=(12, 8))
top_features = coef_df_sorted.head(15)  # Top 15 most influential

colors = ['red' if x < 0 else 'blue' for x in top_features['Coefficient']]
plt.barh(range(len(top_features)), top_features['Coefficient'], color=colors, alpha=0.7)
plt.yticks(range(len(top_features)), top_features['Feature'])
plt.xlabel('Coefficient Value')
plt.title('Top 15 Most Influential Features (by Absolute Coefficient Value)')
plt.axvline(x=0, color='black', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

### Business Interpretation:

**Key Insights from Coefficients:**

1. **Positive Coefficients (increase call duration):**
   - Features with positive coefficients tend to increase the log duration of calls
   - These might indicate more engaged customers or complex inquiries

2. **Negative Coefficients (decrease call duration):**
   - Features with negative coefficients tend to decrease call duration
   - These might indicate quick decisions or less complex situations

3. **Magnitude Interpretation:**
   - Since we're predicting log(duration), a coefficient of 0.1 means approximately 10% increase in duration
   - Larger absolute coefficients have more impact on call duration

**Business Applications:**
- Use high-impact features for call center resource planning
- Identify customer segments that require more time
- Optimize call routing based on expected duration

## Exercise 2: Residual Analysis

**Task:** Perform a comprehensive residual analysis to check model assumptions.

### Solution Approach:
1. Calculate residuals for training and test sets
2. Create diagnostic plots
3. Test for normality
4. Check for patterns indicating assumption violations

In [None]:
# Exercise 2 Solution

# Calculate residuals
y_train_pred = model_banking.predict(X_train_banking)
y_test_pred = model_banking.predict(X_test_banking)

residuals_train = y_train_banking.ravel() - y_train_pred.ravel()
residuals_test = y_test_banking.ravel() - y_test_pred.ravel()

print("=== RESIDUAL ANALYSIS ===")
print(f"Training residuals - Mean: {residuals_train.mean():.6f}, Std: {residuals_train.std():.4f}")
print(f"Test residuals - Mean: {residuals_test.mean():.6f}, Std: {residuals_test.std():.4f}")

# Create comprehensive diagnostic plots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Residuals vs Fitted (Training)
axes[0,0].scatter(y_train_pred, residuals_train, alpha=0.5)
axes[0,0].axhline(y=0, color='red', linestyle='--')
axes[0,0].set_xlabel('Fitted Values')
axes[0,0].set_ylabel('Residuals')
axes[0,0].set_title('Residuals vs Fitted (Training)')

# 2. Residuals vs Fitted (Test)
axes[0,1].scatter(y_test_pred, residuals_test, alpha=0.5, color='orange')
axes[0,1].axhline(y=0, color='red', linestyle='--')
axes[0,1].set_xlabel('Fitted Values')
axes[0,1].set_ylabel('Residuals')
axes[0,1].set_title('Residuals vs Fitted (Test)')

# 3. Histogram of residuals (Training)
axes[0,2].hist(residuals_train, bins=50, alpha=0.7, density=True)
axes[0,2].set_xlabel('Residuals')
axes[0,2].set_ylabel('Density')
axes[0,2].set_title('Distribution of Residuals (Training)')

# 4. Q-Q plot (Training)
from scipy import stats
stats.probplot(residuals_train, dist="norm", plot=axes[1,0])
axes[1,0].set_title('Q-Q Plot (Training)')

# 5. Residuals over time/index (Test)
axes[1,1].plot(residuals_test, alpha=0.7)
axes[1,1].axhline(y=0, color='red', linestyle='--')
axes[1,1].set_xlabel('Observation Index')
axes[1,1].set_ylabel('Residuals')
axes[1,1].set_title('Residuals Over Index (Test)')

# 6. Scale-Location plot
sqrt_abs_resid = np.sqrt(np.abs(residuals_test))
axes[1,2].scatter(y_test_pred, sqrt_abs_resid, alpha=0.5, color='green')
axes[1,2].set_xlabel('Fitted Values')
axes[1,2].set_ylabel('√|Residuals|')
axes[1,2].set_title('Scale-Location Plot (Test)')

plt.tight_layout()
plt.show()

In [None]:
# Statistical tests for normality
from scipy.stats import shapiro, jarque_bera, anderson

print("=== NORMALITY TESTS ===")

# Shapiro-Wilk test (good for smaller samples)
if len(residuals_test) <= 5000:  # Shapiro-Wilk has sample size limitations
    shapiro_stat, shapiro_p = shapiro(residuals_test)
    print(f"Shapiro-Wilk Test: statistic={shapiro_stat:.4f}, p-value={shapiro_p:.6f}")
    print(f"Interpretation: {'Residuals appear normal' if shapiro_p > 0.05 else 'Residuals deviate from normality'}")

# Jarque-Bera test
jb_stat, jb_p = jarque_bera(residuals_test)
print(f"\nJarque-Bera Test: statistic={jb_stat:.4f}, p-value={jb_p:.6f}")
print(f"Interpretation: {'Residuals appear normal' if jb_p > 0.05 else 'Residuals deviate from normality'}")

# Anderson-Darling test
ad_stat, ad_critical, ad_significance = anderson(residuals_test, dist='norm')
print(f"\nAnderson-Darling Test: statistic={ad_stat:.4f}")
for i, (crit, sig) in enumerate(zip(ad_critical, ad_significance)):
    print(f"  At {sig}% significance: critical value = {crit:.4f}, {'REJECT normality' if ad_stat > crit else 'ACCEPT normality'}")

### Residual Analysis Interpretation:

**What to Look For:**

1. **Residuals vs Fitted:**
   - Should show random scatter around zero
   - Patterns indicate model misspecification
   - Funnel shapes indicate heteroscedasticity

2. **Normality of Residuals:**
   - Histogram should be approximately bell-shaped
   - Q-Q plot points should follow the diagonal line
   - Statistical tests provide formal assessment

3. **Scale-Location Plot:**
   - Tests for homoscedasticity (constant variance)
   - Should show random scatter, not increasing/decreasing trend

**Common Issues and Solutions:**
- **Non-normality:** Consider transformations or robust regression
- **Heteroscedasticity:** Use weighted least squares or robust standard errors
- **Patterns in residuals:** Add missing variables or interaction terms

## Exercise 3: Feature Engineering

**Task:** Create new features and see if they improve model performance.

### Solution Approach:
1. Create interaction terms between numerical features
2. Add polynomial (squared) terms
3. Train new model with engineered features
4. Compare performance with original model

In [None]:
# Exercise 3 Solution

print("=== FEATURE ENGINEERING ===")

# Start with original features
X_engineered = X_df_banking.copy()
print(f"Original features: {X_engineered.shape[1]}")

# 1. Create interaction terms between key numerical features
numerical_features = ['age', 'previous', 'pdays_clean', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed']
available_numerical = [col for col in numerical_features if col in X_engineered.columns]

print(f"\nCreating interactions between: {available_numerical}")

# Create some meaningful interactions (not all combinations to avoid overfitting)
interaction_pairs = [
    ('age', 'previous'),  # Age and previous contacts
    ('emp_var_rate', 'cons_conf_idx'),  # Economic indicators
    ('cons_price_idx', 'euribor3m'),  # Economic indicators
]

for feat1, feat2 in interaction_pairs:
    if feat1 in X_engineered.columns and feat2 in X_engineered.columns:
        interaction_name = f'{feat1}_x_{feat2}'
        X_engineered[interaction_name] = X_engineered[feat1] * X_engineered[feat2]
        print(f"Created interaction: {interaction_name}")

# 2. Add polynomial (squared) terms for key numerical features
polynomial_features = ['age', 'previous', 'emp_var_rate', 'cons_conf_idx']
for feat in polynomial_features:
    if feat in X_engineered.columns:
        squared_name = f'{feat}_squared'
        X_engineered[squared_name] = X_engineered[feat] ** 2
        print(f"Created polynomial term: {squared_name}")

print(f"\nTotal features after engineering: {X_engineered.shape[1]}")
print(f"Added {X_engineered.shape[1] - X_df_banking.shape[1]} new features")

In [None]:
# Split the engineered dataset
X_eng_train, X_eng_test, y_eng_train, y_eng_test = train_test_split(
    X_engineered.values, y_banking, test_size=0.2, random_state=0
)

# Train model with engineered features
model_engineered = LinearRegression()
model_engineered.fit(X_eng_train, y_eng_train)

# Make predictions
y_eng_train_pred = model_engineered.predict(X_eng_train)
y_eng_test_pred = model_engineered.predict(X_eng_test)

# Calculate performance metrics
print("=== MODEL COMPARISON ===")

# Original model performance
r2_orig_train = r2_score(y_train_banking, y_train_pred)
r2_orig_test = r2_score(y_test_banking, y_test_pred)
rmse_orig_train = np.sqrt(mean_squared_error(y_train_banking, y_train_pred))
rmse_orig_test = np.sqrt(mean_squared_error(y_test_banking, y_test_pred))

# Engineered model performance
r2_eng_train = r2_score(y_eng_train, y_eng_train_pred)
r2_eng_test = r2_score(y_eng_test, y_eng_test_pred)
rmse_eng_train = np.sqrt(mean_squared_error(y_eng_train, y_eng_train_pred))
rmse_eng_test = np.sqrt(mean_squared_error(y_eng_test, y_eng_test_pred))

comparison_df = pd.DataFrame({
    'Metric': ['R² Train', 'R² Test', 'RMSE Train', 'RMSE Test'],
    'Original Model': [r2_orig_train, r2_orig_test, rmse_orig_train, rmse_orig_test],
    'Engineered Model': [r2_eng_train, r2_eng_test, rmse_eng_train, rmse_eng_test],
    'Improvement': [
        r2_eng_train - r2_orig_train,
        r2_eng_test - r2_orig_test,
        rmse_orig_train - rmse_eng_train,  # Negative means worse (higher RMSE)
        rmse_orig_test - rmse_eng_test
    ]
})

print(comparison_df.to_string(index=False, float_format='%.6f'))

# Check for overfitting
print(f"\n=== OVERFITTING CHECK ===")
print(f"Original model - Train/Test R² gap: {r2_orig_train - r2_orig_test:.6f}")
print(f"Engineered model - Train/Test R² gap: {r2_eng_train - r2_eng_test:.6f}")
print(f"Gap increase: {(r2_eng_train - r2_eng_test) - (r2_orig_train - r2_orig_test):.6f}")

In [None]:
# Visualize the most important new features
new_features = [col for col in X_engineered.columns if col not in X_df_banking.columns]
new_feature_indices = [X_engineered.columns.get_loc(col) for col in new_features]
new_feature_coefs = model_engineered.coef_[0][new_feature_indices]

new_coef_df = pd.DataFrame({
    'Feature': new_features,
    'Coefficient': new_feature_coefs,
    'Abs_Coefficient': np.abs(new_feature_coefs)
}).sort_values('Abs_Coefficient', ascending=False)

print("\n=== NEW ENGINEERED FEATURES IMPORTANCE ===")
print(new_coef_df.to_string(index=False))

# Plot new features
if len(new_features) > 0:
    plt.figure(figsize=(10, 6))
    colors = ['red' if x < 0 else 'blue' for x in new_coef_df['Coefficient']]
    plt.barh(range(len(new_coef_df)), new_coef_df['Coefficient'], color=colors, alpha=0.7)
    plt.yticks(range(len(new_coef_df)), new_coef_df['Feature'])
    plt.xlabel('Coefficient Value')
    plt.title('Coefficients of Engineered Features')
    plt.axvline(x=0, color='black', linestyle='--', alpha=0.5)
    plt.tight_layout()
    plt.show()

### Feature Engineering Insights:

**Key Findings:**

1. **Performance Impact:**
   - Compare R² and RMSE improvements
   - Check if improvements are consistent between train/test

2. **Overfitting Risk:**
   - Monitor train/test performance gap
   - Large gaps indicate overfitting

3. **Feature Importance:**
   - Interaction terms can capture non-linear relationships
   - Polynomial terms model curvature in relationships

**Business Applications:**
- Interaction terms reveal how features work together
- Polynomial terms capture diminishing returns or accelerating effects
- Balance complexity with interpretability

## Exercise 4: Cross-Validation

**Task:** Use cross-validation to get a more robust estimate of model performance.

### Solution Approach:
1. Implement 5-fold cross-validation
2. Compare with baseline model
3. Analyze stability of performance
4. Discuss implications

In [None]:
# Exercise 4 Solution

from sklearn.model_selection import cross_val_score, KFold

print("=== CROSS-VALIDATION ANALYSIS ===")

# Set up cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# 1. Cross-validate original model
cv_scores_r2 = cross_val_score(model_banking, X_banking, y_banking.ravel(), 
                               cv=cv, scoring='r2')
cv_scores_rmse = -cross_val_score(model_banking, X_banking, y_banking.ravel(), 
                                  cv=cv, scoring='neg_root_mean_squared_error')

print("Original Model Cross-Validation Results:")
print(f"R² scores: {cv_scores_r2}")
print(f"R² mean: {cv_scores_r2.mean():.6f} ± {cv_scores_r2.std():.6f}")
print(f"RMSE scores: {cv_scores_rmse}")
print(f"RMSE mean: {cv_scores_rmse.mean():.6f} ± {cv_scores_rmse.std():.6f}")

# 2. Cross-validate engineered model
cv_scores_eng_r2 = cross_val_score(model_engineered, X_engineered.values, y_banking.ravel(), 
                                   cv=cv, scoring='r2')
cv_scores_eng_rmse = -cross_val_score(model_engineered, X_engineered.values, y_banking.ravel(), 
                                      cv=cv, scoring='neg_root_mean_squared_error')

print("\nEngineered Model Cross-Validation Results:")
print(f"R² scores: {cv_scores_eng_r2}")
print(f"R² mean: {cv_scores_eng_r2.mean():.6f} ± {cv_scores_eng_r2.std():.6f}")
print(f"RMSE scores: {cv_scores_eng_rmse}")
print(f"RMSE mean: {cv_scores_eng_rmse.mean():.6f} ± {cv_scores_eng_rmse.std():.6f}")

# 3. Baseline model (predict mean)
dummy_regressor = DummyRegressor(strategy='mean')
cv_scores_dummy_r2 = cross_val_score(dummy_regressor, X_banking, y_banking.ravel(), 
                                     cv=cv, scoring='r2')
cv_scores_dummy_rmse = -cross_val_score(dummy_regressor, X_banking, y_banking.ravel(), 
                                        cv=cv, scoring='neg_root_mean_squared_error')

print("\nBaseline Model (Mean Prediction) Cross-Validation Results:")
print(f"R² mean: {cv_scores_dummy_r2.mean():.6f} ± {cv_scores_dummy_r2.std():.6f}")
print(f"RMSE mean: {cv_scores_dummy_rmse.mean():.6f} ± {cv_scores_dummy_rmse.std():.6f}")

In [None]:
# Visualize cross-validation results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# R² comparison
cv_data_r2 = [cv_scores_dummy_r2, cv_scores_r2, cv_scores_eng_r2]
labels = ['Baseline\n(Mean)', 'Original\nModel', 'Engineered\nModel']

ax1.boxplot(cv_data_r2, labels=labels)
ax1.set_ylabel('R² Score')
ax1.set_title('Cross-Validation R² Comparison')
ax1.grid(True, alpha=0.3)

# RMSE comparison
cv_data_rmse = [cv_scores_dummy_rmse, cv_scores_rmse, cv_scores_eng_rmse]

ax2.boxplot(cv_data_rmse, labels=labels)
ax2.set_ylabel('RMSE')
ax2.set_title('Cross-Validation RMSE Comparison')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical significance test
from scipy.stats import ttest_rel

print("\n=== STATISTICAL SIGNIFICANCE TESTS ===")
t_stat, p_value = ttest_rel(cv_scores_r2, cv_scores_dummy_r2)
print(f"Original vs Baseline (R²): t-statistic={t_stat:.4f}, p-value={p_value:.6f}")
print(f"Significant improvement: {'Yes' if p_value < 0.05 else 'No'}")

t_stat, p_value = ttest_rel(cv_scores_eng_r2, cv_scores_r2)
print(f"\nEngineered vs Original (R²): t-statistic={t_stat:.4f}, p-value={p_value:.6f}")
print(f"Significant improvement: {'Yes' if p_value < 0.05 else 'No'}")

In [None]:
# Model stability analysis
print("\n=== MODEL STABILITY ANALYSIS ===")

def coefficient_of_variation(scores):
    return scores.std() / scores.mean() if scores.mean() != 0 else np.inf

stability_df = pd.DataFrame({
    'Model': ['Baseline', 'Original', 'Engineered'],
    'R² CV': [coefficient_of_variation(cv_scores_dummy_r2),
              coefficient_of_variation(cv_scores_r2),
              coefficient_of_variation(cv_scores_eng_r2)],
    'RMSE CV': [coefficient_of_variation(cv_scores_dummy_rmse),
                coefficient_of_variation(cv_scores_rmse),
                coefficient_of_variation(cv_scores_eng_rmse)]
})

print("Coefficient of Variation (lower = more stable):")
print(stability_df.to_string(index=False, float_format='%.6f'))

# Performance summary
summary_df = pd.DataFrame({
    'Model': ['Baseline', 'Original', 'Engineered'],
    'Mean R²': [cv_scores_dummy_r2.mean(), cv_scores_r2.mean(), cv_scores_eng_r2.mean()],
    'Std R²': [cv_scores_dummy_r2.std(), cv_scores_r2.std(), cv_scores_eng_r2.std()],
    'Mean RMSE': [cv_scores_dummy_rmse.mean(), cv_scores_rmse.mean(), cv_scores_eng_rmse.mean()],
    'Std RMSE': [cv_scores_dummy_rmse.std(), cv_scores_rmse.std(), cv_scores_eng_rmse.std()]
})

print("\nPerformance Summary:")
print(summary_df.to_string(index=False, float_format='%.6f'))

### Cross-Validation Insights:

**Key Benefits of Cross-Validation:**

1. **Robust Performance Estimates:**
   - Reduces dependence on specific train/test split
   - Provides confidence intervals for performance

2. **Model Stability Assessment:**
   - Low standard deviation indicates stable performance
   - High variation suggests overfitting or data sensitivity

3. **Statistical Significance:**
   - Paired t-tests determine if improvements are significant
   - Helps avoid false conclusions from random variation

**Business Implications:**
- More reliable performance estimates for production deployment
- Better understanding of model reliability
- Informed decisions about model complexity trade-offs

## Exercise 5: Synthetic Data Generation

**Task:** Create your own synthetic dataset with known relationships and test your model.

### Solution Approach:
1. Generate synthetic data with known coefficients
2. Add realistic noise and outliers
3. Test model's ability to recover true coefficients
4. Experiment with different noise levels

In [None]:
# Exercise 5 Solution

print("=== SYNTHETIC DATA GENERATION ===")

# Set parameters
n_samples = 500
true_coefficients = np.array([2.0, 3.0, -1.5])  # Known true coefficients
true_intercept = 1.0

def generate_synthetic_data(n_samples, true_coef, true_intercept, noise_level=1.0, outlier_fraction=0.05):
    """
    Generate synthetic data with known linear relationship
    """
    # Generate features from different distributions to make it realistic
    X1 = np.random.normal(0, 1, n_samples)  # Standard normal
    X2 = np.random.uniform(-2, 2, n_samples)  # Uniform
    X3 = np.random.exponential(1, n_samples)  # Exponential (right-skewed)
    
    X = np.column_stack([X1, X2, X3])
    
    # Generate target with known relationship
    y_true = true_intercept + X @ true_coef
    
    # Add noise
    noise = np.random.normal(0, noise_level, n_samples)
    y = y_true + noise
    
    # Add outliers
    n_outliers = int(outlier_fraction * n_samples)
    outlier_indices = np.random.choice(n_samples, n_outliers, replace=False)
    y[outlier_indices] += np.random.normal(0, 5 * noise_level, n_outliers)
    
    return X, y, y_true

# Generate synthetic data with different noise levels
noise_levels = [0.5, 1.0, 2.0, 3.0]
results = []

for noise_level in noise_levels:
    print(f"\n--- Noise Level: {noise_level} ---")
    
    # Generate data
    X_syn, y_syn, y_true_syn = generate_synthetic_data(
        n_samples, true_coefficients, true_intercept, noise_level
    )
    
    # Split data
    X_train_syn, X_test_syn, y_train_syn, y_test_syn = train_test_split(
        X_syn, y_syn, test_size=0.2, random_state=42
    )
    
    # Train model
    model_syn = LinearRegression()
    model_syn.fit(X_train_syn, y_train_syn)
    
    # Evaluate
    y_pred_syn = model_syn.predict(X_test_syn)
    r2_syn = r2_score(y_test_syn, y_pred_syn)
    rmse_syn = np.sqrt(mean_squared_error(y_test_syn, y_pred_syn))
    
    # Compare estimated vs true coefficients
    estimated_coef = model_syn.coef_
    estimated_intercept = model_syn.intercept_
    
    coef_error = np.abs(estimated_coef - true_coefficients)
    intercept_error = abs(estimated_intercept - true_intercept)
    
    print(f"True coefficients: {true_coefficients}")
    print(f"Estimated coefficients: {estimated_coef}")
    print(f"Coefficient errors: {coef_error}")
    print(f"True intercept: {true_intercept:.3f}, Estimated: {estimated_intercept:.3f}, Error: {intercept_error:.3f}")
    print(f"R²: {r2_syn:.6f}, RMSE: {rmse_syn:.6f}")
    
    results.append({
        'noise_level': noise_level,
        'r2': r2_syn,
        'rmse': rmse_syn,
        'coef_error_mean': coef_error.mean(),
        'coef_error_max': coef_error.max(),
        'intercept_error': intercept_error
    })

In [None]:
# Visualize results
results_df = pd.DataFrame(results)

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# R² vs noise level
axes[0,0].plot(results_df['noise_level'], results_df['r2'], 'bo-')
axes[0,0].set_xlabel('Noise Level')
axes[0,0].set_ylabel('R²')
axes[0,0].set_title('R² vs Noise Level')
axes[0,0].grid(True, alpha=0.3)

# RMSE vs noise level
axes[0,1].plot(results_df['noise_level'], results_df['rmse'], 'ro-')
axes[0,1].set_xlabel('Noise Level')
axes[0,1].set_ylabel('RMSE')
axes[0,1].set_title('RMSE vs Noise Level')
axes[0,1].grid(True, alpha=0.3)

# Coefficient error vs noise level
axes[1,0].plot(results_df['noise_level'], results_df['coef_error_mean'], 'go-', label='Mean Error')
axes[1,0].plot(results_df['noise_level'], results_df['coef_error_max'], 'g^-', label='Max Error')
axes[1,0].set_xlabel('Noise Level')
axes[1,0].set_ylabel('Coefficient Error')
axes[1,0].set_title('Coefficient Recovery Error vs Noise Level')
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)

# Intercept error vs noise level
axes[1,1].plot(results_df['noise_level'], results_df['intercept_error'], 'mo-')
axes[1,1].set_xlabel('Noise Level')
axes[1,1].set_ylabel('Intercept Error')
axes[1,1].set_title('Intercept Recovery Error vs Noise Level')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n=== SUMMARY TABLE ===")
print(results_df.to_string(index=False, float_format='%.6f'))

In [None]:
# Demonstrate with a specific example (medium noise)
X_demo, y_demo, y_true_demo = generate_synthetic_data(
    n_samples, true_coefficients, true_intercept, noise_level=1.0
)

# Visualize the data
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

feature_names = ['X1 (Normal)', 'X2 (Uniform)', 'X3 (Exponential)']

for i in range(3):
    axes[i].scatter(X_demo[:, i], y_demo, alpha=0.6, label='Observed')
    axes[i].scatter(X_demo[:, i], y_true_demo, alpha=0.6, color='red', s=10, label='True (no noise)')
    axes[i].set_xlabel(feature_names[i])
    axes[i].set_ylabel('Target')
    axes[i].set_title(f'Target vs {feature_names[i]}')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Train final model and show detailed results
model_demo = LinearRegression()
model_demo.fit(X_demo, y_demo)

print("\n=== DETAILED COEFFICIENT RECOVERY ===")
coef_comparison = pd.DataFrame({
    'Feature': ['X1', 'X2', 'X3', 'Intercept'],
    'True Value': list(true_coefficients) + [true_intercept],
    'Estimated Value': list(model_demo.coef_) + [model_demo.intercept_],
    'Absolute Error': list(np.abs(model_demo.coef_ - true_coefficients)) + [abs(model_demo.intercept_ - true_intercept)],
    'Relative Error (%)': [
        abs(est - true) / abs(true) * 100 if true != 0 else np.inf
        for est, true in zip(list(model_demo.coef_) + [model_demo.intercept_], 
                           list(true_coefficients) + [true_intercept])
    ]
})

print(coef_comparison.to_string(index=False, float_format='%.6f'))

### Synthetic Data Insights:

**Key Observations:**

1. **Coefficient Recovery:**
   - Linear regression successfully recovers true coefficients
   - Accuracy decreases with higher noise levels
   - Outliers can significantly impact estimates

2. **Noise Impact:**
   - Higher noise reduces R² and increases RMSE
   - Coefficient estimation becomes less accurate
   - Relationship is predictable and quantifiable

3. **Model Validation:**
   - Synthetic data provides ground truth for testing
   - Helps understand model limitations
   - Useful for algorithm development and debugging

**Practical Applications:**
- Test new algorithms on known problems
- Understand impact of data quality on model performance
- Generate training data when real data is limited

## Exercise 6: Time Series Forecasting Analysis

**Task:** Analyze the Bitcoin forecasting model more deeply.

### Solution Approach:
1. Calculate directional accuracy
2. Create cumulative returns comparison
3. Test different lag lengths
4. Discuss trading implications

In [None]:
# Exercise 6 Solution
# First, let's recreate the Bitcoin data and model from the main notebook

print("=== BITCOIN TIME SERIES ANALYSIS ===")

# Load Bitcoin data
btc_url = "https://raw.githubusercontent.com/umatter/EDFB/main/data/data_BTC.csv"

try:
    data_btc = pd.read_csv(btc_url)
except Exception as e:
    print("Falling back to KaggleHub dataset due to:", repr(e))
    try:
        import kagglehub
    except Exception:
        import sys, subprocess
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "kagglehub"])
        import kagglehub
    path = kagglehub.dataset_download("nguynchtrai/data-btc")
    import os, glob
    candidates = sorted(glob.glob(os.path.join(path, "*.csv")))
    if not candidates:
        raise RuntimeError("No CSV files found in Kaggle dataset directory: " + path)
    print("Loaded from:", candidates[0])
    data_btc = pd.read_csv(candidates[0])

# Standardize columns
if 'Timestamp' in data_btc.columns and 'Close' in data_btc.columns:
    data_btc['Date'] = pd.to_datetime(data_btc['Timestamp'], unit='ms')
    data_btc = data_btc.sort_values('Date').reset_index(drop=True)
    data_btc = data_btc.rename(columns={'Close': 'BTC-USD.Close'})
elif 'Date' in data_btc.columns and 'BTC-USD.Close' in data_btc.columns:
    try:
        data_btc['Date'] = pd.to_datetime(data_btc['Date'])
    except Exception:
        pass

data_btc = data_btc.drop(columns=['Timestamp'], errors='ignore')
data_btc = data_btc.sort_values('Date').reset_index(drop=True)

print(f"Bitcoin data loaded: {len(data_btc)} observations")
print(f"Date range: {data_btc['Date'].min()} to {data_btc['Date'].max()}")

In [None]:
# Function to create lagged features and train model
def create_lagged_features_btc(data, lag):
    df = data.copy()
    df['ret'] = np.log(df['BTC-USD.Close']).diff()
    for i in range(1, lag+1):
        df[f'lag_ret_{i}'] = df['ret'].shift(i)
    df = df.dropna().reset_index(drop=True)
    return df

def train_btc_model(data, lag):
    # Create features
    data_lagged = create_lagged_features_btc(data, lag)
    
    # Prepare features and target
    feature_cols = [f'lag_ret_{i}' for i in range(1, lag+1)]
    X = data_lagged[feature_cols].values
    y = data_lagged['ret'].values
    
    # Chronological split
    split_idx = int(len(data_lagged) * 0.8)
    X_train, X_test = X[:split_idx], X[split_idx:]
    y_train, y_test = y[:split_idx], y[split_idx:]
    
    # Train model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    
    return {
        'model': model,
        'X_train': X_train, 'X_test': X_test,
        'y_train': y_train, 'y_test': y_test,
        'y_pred': y_pred,
        'data': data_lagged,
        'split_idx': split_idx
    }

# Test different lag lengths
lag_lengths = [1, 3, 5, 10]
lag_results = {}

print("\n=== TESTING DIFFERENT LAG LENGTHS ===")

for lag in lag_lengths:
    print(f"\n--- Lag Length: {lag} ---")
    
    result = train_btc_model(data_btc, lag)
    lag_results[lag] = result
    
    # Calculate metrics
    r2 = r2_score(result['y_test'], result['y_pred'])
    rmse = np.sqrt(mean_squared_error(result['y_test'], result['y_pred']))
    
    # Directional accuracy
    direction_correct = np.sum(np.sign(result['y_test']) == np.sign(result['y_pred']))
    directional_accuracy = direction_correct / len(result['y_test'])
    
    # Naive baseline (lag-1 prediction)
    naive_pred = result['X_test'][:, 0]  # First lag is lag-1
    rmse_naive = np.sqrt(mean_squared_error(result['y_test'], naive_pred))
    
    print(f"R²: {r2:.6f}")
    print(f"RMSE: {rmse:.6f}")
    print(f"RMSE (naive): {rmse_naive:.6f}")
    print(f"RMSE ratio (model/naive): {rmse/rmse_naive:.3f}")
    print(f"Directional accuracy: {directional_accuracy:.3f} ({direction_correct}/{len(result['y_test'])} correct)")
    
    lag_results[lag]['metrics'] = {
        'r2': r2, 'rmse': rmse, 'rmse_naive': rmse_naive,
        'directional_accuracy': directional_accuracy
    }

In [None]:
# Compare lag lengths
comparison_data = []
for lag, result in lag_results.items():
    metrics = result['metrics']
    comparison_data.append({
        'Lag Length': lag,
        'R²': metrics['r2'],
        'RMSE': metrics['rmse'],
        'RMSE Ratio': metrics['rmse'] / metrics['rmse_naive'],
        'Directional Accuracy': metrics['directional_accuracy']
    })

comparison_df = pd.DataFrame(comparison_data)
print("\n=== LAG LENGTH COMPARISON ===")
print(comparison_df.to_string(index=False, float_format='%.6f'))

# Visualize comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# R² comparison
axes[0,0].plot(comparison_df['Lag Length'], comparison_df['R²'], 'bo-')
axes[0,0].set_xlabel('Lag Length')
axes[0,0].set_ylabel('R²')
axes[0,0].set_title('R² vs Lag Length')
axes[0,0].grid(True, alpha=0.3)

# RMSE ratio comparison
axes[0,1].plot(comparison_df['Lag Length'], comparison_df['RMSE Ratio'], 'ro-')
axes[0,1].axhline(y=1, color='black', linestyle='--', alpha=0.5, label='Naive baseline')
axes[0,1].set_xlabel('Lag Length')
axes[0,1].set_ylabel('RMSE Ratio (Model/Naive)')
axes[0,1].set_title('RMSE Ratio vs Lag Length')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)

# Directional accuracy
axes[1,0].plot(comparison_df['Lag Length'], comparison_df['Directional Accuracy'], 'go-')
axes[1,0].axhline(y=0.5, color='black', linestyle='--', alpha=0.5, label='Random guess')
axes[1,0].set_xlabel('Lag Length')
axes[1,0].set_ylabel('Directional Accuracy')
axes[1,0].set_title('Directional Accuracy vs Lag Length')
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)

# RMSE absolute values
axes[1,1].plot(comparison_df['Lag Length'], comparison_df['RMSE'], 'mo-')
axes[1,1].set_xlabel('Lag Length')
axes[1,1].set_ylabel('RMSE')
axes[1,1].set_title('RMSE vs Lag Length')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Cumulative returns analysis (using best performing model)
best_lag = comparison_df.loc[comparison_df['Directional Accuracy'].idxmax(), 'Lag Length']
print(f"\n=== CUMULATIVE RETURNS ANALYSIS (Lag {best_lag}) ===")

best_result = lag_results[best_lag]
y_test = best_result['y_test']
y_pred = best_result['y_pred']

# Create trading strategies
# Strategy 1: Perfect foresight (actual returns)
perfect_returns = y_test

# Strategy 2: Model predictions
# Trade based on predicted direction: if predict positive return, go long
model_returns = np.where(y_pred > 0, y_test, -y_test)  # Go long if positive prediction, short if negative

# Strategy 3: Naive (always long)
naive_returns = y_test

# Strategy 4: Random (50% chance of going long)
np.random.seed(42)
random_signals = np.random.choice([-1, 1], size=len(y_test))
random_returns = random_signals * y_test

# Calculate cumulative returns
cum_perfect = np.cumsum(perfect_returns)
cum_model = np.cumsum(model_returns)
cum_naive = np.cumsum(naive_returns)
cum_random = np.cumsum(random_returns)

# Plot cumulative returns
plt.figure(figsize=(15, 8))
plt.plot(cum_perfect, label='Perfect Foresight (Buy & Hold)', linewidth=2)
plt.plot(cum_model, label=f'Model Strategy (Lag {best_lag})', linewidth=2)
plt.plot(cum_naive, label='Naive (Always Long)', linewidth=1, alpha=0.7)
plt.plot(cum_random, label='Random Strategy', linewidth=1, alpha=0.7)
plt.xlabel('Time Period')
plt.ylabel('Cumulative Log Returns')
plt.title('Cumulative Returns Comparison: Trading Strategies')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Calculate strategy statistics
strategies = {
    'Perfect Foresight': perfect_returns,
    'Model Strategy': model_returns,
    'Naive (Buy & Hold)': naive_returns,
    'Random': random_returns
}

strategy_stats = []
for name, returns in strategies.items():
    total_return = np.sum(returns)
    volatility = np.std(returns)
    sharpe_ratio = np.mean(returns) / np.std(returns) if np.std(returns) > 0 else 0
    max_drawdown = np.min(np.cumsum(returns) - np.maximum.accumulate(np.cumsum(returns)))
    
    strategy_stats.append({
        'Strategy': name,
        'Total Return': total_return,
        'Volatility': volatility,
        'Sharpe Ratio': sharpe_ratio,
        'Max Drawdown': max_drawdown
    })

strategy_df = pd.DataFrame(strategy_stats)
print("\nStrategy Performance Comparison:")
print(strategy_df.to_string(index=False, float_format='%.6f'))

In [None]:
# Detailed analysis of model predictions
print("\n=== DETAILED PREDICTION ANALYSIS ===")

# Prediction accuracy by magnitude
pred_magnitude = np.abs(y_pred)
actual_magnitude = np.abs(y_test)

# Bin predictions by confidence (magnitude)
confidence_bins = np.percentile(pred_magnitude, [0, 25, 50, 75, 100])
bin_labels = ['Low', 'Medium-Low', 'Medium-High', 'High']

print("Directional Accuracy by Prediction Confidence:")
for i, label in enumerate(bin_labels):
    mask = (pred_magnitude >= confidence_bins[i]) & (pred_magnitude < confidence_bins[i+1])
    if i == len(bin_labels) - 1:  # Include the maximum value in the last bin
        mask = pred_magnitude >= confidence_bins[i]
    
    if np.sum(mask) > 0:
        accuracy = np.mean(np.sign(y_test[mask]) == np.sign(y_pred[mask]))
        count = np.sum(mask)
        avg_magnitude = np.mean(pred_magnitude[mask])
        print(f"{label} Confidence: {accuracy:.3f} ({count} predictions, avg magnitude: {avg_magnitude:.6f})")

# Market regime analysis
print("\nDirectional Accuracy by Market Regime:")

# Define regimes based on actual return magnitude
low_vol_mask = actual_magnitude < np.percentile(actual_magnitude, 33)
med_vol_mask = (actual_magnitude >= np.percentile(actual_magnitude, 33)) & (actual_magnitude < np.percentile(actual_magnitude, 67))
high_vol_mask = actual_magnitude >= np.percentile(actual_magnitude, 67)

regimes = {
    'Low Volatility': low_vol_mask,
    'Medium Volatility': med_vol_mask,
    'High Volatility': high_vol_mask
}

for regime_name, mask in regimes.items():
    if np.sum(mask) > 0:
        accuracy = np.mean(np.sign(y_test[mask]) == np.sign(y_pred[mask]))
        count = np.sum(mask)
        avg_return = np.mean(np.abs(y_test[mask]))
        print(f"{regime_name}: {accuracy:.3f} ({count} periods, avg |return|: {avg_return:.6f})")

### Time Series Forecasting Insights:

**Key Findings:**

1. **Lag Length Impact:**
   - More lags don't always improve performance
   - Risk of overfitting with too many lags
   - Optimal lag length depends on market dynamics

2. **Directional Accuracy:**
   - Often more important than precise magnitude prediction
   - Even modest improvements over 50% can be profitable
   - Higher confidence predictions tend to be more accurate

3. **Trading Strategy Performance:**
   - Model-based strategies can outperform naive approaches
   - Transaction costs and market impact not considered
   - Risk management crucial for practical implementation

**Practical Trading Implications:**

1. **Position Sizing:**
   - Scale positions based on prediction confidence
   - Larger positions when model is more certain

2. **Risk Management:**
   - Set stop-losses and take-profits
   - Monitor maximum drawdown
   - Consider volatility regimes

3. **Model Limitations:**
   - Linear models may miss complex patterns
   - Market conditions change over time
   - Need for regular model retraining

**Recommendations:**
- Combine with other indicators and models
- Implement proper backtesting with transaction costs
- Consider ensemble methods for robustness
- Regular performance monitoring and model updates

## Exercise 7: Model Comparison

**Task:** Compare linear regression with Ridge and Lasso regression.

### Solution Approach:
1. Implement Ridge and Lasso regression
2. Tune hyperparameters
3. Compare performance metrics
4. Analyze feature selection effects

In [None]:
# Exercise 7 Solution

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

print("=== REGULARIZED REGRESSION COMPARISON ===")

# Use the banking dataset with engineered features for this comparison
X_comparison = X_engineered.values
y_comparison = y_banking.ravel()

# Split the data
X_train_comp, X_test_comp, y_train_comp, y_test_comp = train_test_split(
    X_comparison, y_comparison, test_size=0.2, random_state=42
)

# Standardize features (important for regularized regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_comp)
X_test_scaled = scaler.transform(X_test_comp)

print(f"Dataset: {X_train_scaled.shape[0]} training, {X_test_scaled.shape[0]} test samples")
print(f"Features: {X_train_scaled.shape[1]}")

# Define models and hyperparameter grids
models = {
    'Linear Regression': {
        'model': LinearRegression(),
        'params': {}
    },
    'Ridge Regression': {
        'model': Ridge(),
        'params': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}
    },
    'Lasso Regression': {
        'model': Lasso(max_iter=2000),
        'params': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}
    }
}

# Train and evaluate models
results = {}

for name, config in models.items():
    print(f"\n--- Training {name} ---")
    
    if config['params']:  # If hyperparameters to tune
        # Use GridSearchCV for hyperparameter tuning
        grid_search = GridSearchCV(
            config['model'], config['params'], 
            cv=5, scoring='r2', n_jobs=-1
        )
        grid_search.fit(X_train_scaled, y_train_comp)
        best_model = grid_search.best_estimator_
        best_params = grid_search.best_params_
        print(f"Best parameters: {best_params}")
        print(f"Best CV score: {grid_search.best_score_:.6f}")
    else:
        # No hyperparameters to tune
        best_model = config['model']
        best_model.fit(X_train_scaled, y_train_comp)
        best_params = {}
    
    # Make predictions
    y_train_pred = best_model.predict(X_train_scaled)
    y_test_pred = best_model.predict(X_test_scaled)
    
    # Calculate metrics
    train_r2 = r2_score(y_train_comp, y_train_pred)
    test_r2 = r2_score(y_test_comp, y_test_pred)
    train_rmse = np.sqrt(mean_squared_error(y_train_comp, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test_comp, y_test_pred))
    
    # Store results
    results[name] = {
        'model': best_model,
        'params': best_params,
        'train_r2': train_r2,
        'test_r2': test_r2,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'overfitting': train_r2 - test_r2
    }
    
    print(f"Train R²: {train_r2:.6f}, Test R²: {test_r2:.6f}")
    print(f"Train RMSE: {train_rmse:.6f}, Test RMSE: {test_rmse:.6f}")
    print(f"Overfitting (R² gap): {train_r2 - test_r2:.6f}")

In [None]:
# Create comparison table
comparison_data = []
for name, result in results.items():
    comparison_data.append({
        'Model': name,
        'Train R²': result['train_r2'],
        'Test R²': result['test_r2'],
        'Train RMSE': result['train_rmse'],
        'Test RMSE': result['test_rmse'],
        'Overfitting': result['overfitting'],
        'Best Params': str(result['params'])
    })

comparison_table = pd.DataFrame(comparison_data)
print("\n=== MODEL COMPARISON SUMMARY ===")
print(comparison_table.to_string(index=False, float_format='%.6f'))

# Visualize comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

models_list = list(results.keys())
train_r2_values = [results[model]['train_r2'] for model in models_list]
test_r2_values = [results[model]['test_r2'] for model in models_list]
train_rmse_values = [results[model]['train_rmse'] for model in models_list]
test_rmse_values = [results[model]['test_rmse'] for model in models_list]

# R² comparison
x_pos = np.arange(len(models_list))
width = 0.35

axes[0,0].bar(x_pos - width/2, train_r2_values, width, label='Train', alpha=0.8)
axes[0,0].bar(x_pos + width/2, test_r2_values, width, label='Test', alpha=0.8)
axes[0,0].set_xlabel('Model')
axes[0,0].set_ylabel('R²')
axes[0,0].set_title('R² Comparison')
axes[0,0].set_xticks(x_pos)
axes[0,0].set_xticklabels(models_list, rotation=45)
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# RMSE comparison
axes[0,1].bar(x_pos - width/2, train_rmse_values, width, label='Train', alpha=0.8)
axes[0,1].bar(x_pos + width/2, test_rmse_values, width, label='Test', alpha=0.8)
axes[0,1].set_xlabel('Model')
axes[0,1].set_ylabel('RMSE')
axes[0,1].set_title('RMSE Comparison')
axes[0,1].set_xticks(x_pos)
axes[0,1].set_xticklabels(models_list, rotation=45)
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)

# Overfitting comparison
overfitting_values = [results[model]['overfitting'] for model in models_list]
axes[1,0].bar(models_list, overfitting_values, alpha=0.8, color='red')
axes[1,0].set_xlabel('Model')
axes[1,0].set_ylabel('Overfitting (Train R² - Test R²)')
axes[1,0].set_title('Overfitting Comparison')
axes[1,0].tick_params(axis='x', rotation=45)
axes[1,0].grid(True, alpha=0.3)

# Test R² vs Overfitting scatter
axes[1,1].scatter(test_r2_values, overfitting_values, s=100, alpha=0.7)
for i, model in enumerate(models_list):
    axes[1,1].annotate(model, (test_r2_values[i], overfitting_values[i]), 
                      xytext=(5, 5), textcoords='offset points')
axes[1,1].set_xlabel('Test R²')
axes[1,1].set_ylabel('Overfitting')
axes[1,1].set_title('Test Performance vs Overfitting')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Analyze feature selection (Lasso) and regularization effects
print("\n=== FEATURE SELECTION ANALYSIS ===")

# Get coefficients from each model
feature_names = X_engineered.columns

coef_comparison = pd.DataFrame({
    'Feature': feature_names,
    'Linear': results['Linear Regression']['model'].coef_,
    'Ridge': results['Ridge Regression']['model'].coef_,
    'Lasso': results['Lasso Regression']['model'].coef_
})

# Count non-zero coefficients
print("Non-zero coefficients:")
for model in ['Linear', 'Ridge', 'Lasso']:
    non_zero = np.sum(np.abs(coef_comparison[model]) > 1e-6)
    print(f"{model}: {non_zero}/{len(feature_names)} ({non_zero/len(feature_names)*100:.1f}%)")

# Show features selected by Lasso
lasso_selected = coef_comparison[np.abs(coef_comparison['Lasso']) > 1e-6]
print(f"\nFeatures selected by Lasso ({len(lasso_selected)}):"))
lasso_selected_sorted = lasso_selected.reindex(
    lasso_selected['Lasso'].abs().sort_values(ascending=False).index
)
print(lasso_selected_sorted[['Feature', 'Lasso']].to_string(index=False, float_format='%.6f'))

# Visualize coefficient comparison for top features
top_features = coef_comparison.reindex(
    coef_comparison['Linear'].abs().sort_values(ascending=False).index
).head(15)

fig, ax = plt.subplots(figsize=(12, 8))
x_pos = np.arange(len(top_features))
width = 0.25

ax.bar(x_pos - width, top_features['Linear'], width, label='Linear', alpha=0.8)
ax.bar(x_pos, top_features['Ridge'], width, label='Ridge', alpha=0.8)
ax.bar(x_pos + width, top_features['Lasso'], width, label='Lasso', alpha=0.8)

ax.set_xlabel('Features')
ax.set_ylabel('Coefficient Value')
ax.set_title('Coefficient Comparison: Top 15 Features')
ax.set_xticks(x_pos)
ax.set_xticklabels(top_features['Feature'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='black', linestyle='-', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Regularization path analysis
print("\n=== REGULARIZATION PATH ANALYSIS ===")

# Ridge regularization path
alphas_ridge = np.logspace(-3, 3, 50)
ridge_coefs = []
ridge_scores = []

for alpha in alphas_ridge:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train_comp)
    ridge_coefs.append(ridge.coef_)
    ridge_scores.append(ridge.score(X_test_scaled, y_test_comp))

ridge_coefs = np.array(ridge_coefs)

# Lasso regularization path
alphas_lasso = np.logspace(-3, 1, 50)
lasso_coefs = []
lasso_scores = []

for alpha in alphas_lasso:
    lasso = Lasso(alpha=alpha, max_iter=2000)
    lasso.fit(X_train_scaled, y_train_comp)
    lasso_coefs.append(lasso.coef_)
    lasso_scores.append(lasso.score(X_test_scaled, y_test_comp))

lasso_coefs = np.array(lasso_coefs)

# Plot regularization paths
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Ridge path
for i in range(min(10, ridge_coefs.shape[1])):  # Plot first 10 features
    axes[0,0].plot(alphas_ridge, ridge_coefs[:, i], alpha=0.7)
axes[0,0].set_xscale('log')
axes[0,0].set_xlabel('Alpha')
axes[0,0].set_ylabel('Coefficient Value')
axes[0,0].set_title('Ridge Regularization Path')
axes[0,0].grid(True, alpha=0.3)

# Lasso path
for i in range(min(10, lasso_coefs.shape[1])):  # Plot first 10 features
    axes[0,1].plot(alphas_lasso, lasso_coefs[:, i], alpha=0.7)
axes[0,1].set_xscale('log')
axes[0,1].set_xlabel('Alpha')
axes[0,1].set_ylabel('Coefficient Value')
axes[0,1].set_title('Lasso Regularization Path')
axes[0,1].grid(True, alpha=0.3)

# Ridge scores
axes[1,0].plot(alphas_ridge, ridge_scores, 'b-', linewidth=2)
axes[1,0].set_xscale('log')
axes[1,0].set_xlabel('Alpha')
axes[1,0].set_ylabel('Test R²')
axes[1,0].set_title('Ridge: Test Performance vs Regularization')
axes[1,0].grid(True, alpha=0.3)

# Lasso scores
axes[1,1].plot(alphas_lasso, lasso_scores, 'r-', linewidth=2)
axes[1,1].set_xscale('log')
axes[1,1].set_xlabel('Alpha')
axes[1,1].set_ylabel('Test R²')
axes[1,1].set_title('Lasso: Test Performance vs Regularization')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find optimal alphas
best_ridge_idx = np.argmax(ridge_scores)
best_lasso_idx = np.argmax(lasso_scores)

print(f"Optimal Ridge alpha: {alphas_ridge[best_ridge_idx]:.6f} (R² = {ridge_scores[best_ridge_idx]:.6f})")
print(f"Optimal Lasso alpha: {alphas_lasso[best_lasso_idx]:.6f} (R² = {lasso_scores[best_lasso_idx]:.6f})")

### Regularized Regression Insights:

**Key Differences Between Methods:**

1. **Linear Regression:**
   - No regularization, can overfit with many features
   - All features retained with potentially large coefficients
   - Best when you have few features relative to samples

2. **Ridge Regression (L2 Regularization):**
   - Shrinks coefficients toward zero but doesn't eliminate them
   - Handles multicollinearity well
   - Good when all features are somewhat relevant
   - Reduces overfitting while maintaining interpretability

3. **Lasso Regression (L1 Regularization):**
   - Performs automatic feature selection by setting coefficients to zero
   - Creates sparse models (fewer features)
   - Good when only subset of features are truly important
   - Can be unstable with highly correlated features

**When to Use Each Method:**

- **Linear Regression:** Small datasets, few features, interpretability crucial
- **Ridge:** Many features, multicollinearity, want to keep all features
- **Lasso:** Many features, want automatic feature selection, sparse solutions

**Business Applications:**
- **Ridge:** Risk modeling where all factors matter
- **Lasso:** Marketing attribution with many channels
- **Linear:** Simple pricing models with few key factors

## Exercise 8: Business Impact Analysis

**Task:** Quantify the business value of your duration prediction model.

### Solution Approach:
1. Create engagement categories based on call duration
2. Calculate classification accuracy for each category
3. Estimate business value of correct predictions
4. Develop actionable insights

In [None]:
# Exercise 8 Solution

print("=== BUSINESS IMPACT ANALYSIS ===")

# Define engagement thresholds (log duration)
low_threshold = np.log(2 * 60)    # 2 minutes = 120 seconds
high_threshold = np.log(5 * 60)   # 5 minutes = 300 seconds

print(f"Engagement thresholds:")
print(f"Low engagement: < {low_threshold:.3f} (< 2 minutes)")
print(f"Medium engagement: {low_threshold:.3f} - {high_threshold:.3f} (2-5 minutes)")
print(f"High engagement: > {high_threshold:.3f} (> 5 minutes)")

def categorize_engagement(log_duration):
    """
    Categorize engagement based on log duration
    """
    return np.where(log_duration < low_threshold, 'Low',
                   np.where(log_duration < high_threshold, 'Medium', 'High'))

# Categorize actual and predicted durations
y_test_categories = categorize_engagement(y_test_banking.ravel())
y_pred_categories = categorize_engagement(y_test_predicted_banking.ravel())

# Calculate distribution of engagement levels
engagement_dist = pd.Series(y_test_categories).value_counts(normalize=True).sort_index()
print(f"\nActual engagement distribution:")
for level, pct in engagement_dist.items():
    count = pd.Series(y_test_categories).value_counts()[level]
    print(f"{level}: {pct:.3f} ({count} calls)")

# Create confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test_categories, y_pred_categories, 
                     labels=['Low', 'Medium', 'High'])

print(f"\nConfusion Matrix:")
cm_df = pd.DataFrame(cm, 
                    index=['Actual Low', 'Actual Medium', 'Actual High'],
                    columns=['Pred Low', 'Pred Medium', 'Pred High'])
print(cm_df)

# Calculate accuracy for each engagement level
print(f"\nClassification Report:")
print(classification_report(y_test_categories, y_pred_categories))

In [None]:
# Visualize confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues', 
            cbar_kws={'label': 'Number of Calls'})
plt.title('Engagement Level Prediction: Confusion Matrix')
plt.ylabel('Actual Engagement')
plt.xlabel('Predicted Engagement')
plt.show()

# Calculate precision, recall, and F1 for each class
from sklearn.metrics import precision_recall_fscore_support

precision, recall, f1, support = precision_recall_fscore_support(
    y_test_categories, y_pred_categories, labels=['Low', 'Medium', 'High']
)

metrics_df = pd.DataFrame({
    'Engagement Level': ['Low', 'Medium', 'High'],
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1,
    'Support': support
})

print("\nDetailed Metrics by Engagement Level:")
print(metrics_df.to_string(index=False, float_format='%.3f'))

# Visualize metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

metrics_to_plot = ['Precision', 'Recall', 'F1-Score']
colors = ['skyblue', 'lightcoral', 'lightgreen']

for i, metric in enumerate(metrics_to_plot):
    axes[i].bar(metrics_df['Engagement Level'], metrics_df[metric], 
               color=colors[i], alpha=0.8)
    axes[i].set_ylabel(metric)
    axes[i].set_title(f'{metric} by Engagement Level')
    axes[i].set_ylim(0, 1)
    axes[i].grid(True, alpha=0.3)
    
    # Add value labels on bars
    for j, v in enumerate(metrics_df[metric]):
        axes[i].text(j, v + 0.02, f'{v:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

In [None]:
# Business value calculation
print("\n=== BUSINESS VALUE ESTIMATION ===")

# Define business values (hypothetical but realistic)
business_values = {
    'cost_per_call': 5.0,  # Cost to make a follow-up call
    'revenue_low': 50.0,   # Expected revenue from low engagement customer
    'revenue_medium': 150.0,  # Expected revenue from medium engagement customer
    'revenue_high': 400.0,    # Expected revenue from high engagement customer
    'conversion_rate_low': 0.05,    # 5% conversion rate for low engagement
    'conversion_rate_medium': 0.15,  # 15% conversion rate for medium engagement
    'conversion_rate_high': 0.35,   # 35% conversion rate for high engagement
}

print("Business assumptions:")
for key, value in business_values.items():
    print(f"{key}: {value}")

# Calculate expected value for each engagement level
expected_values = {
    'Low': business_values['revenue_low'] * business_values['conversion_rate_low'] - business_values['cost_per_call'],
    'Medium': business_values['revenue_medium'] * business_values['conversion_rate_medium'] - business_values['cost_per_call'],
    'High': business_values['revenue_high'] * business_values['conversion_rate_high'] - business_values['cost_per_call']
}

print(f"\nExpected value per follow-up call:")
for level, value in expected_values.items():
    print(f"{level} engagement: ${value:.2f}")

# Strategy 1: Follow up with all customers (baseline)
total_calls_baseline = len(y_test_categories)
baseline_value = sum(expected_values[level] * np.sum(y_test_categories == level) 
                    for level in ['Low', 'Medium', 'High'])

print(f"\nBaseline strategy (follow up with all {total_calls_baseline} customers):")
print(f"Total expected value: ${baseline_value:.2f}")
print(f"Average value per call: ${baseline_value/total_calls_baseline:.2f}")

# Strategy 2: Only follow up with predicted high engagement customers
high_pred_mask = y_pred_categories == 'High'
high_pred_count = np.sum(high_pred_mask)
high_pred_actual = y_test_categories[high_pred_mask]

strategy2_value = sum(expected_values[level] * np.sum(high_pred_actual == level) 
                     for level in ['Low', 'Medium', 'High'])

print(f"\nStrategy 2 (follow up only with predicted high engagement):")
print(f"Calls made: {high_pred_count} ({high_pred_count/total_calls_baseline*100:.1f}% of total)")
print(f"Total expected value: ${strategy2_value:.2f}")
print(f"Average value per call: ${strategy2_value/high_pred_count:.2f}" if high_pred_count > 0 else "No calls made")
print(f"Value vs baseline: ${strategy2_value - baseline_value:.2f}")

# Strategy 3: Follow up with predicted medium and high engagement
med_high_pred_mask = (y_pred_categories == 'Medium') | (y_pred_categories == 'High')
med_high_pred_count = np.sum(med_high_pred_mask)
med_high_pred_actual = y_test_categories[med_high_pred_mask]

strategy3_value = sum(expected_values[level] * np.sum(med_high_pred_actual == level) 
                     for level in ['Low', 'Medium', 'High'])

print(f"\nStrategy 3 (follow up with predicted medium + high engagement):")
print(f"Calls made: {med_high_pred_count} ({med_high_pred_count/total_calls_baseline*100:.1f}% of total)")
print(f"Total expected value: ${strategy3_value:.2f}")
print(f"Average value per call: ${strategy3_value/med_high_pred_count:.2f}" if med_high_pred_count > 0 else "No calls made")
print(f"Value vs baseline: ${strategy3_value - baseline_value:.2f}")

In [None]:
# ROI analysis and sensitivity testing
print("\n=== ROI ANALYSIS ===")

strategies = {
    'Baseline (All)': {
        'calls': total_calls_baseline,
        'value': baseline_value,
        'cost': total_calls_baseline * business_values['cost_per_call']
    },
    'High Only': {
        'calls': high_pred_count,
        'value': strategy2_value,
        'cost': high_pred_count * business_values['cost_per_call']
    },
    'Medium + High': {
        'calls': med_high_pred_count,
        'value': strategy3_value,
        'cost': med_high_pred_count * business_values['cost_per_call']
    }
}

roi_data = []
for strategy_name, data in strategies.items():
    revenue = data['value'] + data['cost']  # Add back the cost to get gross revenue
    profit = data['value']
    roi = (profit / data['cost']) * 100 if data['cost'] > 0 else 0
    
    roi_data.append({
        'Strategy': strategy_name,
        'Calls': data['calls'],
        'Cost': data['cost'],
        'Revenue': revenue,
        'Profit': profit,
        'ROI (%)': roi,
        'Profit per Call': profit / data['calls'] if data['calls'] > 0 else 0
    })

roi_df = pd.DataFrame(roi_data)
print(roi_df.to_string(index=False, float_format='%.2f'))

# Visualize ROI comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Profit comparison
axes[0,0].bar(roi_df['Strategy'], roi_df['Profit'], alpha=0.8, color='green')
axes[0,0].set_ylabel('Total Profit ($)')
axes[0,0].set_title('Total Profit by Strategy')
axes[0,0].tick_params(axis='x', rotation=45)
axes[0,0].grid(True, alpha=0.3)

# ROI comparison
axes[0,1].bar(roi_df['Strategy'], roi_df['ROI (%)'], alpha=0.8, color='blue')
axes[0,1].set_ylabel('ROI (%)')
axes[0,1].set_title('Return on Investment by Strategy')
axes[0,1].tick_params(axis='x', rotation=45)
axes[0,1].grid(True, alpha=0.3)

# Calls vs Profit
axes[1,0].scatter(roi_df['Calls'], roi_df['Profit'], s=100, alpha=0.8)
for i, strategy in enumerate(roi_df['Strategy']):
    axes[1,0].annotate(strategy, (roi_df['Calls'].iloc[i], roi_df['Profit'].iloc[i]),
                      xytext=(5, 5), textcoords='offset points')
axes[1,0].set_xlabel('Number of Calls')
axes[1,0].set_ylabel('Total Profit ($)')
axes[1,0].set_title('Calls vs Profit')
axes[1,0].grid(True, alpha=0.3)

# Profit per call
axes[1,1].bar(roi_df['Strategy'], roi_df['Profit per Call'], alpha=0.8, color='orange')
axes[1,1].set_ylabel('Profit per Call ($)')
axes[1,1].set_title('Profit per Call by Strategy')
axes[1,1].tick_params(axis='x', rotation=45)
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Sensitivity analysis
print("\n=== SENSITIVITY ANALYSIS ===")

# Test different cost scenarios
cost_scenarios = [2.0, 5.0, 10.0, 15.0]  # Different cost per call
sensitivity_results = []

for cost in cost_scenarios:
    # Recalculate expected values
    exp_val_low = business_values['revenue_low'] * business_values['conversion_rate_low'] - cost
    exp_val_med = business_values['revenue_medium'] * business_values['conversion_rate_medium'] - cost
    exp_val_high = business_values['revenue_high'] * business_values['conversion_rate_high'] - cost
    
    # Baseline strategy value
    baseline_val = (exp_val_low * np.sum(y_test_categories == 'Low') +
                   exp_val_med * np.sum(y_test_categories == 'Medium') +
                   exp_val_high * np.sum(y_test_categories == 'High'))
    
    # High-only strategy value
    high_only_val = (exp_val_low * np.sum(high_pred_actual == 'Low') +
                    exp_val_med * np.sum(high_pred_actual == 'Medium') +
                    exp_val_high * np.sum(high_pred_actual == 'High'))
    
    sensitivity_results.append({
        'Cost per Call': cost,
        'Baseline Profit': baseline_val,
        'High-Only Profit': high_only_val,
        'Improvement': high_only_val - baseline_val,
        'Improvement %': ((high_only_val - baseline_val) / abs(baseline_val)) * 100 if baseline_val != 0 else 0
    })

sensitivity_df = pd.DataFrame(sensitivity_results)
print("Sensitivity to Cost per Call:")
print(sensitivity_df.to_string(index=False, float_format='%.2f'))

# Plot sensitivity
plt.figure(figsize=(12, 6))
plt.plot(sensitivity_df['Cost per Call'], sensitivity_df['Baseline Profit'], 
         'b-o', label='Baseline (All Customers)', linewidth=2)
plt.plot(sensitivity_df['Cost per Call'], sensitivity_df['High-Only Profit'], 
         'r-o', label='High Engagement Only', linewidth=2)
plt.axhline(y=0, color='black', linestyle='--', alpha=0.5)
plt.xlabel('Cost per Call ($)')
plt.ylabel('Total Profit ($)')
plt.title('Profit Sensitivity to Cost per Call')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Find break-even point
break_even_cost = business_values['revenue_high'] * business_values['conversion_rate_high']
print(f"\nBreak-even cost per call for high engagement: ${break_even_cost:.2f}")
print(f"Current cost assumption: ${business_values['cost_per_call']:.2f}")
print(f"Safety margin: {((break_even_cost - business_values['cost_per_call']) / business_values['cost_per_call']) * 100:.1f}%")

### Business Impact Analysis Insights:

**Key Business Findings:**

1. **Model Performance by Engagement Level:**
   - High engagement customers are most valuable but hardest to predict
   - Medium engagement provides good balance of value and predictability
   - Low engagement customers have negative expected value

2. **Strategic Recommendations:**
   - **Selective Follow-up:** Only contact predicted medium/high engagement customers
   - **Resource Optimization:** Focus limited resources on highest-value prospects
   - **Cost Management:** Monitor cost per call to maintain profitability

3. **Financial Impact:**
   - Model-driven strategies can significantly improve ROI
   - Profit per call increases when focusing on high-value segments
   - Break-even analysis helps set cost thresholds

**Implementation Considerations:**

1. **Model Accuracy Trade-offs:**
   - False positives: Waste resources on low-value customers
   - False negatives: Miss high-value opportunities
   - Optimize threshold based on business costs

2. **Operational Changes:**
   - Train call center staff on engagement indicators
   - Implement real-time scoring system
   - Monitor and update model performance regularly

3. **Risk Management:**
   - Test strategies on small samples first
   - Monitor customer satisfaction impacts
   - Have fallback procedures for model failures

**Success Metrics:**
- Increase in profit per call
- Improvement in conversion rates
- Reduction in wasted follow-up calls
- Overall ROI improvement

## Exercise 9: Advanced Diagnostics

**Task:** Perform advanced model diagnostics to identify potential issues.

### Solution Approach:
1. Calculate Cook's distance for influential observations
2. Check multicollinearity using Variance Inflation Factor (VIF)
3. Test for autocorrelation in residuals
4. Provide improvement recommendations

In [None]:
# Exercise 9 Solution

print("=== ADVANCED MODEL DIAGNOSTICS ===")

# First, let's use statsmodels for more detailed diagnostics
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import OLSInfluence
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
from statsmodels.stats.stattools import durbin_watson

# Prepare data for statsmodels (add constant for intercept)
X_train_sm = sm.add_constant(X_train_banking)
X_test_sm = sm.add_constant(X_test_banking)

# Fit OLS model using statsmodels
model_sm = sm.OLS(y_train_banking.ravel(), X_train_sm).fit()

print("Model Summary:")
print(model_sm.summary())

# Get predictions and residuals
y_train_pred_sm = model_sm.predict(X_train_sm)
residuals_sm = model_sm.resid

In [None]:
# 1. Cook's Distance Analysis
print("\n=== COOK'S DISTANCE ANALYSIS ===")

# Calculate influence measures
influence = OLSInfluence(model_sm)
cooks_d = influence.cooks_distance[0]
leverage = influence.hat_matrix_diag
studentized_residuals = influence.resid_studentized_external

# Identify influential observations
n = len(X_train_sm)
p = X_train_sm.shape[1]
cooks_threshold = 4 / n  # Common threshold
leverage_threshold = 2 * p / n  # Common threshold

influential_cooks = np.where(cooks_d > cooks_threshold)[0]
high_leverage = np.where(leverage > leverage_threshold)[0]
outliers = np.where(np.abs(studentized_residuals) > 3)[0]  # |t| > 3

print(f"Cook's distance threshold: {cooks_threshold:.6f}")
print(f"Leverage threshold: {leverage_threshold:.6f}")
print(f"Influential observations (Cook's D): {len(influential_cooks)} ({len(influential_cooks)/n*100:.2f}%)")
print(f"High leverage observations: {len(high_leverage)} ({len(high_leverage)/n*100:.2f}%)")
print(f"Outliers (|studentized residual| > 3): {len(outliers)} ({len(outliers)/n*100:.2f}%)")

# Plot diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Cook's distance plot
axes[0,0].stem(range(len(cooks_d)), cooks_d, basefmt=" ")
axes[0,0].axhline(y=cooks_threshold, color='red', linestyle='--', label=f'Threshold ({cooks_threshold:.4f})')
axes[0,0].set_xlabel('Observation Index')
axes[0,0].set_ylabel("Cook's Distance")
axes[0,0].set_title("Cook's Distance")
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# Leverage plot
axes[0,1].scatter(range(len(leverage)), leverage, alpha=0.6)
axes[0,1].axhline(y=leverage_threshold, color='red', linestyle='--', label=f'Threshold ({leverage_threshold:.4f})')
axes[0,1].set_xlabel('Observation Index')
axes[0,1].set_ylabel('Leverage')
axes[0,1].set_title('Leverage Values')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)

# Residuals vs Leverage
axes[1,0].scatter(leverage, studentized_residuals, alpha=0.6)
axes[1,0].axhline(y=3, color='red', linestyle='--', alpha=0.7)
axes[1,0].axhline(y=-3, color='red', linestyle='--', alpha=0.7)
axes[1,0].axvline(x=leverage_threshold, color='red', linestyle='--', alpha=0.7)
axes[1,0].set_xlabel('Leverage')
axes[1,0].set_ylabel('Studentized Residuals')
axes[1,0].set_title('Residuals vs Leverage')
axes[1,0].grid(True, alpha=0.3)

# Cook's distance vs Leverage
axes[1,1].scatter(leverage, cooks_d, alpha=0.6)
axes[1,1].axhline(y=cooks_threshold, color='red', linestyle='--', alpha=0.7)
axes[1,1].axvline(x=leverage_threshold, color='red', linestyle='--', alpha=0.7)
axes[1,1].set_xlabel('Leverage')
axes[1,1].set_ylabel("Cook's Distance")
axes[1,1].set_title("Cook's Distance vs Leverage")
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Show most influential observations
if len(influential_cooks) > 0:
    print(f"\nTop 10 most influential observations (Cook's Distance):")
    top_influential = np.argsort(cooks_d)[-10:][::-1]
    for i, idx in enumerate(top_influential):
        print(f"{i+1}. Index {idx}: Cook's D = {cooks_d[idx]:.6f}, Leverage = {leverage[idx]:.6f}, Studentized Residual = {studentized_residuals[idx]:.3f}")

In [None]:
# 2. Multicollinearity Analysis (VIF)
print("\n=== MULTICOLLINEARITY ANALYSIS (VIF) ===")

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature (excluding constant)
X_for_vif = X_train_sm.iloc[:, 1:]  # Exclude constant column
vif_data = []

for i in range(X_for_vif.shape[1]):
    try:
        vif = variance_inflation_factor(X_for_vif.values, i)
        vif_data.append({
            'Feature': X_df_banking.columns[i],
            'VIF': vif
        })
    except:
        vif_data.append({
            'Feature': X_df_banking.columns[i],
            'VIF': np.inf  # Perfect multicollinearity
        })

vif_df = pd.DataFrame(vif_data)
vif_df = vif_df.sort_values('VIF', ascending=False)

print("Variance Inflation Factors:")
print("VIF > 10: High multicollinearity")
print("VIF > 5: Moderate multicollinearity")
print("VIF < 5: Low multicollinearity")
print()
print(vif_df.to_string(index=False, float_format='%.3f'))

# Identify problematic features
high_vif = vif_df[vif_df['VIF'] > 10]
moderate_vif = vif_df[(vif_df['VIF'] > 5) & (vif_df['VIF'] <= 10)]

print(f"\nFeatures with high multicollinearity (VIF > 10): {len(high_vif)}")
if len(high_vif) > 0:
    print(high_vif[['Feature', 'VIF']].to_string(index=False))

print(f"\nFeatures with moderate multicollinearity (5 < VIF <= 10): {len(moderate_vif)}")
if len(moderate_vif) > 0:
    print(moderate_vif[['Feature', 'VIF']].to_string(index=False))

# Visualize VIF
plt.figure(figsize=(12, 8))
colors = ['red' if vif > 10 else 'orange' if vif > 5 else 'green' for vif in vif_df['VIF']]
plt.barh(range(len(vif_df)), vif_df['VIF'], color=colors, alpha=0.7)
plt.yticks(range(len(vif_df)), vif_df['Feature'])
plt.xlabel('Variance Inflation Factor (VIF)')
plt.title('Multicollinearity Analysis: VIF by Feature')
plt.axvline(x=5, color='orange', linestyle='--', alpha=0.7, label='Moderate threshold (5)')
plt.axvline(x=10, color='red', linestyle='--', alpha=0.7, label='High threshold (10)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# 3. Heteroscedasticity Tests
print("\n=== HETEROSCEDASTICITY TESTS ===")

# Breusch-Pagan test
bp_stat, bp_pvalue, bp_fstat, bp_fpvalue = het_breuschpagan(residuals_sm, X_train_sm)
print(f"Breusch-Pagan Test:")
print(f"  LM statistic: {bp_stat:.6f}")
print(f"  p-value: {bp_pvalue:.6f}")
print(f"  Interpretation: {'Heteroscedasticity detected' if bp_pvalue < 0.05 else 'Homoscedasticity (constant variance)'}")

# White test
white_stat, white_pvalue, white_fstat, white_fpvalue = het_white(residuals_sm, X_train_sm)
print(f"\nWhite Test:")
print(f"  LM statistic: {white_stat:.6f}")
print(f"  p-value: {white_pvalue:.6f}")
print(f"  Interpretation: {'Heteroscedasticity detected' if white_pvalue < 0.05 else 'Homoscedasticity (constant variance)'}")

# 4. Autocorrelation Test (Durbin-Watson)
print("\n=== AUTOCORRELATION TEST ===")

dw_stat = durbin_watson(residuals_sm)
print(f"Durbin-Watson statistic: {dw_stat:.6f}")
print(f"Interpretation:")
print(f"  DW ≈ 2: No autocorrelation")
print(f"  DW < 2: Positive autocorrelation")
print(f"  DW > 2: Negative autocorrelation")
if dw_stat < 1.5:
    print(f"  Result: Strong positive autocorrelation detected")
elif dw_stat < 1.8:
    print(f"  Result: Moderate positive autocorrelation")
elif dw_stat > 2.5:
    print(f"  Result: Strong negative autocorrelation detected")
elif dw_stat > 2.2:
    print(f"  Result: Moderate negative autocorrelation")
else:
    print(f"  Result: No significant autocorrelation")

# 5. Normality Tests (additional)
print("\n=== ADDITIONAL NORMALITY TESTS ===")

from scipy.stats import jarque_bera, shapiro, normaltest

# Jarque-Bera test
jb_stat, jb_pvalue = jarque_bera(residuals_sm)
print(f"Jarque-Bera Test:")
print(f"  Statistic: {jb_stat:.6f}")
print(f"  p-value: {jb_pvalue:.6f}")
print(f"  Interpretation: {'Residuals deviate from normality' if jb_pvalue < 0.05 else 'Residuals appear normal'}")

# D'Agostino's normality test
dag_stat, dag_pvalue = normaltest(residuals_sm)
print(f"\nD'Agostino's Normality Test:")
print(f"  Statistic: {dag_stat:.6f}")
print(f"  p-value: {dag_pvalue:.6f}")
print(f"  Interpretation: {'Residuals deviate from normality' if dag_pvalue < 0.05 else 'Residuals appear normal'}")

In [None]:
# 6. Model Improvement Recommendations
print("\n=== MODEL IMPROVEMENT RECOMMENDATIONS ===")

recommendations = []

# Based on influential observations
if len(influential_cooks) > len(X_train_sm) * 0.05:  # More than 5% influential
    recommendations.append(
        f"🔍 INFLUENTIAL OBSERVATIONS: {len(influential_cooks)} observations have high Cook's distance. "
        "Consider investigating these data points for errors or outliers."
    )

# Based on multicollinearity
if len(high_vif) > 0:
    recommendations.append(
        f"⚠️ MULTICOLLINEARITY: {len(high_vif)} features have VIF > 10. "
        "Consider removing highly correlated features or using regularization (Ridge/Lasso)."
    )

# Based on heteroscedasticity
if bp_pvalue < 0.05 or white_pvalue < 0.05:
    recommendations.append(
        "📊 HETEROSCEDASTICITY: Non-constant variance detected. "
        "Consider using robust standard errors, weighted least squares, or transforming the target variable."
    )

# Based on autocorrelation
if abs(dw_stat - 2) > 0.5:
    recommendations.append(
        "🔄 AUTOCORRELATION: Residuals show autocorrelation. "
        "Consider adding lagged variables or using time series models if data has temporal structure."
    )

# Based on normality
if jb_pvalue < 0.05:
    recommendations.append(
        "📈 NON-NORMALITY: Residuals deviate from normality. "
        "Consider transforming the target variable (log, Box-Cox) or using robust regression methods."
    )

# Model complexity
if len(X_df_banking.columns) > len(X_train_sm) / 10:  # More features than 10% of observations
    recommendations.append(
        "🎯 MODEL COMPLEXITY: High feature-to-observation ratio. "
        "Consider feature selection, dimensionality reduction (PCA), or regularization."
    )

if len(recommendations) == 0:
    print("✅ MODEL DIAGNOSTICS PASSED: No major issues detected!")
    print("The model appears to satisfy linear regression assumptions reasonably well.")
else:
    print("Issues detected and recommendations:")
    for i, rec in enumerate(recommendations, 1):
        print(f"\n{i}. {rec}")

# Specific actionable steps
print("\n=== SPECIFIC ACTIONABLE STEPS ===")

action_steps = [
    "1. DATA QUALITY:",
    "   - Investigate influential observations for data entry errors",
    "   - Consider robust regression methods for outlier handling",
    "   - Validate extreme values with domain experts",
    "",
    "2. FEATURE ENGINEERING:",
    "   - Remove or combine highly correlated features (VIF > 10)",
    "   - Consider polynomial or interaction terms for non-linearity",
    "   - Apply feature scaling/normalization",
    "",
    "3. MODEL ALTERNATIVES:",
    "   - Try regularized regression (Ridge/Lasso) for multicollinearity",
    "   - Consider robust regression for outliers",
    "   - Explore non-linear models if assumptions are severely violated",
    "",
    "4. VALIDATION:",
    "   - Use cross-validation for robust performance estimates",
    "   - Monitor model performance on new data",
    "   - Implement model retraining procedures"
]

for step in action_steps:
    print(step)

# Summary statistics
print("\n=== DIAGNOSTIC SUMMARY ===")
summary_stats = {
    'Total Observations': len(X_train_sm),
    'Features': X_train_sm.shape[1] - 1,  # Exclude constant
    'Influential Obs (%)': len(influential_cooks) / len(X_train_sm) * 100,
    'High VIF Features': len(high_vif),
    'Heteroscedasticity p-value': min(bp_pvalue, white_pvalue),
    'Durbin-Watson Stat': dw_stat,
    'Normality p-value': jb_pvalue,
    'Model R²': model_sm.rsquared,
    'Adjusted R²': model_sm.rsquared_adj
}

for key, value in summary_stats.items():
    if 'p-value' in key:
        print(f"{key}: {value:.6f}")
    elif '%' in key or 'R²' in key or 'Stat' in key:
        print(f"{key}: {value:.3f}")
    else:
        print(f"{key}: {value}")

### Advanced Diagnostics Insights:

**Key Diagnostic Tools:**

1. **Cook's Distance:**
   - Identifies observations that heavily influence model coefficients
   - High values indicate potential outliers or data errors
   - Threshold: 4/n (where n = sample size)

2. **Variance Inflation Factor (VIF):**
   - Measures multicollinearity between features
   - VIF > 10: High multicollinearity (problematic)
   - VIF > 5: Moderate multicollinearity (concerning)

3. **Heteroscedasticity Tests:**
   - Breusch-Pagan: Tests for linear heteroscedasticity
   - White Test: Tests for general heteroscedasticity
   - Violation affects standard errors and confidence intervals

4. **Durbin-Watson Test:**
   - Tests for autocorrelation in residuals
   - Important for time series or ordered data
   - Values near 2 indicate no autocorrelation

**Common Issues and Solutions:**

1. **Influential Observations:**
   - Investigate for data errors
   - Consider robust regression methods
   - Use outlier detection algorithms

2. **Multicollinearity:**
   - Remove highly correlated features
   - Use regularization (Ridge/Lasso)
   - Apply principal component analysis (PCA)

3. **Heteroscedasticity:**
   - Transform target variable (log, square root)
   - Use weighted least squares
   - Apply robust standard errors

4. **Non-normality:**
   - Transform variables (Box-Cox, log)
   - Use robust regression methods
   - Consider non-parametric alternatives

**Best Practices:**
- Always check assumptions before interpreting results
- Use multiple diagnostic tests for robustness
- Document any assumption violations and remedial actions
- Consider the business impact of model limitations
- Regularly re-evaluate model diagnostics with new data

## Reflection Questions - Detailed Answers

### 1. When might linear regression not be appropriate?

**Non-linear relationships:**
- When the relationship between features and target is curved, exponential, or has other non-linear patterns
- Example: Population growth, compound interest, or diminishing returns

**Categorical outcomes:**
- Linear regression predicts continuous values, not categories
- Use logistic regression for binary outcomes, multinomial regression for multiple categories

**Assumption violations:**
- Severe heteroscedasticity (non-constant variance)
- Strong autocorrelation in residuals
- Non-normal residuals with small sample sizes
- Extreme multicollinearity

**Alternative approaches:**
- Polynomial regression for curved relationships
- Tree-based models for complex interactions
- Neural networks for highly non-linear patterns

### 2. How do you balance model complexity with interpretability?

**Start simple:**
- Begin with basic linear model
- Add complexity only when justified by performance gains
- Use statistical tests to validate additional features

**Use regularization:**
- Ridge regression maintains all features but shrinks coefficients
- Lasso automatically selects important features
- Elastic Net combines both approaches

**Business context matters:**
- Regulatory environments may require interpretable models
- High-stakes decisions need explainable predictions
- Operational teams need to understand model logic

**Practical strategies:**
- Create separate models for different purposes (simple for explanation, complex for prediction)
- Use feature importance rankings to focus on key drivers
- Provide model summaries at different technical levels

### 3. What are the key assumptions of linear regression and why do they matter?

**Linearity:**
- Relationship between features and target is linear
- Violation: Biased predictions, poor fit
- Business impact: Wrong understanding of factor relationships

**Independence:**
- Observations are independent of each other
- Violation: Underestimated standard errors, overconfident predictions
- Business impact: False confidence in model reliability

**Homoscedasticity:**
- Constant variance of residuals
- Violation: Unreliable confidence intervals
- Business impact: Incorrect uncertainty estimates for decisions

**Normality:**
- Residuals follow normal distribution
- Violation: Invalid hypothesis tests, poor confidence intervals
- Business impact: Unreliable statistical inference

**No multicollinearity:**
- Features are not highly correlated
- Violation: Unstable coefficients, difficult interpretation
- Business impact: Wrong conclusions about factor importance

### 4. How would you explain R² to a non-technical business stakeholder?

**Simple explanation:**
"R² tells us what percentage of the variation in our target variable is explained by our model. It's like asking: 'How much of the ups and downs in our data can we predict using our features?'"

**Practical interpretation:**
- R² = 0.80 means "Our model explains 80% of why values vary"
- R² = 0.30 means "Our model captures 30% of the pattern, 70% is due to other factors"

**Business context:**
- Higher R² = More predictable outcomes
- Lower R² = More uncertainty, need additional factors
- Perfect R² (1.0) is rare in real business data

**Avoid common misconceptions:**
- R² doesn't prove causation
- Higher R² doesn't always mean better business decisions
- R² can be misleading with small samples or many features

### 5. In what business scenarios would you prefer RMSE over R² as an evaluation metric?

**When absolute errors matter:**
- Financial forecasting: $1000 error has real cost regardless of scale
- Inventory management: Overstocking/understocking has direct costs
- Resource planning: Wrong headcount predictions affect operations

**When comparing models with different scales:**
- RMSE has same units as target variable
- Easier to interpret business impact
- Can set acceptable error thresholds

**When stakeholders need concrete numbers:**
- "Average error is $500" vs "Model explains 85% of variance"
- RMSE directly relates to business costs
- Easier to set performance targets

**Examples:**
- Sales forecasting: RMSE shows average dollar error
- Demand planning: RMSE indicates typical unit shortage/surplus
- Budget planning: RMSE reveals expected deviation from targets

### 6. How might you improve the Bitcoin forecasting model?

**Additional features:**
- Market sentiment indicators (fear/greed index)
- Trading volume and volatility measures
- Macroeconomic indicators (interest rates, inflation)
- Social media sentiment and news analysis

**Advanced techniques:**
- GARCH models for volatility clustering
- ARIMA models for time series patterns
- Ensemble methods combining multiple models
- Deep learning for complex pattern recognition

**Risk management:**
- Implement position sizing based on prediction confidence
- Add stop-loss and take-profit mechanisms
- Consider transaction costs and market impact
- Regular model retraining and performance monitoring

**Practical considerations:**
- Real-time data feeds for timely predictions
- Backtesting with realistic trading constraints
- Stress testing under different market conditions
- Integration with existing trading infrastructure

## Final Takeaways

**Linear regression strengths:**
- Simple, interpretable, and fast
- Good baseline for more complex models
- Well-understood statistical properties
- Effective when assumptions are met

**Key success factors:**
- Always check and validate assumptions
- Focus on business value, not just statistical metrics
- Use appropriate evaluation methods (cross-validation)
- Consider model limitations in decision-making

**Best practices:**
- Start simple, add complexity gradually
- Document assumptions and limitations
- Regular model monitoring and updates
- Clear communication with stakeholders

**Remember:** The best model is not always the most complex one, but the one that provides reliable, actionable insights for business decisions while being appropriately validated and understood by its users.