# Week 1: Linear Regression - Integrated Capstone Project

## Overview
This notebook covers fundamental concepts of linear regression applied to your capstone project dataset:
- Simple and Multiple Linear Regression
- Polynomial Terms
- Interaction Terms
- Multicollinearity Detection
- Variance Inflation Factor (VIF)
- Working with Categorical and Continuous Features

## Learning Objectives
By the end of this notebook, you will be able to:
1. Build and interpret linear regression models
2. Create polynomial and interaction terms
3. Detect and handle multicollinearity
4. Calculate and interpret VIF
5. Work with mixed data types (categorical and continuous)

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from statsmodels.stats.outliers_influence import variance_inflation_factor
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

### Load Your Dataset

**Instructions:** Replace the sample data below with your own capstone project dataset.

```python
# Example: Load your data
# df = pd.read_csv('your_data.csv')
```

In [None]:
# Sample dataset for demonstration purposes
# Replace this with your actual capstone project data
np.random.seed(42)
n_samples = 200

# Generate sample data
df = pd.DataFrame({
    'feature1': np.random.randn(n_samples) * 10 + 50,
    'feature2': np.random.randn(n_samples) * 5 + 30,
    'feature3': np.random.randn(n_samples) * 8 + 40,
    'category': np.random.choice(['A', 'B', 'C'], n_samples),
    'binary_feature': np.random.choice([0, 1], n_samples)
})

# Create target variable with some relationship to features
df['target'] = (2 * df['feature1'] + 
                1.5 * df['feature2'] + 
                0.5 * df['feature3'] + 
                np.random.randn(n_samples) * 10)

print("Dataset Shape:", df.shape)
df.head()

In [None]:
# Basic data exploration
print("\nDataset Info:")
df.info()

print("\nBasic Statistics:")
df.describe()

## 2. Simple Linear Regression

We'll start with a simple linear regression using one predictor variable.

In [None]:
# Simple Linear Regression: feature1 vs target
X_simple = df[['feature1']]
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_simple, y, test_size=0.2, random_state=42)

# Fit model
simple_model = LinearRegression()
simple_model.fit(X_train, y_train)

# Predictions
y_pred = simple_model.predict(X_test)

# Metrics
print("Simple Linear Regression Results:")
print(f"Coefficient: {simple_model.coef_[0]:.4f}")
print(f"Intercept: {simple_model.intercept_:.4f}")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")

In [None]:
# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, alpha=0.5, label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Feature 1')
plt.ylabel('Target')
plt.title('Simple Linear Regression: Feature 1 vs Target')
plt.legend()
plt.show()

## 3. Multiple Linear Regression

Now let's use multiple features to predict the target variable.

In [None]:
# Multiple Linear Regression
X_multiple = df[['feature1', 'feature2', 'feature3']]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_multiple, y, test_size=0.2, random_state=42)

# Fit model
multiple_model = LinearRegression()
multiple_model.fit(X_train, y_train)

# Predictions
y_pred = multiple_model.predict(X_test)

# Display coefficients
print("Multiple Linear Regression Results:")
print("\nCoefficients:")
for feature, coef in zip(X_multiple.columns, multiple_model.coef_):
    print(f"  {feature}: {coef:.4f}")
print(f"\nIntercept: {multiple_model.intercept_:.4f}")
print(f"\nR² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")

In [None]:
# Visualize actual vs predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Multiple Linear Regression: Actual vs Predicted')
plt.show()

## 4. Polynomial Terms

Polynomial features allow us to capture non-linear relationships using linear regression.

In [None]:
# Create polynomial features (degree 2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['feature1']])

# Create DataFrame with polynomial features
poly_feature_names = poly.get_feature_names_out(['feature1'])
X_poly_df = pd.DataFrame(X_poly, columns=poly_feature_names)

print("Polynomial Features:")
print(X_poly_df.head())

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)

# Fit polynomial model
poly_model = LinearRegression()
poly_model.fit(X_train, y_train)

# Predictions
y_pred = poly_model.predict(X_test)

print("\nPolynomial Regression Results (Degree 2):")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

In [None]:
# Compare different polynomial degrees
degrees = [1, 2, 3, 4]
results = []

for degree in degrees:
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly = poly.fit_transform(df[['feature1']])
    
    X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    results.append({'Degree': degree, 'R²': r2, 'RMSE': rmse})

results_df = pd.DataFrame(results)
print("\nComparison of Polynomial Degrees:")
print(results_df)

## 5. Interaction Terms

Interaction terms capture the combined effect of two or more features.

In [None]:
# Create interaction terms manually
df_interaction = df.copy()
df_interaction['feature1_x_feature2'] = df['feature1'] * df['feature2']
df_interaction['feature1_x_feature3'] = df['feature1'] * df['feature3']
df_interaction['feature2_x_feature3'] = df['feature2'] * df['feature3']

print("Dataset with Interaction Terms:")
print(df_interaction[['feature1', 'feature2', 'feature3', 
                       'feature1_x_feature2', 'feature1_x_feature3', 
                       'feature2_x_feature3']].head())

In [None]:
# Model with interaction terms
X_interaction = df_interaction[['feature1', 'feature2', 'feature3',
                                 'feature1_x_feature2', 'feature1_x_feature3', 
                                 'feature2_x_feature3']]

X_train, X_test, y_train, y_test = train_test_split(X_interaction, y, test_size=0.2, random_state=42)

interaction_model = LinearRegression()
interaction_model.fit(X_train, y_train)
y_pred = interaction_model.predict(X_test)

print("Model with Interaction Terms:")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

print("\nCoefficients:")
for feature, coef in zip(X_interaction.columns, interaction_model.coef_):
    print(f"  {feature}: {coef:.4f}")

## 6. Multicollinearity Detection

Multicollinearity occurs when predictor variables are highly correlated with each other.

In [None]:
# Correlation matrix
correlation_matrix = df[['feature1', 'feature2', 'feature3', 'target']].corr()

print("Correlation Matrix:")
print(correlation_matrix)

# Visualize correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix Heatmap')
plt.show()

In [None]:
# Identify highly correlated features
threshold = 0.8
high_correlation_pairs = []

for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            high_correlation_pairs.append({
                'Feature 1': correlation_matrix.columns[i],
                'Feature 2': correlation_matrix.columns[j],
                'Correlation': correlation_matrix.iloc[i, j]
            })

if high_correlation_pairs:
    print(f"\nHighly Correlated Feature Pairs (|correlation| > {threshold}):")
    for pair in high_correlation_pairs:
        print(f"  {pair['Feature 1']} <-> {pair['Feature 2']}: {pair['Correlation']:.4f}")
else:
    print(f"\nNo feature pairs with correlation > {threshold}")

## 7. Variance Inflation Factor (VIF)

VIF quantifies the severity of multicollinearity. A VIF > 10 typically indicates problematic multicollinearity.

In [None]:
# Calculate VIF for each feature
X_vif = df[['feature1', 'feature2', 'feature3']]

vif_data = pd.DataFrame()
vif_data['Feature'] = X_vif.columns
vif_data['VIF'] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]

print("Variance Inflation Factor (VIF):")
print(vif_data)
print("\nInterpretation:")
print("  VIF = 1: No correlation")
print("  VIF = 1-5: Moderate correlation")
print("  VIF = 5-10: High correlation")
print("  VIF > 10: Problematic multicollinearity")

In [None]:
# Visualize VIF
plt.figure(figsize=(10, 6))
plt.bar(vif_data['Feature'], vif_data['VIF'])
plt.axhline(y=5, color='orange', linestyle='--', label='Moderate threshold (VIF=5)')
plt.axhline(y=10, color='red', linestyle='--', label='High threshold (VIF=10)')
plt.xlabel('Features')
plt.ylabel('VIF')
plt.title('Variance Inflation Factor by Feature')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 8. Working with Categorical Features

Linear regression requires numerical inputs, so we need to encode categorical variables.

In [None]:
# One-Hot Encoding for categorical variables
print("Original categorical feature:")
print(df['category'].value_counts())

# Create dummy variables
df_encoded = pd.get_dummies(df, columns=['category'], drop_first=True, dtype=int)

print("\nDataset after One-Hot Encoding:")
print(df_encoded.head())
print(f"\nNew shape: {df_encoded.shape}")

In [None]:
# Model with categorical and continuous features
feature_cols = ['feature1', 'feature2', 'feature3', 'binary_feature']
# Add dummy variables if they exist
dummy_cols = [col for col in df_encoded.columns if col.startswith('category_')]
feature_cols.extend(dummy_cols)

X_mixed = df_encoded[feature_cols]
y = df_encoded['target']

X_train, X_test, y_train, y_test = train_test_split(X_mixed, y, test_size=0.2, random_state=42)

mixed_model = LinearRegression()
mixed_model.fit(X_train, y_train)
y_pred = mixed_model.predict(X_test)

print("Model with Categorical and Continuous Features:")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

print("\nCoefficients:")
coef_df = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': mixed_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(coef_df)

## 9. Feature Importance Analysis

In [None]:
# Standardize features to compare coefficients
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_mixed)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

scaled_model = LinearRegression()
scaled_model.fit(X_train, y_train)

# Feature importance based on standardized coefficients
importance_df = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': np.abs(scaled_model.coef_)
}).sort_values('Importance', ascending=False)

print("Feature Importance (based on standardized coefficients):")
print(importance_df)

In [None]:
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Absolute Coefficient Value (Standardized)')
plt.ylabel('Feature')
plt.title('Feature Importance in Linear Regression')
plt.tight_layout()
plt.show()

## 10. Model Diagnostics

In [None]:
# Residual analysis
X_train, X_test, y_train, y_test = train_test_split(X_multiple, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

residuals_train = y_train - y_pred_train
residuals_test = y_test - y_pred_test

# Create diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Residuals vs Fitted
axes[0, 0].scatter(y_pred_test, residuals_test, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted Values')

# 2. Histogram of Residuals
axes[0, 1].hist(residuals_test, bins=30, edgecolor='black')
axes[0, 1].set_xlabel('Residuals')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Residuals')

# 3. Q-Q Plot
from scipy import stats
stats.probplot(residuals_test, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot')

# 4. Scale-Location Plot
standardized_residuals = np.sqrt(np.abs(residuals_test / np.std(residuals_test)))
axes[1, 1].scatter(y_pred_test, standardized_residuals, alpha=0.5)
axes[1, 1].set_xlabel('Fitted Values')
axes[1, 1].set_ylabel('√|Standardized Residuals|')
axes[1, 1].set_title('Scale-Location Plot')

plt.tight_layout()
plt.show()

## 11. Summary and Next Steps

### Key Takeaways:

1. **Simple Linear Regression**: Uses one predictor to model the relationship with the target
2. **Multiple Linear Regression**: Uses multiple predictors for better predictions
3. **Polynomial Terms**: Capture non-linear relationships in the data
4. **Interaction Terms**: Model the combined effect of features
5. **Multicollinearity**: Check correlation between predictors
6. **VIF**: Quantifies multicollinearity (VIF > 10 is problematic)
7. **Categorical Features**: Use one-hot encoding to include in regression models
8. **Model Diagnostics**: Residual plots help validate model assumptions

### For Your Capstone Project:

1. Load your actual dataset
2. Explore relationships between features and target
3. Check for multicollinearity using correlation matrix and VIF
4. Create meaningful polynomial and interaction terms
5. Handle categorical variables appropriately
6. Validate model assumptions using diagnostic plots
7. Document your findings and interpretations

### Next Week Preview:
In Week 2, we'll explore regularization techniques (Ridge, Lasso, Elastic Net) to improve model performance and handle multicollinearity.

## 12. Exercises for Your Dataset

Apply the following to your capstone project dataset:

1. **Data Exploration**
   - Load your dataset
   - Identify continuous and categorical features
   - Check for missing values and outliers

2. **Simple Analysis**
   - Create scatter plots of each feature vs target
   - Identify potential non-linear relationships

3. **Multicollinearity Check**
   - Calculate correlation matrix
   - Compute VIF for all features
   - Decide which features to keep/remove

4. **Model Building**
   - Build a baseline multiple regression model
   - Add polynomial terms where appropriate
   - Add interaction terms based on domain knowledge

5. **Evaluation**
   - Compare model performance metrics
   - Analyze residual plots
   - Interpret coefficients in context of your problem

In [None]:
# Space for your capstone project work
# TODO: Replace sample data with your actual dataset
# TODO: Apply the concepts learned above to your data
# TODO: Document your findings and insights