# Linear Regression Modeling Guide

## How Linear Regression Works
Linear regression models the relationship between:
- **Independent variables** (features like age, smoking status)
- **Dependent variable** (medical charges) 

By finding coefficients (β) that minimize the sum of squared residuals:
```math
charges = β₀ + β₁(age) + β₂(smoker) + ... + ε
```
## Step-by-Step Implementation
1. Data Preparation
```python
# One-hot encode categorical variables; Skip if it has been done previously
df = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)

# Separate features (X) and target (y)
X = df.drop(columns=['charges'])
y = df['charges']
```
2. Train-Test Split with Standardization
- **Why Standardize?**
    - Standardization transforms features to:
        - **Mean = 0**
        - **Standard Deviation = 1**

- **Key Benefits**
    - Fair Feature Comparison: Prevents variables with larger scales (e.g., age 0-100) from dominating those with smaller scales (e.g., BMI 15-40)
    - Model Performance: Critical for distance-based algorithms (regression, SVM, KNN); helps gradient descent converge faster
    - Interpretable Coefficients: Enables direct comparison of feature importance weights

- **Common Standardization Techniques**

    | Technique          | Formula                  | Best For                 |
    |--------------------|--------------------------|--------------------------|
    | **Z-score**        | (x - μ)/σ               | Most regression models   |
    | **Min-Max**        | (x - min)/(max - min)    | Neural networks          |
    | **Robust Scaling** | (x - median)/IQR         | Data with outliers       |

- **Best Practices**
    - Fit Standardizer on Training Data Only: Calculate mean/std from training set; apply same transformation to test set
    - Never Standardize the Target Variable (in regression)
    - Handle New Data: Use saved standardization parameters in production
```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split data (80% train, 20% test)
# random_state ensures reproducible splits (sets the seed for shuffling)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```
3. Modeling
```
from sklearn.linear_model import LinearRegression

# Initialize and train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
```
4. Model Evaluation
Success Criteria:
**R² (Coefficient of Determination):**  
Measures the proportion of variance in the dependent variable explained by the independent variables. R² ranges from 0 to 1; higher values indicate a better fit.

**All VIF < 5:**  
VIF (Variance Inflation Factor) quantifies multicollinearity among features. A VIF below 5 suggests low correlation between predictors, indicating stable and reliable coefficient estimates.
- R² > 0.75
- All VIF < 5
```python
from sklearn.metrics import r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate R²
y_pred = model.predict(X_test_scaled)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")

# Check VIF
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns
vif_data["VIF"] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
print(vif_data)
```
5. Model Interpretation
A. Dollar-Value Conversion (A technique to translate regression coefficients into real-world monetary impact by:Taking raw coefficient values from the model; Converting them to annual USD amounts)
```python
coefficients = pd.DataFrame({
    'feature': X_train.columns,
    'impact': model.coef_,
    'dollar_impact': [f"${x:,.2f}" for x in model.coef_]
})
```
B. Smoking Impact Statement
```python
smoking_effect = coefficients[coefficients['feature'] == 'smoker_yes']
print(f"Smoking increases annual costs by {smoking_effect['dollar_impact'].values[0]}")
```
C. Feature Importance
```python
coefficients['abs_impact'] = np.abs(coefficients['impact'])
print(coefficients.sort_values('abs_impact', ascending=False))
```



In [None]:
# Load prepared and preprocessed data
df = pd.read_csv('../data/insurance_cleaned.csv')

In [None]:
# Todo: Split features/target


In [None]:
# Todo: Standardize features


In [None]:
# Todo: Train model using liearn regression


In [None]:
# Todo: Evaluate and check R2 and VIF


In [None]:
# Todo: Model Interpretation: e.g. dollar interpretation and feature importance

In [None]:
# Save model artifacts
import joblib

joblib.dump(model, '../models/insurance_pricing_model.pkl')
joblib.dump(scaler, '../models/scaler.pkl')

print("\nModel saved for production use!")


## Decision Trees for Medical Cost Prediction
### Why Consider Decision Trees?
| Feature           | Linear Regression         | Decision Tree                |
|-------------------|--------------------------|------------------------------|
| Interpretability  | High (coefficients)      | Moderate (tree paths)        |
| Non-linearity     | Needs feature engineering| Handles automatically        |
| Outliers          | Sensitive                | Robust                       |
| Categorical Vars  | Requires encoding        | Native support               |
### Implementation Steps
1. Data Prep Differences
- No need to standardize features (trees are scale-invariant)
- Handle missing values (trees support np.nan natively)
- Keep categoricals ordinal (label encoding suffices)

2. Model Training
```python
from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(
    max_depth=5,       # Control complexity
    min_samples_leaf=50 # Prevent overfitting
)
tree.fit(X_train, y_train)  # No scaling needed
```
3. Key Hyperparameters
| Parameter         | Purpose                  | Insurance Use Case         |
|-------------------|--------------------------|----------------------------|
| max_depth         | Limits tree layers       | 4-6 for interpretability   |
| min_samples_leaf  | Minimum data per leaf    | ≥50 for stable groups      |
| ccp_alpha         | Cost-complexity pruning  | Optimize via CV            |
4. Interpretation Tools
A. Feature Importance
Shows which factors (age, smoking, etc.) most influence costs

```python
pd.DataFrame({
    'feature': X_train.columns,
    'importance': tree.feature_importances_
}).sort_values('importance', ascending=False)
```
B. Tree Visualization
```python
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(tree, feature_names=X.columns, filled=True)
```
Example path:
Smoker? → Yes → Age ≥ 45 → $28,000
(Concrete pricing rules)

5. Pros/Cons for Insurance
✓ Advantages
- Automatic non-linear pattern detection
- Native interaction effects (e.g., smoking × age)
- Explainable rules for regulators
✗ Limitations
- Tendency to overfit without tuning
- Less precise dollar estimates than regression
- Instability with small data changes

### When to Choose Over Regression
- Non-linear cost relationships exist
- Interaction terms are suspected (e.g., smoking + obesity)
- Interpretable rules are prioritized over exact $ amounts

### Performance Metrics
- Maintain R² > 0.7 (similar to linear regression)
- Monitor MAE (Mean Absolute Error) in dollars
- Use out-of-bag error for stability checks



In [None]:
# Todo: **Data Preparation**
#     - [ ] Load the cleaned insurance dataset.
#     - [ ] Split data into features (`X`) and target (`y`).
#     - [ ] Perform train-test split (e.g., 80% train, 20% test).
#     - [ ] Encode categorical variables (label encoding if needed).

In [None]:
# Todo:  **Model Training**
#     - [ ] Import and initialize `DecisionTreeRegressor` with reasonable hyperparameters (`max_depth`, `min_samples_leaf`).
#     - [ ] Fit the model on the training data.

In [None]:
# Todo: **Model Evaluation**
#     - [ ] Predict on the test set.
#     - [ ] Calculate R² and MAE (Mean Absolute Error) for model performance.
#     - [ ] Compare results to linear regression baseline.

In [None]:
# Todo: **Interpretation**
#     - [ ] Display feature importances.
#     - [ ] Visualize the tree structure (optional, for small trees).
#     - [ ] Summarize key decision rules (e.g., "If smoker and age > 45, then...").

In [None]:
# Todo: 6. **(Optional) Model Saving**
#     - [ ] Save the trained tree model for future use.

## Alternative ML Models for Medical Cost Prediction (Optional)
| Model                          | Why Use It?                                                                                           | Key Hyperparameters / Implementation Example                                                                                                                                                                                                                                   | Reference                                                                                                                                                                                                                      |
|---------------------------------|-------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Gradient Boosting (XGBoost/LightGBM)** | - Handles non-linear relationships<br>- Captures feature interactions<br>- Robust to outliers        | `from xgboost import XGBRegressor`<br>`model = XGBRegressor(n_estimators=200, max_depth=4, learning_rate=0.1)`                                                                                                                        | [XGBoost Docs](https://xgboost.readthedocs.io/en/stable/)<br>Chen & Guestrin (2016) KDD Paper                                                                                           |
| **Random Forest**               | - Reduces overfitting (ensemble)<br>- Feature importance<br>- Works with mixed data types              | `from sklearn.ensemble import RandomForestRegressor`<br>`rf = RandomForestRegressor(n_estimators=100, max_features='sqrt')`                                                                                                            | [Scikit-learn RF Docs](https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees)<br>Breiman (2001) ML Journal                                                  |
| **Neural Networks (PyTorch)**   | - Captures complex non-linear patterns<br>- Handles high-dimensional data<br>- GPU-accelerated        | `import torch, torch.nn as nn`<br>`class InsuranceNet(nn.Module): ...`<br>(see code above for full architecture)                                                                                                                       | [PyTorch Tutorials](https://pytorch.org/tutorials/)<br>Goodfellow et al. (2016) Deep Learning Ch. 6                                                                                     |
| **Support Vector Regression (SVR)** | - Effective in high-dimensional spaces<br>- Robust to outliers with proper kernel                     | `from sklearn.svm import SVR`<br>`svr = SVR(kernel='rbf', C=100, epsilon=0.1)`                                                                                                                                                        | Smola & Schölkopf (2004) Tutorial<br>[Scikit-learn SVR Docs](https://scikit-learn.org/stable/modules/svm.html#regression)                                                               |

<br>

**Model Selection Guide**

| Model              | Best When...                        | Computational Cost | Interpretability |
|--------------------|-------------------------------------|--------------------|------------------|
| Linear Regression  | Linear relationships exist          | Low                | High             |
| Decision Trees     | Non-linearities/interactions        | Medium             | Medium           |
| XGBoost            | Large dataset, need accuracy        | High               | Medium           |
| Neural Nets        | Very complex patterns               | Very High          | Low              |
