# Insurance Claim Risk Classification

## 🎯 Problem Statement  
Develop a binary classifier to predict whether an insurance applicant will file **high-cost claims** (above \$10,000/year) based on their demographic and health profile. This helps insurers:  
1. Flag high-risk applicants for additional medical underwriting  
2. Design tiered insurance plans  
3. Optimize risk pools  

**Key Difference from Regression**:  
While the regression model predicts exact costs, this classifier identifies risk categories for decision automation.

## 📊 Target Variable Engineering  
Create binary label from existing `charges` column:  
```python
df['high_risk'] = (df['charges'] > 10000).astype(int)  # Threshold based on 75th percentile
```

## Proposed Analysis Flow
1.  Baseline Classification
**Models to Compare**:

| Model               | Pros                                      | Cons                                      | Best Use Cases                          |
|---------------------|------------------------------------------|------------------------------------------|-----------------------------------------|
| **Logistic Regression** | - Simple to implement<br>- Highly interpretable<br>- Fast training time | - Linear decision boundary<br>- Struggles with complex patterns<br>- Requires feature scaling | Baseline models<br>When interpretability is critical<br>Low-dimensional data |
| **Decision Tree**   | - No need for feature scaling<br>- Handles non-linear relationships<br>- Interpretable rules | - Prone to overfitting<br>- Unstable (small changes affect tree)<br>- Poor generalization | Rule-based systems<br>Feature importance analysis<br>Data with mixed types |
| **Random Forest**   | - Reduces overfitting vs single trees<br>- Handles high dimensions well<br>- Feature importance scores | - Less interpretable than single trees<br>- Slower prediction time<br>- Memory intensive | Medium-large datasets<br>When accuracy > interpretability<br>Non-linear relationships |
| **XGBoost**        | - State-of-the-art accuracy<br>- Built-in regularization<br>- Handles missing values | - Complex hyperparameter tuning<br>- Computationally expensive<br>- Black box nature | Competitions/kaggle<br>Imbalanced datasets<br>When performance is critical |
| **LightGBM**       | - **Faster training** than XGBoost<br>- **Lower memory** usage<br>- Handles categorical features natively | - More prone to overfitting on small data<br>- Less robust to noisy data | **Large datasets**<br>**Real-time applications**<br>When speed is critical |
| **SVM**            | - Effective in high dimensions<br>- Versatile kernel options<br>- Good for small datasets | - Computationally heavy<br>- Difficult to tune<br>- Poor scalability | Small-medium datasets<br>Clear margin separation<br>Text classification |
| **Naive Bayes**    | - Extremely fast<br>- Works well with small data<br>- Low memory usage | - Strong independence assumptions<br>- Poor with correlated features<br>- Underfitting risk | Text classification<br>Real-time predictions<br>As a baseline model |
| **Neural Networks** | - Handles complex patterns<br>- Feature extraction capability<br>- State-of-the-art performance | - Requires large data<br>- Computationally expensive<br>- Black box | Image/audio/text data<br>When other models plateau<br>With sufficient resources |

**Key Selection Criteria**:
1. **Speed**: LightGBM > XGBoost > Random Forest > Logistic Regression
2. **Accuracy**: XGBoost ≈ LightGBM > Random Forest > Neural Nets (with enough data)
3. **Memory Efficiency**: LightGBM > Random Forest > XGBoost
4. **Categorical Data**: LightGBM (native handling) > Others (require encoding)


**Key Selection Criteria**:

- **Interpretability**:  
  `Logistic Regression > Decision Tree > Random Forest > XGBoost > Neural Nets`  

- **Training Speed**:  
  `Naive Bayes > Logistic Regression > Decision Tree > Random Forest > XGBoost`  

- **Handling Non-linearity**:  
  `Neural Nets > XGBoost > SVM > Random Forest > Decision Tree`

## 🛠️  Exploratory Data Analysis (EDA)
Example classification task comparing smokers/non-smokers:
- Creates a percentage frequency table showing the distribution of high_risk classifications across different smoking statuses.

```python
pd.crosstab(df['smoker'], df['high_risk'], normalize='index')*100
```

**Metrics**：
```python
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
```
2. Feature Importance Analysis (Notebook Section 2)
Identify top risk drivers using:
```python
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
```
3. Business Rule Extraction
Convert model insights into underwriting rules:
```python
from sklearn.tree import export_text
print(export_text(decision_tree, feature_names=X.columns))
```
Example rule:
IF smoker=YES AND BMI>30 THEN high_risk_prob=82%

## Expected Deliverables
1. Model Performance Report: 
- Precision/recall for high-risk class

2. Risk Probability Calculator:
```python
def risk_calculator(age, bmi, smoker):
    return model.predict_proba([[age, bmi, smoker]])[0][1]
```
3. Underwriting Recommendations: 
- List of high-risk combinations to screen




# Introduction to Loss Functions in Classification

## What is a Loss Function?
A loss function measures how wrong a model's predictions are compared to the true values. It gives the model a way to calculate its mistakes and improve during training.

## Key Properties of Loss Functions
- **Differentiable**: Can calculate gradients for optimization
- **Task-Specific**: Different problems need different loss types
- **Robustness**: Some handle unusual data points better than others

## Common Loss Functions

### Cross-Entropy Loss
Used by logistic regression and neural networks. It heavily penalizes confident but wrong predictions.

### Gini Impurity
Used by decision trees and random forests. Measures how often a random sample would be misclassified.

### Hinge Loss
Used by support vector machines (SVMs). Focuses only on predictions near the decision boundary.

### Focal Loss
Specialized for imbalanced data. Pays more attention to hard-to-classify examples.

## How Models Choose Loss
Most models automatically pick a suitable loss function based on:
- Number of classes (binary vs multiclass)
- Type of problem (classification vs regression)
- Model architecture (trees vs neural networks)

## Practical Recommendations
1. Start with default loss functions - they usually work well
2. For imbalanced data, use weighted or specialized losses
3. For custom needs, some models allow creating your own loss


# Classification Evaluation Metrics

## Core Metrics

### 1. **Accuracy**
```
Accuracy = (TP + TN) / (TP + TN + FP + FN)
```
- **Pros**: Easy to interpret
- **Cons**: Misleading for imbalanced datasets
- **When to use**: Balanced classes, quick sanity check

### 2. **Precision**
```
Precision = TP / (TP + FP)
```
- Measures: "How many selected items are relevant?"
- Focus: False positives
- **Use case**: Spam detection (minimize false positives)

### 3. **Recall (Sensitivity)**
```
Recall = TP / (TP + FN)
```
- Measures: "How many relevant items are selected?"
- Focus: False negatives
- **Use case**: Cancer detection (minimize false negatives)

### 4. **F1-Score**
```
F1 = 2 * (Precision * Recall) / (Precision + Recall)
```
- Harmonic mean of precision and recall
- **Best for**: Imbalanced datasets

## Advanced Metrics

### 5. **ROC-AUC**
- Measures model's ability to distinguish classes
- **Range**: 0.5 (random) to 1.0 (perfect)
- **Good for**: Binary classification with balanced data

### 6. **Precision-Recall Curve**
- Better than ROC for imbalanced data
- Shows tradeoff between precision/recall

### 7. **Confusion Matrix**
|                     | Predicted Negative | Predicted Positive |
|---------------------|--------------------|--------------------|
| **Actual Negative** | True Negative (TN) | False Positive (FP)|
| **Actual Positive** | False Negative (FN)| True Positive (TP) |
- Visualizes all error types
- Foundation for other metrics

## Special Cases

### For Multi-class:
- **Macro-average**: Treats all classes equally
- **Micro-average**: Aggregates all TP/FP/FN/TN

### For Imbalanced Data:
- **Cohen's Kappa**: Measures agreement accounting for chance
- **MCC (Matthews Correlation)**: Balanced measure (-1 to +1)

## Metric Selection Guide

| Scenario                  | Recommended Metrics                  |
|---------------------------|--------------------------------------|
| Balanced binary class     | Accuracy, ROC-AUC                   |
| Imbalanced binary class   | F1, Precision-Recall Curve          |
| Multi-class               | Macro-F1, Weighted Accuracy         |
| High FP cost              | Precision                           |
| High FN cost              | Recall                              |
| Comprehensive evaluation  | Confusion Matrix + Multiple Metrics |

# Classification Model Implementation Guide

## 1. Data Preparation
```python
from sklearn.model_selection import train_test_split

# Define features and target
X = df[['age', 'bmi', 'smoker', 'children']]  # Features
y = df['high_risk']  # Target (binary)

# Convert categoricals (smoker: yes/no → 1/0)
X = pd.get_dummies(X, drop_first=True)

# Split data (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y)
```
2. Model Training

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Create pipeline with preprocessing + model
model = make_pipeline(
    StandardScaler(),
    RandomForestClassifier(
        n_estimators=100,
        max_depth=5,
        class_weight='balanced',  # Handles class imbalance
        random_state=42
    )
)

# Train model
model.fit(X_train, y_train)
```
3. Prediction
```python
# Predict on test set
y_pred = model.predict(X_test)  # Class predictions (0/1)
y_proba = model.predict_proba(X_test)[:, 1]  # Probability scores

# Predict for new applicant
new_applicant = [[45, 28.5, 1, 2]]  # Age, BMI, Smoker (1=yes), Children
risk_prediction = model.predict(new_applicant)
risk_probability = model.predict_proba(new_applicant)[0][1]
```
4. Evaluation
```python
from sklearn.metrics import classification_report, roc_auc_score

# Classification metrics
print(classification_report(y_test, y_pred, target_names=['Low Risk', 'High Risk']))

# AUC-ROC score
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.2f}")

# Feature importance
importances = model.named_steps['randomforestclassifier'].feature_importances_
pd.Series(importances, index=X.columns).sort_values().plot(kind='barh')
```
5. Model Deployment
```python
import joblib

# Save model
joblib.dump(model, 'risk_classifier.pkl')

# Load in production
production_model = joblib.load('risk_classifier.pkl')

# API endpoint example (Flask)
from flask import Flask, request, jsonify
app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    prediction = production_model.predict([data['features']])
    return jsonify({'risk_level': int(prediction[0])})
```
6. Monitoring (Optional)

```python
# Drift detection
from alibi_detect import KSDrift

drift_detector = KSDrift(X_train, p_val=0.05)
drift = drift_detector.predict(X_test)
print(f"Data drift detected: {drift['data']['is_drift']}")
```
```
graph TD
    A[Raw Data] --> B[Preprocessing]
    B --> C[Train/Test Split]
    C --> D[Model Training]
    D --> E[Evaluation]
    E --> F[Deployment]
    F --> G[Monitoring]
```

