# Sun Life Machine Learning Engineer - Take Home Assessment

This notebook demonstrates an end-to-end solution for an insurance application quoting process.

Key deliverables:
- Realistic synthetic data generation
- BMI calculation
- Supervised ML model to approximate BMI
- Evaluation on held-out data
- Business rule-based quoting logic
- Operationalization approach for production deployment


## Imports

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score

import joblib

## 1. Synthetic Data Generation

In [None]:
# Fix random seed for reproducibility
np.random.seed(30)

N = 2000  # Number of synthetic applicants

# Create basic demographic attributes
application_ids = [f"APP_{i:05d}" for i in range(1, N + 1)]
genders = np.random.choice(["Male", "Female"], size=N)
ages = np.random.randint(18, 80, size=N)

# Height distributions vary slightly by gender to simulate realism
heights = []
for g in genders:
    base_height = np.random.normal(175, 7) if g == "Male" else np.random.normal(162, 6)
    heights.append(np.clip(base_height, 145, 200))  # Avoid unrealistic extremes

heights = np.array(heights)

# Generate BMI from a reasonable distribution, then derive weight
bmi_values = np.clip(np.random.normal(27, 5, size=N), 15, 45)
weights = bmi_values * ((heights / 100) ** 2)

# Add small noise to simulate measurement variation
weights += np.random.normal(0, 1.5, size=N)

# Assemble DataFrame
applicants_df = pd.DataFrame({
    "applicationID": application_ids,
    "gender": genders,
    "height": heights,
    "weight": weights,
    "age": ages
})

# Persist dataset
applicants_df.to_csv("synthetic_insurance_applicants.csv", index=False)

applicants_df.head()

## 2. BMI Calculation

In [None]:
def calculate_bmi(height_cm: float, weight_kg: float) -> float:
    """Compute BMI safely using metric formula."""
    if height_cm <= 0:
        return np.nan
    height_m = height_cm / 100
    return weight_kg / (height_m ** 2)

# Apply BMI calculation row-wise
applicants_df["BMI"] = applicants_df.apply(
    lambda row: calculate_bmi(row["height"], row["weight"]),
    axis=1
)

applicants_df.head()

## 3. Machine Learning Model

In [None]:
# Define feature matrix and target
X = applicants_df[["gender", "height", "weight", "age"]]
y = applicants_df["BMI"]

# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Preprocessing:
# - Scale numerical features
# - One-hot encode categorical feature (gender)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), ["height", "weight", "age"]),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["gender"])
    ]
)

# Random Forest chosen as a robust default for nonlinear tabular data
model = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

# Combine preprocessing and model into single pipeline
pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", model)
])

# Train model
pipeline.fit(X_train, y_train)

# Save trained artifact for reuse
joblib.dump(pipeline, "bmi_prediction_model.joblib")

## 4. Model Evaluation

In [None]:
# Generate predictions on test data
preds = pipeline.predict(X_test)

# Compute standard regression metrics
mae = mean_absolute_error(y_test, preds)
rmse = root_mean_squared_error(y_test, preds)
r2 = r2_score(y_test, preds)

print("Evaluation Metrics (Test Set)")
print("MAE :", round(mae, 4))
print("RMSE:", round(rmse, 4))
print("R2  :", round(r2, 4))

## 5. Business Rule Engine

In [None]:
def generate_quote(applicant: pd.Series):
    """Apply rule-based logic to determine insurance quote."""
    
    age = applicant["age"]
    bmi = applicant["BMI"]
    gender = applicant["gender"]

    # Default quote
    quote = 600
    reason = "BMI is in the right range"

    # Age-based rules
    if 18 <= age <= 39 and (bmi < 17 or bmi > 37.5):
        quote = 800
        reason = "Age is between 18 and 39 and BMI is either less than 17 or greater than 37.5"

    elif 40 <= age <= 59 and (bmi < 18 or bmi > 37.5):
        quote = 900
        reason = "Age is between 40 and 59 and BMI is either less than 18 or greater than 37.5"

    elif age > 60 and (bmi < 18 or bmi > 44.5):
        quote = 18000
        reason = "Age is greater than 60 and BMI is either less than 18 or greater than 44.5"

    # Gender discount rule
    if gender == "Female":
        quote *= 0.9
        reason += " 10% discount added as application gender is female."

    return round(quote, 2), reason


# Demonstrate logic on sample applicants
for _, sample in applicants_df.sample(3, random_state=1).iterrows():
    q, r = generate_quote(sample)
    print(sample["applicationID"], "|", q, "|", r)

## 6. Operationalization Plan

If this system were to move beyond a notebook into a production insurance workflow, I would focus on reliability and traceability first, then scalability.

### Ingestion & Validation

Applications would arrive via API or batch upload. At ingestion:

- Validate required fields (applicationID, gender, height, weight, age)
- Enforce basic range checks (ex. height > 0, reasonable age limits)
- Reject and log invalid records rather than silently correcting them

Raw submissions should be stored unchanged in a database or object storage. This ensures every quote decision can be reconstructed later.


### Scoring Layer

The serialized sklearn Pipeline (preprocessing + model) would be loaded inside a lightweight service layer (ex. FastAPI). Keeping preprocessing inside the pipeline prevents training-serving skew.

For each new application:

1. Validate inputs  
2. Compute BMI (or predict it if required)  
3. Pass enriched record to the quoting engine  

This keeps model logic isolated from business rules.


### Business Rules Engine

The rule engine should remain separate from the ML model since pricing rules may change more frequently and may require compliance review.

Each quote decision should log:
- Input snapshot
- Model version
- Rule version
- Final quote and reason string

Model and rule versioning is important for regulatory and audit requirements.


### Monitoring & Governance

Once deployed, I would monitor:

- Input data distribution shifts (age, height, weight)
- Quote distribution changes
- API latency and failure rates


### Testing & Deployment

- Unit tests for BMI calculation and rule boundaries  
- Integration test for full scoring pipeline  
- Version-controlled deployments with rollback capability  

Given the lightweight model and workload, horizontal scaling behind a simple API service would be sufficient.

This approach keeps the system maintainable, traceable, and production-ready without unnecessary complexity.

Thank you for taking the time to review this submission. I appreciate the opportunity to work through this assignment and demonstrate my approach.