## Jupyter Notebook Template for Take-home Challenges

### Classification (Tabular)

| Challenge type | How to recognize | Recommended split | Baseline model | Primary metrics | Must-do checks | Common pitfalls |
|---|---|---|---|---|---|---|
| **Binary Classification** | target has 2 classes (0/1, yes/no) | Stratified random; Time/Group if needed | Logistic Regression | ROC-AUC + LogLoss; PR-AUC if rare | prevalence, leakage scan, threshold plan | using accuracy only; ignoring threshold |
| **Multiclass Classification** | target has >2 classes | Stratified random; Time/Group if needed | Multinomial Logistic Regression | Macro F1 + LogLoss | class balance, confusion matrix | micro-only metrics hide minorities |
| **Highly Imbalanced Classification** | positive rate very low / “rare event” | Stratified + Time/Group if needed | Logistic (class_weight) | PR-AUC; Recall@Precision | thresholding + calibration | reporting ROC-AUC only |
| **Event Prediction (future event)** | label defined over future window (churn/fraud) | Time split preferred | Logistic / GBM | PR-AUC or Recall@Precision | window definition, censoring | leakage from post-event features |
| **Cost-Sensitive Classification** | prompt gives FP/FN costs | Same as above + align to decision | Logistic | expected cost / utility | cost matrix, threshold by cost | optimizing generic metric |
| **Group Leakage / Repeated Measures** | many rows per user/account/session | Group split (GroupKFold) | Logistic | same as above | confirm group key | splitting rows not groups |


### Regression (Tabular)

| Challenge type | How to recognize | Recommended split | Baseline model | Primary metrics | Must-do checks | Common pitfalls |
|---|---|---|---|---|---|---|
| **Regression (continuous)** | numeric target with many unique values | Random; Time/Group if needed | Ridge / ElasticNet | MAE or RMSE; R² secondary | outliers, target skew, residuals | optimizing R² only |
| **Count Regression** | target non-negative integers (skewed) | Random/Time/Group as needed | PoissonRegressor / Ridge on log1p(y) | MAE or deviance | zeros %, transform choice | negative predictions |
| **Time-to-Value / Horizon Regression** | target is “next week/month value” | Time split | Ridge / GBM | MAE/RMSE by horizon | leakage via future aggregates | random split inflates |
| **Group Leakage Regression** | multiple rows per entity | Group split | Ridge | MAE/RMSE | group consistency | memorizing IDs |
| **Missing-Heavy Regression** | lots of NaNs | Same as task | Ridge + imputer | MAE/RMSE | missingness patterns | imputing using whole dataset |


### Time Series

| Challenge type | How to recognize | Recommended split | Baseline model | Primary metrics | Must-do checks | Common pitfalls |
|---|---|---|---|---|---|---|
| **Forecasting (univariate)** | single series + dates | Rolling/expanding time split | Naive / seasonal naive | MAE/RMSE/SMAPE | seasonality, gaps | random split |
| **Forecasting (panel)** | many entities each has series | Time split + group by entity | Per-entity baseline / Ridge with lags | MAE/RMSE/SMAPE | lag construction from past only | leakage via future lags |
| **Time Drift / Non-stationary** | performance changes over time | Time split + time-sliced eval | Regularized linear | time-bucket metrics | drift checks | random CV inflates |


### Ranking / Recommender

| Challenge type | How to recognize | Recommended split | Baseline model | Primary metrics | Must-do checks | Common pitfalls |
|---|---|---|---|---|---|---|
| **Ranking / Top-K Retrieval** | “rank”, “top-k”, “search” | Time split; group by user/query | LightGBM ranker / Logistic | NDCG@K, MAP@K, Recall@K | define K + protocol | mixing users across folds |
| **Recommendation (implicit)** | user–item interactions, clicks | Time split; group by user | Popularity → MF/logistic | Recall@K / NDCG@K | negative sampling | evaluating on seen items |


### NLP / Text

| Challenge type | How to recognize | Recommended split | Baseline model | Primary metrics | Must-do checks | Common pitfalls |
|---|---|---|---|---|---|---|
| **Text Classification** | text column + class label | Stratified; Group by author/user if needed | TF-IDF + Logistic | F1 / ROC-AUC; LogLoss | dedup/near-dup, leakage | leakage via duplicates |
| **Text Regression** | text → numeric | Random/Group/Time as needed | TF-IDF + Ridge | MAE/RMSE | outliers, leakage words | overfitting, heavy models |


### Unsupervised / Weakly Supervised

| Challenge type | How to recognize | Recommended split | Baseline model | Primary metrics | Must-do checks | Common pitfalls |
|---|---|---|---|---|---|---|
| **Anomaly Detection (no labels)** | “detect anomalies”, no target | Time holdout if temporal | IsolationForest / robust z-score | Precision@K (if labels) / manual review | score + threshold strategy | no eval protocol |
| **Survival / Time-to-Event** | time-to-event + censoring | Time split | CoxPH (if available) | C-index | censoring handling | treating as standard classification |


### Cross-cutting Constraints (add-on labels)

| Constraint | When it appears | What to change in workflow |
|---|---|---|
| **Explainability required** | “interpret”, “justify”, “stakeholders” | favor linear/GAM; coefficient stability; simple FE |
| **Production/deployment** | “deploy”, “latency”, “monitoring” | pipeline reproducibility, simple model, logging/monitoring notes |
| **High-cardinality categoricals** | IDs or huge unique counts | cap rare categories; hashing if needed; avoid leakage via ID memorization |
| **Missing-heavy** | many NaNs / structured missingness | imputation inside pipeline; consider missing indicators |
| **Cost-sensitive** | explicit FP/FN costs | optimize threshold by expected cost, not generic metric |


In [None]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

from dataclasses import dataclass
from typing import Optional, Dict, Any, Tuple

from sklearn.model_selection import (
    train_test_split, StratifiedKFold, KFold,
    GroupKFold, GroupShuffleSplit, TimeSeriesSplit, cross_validate
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import (
    roc_auc_score, average_precision_score, log_loss, f1_score,
    mean_absolute_error, mean_squared_error, r2_score,
    confusion_matrix, precision_recall_curve
)

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)

RANDOM_STATE = 42


## End-to-End Workflow (General Template)

**Plan (time-boxed)**
1) `Quick Overview`: define target; check data leakage; check dtypes/ missingness/ dupes; decide split (random/time/group)
2) `Lightweight EDA` (2–3 checks only): target sanity + missingness + 1 relationship
3) `Feature Engineering` (minimal, interpretable)
4) `Data Split`
5) `Baseline pipeline`: ColumnTransformer + (LogisticRegression or Ridge)
6) `Cross-Validation` with appropriate split + metric
7) Holdout Evaluation + quick error analysis
8) (Optional) Calibration/thresholding (if probabilistic classification)
9) Sanity-check with a simple non-linear model
10) Explainability (coefficients / importances) + wrap-up summary


In [None]:
PATH = "YOUR_FILE.csv"
TARGET = "target"
TIME_COL = None     # e.g. "timestamp" (set if time leakage matters)
GROUP_COL = None    # e.g. "user_id" (set if multiple rows per entity)
DROP_COLS = []      # e.g. IDs, post-event fields, text blobs, etc.
TASK = "auto"       # "auto" | "classification" | "regression"
TEST_SIZE = 0.2

In [None]:
df = pd.read_csv(PATH)
df.head()

### Quick Overview
- define target
- check data leakage
- check data dtypes, missingness, dupes
- decide data split (stratified/ time/ group)

In [None]:
df.head()

In [None]:
# data shape
df.shape

In [None]:
# data dtypes
df.dtypes
df.dtypes.value_counts()

In [None]:
# data missing %
df.isna().mean().sort_values(ascending=False)

In [None]:
# duplicated rows
df.duplicated().sum()

In [None]:
# target distribution
y = df[TARGET]
sns.histplot(data=df, x=y)

In [None]:
# target value counts
y.value_counts()

### Lightweight EDA
- target distribution/ prevalence
- data distribution

### Feature Engineering

### Built Pipelines - Baseline Models
- data split (random vs. time vs. group -> avoid leakage)
- data preprocessing (ColumnTransformor)
- ML model (e.g. Logistic Regression)

#### 1. Data Split

- Ensure split first, then fit preprocessors on train: Scaling/ Encoding/ Imputation should be learned from X_train only.

- `random split`
- Don't use random split for time series/ grouped data: 
1. If rows are time-ordered or user level, random split can leak info.
2. For time-ordered case => use time-based split.
3. For user level case => use GroupShuffleSplit/ GroupKFold.

In [None]:
test_size = 0.2
X = df.drop(columns=[TARGET])
y = df[TARGET]

In [None]:
# imbalanced target classification problem
# stratify = y: preserve the class distribution (very important when y is rare!)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=RANDOM_STATE, stratify=y)

In [None]:
# other problems
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=RANDOM_STATE)

- `time-series data split`
- should preserve time order and ensure no leakage


In [None]:
### version 1: split by row % 
df[TIME_COL] = pd.to_datetime(df[TIME_COL], errors="coerce")
df = df.sort_values(TIME_COL).reset_index(drop=True)

train_size = 0.8
split_idx = int(train_size * len(df))

train = df.iloc[:split_idx].copy()
test  = df.iloc[split_idx:].copy()

In [None]:
### version 2: split by a time cutoff

df[TIME_COL] = pd.to_datetime(df[TIME_COL], errors="coerce")
df = df.sort_values(TIME_COL)

cutoff = df[TIME_COL].quantile(0.8)   # or pick an explicit date like pd.Timestamp("2024-10-01")

train = df[df[TIME_COL] < cutoff].copy()
test  = df[df[TIME_COL] >= cutoff].copy()

In [None]:
### version 3: per-entity cutoff (common in finance)

df[TIME_COL] = pd.to_datetime(df[TIME_COL])
df = df.sort_values([ID_COL, TIME_COL])

def split_group(g, train_frac=0.8):
    n = len(g)
    k = int(train_frac * n)
    return g.iloc[:k], g.iloc[k:]

parts = [split_group(g) for _, g in df.groupby(ID_COL, sort=False)]
train = pd.concat([p[0] for p in parts]).copy()
test  = pd.concat([p[1] for p in parts]).copy()

In [None]:
X_train = train.drop(columns=[TARGET])
y_train = train[TARGET]
X_test  = test.drop(columns=[TARGET])
y_test  = test[TARGET]

-  `group split`
- split by ID so that all rows from the same group stay entirely in either train or test, never both
- the prevents leakage when rows within a group are correlated
- if random-split rows, the model can "cheat" by seeing very similar samples from the same group in both train and test

In [None]:
### train/ test group split (single holdout)

from sklearn.model_selection import GroupShuffleSplit

X = df.drop(columns=[TARGET])
y = df[TARGET]
groups = df["UserID"]  # or SessionID / PatientID / Ticker, etc.

gss = GroupShuffleSplit(n_splits=1, test_size=test_size, random_state=RANDOM_STATE)
train_idx, test_idx = next(gss.split(X, y, groups=groups))
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

In [None]:
### group cross-validation

from sklearn.model_selection import GroupKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X = df.drop(columns=[TARGET])
y = df[TARGET]
groups = df["UserID"]

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000))
])

gkf = GroupKFold(n_splits=5)
scores = cross_val_score(pipe, X, y, cv=gkf, groups=groups, scoring="roc_auc")
print(scores.mean(), scores.std())

#### 2. Build Preprocessing + Baseline Pipeline

`ColumnTransformer`
- Numeric -> Impute + StandardScaler
- Categorical -> Impute + OneHotEncoder

`Model`
- Logistic Regression (Classification)
- Ridge (Regression)

In [None]:
### detect numeric vs. categorical columns from X.dtypes

# remember to EXCLUDE ID column
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()

# categorical data: includes object/ string, bool, category, etc.
cat_cols = X.select_dtypes(include="O").columns.tolist()

In [None]:
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")), # robust choice: impute missing values with median
    ("scalar", StandardScalar()) # important for linear models: Logistic Regression, Ridge, etc.
])

`One Hot Encoder`

In [None]:
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent"), # impute missing values with most frequently seen category
    ("onehot", OneHotEncoder(handle_unknown="ignore"))), # prevents errors when new categories appear
])

`Ordinal Encoder`

- It maps each category to an integer (e.g. red->0, blue->1 )
- works well -> when categories are truly ordered (e.g. low < medium < high)
- risks -> when introduce fake "order" for unordered categories

In [None]:
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ord", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)),
])


In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols),
    ],
    remainder="drop", # drop any columns not specified above
    verbose_feature_names_out=False # nicer feature names
)

In [None]:
if TASK == "classification":
    model = LogisticRegression(max_iter=2000)
if TASK == "regression":
    model = Ridge(alpha=1.0, random_state=RANDOM_STATE)

In [None]:
pipe = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", model)
])

### Cross Validation
- (stratified/ time/ group) CV with proper metrics (ROC-AUC, PR-AUC)

In [None]:
n_splits = 5

In [None]:
# time-series
cv = TimeSeriesSplit(n_splits=n_splits)

# group
cv = GroupKFold(n_splits=n_splits)

# classification problem
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=RANDOM_STATE)

In [None]:
def make_scoring(task: str, y_train: pd.Series) -> Dict[str, Any]:
    if task == "regression":
        return {
            "mae": "neg_mean_absolute_error",
            "rmse": "neg_root_mean_squared_error",
            "r2": "r2",
        }

    # For binary: PR-AUC and ROC-AUC are common. For multiclass, fallback to logloss + f1_macro.
    n_classes = y_train.nunique(dropna=True)
    if n_classes == 2:
        return {
            "roc_auc": "roc_auc",
            "pr_auc": "average_precision",
            "logloss": "neg_log_loss",
        }
    else:
        return {
            "f1_macro": "f1_macro",
            "logloss": "neg_log_loss",
        }

scoring = make_scoring(TASK, y_train)

- True targets: $y_i$
- Predictions: $\hat{y}_i$
- Number of samples: $n$
- Mean of true targets: $\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_i$
- Binary classification: $y_i\in{0,1}$, predicted probability: $\hat{p}_i=P(y_i=1\mid x_i)$

In [None]:
from IPython.display import display, Math, Markdown
display(Math(r"\mathrm{MAE}=\frac{1}{n}\sum_{i=1}^{n}\left|y_i-\hat{y}_i\right|"))
display(Math(r"\mathrm{MSE}=\frac{1}{n}\sum_{i=1}^{n}\left(y_i-\hat{y}_i\right)^2"))
display(Math(r"R^2=1-\frac{\sum_{i=1}^{n}\left(y_i-\hat{y}_i\right)^2}{\sum_{i=1}^{n}\left(y_i-\bar{y}\right)^2}"))
display(Math(r"\mathrm{Precision}=\frac{TP}{TP+FP}\quad,\quad \mathrm{Recall}=\frac{TP}{TP+FN}"))
display(Math(r"\mathrm{F1}=\frac{2\cdot \mathrm{Precision}\cdot \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}=\frac{2TP}{2TP+FP+FN}"))
display(Math(r"\mathrm{LogLoss}=-\frac{1}{n}\sum_{i=1}^{n}\left[y_i\log(\hat{p}_i)+(1-y_i)\log(1-\hat{p}_i)\right]"))

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<IPython.core.display.Math object>

In [None]:
groups_cv = groups_train if split_type == "group" else None

In [None]:
cv_results = cross_validate(
    pipe,                    # CV will refit preprocessors + model inside each fold
    X_train,                 # CV will further split train data into K folds internally
    y_train,
    cv=cv,                   # CV will the splitting strategy (e.g. KFold, StratifiedFold, TimeSeriesSplit, GroupKFold, etc.)
    scoring=scoring,         # If a list or dict: multiple metrics (common)
    n_jobs=-1,               # parallelize across CPU cores (use all cores available), speed up CV, especially with many folds/ models
    groups=groups_cv,        # Only if CV splitter is group-aware (e.g. GroupKFold, StratifiedGroupKFold, GroupShuffleSplit)
    return_train_score=False # If True, will also get train scores per fold -> useful to diagnose overfitting, but adds compute
)

for each fold split produced by cv
- split X_train, y_train into (train_fold, valid_fold)
- Fit the pipeline on train_fold
- `pip` includes preprocessing (`ColumnTransformer`: imputing/ scaling/ encoding) + `model`
- preprocessing is fit only on the fold's training data -> no leakage
- score on valid_fold using requested `scoring` metrics
- store each fold's scores

### Fit Baseline on Train -> Evaluate on Holdout + Error Analysis

for each fold split produced by cv
- split X_train, y_train into (train_fold, valid_fold)
- Fit the pipeline on train_fold
- `pip` includes preprocessing (`ColumnTransformer`: imputing/ scaling/ encoding) + `model`
- preprocessing is fit only on the fold's training data -> no leakage
- score on valid_fold using requested `scoring` metrics
- store each fold's scores

In [None]:
pipe.fit(X_train, y_train)

In [None]:
if TASK == "regression":
    pred = pipe.predict(X_test)
    mae = mean_absolute_error(y_test, pred)
    rmse = mean_squared_error(y_test, pred)
    r2 = r2_score(y_test, pred)
    
    # Error analysis: worst absolute errors
    error = np.abs(pred - y_test.values)
    worst = np.argsort(-err)[:10]

In [None]:
if TASK == "classification":
    proba = pipe.predict_proba(X_test)
    
    # Binary probability
    if proba.shape[1] == 2:
        p1 = proba[:, 1]
        roc = roc_auc_score(y_test, p1)
        pr = average_precision_score(y_test, p1)
        ll = log_loss(y_test, p1)
        
        # Threshold setting
        th = 0.5
        preds = (p1 >= th).astype(int)
        confusion_matrix(y_test, preds)
        
    # Multi-class probability
    else:
        preds = pipe.predict(X_test)
        f1 = f1_score(y_test, preds, average="macro")
        ll = log_loss(y_test, proba)