Skip to content

saratmoka/COMBSS-GLM-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COMBSS-GLM: Best Subset Selection for Generalised Linear Models

Python implementation of COMBSS (Continuous Optimisation for Best Subset Selection) for generalised linear models, as described in the paper.


Code Contributors


Overview

Best subset selection — finding the optimal subset of k predictors that maximises the likelihood — is fundamental for interpretable and parsimonious statistical modelling, but is NP-hard in general. COMBSS overcomes this computational barrier by reformulating the discrete combinatorial problem as a continuous optimisation over the hypercube $[0,1]^p$ via a Boolean relaxation, and solving it with a Frank–Wolfe homotopy algorithm.

The key innovation is the use of Danskin's envelope theorem to compute the gradient of the relaxed objective exactly, requiring only a single ridge-penalised GLM solve per iteration. This makes the method scalable to high-dimensional settings with $p \gg n$.


Features

  • Binary logistic regression (two-class classification)
  • Multinomial logistic regression (multi-class, C > 2)
  • Linear regression (continuous response)
  • LOOCV-based ridge penalty selection via combss_cv
  • Support for mandatory variables (always included in every model)
  • Column normalisation for numerical stability
  • Warm-started inner solves with scikit-learn's L-BFGS-B solver
  • Gradient computation via Danskin's envelope theorem (no Hessian)
  • Frank–Wolfe homotopy with auto-calibrated geometric delta schedule

Dependencies

numpy
pandas
scikit-learn

Install with:

pip install numpy pandas scikit-learn

No further installation is needed — clone the repository and import directly.


Quick Start

Multinomial logistic regression (Khan SRBCT, 4 classes)

Searches for the smallest gene subset achieving accurate classification across four childhood tumour types in the Khan SRBCT dataset (p = 2308 genes, n = 63 training samples). The inclusion path is returned for k = 1, …, 20.

import numpy as np
import pandas as pd
import combss

train = pd.read_csv('data/Khan_train.csv')
test  = pd.read_csv('data/Khan_test.csv')

y_train = train.iloc[:, 0].values.astype(int)
X_train = train.iloc[:, 1:].values.astype(float)
y_test  = test.iloc[:, 0].values.astype(int)
X_test  = test.iloc[:, 1:].values.astype(float)

n, C = len(y_train), len(np.unique(y_train))
X_fw = np.hstack([np.ones((n, 1)), X_train])   # prepend intercept

result = combss.fw(
    X_fw, y_train,
    q          = 20,             # search k = 1 .. 20
    Niter      = 50,
    model_type = 'multinomial',
    C          = C,
)

for k, feat in enumerate(result.models, 1):
    print(f'k={k:2d}  genes={feat.tolist()}')

Binary logistic regression

Searches for the best feature subset of size k = 1, …, 15 for a two-class classification problem. Labels should be binary (0/1 or 1/2).

import combss

# X_fw : (n, p+1) with intercept column at index 0
# y    : (n,) binary labels {0, 1}

result = combss.fw(
    X_fw, y,
    q          = 15,
    Niter      = 50,
    model_type = 'logit',
    C          = 2,
)

Linear regression

Searches for the best feature subset of size k = 1, …, 15 for a continuous response. The inner solve uses an exact closed-form solution, making the method efficient even in high-dimensional settings (p ≫ n).

import numpy as np
import combss

# X_fw : (n, p+1) with intercept column at index 0
# y    : (n,) continuous response

result = combss.fw(
    X_fw, y,
    q          = 15,
    Niter      = 50,
    model_type = 'linear',
)

for k, feat in enumerate(result.models, 1):
    print(f'k={k:2d}  features={feat.tolist()}')

Lambda selection via LOOCV

Selects the ridge penalty λ from a candidate grid using leave-one-out cross-validation. Feature selection runs once per λ on the full data; only the lightweight refit step is cross-validated.

import combss_cv

# Classification (logit / multinomial)
best_lam, best_lam_per_k, cv_df = combss_cv.cv_select_lambda(
    X_train, y_train,
    q           = 15,
    C           = 4,
    model_type  = 'multinomial',
)
# cv_df columns: lambda, k, loocv_acc, selected

# Linear regression (C not required)
best_lam, best_lam_per_k, cv_df = combss_cv.cv_select_lambda(
    X_train, y_train,
    q           = 15,
    model_type  = 'linear',
)
# cv_df columns: lambda, k, loocv_mse, selected

Repository Structure

combss_glm/
├── combss.py            Main module — fw() algorithm (logit, multinomial, linear)
├── combss_cv.py         LOOCV-based lambda selection
├── multinomial_grad.py  Internal: multinomial gradient and inner solver
├── combss_demo.ipynb    Interactive notebook demonstrating both examples
├── data/
│   ├── Khan_train.csv                       Khan SRBCT training set (n=63, p=2308)
│   ├── Khan_test.csv                        Khan SRBCT test set    (n=20)
│   ├── n-200-p1000Replica1.csv              Simulated binary train (n=200, p=1000)
│   └── n-200-p1000Test-Replica1.csv        Simulated binary test  (n=1000)
└── examples/
    ├── example_multinomial_khan.py     Khan SRBCT full inclusion path
    ├── example_binary_logit.py         High-dim binary logit (p=1000)
    ├── example_lambda_cv.py            LOOCV lambda selection on Khan
    └── example_linear.py               High-dim linear regression (p=1000)

Examples

See the examples/ folder and examples/README.md for full self-contained scripts with detailed descriptions and expected output.

Example 1 — Khan SRBCT: multinomial, p=2308

python examples/example_multinomial_khan.py

The Khan SRBCT dataset (Khan et al., Nature Medicine 7, 673–679, 2001) contains expression levels of 2308 genes measured on tissue samples from four childhood tumour types. COMBSS searches for the best gene subset of size k = 1, …, 20. Selected output:

k =  1  Test acc = 0.450  genes: [1954]
k =  2  Test acc = 0.700  genes: [1954, 1955]
k =  5  Test acc = 0.950  genes: [246, 1389, 1645, 1954, 1955]
k =  8  Test acc = 0.950  genes: [187, 246, 509, 1389, 1645, 1954, 1955, 2050]
k = 12  Test acc = 1.000  genes: [187, 246, 509, 545, 910, 1074, 1319, 1389, 1645, 1954, 1955, 2050]

COMBSS achieves perfect test accuracy (20/20) with only 12 genes out of 2308. By contrast, the multinomial group Lasso requires 35 genes to reach the same performance.

Method Genes selected Test accuracy
COMBSS (k = 8) 8 0.95
COMBSS (k = 12) 12 1.00
Group Lasso (λ.1se) 28 0.95
Group Lasso (λ.min) 30 0.95
Group Lasso (best λ) 35 1.00

Example 2 — Simulated binary logit: n=200, p=1000

python examples/example_binary_logit.py

Simulated data with 10 truly active features (variables 1–10) among 1000. COMBSS recovers all 10 by k=10 with ~88% test accuracy.

Example 3 — Lambda selection via LOOCV

python examples/example_lambda_cv.py

Runs combss_cv.cv_select_lambda on the Khan data with a grid of 7 lambda values. Returns the best lambda per k (individually) and overall (mean across k). Ridge regularisation helps most at small k (k=3,4) where it selects different genes and improves LOOCV accuracy.

Example 4 — Linear regression: high-dimensional feature selection (p=1000)

python examples/example_linear.py

Simulated data: n=100, p=1000, 10 truly active features, AR(1) correlation ρ=0.8, SNR=5. Two demonstrations in one script:

  • Part 1: combss.fw with model_type='linear' — feature selection path for k=1,…,15, with OLS refit and test MSE at each size.
  • Part 2: combss_cv.cv_select_lambda with model_type='linear' — ridge penalty selection via the exact hat-matrix LOOCV shortcut (cost O(nk²)).

COMBSS recovers the exact true support {1,…,10} at k=10, where test MSE is minimised.


Module Reference

combss.fw(X, y, q, ...)

Runs the COMBSS Frank–Wolfe homotopy for model sizes k = 1, …, q.

Parameter Default Description
X Design matrix (n, p+1) with intercept column at index 0
y Labels: {0,1} for logit; {1,…,C} for multinomial; real-valued for linear
q Maximum subset size
Niter 50 Number of homotopy iterations
lam 0 Ridge regularisation parameter λ ≥ 0
scale True Column-normalise X before running
model_type 'logit' 'logit', 'multinomial', or 'linear'
C 2 Number of classes (required for multinomial/logit; ignored for linear)
inner_tol 1e-4 Inner solver convergence tolerance
mandatory_features None List of 1-indexed features always included
verbose True Print calibration and per-k progress

Returns a Result object with:

  • .models — list of length q; models[k-1] is a sorted array of 1-indexed selected feature indices for model size k
  • .niters — list of iteration counts (one per k)

combss_cv.cv_select_lambda(X, y, q, ...)

Selects the ridge penalty λ via leave-one-out cross-validation. Feature selection (the expensive COMBSS run) is performed once per λ on the full data; only the lightweight refit step is cross-validated. For linear regression, LOOCV uses the exact hat-matrix shortcut (cost O(nk²); no per-fold loop). For classification, a logistic regression is refit for each held-out observation.

Parameter Default Description
X Feature matrix (n, p) — no intercept column
y Class labels or continuous response
q Maximum subset size
C None Number of classes (required for logit/multinomial; omit for linear)
lambda_grid [0]+logspace(-3,1,10) Candidate λ values
Niter 50 COMBSS homotopy iterations
model_type 'multinomial' 'logit', 'multinomial', or 'linear'
inner_tol 1e-4 Inner solver tolerance
lambda_refit 0 Ridge penalty for LOOCV refit
verbose True Print per-λ progress

Returns (best_lambda, best_lambda_per_k, results_df).

  • For classification: results_df has columns lambda, k, loocv_acc, selected; best λ maximises mean LOOCV accuracy.
  • For linear: results_df has columns lambda, k, loocv_mse, selected; best λ minimises mean LOOCV MSE.

Algorithm Details

The Frank–Wolfe homotopy algorithm (Algorithm 1 in the paper) operates as follows:

  1. Initialise t = (k/p, …, k/p) — centroid of the k-sparse simplex T_k
  2. For i = 1, …, N (homotopy loop):
    • Set δ_i = min(δ_min · r^i, δ_max) with r = (δ_max/δ_min)^(1/N)
    • Set α_i = i/N (Frank–Wolfe step size)
    • Solve the ridge-penalised GLM inner problem at current t
    • Compute the Danskin gradient: ∇f(t)_j ∝ −‖ξ_j‖² / t_j³
    • Find the FW vertex: s = argmin_{s ∈ T_k} ⟨∇f(t), s⟩ (k smallest grad)
    • Update: t ← (1−α) t + α s
  3. Select the k features with the largest final t values (equivalently, the k features in the last FW vertex s)

The penalty schedule (δ_min, δ_max) is auto-calibrated from λ_max(X^T X) estimated by power iteration, so no tuning is needed in practice.

About

a repository for combss_glm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors