COMBSS-GLM: Best Subset Selection for Generalised Linear Models

Python implementation of COMBSS (Continuous Optimisation for Best Subset Selection) for generalised linear models, as described in the paper.

Code Contributors

Overview

Best subset selection — finding the optimal subset of k predictors that maximises the likelihood — is fundamental for interpretable and parsimonious statistical modelling, but is NP-hard in general. COMBSS overcomes this computational barrier by reformulating the discrete combinatorial problem as a continuous optimisation over the hypercube $[0,1]^p$ via a Boolean relaxation, and solving it with a Frank–Wolfe homotopy algorithm.

The key innovation is the use of Danskin's envelope theorem to compute the gradient of the relaxed objective exactly, requiring only a single ridge-penalised GLM solve per iteration. This makes the method scalable to high-dimensional settings with $p \gg n$.

Features

Binary logistic regression (two-class classification)
Multinomial logistic regression (multi-class, C > 2)
Linear regression (continuous response)
LOOCV-based ridge penalty selection via combss_cv
Support for mandatory variables (always included in every model)
Column normalisation for numerical stability
Warm-started inner solves with scikit-learn's L-BFGS-B solver
Gradient computation via Danskin's envelope theorem (no Hessian)
Frank–Wolfe homotopy with auto-calibrated geometric delta schedule

Dependencies

numpy
pandas
scikit-learn

Install with:

pip install numpy pandas scikit-learn

No further installation is needed — clone the repository and import directly.

Quick Start

Multinomial logistic regression (Khan SRBCT, 4 classes)

Searches for the smallest gene subset achieving accurate classification across four childhood tumour types in the Khan SRBCT dataset (p = 2308 genes, n = 63 training samples). The inclusion path is returned for k = 1, …, 20.

import numpy as np
import pandas as pd
import combss

train = pd.read_csv('data/Khan_train.csv')
test  = pd.read_csv('data/Khan_test.csv')

y_train = train.iloc[:, 0].values.astype(int)
X_train = train.iloc[:, 1:].values.astype(float)
y_test  = test.iloc[:, 0].values.astype(int)
X_test  = test.iloc[:, 1:].values.astype(float)

n, C = len(y_train), len(np.unique(y_train))
X_fw = np.hstack([np.ones((n, 1)), X_train])   # prepend intercept

result = combss.fw(
    X_fw, y_train,
    q          = 20,             # search k = 1 .. 20
    Niter      = 50,
    model_type = 'multinomial',
    C          = C,
)

for k, feat in enumerate(result.models, 1):
    print(f'k={k:2d}  genes={feat.tolist()}')

Binary logistic regression

Searches for the best feature subset of size k = 1, …, 15 for a two-class classification problem. Labels should be binary (0/1 or 1/2).

import combss

# X_fw : (n, p+1) with intercept column at index 0
# y    : (n,) binary labels {0, 1}

result = combss.fw(
    X_fw, y,
    q          = 15,
    Niter      = 50,
    model_type = 'logit',
    C          = 2,
)

Linear regression

Searches for the best feature subset of size k = 1, …, 15 for a continuous response. The inner solve uses an exact closed-form solution, making the method efficient even in high-dimensional settings (p ≫ n).

import numpy as np
import combss

# X_fw : (n, p+1) with intercept column at index 0
# y    : (n,) continuous response

result = combss.fw(
    X_fw, y,
    q          = 15,
    Niter      = 50,
    model_type = 'linear',
)

for k, feat in enumerate(result.models, 1):
    print(f'k={k:2d}  features={feat.tolist()}')

Lambda selection via LOOCV

Selects the ridge penalty λ from a candidate grid using leave-one-out cross-validation. Feature selection runs once per λ on the full data; only the lightweight refit step is cross-validated.

import combss_cv

# Classification (logit / multinomial)
best_lam, best_lam_per_k, cv_df = combss_cv.cv_select_lambda(
    X_train, y_train,
    q           = 15,
    C           = 4,
    model_type  = 'multinomial',
)
# cv_df columns: lambda, k, loocv_acc, selected

# Linear regression (C not required)
best_lam, best_lam_per_k, cv_df = combss_cv.cv_select_lambda(
    X_train, y_train,
    q           = 15,
    model_type  = 'linear',
)
# cv_df columns: lambda, k, loocv_mse, selected

Repository Structure

combss_glm/
├── combss.py            Main module — fw() algorithm (logit, multinomial, linear)
├── combss_cv.py         LOOCV-based lambda selection
├── multinomial_grad.py  Internal: multinomial gradient and inner solver
├── combss_demo.ipynb    Interactive notebook demonstrating both examples
├── data/
│   ├── Khan_train.csv                       Khan SRBCT training set (n=63, p=2308)
│   ├── Khan_test.csv                        Khan SRBCT test set    (n=20)
│   ├── n-200-p1000Replica1.csv              Simulated binary train (n=200, p=1000)
│   └── n-200-p1000Test-Replica1.csv        Simulated binary test  (n=1000)
└── examples/
    ├── example_multinomial_khan.py     Khan SRBCT full inclusion path
    ├── example_binary_logit.py         High-dim binary logit (p=1000)
    ├── example_lambda_cv.py            LOOCV lambda selection on Khan
    └── example_linear.py               High-dim linear regression (p=1000)

Examples

See the examples/ folder and examples/README.md for full self-contained scripts with detailed descriptions and expected output.

Example 1 — Khan SRBCT: multinomial, p=2308

python examples/example_multinomial_khan.py

The Khan SRBCT dataset (Khan et al., Nature Medicine 7, 673–679, 2001) contains expression levels of 2308 genes measured on tissue samples from four childhood tumour types. COMBSS searches for the best gene subset of size k = 1, …, 20. Selected output:

k =  1  Test acc = 0.450  genes: [1954]
k =  2  Test acc = 0.700  genes: [1954, 1955]
k =  5  Test acc = 0.950  genes: [246, 1389, 1645, 1954, 1955]
k =  8  Test acc = 0.950  genes: [187, 246, 509, 1389, 1645, 1954, 1955, 2050]
k = 12  Test acc = 1.000  genes: [187, 246, 509, 545, 910, 1074, 1319, 1389, 1645, 1954, 1955, 2050]

COMBSS achieves perfect test accuracy (20/20) with only 12 genes out of 2308. By contrast, the multinomial group Lasso requires 35 genes to reach the same performance.

Method	Genes selected	Test accuracy
COMBSS (k = 8)	8	0.95
COMBSS (k = 12)	12	1.00
Group Lasso (λ.1se)	28	0.95
Group Lasso (λ.min)	30	0.95
Group Lasso (best λ)	35	1.00

Example 2 — Simulated binary logit: n=200, p=1000

python examples/example_binary_logit.py

Simulated data with 10 truly active features (variables 1–10) among 1000. COMBSS recovers all 10 by k=10 with ~88% test accuracy.

Example 3 — Lambda selection via LOOCV

python examples/example_lambda_cv.py

Runs combss_cv.cv_select_lambda on the Khan data with a grid of 7 lambda values. Returns the best lambda per k (individually) and overall (mean across k). Ridge regularisation helps most at small k (k=3,4) where it selects different genes and improves LOOCV accuracy.

Example 4 — Linear regression: high-dimensional feature selection (p=1000)

python examples/example_linear.py

Simulated data: n=100, p=1000, 10 truly active features, AR(1) correlation ρ=0.8, SNR=5. Two demonstrations in one script:

Part 1: combss.fw with model_type='linear' — feature selection path for k=1,…,15, with OLS refit and test MSE at each size.
Part 2: combss_cv.cv_select_lambda with model_type='linear' — ridge penalty selection via the exact hat-matrix LOOCV shortcut (cost O(nk²)).

COMBSS recovers the exact true support {1,…,10} at k=10, where test MSE is minimised.

Module Reference

`combss.fw(X, y, q, ...)`

Runs the COMBSS Frank–Wolfe homotopy for model sizes k = 1, …, q.

Parameter	Default	Description
`X`	—	Design matrix (n, p+1) with intercept column at index 0
`y`	—	Labels: {0,1} for logit; {1,…,C} for multinomial; real-valued for linear
`q`	—	Maximum subset size
`Niter`	50	Number of homotopy iterations
`lam`	0	Ridge regularisation parameter λ ≥ 0
`scale`	True	Column-normalise X before running
`model_type`	`'logit'`	`'logit'`, `'multinomial'`, or `'linear'`
`C`	2	Number of classes (required for multinomial/logit; ignored for linear)
`inner_tol`	1e-4	Inner solver convergence tolerance
`mandatory_features`	None	List of 1-indexed features always included
`verbose`	True	Print calibration and per-k progress

Returns a Result object with:

.models — list of length q; models[k-1] is a sorted array of 1-indexed selected feature indices for model size k
.niters — list of iteration counts (one per k)

`combss_cv.cv_select_lambda(X, y, q, ...)`

Selects the ridge penalty λ via leave-one-out cross-validation. Feature selection (the expensive COMBSS run) is performed once per λ on the full data; only the lightweight refit step is cross-validated. For linear regression, LOOCV uses the exact hat-matrix shortcut (cost O(nk²); no per-fold loop). For classification, a logistic regression is refit for each held-out observation.

Parameter	Default	Description
`X`	—	Feature matrix (n, p) — no intercept column
`y`	—	Class labels or continuous response
`q`	—	Maximum subset size
`C`	`None`	Number of classes (required for logit/multinomial; omit for linear)
`lambda_grid`	`[0]+logspace(-3,1,10)`	Candidate λ values
`Niter`	50	COMBSS homotopy iterations
`model_type`	`'multinomial'`	`'logit'`, `'multinomial'`, or `'linear'`
`inner_tol`	1e-4	Inner solver tolerance
`lambda_refit`	0	Ridge penalty for LOOCV refit
`verbose`	True	Print per-λ progress

Returns (best_lambda, best_lambda_per_k, results_df).

For classification: results_df has columns lambda, k, loocv_acc, selected; best λ maximises mean LOOCV accuracy.
For linear: results_df has columns lambda, k, loocv_mse, selected; best λ minimises mean LOOCV MSE.

Algorithm Details

The Frank–Wolfe homotopy algorithm (Algorithm 1 in the paper) operates as follows:

Initialise t = (k/p, …, k/p) — centroid of the k-sparse simplex T_k
For i = 1, …, N (homotopy loop):
- Set δ_i = min(δ_min · r^i, δ_max) with r = (δ_max/δ_min)^(1/N)
- Set α_i = i/N (Frank–Wolfe step size)
- Solve the ridge-penalised GLM inner problem at current t
- Compute the Danskin gradient: ∇f(t)_j ∝ −‖ξ_j‖² / t_j³
- Find the FW vertex: s = argmin_{s ∈ T_k} ⟨∇f(t), s⟩ (k smallest grad)
- Update: t ← (1−α) t + α s
Select the k features with the largest final t values (equivalently, the k features in the last FW vertex s)

The penalty schedule (δ_min, δ_max) is auto-calibrated from λ_max(X^T X) estimated by power iteration, so no tuning is needed in practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COMBSS-GLM: Best Subset Selection for Generalised Linear Models

Code Contributors

Overview

Features

Dependencies

Quick Start

Multinomial logistic regression (Khan SRBCT, 4 classes)

Binary logistic regression

Linear regression

Lambda selection via LOOCV

Repository Structure

Examples

Example 1 — Khan SRBCT: multinomial, p=2308

Example 2 — Simulated binary logit: n=200, p=1000

Example 3 — Lambda selection via LOOCV

Example 4 — Linear regression: high-dimensional feature selection (p=1000)

Module Reference

`combss.fw(X, y, q, ...)`

`combss_cv.cv_select_lambda(X, y, q, ...)`

Algorithm Details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
examples		examples
LICENSE		LICENSE
README.md		README.md
combss.py		combss.py
combss_cv.py		combss_cv.py
combss_demo.ipynb		combss_demo.ipynb
multinomial_grad.py		multinomial_grad.py

Folders and files

Latest commit

History

Repository files navigation

COMBSS-GLM: Best Subset Selection for Generalised Linear Models

Code Contributors

Overview

Features

Dependencies

Quick Start

Multinomial logistic regression (Khan SRBCT, 4 classes)

Binary logistic regression

Linear regression

Lambda selection via LOOCV

Repository Structure

Examples

Example 1 — Khan SRBCT: multinomial, p=2308

Example 2 — Simulated binary logit: n=200, p=1000

Example 3 — Lambda selection via LOOCV

Example 4 — Linear regression: high-dimensional feature selection (p=1000)

Module Reference

combss.fw(X, y, q, ...)

combss_cv.cv_select_lambda(X, y, q, ...)

Algorithm Details

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`combss.fw(X, y, q, ...)`

`combss_cv.cv_select_lambda(X, y, q, ...)`

Packages