Python implementation of COMBSS (Continuous Optimisation for Best Subset Selection) for generalised linear models, as described in the paper.
Best subset selection — finding the optimal subset of k predictors that
maximises the likelihood — is fundamental for interpretable and parsimonious
statistical modelling, but is NP-hard in general. COMBSS overcomes this
computational barrier by reformulating the discrete combinatorial problem as a
continuous optimisation over the hypercube
The key innovation is the use of Danskin's envelope theorem to compute the
gradient of the relaxed objective exactly, requiring only a single
ridge-penalised GLM solve per iteration. This makes the method scalable to high-dimensional settings
with
- Binary logistic regression (two-class classification)
- Multinomial logistic regression (multi-class, C > 2)
- Linear regression (continuous response)
- LOOCV-based ridge penalty selection via
combss_cv - Support for mandatory variables (always included in every model)
- Column normalisation for numerical stability
- Warm-started inner solves with scikit-learn's L-BFGS-B solver
- Gradient computation via Danskin's envelope theorem (no Hessian)
- Frank–Wolfe homotopy with auto-calibrated geometric delta schedule
numpy
pandas
scikit-learn
Install with:
pip install numpy pandas scikit-learnNo further installation is needed — clone the repository and import directly.
Searches for the smallest gene subset achieving accurate classification across four childhood tumour types in the Khan SRBCT dataset (p = 2308 genes, n = 63 training samples). The inclusion path is returned for k = 1, …, 20.
import numpy as np
import pandas as pd
import combss
train = pd.read_csv('data/Khan_train.csv')
test = pd.read_csv('data/Khan_test.csv')
y_train = train.iloc[:, 0].values.astype(int)
X_train = train.iloc[:, 1:].values.astype(float)
y_test = test.iloc[:, 0].values.astype(int)
X_test = test.iloc[:, 1:].values.astype(float)
n, C = len(y_train), len(np.unique(y_train))
X_fw = np.hstack([np.ones((n, 1)), X_train]) # prepend intercept
result = combss.fw(
X_fw, y_train,
q = 20, # search k = 1 .. 20
Niter = 50,
model_type = 'multinomial',
C = C,
)
for k, feat in enumerate(result.models, 1):
print(f'k={k:2d} genes={feat.tolist()}')Searches for the best feature subset of size k = 1, …, 15 for a two-class classification problem. Labels should be binary (0/1 or 1/2).
import combss
# X_fw : (n, p+1) with intercept column at index 0
# y : (n,) binary labels {0, 1}
result = combss.fw(
X_fw, y,
q = 15,
Niter = 50,
model_type = 'logit',
C = 2,
)Searches for the best feature subset of size k = 1, …, 15 for a continuous response. The inner solve uses an exact closed-form solution, making the method efficient even in high-dimensional settings (p ≫ n).
import numpy as np
import combss
# X_fw : (n, p+1) with intercept column at index 0
# y : (n,) continuous response
result = combss.fw(
X_fw, y,
q = 15,
Niter = 50,
model_type = 'linear',
)
for k, feat in enumerate(result.models, 1):
print(f'k={k:2d} features={feat.tolist()}')Selects the ridge penalty λ from a candidate grid using leave-one-out cross-validation. Feature selection runs once per λ on the full data; only the lightweight refit step is cross-validated.
import combss_cv
# Classification (logit / multinomial)
best_lam, best_lam_per_k, cv_df = combss_cv.cv_select_lambda(
X_train, y_train,
q = 15,
C = 4,
model_type = 'multinomial',
)
# cv_df columns: lambda, k, loocv_acc, selected
# Linear regression (C not required)
best_lam, best_lam_per_k, cv_df = combss_cv.cv_select_lambda(
X_train, y_train,
q = 15,
model_type = 'linear',
)
# cv_df columns: lambda, k, loocv_mse, selectedcombss_glm/
├── combss.py Main module — fw() algorithm (logit, multinomial, linear)
├── combss_cv.py LOOCV-based lambda selection
├── multinomial_grad.py Internal: multinomial gradient and inner solver
├── combss_demo.ipynb Interactive notebook demonstrating both examples
├── data/
│ ├── Khan_train.csv Khan SRBCT training set (n=63, p=2308)
│ ├── Khan_test.csv Khan SRBCT test set (n=20)
│ ├── n-200-p1000Replica1.csv Simulated binary train (n=200, p=1000)
│ └── n-200-p1000Test-Replica1.csv Simulated binary test (n=1000)
└── examples/
├── example_multinomial_khan.py Khan SRBCT full inclusion path
├── example_binary_logit.py High-dim binary logit (p=1000)
├── example_lambda_cv.py LOOCV lambda selection on Khan
└── example_linear.py High-dim linear regression (p=1000)
See the examples/ folder and examples/README.md
for full self-contained scripts with detailed descriptions and expected output.
python examples/example_multinomial_khan.pyThe Khan SRBCT dataset (Khan et al., Nature Medicine 7, 673–679, 2001) contains expression levels of 2308 genes measured on tissue samples from four childhood tumour types. COMBSS searches for the best gene subset of size k = 1, …, 20. Selected output:
k = 1 Test acc = 0.450 genes: [1954]
k = 2 Test acc = 0.700 genes: [1954, 1955]
k = 5 Test acc = 0.950 genes: [246, 1389, 1645, 1954, 1955]
k = 8 Test acc = 0.950 genes: [187, 246, 509, 1389, 1645, 1954, 1955, 2050]
k = 12 Test acc = 1.000 genes: [187, 246, 509, 545, 910, 1074, 1319, 1389, 1645, 1954, 1955, 2050]
COMBSS achieves perfect test accuracy (20/20) with only 12 genes out of 2308. By contrast, the multinomial group Lasso requires 35 genes to reach the same performance.
| Method | Genes selected | Test accuracy |
|---|---|---|
| COMBSS (k = 8) | 8 | 0.95 |
| COMBSS (k = 12) | 12 | 1.00 |
| Group Lasso (λ.1se) | 28 | 0.95 |
| Group Lasso (λ.min) | 30 | 0.95 |
| Group Lasso (best λ) | 35 | 1.00 |
python examples/example_binary_logit.pySimulated data with 10 truly active features (variables 1–10) among 1000. COMBSS recovers all 10 by k=10 with ~88% test accuracy.
python examples/example_lambda_cv.pyRuns combss_cv.cv_select_lambda on the Khan data with a grid of 7 lambda
values. Returns the best lambda per k (individually) and overall (mean across
k). Ridge regularisation helps most at small k (k=3,4) where it selects
different genes and improves LOOCV accuracy.
python examples/example_linear.pySimulated data: n=100, p=1000, 10 truly active features, AR(1) correlation ρ=0.8, SNR=5. Two demonstrations in one script:
- Part 1:
combss.fwwithmodel_type='linear'— feature selection path for k=1,…,15, with OLS refit and test MSE at each size. - Part 2:
combss_cv.cv_select_lambdawithmodel_type='linear'— ridge penalty selection via the exact hat-matrix LOOCV shortcut (cost O(nk²)).
COMBSS recovers the exact true support {1,…,10} at k=10, where test MSE is minimised.
Runs the COMBSS Frank–Wolfe homotopy for model sizes k = 1, …, q.
| Parameter | Default | Description |
|---|---|---|
X |
— | Design matrix (n, p+1) with intercept column at index 0 |
y |
— | Labels: {0,1} for logit; {1,…,C} for multinomial; real-valued for linear |
q |
— | Maximum subset size |
Niter |
50 | Number of homotopy iterations |
lam |
0 | Ridge regularisation parameter λ ≥ 0 |
scale |
True | Column-normalise X before running |
model_type |
'logit' |
'logit', 'multinomial', or 'linear' |
C |
2 | Number of classes (required for multinomial/logit; ignored for linear) |
inner_tol |
1e-4 | Inner solver convergence tolerance |
mandatory_features |
None | List of 1-indexed features always included |
verbose |
True | Print calibration and per-k progress |
Returns a Result object with:
.models— list of length q;models[k-1]is a sorted array of 1-indexed selected feature indices for model size k.niters— list of iteration counts (one per k)
Selects the ridge penalty λ via leave-one-out cross-validation. Feature selection (the expensive COMBSS run) is performed once per λ on the full data; only the lightweight refit step is cross-validated. For linear regression, LOOCV uses the exact hat-matrix shortcut (cost O(nk²); no per-fold loop). For classification, a logistic regression is refit for each held-out observation.
| Parameter | Default | Description |
|---|---|---|
X |
— | Feature matrix (n, p) — no intercept column |
y |
— | Class labels or continuous response |
q |
— | Maximum subset size |
C |
None |
Number of classes (required for logit/multinomial; omit for linear) |
lambda_grid |
[0]+logspace(-3,1,10) |
Candidate λ values |
Niter |
50 | COMBSS homotopy iterations |
model_type |
'multinomial' |
'logit', 'multinomial', or 'linear' |
inner_tol |
1e-4 | Inner solver tolerance |
lambda_refit |
0 | Ridge penalty for LOOCV refit |
verbose |
True | Print per-λ progress |
Returns (best_lambda, best_lambda_per_k, results_df).
- For classification:
results_dfhas columnslambda, k, loocv_acc, selected; best λ maximises mean LOOCV accuracy. - For linear:
results_dfhas columnslambda, k, loocv_mse, selected; best λ minimises mean LOOCV MSE.
The Frank–Wolfe homotopy algorithm (Algorithm 1 in the paper) operates as follows:
- Initialise t = (k/p, …, k/p) — centroid of the k-sparse simplex T_k
- For i = 1, …, N (homotopy loop):
- Set δ_i = min(δ_min · r^i, δ_max) with r = (δ_max/δ_min)^(1/N)
- Set α_i = i/N (Frank–Wolfe step size)
- Solve the ridge-penalised GLM inner problem at current t
- Compute the Danskin gradient: ∇f(t)_j ∝ −‖ξ_j‖² / t_j³
- Find the FW vertex: s = argmin_{s ∈ T_k} ⟨∇f(t), s⟩ (k smallest grad)
- Update: t ← (1−α) t + α s
- Select the k features with the largest final t values (equivalently, the k features in the last FW vertex s)
The penalty schedule (δ_min, δ_max) is auto-calibrated from λ_max(X^T X) estimated by power iteration, so no tuning is needed in practice.