feat: improved benchmarks and do-no-harm feature selection by thinkall · Pull Request #3 · thinkall/featcopilot

thinkall · 2026-04-17T06:46:03Z

Summary

Improves FeatCopilot benchmarks (rigorous statistical methodology, broader dataset coverage, source-tracked reporting) and the feature engineering pipeline so it no longer regresses on real-world datasets.

Benchmark Infrastructure

5-fold stratified cross-validation with mean ± std reporting (was single 80/20 split)
Wilcoxon signed-rank test for statistical significance (p < 0.05)
Dataset source tracking: 31 real-world (30 INRIA + 1 HuggingFace) vs 32 synthetic, reported separately
Win/Tie/Loss counts with significance markers
Source provenance: loader-level synthetic-fallback detection, so a Kaggle/HF loader that falls back to a synthesized dataset is correctly bucketed as synthetic
New CLI options: --real-world, --n-folds, --n-seeds, --fast

FeatCopilot Feature Selection Improvements

Fixed redundancy eliminator: never removes original features (originals are categorically protected; was dropping them, e.g., diamonds 6→4 features)
Stricter L1 refinement: uses mean importance threshold (was 0.5× mean — too lenient)
Do-no-harm gate: held-out validation (5-split shuffle) to verify derived features help before keeping them. Falls back to original features automatically when features don't add value, with a RuntimeWarning so the fallback is never silent.
Repaired Copilot LLM SDK integration: the --with-llm benchmark path was previously falling back to mock LLM responses on github-copilot-sdk >= 0.1.32; updated featcopilot/llm/copilot_client.py to the new SDK API (mandatory on_permission_request, send_and_wait, disconnect) so LLM-generated features are actually evaluated.

Results (5-fold CV, FeatCopilot tabular + LLM engines, 63 datasets)

Dataset	Before	After
diabetes	-4.40%	+0.00% ✅
covertype	-3.01%	+0.00% ✅
electricity	-2.73%	+1.42% ✅
higgs	-1.77%	+0.73% ✅
bank_marketing	-2.12%	+0.00% ✅
diamonds	-0.56%	+0.00% ✅
eye_movements	+0.31%	+2.16% ✅
fake_news (HF text)	n/a	+76.31% ✅
nyc_taxi	n/a	+8.62% ✅

Real-world classification (15 INRIA + 1 HuggingFace text, 5-fold CV):

Win: 4 (25%), Tie: 11 (69%), Loss: 1 (6%)
Max regression: -1.17% (covertype_cat; not statistically significant at p<0.05)
No statistically significant regression on any real-world classification dataset

Real-world regression (15 INRIA, 5-fold CV):

Win: 1 (7%, nyc_taxi +8.62%), Tie: 13 (87%), Loss: 1 (7%, delays_zurich)
Note: delays_zurich shows -129.52% only because its baseline R² is ≈0.009 (essentially noise prediction); absolute Δ in R² is small and the result is not statistically significant (p=1.0)

All real-world datasets combined (31, 5-fold CV):

Win: 5 (16%), Tie: 24 (77%), Loss: 2 (6%)
Mean improvement: -1.29% (skewed entirely by the delays_zurich near-zero-baseline edge case; median improvement is +0.00%)

Synthetic datasets (32, supplementary, 5-fold CV):

Win: 20 (63%), Tie: 11 (34%), Loss: 1 (3%)
Mean improvement: +13.70%; Best: +104.34% (triple_interaction_regression), +103.41% (xor_regression)

Testing

All 768 tests pass (was 625 at PR open; +143 regression tests added across review iterations)
Pre-commit (black, ruff) passes
Python 3.10 / 3.11 / 3.12 / 3.13 all green in CI

…marks - Add 5-fold stratified cross-validation with mean±std reporting - Add Wilcoxon signed-rank test for statistical significance - Add dataset source tracking (real-world vs synthetic) - Separate real-world vs synthetic results in reports - Add --real-world, --n-folds, --n-seeds, --fast CLI options - Report win/tie/loss counts and significance markers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- RedundancyEliminator: never remove original features when paired with each other; only remove derived features that are redundant with originals - FeatureSelector: always preserve original features in score dict even after redundancy elimination - Stricter L1 refinement: use mean_imp threshold instead of mean_imp*0.5 - Reduce fallback from top-half to top-3 for derived features Fixes regression on real-world datasets (diamonds 6→4 feature reduction). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add held-out validation gate to AutoFeatureEngineer.fit_transform() - Uses 3-split shuffle validation (not CV on feature-selected data) - Requires derived features to show clear benefit (delta > 0.001) - Falls back to original features if derived features don't help - Only activates when apply_selection=True (preserves existing API) - Eliminates regressions: diabetes -4.4%→-0.01%, covertype -3%→+0.11% Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…gate Real-world (31 datasets): Win 2 / Tie 27 / Loss 2, max regression -1.42% Synthetic (32 datasets): Win 18 / Tie 12 / Loss 2, mean improvement +14.49% Key improvements from baseline: - diabetes: -4.40% → +0.00% (do-no-harm gate blocks harmful features) - covertype: -3.01% → -0.11% (gate prevents major regression) - electricity: -2.73% → -0.30% (gate reduces regression) - diamonds: -0.56% → +0.00% (redundancy fix preserves originals) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ed threshold - Increase gate validation from 3 to 5 splits for stability - Add feature ratio-scaled threshold: more derived features = stricter gate - Fixes bank_marketing regression: -0.77% → +0.02% - Maintains eye_movements improvement: +3.63% - Preserves synthetic dataset gains Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…n real-world Real-world classification (15 INRIA datasets, 5-fold CV): - Win rate: 40% (6/15) with improvements up to +3.63% - Tie rate: 40% (6/15) — do-no-harm gate protects performance - Loss rate: 20% (3/15) — max regression -1.14%, none significant (p<0.05) vs original benchmark: - diabetes: -4.40% → -0.01% (fixed) - covertype: -3.01% → +0.11% (now wins) - electricity: -2.73% → -0.32% (much improved) - higgs: -1.77% → +0.45% (now wins) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…d results - Add statistical methodology section (5-fold CV, Wilcoxon, win/tie/loss) - Separate real-world vs synthetic results with real-world as primary - Highlight do-no-harm guarantee (no significant regression on any dataset) - Update results to reflect latest improvements Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

codecov-commenter · 2026-04-17T06:48:11Z

Codecov Report

❌ Patch coverage is 96.24060% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.90%. Comparing base (fd30793) to head (1e82335).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
featcopilot/transformers/sklearn_compat.py	96.06%	5 Missing ⚠️
featcopilot/engines/text.py	60.00%	2 Missing ⚠️
featcopilot/llm/copilot_client.py	88.88%	2 Missing ⚠️
featcopilot/selection/unified.py	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main       #3      +/-   ##
==========================================
+ Coverage   88.18%   88.90%   +0.71%     
==========================================
  Files          36       36              
  Lines        3808     3937     +129     
==========================================
+ Hits         3358     3500     +142     
+ Misses        450      437      -13

Flag	Coverage Δ
unittests	`88.90% <96.24%> (+0.71%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d9d5df5fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot

Pull request overview

This PR upgrades the benchmarking suite to support more rigorous, publication-style evaluation (CV, significance testing, dataset source stratification) and hardens feature selection to prevent regressions by preserving original features and adding a “do-no-harm” validation gate.

Changes:

Add a do-no-harm gate to AutoFeatureEngineer.fit_transform() that can drop derived features when held-out validation doesn’t show benefit.
Fix feature selection behavior to preserve original features during redundancy elimination and tighten L1 refinement thresholds.
Rework the simple-model benchmark runner/reporting to use k-fold CV with Wilcoxon signed-rank tests and to separate real-world vs synthetic datasets.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
featcopilot/transformers/sklearn_compat.py	Adds do-no-harm validation gate after selection.
featcopilot/selection/unified.py	Preserves original features in redundancy step; tightens L1 refinement thresholds/fallback.
featcopilot/selection/redundancy.py	Ensures redundancy eliminator never removes original features when paired with derived ones.
benchmarks/simple_models/run_simple_models_benchmark.py	Switches to CV + Wilcoxon; adds real-world filtering and new CLI args; updates report format.
benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md	Updates published benchmark report snapshot to new format.
benchmarks/datasets.py	Introduces dataset source tagging + helpers (`is_real_world`, `list_real_world_datasets`, etc.).
benchmarks/init.py	Re-exports new dataset source helpers/constants.
benchmarks/README.md	Updates benchmark documentation/results summary to reflect new methodology and source split.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- sklearn_compat: preserve _selector after do-no-harm fall-back, rewriting its selected features to original-only so subsequent transform() calls emit the same column set (codex P1) - sklearn_compat: only invoke do-no-harm gate when a selector actually ran (skip when max_features=None) so apply_selection=True remains a no-op in that case (Copilot) - sklearn_compat: classify task via sklearn.utils.multiclass.type_of_target so bool / nullable Int64 / float-encoded {0.0,1.0} targets are routed through the classifier path (Copilot) - tests: add TestDoNoHarmGate covering keep, fall-back, float-binary detection, gate gating, and post-fall-back transform consistency - benchmarks: pass through --with-llm into FeatCopilot in CV path so the reported with_llm flag matches what was actually run (codex P2 / Copilot) - benchmarks: drop dead use_feature_cache argument and --no-feature-cache CLI flag (Copilot) - benchmarks/SIMPLE_MODELS_BENCHMARK.md: annotate as a curated 26-of-63 subset and document how to regenerate the full report (Copilot) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Updates FeatCopilot’s benchmarking and feature-selection pipeline to (a) report more statistically rigorous benchmark results and (b) prevent engineered features from causing regressions by adding a “do-no-harm” validation gate.

Changes:

Added a held-out validation “do-no-harm” gate to AutoFeatureEngineer.fit_transform() plus new unit tests covering keep/fallback behavior and task-type inference.
Adjusted feature selection to preserve original features through redundancy elimination and tightened derived-feature refinement thresholds.
Refactored the simple-models benchmark to use k-fold CV with Wilcoxon signed-rank significance testing and to report real-world vs synthetic datasets separately; updated benchmark docs accordingly.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_sklearn_compat.py	Adds tests for the new do-no-harm gate and its integration points.
featcopilot/transformers/sklearn_compat.py	Introduces `_do_no_harm_gate()` and invokes it after selection in `fit_transform()`.
featcopilot/selection/unified.py	Ensures original features aren’t dropped when filtering by non-redundancy; tightens L1 refine.
featcopilot/selection/redundancy.py	Updates redundancy removal rules to avoid dropping original features when correlated.
benchmarks/simple_models/run_simple_models_benchmark.py	Moves to CV + Wilcoxon reporting and adds `--real-world/--n-folds/--n-seeds/--fast` options.
benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md	Updates the published report snapshot to match the new methodology and source split.
benchmarks/datasets.py	Adds dataset source tracking + helper APIs for listing/flagging real-world vs synthetic datasets.
benchmarks/init.py	Re-exports new dataset source helpers/constants.
benchmarks/README.md	Updates benchmark methodology and summarizes results split by source.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- sklearn_compat: do-no-harm gate now builds the original-only baseline from the truly original input (X_original) rather than slicing X_engineered, so the comparison stays meaningful even if downstream steps drop or rename original columns (Copilot) - core/base: add public BaseSelector.set_selected_features(features) setter and switch the do-no-harm gate fall-back to use it instead of reaching into _selected_features (Copilot) - benchmarks: time-series / forecasting tasks (task contains "forecast" or "timeseries") now use TimeSeriesSplit with chronological folds instead of KFold(shuffle=True), preventing target leakage across folds and the resulting overly optimistic Wilcoxon scores (Copilot) - tests: TestDoNoHarmGate now exercises set_selected_features via a mock with side_effect; new test_gate_uses_x_original_for_baseline verifies the baseline truly comes from X_original; new test_set_selected_features_overrides_selection covers the public setter end-to-end (transform reflects the override and the input list is copied defensively). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

thinkall · 2026-04-30T16:21:18Z

@copilot review

Copilot · 2026-04-30T16:24:35Z

@copilot review

All 6 round-1 and 3 round-2 review threads have been addressed in commits c1c2b35 and 813a7d4. Here's a summary:

sklearn_compat.py (do-no-harm gate):

Task-type inference now uses sklearn.utils.multiclass.type_of_target() instead of the integer-dtype heuristic — correctly handles bool, nullable Int64, and {0.0, 1.0} float-encoded binary targets.
Gate only runs when self._selector is not None (i.e., selection actually ran), so apply_selection=True with max_features=None remains a no-op.
X_original is now used to build the baseline feature matrix, not a column-slice of X_engineered, making the engineered-vs-original comparison correct even if selection renames/drops original columns.
Selector is updated via the public set_selected_features() on BaseSelector instead of mutating _selected_features directly.
7 new unit tests cover keep/fallback decisions, float-encoded target detection, gate bypass when no selector, and transform() consistency after fallback.

Benchmark runner:

with_llm is now passed through to apply_featcopilot, so --with-llm works correctly.
use_feature_cache dead parameter removed from run_single_benchmark() and CLI.
Time-series/forecast tasks now use TimeSeriesSplit (chronological ordering preserved); --n-seeds is a no-op for those tasks with a first-iteration guard.

690 tests pass.

The previous push (813a7d4) did not auto-trigger the Copilot reviewer even though the branch ruleset has review_on_push: true. This empty commit forces a fresh push event so the latest changes get reviewed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR upgrades FeatCopilot’s benchmarking workflow to be more publication-oriented (cross-validation + significance testing + real-world vs synthetic separation) and tightens the feature engineering/selection pipeline with a “do-no-harm” gate and stronger protections for original features.

Changes:

Add a held-out “do-no-harm” validation gate to AutoFeatureEngineer.fit_transform() and a public selector hook to enforce original-only fallback consistently.
Fix/adjust feature selection to preserve original features during redundancy elimination and apply stricter refinement thresholds.
Rework the simple-models benchmark to use k-fold CV + Wilcoxon testing and to report real-world vs synthetic results separately, with new CLI flags.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_sklearn_compat.py	Adds unit tests for the do-no-harm gate and post-fallback transform consistency.
tests/test_selection.py	Adds coverage for `set_selected_features()` overriding downstream `transform()`.
featcopilot/transformers/sklearn_compat.py	Implements the do-no-harm gate and invokes it after feature selection in `fit_transform()`.
featcopilot/selection/unified.py	Ensures original features are preserved through redundancy filtering; tightens L1 refinement selection threshold.
featcopilot/selection/redundancy.py	Updates redundancy removal logic to never drop original features when correlated with derived features.
featcopilot/core/base.py	Adds `BaseSelector.set_selected_features()` public API for post-fit selection overrides.
benchmarks/simple_models/run_simple_models_benchmark.py	Switches to k-fold CV + Wilcoxon test; adds `--real-world`, `--n-folds`, `--n-seeds`, `--fast`; improves report structure.
benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md	Updates the benchmark snapshot/report format to match the new methodology and data-source split.
benchmarks/datasets.py	Introduces dataset source tagging and helper APIs (`is_real_world`, `list_real_world_datasets`, `list_synthetic_datasets`).
benchmarks/init.py	Re-exports dataset source constants/helpers via the benchmarks package API.
benchmarks/README.md	Updates benchmark documentation to reflect CV + significance testing + source-separated reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- sklearn_compat: gate exception path is now fail-closed — always logs a warning (not just verbose) and conservatively falls back to original- only columns + pins the selector via set_selected_features. - sklearn_compat: new _encode_for_gate helper ordinal-encodes scalar non-numeric columns (bool/category/datetime/object scalars) so signal in categorical originals contributes to the baseline. Object columns containing arrays / lists / dicts are skipped to avoid row-id-like codes fooling the gate model. - benchmarks: time-series seed handling now emits an explicit one-shot warning when n_seeds > 1 and the task is time-series, then runs a single pass and records the effective n_seeds (1) in the results instead of the misleading raw n_seeds. - tests: removed unused _make_engineered_frame helper; added coverage for the gate's exception fallback, categorical baseline encoding, unsupported-object skip, no-usable-columns no-op, ndarray X_original, regression target routing, mismatched-index alignment, verbose fall- back warning, plus redundancy / unified selector branch coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR upgrades FeatCopilot’s benchmarking workflow to cross-validated, statistically tested reporting (with real-world vs synthetic separation), and strengthens the feature-selection pipeline with a “do-no-harm” validation gate plus fixes to preserve original features.

Changes:

Add a held-out validation “do-no-harm” gate to AutoFeatureEngineer.fit_transform() (with encoding support for categorical/bool/datetime) and a public set_selected_features() hook for selectors.
Fix feature-selection behavior to never drop original features during redundancy elimination and tighten L1 refinement thresholds.
Rework the simple-model benchmark to use k-fold CV + Wilcoxon signed-rank significance testing, add dataset source tracking, and extend CLI/reporting.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`featcopilot/transformers/sklearn_compat.py`	Adds do-no-harm gate + gate encoding helper and wires gate into `fit_transform()`.
`featcopilot/core/base.py`	Exposes `BaseSelector.set_selected_features()` for post-fit selection overrides.
`featcopilot/selection/unified.py`	Ensures original features are preserved through redundancy filtering; tightens L1 refinement threshold.
`featcopilot/selection/redundancy.py`	Changes redundancy removal rules to never remove originals when correlated with derived features.
`tests/test_sklearn_compat.py`	Adds comprehensive unit tests for do-no-harm gate behavior and encoding edge cases.
`tests/test_selection.py`	Adds coverage for selection overrides, safety branch, and redundancy eliminator branches.
`benchmarks/simple_models/run_simple_models_benchmark.py`	Switches benchmark to CV + seeds + Wilcoxon; adds real-world filtering and richer report output.
`benchmarks/datasets.py`	Introduces dataset source tagging and helper APIs to list/filter real-world vs synthetic datasets.
`benchmarks/__init__.py`	Re-exports new dataset source helpers/constants.
`benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md`	Updates report snapshot and reproduction instructions.
`benchmarks/README.md`	Updates benchmark methodology and summarizes results with real-world vs synthetic separation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

This PR makes the benchmark suite statistically rigorous (k-fold CV with paired significance testing and source-aware reporting) and strengthens FeatCopilot’s feature selection pipeline with a fail-closed “do-no-harm” gate plus safer selection/refinement behaviors, while also modernizing the codebase to Python 3.10+ typing and tooling.

Changes:

Reworks simple_models benchmarks to use k-fold CV (with seeds), fold-local preprocessing (leakage-safe), Wilcoxon signed-rank testing, and real-world vs synthetic reporting.
Adds/extends feature selection safeguards: categorical protection in redundancy elimination, stricter L1 refinement, a do-no-harm validation gate (with gate_n_jobs), and selector pinning via BaseSelector.set_selected_features.
Updates project metadata/tooling for Python 3.10+ and expands test coverage for benchmark preprocessing/encoding and selection/gate behavior.

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_sklearn_compat.py	Adds extensive end-to-end tests for the do-no-harm gate and `_encode_for_gate`, plus `gate_n_jobs` plumbing.
tests/test_selection.py	Adds tests for selector override behavior, categorical baseline scoring defaults, redundancy behavior, and deprecation warning coverage.
tests/test_benchmark_encoding.py	New tests for leakage-safe fold-local encoding/imputation and report bucketing/CLI validation.
pyproject.toml	Raises required Python to 3.10+, updates dev tool versions, and adjusts Black/Ruff/Mypy targets.
featcopilot/utils/validation.py	Modernizes typing (`X
featcopilot/utils/parallel.py	Modernizes typing and imports (`collections.abc.Callable`).
featcopilot/utils/models.py	Modernizes typing for cached model metadata helpers.
featcopilot/utils/cache.py	Modernizes typing for cache API and internal helpers.
featcopilot/transformers/sklearn_compat.py	Adds do-no-harm gate + `_encode_for_gate`, introduces `gate_n_jobs`, and updates typing/get_params.
featcopilot/stores/rule_store.py	Modernizes typing for rule store API.
featcopilot/stores/feast_store.py	Modernizes typing for Feast store integration APIs.
featcopilot/stores/base.py	Modernizes typing for base feature store interfaces.
featcopilot/selection/unified.py	Ensures original features aren’t dropped post-redundancy elimination; tightens L1 refinement thresholds.
featcopilot/selection/statistical.py	Modernizes typing and uses `zip(..., strict=True)` for safety.
featcopilot/selection/redundancy.py	Enforces “never drop originals” rules, deprecates `original_preference` with `FutureWarning`, and updates typing/docs.
featcopilot/selection/importance.py	Modernizes typing and minor internal dict initialization cleanup.
featcopilot/llm/transform_rule_generator.py	Modernizes typing across generator APIs.
featcopilot/llm/semantic_engine.py	Modernizes typing across engine config and methods.
featcopilot/llm/openai_client.py	Modernizes typing across OpenAI client config and APIs.
featcopilot/llm/litellm_client.py	Modernizes typing across LiteLLM client config and APIs.
featcopilot/llm/explainer.py	Modernizes typing across explainer APIs.
featcopilot/llm/copilot_client.py	Modernizes typing across Copilot client APIs.
featcopilot/llm/code_generator.py	Modernizes typing across code generator APIs.
featcopilot/engines/timeseries.py	Modernizes typing in time series engine APIs.
featcopilot/engines/text.py	Modernizes typing and minor dict initialization cleanup.
featcopilot/engines/tabular.py	Modernizes typing in tabular engine APIs.
featcopilot/engines/relational.py	Modernizes typing in relational engine APIs.
featcopilot/core/transform_rule.py	Modernizes typing in transform rule model/methods.
featcopilot/core/registry.py	Modernizes typing in registry getter APIs.
featcopilot/core/feature.py	Modernizes typing in feature and feature set structures.
featcopilot/core/base.py	Adds `BaseSelector.set_selected_features` hook and modernizes typing across base interfaces.
docs/getting-started/installation.md	Updates documented minimum Python version to 3.10+.
benchmarks/simple_models/run_simple_models_benchmark.py	Implements leakage-safe CV benchmarking + Wilcoxon testing + source-aware report generation + new CLI flags.
benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md	Updates the benchmark report snapshot to new format/content.
benchmarks/datasets.py	Adds dataset source tracking and helpers for real-world vs synthetic subsets.
benchmarks/compare_tools/run_fe_tools_comparison.py	Modernizes typing in FE tools comparison benchmark utilities.
benchmarks/automl/run_automl_benchmark.py	Modernizes typing and minor strict zip safety for column sanitization.
benchmarks/init.py	Re-exports new dataset source helpers/constants.
benchmarks/README.md	Updates benchmark methodology and summary reporting to match new scientific workflow.
README.md	Updates documented minimum Python version to 3.10+.
AGENTS.md	Updates repo contributor guidance to Python 3.10+ conventions.
.pre-commit-config.yaml	Updates hook revisions and Ruff hook configuration.
.github/workflows/tests.yml	Updates CI test matrix to Python 3.10–3.13.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Four findings, all valid: 1. `RedundancyEliminator` accepted but did not document the deprecated `original_preference` parameter. Added a clearly-marked "**Deprecated, accepted for backward compatibility only.**" entry to the Parameters section so callers understand why it's accepted, that it has no effect, and that any non-None value triggers a `FutureWarning`. 2. `preprocess_X_train_test` would leave NaNs in the output if a numeric column was entirely missing in the train fold: `median()` returns NaN and `fillna(NaN)` is a no-op, contradicting the docstring's "all numeric, no NaN" guarantee and breaking downstream model fitting (`Input X contains NaN`). Added a guard that falls back to the `0.0` sentinel when `train_median` is NaN — the column becomes a constant (useless but safe). 3. `preprocess_target` only label-encoded `object` and `category` targets. Pandas `string`/`StringDtype` (and other non-numeric extension dtypes) would fall through and return a non-numeric ndarray that breaks sklearn estimators. Switched the condition to `pd.api.types.is_numeric_dtype(y) or pd.api.types.is_bool_dtype(y)` so all non-numeric, non-bool targets get encoded uniformly. 4. `warnings.filterwarnings("ignore")` in `main()` mutated the global warnings filter and was never restored. Tests and other programmatic callers of `main()` would have warnings silently suppressed for the rest of the process. Moved the call under the `if __name__ == "__main__"` guard so it only triggers from the CLI entrypoint, leaving `main()` itself side-effect free. Tests added (all 5 in `tests/test_benchmark_encoding.py`): - `test_preprocess_X_train_test_handles_all_missing_train_column` - `test_preprocess_target_handles_pandas_string_dtype` - `test_preprocess_target_passes_through_numeric_classification` - `test_preprocess_target_passes_through_bool_classification` - `test_main_does_not_mutate_global_warnings_filter` Verified: pre-commit clean, full suite 753 tests passing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 43 out of 43 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two findings on commit 61f0e9c, both valid: 1. `run_single_benchmark` discarded `loader_name` (4th element of `load_dataset`'s return tuple) and derived `source` solely from `is_real_world(dataset_name)` (registry key). For loaders that sometimes fetch real data and sometimes fall back to a synthesized dataset (the historical pattern for Kaggle/OpenML/HF loaders in `benchmarks/datasets.py`), this would mislabel a synthetic-fallback run as `real_world` and skew the real-world vs synthetic split. Added a new `_infer_source(dataset_name, loader_name)` helper that downgrades to `synthetic` when the loader's canonical name carries a synthetic-fallback marker (`synthetic` or `-style`, case-insensitive), even if the registry tags the dataset as real-world. Wired it into `run_single_benchmark` so both the printed Source line and the stored `results["source"]` reflect the actual data that was loaded. Also started persisting `loader_name` in the result dict for traceability. 2. `generate_report()` treated any result without `source == "real_world"` as synthetic, including legacy cached results loaded via `--report-only` that pre-date the `source` field. All such datasets would silently get bucketed as synthetic, producing a misleading report. Added a `_resolve_source(result)` helper that prefers the explicit `source` field when present and falls back to `is_real_world(result["dataset"])` when missing. Switched both the `real_world` and `synthetic` filters in `generate_report` to use it. This is strictly more accurate than the previous behaviour for stable real-world datasets (INRIA, `fake_news`); the only remaining gap — detecting a historical synthetic-fallback event for a legacy result — isn't fixable from the cache alone because `loader_name` wasn't persisted, so we accept it. Tests added (6 in `tests/test_benchmark_encoding.py`): - `test_infer_source_synthetic_dataset_always_synthetic` - `test_infer_source_real_world_no_marker_returns_real_world` - `test_infer_source_real_world_with_synthetic_fallback_marker` (covers `(synthetic)`, `(HF-style)`, case-insensitive variants) - `test_resolve_source_uses_explicit_field_when_present` - `test_resolve_source_falls_back_to_registry_for_legacy_results` (covers known real_world / synthetic / unknown / missing dataset) - `test_generate_report_buckets_legacy_results_via_registry` (end-to-end: generate report from rows without `source`, verify both buckets are populated correctly via the registry) Verified: pre-commit clean, full suite 759 tests passing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 43 out of 43 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot round-23 review on commit d6db161 (1 finding, valid). The `original_preference` parameter was originally between `original_features` and `verbose` in `RedundancyEliminator.__init__` (see commit ea3be9b for the legacy signature). Round 17 (b0fc56e) re-introduced it as an explicit deprecated parameter but appended it **after** `verbose`, silently breaking the positional contract: # Legacy positional caller intent: RedundancyEliminator(0.95, "pearson", scores, originals, 0.5, True) # ^pref ^verbose After round 17, positions 5 and 6 had been swapped, so: - `0.5` would bind to `verbose` (truthy bool) — wrong behavior - `True` would bind to `original_preference` — silent FutureWarning for code that never asked to be deprecated Fix: move `original_preference` back between `original_features` and `verbose`, restoring the legacy positional order. Update the Parameters docstring to document the slot is intentional. Add a regression test (`test_positional_argument_order_matches_legacy`) that: 1. Uses `inspect.signature` to assert the positional order matches the legacy contract — fails loudly if the order is ever changed again. 2. End-to-end smoke: passing all-positional in the legacy order binds `verbose=True` cleanly with no `FutureWarning`. Verified: pre-commit clean, full suite 760 tests passing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 43 out of 43 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Two related bugs surfaced when re-running the full `--all` benchmark to refresh the PR description metrics: 1. After ~3 hours benchmarking 63 datasets, ``save_cache`` raised ``TypeError: Object of type bool is not JSON serializable`` on the ``significant`` field. ``significant = p_value < 0.05`` returns ``numpy.bool_`` (since ``p_value`` is a numpy float from scipy), and the default ``json.JSONEncoder`` doesn't accept ``numpy.bool_`` on NumPy <2 because it isn't a Python ``bool`` subclass for json's purposes. The existing converter handled ``np.floating`` / ``np.integer`` / ``np.ndarray`` / nested dicts but missed ``np.bool_``. Added an explicit ``np.bool_`` cast (matters: it must come BEFORE the ``np.integer`` check on NumPy <2 where ``np.bool_`` was a subclass of ``np.generic`` but not ``np.integer``) and extended the nested ``dict`` converter to handle ``np.bool_`` too. 2. When ``save_cache`` raised, the in-memory results were lost — the markdown report had not been written yet because ``main()`` called ``save_cache`` BEFORE ``generate_report``. After 3 hours of compute, one TypeError discarded everything. Swapped the order so the report is written first; cache failure now only loses the JSON cache, not the markdown summary. Tests added (`tests/test_benchmark_encoding.py`): - `test_save_cache_handles_numpy_bool_in_significant_field` — directly exercises the previously-failing path with ``np.bool_(True)`` in both a top-level field and a nested dict, asserts the JSON round-trip produces native Python ``bool``s. - `test_main_generates_report_even_when_save_cache_would_fail` — patches ``save_cache`` to always raise and asserts ``main()`` still writes the markdown report so a long benchmark run isn't fully wasted. Verified: pre-commit clean, full suite 762 tests passing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The benchmark's ``--with-llm`` mode was silently falling back to mock LLM responses (returning hard-coded ``{col1}_{col2}_ratio``-style features) because ``featcopilot.llm.copilot_client`` was incompatible with the current ``github-copilot-sdk`` API. Three breaking changes had accumulated: 1. **``create_session`` is now keyword-only and requires ``on_permission_request``** (since SDK 0.1.32). The old code passed a positional dict, which raised ``TypeError: takes 1 positional argument but 2 were given`` — caught by the ``except Exception`` block and silently mocked. 2. **``session.send`` was redesigned in SDK 0.3.x** to take a positional ``str`` and not the old ``{"prompt": ...}`` dict, AND a new ``send_and_wait`` helper now handles the send → wait-for-idle → collect-assistant-message pattern internally. The old hand-rolled ``send + asyncio.Event(idle)`` loop produced empty responses. 3. **``session.destroy()`` was deprecated in SDK 0.3.x**, replaced with ``session.disconnect()``. ``destroy()`` still works but emits a ``DeprecationWarning`` on every test run. Together (1) was the catastrophic failure: every benchmark with ``--with-llm`` was running on mock features. The benchmark log header correctly reported ``LLM enabled: True`` so the failure was invisible to anyone not specifically looking for ``WARNING - copilot_client.py`` fallback messages. Fixes: - ``start()`` imports ``PermissionHandler`` from ``copilot.session`` (it isn't re-exported from the top-level ``copilot`` package) and passes ``on_permission_request=PermissionHandler.approve_all`` plus ``model`` and ``streaming`` as keyword arguments to ``create_session``. - ``send_prompt()`` calls the new ``session.send_and_wait(prompt, timeout=...)``, which returns ``SessionEvent | None``; reads ``event.data.content`` for the assistant text. - ``stop()`` prefers ``session.disconnect()`` when available, falling back to ``session.destroy()`` for older SDKs. Tests added (``tests/test_llm_modules.py``): - ``test_start_passes_on_permission_request_to_create_session`` — uses a fake ``copilot`` module to assert the new kwargs contract is exercised; would catch a regression to the old positional-dict call. - ``test_send_prompt_uses_send_and_wait`` — asserts the prompt is passed as a positional ``str`` and the SDK's ``send_and_wait`` helper is used. - ``test_stop_prefers_disconnect_over_destroy`` — pins the new preferred teardown path. - ``test_stop_falls_back_to_destroy_for_old_sdk`` — exercises the graceful fallback for SDKs that pre-date ``disconnect``. Verified: - Live ``send_prompt`` round-trip returns a real LLM response (not ``{col1}_{col2}_ratio`` mock output). - All 214 pre-existing ``test_llm_modules.py`` tests still pass; +4 new. - Pre-commit clean (black, ruff). - Existing ``DeprecationWarning: destroy() is deprecated`` warning no longer surfaces in the test suite output. Note: this only fixes the LLM-engine path. ``--with-llm`` benchmarks were producing misleading numbers prior to this fix; the PR description metrics will be regenerated on a fresh ``--with-llm`` run with the real LLM connected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Regenerate `SIMPLE_MODELS_BENCHMARK_LLM.md` and `SIMPLE_MODELS_CACHE_LLM.json` from a fresh `--all --with-llm` run on the current code (`e9be75f`). Prior to commit `e9be75f`, the `--with-llm` mode silently fell back to mock LLM responses (incompatible with `github-copilot-sdk >= 0.1.32`'s mandatory `on_permission_request` kwarg). The previously-committed `SIMPLE_MODELS_BENCHMARK_LLM.md` from 2026-02-25 was generated under the old single-split methodology AND with the broken LLM connection, making it neither comparable to the new 5-fold CV reports nor accurate about the LLM engine's contribution. This run uses: - 5-fold stratified CV × 1 seed (matches the no-LLM benchmark report) - Real Copilot LLM (`gpt-5.2`, no mock fallbacks; verified by 0 `Could not connect to Copilot` warnings across the entire run) - All 63 datasets (31 real-world + 32 synthetic) Headline metrics (full numbers in the report file): - **Real-world (31)**: Win 5 / Tie 24 / Loss 2; max regression -1.17% on `covertype_cat` (not statistically significant at p<0.05). The apparent -129.52% on `delays_zurich` is a near-zero-baseline (R² ≈ 0.009) artifact and is also non-significant (p=1.0). - **Synthetic (32)**: Win 20 / Tie 11 / Loss 1; mean +13.70%; best +104.34% on `triple_interaction_regression`. - Overall mean: +6.32%; median: +0.16%. The PR description has been updated to cite these regenerated metrics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Address Copilot review on commit e9be75f. The previous converter used ``float(v)`` for both ``np.floating`` and ``np.integer``, turning counts like ``n_samples=5000`` into ``5000.0`` in the cache and silently losing precision past ``2**53`` for large integer counts. Fix: split ``np.integer`` -> ``int(v)`` from ``np.floating`` -> ``float(v)``, and apply the same distinction inside the nested-dict conversion. Refactored both top-level and nested-dict conversion through a single ``_to_native`` helper to eliminate the duplicated isinstance ladder and avoid future drift between the two paths (the previous code had to repeat the ``np.bool_`` / ``np.floating`` / ``np.integer`` ladder verbatim in the nested-dict branch — easy to get wrong, and we already had this happen with ``np.bool_`` itself). The ordering ``np.bool_`` -> ``np.integer`` -> ``np.floating`` is intentional and preserved: on NumPy <2.0 ``np.bool_`` is a subtype of ``np.generic`` but not ``np.integer``, so the ``np.bool_`` check must come first to avoid the ``json`` encoder later raising ``TypeError: Object of type bool is not JSON serializable`` (the bug fixed in commit 6c21494). Tests added (`tests/test_benchmark_encoding.py`): - `test_save_cache_preserves_int_fidelity_for_np_integer` — exercises ``np.int32`` / ``np.int64`` count fields plus a ``2**60+1`` value that would lose precision through float64; asserts on-disk JSON has no decimal point on int fields and the round-trip preserves ``isinstance(int)``. Also covers nested dict mixed-type case. Verified: pre-commit clean, full suite 767 tests passing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

thinkall · 2026-05-02T23:22:58Z

@copilot please review the entire PR (not just the latest commit). The PR has accumulated many changes across multiple iterations and the review should consider the full diff against main.

Copilot

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot review on commit d501114. Round 24 restored the ``original_preference`` parameter in its legacy positional slot but inadvertently dropped ``self.original_preference = original_preference`` from ``__init__``. Legacy callers that read the attribute after construction (for logging, hyperparameter introspection, or sklearn's ``get_params`` / ``set_params`` round-trip) would crash with ``AttributeError``. Fix: re-add ``self.original_preference = original_preference``, with an explanatory comment that the attribute is preserved purely for read-only backward compatibility — the value has no effect on behavior because originals are categorically protected from removal. Tests added (`tests/test_selection.py`): - `test_original_preference_attribute_preserved_for_legacy_readers` — asserts the attribute exists with the default ``None`` after a zero-arg construction, and reflects the caller-provided value when one is explicitly passed. Verified: pre-commit clean, redundancy + selection test set passing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

thinkall · 2026-05-02T23:40:02Z

@copilot please review the entire PR (not just the latest commit).

thinkall · 2026-05-02T23:58:49Z

@copilot please review the entire PR (head commit 085371e). Earlier review request did not trigger the workflow.

Copilot

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot review on commit 085371e. ``send_prompt`` previously only caught the built-in ``TimeoutError``. While that matches what the current ``github-copilot-sdk`` raises, two adjacent exception classes can also surface from async timeouts: 1. ``asyncio.TimeoutError`` — was a separate exception class on Python 3.10 (only became an alias for the built-in ``TimeoutError`` in 3.11+). On 3.10, an ``asyncio.wait_for`` timeout inside the SDK would NOT be caught by ``except TimeoutError`` and would propagate to callers as an unhandled crash. ``pyproject.toml`` declares ``requires-python = ">=3.10"``, so 3.10 is in scope. 2. ``concurrent.futures.TimeoutError`` — what some thread/process bridges (e.g., ``run_in_executor`` chains) surface; cheap to include and prevents the timeout handler from being bypassed. Fix: catch ``(TimeoutError, asyncio.TimeoutError, concurrent.futures.TimeoutError)``. Added ``import concurrent.futures`` at the module level. Inline comment documents why each class is included. Tests added (`tests/test_llm_modules.py`): - ``test_send_prompt_handles_builtin_timeout_error`` — current SDK behaviour (already worked). - ``test_send_prompt_handles_asyncio_timeout_error`` — Python-3.10 scenario; would have failed before this commit. - ``test_send_prompt_handles_concurrent_futures_timeout_error`` — thread-bridge scenario; cheap defensive coverage. All three patch ``send_and_wait`` with ``side_effect=<TimeoutError>`` and assert the sentinel ``"timed out"`` message is returned, not an unhandled exception. Verified: pre-commit clean, 7 ``TestCopilotClientSDKCompatibility`` tests pass (4 prior + 3 new). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

thinkall · 2026-05-03T01:07:03Z

@copilot please review the entire PR (not just the latest commit).

Copilot

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot round-30 review on commit 3fdce5e (2 findings). 1. **`preprocess_data()` only handled `object`/`category`** The legacy `preprocess_data()` filtered with `select_dtypes(include=["object", "category"])`, which silently leaves pandas `string`/`StringDtype` columns unencoded. Callers that rely on the helper's "model-ready" output contract would get a non-numeric frame back and crash inside sklearn. Fix: switch to `select_dtypes(exclude=[np.number, "bool", "boolean"])` so the legacy path mirrors the production `_label_encode_non_numeric` contract — anything that isn't numeric or a bool gets label-encoded, including `StringDtype`, ArrowString, and any future non-numeric extension dtype. 2. **`significant = p_value < 0.05` returned `numpy.bool_`** `p_value` is a NumPy scalar (`scipy.stats.wilcoxon` output), so the bare comparison yields `numpy.bool_`. That makes the in-memory results dict non-JSON-serializable by default, and forced `save_cache`'s `_to_native` helper to special-case it (the bug chain that crashed a 3-hour run before commit 6c21494). Fix: cast at the source — `significant = bool(p_value < 0.05)`. Inline comment notes the rationale so a future refactor doesn't silently revert it. Tests added (`tests/test_benchmark_encoding.py`): - `test_preprocess_data_encodes_pandas_string_dtype` — exercises a frame with mixed `object`, `category`, `string`/`StringDtype`, and `bool` columns; asserts `string_ext` ends up integer-typed (encoded) and bool stays bool. - `test_significant_field_is_native_python_bool` — runs `run_single_ benchmark` end-to-end on a tiny synthetic dataset (with `load_dataset` patched so the test stays fast) and asserts `type(results["significant"]) is bool` and `json.dumps({"significant": sig})` succeeds without a custom encoder. Verified: pre-commit clean, full suite 773 tests passing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

thinkall · 2026-05-03T03:21:36Z

@copilot please review the entire PR (not just the latest commit).

thinkall · 2026-05-03T04:00:05Z

@copilot please review the entire PR (head 1e82335). Earlier review request did not trigger the workflow.

Copilot

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

thinkall · 2026-05-03T04:26:24Z

@copilot please review the entire PR (not just the latest commit).

Copilot

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

thinkall and others added 7 commits April 16, 2026 15:03

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread featcopilot/transformers/sklearn_compat.py Outdated

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated

thinkall requested a review from Copilot April 30, 2026 15:30

Copilot started reviewing on behalf of thinkall April 30, 2026 15:30 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

thinkall and others added 2 commits April 30, 2026 23:42

Merge branch 'main' into improve/benchmark-and-feature-selection

5219b88

Copilot AI review requested due to automatic review settings April 30, 2026 15:52

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread featcopilot/transformers/sklearn_compat.py

Comment thread featcopilot/transformers/sklearn_compat.py Outdated

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated

Copilot started work on behalf of thinkall April 30, 2026 16:21 View session

Copilot finished work on behalf of thinkall April 30, 2026 16:25

thinkall requested a review from Copilot April 30, 2026 23:41

Copilot started reviewing on behalf of thinkall April 30, 2026 23:42 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread featcopilot/transformers/sklearn_compat.py Outdated

Comment thread featcopilot/transformers/sklearn_compat.py Outdated

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated

Comment thread tests/test_sklearn_compat.py Outdated

thinkall and others added 2 commits May 1, 2026 08:04

chore: trigger Copilot re-review for round 3 fixes

aa5fb57

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

thinkall requested a review from Copilot May 1, 2026 02:49

Copilot started reviewing on behalf of thinkall May 1, 2026 02:49 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Comment thread featcopilot/selection/redundancy.py

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated

Copilot AI reviewed May 1, 2026

View reviewed changes

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated

Copilot AI reviewed May 1, 2026

View reviewed changes

Comment thread featcopilot/selection/redundancy.py Outdated

Copilot AI reviewed May 2, 2026

View reviewed changes

jianglibigdata approved these changes May 2, 2026

View reviewed changes

thinkall and others added 2 commits May 2, 2026 12:12

Copilot AI reviewed May 2, 2026

View reviewed changes

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated

thinkall and others added 2 commits May 3, 2026 01:20

Copilot AI reviewed May 2, 2026

View reviewed changes

Comment thread featcopilot/selection/redundancy.py

Copilot AI reviewed May 3, 2026

View reviewed changes

Comment thread featcopilot/llm/copilot_client.py Outdated

Copilot AI reviewed May 3, 2026

View reviewed changes

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated

Copilot AI reviewed May 3, 2026

View reviewed changes

Comment thread .pre-commit-config.yaml

Copilot AI reviewed May 3, 2026

View reviewed changes

thinkall mentioned this pull request May 3, 2026

chore: bump version to 0.3.7 #4

Merged

Conversation

thinkall commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Infrastructure

FeatCopilot Feature Selection Improvements

Results (5-fold CV, FeatCopilot tabular + LLM engines, 63 datasets)

Testing

Uh oh!

codecov-commenter commented Apr 17, 2026 • edited by codecov Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thinkall commented Apr 30, 2026

Uh oh!

Copilot AI commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

thinkall commented Apr 17, 2026 •

edited

Loading

codecov-commenter commented Apr 17, 2026 •

edited by codecov Bot

Loading