Skip to content

feat: improved benchmarks and do-no-harm feature selection#3

Merged
thinkall merged 42 commits into
mainfrom
improve/benchmark-and-feature-selection
May 3, 2026
Merged

feat: improved benchmarks and do-no-harm feature selection#3
thinkall merged 42 commits into
mainfrom
improve/benchmark-and-feature-selection

Conversation

@thinkall
Copy link
Copy Markdown
Owner

@thinkall thinkall commented Apr 17, 2026

Summary

Improves FeatCopilot benchmarks (rigorous statistical methodology, broader dataset coverage, source-tracked reporting) and the feature engineering pipeline so it no longer regresses on real-world datasets.

Benchmark Infrastructure

  • 5-fold stratified cross-validation with mean ± std reporting (was single 80/20 split)
  • Wilcoxon signed-rank test for statistical significance (p < 0.05)
  • Dataset source tracking: 31 real-world (30 INRIA + 1 HuggingFace) vs 32 synthetic, reported separately
  • Win/Tie/Loss counts with significance markers
  • Source provenance: loader-level synthetic-fallback detection, so a Kaggle/HF loader that falls back to a synthesized dataset is correctly bucketed as synthetic
  • New CLI options: --real-world, --n-folds, --n-seeds, --fast

FeatCopilot Feature Selection Improvements

  • Fixed redundancy eliminator: never removes original features (originals are categorically protected; was dropping them, e.g., diamonds 6→4 features)
  • Stricter L1 refinement: uses mean importance threshold (was 0.5× mean — too lenient)
  • Do-no-harm gate: held-out validation (5-split shuffle) to verify derived features help before keeping them. Falls back to original features automatically when features don't add value, with a RuntimeWarning so the fallback is never silent.
  • Repaired Copilot LLM SDK integration: the --with-llm benchmark path was previously falling back to mock LLM responses on github-copilot-sdk >= 0.1.32; updated featcopilot/llm/copilot_client.py to the new SDK API (mandatory on_permission_request, send_and_wait, disconnect) so LLM-generated features are actually evaluated.

Results (5-fold CV, FeatCopilot tabular + LLM engines, 63 datasets)

Dataset Before After
diabetes -4.40% +0.00%
covertype -3.01% +0.00%
electricity -2.73% +1.42%
higgs -1.77% +0.73%
bank_marketing -2.12% +0.00%
diamonds -0.56% +0.00%
eye_movements +0.31% +2.16%
fake_news (HF text) n/a +76.31%
nyc_taxi n/a +8.62%

Real-world classification (15 INRIA + 1 HuggingFace text, 5-fold CV):

  • Win: 4 (25%), Tie: 11 (69%), Loss: 1 (6%)
  • Max regression: -1.17% (covertype_cat; not statistically significant at p<0.05)
  • No statistically significant regression on any real-world classification dataset

Real-world regression (15 INRIA, 5-fold CV):

  • Win: 1 (7%, nyc_taxi +8.62%), Tie: 13 (87%), Loss: 1 (7%, delays_zurich)
  • Note: delays_zurich shows -129.52% only because its baseline R² is ≈0.009 (essentially noise prediction); absolute Δ in R² is small and the result is not statistically significant (p=1.0)

All real-world datasets combined (31, 5-fold CV):

  • Win: 5 (16%), Tie: 24 (77%), Loss: 2 (6%)
  • Mean improvement: -1.29% (skewed entirely by the delays_zurich near-zero-baseline edge case; median improvement is +0.00%)

Synthetic datasets (32, supplementary, 5-fold CV):

  • Win: 20 (63%), Tie: 11 (34%), Loss: 1 (3%)
  • Mean improvement: +13.70%; Best: +104.34% (triple_interaction_regression), +103.41% (xor_regression)

Testing

  • All 768 tests pass (was 625 at PR open; +143 regression tests added across review iterations)
  • Pre-commit (black, ruff) passes
  • Python 3.10 / 3.11 / 3.12 / 3.13 all green in CI

thinkall and others added 7 commits April 16, 2026 15:03
…marks

- Add 5-fold stratified cross-validation with mean±std reporting
- Add Wilcoxon signed-rank test for statistical significance
- Add dataset source tracking (real-world vs synthetic)
- Separate real-world vs synthetic results in reports
- Add --real-world, --n-folds, --n-seeds, --fast CLI options
- Report win/tie/loss counts and significance markers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- RedundancyEliminator: never remove original features when paired with
  each other; only remove derived features that are redundant with originals
- FeatureSelector: always preserve original features in score dict even
  after redundancy elimination
- Stricter L1 refinement: use mean_imp threshold instead of mean_imp*0.5
- Reduce fallback from top-half to top-3 for derived features

Fixes regression on real-world datasets (diamonds 6→4 feature reduction).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add held-out validation gate to AutoFeatureEngineer.fit_transform()
- Uses 3-split shuffle validation (not CV on feature-selected data)
- Requires derived features to show clear benefit (delta > 0.001)
- Falls back to original features if derived features don't help
- Only activates when apply_selection=True (preserves existing API)
- Eliminates regressions: diabetes -4.4%→-0.01%, covertype -3%→+0.11%

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gate

Real-world (31 datasets): Win 2 / Tie 27 / Loss 2, max regression -1.42%
Synthetic (32 datasets): Win 18 / Tie 12 / Loss 2, mean improvement +14.49%

Key improvements from baseline:
- diabetes: -4.40% → +0.00% (do-no-harm gate blocks harmful features)
- covertype: -3.01% → -0.11% (gate prevents major regression)
- electricity: -2.73% → -0.30% (gate reduces regression)
- diamonds: -0.56% → +0.00% (redundancy fix preserves originals)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ed threshold

- Increase gate validation from 3 to 5 splits for stability
- Add feature ratio-scaled threshold: more derived features = stricter gate
- Fixes bank_marketing regression: -0.77% → +0.02%
- Maintains eye_movements improvement: +3.63%
- Preserves synthetic dataset gains

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n real-world

Real-world classification (15 INRIA datasets, 5-fold CV):
- Win rate: 40% (6/15) with improvements up to +3.63%
- Tie rate: 40% (6/15) — do-no-harm gate protects performance
- Loss rate: 20% (3/15) — max regression -1.14%, none significant (p<0.05)

vs original benchmark:
- diabetes: -4.40% → -0.01% (fixed)
- covertype: -3.01% → +0.11% (now wins)
- electricity: -2.73% → -0.32% (much improved)
- higgs: -1.77% → +0.45% (now wins)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d results

- Add statistical methodology section (5-fold CV, Wilcoxon, win/tie/loss)
- Separate real-world vs synthetic results with real-world as primary
- Highlight do-no-harm guarantee (no significant regression on any dataset)
- Update results to reflect latest improvements

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 96.24060% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.90%. Comparing base (fd30793) to head (1e82335).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
featcopilot/transformers/sklearn_compat.py 96.06% 5 Missing ⚠️
featcopilot/engines/text.py 60.00% 2 Missing ⚠️
featcopilot/llm/copilot_client.py 88.88% 2 Missing ⚠️
featcopilot/selection/unified.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main       #3      +/-   ##
==========================================
+ Coverage   88.18%   88.90%   +0.71%     
==========================================
  Files          36       36              
  Lines        3808     3937     +129     
==========================================
+ Hits         3358     3500     +142     
+ Misses        450      437      -13     
Flag Coverage Δ
unittests 88.90% <96.24%> (+0.71%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d9d5df5fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread featcopilot/transformers/sklearn_compat.py Outdated
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades the benchmarking suite to support more rigorous, publication-style evaluation (CV, significance testing, dataset source stratification) and hardens feature selection to prevent regressions by preserving original features and adding a “do-no-harm” validation gate.

Changes:

  • Add a do-no-harm gate to AutoFeatureEngineer.fit_transform() that can drop derived features when held-out validation doesn’t show benefit.
  • Fix feature selection behavior to preserve original features during redundancy elimination and tighten L1 refinement thresholds.
  • Rework the simple-model benchmark runner/reporting to use k-fold CV with Wilcoxon signed-rank tests and to separate real-world vs synthetic datasets.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
featcopilot/transformers/sklearn_compat.py Adds do-no-harm validation gate after selection.
featcopilot/selection/unified.py Preserves original features in redundancy step; tightens L1 refinement thresholds/fallback.
featcopilot/selection/redundancy.py Ensures redundancy eliminator never removes original features when paired with derived ones.
benchmarks/simple_models/run_simple_models_benchmark.py Switches to CV + Wilcoxon; adds real-world filtering and new CLI args; updates report format.
benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md Updates published benchmark report snapshot to new format.
benchmarks/datasets.py Introduces dataset source tagging + helpers (is_real_world, list_real_world_datasets, etc.).
benchmarks/init.py Re-exports new dataset source helpers/constants.
benchmarks/README.md Updates benchmark documentation/results summary to reflect new methodology and source split.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread featcopilot/transformers/sklearn_compat.py
Comment thread featcopilot/transformers/sklearn_compat.py Outdated
Comment thread featcopilot/transformers/sklearn_compat.py Outdated
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Comment thread benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md
thinkall and others added 2 commits April 30, 2026 23:42
- sklearn_compat: preserve _selector after do-no-harm fall-back, rewriting
  its selected features to original-only so subsequent transform() calls
  emit the same column set (codex P1)
- sklearn_compat: only invoke do-no-harm gate when a selector actually ran
  (skip when max_features=None) so apply_selection=True remains a no-op
  in that case (Copilot)
- sklearn_compat: classify task via sklearn.utils.multiclass.type_of_target
  so bool / nullable Int64 / float-encoded {0.0,1.0} targets are routed
  through the classifier path (Copilot)
- tests: add TestDoNoHarmGate covering keep, fall-back, float-binary
  detection, gate gating, and post-fall-back transform consistency
- benchmarks: pass through --with-llm into FeatCopilot in CV path so the
  reported with_llm flag matches what was actually run (codex P2 / Copilot)
- benchmarks: drop dead use_feature_cache argument and --no-feature-cache
  CLI flag (Copilot)
- benchmarks/SIMPLE_MODELS_BENCHMARK.md: annotate as a curated 26-of-63
  subset and document how to regenerate the full report (Copilot)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 30, 2026 15:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates FeatCopilot’s benchmarking and feature-selection pipeline to (a) report more statistically rigorous benchmark results and (b) prevent engineered features from causing regressions by adding a “do-no-harm” validation gate.

Changes:

  • Added a held-out validation “do-no-harm” gate to AutoFeatureEngineer.fit_transform() plus new unit tests covering keep/fallback behavior and task-type inference.
  • Adjusted feature selection to preserve original features through redundancy elimination and tightened derived-feature refinement thresholds.
  • Refactored the simple-models benchmark to use k-fold CV with Wilcoxon signed-rank significance testing and to report real-world vs synthetic datasets separately; updated benchmark docs accordingly.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_sklearn_compat.py Adds tests for the new do-no-harm gate and its integration points.
featcopilot/transformers/sklearn_compat.py Introduces _do_no_harm_gate() and invokes it after selection in fit_transform().
featcopilot/selection/unified.py Ensures original features aren’t dropped when filtering by non-redundancy; tightens L1 refine.
featcopilot/selection/redundancy.py Updates redundancy removal rules to avoid dropping original features when correlated.
benchmarks/simple_models/run_simple_models_benchmark.py Moves to CV + Wilcoxon reporting and adds --real-world/--n-folds/--n-seeds/--fast options.
benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md Updates the published report snapshot to match the new methodology and source split.
benchmarks/datasets.py Adds dataset source tracking + helper APIs for listing/flagging real-world vs synthetic datasets.
benchmarks/init.py Re-exports new dataset source helpers/constants.
benchmarks/README.md Updates benchmark methodology and summarizes results split by source.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread featcopilot/transformers/sklearn_compat.py
Comment thread featcopilot/transformers/sklearn_compat.py Outdated
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
- sklearn_compat: do-no-harm gate now builds the original-only baseline
  from the truly original input (X_original) rather than slicing
  X_engineered, so the comparison stays meaningful even if downstream
  steps drop or rename original columns (Copilot)
- core/base: add public BaseSelector.set_selected_features(features)
  setter and switch the do-no-harm gate fall-back to use it instead of
  reaching into _selected_features (Copilot)
- benchmarks: time-series / forecasting tasks (task contains "forecast"
  or "timeseries") now use TimeSeriesSplit with chronological folds
  instead of KFold(shuffle=True), preventing target leakage across
  folds and the resulting overly optimistic Wilcoxon scores (Copilot)
- tests: TestDoNoHarmGate now exercises set_selected_features via a
  mock with side_effect; new test_gate_uses_x_original_for_baseline
  verifies the baseline truly comes from X_original; new
  test_set_selected_features_overrides_selection covers the public
  setter end-to-end (transform reflects the override and the input
  list is copied defensively).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@thinkall
Copy link
Copy Markdown
Owner Author

@copilot review

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 30, 2026

@copilot review

All 6 round-1 and 3 round-2 review threads have been addressed in commits c1c2b35 and 813a7d4. Here's a summary:

sklearn_compat.py (do-no-harm gate):

  • Task-type inference now uses sklearn.utils.multiclass.type_of_target() instead of the integer-dtype heuristic — correctly handles bool, nullable Int64, and {0.0, 1.0} float-encoded binary targets.
  • Gate only runs when self._selector is not None (i.e., selection actually ran), so apply_selection=True with max_features=None remains a no-op.
  • X_original is now used to build the baseline feature matrix, not a column-slice of X_engineered, making the engineered-vs-original comparison correct even if selection renames/drops original columns.
  • Selector is updated via the public set_selected_features() on BaseSelector instead of mutating _selected_features directly.
  • 7 new unit tests cover keep/fallback decisions, float-encoded target detection, gate bypass when no selector, and transform() consistency after fallback.

Benchmark runner:

  • with_llm is now passed through to apply_featcopilot, so --with-llm works correctly.
  • use_feature_cache dead parameter removed from run_single_benchmark() and CLI.
  • Time-series/forecast tasks now use TimeSeriesSplit (chronological ordering preserved); --n-seeds is a no-op for those tasks with a first-iteration guard.

690 tests pass.

The previous push (813a7d4) did not auto-trigger the Copilot reviewer
even though the branch ruleset has review_on_push: true. This empty
commit forces a fresh push event so the latest changes get reviewed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades FeatCopilot’s benchmarking workflow to be more publication-oriented (cross-validation + significance testing + real-world vs synthetic separation) and tightens the feature engineering/selection pipeline with a “do-no-harm” gate and stronger protections for original features.

Changes:

  • Add a held-out “do-no-harm” validation gate to AutoFeatureEngineer.fit_transform() and a public selector hook to enforce original-only fallback consistently.
  • Fix/adjust feature selection to preserve original features during redundancy elimination and apply stricter refinement thresholds.
  • Rework the simple-models benchmark to use k-fold CV + Wilcoxon testing and to report real-world vs synthetic results separately, with new CLI flags.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_sklearn_compat.py Adds unit tests for the do-no-harm gate and post-fallback transform consistency.
tests/test_selection.py Adds coverage for set_selected_features() overriding downstream transform().
featcopilot/transformers/sklearn_compat.py Implements the do-no-harm gate and invokes it after feature selection in fit_transform().
featcopilot/selection/unified.py Ensures original features are preserved through redundancy filtering; tightens L1 refinement selection threshold.
featcopilot/selection/redundancy.py Updates redundancy removal logic to never drop original features when correlated with derived features.
featcopilot/core/base.py Adds BaseSelector.set_selected_features() public API for post-fit selection overrides.
benchmarks/simple_models/run_simple_models_benchmark.py Switches to k-fold CV + Wilcoxon test; adds --real-world, --n-folds, --n-seeds, --fast; improves report structure.
benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md Updates the benchmark snapshot/report format to match the new methodology and data-source split.
benchmarks/datasets.py Introduces dataset source tagging and helper APIs (is_real_world, list_real_world_datasets, list_synthetic_datasets).
benchmarks/init.py Re-exports dataset source constants/helpers via the benchmarks package API.
benchmarks/README.md Updates benchmark documentation to reflect CV + significance testing + source-separated reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread featcopilot/transformers/sklearn_compat.py Outdated
Comment thread featcopilot/transformers/sklearn_compat.py Outdated
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Comment thread tests/test_sklearn_compat.py Outdated
thinkall and others added 2 commits May 1, 2026 08:04
- sklearn_compat: gate exception path is now fail-closed — always logs
  a warning (not just verbose) and conservatively falls back to original-
  only columns + pins the selector via set_selected_features.
- sklearn_compat: new _encode_for_gate helper ordinal-encodes scalar
  non-numeric columns (bool/category/datetime/object scalars) so signal
  in categorical originals contributes to the baseline. Object columns
  containing arrays / lists / dicts are skipped to avoid row-id-like
  codes fooling the gate model.
- benchmarks: time-series seed handling now emits an explicit one-shot
  warning when n_seeds > 1 and the task is time-series, then runs a
  single pass and records the effective n_seeds (1) in the results
  instead of the misleading raw n_seeds.
- tests: removed unused _make_engineered_frame helper; added coverage
  for the gate's exception fallback, categorical baseline encoding,
  unsupported-object skip, no-usable-columns no-op, ndarray X_original,
  regression target routing, mismatched-index alignment, verbose fall-
  back warning, plus redundancy / unified selector branch coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades FeatCopilot’s benchmarking workflow to cross-validated, statistically tested reporting (with real-world vs synthetic separation), and strengthens the feature-selection pipeline with a “do-no-harm” validation gate plus fixes to preserve original features.

Changes:

  • Add a held-out validation “do-no-harm” gate to AutoFeatureEngineer.fit_transform() (with encoding support for categorical/bool/datetime) and a public set_selected_features() hook for selectors.
  • Fix feature-selection behavior to never drop original features during redundancy elimination and tighten L1 refinement thresholds.
  • Rework the simple-model benchmark to use k-fold CV + Wilcoxon signed-rank significance testing, add dataset source tracking, and extend CLI/reporting.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
featcopilot/transformers/sklearn_compat.py Adds do-no-harm gate + gate encoding helper and wires gate into fit_transform().
featcopilot/core/base.py Exposes BaseSelector.set_selected_features() for post-fit selection overrides.
featcopilot/selection/unified.py Ensures original features are preserved through redundancy filtering; tightens L1 refinement threshold.
featcopilot/selection/redundancy.py Changes redundancy removal rules to never remove originals when correlated with derived features.
tests/test_sklearn_compat.py Adds comprehensive unit tests for do-no-harm gate behavior and encoding edge cases.
tests/test_selection.py Adds coverage for selection overrides, safety branch, and redundancy eliminator branches.
benchmarks/simple_models/run_simple_models_benchmark.py Switches benchmark to CV + seeds + Wilcoxon; adds real-world filtering and richer report output.
benchmarks/datasets.py Introduces dataset source tagging and helper APIs to list/filter real-world vs synthetic datasets.
benchmarks/__init__.py Re-exports new dataset source helpers/constants.
benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md Updates report snapshot and reproduction instructions.
benchmarks/README.md Updates benchmark methodology and summarizes results with real-world vs synthetic separation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Comment thread benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md Outdated
Comment thread featcopilot/transformers/sklearn_compat.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the benchmark suite statistically rigorous (k-fold CV with paired significance testing and source-aware reporting) and strengthens FeatCopilot’s feature selection pipeline with a fail-closed “do-no-harm” gate plus safer selection/refinement behaviors, while also modernizing the codebase to Python 3.10+ typing and tooling.

Changes:

  • Reworks simple_models benchmarks to use k-fold CV (with seeds), fold-local preprocessing (leakage-safe), Wilcoxon signed-rank testing, and real-world vs synthetic reporting.
  • Adds/extends feature selection safeguards: categorical protection in redundancy elimination, stricter L1 refinement, a do-no-harm validation gate (with gate_n_jobs), and selector pinning via BaseSelector.set_selected_features.
  • Updates project metadata/tooling for Python 3.10+ and expands test coverage for benchmark preprocessing/encoding and selection/gate behavior.

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_sklearn_compat.py Adds extensive end-to-end tests for the do-no-harm gate and _encode_for_gate, plus gate_n_jobs plumbing.
tests/test_selection.py Adds tests for selector override behavior, categorical baseline scoring defaults, redundancy behavior, and deprecation warning coverage.
tests/test_benchmark_encoding.py New tests for leakage-safe fold-local encoding/imputation and report bucketing/CLI validation.
pyproject.toml Raises required Python to 3.10+, updates dev tool versions, and adjusts Black/Ruff/Mypy targets.
featcopilot/utils/validation.py Modernizes typing (`X
featcopilot/utils/parallel.py Modernizes typing and imports (collections.abc.Callable).
featcopilot/utils/models.py Modernizes typing for cached model metadata helpers.
featcopilot/utils/cache.py Modernizes typing for cache API and internal helpers.
featcopilot/transformers/sklearn_compat.py Adds do-no-harm gate + _encode_for_gate, introduces gate_n_jobs, and updates typing/get_params.
featcopilot/stores/rule_store.py Modernizes typing for rule store API.
featcopilot/stores/feast_store.py Modernizes typing for Feast store integration APIs.
featcopilot/stores/base.py Modernizes typing for base feature store interfaces.
featcopilot/selection/unified.py Ensures original features aren’t dropped post-redundancy elimination; tightens L1 refinement thresholds.
featcopilot/selection/statistical.py Modernizes typing and uses zip(..., strict=True) for safety.
featcopilot/selection/redundancy.py Enforces “never drop originals” rules, deprecates original_preference with FutureWarning, and updates typing/docs.
featcopilot/selection/importance.py Modernizes typing and minor internal dict initialization cleanup.
featcopilot/llm/transform_rule_generator.py Modernizes typing across generator APIs.
featcopilot/llm/semantic_engine.py Modernizes typing across engine config and methods.
featcopilot/llm/openai_client.py Modernizes typing across OpenAI client config and APIs.
featcopilot/llm/litellm_client.py Modernizes typing across LiteLLM client config and APIs.
featcopilot/llm/explainer.py Modernizes typing across explainer APIs.
featcopilot/llm/copilot_client.py Modernizes typing across Copilot client APIs.
featcopilot/llm/code_generator.py Modernizes typing across code generator APIs.
featcopilot/engines/timeseries.py Modernizes typing in time series engine APIs.
featcopilot/engines/text.py Modernizes typing and minor dict initialization cleanup.
featcopilot/engines/tabular.py Modernizes typing in tabular engine APIs.
featcopilot/engines/relational.py Modernizes typing in relational engine APIs.
featcopilot/core/transform_rule.py Modernizes typing in transform rule model/methods.
featcopilot/core/registry.py Modernizes typing in registry getter APIs.
featcopilot/core/feature.py Modernizes typing in feature and feature set structures.
featcopilot/core/base.py Adds BaseSelector.set_selected_features hook and modernizes typing across base interfaces.
docs/getting-started/installation.md Updates documented minimum Python version to 3.10+.
benchmarks/simple_models/run_simple_models_benchmark.py Implements leakage-safe CV benchmarking + Wilcoxon testing + source-aware report generation + new CLI flags.
benchmarks/simple_models/SIMPLE_MODELS_BENCHMARK.md Updates the benchmark report snapshot to new format/content.
benchmarks/datasets.py Adds dataset source tracking and helpers for real-world vs synthetic subsets.
benchmarks/compare_tools/run_fe_tools_comparison.py Modernizes typing in FE tools comparison benchmark utilities.
benchmarks/automl/run_automl_benchmark.py Modernizes typing and minor strict zip safety for column sanitization.
benchmarks/init.py Re-exports new dataset source helpers/constants.
benchmarks/README.md Updates benchmark methodology and summary reporting to match new scientific workflow.
README.md Updates documented minimum Python version to 3.10+.
AGENTS.md Updates repo contributor guidance to Python 3.10+ conventions.
.pre-commit-config.yaml Updates hook revisions and Ruff hook configuration.
.github/workflows/tests.yml Updates CI test matrix to Python 3.10–3.13.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread featcopilot/selection/redundancy.py
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Four findings, all valid:

1. `RedundancyEliminator` accepted but did not document the deprecated
   `original_preference` parameter. Added a clearly-marked
   "**Deprecated, accepted for backward compatibility only.**" entry to
   the Parameters section so callers understand why it's accepted, that
   it has no effect, and that any non-None value triggers a
   `FutureWarning`.

2. `preprocess_X_train_test` would leave NaNs in the output if a
   numeric column was entirely missing in the train fold: `median()`
   returns NaN and `fillna(NaN)` is a no-op, contradicting the
   docstring's "all numeric, no NaN" guarantee and breaking downstream
   model fitting (`Input X contains NaN`). Added a guard that falls
   back to the `0.0` sentinel when `train_median` is NaN — the column
   becomes a constant (useless but safe).

3. `preprocess_target` only label-encoded `object` and `category`
   targets. Pandas `string`/`StringDtype` (and other non-numeric
   extension dtypes) would fall through and return a non-numeric
   ndarray that breaks sklearn estimators. Switched the condition to
   `pd.api.types.is_numeric_dtype(y) or pd.api.types.is_bool_dtype(y)`
   so all non-numeric, non-bool targets get encoded uniformly.

4. `warnings.filterwarnings("ignore")` in `main()` mutated the global
   warnings filter and was never restored. Tests and other programmatic
   callers of `main()` would have warnings silently suppressed for the
   rest of the process. Moved the call under the
   `if __name__ == "__main__"` guard so it only triggers from the CLI
   entrypoint, leaving `main()` itself side-effect free.

Tests added (all 5 in `tests/test_benchmark_encoding.py`):
- `test_preprocess_X_train_test_handles_all_missing_train_column`
- `test_preprocess_target_handles_pandas_string_dtype`
- `test_preprocess_target_passes_through_numeric_classification`
- `test_preprocess_target_passes_through_bool_classification`
- `test_main_does_not_mutate_global_warnings_filter`

Verified: pre-commit clean, full suite 753 tests passing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 43 out of 43 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Two findings on commit 61f0e9c, both valid:

1. `run_single_benchmark` discarded `loader_name` (4th element of
   `load_dataset`'s return tuple) and derived `source` solely from
   `is_real_world(dataset_name)` (registry key). For loaders that
   sometimes fetch real data and sometimes fall back to a synthesized
   dataset (the historical pattern for Kaggle/OpenML/HF loaders in
   `benchmarks/datasets.py`), this would mislabel a synthetic-fallback
   run as `real_world` and skew the real-world vs synthetic split.

   Added a new `_infer_source(dataset_name, loader_name)` helper that
   downgrades to `synthetic` when the loader's canonical name carries a
   synthetic-fallback marker (`synthetic` or `-style`, case-insensitive),
   even if the registry tags the dataset as real-world. Wired it into
   `run_single_benchmark` so both the printed Source line and the stored
   `results["source"]` reflect the actual data that was loaded. Also
   started persisting `loader_name` in the result dict for traceability.

2. `generate_report()` treated any result without `source == "real_world"`
   as synthetic, including legacy cached results loaded via
   `--report-only` that pre-date the `source` field. All such datasets
   would silently get bucketed as synthetic, producing a misleading
   report.

   Added a `_resolve_source(result)` helper that prefers the explicit
   `source` field when present and falls back to
   `is_real_world(result["dataset"])` when missing. Switched both the
   `real_world` and `synthetic` filters in `generate_report` to use it.
   This is strictly more accurate than the previous behaviour for stable
   real-world datasets (INRIA, `fake_news`); the only remaining gap —
   detecting a historical synthetic-fallback event for a legacy result —
   isn't fixable from the cache alone because `loader_name` wasn't
   persisted, so we accept it.

Tests added (6 in `tests/test_benchmark_encoding.py`):
- `test_infer_source_synthetic_dataset_always_synthetic`
- `test_infer_source_real_world_no_marker_returns_real_world`
- `test_infer_source_real_world_with_synthetic_fallback_marker`
  (covers `(synthetic)`, `(HF-style)`, case-insensitive variants)
- `test_resolve_source_uses_explicit_field_when_present`
- `test_resolve_source_falls_back_to_registry_for_legacy_results`
  (covers known real_world / synthetic / unknown / missing dataset)
- `test_generate_report_buckets_legacy_results_via_registry`
  (end-to-end: generate report from rows without `source`, verify both
  buckets are populated correctly via the registry)

Verified: pre-commit clean, full suite 759 tests passing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 43 out of 43 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread featcopilot/selection/redundancy.py Outdated
Address Copilot round-23 review on commit d6db161 (1 finding, valid).

The `original_preference` parameter was originally between
`original_features` and `verbose` in `RedundancyEliminator.__init__`
(see commit ea3be9b for the legacy signature). Round 17 (b0fc56e)
re-introduced it as an explicit deprecated parameter but appended it
**after** `verbose`, silently breaking the positional contract:

    # Legacy positional caller intent:
    RedundancyEliminator(0.95, "pearson", scores, originals, 0.5, True)
    #                                                          ^pref ^verbose

After round 17, positions 5 and 6 had been swapped, so:
- `0.5` would bind to `verbose` (truthy bool) — wrong behavior
- `True` would bind to `original_preference` — silent FutureWarning
  for code that never asked to be deprecated

Fix: move `original_preference` back between `original_features` and
`verbose`, restoring the legacy positional order. Update the Parameters
docstring to document the slot is intentional. Add a regression test
(`test_positional_argument_order_matches_legacy`) that:

1. Uses `inspect.signature` to assert the positional order matches the
   legacy contract — fails loudly if the order is ever changed again.
2. End-to-end smoke: passing all-positional in the legacy order binds
   `verbose=True` cleanly with no `FutureWarning`.

Verified: pre-commit clean, full suite 760 tests passing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 43 out of 43 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

thinkall and others added 2 commits May 2, 2026 12:12
Two related bugs surfaced when re-running the full `--all` benchmark to
refresh the PR description metrics:

1. After ~3 hours benchmarking 63 datasets, ``save_cache`` raised
   ``TypeError: Object of type bool is not JSON serializable`` on the
   ``significant`` field. ``significant = p_value < 0.05`` returns
   ``numpy.bool_`` (since ``p_value`` is a numpy float from scipy),
   and the default ``json.JSONEncoder`` doesn't accept ``numpy.bool_``
   on NumPy <2 because it isn't a Python ``bool`` subclass for json's
   purposes. The existing converter handled ``np.floating`` / ``np.integer``
   / ``np.ndarray`` / nested dicts but missed ``np.bool_``. Added an
   explicit ``np.bool_`` cast (matters: it must come BEFORE the
   ``np.integer`` check on NumPy <2 where ``np.bool_`` was a subclass
   of ``np.generic`` but not ``np.integer``) and extended the nested
   ``dict`` converter to handle ``np.bool_`` too.

2. When ``save_cache`` raised, the in-memory results were lost — the
   markdown report had not been written yet because ``main()`` called
   ``save_cache`` BEFORE ``generate_report``. After 3 hours of compute,
   one TypeError discarded everything. Swapped the order so the report
   is written first; cache failure now only loses the JSON cache, not
   the markdown summary.

Tests added (`tests/test_benchmark_encoding.py`):
- `test_save_cache_handles_numpy_bool_in_significant_field` — directly
  exercises the previously-failing path with ``np.bool_(True)`` in both
  a top-level field and a nested dict, asserts the JSON round-trip
  produces native Python ``bool``s.
- `test_main_generates_report_even_when_save_cache_would_fail` — patches
  ``save_cache`` to always raise and asserts ``main()`` still writes the
  markdown report so a long benchmark run isn't fully wasted.

Verified: pre-commit clean, full suite 762 tests passing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The benchmark's ``--with-llm`` mode was silently falling back to mock
LLM responses (returning hard-coded ``{col1}_{col2}_ratio``-style features)
because ``featcopilot.llm.copilot_client`` was incompatible with the
current ``github-copilot-sdk`` API. Three breaking changes had
accumulated:

1. **``create_session`` is now keyword-only and requires
   ``on_permission_request``** (since SDK 0.1.32). The old code passed a
   positional dict, which raised
   ``TypeError: takes 1 positional argument but 2 were given`` — caught
   by the ``except Exception`` block and silently mocked.

2. **``session.send`` was redesigned in SDK 0.3.x** to take a positional
   ``str`` and not the old ``{"prompt": ...}`` dict, AND a new
   ``send_and_wait`` helper now handles the send → wait-for-idle →
   collect-assistant-message pattern internally. The old hand-rolled
   ``send + asyncio.Event(idle)`` loop produced empty responses.

3. **``session.destroy()`` was deprecated in SDK 0.3.x**, replaced with
   ``session.disconnect()``. ``destroy()`` still works but emits a
   ``DeprecationWarning`` on every test run.

Together (1) was the catastrophic failure: every benchmark with
``--with-llm`` was running on mock features. The benchmark log header
correctly reported ``LLM enabled: True`` so the failure was invisible
to anyone not specifically looking for ``WARNING - copilot_client.py``
fallback messages.

Fixes:

- ``start()`` imports ``PermissionHandler`` from ``copilot.session``
  (it isn't re-exported from the top-level ``copilot`` package) and
  passes ``on_permission_request=PermissionHandler.approve_all`` plus
  ``model`` and ``streaming`` as keyword arguments to ``create_session``.
- ``send_prompt()`` calls the new ``session.send_and_wait(prompt,
  timeout=...)``, which returns ``SessionEvent | None``; reads
  ``event.data.content`` for the assistant text.
- ``stop()`` prefers ``session.disconnect()`` when available, falling
  back to ``session.destroy()`` for older SDKs.

Tests added (``tests/test_llm_modules.py``):

- ``test_start_passes_on_permission_request_to_create_session`` — uses
  a fake ``copilot`` module to assert the new kwargs contract is
  exercised; would catch a regression to the old positional-dict call.
- ``test_send_prompt_uses_send_and_wait`` — asserts the prompt is
  passed as a positional ``str`` and the SDK's ``send_and_wait`` helper
  is used.
- ``test_stop_prefers_disconnect_over_destroy`` — pins the new
  preferred teardown path.
- ``test_stop_falls_back_to_destroy_for_old_sdk`` — exercises the
  graceful fallback for SDKs that pre-date ``disconnect``.

Verified:
- Live ``send_prompt`` round-trip returns a real LLM response (not
  ``{col1}_{col2}_ratio`` mock output).
- All 214 pre-existing ``test_llm_modules.py`` tests still pass; +4 new.
- Pre-commit clean (black, ruff).
- Existing ``DeprecationWarning: destroy() is deprecated`` warning no
  longer surfaces in the test suite output.

Note: this only fixes the LLM-engine path. ``--with-llm`` benchmarks
were producing misleading numbers prior to this fix; the PR description
metrics will be regenerated on a fresh ``--with-llm`` run with the
real LLM connected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
thinkall and others added 2 commits May 3, 2026 01:20
Regenerate `SIMPLE_MODELS_BENCHMARK_LLM.md` and
`SIMPLE_MODELS_CACHE_LLM.json` from a fresh `--all --with-llm` run on
the current code (`e9be75f`).

Prior to commit `e9be75f`, the `--with-llm` mode silently fell back to
mock LLM responses (incompatible with `github-copilot-sdk >= 0.1.32`'s
mandatory `on_permission_request` kwarg). The previously-committed
`SIMPLE_MODELS_BENCHMARK_LLM.md` from 2026-02-25 was generated under
the old single-split methodology AND with the broken LLM connection,
making it neither comparable to the new 5-fold CV reports nor accurate
about the LLM engine's contribution.

This run uses:
- 5-fold stratified CV × 1 seed (matches the no-LLM benchmark report)
- Real Copilot LLM (`gpt-5.2`, no mock fallbacks; verified by 0
  `Could not connect to Copilot` warnings across the entire run)
- All 63 datasets (31 real-world + 32 synthetic)

Headline metrics (full numbers in the report file):
- **Real-world (31)**: Win 5 / Tie 24 / Loss 2; max regression -1.17% on
  `covertype_cat` (not statistically significant at p<0.05). The
  apparent -129.52% on `delays_zurich` is a near-zero-baseline (R²
  ≈ 0.009) artifact and is also non-significant (p=1.0).
- **Synthetic (32)**: Win 20 / Tie 11 / Loss 1; mean +13.70%; best
  +104.34% on `triple_interaction_regression`.
- Overall mean: +6.32%; median: +0.16%.

The PR description has been updated to cite these regenerated metrics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address Copilot review on commit e9be75f. The previous converter used
``float(v)`` for both ``np.floating`` and ``np.integer``, turning
counts like ``n_samples=5000`` into ``5000.0`` in the cache and
silently losing precision past ``2**53`` for large integer counts.

Fix: split ``np.integer`` -> ``int(v)`` from ``np.floating`` ->
``float(v)``, and apply the same distinction inside the nested-dict
conversion. Refactored both top-level and nested-dict conversion
through a single ``_to_native`` helper to eliminate the duplicated
isinstance ladder and avoid future drift between the two paths
(the previous code had to repeat the ``np.bool_`` / ``np.floating``
/ ``np.integer`` ladder verbatim in the nested-dict branch — easy to
get wrong, and we already had this happen with ``np.bool_`` itself).

The ordering ``np.bool_`` -> ``np.integer`` -> ``np.floating`` is
intentional and preserved: on NumPy <2.0 ``np.bool_`` is a subtype of
``np.generic`` but not ``np.integer``, so the ``np.bool_`` check must
come first to avoid the ``json`` encoder later raising
``TypeError: Object of type bool is not JSON serializable`` (the bug
fixed in commit 6c21494).

Tests added (`tests/test_benchmark_encoding.py`):
- `test_save_cache_preserves_int_fidelity_for_np_integer` — exercises
  ``np.int32`` / ``np.int64`` count fields plus a ``2**60+1`` value
  that would lose precision through float64; asserts on-disk JSON has
  no decimal point on int fields and the round-trip preserves
  ``isinstance(int)``. Also covers nested dict mixed-type case.

Verified: pre-commit clean, full suite 767 tests passing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@thinkall
Copy link
Copy Markdown
Owner Author

thinkall commented May 2, 2026

@copilot please review the entire PR (not just the latest commit). The PR has accumulated many changes across multiple iterations and the review should consider the full diff against main.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread featcopilot/selection/redundancy.py
Address Copilot review on commit d501114. Round 24 restored the
``original_preference`` parameter in its legacy positional slot but
inadvertently dropped ``self.original_preference = original_preference``
from ``__init__``. Legacy callers that read the attribute after
construction (for logging, hyperparameter introspection, or sklearn's
``get_params`` / ``set_params`` round-trip) would crash with
``AttributeError``.

Fix: re-add ``self.original_preference = original_preference``, with
an explanatory comment that the attribute is preserved purely for
read-only backward compatibility — the value has no effect on
behavior because originals are categorically protected from removal.

Tests added (`tests/test_selection.py`):
- `test_original_preference_attribute_preserved_for_legacy_readers` —
  asserts the attribute exists with the default ``None`` after a
  zero-arg construction, and reflects the caller-provided value when
  one is explicitly passed.

Verified: pre-commit clean, redundancy + selection test set passing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@thinkall
Copy link
Copy Markdown
Owner Author

thinkall commented May 2, 2026

@copilot please review the entire PR (not just the latest commit).

@thinkall
Copy link
Copy Markdown
Owner Author

thinkall commented May 2, 2026

@copilot please review the entire PR (head commit 085371e). Earlier review request did not trigger the workflow.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread featcopilot/llm/copilot_client.py Outdated
Address Copilot review on commit 085371e. ``send_prompt`` previously
only caught the built-in ``TimeoutError``. While that matches what the
current ``github-copilot-sdk`` raises, two adjacent exception classes
can also surface from async timeouts:

1. ``asyncio.TimeoutError`` — was a separate exception class on
   Python 3.10 (only became an alias for the built-in ``TimeoutError``
   in 3.11+). On 3.10, an ``asyncio.wait_for`` timeout inside the SDK
   would NOT be caught by ``except TimeoutError`` and would propagate
   to callers as an unhandled crash. ``pyproject.toml`` declares
   ``requires-python = ">=3.10"``, so 3.10 is in scope.

2. ``concurrent.futures.TimeoutError`` — what some thread/process
   bridges (e.g., ``run_in_executor`` chains) surface; cheap to
   include and prevents the timeout handler from being bypassed.

Fix: catch ``(TimeoutError, asyncio.TimeoutError,
concurrent.futures.TimeoutError)``. Added ``import
concurrent.futures`` at the module level. Inline comment documents
why each class is included.

Tests added (`tests/test_llm_modules.py`):
- ``test_send_prompt_handles_builtin_timeout_error`` — current SDK
  behaviour (already worked).
- ``test_send_prompt_handles_asyncio_timeout_error`` — Python-3.10
  scenario; would have failed before this commit.
- ``test_send_prompt_handles_concurrent_futures_timeout_error`` —
  thread-bridge scenario; cheap defensive coverage.

All three patch ``send_and_wait`` with ``side_effect=<TimeoutError>``
and assert the sentinel ``"timed out"`` message is returned, not an
unhandled exception.

Verified: pre-commit clean, 7 ``TestCopilotClientSDKCompatibility``
tests pass (4 prior + 3 new).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@thinkall
Copy link
Copy Markdown
Owner Author

thinkall commented May 3, 2026

@copilot please review the entire PR (not just the latest commit).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
Address Copilot round-30 review on commit 3fdce5e (2 findings).

1. **`preprocess_data()` only handled `object`/`category`**

   The legacy `preprocess_data()` filtered with
   `select_dtypes(include=["object", "category"])`, which silently
   leaves pandas `string`/`StringDtype` columns unencoded. Callers that
   rely on the helper's "model-ready" output contract would get a
   non-numeric frame back and crash inside sklearn.

   Fix: switch to `select_dtypes(exclude=[np.number, "bool",
   "boolean"])` so the legacy path mirrors the production
   `_label_encode_non_numeric` contract — anything that isn't numeric
   or a bool gets label-encoded, including `StringDtype`, ArrowString,
   and any future non-numeric extension dtype.

2. **`significant = p_value < 0.05` returned `numpy.bool_`**

   `p_value` is a NumPy scalar (`scipy.stats.wilcoxon` output), so the
   bare comparison yields `numpy.bool_`. That makes the in-memory
   results dict non-JSON-serializable by default, and forced
   `save_cache`'s `_to_native` helper to special-case it (the bug
   chain that crashed a 3-hour run before commit 6c21494).

   Fix: cast at the source — `significant = bool(p_value < 0.05)`.
   Inline comment notes the rationale so a future refactor doesn't
   silently revert it.

Tests added (`tests/test_benchmark_encoding.py`):

- `test_preprocess_data_encodes_pandas_string_dtype` — exercises a
  frame with mixed `object`, `category`, `string`/`StringDtype`, and
  `bool` columns; asserts `string_ext` ends up integer-typed (encoded)
  and bool stays bool.
- `test_significant_field_is_native_python_bool` — runs `run_single_
  benchmark` end-to-end on a tiny synthetic dataset (with `load_dataset`
  patched so the test stays fast) and asserts
  `type(results["significant"]) is bool` and
  `json.dumps({"significant": sig})` succeeds without a custom encoder.

Verified: pre-commit clean, full suite 773 tests passing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@thinkall
Copy link
Copy Markdown
Owner Author

thinkall commented May 3, 2026

@copilot please review the entire PR (not just the latest commit).

@thinkall
Copy link
Copy Markdown
Owner Author

thinkall commented May 3, 2026

@copilot please review the entire PR (head 1e82335). Earlier review request did not trigger the workflow.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .pre-commit-config.yaml
@thinkall
Copy link
Copy Markdown
Owner Author

thinkall commented May 3, 2026

@copilot please review the entire PR (not just the latest commit).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 46 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants