Skip to content

feat: Paper results pipeline and robustness improvements#46

Merged
Mattdl merged 14 commits intomainfrom
feat/paper_results
Mar 13, 2026
Merged

feat: Paper results pipeline and robustness improvements#46
Mattdl merged 14 commits intomainfrom
feat/paper_results

Conversation

@Mattdl
Copy link
Collaborator

@Mattdl Mattdl commented Mar 13, 2026

Summary

Adds end-to-end support for reproducing paper benchmark results — from running experiments to generating publication-ready LaTeX tables — while fixing several robustness issues discovered during large-scale multilingual evaluation runs.

Changes

New: Paper results pipeline

  • examples/run_paper_results.py — Script to run the full benchmark suite across multilingual and monolingual model configurations (BM25, JobBERT-v2/v3, Qwen3, ConTeXTMatch, CurriculumMatch).
  • examples/generate_paper_table.py — Script to load saved results and generate a formatted LaTeX comparison table with model grouping, short names, bold-best highlighting, and optional dataset count rows.
  • src/workrb/metrics/reporting.py — New format_results_latex() function (~290 lines) that builds a complete LaTeX table environment from multiple BenchmarkResults, with support for model groups, \midrule separators, column renaming/reordering, \resizebox, and per-group dataset count rows.
  • src/workrb/results.py — New BenchmarkResults.get_dataset_counts() method to report how many datasets contribute to each task group or task score after language filtering.

New: Duplicate handling in ranking datasets

  • DuplicateStrategy enum (ALLOW, RAISE, RESOLVE) replaces the old boolean allow_duplicate_queries / allow_duplicate_targets flags in RankingDataset.
  • RESOLVE mode (new default) deterministically deduplicates:
    • Targets: keeps first occurrence and remaps all indices.
    • Queries: merges target_indices via set union for identical query texts.
  • All ranking tasks (melo, mels, skillnorm, etc.) updated to use the new API.
  • tests/test_duplicate_strategy.py — 210-line test suite covering RAISE, ALLOW, and RESOLVE strategies for both queries and targets.

New: Graceful handling of unsupported dataset configs

  • DatasetConfigNotSupported exception — New exception type in base.py for datasets that dynamically produce 0 queries or targets (e.g., an ESCO language/version lacking skill alternatives).
  • Task._load_datasets() catches DatasetConfigNotSupported and logs a warning instead of crashing, then updates self.dataset_ids to reflect only successfully loaded datasets.
  • RankingDataset validation raises DatasetConfigNotSupported when query or target count is 0.

New: ConTeXTMatch query batching for large ranking tasks

  • ConTeXTMatchModel._compute_rankings() now scores queries in configurable chunks (scoring_batch_size, default 32) to prevent OOM from the (num_queries, num_targets, seq_len) intermediate tensor.
  • Targets are encoded once and reused across all query chunks.
  • Renamed batch_size parameter to encode_batch_size for clarity.
  • tests/test_models/test_contextmatch_model.py — 108-line test suite for the ConTeXTMatch model including batching behavior.

Fix: Version-dependent ESCO language support

  • ESCO.get_supported_languages(version) returns the correct language set per major.minor version (e.g., Icelandic/Norwegian/Arabic/Ukrainian only available from v1.1+).
  • All ranking tasks now call ESCO.get_supported_languages(self.esco_version) instead of the static SUPPORTED_ESCO_LANGUAGES tuple.

Fix: Miscellaneous

  • Cast prediction matrices to .float() before .numpy() to avoid dtype errors with bfloat16 models.
  • Use sorted(set(...)) instead of list(set(...)) for deterministic index ordering.

Files changed (22)

Area Files
Examples examples/run_paper_results.py, examples/generate_paper_table.py
Metrics & Reporting src/workrb/metrics/reporting.py, src/workrb/metrics/__init__.py, src/workrb/metrics/classification.py, src/workrb/metrics/ranking.py
Models src/workrb/models/bi_encoder.py
Results src/workrb/results.py
Tasks (core) src/workrb/tasks/abstract/base.py, src/workrb/tasks/abstract/ranking_base.py, src/workrb/tasks/__init__.py
Tasks (ranking) job2skill.py, skill2job.py, skill_extraction.py, jobnorm.py, skillnorm.py, melo.py, mels.py
Data src/workrb/data/esco.py
Tasks (classification) src/workrb/tasks/classification/job2skill.py
Tests tests/test_duplicate_strategy.py, tests/test_models/test_contextmatch_model.py

Test plan

  • Run tests/test_duplicate_strategy.py — validates RAISE, ALLOW, and RESOLVE strategies for both query and target deduplication
  • Run tests/test_models/test_contextmatch_model.py — validates ConTeXTMatch encoding and batched scoring
  • Run existing test suite to confirm no regressions in ranking/classification tasks
  • Verify examples/run_paper_results.py executes end-to-end with at least one model
  • Verify examples/generate_paper_table.py generates valid LaTeX output from saved results

Breaking changes

  • RankingDataset.__init__ signature changed: allow_duplicate_queries / allow_duplicate_targets replaced by duplicate_query_strategy / duplicate_target_strategy (enum-based). Any external callers using the old boolean flags will need to update.
  • ConTeXTMatchModel.encode() parameter renamed from batch_size to encode_batch_size.

@Mattdl Mattdl merged commit 18ea566 into main Mar 13, 2026
2 checks passed
@Mattdl Mattdl deleted the feat/paper_results branch March 13, 2026 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant