feat: Paper results pipeline and robustness improvements by Mattdl · Pull Request #46 · techwolf-ai/workrb

Mattdl · 2026-03-13T13:33:49Z

Summary

Adds end-to-end support for reproducing paper benchmark results — from running experiments to generating publication-ready LaTeX tables — while fixing several robustness issues discovered during large-scale multilingual evaluation runs.

Changes

New: Paper results pipeline

examples/run_paper_results.py — Script to run the full benchmark suite across multilingual and monolingual model configurations (BM25, JobBERT-v2/v3, Qwen3, ConTeXTMatch, CurriculumMatch).
examples/generate_paper_table.py — Script to load saved results and generate a formatted LaTeX comparison table with model grouping, short names, bold-best highlighting, and optional dataset count rows.
src/workrb/metrics/reporting.py — New format_results_latex() function (~290 lines) that builds a complete LaTeX table environment from multiple BenchmarkResults, with support for model groups, \midrule separators, column renaming/reordering, \resizebox, and per-group dataset count rows.
src/workrb/results.py — New BenchmarkResults.get_dataset_counts() method to report how many datasets contribute to each task group or task score after language filtering.

New: Duplicate handling in ranking datasets

DuplicateStrategy enum (ALLOW, RAISE, RESOLVE) replaces the old boolean allow_duplicate_queries / allow_duplicate_targets flags in RankingDataset.
RESOLVE mode (new default) deterministically deduplicates:
- Targets: keeps first occurrence and remaps all indices.
- Queries: merges target_indices via set union for identical query texts.
All ranking tasks (melo, mels, skillnorm, etc.) updated to use the new API.
tests/test_duplicate_strategy.py — 210-line test suite covering RAISE, ALLOW, and RESOLVE strategies for both queries and targets.

New: Graceful handling of unsupported dataset configs

DatasetConfigNotSupported exception — New exception type in base.py for datasets that dynamically produce 0 queries or targets (e.g., an ESCO language/version lacking skill alternatives).
Task._load_datasets() catches DatasetConfigNotSupported and logs a warning instead of crashing, then updates self.dataset_ids to reflect only successfully loaded datasets.
RankingDataset validation raises DatasetConfigNotSupported when query or target count is 0.

New: ConTeXTMatch query batching for large ranking tasks

ConTeXTMatchModel._compute_rankings() now scores queries in configurable chunks (scoring_batch_size, default 32) to prevent OOM from the (num_queries, num_targets, seq_len) intermediate tensor.
Targets are encoded once and reused across all query chunks.
Renamed batch_size parameter to encode_batch_size for clarity.
tests/test_models/test_contextmatch_model.py — 108-line test suite for the ConTeXTMatch model including batching behavior.

Fix: Version-dependent ESCO language support

ESCO.get_supported_languages(version) returns the correct language set per major.minor version (e.g., Icelandic/Norwegian/Arabic/Ukrainian only available from v1.1+).
All ranking tasks now call ESCO.get_supported_languages(self.esco_version) instead of the static SUPPORTED_ESCO_LANGUAGES tuple.

Fix: Miscellaneous

Cast prediction matrices to .float() before .numpy() to avoid dtype errors with bfloat16 models.
Use sorted(set(...)) instead of list(set(...)) for deterministic index ordering.

Files changed (22)

Area	Files
Examples	`examples/run_paper_results.py`, `examples/generate_paper_table.py`
Metrics & Reporting	`src/workrb/metrics/reporting.py`, `src/workrb/metrics/__init__.py`, `src/workrb/metrics/classification.py`, `src/workrb/metrics/ranking.py`
Models	`src/workrb/models/bi_encoder.py`
Results	`src/workrb/results.py`
Tasks (core)	`src/workrb/tasks/abstract/base.py`, `src/workrb/tasks/abstract/ranking_base.py`, `src/workrb/tasks/__init__.py`
Tasks (ranking)	`job2skill.py`, `skill2job.py`, `skill_extraction.py`, `jobnorm.py`, `skillnorm.py`, `melo.py`, `mels.py`
Data	`src/workrb/data/esco.py`
Tasks (classification)	`src/workrb/tasks/classification/job2skill.py`
Tests	`tests/test_duplicate_strategy.py`, `tests/test_models/test_contextmatch_model.py`

Test plan

Run tests/test_duplicate_strategy.py — validates RAISE, ALLOW, and RESOLVE strategies for both query and target deduplication
Run tests/test_models/test_contextmatch_model.py — validates ConTeXTMatch encoding and batched scoring
Run existing test suite to confirm no regressions in ranking/classification tasks
Verify examples/run_paper_results.py executes end-to-end with at least one model
Verify examples/generate_paper_table.py generates valid LaTeX output from saved results

Breaking changes

RankingDataset.__init__ signature changed: allow_duplicate_queries / allow_duplicate_targets replaced by duplicate_query_strategy / duplicate_target_strategy (enum-based). Any external callers using the old boolean flags will need to update.
ConTeXTMatchModel.encode() parameter renamed from batch_size to encode_batch_size.

…ts with duplicate targets

…to solve dynamic dataset loading that can result in 0-length query or target tasks.

…tasks

…groups

Mattdl added 14 commits March 9, 2026 16:44

chore: run paper results example

aeaf417

feat: deduplication strategy for queries and targets to enable datase…

171c402

…ts with duplicate targets

fix: default resolve duplicates in ranking and query

47bb2b1

feat: make ESCO language support version dependent

400a65b

fix: DatasetConfigNotSupported exceptions are introduced and skipped …

063500c

…to solve dynamic dataset loading that can result in 0-length query or target tasks.

fix: linter fixes

4d904e8

fix: update dataset_ids in Task based on filtering non-supported ones

93572fd

feat: ConteXTMatch query batching for out-of-memory on large ranking …

2899803

…tasks

fix: cast prediction matrices explicitly to float

14d1baa

chore: adding JobBERTv2 to paper results

42ada46

feat: latex table results reporting for paper example

25b9e56

fix: latex table result set best results in bold for different model …

4b0f38c

…groups

feat: enable row with dataset counts for results table example

7d14f77

feat: NDCG metrics for paper

da3c94e

Mattdl merged commit 18ea566 into main Mar 13, 2026
2 checks passed

Mattdl deleted the feat/paper_results branch March 13, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Paper results pipeline and robustness improvements#46

feat: Paper results pipeline and robustness improvements#46
Mattdl merged 14 commits intomainfrom
feat/paper_results

Mattdl commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mattdl commented Mar 13, 2026

Summary

Changes

New: Paper results pipeline

New: Duplicate handling in ranking datasets

New: Graceful handling of unsupported dataset configs

New: ConTeXTMatch query batching for large ranking tasks

Fix: Version-dependent ESCO language support

Fix: Miscellaneous

Files changed (22)

Test plan

Breaking changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant