Skip to content

Report errors with original DEM indexing#186

Merged
noajshu merged 21 commits into
quantumlib:mainfrom
noajshu:main
Feb 17, 2026
Merged

Report errors with original DEM indexing#186
noajshu merged 21 commits into
quantumlib:mainfrom
noajshu:main

Conversation

@noajshu
Copy link
Copy Markdown
Contributor

@noajshu noajshu commented Feb 15, 2026

A big flaw in Tesseract's API is that error indices are always in terms of the internal errors vector, which does not correspond directly to the errors in the (flattened) input DEM given by the user. This causes a variety of problems. One such problem is that if we generate a DEM with --dem-out (or e.g. similar manual processing steps in python) it may bear little resemblance to the input DEM. For example all the targets will be stripped of separators etc. This makes it annoying to use tesseract-calibrated error models for downstream tasks like matching-based decoding.

Here we adopt the principle that the user interface to Tesseract/Simplex decoders should always be in terms of the error indices from the original flattened DEM as provided by the user. This is now true across C++, CLI, and Python APIs.

  • Added index-mapping support to DEM preprocessing in common:
    • merge_indistinguishable_errors(..., error_index_map)
    • remove_zero_probability_errors(..., error_index_map)
    • error_index_map maps original error index to new preprocessed index
    • error_index_map maps removed / redundant errors to std::numeric_limits<size_t>::max()
  • Update both decoders (TesseractDecoder, SimplexDecoder) to maintain:
    • dem_error_to_error (original flattened DEM index -> internal index)
    • error_to_dem_error (internal error index -> original flattened DEM index)
  • predicted_errors_buffer reports errors back with original flattened DEM error indices.
  • cost_from_errors and observables-from-errors methods now:
    • accept original flattened DEM indices
    • throw on unmapped/removed indices (size_t::max())
  • Updated Python bindings to use the new helpers.
  • Updated pybind common wrappers to pass required map args to common preprocessing functions.
  • Updated --dem-out in both CLI binaries:
    • keep original flattened DEM in scope
    • emit updated probabilities by iterating original DEM instruction order
    • preserve original error instruction tags and arbitrary formatting (e.g. D0 ^ D0 D1) when writing estimated DEM output.
  • Updated tests
  • Added AGENTS guidance to run Python Bazel tests

@noajshu noajshu requested a review from a team as a code owner February 15, 2026 22:03
@noajshu noajshu requested review from LalehB and removed request for a team February 15, 2026 22:03
Copy link
Copy Markdown
Collaborator

@LalehB LalehB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thanks Noah

Comment thread src/tesseract.cc Outdated
#include <cassert>
#include <functional> // For std::hash (though not strictly necessary here, but good practice)
#include <iostream>
#include <numeric>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: for a more robust build you can also include since you this file and simplex.cc are using numeric_limits but that relies on transitive incude

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, I'll #include <limits> as well, thanks!

@noajshu noajshu merged commit dc11cb0 into quantumlib:main Feb 17, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants