txtextracteval is a Python toolkit for comparing text extraction methods on images of printed documents, particularly those captured by mobile phones which may suffer from quality issues. It provides an end-to-end pipeline to evaluate and compare various approaches, including:
- Traditional OCR engines (Tesseract and EasyOCR)
- Vision-enabled Large Language Model APIs (from providers like Google Gemini, OpenAI, or Anthropic)
- Local Transformer models (e.g., TrOCR, SmolVLM via HuggingFace, or via Ollama)
The toolkit automates image preprocessing (applying transformations like blur, rotation, brightness changes), runs the configured extraction methods on the original and transformed images, computes accuracy (CER, WER) and performance (latency, cost) metrics against ground truth text, and generates a consolidated Markdown report for easy comparison.
This experiment aims to help teams understand which method works best for different image qualities and scenarios, guiding informed decisions for real-world applications.
- Multi-Method Comparison: Evaluate OCR engines, API providers, and local model pipelines side-by-side.
- Image Degradation Simulation: Apply configurable transformations (blur, brightness, rotation) to test robustness against common mobile capture issues.
- Comprehensive Metrics: Calculate Character Error Rate (CER), Word Error Rate (WER), latency (seconds per extraction), and estimated cost (for API methods).
- YAML Configuration: Define experiments reproducibly using a clear YAML format.
- Command-Line Interface: Run evaluations easily using
uv run txtextractevalwith either a config file or direct image/GT arguments. - Detailed Markdown Reports: Automatically generate reports including transformed image previews, extracted text outputs, and summary tables for easy analysis.
- Extensible: Designed with a modular structure to facilitate adding new extraction methods or image transformations.
-
Clone the repository:
git clone https://github.com/wri/txtextracteval.git cd txtextracteval -
Create and activate virtual environment:
uvhandles this seamlessly. Run:uv venv
This creates a
.venvdirectory (if it doesn't exist) and activates it. -
Install dependencies: Sync the environment with the project's dependencies listed in
pyproject.toml:uv sync
This will install
txtextractevaland all its dependencies (likeopencv-python,pytesseract,transformers,google-genai, etc.).
-
LLM API provider: If using the
llm_apimethod withprovider: gemini,provider: openai, orprovider: anthropicyou need an API key.- Create a file named
.envin the project root directory. - Add your API key to the
.envfile:GOOGLE_API_KEY="YOUR_API_KEY_HERE" ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" OPENAI_API_KEY="YOUR_API_KEY_HERE"
The application uses
python-dotenvto load this key automatically. Alternatively, set the corrsponding API key environment variable directly in your shell. - Create a file named
-
Ollama: If using the
llm_apimethod withprovider: ollama, ensure you have an Ollama instance running locally.- Follow the Ollama installation guide.
- Make sure the desired multimodal model (e.g.,
llava,gemma3with vision support) is pulled:ollama pull gemma3. - The default endpoint used by the tool is
http://localhost:11434. You can override this in the configuration if needed.
Experiments are defined using a YAML file (e.g., config.yaml). Here's an example structure:
# List of input image paths or path to a directory
images:
- data/samples/receipt.jpg
- data/samples/document.png
# List of corresponding ground truth text file paths
# Must match the order of images if both are lists
ground_truth:
- data/samples/receipt.txt
- data/samples/document.txt
# List of extraction methods to evaluate
methods:
- type: tesseract # Uses local Tesseract installation
config: # Optional Tesseract settings
lang: eng # Language (e.g., eng+fra for multiple)
psm: 3 # Page Segmentation Mode
- type: hf_ocr # Uses Hugging Face transformers
config:
# Optional: Specify model (defaults to ds4sd/SmolDocling-256M-preview)
model: microsoft/trocr-base-printed
# device: 0 # Optional: Specify GPU device index (defaults to CPU: -1)
- type: llm_api # Uses LLM APIs
config:
provider: gemini # 'gemini' or 'ollama'
model: gemini-1.5-flash-latest # Specific Gemini model
# api_key_env: CUSTOM_GEMINI_KEY_VAR # Optional: Override default env var name
prompt: "Extract the text content accurately." # Optional: Custom prompt
- type: llm_api
config:
provider: ollama
model: llava # Model running on local Ollama instance
endpoint: http://127.0.0.1:11434 # Optional: Override default Ollama endpoint
timeout: 180 # Optional: Timeout in seconds for API call (default: 120)
# Optional: List of image transformations to apply
transformations:
- name: blur # Name matching registered transform function
params:
kernel_size: 3 # Parameters for the transform function
- name: rotate
params:
angle: -5
- name: brightness
params:
factor: 0.7 # Decrease brightness by 30%
# Optional: List of metrics to calculate (defaults shown)
metrics:
- cer
- wer
- latency # Automatically included via ExtractionResult
- cost # Automatically included via ExtractionResult
# Output configuration
output:
directory: ./evaluation_results/run_01 # Where to save report and variants
report_filename: comparison_report.md # Name of the Markdown reportRun experiments using the CLI:
# Ensure your virtual environment is active or use uv run directly
uv run txtextracteval --config_file path/to/your/config.yamlOptions:
--config_file(-c): Path to the YAML configuration file (required for full experiments).--src_img: Path to a single source image (for quick test runs). Requires--gt_file.--gt_file: Path to the ground truth text for the single source image. Requires--src_img.--out_dir: Output directory path. Overrides thedirectorysetting in the config file if provided.--verbose(-v): Enable detailed DEBUG level logging.--version: Show the version number and exit.--help: Show help message and exit.
Simple Run Example (without config file):
This runs only the default Tesseract method with no transformations on a single image.
uv run txtextracteval --src_img data/samples/receipt.jpg --gt_file data/samples/receipt.txt --out_dir ./simple_run_outputThe tool generates the following in the specified output directory:
- Markdown Report (
<report_filename>.md): Contains:- Experiment overview.
- A summary table comparing average metrics (latency, cost, CER, WER) across all methods.
- Detailed results for each source image, broken down by image variant (original + transformations).
- For each variant, an embedded preview of the transformed image.
- A table showing the extracted text (truncated), metrics, latency, cost, and any errors for each method applied to that variant.
- Image Variants: Saved versions of the original image and each applied transformation (e.g.,
image1_original.png,image1_blur_k3.png). These are linked in the report.
To run the test suite (requires pytest and pytest-mock):
uv run pytestFair warning: the test suite is not complete, or even functioning half decently.
This codebase is designed to be extensible:
- Extractors: Add a new class inheriting from
BaseExtractorinsrc/txtextracteval/extractors/, implement theextractmethod, and register the class key/type inEXTRACTOR_REGISTRYwithinsrc/txtextracteval/runner.py. - Transformations: Add a new function in
src/txtextracteval/transforms/(e.g.,opencv_transforms.py) that takes a NumPy array image and parameters, returns a transformed NumPy array, and register its name/function inTRANSFORM_REGISTRYwithinsrc/txtextracteval/runner.py.
- Refactor the approach to using HuggingFace models
- Revisit the testing suite
This project is licensed under the MIT License - see the LICENSE file for details.