New feature: run OCR and embed text in the generated PDF

## OCR support in FairScan

[OCR](https://en.wikipedia.org/wiki/Optical_character_recognition) is a commonly requested features for FairScan, at least by advanced users. 
OCR can be quite valuable, but it is also a complex feature to implement well, especially on-device, without compromising UX, performance or reliability.
This issue is mainly about doing OCR right, not just adding an OCR checkbox.
This is not a short-term promise.

## What FairScan aims to provide (ideal outcome)

The goal is to generate **searchable PDFs with an invisible text layer**:
- The scanned page remains a raster image.
- OCR text is embedded invisibly on top of the image.
- Users can:
  - select and copy text,
  - search text inside the PDF,
  - see search highlights at the correct locations.

In the UX, OCR should feel transparent to the user when scanning a document: 
- No dedicated OCR screen.
- OCR is triggered automatically with no user interaction
- OCR could be mentioned in the export screen for example to confirm that detected text is added to the document, e.g. "Text: 42k characters"

OCR will probably be opt-in since many users don't know what it is and might be confused by a longer processing time. It would be activated in the application settings.

## What is explicitly out of scope (at least initially)

To keep UX simple and reliable, the following are **out of scope** for a first OCR version:

- Real-time OCR during capture.
- Handwritten text recognition.
- Perfect character-level placement.
- Automatic language detection.

Non-goals:
- manual OCR tuning per page
- bounding boxes shown to the user
- editable text layer

## Challenges with OCR on Android

Key challenges include:
- OCR engines are very sensitive to image quality.
- Correctly mapping OCR bounding boxes from image space to PDF coordinates.
- Preserving baselines, font sizes and alignment well enough for good text selection.
- Performance on mobile devices.

Poor placement or slow execution may lead to a worse user experience than having no OCR at all.

## Dependency on image post-processing

OCR results will depend on FairScan's image processing including:
- document detection and cropping,
- contrast / brightness adjustment,
- deskewing,
- sharpening?
- denoising?
- OCR-friendly resolution (300 dpi)

As of today, FairScan has no deskewing, sharpening nor denoising.
Automatic contrast and brightness adjustment should be improved to be more reliable (see https://github.com/pynicolas/FairScan/issues/80).
Management of captured images should be revamped to keep images in the original resolution: this could be done as part of https://github.com/pynicolas/FairScan/issues/70

Improving post-processing of captured images benefits all users, even without OCR.

## OCR engines

Requirements for an OCR engine:
- open source
- offline
- can run on Android without a massive integration effort
- can run in a reasonable amount of time on a mobile device, e.g. < 1 second per page on recent devices
- doesn't have a huge impact on the size of the APK

Known options:
- [Tesseract](https://github.com/tesseract-ocr): with [Tesseract4Android](https://github.com/adaptech-cz/Tesseract4Android)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR): [maybe hard to integrate](https://github.com/PaddlePaddle/PaddleOCR/blob/main/deploy/lite/readme.md)

## Managing languages

OCR engines like Tesseract have separate models per language. Those models can be heavy: like several megabytes per language for [Tesseract "fast" models](https://github.com/tesseract-ocr/tessdata_fast).
A possible way to manage languages :
- APKs don't include any model
- To activate OCR in the app settings, the user has to trigger the download of one or more language models
- When a document is scanned, all installed language models are used (that might be refined later)

FairScan would then require the "internet" permission: the privacy policy should then be updated. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New feature: run OCR and embed text in the generated PDF #27

OCR support in FairScan

What FairScan aims to provide (ideal outcome)

What is explicitly out of scope (at least initially)

Challenges with OCR on Android

Dependency on image post-processing

OCR engines

Managing languages

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

New feature: run OCR and embed text in the generated PDF #27

Description

OCR support in FairScan

What FairScan aims to provide (ideal outcome)

What is explicitly out of scope (at least initially)

Challenges with OCR on Android

Dependency on image post-processing

OCR engines

Managing languages

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions