```{contents}
```

## OCR

OCR (**Optical Character Recognition**) is the process of converting **images of text** into **machine-readable digital text**.

Example:

| Input               | Output                  |
| ------------------- | ----------------------- |
| Photo of a receipt  | Structured digital text |
| Scanned PDF         | Searchable text         |
| Handwriting picture | Recognized characters   |

OCR is used in:

* Scanned documents
* Invoices/receipts processing
* Passport/ID scanning
* License plate recognition (ANPR)
* Bank cheques
* Digitizing printed books
* Multimodal LLMs (GPT-4o, Gemini)

---

### OCR Pipeline (Step-by-Step)

Modern OCR systems have **four main components**:

---

#### **1. Image Preprocessing**

Goal: Improve image quality for recognition.

Techniques:

* Grayscale conversion
* Binarization (Otsu thresholding)
* Noise removal
* Deskewing (fix tilted scans)
* Contrast enhancement
* Resizing
* Smoothing/sharpening

Example:

```
Original → cleaned → easier for model to detect text
```

---

### **2. Text Detection (Where is the text?)**

The system finds **regions containing text**.

Modern text detectors:

* **EAST** (Efficient and Accurate Scene Text Detector)
* **CRAFT** (Character Region Awareness)
* **DB (Differentiable Binarization)** – SOTA
* **YOLO-based text detectors**
* **MMOCR (OpenMMLab)**
* **TrOCR Vision Encoder**

Output:

* bounding boxes (quadrilateral or rotated)

Example:

```
[ x1, y1, x2, y2, x3, y3, x4, y4 ]
```

---

### **3. Text Recognition (What does the text say?)**

Once text regions are extracted, they are fed into a **recognition model**.

Two classes:

---

#### **A. Traditional OCR (Old, rule-based)**

* Tesseract (Google)
* Works on printed text
* Uses:

  * segmentation
  * template matching
  * HMMs (Hidden Markov Models)

Limitations:

* fails with blurry/angled text
* poor on handwriting
* not good for complex fonts
* not robust for real-world images

---

#### **B. Deep Learning OCR (Modern SOTA)**

##### Architectures:

1. **CRNN (Convolutional Recurrent Neural Network)**
   CNN → RNN → CTC decoder
   (Used for scene text)

2. **Transformer-based OCR (State of the Art)**

   * **TrOCR (Microsoft)**
   * Donut (OCR-free document understanding)
   * LayoutLMv3 (document transformer)

3. **Vision Encoders + Sequence Decoders**
   Encoder: ViT or CNN
   Decoder: Transformer autoregressive

These models handle:

* handwriting
* noisy images
* rotated text
* multilingual content
* natural scenes

---

### **4. Post-Processing**

Improves text accuracy:

* Spell correction
* Dictionary lookup
* Language modeling
* Layout reconstruction (paragraphs, tables)
* Regex extraction (emails, IDs, prices)

Example:

```
"Ths is an exmple" → "This is an example"
```

---

### Modern OCR Systems (What Real Organizations Use)

| Tool                  | Type                       | Notes                        |
| --------------------- | -------------------------- | ---------------------------- |
| **Tesseract**         | classical                  | free, best for clean scans   |
| **Google Vision API** | deep learning              | robust, cloud                |
| **AWS Textract**      | deep learning              | forms, receipts, tables      |
| **Azure Read**        | SOTA OCR                   | handles handwriting          |
| **PaddleOCR**         | open-source, very accurate | supports 80+ languages       |
| **MMOCR**             | deep learning framework    | academic & industrial use    |
| **TrOCR**             | transformer OCR            | state of the art open-source |

---

### Example: OCR Using Tesseract (Python)

### Install

```bash
pip install pytesseract pillow
```

### Code

```python
import pytesseract
from PIL import Image

img = Image.open("text_image.jpg")
text = pytesseract.image_to_string(img)

print(text)
```

---

### Example: OCR Using Transformers (TrOCR – SOTA)

#### Install:

```bash
pip install transformers pillow torch
```

#### Code:

```python
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")

image = Image.open("handwriting.jpg").convert("RGB")

inputs = processor(image, return_tensors="pt")
with torch.no_grad():
    generated = model.generate(**inputs)

text = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(text)
```

This handles **handwriting**, **blur**, **complex scenes**, etc.

---

#### How OCR Models Handle Real-World Challenges

OCR must handle:

* rotated text → rotated boxes
* curved text → segmentation
* low contrast → image enhancement
* handwriting → transformer decoders
* multi-language → multilingual vocab
* scanned PDFs → layout modeling
* natural scenes → CNN + ViT encoders

Modern approaches solve these using:

* attention
* self-supervision
* synthetic data generation
* convolution + transformer hybrids

---

**Final Summary**

**OCR systems convert images of text into digital text using a pipeline of:**

1. **Preprocessing**
2. **Text Detection** (EAST, CRAFT, DB)
3. **Text Recognition** (CRNN, Transformers, TrOCR)
4. **Post-processing** (spell correction, layout)

Modern OCR uses **deep learning** and can read:

* handwriting
* curved text
* noisy photos
* multi-language content
* real-world images (street signs, menus, documents)
