# OCR API

# Introduction

In this notebook, we developed an OCR API designed to extract text from images. For this purpose, we leveraged an open-source model available through Hugging Face.

# Model Info

The model used is **Nanonets-OCR**, a state-of-the-art image-to-markdown OCR system that goes well beyond basic text extraction. Unlike traditional approaches, this advanced model converts documents into structured markdown with intelligent content recognition and semantic tagging.

Most publicly available image-to-text models are limited to plain text extraction. They often fail to differentiate between key elements such as watermarks, signatures, or page numbers.

In our use case, the provided images are vibrant and colorful, which makes it difficult for conventional OCR tools to capture the text accurately. This is why we chose Nanonets-OCR — it is capable of extracting text effectively even from complex, colorful images.

For more information about the model, see: [Nanonets-OCR](https://nanonets.com/research/nanonets-ocr-s/)

# Use

We integrated this model into a simple **Flask API**, which is then used in the next stage of our pipeline. Since the model runs best on a GPU, we created the API to ensure faster performance and seamless integration.

For setup instructions, please refer to the README file.


In [1]:
!pip install pyngrok

Collecting pyngrok
  Downloading pyngrok-7.4.0-py3-none-any.whl.metadata (8.1 kB)
Downloading pyngrok-7.4.0-py3-none-any.whl (25 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.4.0


# Ngrok Token Setup

Replace the placeholder token in the URL with your actual **ngrok authtoken**:

https://dashboard.ngrok.com/get-started/your-authtoken



In [None]:
!ngrok config add-authtoken <YOUR_NGROK_AUTH_TOKEN>  # Replace this token with your token

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml                                


In [3]:
from flask import Flask, request, jsonify
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
import torch
import os

# Load model and processor once when app starts
model_path = "nanonets/Nanonets-OCR-s"

model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

2025-10-02 17:17:37.238923: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1759425457.598081      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1759425457.705817      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.51G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/575 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


video_preprocessor_config.json: 0.00B [00:00, ?B/s]

In [None]:
from pyngrok import ngrok

# Load model and processor once when app starts
model_path = "nanonets/Nanonets-OCR-s"

model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

# Flask app
app = Flask(__name__)

def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. 
    Return the tables in html format. Return the equations in LaTeX representation. 
    If there is an image in the document and image caption is not present, 
    add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. 
    Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. 
    Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. 
    Prefer using ☐ and ☑ for check boxes."""

    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    
    output_text = processor.batch_decode(
        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )
    return output_text[0]

@app.route("/")
def home():
    return "Welcome to OCR LLM!"


@app.route("/ocr", methods=["POST"])
def ocr_api():
    if "image" not in request.files:
        return jsonify({"error": "No image uploaded"}), 400
    
    image_file = request.files["image"]
    image_path = os.path.join("uploads", image_file.filename)
    os.makedirs("uploads", exist_ok=True)
    image_file.save(image_path)

    try:
        result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
        return jsonify({"extracted_text": result})
    except Exception as e:
        return jsonify({"error": str(e)}), 500
    finally:
        if os.path.exists(image_path):
            os.remove(image_path)  # cleanup temp file


if __name__ == "__main__":
    # app.run(host="0.0.0.0", port=5000)
    public_url = ngrok.connect(5000)
    print(f" * ngrok tunnel available at {public_url}")
    app.run(debug=False)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

 * ngrok tunnel available at NgrokTunnel: "https://fcb2c5a30639.ngrok-free.app" -> "http://localhost:5000"
 * Serving Flask app '__main__'
 * Debug mode: off


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore