# Nemotron Parse and NeMoRetriever OCR Evaluation with NV Ingest

**Important Notes**: 
1. In order to run this notebook, you need create a NGC account and get an API Key

# Introduction

NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. For more information, see [What is NVIDIA Ingest?](https://docs.nvidia.com/nemo/retriever/extraction/overview/).

# Architecture

NV Ingest architecture

![arch](./nv_ingest_image.png)

# Getting started

## Clone the repository and log into Docker

In order to spin up this blueprint, you will need an NGC api key. After you get your API key, paste it in the `.env` file and run the cell below.

In [3]:
%%bash

# Paste your NGC API key in .env file before running the cell below

[ -d "nv-ingest" ] || git clone https://github.com/nvidia/nv-ingest

cd nv-ingest
git checkout 25.9.0

export NGC_API_KEY=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

cat > .env << 'EOF'
NGC_API_KEY=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
DATASET_ROOT=./data
NV_INGEST_ROOT=./nv-ingest

# NeMo Retriever OCR Configuration (high-performance OCR)
# Reference: https://build.nvidia.com/nvidia/nemoretriever-ocr-v1/deploy
    "OCR_IMAGE=nvcr.io/nim/nvidia/nemoretriever-ocr-v1\n",
    "OCR_TAG=1.2.1\n",
OCR_MODEL_NAME=scene_text_ensemble
EOF
echo "Created .env file"

echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin

HEAD is now at b63eecea docs: Fix helm link in release notes (#1037)


Created .env file
Login Succeeded


## Spin up the Blueprint

Choose **ONE** of the following two options to spin up NV-Ingest:

- **Option A**: Standard NIM Pipeline (uses YOLOX + Deplot + NeMo Retriever OCR)
- **Option B**: NemoRetriever Parse Pipeline (uses state-of-the-art NemoRetriever Parse model)

---

## Option A: Standard NIM Pipeline

This option uses the multi-NIM approach with specialized models for each task:
- **YOLOX**: Table structure detection
- **Deplot**: Chart extraction  
- **NeMo Retriever OCR**: High-performance text extraction

**NOTE**: This step can take about 10 minutes. The following steps are how we suggest running and monitoring your progress.

1. Open up a new terminal window and run the following:

```bash
cd nv-ingest
docker compose --profile retrieval --profile table-structure up
```

This will run each service and output persistent logs.

2. In a second terminal window, run:

```bash
cd nv-ingest
docker compose logs -f nv-ingest-ms-runtime
```

This will show you persistent logs for the main `nv-ingest` service.

**To stop this deployment later:**
```bash
cd nv-ingest
docker compose --profile retrieval --profile table-structure down
```

---

## Option B: NemoRetriever Parse Pipeline

This option uses the **NemoRetriever Parse** model for state-of-the-art PDF extraction:
- Single model handles text, tables, and document structure
- Generally better accuracy for complex documents
- Higher GPU memory requirements

**Reference**: [NemoRetriever Parse Documentation](https://docs.nvidia.com/nemo/retriever/latest/extraction/nemoretriever-parse/)

**NOTE**: This step can take about 10-15 minutes on first startup as NIM containers pull and load models.

1. Open up a new terminal window and run the following:

```bash
cd nv-ingest
docker compose --profile retrieval --profile table-structure --profile nemoretriever-parse up
```

This will run each service including NemoRetriever Parse and output persistent logs.

2. In a second terminal window, run:

```bash
cd nv-ingest
docker compose logs -f nv-ingest-ms-runtime
```

This will show you persistent logs for the main `nv-ingest` service.

**To stop this deployment later:**
```bash
cd nv-ingest
docker compose --profile retrieval --profile table-structure --profile nemoretriever-parse down
```

---


### Checking for completion
Things should be spun up properly if the first part of your `nv-ingest-ms-runtime` logs show something similar to 

```bash
nv-ingest-ms-runtime-1  | INFO:     Uvicorn running on http://0.0.0.0:7670 (Press CTRL+C to quit)
nv-ingest-ms-runtime-1  | INFO:     Started parent process [20]
nv-ingest-ms-runtime-1  | INFO:     Started server process [40]
```

and the final lines look similar to 

```bash
nv-ingest-ms-runtime-1  | 2024-10-16 02:36:11,162 - DEBUG - parent_receive started child_thread
nv-ingest-ms-runtime-1  | 2024-10-16 02:36:11,162 - DEBUG - parent_receive started child_thread
nv-ingest-ms-runtime-1  | 2024-10-16 02:36:11,163 - DEBUG - parent_receive started child_thread
```

After everything is up and running, we can run `docker ps` and `nvidia-smi` 

In [5]:
!docker ps

CONTAINER ID   IMAGE                                                           COMMAND                  CREATED         STATUS                   PORTS                                                                                                                                                                                                                                                                                                                   NAMES
54e72e9297eb   zilliz/attu:v2.3.5                                              "docker-entrypoint.s…"   5 minutes ago   Up 5 minutes             0.0.0.0:3001->3000/tcp, [::]:3001->3000/tcp                                                                                                                                                                                                                                                                             milvus-attu
28563a922382   milvusdb/milvus:v2.5.3-gpu                           

After you run `docker ps` you should see output similar to the following that lists your container images and the status of each. If any status includes `starting`, wait for the container to start before you proceed.

```text
CONTAINER ID   IMAGE                                        COMMAND                 CREATED            STATUS                      PORTS          
9869e432cc04   zilliz/attu:v2.3.5                           "docker-entrypoint.s…"  About an hour ago  Up About an hour            0.0.0.0:3001...
e02baf85ccc5   otel/opentelemetry-collector-contrib:0.91.0  "/otelcol-contrib --…"  About an hour ago  Up About an hour            0.0.0.0:4317...
4c3be36de11b   milvusdb/milvus:v2.5.3-gpu                   "/tini -- milvus run…"  About an hour ago  Up About an hour (healthy)  0.0.0.0:9091...
...
```


In [None]:
!nvidia-smi

: 

# Interacting with NV Ingest

There are 2 ways to interact with `nv-ingest`, a python client and a CLI. Lets use the Python client first 

## Installing the Python Client

In [2]:
!pip install nv-ingest-client==25.9.0 pymilvus[bulk_writer,model] minio tritonclient langchain_milvus

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try apt install
[31m   [0m python3-xyz, where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Debian-packaged Python package,
[31m   [0m create a virtual environment using python3 -m venv path/to/venv.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
[31m   [0m sure you have python3-full installed.
[31m   [0m 
[31m   [0m If you wish to install a non-Debian packaged Python application,
[31m   [0m it may be easiest to use pipx install xyz, which will manage a
[31m   [0m virtual environment for you. Make sure you have pipx installed.
[31m   [0m 
[31m   [0m See /usr/share/doc/python3.12/README.venv for more information.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS dist

## Using the python client 

Each ingest job will include a set of stages. These stages define and configure the operations that will be performed during ingestion of the specified input files.

- `extract` : Performs multimodal extractions from a document, including text, images, and tables.
- `split` : Chunk the text into smaller chunks, useful for storing in a vector database for retrieval applications.
- `dedup` : Identifies duplicate images in document that can be filtered to remove data redundancy.
- `filter` : Filters out images that are likely not useful using some heuristics, including size and aspect ratio.
- `embed` : Pass the text or table extractions through `"nvidia/nv-embedqa-e5-v5` NIM to obtain its embeddings.
- `store` : Save the extracted tables or images to MinIO, Milvus's storage system.

In [1]:
from nv_ingest_client.client import Ingestor

# Load a sample PDF to demonstrate NV-Ingest usage.
ingestor = ( 
    Ingestor(message_client_hostname="host.docker.internal", message_client_port=7670)
    .files("./nv-ingest/data/multimodal_test.pdf") # can be a list of files, or contain wildcards i.e. /some/path/*.pdf
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_images=True,
    ).split(
        tokenizer="meta-llama/Llama-3.2-1B",
        chunk_size=1024,
        chunk_overlap=150,
    ).embed( # whether to compute embeddings
        text=True, tables=True
    ) 
)

# Result is a List[List[Dict]] - Each outer list Item [] is a file and each inner list Item [][] is an element in that file
generated_metadata = ingestor.ingest()

ModuleNotFoundError: No module named 'nv_ingest_client'

In [13]:
from nv_ingest_client.util.process_json_files import ingest_json_results_to_blob

# generated_metadata is the result of a batch of submitted files. We sample the first file metadata here for demonstration purposes.
ingest_json_results_to_blob(generated_metadata[0])

'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose. It contains some \r\nheadings and some placeholder text to fill the space. The text is not important and contains \r\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \r\nthat we can use to confirm Ingest is working as expected.\r\nTable 1\r\nThis table describes some animals, and some activities they might be doing in specific \r\nlocations.\r\nAnimal Activity Place\r\nGira@e Driving a car At the beach\r\nLion Putting on sunscreen At the park\r\nCat Jumping onto a laptop In a home o@ice\r\nDog Chasing a squirrel In the front yard\r\nChart 1\r\nThis chart shows some gadgets, and some very fictitious costs. Section One\r\nThis is the first section of the document. It has some more placeholder text to show how \r\nthe document looks like. The text is not meant to be meaningful or informat

## Explore the Outputs

Let's explore elements of the NV-Ingest output. When data flows through an NV-Ingest pipeline, a number of extractions and transformations are performed. As the data is enriched, it is stored in rich metadata hierarchy. In the end, there will be a list of dictionaries, each of which represents a extracted type of information. The most common elements to extract from a dictionary in this hierarchy are the extracted content and the text representation of this content. The next few cells will demonstrate interacting with the metadata, pulling out these elements, and visualizing them. Note, when there is a -1 value present, this represents non-applicable positional resolution. Positive numbers represent valid positional data.

For a more complete description of metadata elements, view the data dictionary.

https://github.com/NVIDIA/nv-ingest/blob/main/docs/docs/extraction/content-metadata.md

In [14]:
def redact_metadata_helper(metadata: dict) -> dict:
    """A simple helper function to redact `metadata["content"]` and metadata["embedding"]' to improve readability."""
    
    text_metadata_redact = metadata.copy()
    text_metadata_redact["metadata"]["content"] = "<---Redacted for readability--->"
    text_metadata_redact["metadata"]["embedding"] = "<---Redacted for readability--->"
    
    return text_metadata_redact

## Explore Output - Text

This cell depicts the full metadata hierarchy for a text extraction with redacted content to ease readability. Notice the following sections are populated with information:

- `content` - The raw extracted content, text in this case - this section will always be populated with a successful job.
- `content_metadata` - Describes the type of extraction and its position in the broader document - this section will always be populated with a successful job.
- `source_metadata` - Describes the source document that is the basis of the ingest job.
- `text_metadata` - Contain information about the text extraction, including detected language, among others - this section will only exist when `metadata['content_metadata']['document_type'] == 'text'`

In [15]:
redacted_text_metadata = redact_metadata_helper(generated_metadata[0][0])  # First file, first element of elements found within file [0][0]. There are 9 total
redacted_text_metadata

{'document_type': 'text',
 'metadata': {'content': '<---Redacted for readability--->',
  'content_url': '',
  'embedding': '<---Redacted for readability--->',
  'source_metadata': {'source_name': './nv-ingest/data/multimodal_test.pdf',
   'source_id': './nv-ingest/data/multimodal_test.pdf',
   'source_location': '',
   'source_type': 'PDF',
   'collection_id': '',
   'date_created': '2025-06-17T18:13:04.715946',
   'last_modified': '2025-06-17T18:13:04.715805',
   'summary': '',
   'partition_id': -1,
   'access_level': -1},
  'content_metadata': {'type': 'text',
   'description': 'Unstructured text from PDF document.',
   'page_number': -1,
   'hierarchy': {'page_count': 3,
    'page': -1,
    'block': -1,
    'line': -1,
    'span': -1,
    'nearby_objects': {'text': {'content': [], 'bbox': [], 'type': []},
     'images': {'content': [], 'bbox': [], 'type': []},
     'structured': {'content': [], 'bbox': [], 'type': []}}},
   'subtype': ''},
  'audio_metadata': None,
  'text_metadata

## Explore Output - Tables

This cell depicts the full metadata hierarchy for a table extraction with redacted content to ease readability. Notice the following sections are populated with information:

- `content` - The raw extracted content, a base64 encoded image of the extracted table in this case - this section will always be populated with a successful job.
- `content_metadata` - Describes the type of extraction and its position in the broader document - this section will always be populated with a successful job.
- `source_metadata` - Describes the source and storage path of an extracted table in an S3 compliant object store.
- `table_metadata` - Contains the text representation of the table, positional data, and other useful elements - this section will only exist when `metadata['content_metadata']['document_type'] == 'structured'`.

Note, `table_metadata` will store chart and table extractions. The are distringuished by `metadata['content_metadata']['subtype']`

In [16]:
redacted_table_metadata = redact_metadata_helper(generated_metadata[0][2])  # First file, third element within file [0][2]. There are 9 total
redacted_table_metadata

{'document_type': 'structured',
 'metadata': {'content': '<---Redacted for readability--->',
  'content_url': '',
  'embedding': '<---Redacted for readability--->',
  'source_metadata': {'source_name': './nv-ingest/data/multimodal_test.pdf',
   'source_id': './nv-ingest/data/multimodal_test.pdf',
   'source_location': '',
   'source_type': 'PDF',
   'collection_id': '',
   'date_created': '2025-06-17T18:13:04.715946',
   'last_modified': '2025-06-17T18:13:04.715805',
   'summary': '',
   'partition_id': -1,
   'access_level': -1},
  'content_metadata': {'type': 'structured',
   'description': 'Structured table extracted from PDF document.',
   'page_number': 0,
   'hierarchy': {'page_count': 3,
    'page': 0,
    'block': -1,
    'line': -1,
    'span': -1,
    'nearby_objects': {'text': {'content': [], 'bbox': [], 'type': []},
     'images': {'content': [], 'bbox': [], 'type': []},
     'structured': {'content': [], 'bbox': [], 'type': []}}},
   'subtype': 'table'},
  'audio_metadata'

## Using the CLI

The CLI is another way to interact with nv-ingest. Notice that we have encoded tasks in the `--tasks` flag. This will store outputs in a `processed_docs` folder

In [17]:
%%bash

nv-ingest-cli \
  --doc nv-ingest/data/multimodal_test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_tables": "true", "extract_images": "true", "extract_charts": "true"}' \
  --client_host=host.docker.internal \
  --client_port=7670

python-dotenv could not parse statement starting at line 1
python-dotenv could not parse statement starting at line 2
python-dotenv could not parse statement starting at line 3
python-dotenv could not parse statement starting at line 4
python-dotenv could not parse statement starting at line 5
python-dotenv could not parse statement starting at line 6
python-dotenv could not parse statement starting at line 7
python-dotenv could not parse statement starting at line 8
python-dotenv could not parse statement starting at line 9
python-dotenv could not parse statement starting at line 10
python-dotenv could not parse statement starting at line 11
python-dotenv could not parse statement starting at line 12
python-dotenv could not parse statement starting at line 13
python-dotenv could not parse statement starting at line 14
python-dotenv could not parse statement starting at line 15
python-dotenv could not parse statement starting at line 16
python-dotenv could not parse statement starting 