# Reranking top pages from PDF using LlamaParse and ZeroEntropy

In this guide, we’ll build a simple workflow to parse PDF documents into text using LlamaParse and then query and rerank the textual data. 

---

### Pre-requisites
- Python 3.8+
- `zeroentropy` client (`pip install zeroentropy`)
- `llama_cloud_services` client (`pip install llama_cloud_services`)
- A ZeroEntropy API key ([Get yours here](https://dashboard.zeroentropy.dev))
- A LlamaParse API key ([Get yours here](https://docs.cloud.llamaindex.ai/api_key))
- A .env file with the following: 

```bash
ZEROENTROPY_API_KEY=your_api_key_here\
LLAMAPARSE_API_KEY=your_api_key_here\
```

---

### What You’ll Learn
- How to use LlamaParse to convert PDF documents into usable text
- How to use ZeroEntropy to semantically index the text docs
- How to query your docs using semantic search (top pages)
- How to rerank your results using the reranker

---

### Directory Structure

This guide expects a directory like this:

```bash
zcookbook/
├── guides/
│   └── reranker_quickstart/
│       ├── rerank_llamaparsed_pages.ipynb
│       └── sample_docs/
│           ├── doc1.pdf
│           ├── doc2.pdf
│           └── doc3.pdf
│           └── ...
├── LICENSE
└── README.md
```

### Setting up your ZeroEntropy Client

First, install dependencies:

```bash
!pip install zeroentropy python-dotenv llama_cloud_services
```

Now load your API keys and initialize the clients

In [None]:
from zeroentropy import AsyncZeroEntropy, ConflictError
from llama_cloud_services import LlamaParse
import os

api_key = os.getenv("ZEROENTROPY_API_KEY")
if not api_key:
    raise ValueError("API Key not found. Make sure your .env file has ZEROENTROPY_API_KEY.")

#We initialize the AsyncZeroEntropy client in order to parse multiple documents in parallel
#If you want to parse a single document, you can use the synchronous client instead
zclient = AsyncZeroEntropy(api_key=api_key)

#We initialize the llama_parse client to parse the PDF documents into text
api_key = os.getenv("LLAMAPARSE_API_KEY")
llamaParser = LlamaParse(
    api_key=api_key,
    num_workers=1,       # if multiple files passed, split in `num_workers` API calls
    result_type="text",
    verbose=True,
    language="en",       # optionally define a language, default=en
)

success!


### Adding a collection to the ZeroEntropy client

In [48]:
collection_name = "pdf_docs_demo_vn"
zclient.collections.add(collection_name=collection_name)

<coroutine object AsyncCollectionsResource.add at 0x76b360222680>

Now define a function to acquire the paths for all the PDF files:

In [None]:
def get_file_names(directory_path):
    try:
        # Check if path exists and is a directory
        if not os.path.exists(directory_path):
            raise FileNotFoundError(f"Directory not found: {directory_path}")
        if not os.path.isdir(directory_path):
            raise NotADirectoryError(f"Path is not a directory: {directory_path}")
            
        # Get list of files (excluding directories)
        file_names = [
            os.path.join(directory_path, f)
            for f in os.listdir(directory_path) 
            if os.path.isfile(os.path.join(directory_path, f))
        ]
        
        return file_names
    
    except PermissionError:
        raise PermissionError(f"Permission denied accessing directory: {directory_path}")

Let’s use LlamaParse to parse all .pdf files in our sample folder into text:

In [50]:
folder_path = "./sample_docs"
file_names = get_file_names(folder_path)
print(file_names)
text_data = llamaParser.parse(file_names)

['./sample_docs/dashboard-sp-500-factor.pdf', './sample_docs/annual-report-multi-page.pdf', './sample_docs/S-P-Global-2024-PageFifty.pdf', './sample_docs/S-P-Global-2024-PageNine.pdf', './sample_docs/annual-report-sg-en-spy-PageSeven.pdf']


Getting job results:   0%|          | 0/5 [00:00<?, ?it/s]

Started parsing the file under job_id f5a8d2db-9542-413c-af7f-35f788a4f6e3


Getting job results:  20%|██        | 1/5 [00:06<00:24,  6.13s/it]

Started parsing the file under job_id 812924e5-a854-4f8c-b062-2f3ac15965d5


Getting job results:  40%|████      | 2/5 [00:10<00:15,  5.30s/it]

Started parsing the file under job_id 9f05f3f9-5ca7-4cac-8269-8381334f5135


Getting job results:  60%|██████    | 3/5 [00:15<00:09,  4.95s/it]

Started parsing the file under job_id b6f29839-8c6e-4f31-aceb-62942c2d5d55


Getting job results:  80%|████████  | 4/5 [00:21<00:05,  5.34s/it]

Started parsing the file under job_id d4d04477-5247-4caf-b60e-49effeb7845f
..

Getting job results: 100%|██████████| 5/5 [02:30<00:00, 30.02s/it]


## Organizing your documents

Once parsed, we form a list of documents with a list of the pages within them. 

In [57]:
docs = []

for dindex, doc in enumerate(text_data):
    pages=[]
    for index, page in enumerate(doc.pages):
        pages.append(page.text)
    docs.append(pages)

print(docs[0][10])

For use with institutions only, not for use with retail investors.


                                                                                                                                                                                                                                             Index Dashboard: S&P 500® Factor Indices

S&P 500 Quality FCF Aristocrats                                                                                                                                                                                                                                             June 2025
Description
The S&P 500 Quality FCF Aristocrats Index measures the performance of companies in the S&P 500 that have had positive free cash flow (FCF) for at least 10 consecutive years
and simultaneously have high FCF margin and high FCF return on invested capital (ROIC). As of June 30, 2025 the index comprised 99 constituents.

Index Statistics                1M      3M 

## Querying with ZeroEntropy
We’ll now define functions to upload the documents as text pages asynchroniously.

In [58]:
import asyncio
from tqdm.asyncio import tqdm

sem = asyncio.Semaphore(16)
async def add_document_with_pages(collection_name: str, filename: str, pages: list, doc_index: int):
    """Add a single document with multiple pages to the collection."""
    async with sem:  # Limit concurrent operations
        for retry in range(3):  # Retry logic
            try:
                response = await zclient.documents.add(
                    collection_name=collection_name,
                    path=filename,  # Use the actual filename as path
                    content={
                        "type": "text-pages",
                        "pages": pages  # Send list of strings directly
                    }
                )
                return response
            except ConflictError:
                print(f"Document '{filename}' already exists in collection '{collection_name}'")
                break
            except Exception as e:
                if retry == 2:  # Last retry
                    print(f"Failed to add document '{filename}': {e}")
                    return None
                await asyncio.sleep(0.1 * (retry + 1))  # Exponential backoff

async def upload_documents_async(docs: list, file_names: list, collection_name: str):
    """
    Upload documents asynchronously to ZeroEntropy collection.
    
    Args:
        docs: 2D array where docs[i] contains the list of pages (strings) for document i
        file_names: Array where file_names[i] contains the path for document i
        collection_name: Name of the collection to add documents to
    """
    
    # Validate input arrays have same length
    if len(docs) != len(file_names):
        raise ValueError("docs and file_names must have the same length")
    
    # Print starting message
    print(f"Starting upload of {len(docs)} documents...")
    
    # Create tasks for all documents
    tasks = [
        add_document_with_pages(collection_name, file_names[i], docs[i], i)
        for i in range(len(docs))
    ]
    
    # Execute all tasks concurrently with progress bar
    results = await tqdm.gather(*tasks, desc="Uploading Documents")
    
    # Count successful uploads
    successful = sum(1 for result in results if result is not None)
    print(f"Successfully uploaded {successful}/{len(docs)} documents")
    
    return results

### Querying documents with ZeroEntropy
First we will upload documents

In [60]:
await upload_documents_async(docs, file_names, "pdf_docs_demo_vn")

Starting upload of 5 documents...


Uploading Documents: 100%|██████████| 5/5 [00:01<00:00,  3.02it/s]

Document './sample_docs/annual-report-multi-page.pdf' already exists in collection 'pdf_docs_demo_vn'
Document './sample_docs/annual-report-sg-en-spy-PageSeven.pdf' already exists in collection 'pdf_docs_demo_vn'
Document './sample_docs/dashboard-sp-500-factor.pdf' already exists in collection 'pdf_docs_demo_vn'
Document './sample_docs/S-P-Global-2024-PageNine.pdf' already exists in collection 'pdf_docs_demo_vn'
Document './sample_docs/S-P-Global-2024-PageFifty.pdf' already exists in collection 'pdf_docs_demo_vn'
Successfully uploaded 0/5 documents





[None, None, None, None, None]

Query for the top 5 pages

In [61]:
response = await zclient.queries.top_pages(
    collection_name="pdf_docs_demo_vn",
    query="What are the top 100 stocks in the S&P 500?",
    k=5,
)

Now let's define a function to rerank the pages in the response:

In [86]:
async def rerank_top_pages_with_metadata(query: str, top_pages_response, collection_name: str):
    """
    Rerank the results from a top_pages query and return re-ordered list with metadata.
    
    Args:
        query: The query string to use for reranking
        top_pages_response: The response object from zclient.queries.top_pages()
        collection_name: Name of the collection to fetch page content from
    
    Returns:
        List of dicts with 'path', 'page_index', and 'rerank_score' in reranked order
    """
    
    # Fetch page content and store metadata for each result
    documents = []
    metadata = []
    
    for result in top_pages_response.results:
        # Fetch the actual page content
        page_info = await zclient.documents.get_page_info(
            collection_name=collection_name,
            path=result.path,
            page_index=result.page_index,
            include_content=True
        )
        
        # Get page content and ensure it's not empty
        page_content = page_info.page.content
        if page_content and page_content.strip():
            documents.append(page_content.strip())
            metadata.append({
                "path": result.path,
                "page_index": result.page_index,
                "original_score": result.score
            })
        else:
            # Include empty pages with fallback content
            documents.append("No content available")
            metadata.append({
                "path": result.path,
                "page_index": result.page_index,
                "original_score": result.score
            })
    
    if not documents:
        raise ValueError("No documents found to rerank")
    
    # Perform reranking
    rerank_response = await zclient.models.rerank(
        model="zerank-1",
        query=query,
        documents=documents
    )
    
    # Create re-ordered list with metadata
    reranked_results = []
    for rerank_result in rerank_response.results:
        original_metadata = metadata[rerank_result.index]
        reranked_results.append({
            "path": original_metadata["path"],
            "page_index": original_metadata["page_index"],
            "rerank_score": rerank_result.relevance_score
        })
    
    return reranked_results

Run the function and see the results!

In [87]:
reranked_results = await rerank_top_pages_with_metadata(
    query="What are the top 100 stocks in the S&P 500?",
    top_pages_response=response,
    collection_name="pdf_docs_demo_vn"
)

# Display results
print("Reranked Results with Metadata:")
for i, result in enumerate(reranked_results, 1):
    print(f"Rank {i}: {result['path']} (Page {result['page_index']}) - Score: {result['rerank_score']:.4f}")

Reranked Results with Metadata:
Rank 1: ./sample_docs/dashboard-sp-500-factor.pdf (Page 9) - Score: 0.8472
Rank 2: ./sample_docs/dashboard-sp-500-factor.pdf (Page 12) - Score: 0.8311
Rank 3: ./sample_docs/dashboard-sp-500-factor.pdf (Page 8) - Score: 0.7941
Rank 4: ./sample_docs/annual-report-sg-en-spy-PageSeven.pdf (Page 0) - Score: 0.7837
Rank 5: ./sample_docs/dashboard-sp-500-factor.pdf (Page 4) - Score: 0.4511


### ✅ That's It!

You’ve now built a working semantic search engine over markdown files using ZeroEntropy — great for indexing changelogs, guides, and internal dev docs.