# Build Smart Document Understanding Agents with TensorLake and OpenAI Agent SDK
*Author: [Antaripa Saha](https://x.com/doesdatmaksense)*

In this example you will learn how to build smart agents that understand documents using TensorLake and OpenAI Agent SDK. To learn more about Agentic Applications [check out the Tensorlake docs](https://docs.tensorlake.ai/use-cases/agents-and-rag-workflows/agents-understand-docs)

## Step 0: Prerequisites

1. Install the [Tensorlake SDK](https://pypi.org/project/tensorlake/)
2. Import necessary packages
3. Set your [Tensorlake API Key](https://docs.tensorlake.ai/platform/authentication)

**Note:** Learn more with the [Tensorlake docs](https://docs.tensorlake.ai/).

In [None]:
!pip install tensorlake openai-agents

In [None]:
from tensorlake.documentai import DocumentAI
from tensorlake.documentai.models import (
    ParsingOptions,
    StructuredExtractionOptions,
    EnrichmentOptions,
    ParseStatus,
    ChunkingStrategy,
    TableOutputMode,
    TableParsingFormat,
    PartitionStrategy
)

# openai agent sdk
from agents import Agent, Runner

from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum

import time
import json

In [None]:
%env TENSORLAKE_API_KEY=YOUR_TENSORLAKE_API_KEY
%env OPENAI_API_KEY=YOUR_OPENAI_API_KEY

## Step 1: Specify Structured Data Extraction

Create a simple Pydantic model to specify what structured data you want extracted from the document

In [None]:
class ResearchPaperSchema(BaseModel):
    """Schema focusing on the most critical information from the research papers"""

    title: str = Field(description="Title of the research paper")
    authors: List[str] = Field(description="List of author names")
    abstract: str = Field(description="Abstract of the paper")

    research_problem: str = Field(description="What problem does this paper solve?")
    main_approach: str = Field(description="What is the main approach or method used?")
    key_contributions: List[str] = Field(description="What are the 3-5 most important contributions?")

    methodology_summary: str = Field(description="Brief summary of the research methodology")
    datasets_used: Optional[List[str]] = Field(description="Datasets mentioned in the paper", default=None)
    evaluation_metrics: Optional[List[str]] = Field(description="How do they measure success?", default=None)

    related_work_summary: Optional[str] = Field(description="Brief summary of how this relates to existing work", default=None)
    limitations: Optional[List[str]] = Field(description="What limitations do the authors acknowledge?", default=None)

## Step 2: Parse the Document
To use the Tensorlake Python SDK, you need to:

1. Create a Tensorlake Client
2. Specify a file path of the document that you want to parse
3. Upload the document to Tensorlake Cloud
4. Specify Parsing Options, if nothing specified then default options will be used.
5. Initiate the parsing job and wait until it compeltes successfully

In [None]:
# Create a Tensorlake Client, this will reference the `TENSORLAKE_API_KEY` environment variable you set above
doc_ai = DocumentAI()

file_path = "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/Jasper%20and%20Stells-%20distillation%20of%20SOTA%20embedding%20models.pdf"

# Configure parsing options for academic papers
parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.PAGE
)

# Configure structured extraction
structured_extraction_options = StructuredExtractionOptions(
    schema_name="Research Paper Analysis",
    json_schema=ResearchPaperSchema
)

# Parse the document with the specified extraction options
parse_id = doc_ai.parse(file_path, parsing_options=parsing_options, structured_extraction_options=[structured_extraction_options])

print(f"Parse job submitted with ID: {parse_id}")

# Wait for completion
result = doc_ai.wait_for_completion(parse_id)

Parse job submitted with ID: parse_mhwzN6NGcjbMhDDLw8wfG
waiting 5 s…
parse status: processing
waiting 5 s…
parse status: processing
waiting 5 s…
parse status: processing
waiting 5 s…
parse status: processing
waiting 5 s…
parse status: successful


# Understanding Tensorlake Parsing Output

In one single DocumentAI API call, Tensorlake returns both the full markdown content of the document and the structured data in JSON format.

## Review the Structured Data

In [None]:
print(json.dumps(result.structured_data[0].data, indent=2))

{
  "abstract": "A crucial component in many deep learning applications, such as Frequently Asked Questions (FAQ) and Retrieval-Augmented Generation (RAG), is dense retrieval. In this process, embedding models transform raw text into numerical vectors. However, the embedding models that currently excel on text embedding benchmarks, like the Massive Text Embedding Benchmark (MTEB), often have numerous parameters and high vector dimensionality. This poses challenges for their application in real-world scenarios. To address this issue, we propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple larger teacher embedding models through three carefully designed losses. Meanwhile, we utilize Matryoshka Representation Learning (MRL) to reduce the vector dimensionality of the student embedding model effectively. Our student model named Jasper with 2 billion parameters, built upon the Stella embedding model, obtained the No.3 position 

## Review the Markdown Chunks

In [None]:
# Get the markdown from extracted data
for index, chunk in enumerate(result.chunks):
    print(f"Chunk {index}:")
    print(chunk.content)

Chunk 0:

arXiv:2412.19048v2 [cs.IR] 23 Jan 2025

## Jasper and Stella: distillation of SOTA embedding models

Dun Zhang1, Jiacheng Li1; Ziyang Zeng1,2, Fulong Wang1 1 NovaSearch Team
2Beijing University of Posts and Telecommunications infgrad@163.com jcli.nlp@gmail.com ziyang1060@bupt.edu.cn wangfl1989@163.com

## Abstract

A crucial component in many deep learning applications, such as Frequently Asked Ques- tions (FAQ) and Retrieval-Augmented Gener- ation (RAG), is dense retrieval. In this pro- cess, embedding models transform raw text into numerical vectors. However, the embed- ding models that currently excel on text embed- ding benchmarks, like the Massive Text Embed- ding Benchmark (MTEB), often have numer- ous parameters and high vector dimensionality. This poses challenges for their application in real-world scenarios. To address this issue, we propose a novel multi-stage distillation frame- work that enables a smaller student embedding model to distill multiple larger teacher

# Create agent using OpenAI Agents SDK

For this example, we're going to create two different agents to compare the effectiveness of the LLM when given the document as a PDF, versus when given the document as a set of structured data, markdown chunks, and complete document layout.

## Step 1: Create a Basic Agent

This agent will only reference the PDF document directly.

In [None]:
def create_qa_agent_basic(document: str):
    """Create a Q&A agent specialized for research paper analysis."""
    return Agent(
        name="Research Paper Q&A Basic Agent",
        instructions=f"""
You are a knowledgeable and precise assistant designed to answer questions based on the content of academic research papers. Your goal is to help users understand and extract relevant insights from the document linked to below.

Document to parse:
{document}

Capabilities:
- Accurately summarize and interpret sections, tables, and figures.
- Understand technical terminology, methodologies, and experimental setups.
- Identify and explain findings, results, and conclusions.
- Recognize document structure (abstract, introduction, methods, results, discussion, references).
- Extract insights from equations, data, and complex diagrams when described.

Guidelines:
- Always ground your answers in the content of the document.
- Use direct quotes or paraphrased explanations from the paper when helpful.
- If a question cannot be answered from the document, clearly state that.
- Be concise but informative. Use structured responses (e.g., bullet points or short summaries) when appropriate.

Answer all questions as an expert reader of the paper, supporting your responses with references to the content where necessary.
""")

## Step 2: Create an Agent that leverages Tensorlake results

This agent will only reference output from Tensorlake parsing the PDF, including structured data, markdown chunks, and a complete document layout.

In [None]:
def create_qa_agent(markdown_chunks: str, structured_data: str):
    """Create a Q&A agent specialized for research paper analysis."""
    return Agent(
        name="Research Paper Q&A Agent",
        instructions=f"""
You are a knowledgeable and precise assistant designed to answer questions based on the content of academic research papers. Your goal is to help users understand and extract relevant insights from the document provided below.
Use both the markdown chunks and structured data as reference material.

Markdown Chunks:
{markdown_chunks}

Structured Data:
{structured_data}

Capabilities:
- Accurately summarize and interpret sections, tables, and figures.
- Understand technical terminology, methodologies, and experimental setups.
- Identify and explain findings, results, and conclusions.
- Recognize document structure (abstract, introduction, methods, results, discussion, references).
- Extract insights from equations, data, and complex diagrams when described.

Guidelines:
- Always ground your answers in the content of the document.
- Use direct quotes or paraphrased explanations from the paper when helpful.
- If a question cannot be answered from the document, clearly state that.
- Be concise but informative. Use structured responses (e.g., bullet points or short summaries) when appropriate.

Answer all questions as an expert reader of the paper, supporting your responses with references to the content where necessary.
""")

## Step 3: Create a Comparison Agent

This agent will compare the results from the two other agents and provide an analysis of what information may have been missed by only leveraging the PDF instead of the parsed Tensorlake output.

In [None]:
def compare_results(basic_results: str, results: str):
    """Create a result comparitor for basic and advanced agent results"""
    return Agent(
        name="Research Paper Result Comparitor Agent",
        instructions=f"""
You are a knowledgeable and precise assistant designed to compare the results from an agent that answered the questions provided based on the content of academic research papers. Your goal is to help users understand which results are more complete and accurate based on the questions and two different outputs below.

Basic Agent Results:
{basic_results}

Advanced Agent Results:
{results}

Capabilities:
- Accurately determine which results are more complete and accurate.
- Compare the accuracy of the results

Guidelines:
- Always ground your answers in the content of the results.
- Use direct quotes or paraphrased explanations from the results when helpful.
- Be concise but informative. Use structured responses (e.g., bullet points or short summaries) when appropriate.

Provide a concise summary of which results are more complete and accurate.
""")

## Step 4: Run and Test the Agent

You can ask questions about the document in natural language and get detailed answers.

In [None]:
# pass the extracted chunks to the agent
markdown_chunks = ""
for chunk in result.chunks:
    markdown_chunks += chunk.content + "\n\n"

# pass the structured data to the agent
structured_data = json.dumps(result.structured_data[0].data, indent=2)

# Ask questions
questions = ''.join([
    "What is the paper about? What are the key points from the paper that we can further leverage?",
    "Describe the architecture of Jasper. How is it structured?"
])

# Create Q&A agent
basic_start_time = time.time()
basic_agent = create_qa_agent_basic("https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/Jasper%20and%20Stells-%20distillation%20of%20SOTA%20embedding%20models.pdf")
basic_end_time = time.time()

advanced_start_time = time.time()
agent = create_qa_agent(markdown_chunks, structured_data)
advanced_end_time = time.time()

print(f"Basic Agent took {basic_end_time - basic_start_time} seconds to run")
print(f"Advanced Agent took {advanced_end_time - advanced_start_time} seconds to run")
print(f"The Advanced Agent is {(advanced_end_time - advanced_start_time) / (basic_end_time - basic_start_time)} times faster")

basic_result = await Runner.run(basic_agent, questions)
advanced_result = await Runner.run(agent, questions)


comparitor_agent = compare_results(basic_result, advanced_result)
comparison_result = await Runner.run(comparitor_agent, questions)


print(f" {comparison_result.final_output}")

 ### Summary Comparison

#### Paper Overview

- **Basic Agent Results**:
  - Focuses on the distillation of large embedding models, specifically Jasper and Stella.
  - Aims at efficient distillation for deployment on resource-limited devices.

- **Advanced Agent Results**:
  - Describes a multi-stage distillation framework for reducing model size while maintaining performance, focusing on dense retrieval applications.
  - Discusses the Jasper model built on the Stella model, achieving competitive performance on the Massive Text Embedding Benchmark (MTEB).

**More Complete**: Advanced Agent Results provide a more detailed context, including specific applications and performance benchmarks.

#### Key Points

- **Basic Agent Results**:
  - Emphasizes model size reduction and performance trade-offs.
  - Discusses use cases for real-time applications.

- **Advanced Agent Results**:
  - Details a multi-stage distillation framework with custom loss functions.
  - Introduces Matryoshka Represe

# Next Steps

Now that you have the basics down, check out one of these other resources to dive deeper into document parsing with Tensorlake:
- [Python SDK and API Docs](https://docs.tensorlake.ai/)
- [Blog](https://tensorlake.ai/blog)
- [YouTube Channel](https://tensorlake.ai/blog)
- [Community Slack](https://tensorlakecloud.slack.com/)