Skip to content

A tool that converts scientific PDFs into plain text for your LLM-related needs, such as building RAGs or agents for academic knowledge. It was developed in collaboration with the LlamaIndex team.

Notifications You must be signed in to change notification settings

tensorsense/Retrieval-Framework

Repository files navigation

Retrieval Framework

This is a tool that converts scientific PDFs into plain text for your LLM-related needs.

  • Convert PDF to LaTeX using Mathpix API that is tailored to work with scientific papers.
  • Extract images and tables from LaTeX and replace them with text using a multimodal LLM.
    • The prompts are made to extract all values and relationships represented within each table or graph and minimize information loss.

Using with LlamaIndex 🦙

See hierarchical_retrieval.ipynb for example LlamaIndex workflow.

It uses hierarchical retrieval to utilize text descriptions generated by GPT together to retrieve original tables and images.

Basic usage

  1. Set MATHPIX_APP_ID and MATHPIX_APP_KEY in your environment. We suggest using a .env file.
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(".env"))  # read local .env file
  1. Instantiate a text and a vision model. This tool uses LlamaIndex abstractions to interface with LLMs.
from llama_index.llms import OpenAI
from llama_index.multi_modal_llms import OpenAIMultiModal

text_model = OpenAI()
vision_model = OpenAIMultiModal(max_new_tokens=4096)

Next, pass those models to the converter.

converter = MathpixPdfConverter(text_model=text_model, vision_model=vision_model)
  1. Convert PDF and extract the result.
pdf_path = Path("path/to/file.pdf")

pdf_result = converter.convert(pdf_path)

with Path(f"output.txt").open("w") as f:
    f.write(pdf_result.content)

Custom workflow

In order to persist intermediate results or run processing in parallel, you can use MathpixProcessor and MathpixResultParser directly.

processor = MathpixProcessor()
parser = MathpixResultParser(text_model=text_model, vision_model=vision_model)

mathpix_result = processor.submit_pdf(pdf_path)
mathpix_result = processor.await_result(mathpix_result)
pdf_result = parser.parse_result(mathpix_result)

See also

About

A tool that converts scientific PDFs into plain text for your LLM-related needs, such as building RAGs or agents for academic knowledge. It was developed in collaboration with the LlamaIndex team.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published