Retrieval Framework

This is a tool that converts scientific PDFs into plain text for your LLM-related needs.

Convert PDF to LaTeX using Mathpix API that is tailored to work with scientific papers.
Extract images and tables from LaTeX and replace them with text using a multimodal LLM.
- The prompts are made to extract all values and relationships represented within each table or graph and minimize information loss.

Using with LlamaIndex 🦙

See hierarchical_retrieval.ipynb for example LlamaIndex workflow.

It uses hierarchical retrieval to utilize text descriptions generated by GPT together to retrieve original tables and images.

Basic usage

Set MATHPIX_APP_ID and MATHPIX_APP_KEY in your environment. We suggest using a .env file.

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(".env"))  # read local .env file

Instantiate a text and a vision model. This tool uses LlamaIndex abstractions to interface with LLMs.

from llama_index.llms import OpenAI
from llama_index.multi_modal_llms import OpenAIMultiModal

text_model = OpenAI()
vision_model = OpenAIMultiModal(max_new_tokens=4096)

Next, pass those models to the converter.

converter = MathpixPdfConverter(text_model=text_model, vision_model=vision_model)

Convert PDF and extract the result.

pdf_path = Path("path/to/file.pdf")

pdf_result = converter.convert(pdf_path)

with Path(f"output.txt").open("w") as f:
    f.write(pdf_result.content)

Custom workflow

In order to persist intermediate results or run processing in parallel, you can use MathpixProcessor and MathpixResultParser directly.

processor = MathpixProcessor()
parser = MathpixResultParser(text_model=text_model, vision_model=vision_model)

mathpix_result = processor.submit_pdf(pdf_path)
mathpix_result = processor.await_result(mathpix_result)
pdf_result = parser.parse_result(mathpix_result)

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
inbox		inbox
pdf_processor		pdf_processor
results/2312.10997		results/2312.10997
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
example.env		example.env
hierarchical_retrieval.ipynb		hierarchical_retrieval.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval Framework

Using with LlamaIndex 🦙

Basic usage

Custom workflow

See also

About

Releases

Packages

Contributors 3

Languages

tensorsense/Retrieval-Framework

Folders and files

Latest commit

History

Repository files navigation

Retrieval Framework

Using with LlamaIndex 🦙

Basic usage

Custom workflow

See also

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages