MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.
- Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
- No Information Loss: Focus on having no information loss during parsing.
- Fast and Efficient: Designed with speed and efficiency at its core.
- Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
- Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.
- Files: β PDF β Powerpoint β Word
- Content: β Tables β TOC β Headers β Footers β Images
megaparse.mp4
pip install megaparse
-
Add your OpenAI API key to the .env file
-
Install poppler on your computer (images and PDFs)
-
Install tesseract on your computer (images and PDFs)
from megaparse import MegaParse
megaparse = MegaParse(file_path="./test.pdf")
document = megaparse.load()
print(document.content)
megaparse.save_md(content, "./test.md")
-
Create an account on Llama Cloud and get your API key.
-
Call Megaparse with the
llama_parse_api_key
parameter
from megaparse import MegaParse
megaparse = MegaParse(file_path="./test.pdf", llama_parse_api_key="llx-your_api_key")
document = megaparse.load()
print(document.content)
Parser | Diff |
---|---|
LMM megaparse | 36 |
Megaparse with LLamaParse and GPTCleaner | 74 |
Megaparse with LLamaParse | 97 |
Unstructured Augmented Parse | 99 |
LLama Parse | 102 |
Megaparse | 105 |
Lower is better
- Improve Table Parsing
- Improve Image Parsing and description
- Add TOC for Docx
- Add Hyperlinks for Docx
- Order Headers for Docx to Markdown