A Streamlit application for processing and comparing PDFs using Unstructured.io and OpenAI.
- Upload and compare two PDFs
- Highlight differences between documents
- Ask questions about PDF content using AI
- Export differences to Excel
-
Clone this repository:
git clone <repository-url> cd pdfparser
-
Set up the environment using
uv
:- Ensure you have
uv
installed. If not, install it with:curl -LsSf https://astral.sh/uv/install.sh | sh
- Create a virtual environment and install dependencies:
uv venv source .venv/bin/activate uv pip install -e .
- Note: These commands are optimized for Zsh (the default shell on macOS) and work seamlessly in iTerm2 on a MacBook M4.
- Ensure you have
-
Configure environment variables:
- Copy the
.env.example
file to.env
:cp .env.example .env
- Add your API keys for Unstructured.io and OpenAI to the
.env
file.
- Copy the
Run the application with:
streamlit run pdf_processor.py
Then:
- Upload two PDFs in the "Upload" tab
- View differences in the "Comparison" tab
- Ask questions in the "Chat" tab
If you encounter errors related to Metal Performance Shaders when running on Apple Silicon Macs (e.g., MacBook M4), use one of these solutions:
uv pip install --upgrade torch --extra-index-url https://download.pytorch.org/whl/cpu
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
export CUDA_VISIBLE_DEVICES=""
streamlit run pdf_processor.py
Add these lines to your .env
file:
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
CUDA_VISIBLE_DEVICES=
If you see an error like:
failed assertion `Error: MLIR pass manager failed'
This is related to GPU acceleration on Apple Silicon Macs. Use the compatibility options above to resolve it.
Warnings about "missing ScriptRunContext" are normal when running scripts directly. They can be safely ignored, as noted in the warning messages themselves.