This project comprises two Python scripts that utilize the PyMuPDF library to extract data from PDF documents. It is particularly tailored for extracting highlighted text and specific data based on text location and context from PDF files.
script2.py
: Extracts data based on location and common text patterns in PDF documents. It does not process highlighted annotations.main.py
: Focuses on extracting highlighted texts in a PDF, particularly processing 'highlighted' annotations.
- Python 3.x
- Pip (Python package installer)
- Create a Virtual Environment:
python -m venv venv
- Activate the Virtual Environment:
- On Windows:
.\venv\Scripts\activate
- On MacOS/Linux:
source venv/bin/activate
- On Windows:
Install the required packages using pip:
pip install -r requirements.txt
Note: The requirements.txt
file should contain all the necessary libraries, including PyMuPDF
, pandas
, python-dateutil
, and any others used in the project.
To use the scripts, navigate to the project directory and run:
- For extracting data based on location and common text patterns:
python script2.py
- For extracting highlighted texts:
python main.py
- script2.py: Extracts text based on predefined locations and patterns. Useful for structured documents where the data layout is consistent.
- main.py: Utilizes PyMuPDF to process PDF documents and extract highlighted text. It identifies highlighted annotations and retrieves the corresponding text.