Skip to content

This project comprises two Python scripts that utilize the PyMuPDF library to extract data from PDF documents. It is particularly tailored for extracting highlighted text and specific data based on text location and context from PDF files.

Notifications You must be signed in to change notification settings

sidathrashen/PyMuPDF-Data-Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyMuPDF-Data-Extraction

Overview

This project comprises two Python scripts that utilize the PyMuPDF library to extract data from PDF documents. It is particularly tailored for extracting highlighted text and specific data based on text location and context from PDF files.

  • script2.py: Extracts data based on location and common text patterns in PDF documents. It does not process highlighted annotations.
  • main.py: Focuses on extracting highlighted texts in a PDF, particularly processing 'highlighted' annotations.

Installation and Setup

Prerequisites

  • Python 3.x
  • Pip (Python package installer)

Setting Up a Python Virtual Environment

  1. Create a Virtual Environment:
    python -m venv venv
  2. Activate the Virtual Environment:
    • On Windows:
      .\venv\Scripts\activate
    • On MacOS/Linux:
      source venv/bin/activate

Installing Dependencies

Install the required packages using pip:

pip install -r requirements.txt

Note: The requirements.txt file should contain all the necessary libraries, including PyMuPDF, pandas, python-dateutil, and any others used in the project.

Usage

To use the scripts, navigate to the project directory and run:

  • For extracting data based on location and common text patterns:
    python script2.py
  • For extracting highlighted texts:
    python main.py

Functionality and Features

  • script2.py: Extracts text based on predefined locations and patterns. Useful for structured documents where the data layout is consistent.
  • main.py: Utilizes PyMuPDF to process PDF documents and extract highlighted text. It identifies highlighted annotations and retrieves the corresponding text.

Demo Videos

About

This project comprises two Python scripts that utilize the PyMuPDF library to extract data from PDF documents. It is particularly tailored for extracting highlighted text and specific data based on text location and context from PDF files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages