PyMuPDF-Data-Extraction

Overview

This project comprises two Python scripts that utilize the PyMuPDF library to extract data from PDF documents. It is particularly tailored for extracting highlighted text and specific data based on text location and context from PDF files.

script2.py: Extracts data based on location and common text patterns in PDF documents. It does not process highlighted annotations.
main.py: Focuses on extracting highlighted texts in a PDF, particularly processing 'highlighted' annotations.

Installation and Setup

Prerequisites

Python 3.x
Pip (Python package installer)

Setting Up a Python Virtual Environment

Create a Virtual Environment:
```
python -m venv venv
```
Activate the Virtual Environment:
- On Windows:
```
.\venv\Scripts\activate
```
- On MacOS/Linux:
```
source venv/bin/activate
```

Installing Dependencies

Install the required packages using pip:

pip install -r requirements.txt

Note: The requirements.txt file should contain all the necessary libraries, including PyMuPDF, pandas, python-dateutil, and any others used in the project.

Usage

To use the scripts, navigate to the project directory and run:

For extracting data based on location and common text patterns:
```
python script2.py
```
For extracting highlighted texts:
```
python main.py
```

Functionality and Features

script2.py: Extracts text based on predefined locations and patterns. Useful for structured documents where the data layout is consistent.
main.py: Utilizes PyMuPDF to process PDF documents and extract highlighted text. It identifies highlighted annotations and retrieves the corresponding text.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.idea		.idea
Resources		Resources
src		src
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyMuPDF-Data-Extraction

Overview

Installation and Setup

Prerequisites

Setting Up a Python Virtual Environment

Installing Dependencies

Usage

Functionality and Features

Demo Videos

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sidathrashen/PyMuPDF-Data-Extraction

Folders and files

Latest commit

History

Repository files navigation

PyMuPDF-Data-Extraction

Overview

Installation and Setup

Prerequisites

Setting Up a Python Virtual Environment

Installing Dependencies

Usage

Functionality and Features

Demo Videos

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages