PDF Underlined Text Extractor

This repository provides tools to extract underlined text from any PDF document. The process automates the extraction and recognition of underlined text using a series of steps that convert PDFs into structured data for easy processing.

How It Works

The process involves several key steps:

XML Structuring: The PDF is converted into a structured XML using PyQuery. This step allows for easier manipulation and querying of the document's content.
Component Extraction: Specific components that denote underlining in the XML are identified and extracted. This step focuses on retrieving only the underlined parts of the document.
Image Slice: The underlined sections of the PDF are sliced out and saved into memory as PNG images. This prepares the content for optical character recognition.
Optical Character Recognition (OCR): pytesseract is used to perform OCR on the sliced images to read and convert the visual data into text.
Results Compilation: The extracted text is compiled into an array, providing a structured output of all underlined text elements from the original PDF.

Installation and Setup

To use this repository, you will need Python and several dependencies, including pytesseract for OCR capabilities.

Installing Python Dependencies

Ensure you have Python installed, then set up a virtual environment for the project (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install required Python libraries:

pip install -r requirements.txt

Installing Pytesseract

For Ubuntu: Run the following commands in your terminal to update your package list and install Tesseract OCR and its development libraries:

sudo apt update
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

For Mac: Use Homebrew to install Tesseract by running the following command:

brew install tesseract

For Windows:

Download the installer from Tesseract at UB Mannheim.
It is recommended to install Tesseract into the default directory (C:\Program Files\Tesseract-OCR) to ensure compatibility. After installing Tesseract, you may need to specify the path to the tesseract executable in your Python script if it's not automatically recognized:

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
tests		tests
README.md		README.md
loremIpsum.pdf		loremIpsum.pdf
main.py		main.py
outXML.xml		outXML.xml
requirements.txt		requirements.txt
softwareSpec.pdf		softwareSpec.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Underlined Text Extractor

How It Works

Installation and Setup

Installing Python Dependencies

Installing Pytesseract

About

Releases

Packages

Languages

sasha-korovkina/pdfUnderlinedExtractor

Folders and files

Latest commit

History

Repository files navigation

PDF Underlined Text Extractor

How It Works

Installation and Setup

Installing Python Dependencies

Installing Pytesseract

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages