Extracts text from module handbooks. Currently only handles descriptions of competencies and requirements, but can easily be extended for other puposes! (only tested with modulehandbooks of Department VI of Beuth University of Applied Sciences Berlin)
The algorithm searches each page for specific keywords, to identify areas which contain relevant data. It dynamically creates bounding-boxes based on these keywords on each page of the pdf, where relevant data is assumed. As module handbooks at Beuth University are usually formatted as tables, for each desired data field, there is always a descriptor/keyword, e.g. "Modulnummer" or "Lernziele/Kompetenzen". To correctly calculate the bounding box, it uses the next row of the column as terminator.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Python 3.7
- pipenv
- (optional) pyenv to automatically install required Pythons
- If pyenv is not installed, Python 3.7 is required, otherwise pyenv will install it
Setup a python virtual environment and download all dependencies
$ pipenv install --dev
Enter the virtual environment
$ pipenv shell
Show help
$ python -m pdfextract -h
Extract descriptions of competencies and requirements to console
$ python -m pdfextract example.pdf
You can add an output directory with the -o
parameter. For each module a folder will be created, with a competencies.txt
and requirements.txt
file, which hold the corresponding data.
$ python -m pdfextract -o ./out example.pdf
Unfortunately, there are no tests at the moment. :(
- Python 3.7
- pipenv - Python Development Workflow for Humans
- pdfquery - A fast and friendly PDF scraping library
- Timo Raschke - Initial work - traschke
This project is licensed under the MIT License - see the LICENSE file for details