Skip to content

Extracts competencies and requirements of modules from module handbooks of Beuth University of Applied Sciences Berlin.

License

Notifications You must be signed in to change notification settings

traschke/bht-module-handbook-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Module Handbook Extractor

Extracts text from module handbooks. Currently only handles descriptions of competencies and requirements, but can easily be extended for other puposes! (only tested with modulehandbooks of Department VI of Beuth University of Applied Sciences Berlin)

How does it work?

The algorithm searches each page for specific keywords, to identify areas which contain relevant data. It dynamically creates bounding-boxes based on these keywords on each page of the pdf, where relevant data is assumed. As module handbooks at Beuth University are usually formatted as tables, for each desired data field, there is always a descriptor/keyword, e.g. "Modulnummer" or "Lernziele/Kompetenzen". To correctly calculate the bounding box, it uses the next row of the column as terminator.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

  • Python 3.7
  • pipenv
  • (optional) pyenv to automatically install required Pythons
    • If pyenv is not installed, Python 3.7 is required, otherwise pyenv will install it

Installing

Setup a python virtual environment and download all dependencies

$ pipenv install --dev

Running

Enter the virtual environment

$ pipenv shell

Show help

$ python -m pdfextract -h

Extraction

Extract descriptions of competencies and requirements to console

$ python -m pdfextract example.pdf

You can add an output directory with the -o parameter. For each module a folder will be created, with a competencies.txt and requirements.txt file, which hold the corresponding data.

$ python -m pdfextract -o ./out example.pdf

Running the tests

Unfortunately, there are no tests at the moment. :(

Built With

Authors

  • Timo Raschke - Initial work - traschke

License

This project is licensed under the MIT License - see the LICENSE file for details

About

Extracts competencies and requirements of modules from module handbooks of Beuth University of Applied Sciences Berlin.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages