Skip to content
/ docsa Public

SLUB Document Classification and Similarity Analysis

License

Notifications You must be signed in to change notification settings

slub/docsa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SLUB Document Classification and Similarity Analysis

This project provides a library for bibliographic document classification and similarity analysis.

It contains a selection of methods that support:

  • pre-processing of bibliographic meta data and full-text documents,
  • training of multi-label multi-class classification models,
  • integrating and using hierarchical subject classifications (pruning methods, performance scores),
  • similarity analysis and clustering.

A detailed description including tutorials and examples can be found in the API documentation, which needs to be generated as described below.

Installation

This projects requires Python v3.8 or above and uses pip for dependency management. Besides, this package uses pyTorch to train Artificial Neural Networks via GPUs. Make sure to install the latest Nvidia graphics drivers and check further requirements.

Via Python Package Installer (not available yet)

Once published to PyPI (not available yet), install via:

  • python3 -m pip install slub_docsa

From Source

Download the source code by checking out the repository:

  • git clone https://git.slub-dresden.de/lod/maschinelle-klassifizierung/docsa.git

Use make to install python dependencies by executing the following commands:

  • make install or make install-test
    (installs slub_docsa package and downloads all required runtime / test dependencies via pip)
  • make test
    (runs tests to verify correct installation, requires test dependencies)
  • make docs
    (generate API documentation, requires test dependencies)

From Source using Ubuntu 20.04

Install essentials like python3, pip and make:

  • apt-get update
    (update the Ubuntu package installer index)
  • apt-get install -y make python3 python3-pip
    (install python3, pip and make)

Optionally, set up a python virtual environment:

  • apt-get install -y python3-venv
  • python3 -m venv /path/to/venv
  • source /path/to/venv/bin/activate

Run make commands as provided above:

  • make install-test
  • make test

Documentation

Further documentation of this project can be found at the following locations:

Development

Python Virtual Environment

Download all developer dependencies and install the slub_docsa package via pip in development mode:

  • make install-dev

This will link your local project such that changes to source files are immediately reflected, see pip install -e.

Container Environment

This project also provides container images for development. You can use docker, but also other container runtimes, e.g., podman.

Install a Container Runtime

  • Either, install docker and docker-compose:

  • Or, setup podman in Fedora 34 including the Nvidia container runtime:

    • Install nvidia graphics driver, and check they are working by running nvidia-smi
    • Install podman and podman-compose from repositories via dnf install podman podman-compose
    • Install the nvidia container runtime using the centos8 repositories via dnf install nvidia-container-runtime, see installation instructions
    • Set no-cgroups = true in /etc/nvidia-container-runtime/config.toml, which is required since Nvidia does not yet support cgroups v2
    • Check your CUDA version with nvidia-smi, e.g., 11.4
    • Identify the matching cuda docker image, e.g., nvidia/cuda:11.4.1-base-centos8
    • Verify gpu support in podman via podman run --security-opt=label=disable --rm nvidia/cuda:11.4.1-base-centos8 nvidia-smi

Setup the Development Environment

  • Docker images for development can be found in the code/docker/devel directory.
  • Run build.sh gpu to build these docker images with gpu support.
  • Run up.sh gpu and down.sh gpu to start and shutdown the development container.
  • Run shell_python.sh gpu to enter the python container with gpu support.
  • Run shell_annif.sh to enter the Annif container

Setup Visual Studio Code, which supports many useful features during development:

Continuous Integration

  • The CI pipeline can be triggered by running make coverage and make lint. Both commands run automated tests using pytest, ensure code guidelines by using pylint and flake8, and check for common security issues using bandit.