Skip to content

scriptotek/otsrdflib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build status

Code health

Latest version

MIT license

Ordered Turtle Serializer for rdflib

An extension to the rdflib Turtle serializer that adds order. This is useful e.g. if you have Turtle files under version control and want clean diffs. Or if you just want to publish a nice and ordered file that is easier to inspect by human beings.

$ pip install otsrdflib

Getting started:

from rdflib import graph
from otsrdflib import OrderedTurtleSerializer

graph = Graph()
serializer = OrderedTurtleSerializer(graph)
with open('out.ttl', 'wb') as fp:
    serializer.serialize(fp)

Class order

By default, classes are ordered alphabetically by their URIs.

A custom order can be imposed by adding classes to the class_order attribute. For a SKOS vocabulary, for instance, you might want to sort the concept scheme first, followed by the other elements of the vocabulary:

from otsrdflib import OrderedTurtleSerializer
from rdflib import graph
from rdflib.namespace import Namespace, SKOS

ISOTHES = Namespace('http://purl.org/iso25964/skos-thes#')

graph = Graph()
serializer = OrderedTurtleSerializer(graph)
serializer.class_order = [
    SKOS.ConceptScheme,
    SKOS.Concept,
    ISOTHES.ThesaurusArray,
]
with open('out.ttl', 'wb') as fp:
    serializer.serialize(fp)

Any class not included in the class_order list will be sorted alphabetically at the end, after the classes included in the list.

Instance order

By default, instances of a class are ordered alphabetically by their URIS.

A custom order can be imposed by defining functions that generate sort keys from the URIs. For instance, you could define a function that returns the numeric last part of an URI to be sorted numerically:

serializer.sorters = [
    ('.*?/[^0-9]*([0-9.]+)$', lambda x: float(x[0])),
]

The first element of the tuple ('.*?/[^0-9]*([0-9.]+)$') is the regexp pattern to be matched against the URIs, while the second element (lambda x: float(x[0])) is the sort key generating function. In this case, it returns the first backreference as a float.

The patterns in sorters will be attempted matched against instances of any class. You can also define patterns that will only be matched against instances of a specific class. Let's say you only wanted to sort instances of SKOS.Concept this way:

from rdflib.namespace import SKOS

serializer.sorters_by_class = {
    SKOS.Concept: [
        ('.*?/[^0-9]*([0-9.]+)$', lambda x: float(x[0])),
    ]
}

For a slightly more complicated example, let's look at Dewey. Classes in the main schedules are describes by URIs like http://dewey.info/class/001.433/e23/, and we will use the class number (001.433) for sorting. But there's also table classes like http://dewey.info/class/1--0901/e23/. We want to sort these at the end, after the main schedules. To achieve this, we define two sorters, one that matches the table classes and one that matches the main schedule classes:

serializer.sorters = [
    ('/([0-9A-Z\-]+)\-\-([0-9.\-;:]+)/e', lambda x: 'T{}--{}'.format(x[0], x[1])),  # table numbers
    ('/([0-9.\-;:]+)/e', lambda x: 'A' + x[0]),  # main schedule numbers
]

By prefixing the table numbers with 'T' and the main schedule numbers with 'A', we ensure the table numbers are sorted after the main schedule numbers.

Changes in version 0.5

  • The topClasses attribute was renamed to class_order to better reflect its content and comply with PEP8. It was also changed to be empty by default, since the previous default list was rather random.
  • A sorters_by_class attribute was added to allow sorters to be defined per class.

About

Ordered Turtle Serializer for rdflib

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages