Skip to content

Commit

Permalink
Merge pull request #86 from scossu/refactor_dstructs
Browse files Browse the repository at this point in the history
Refactor dstructs
  • Loading branch information
scossu committed Mar 28, 2019
2 parents 44ffaa9 + db7e356 commit b4dfa0e
Show file tree
Hide file tree
Showing 93 changed files with 5,731 additions and 83,836 deletions.
25 changes: 21 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -106,8 +106,25 @@ venv.bak/
# Pytest
.pytest_cache/

# Default LAKEsuperior data directories
/data
#/lakesuperior/store/base_lmdb_store.c
#/lakesuperior/store/ldp_rs/lmdb_triplestore.c
# Default Lakesuperior data directories
lakesuperior/data/ldprs_store
lakesuperior/data/ldpnr_store

# Cython business.
/cython_debug
/lakesuperior/store/*.c
/lakesuperior/store/*.html
/lakesuperior/store/ldp_rs/*.c
/lakesuperior/store/ldp_rs/*.html
/lakesuperior/model/*.c
/lakesuperior/model/*/*.html
/lakesuperior/model/*/*.c
/lakesuperior/model/*.html
/lakesuperior/util/*.c
/lakesuperior/util/*.html
!ext/lib

# Vim CTags file.
tags

!.keep
12 changes: 12 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[submodule "ext/lmdb"]
path = ext/lmdb
url = https://github.com/LMDB/lmdb.git
[submodule "ext/tpl"]
path = ext/tpl
url = https://github.com/troydhanson/tpl.git
[submodule "ext/spookyhash"]
path = ext/spookyhash
url = https://github.com/centaurean/spookyhash.git
[submodule "ext/collections-c"]
path = ext/collections-c
url = https://github.com/srdja/Collections-C.git
3 changes: 3 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,14 @@ language: python
matrix:
include:
- python: 3.6
dist: xenial
sudo: true
- python: 3.7
dist: xenial
sudo: true

install:
- pip install Cython==0.29.6 cymem
- pip install -e .
script:
- python setup.py test
Expand Down
14 changes: 12 additions & 2 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,7 +1,17 @@
include README.rst
include LICENSE
include fcrepo
graft ext
include bin/*
include ext/lmdb/libraries/liblmdb/mdb.c
include ext/lmdb/libraries/liblmdb/lmdb.h
include ext/lmdb/libraries/liblmdb/midl.c
include ext/lmdb/libraries/liblmdb/midl.h
include ext/collections-c/src/*.c
include ext/collections-c/src/include/*.h
include ext/tpl/src/tpl.c
include ext/tpl/src/tpl.h
include ext/spookyhash/src/*.c
include ext/spookyhash/src/*.h

graft lakesuperior/data/bootstrap
graft lakesuperior/endpoints/templates
graft lakesuperior/etc.defaults
45 changes: 23 additions & 22 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,43 +3,44 @@ Lakesuperior

|build status| |docs| |pypi| |codecov|

Lakesuperior is an alternative `Fedora
Repository <http://fedorarepository.org>`__ implementation.
Lakesuperior is a Linked Data repository software. It is capable of storing and
managing large volumes of files and their metadata regardless of their
format, size, ethnicity, gender identity or expression.

Fedora is a mature repository software system historically adopted by
major cultural heritage institutions. It exposes an
`LDP <https://www.w3.org/TR/ldp-primer/>`__ endpoint to manage
any type of binary files and their metadata in Linked Data format.
Lakesuperior is an alternative `Fedora Repository
<http://fedorarepository.org>`__ implementation. Fedora is a mature repository
software system historically adopted by major cultural heritage institutions
which extends the `Linked Data Platform <https://www.w3.org/TR/ldp-primer/>`__
protocol.

Guiding Principles
------------------

Lakesuperior aims at being an uncomplicated, efficient Fedora 4
implementation.
Lakesuperior aims at being a reliable and efficient Fedora 4 implementation.

Its main goals are:

- **Reliability:** Based on solid technologies with stability in mind.
- **Efficiency:** Small memory and CPU footprint, high scalability.
- **Ease of management:** Tools to perform monitoring and maintenance
included.
- **Ease of management:** Tools to perform migration, monitoring and
maintenance included.
- **Simplicity of design:** Straight-forward architecture, robustness
over features.

Key features
------------

- Drop-in replacement for Fedora4
- Very stable persistence layer based on
`LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
ACID-compliant writes guarantee consistency of data.
- Term-based search and SPARQL Query API + UI
- No performance penalty for storing many resources under the same
container, or having one resource link to many URIs
- Extensible provenance metadata tracking
- Multi-modal access: HTTP (REST), command line interface and native Python
API.
- Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.
- Stores binary files and RDF metadata in one repository.
- Multi-modal access: REST/LDP, command line and native Python API.
- (`almost <fcrepo4_deltas>`_) Drop-in replacement for Fedora4
- Very stable persistence layer based on
`LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
ACID-compliant writes guarantee consistency of data.
- Term-based search and SPARQL Query API + UI
- No performance penalty for storing many resources under the same
container, or having one resource link to many URIs
- Extensible provenance metadata tracking
- Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.

Installation & Documentation
----------------------------
Expand All @@ -50,7 +51,7 @@ With Docker::
cd lakesuperior
docker-compose up

With pip (assuming you are familiar with it)::
With pip (requires a C compiler to be installed)::

pip install lakesuperior

Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion docker/docker_entrypoint
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ coilmq &
if [ ! -d /data/ldpnr_store ] && [ ! -d /data/ldprs_store ]; then
echo yes | lsup-admin bootstrap
fi
exec ./fcrepo
exec fcrepo
2 changes: 1 addition & 1 deletion docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The Lakesuperior API modules of most interest for a client are:
- :mod:`lakesupeiror.api.query`
- :mod:`lakesuperior.api.admin`

:mod:`lakesuperior.model.ldpr` is used to manipulate resources.
:mod:`lakesuperior.model.ldp.ldpr` is used to manipulate resources.

The full API docs are listed below.

Expand Down
8 changes: 4 additions & 4 deletions docs/apidoc/lakesuperior.model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,31 +7,31 @@ Submodules
lakesuperior\.model\.ldp\_factory module
----------------------------------------

.. automodule:: lakesuperior.model.ldp_factory
.. automodule:: lakesuperior.model.ldp.ldp_factory
:members:
:undoc-members:
:show-inheritance:

lakesuperior\.model\.ldp\_nr module
-----------------------------------

.. automodule:: lakesuperior.model.ldp_nr
.. automodule:: lakesuperior.model.ldp.ldp_nr
:members:
:undoc-members:
:show-inheritance:

lakesuperior\.model\.ldp\_rs module
-----------------------------------

.. automodule:: lakesuperior.model.ldp_rs
.. automodule:: lakesuperior.model.ldp.ldp_rs
:members:
:undoc-members:
:show-inheritance:

lakesuperior\.model\.ldpr module
--------------------------------

.. automodule:: lakesuperior.model.ldpr
.. automodule:: lakesuperior.model.ldp.ldpr
:members:
:undoc-members:
:show-inheritance:
Expand Down
4 changes: 0 additions & 4 deletions docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,6 @@ available in Windows. Windows users should look for alternative WSGI servers
to run the single-threaded service (``lsup-server``) over multiple processes
and/or threads.

**Note:** This is the only command line tool that is not added to the ``PATH``
environment variable in Unix systems (beecause it is not cross-platform). It
must be invoked by using its full path.

``lsup-admin``
--------------

Expand Down
19 changes: 15 additions & 4 deletions docs/fcrepo4_deltas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,11 @@ clients will use it.
Not yet implemented (but in the plans)
--------------------------------------

- Various headers handling (partial)
- AuthN/Z
- Fixity check
- Blank nodes
- Various headers handling (partial)
- AuthN and WebAC-based authZ
- Fixity check
- Blank nodes (at least partly working, but untested)
- Multiple byte ranges for the ``Range`` request header

Potentially breaking changes
----------------------------
Expand Down Expand Up @@ -62,6 +63,16 @@ regardless of whether the tombstone exists or not.
Lakesuperior will return ``405`` only if the tombstone actually exists,
``404`` otherwise.

``Limit`` Header
~~~~~~~~~~~~~~~~

Lakesuperior does not support the ``Limit`` header which in FCREPO can be used
to limit the number of "child" resources displayed for a container graph. Since
this seems to have a mostly cosmetic function in FCREPO to compensate for
performance limitations (displaying a page with many thousands of children in
the UI can take minutes), and since Lakesuperior already offers options in the
``Prefer`` header to not return any children, this option is not implemented.

Web UI
~~~~~~

Expand Down
8 changes: 8 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,14 @@
Release Notes
=============

1.0 Alpha 19 HOTFIX
-------------------

*October 10, 2018*

A hotfix release was necessary to adjust settings for the source to build
correctly on Read The Docs and Docker Hub, and to package correctly on PyPI.

1.0 Alpha 18
------------

Expand Down
99 changes: 99 additions & 0 deletions docs/structures.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
Data Structure Internals
========================

**(Draft)**

Lakesuperior has its own methods for handling in-memory graphs. These methods
rely on C data structures and are therefore much faster than Python/RDFLib
objects.

The graph data model modules are in :py:module:`lakesuperior.model.graph`.

The Graph Data Model
--------------------

Triples are stored in a C hash set. Each triple is represented by a pointer to
a ``BufferTriple`` structure stored in a temporary memory pool. This pool is
tied to the life cycle of the ``SimpleGraph`` object it belongs to.

A triple structure contains three pointers to ``Buffer`` structures, which
contain a serialized version of a RDF term. These structures are stored in the
``SimpleGraph`` memory pool as well.

Each ``SimpleGraph`` object has a ``_terms`` property and a ``_triples``
property. These are C hash sets holding addresses of unique terms and
triples inserted in the graph. If the same term is entered more than once,
in any position in any triple, the first one entered is used and is pointed to
by the triple. This makes the graph data structure very compact.

In summary, the pointers can be represented this way::

<serialized term data in mem pool (x3)>
^ ^ ^
| | |
<Term structures in mem pool (x3)>
^ ^ ^
| | |
<Term struct addresses in _terms set (x3)>
^ ^ ^
| | |
<Triple structure in mem pool>
^
|
<address of triple in _triples set>

Let's say we insert the following triples in a ``SimpleGraph``::

<urn:s:0> <urn:p:0> <urn:o:0>
<urn:s:0> <urn:p:1> <urn:o:1>
<urn:s:0> <urn:p:1> <urn:o:2>
<urn:s:0> <urn:p:0> <urn:o:0>

The memory pool contains the following byte arrays of raw data, displayed in
the following list with their relative addresses (simplified to 8-bit
addresses and fixed-length byte strings for readability)::

0x00 <urn:s:0>
0x09 <urn:p:0>
0x12 <urn:o:0>

0x1b <urn:s:0>
0x24 <urn:p:1>
0x2d <urn:o:1>

0x36 <urn:s:0>
0x3f <urn:p:1>
0x48 <urn:o:2>

0x51 <urn:s:0>
0x5a <urn:p:0>
0x63 <urn:o:0>

However, the ``_terms`` set contains only ``Buffer`` structures pointing to
unique addresses::

0x00
0x09
0x12
0x24
0x2d
0x48

The other terms are just unutilized. They will be deallocated en masse when
the ``SimpleGraph`` object is garbage collected.

The ``_triples`` set would then contain 3 unique entries pointing to the unique
term addresses::

0x00 0x09 0x12
0x00 0x24 0x2d
0x00 0x24 0x48

(the actual addresses would actually belong to the structures pointing to the
raw data, but this is just an illustrative example).

The advantage of this approach is that the memory pool is contiguous and
append-only (until it gets purged), so it's cheap to just add to it, while the
sets that must maintain uniqueness and are the ones that most operations
(lookup, adding, removing, slicing, copying, etc.) are done on, contain much
less data and are therefore faster.
1 change: 1 addition & 0 deletions ext/collections-c
Submodule collections-c added at 719fd8

0 comments on commit b4dfa0e

Please sign in to comment.