Skip to content

Commit

Permalink
Cython (#77)
Browse files Browse the repository at this point in the history
N.B. This merge has so many commits, many of which experimental, that squashing will be sensible. The Cython branch will be left to preserve the full commit history. 

* Bare-bones base LMDB store.

* Add _open_env method.

* Link header files.

* Remove pure Python base LMDB store.

* Module path adjustments and further base LMDB store enhancements.

* Start work on LmdbTriplestore.

* WIP init dbi method.

* WIP: pxd file.

* LmdbTriplestore compiles, links and imports correctly.

* Add LMDB libraries. [ci skip]

* Separate DBI labels and config; LmdbTriplestore imports and inits.
[ci skip]

* Simplify error messages; rearrange declarations. [skip ci]

* WIP: get methods.
[ci skip]

* Simplify return code check further; run get_value.
[ci skip]

* Add put method; properly raise exceptions; pass unsigned char in cpdef.
[ci skip]

* Basic put & get method working; minimal one-off test script.

* Use transaction ctx.

* Rearrange and expand txn and cursor management methods.

* Add stats.

* Start moving functions to PYX file; remove LexicalSequence.

* Add _from_key().

* Add low-level lookup methods.
[ci skip]

* Change _check() format.

* Move _triple_keys method.
[ci skip]

* Inline _to_key(); add _to_triple_key().
[ci skip]

* Add logic for all unbound lookup.
[ci skip]

* _triple_keys() returns tuples.
[ci skip]

* Add _remove() method; separate Cython and Python triple_keys() methods.

* Add key_exists() method.
[ci skip]

* Make message in _check() optional.

* Add _all_terms().

* Add _index_triple().
[ci skip]

* Remove methods moved to Cython module from Python module.
[ci skip]

* Adapt bind() to new framework.
[ci skip]

* Adapt bind(), namespace(), prefix(), namespaces(), contexts().
[ci skip]

* Implement destroy(); minor adjustments.
[ci skip]

* Remove redundant methods and comment out non-core ones in Python module.
[ci skip]

* Add destroy definition in pxd file.
[ci skip]

* Replace all instances of TxnManager with txn_ctx in tests.
[ci skip]

* Remove redundant Txnmanager class and encode/decode functions.
[ci skip]

* Adjust store management functions; retype some cpdefs.
[ci skip]

* Remove TxnManager everywhere.
[ci skip]

* Add Cython to dev requirements; rename requirements_rtd to _dev.
[ci skip]

* Some pre-test fixes.
[ci skip]

* Properly open and close store; avoid double free segfault.
[ci skip]

* Complete testing environment and transaction lifecycle.
[ci skip]

* Fix some tests; use default transaction in _cur_open().
[ci skip]

* Protect from deadlock in txn context; pass rollback test.
[ci skip]

* Insert meaningful data.
[ci skip]

* Replace string assignments with memcpy.
[ci skip]

* Fix typos.
[ci skip]

* Rework exception handling.
[ci skip]

* Use memcmp for array slice comparison.
[skip ci]

* Add specific exception for overwriting existing key.
[ci skip]

* Fix lookup_2bound; make memcpy syntax consistent.
[ci skip]

* Fix segfaults and a number of other issues.
[ci skip]

* Correct duplicate check.
[ci skip]

* Align lookup_1bound and lookup_2bound process.
[ci skip]

* Re-engineer ResultSet.
[ci skip]

* Fix _get_dup_data.
[ci skip]

* Fix ResultSet.to_tuple().
[ci skip]

* First attempt to fix all-unbound lookup.
[ci skip]

* Fix all-unbound lookup and broader issues.

* Ensure all cursors are closed.
* get_dbi() returns a handle.

[ci skip]

* Fix all-unbound lookup.
[ci skip]

* Refactor array notation.
[ci skip]

* _from_key() returns tuple.
[ci skip]

* Do not truncate terms in _add.
[ci skip]

* Put index cursor generation in right order.
[ci skip]

* Catch all KeyNotFoundErrors in lookups.
[ci skip]

* Fix simple delete test.
[ci skip]

* Fix namespace methods.
[ci skip]

* Change a bunch of function signatures to make more sense.
[ci skip]

* Move add_graph to Cython module.

* Pass add_context() and contexts() tests.

* Fix several issues with context handling; add a lot of logging.
[ci skip]

* Add tests:

* Adding graph with a RO transaction
* Deleting triples for not found graph

* Fix adding and deleting context, and ???c lookup.
[ci skip]

* Fix delete triples without a context.
[ci skip]

* Fix stats.
[ci skip]

* Correct txn reference in _append(). All LMDB tests pass locally.

* Adapt MetadataStore to BaseLmdbStore; fix test setup.

* Various integration fixes.
[ci skip]

* Fix s p o c lookup.
[ci skip]

* Fix RDF store setup and teardown.
[ci skip]

* Fix segfault on remove((s, p, o), c).
[ci skip]

* Expose txn_id() to Python; add a lot more debugging statements.

* Add cleanup tests for LMDB Store; remove `as` in txn_ctx calls.

* Fix triples not being completely deleted. All store tests pass.
[ci skip]

* Enable MetadataStore.

* Clear stale readers.
[ci skip]

* Fix transaction closing that got out of whack somehow.
[ci skip]

* Fix graph identifiers for contexts in triples().
[ci skip]

* Avoid a memcpy() in _from_key().
[ci skip]

* Return all contexts for triples() even when filtering by context.
[ci skip]

* Fix lookup for s p o ?.
[ci skip]

* Properly clean up context.
[ci skip]

* More tests.
[ci skip]

* Recurse into descendants when forgetting resource.
[ci skip]

* Temp fix for version test; fix get_raw; fix all_terms.
[ci skip]

* Fix get_inbound_rel()... really. All tests pass locally.
[ci skip]

* Comment out all debug statements.

* Reset metadata store at bootstrap.

* Replace RDFlib dataset methods with straight store methods.

* Logging adjustments.

* Move destroy() method to Cython module.

* Start using local transactions.

* Test script for concurrent transactions.

* Resolve some multiprocessing issues:

* Start GUnicorn via shell script
* Start one dbenv per process
* Move WSGI config to a module
* Resolve EAGAIN error with multi-worker server

* Add C files.

* Travis adjustments:

* Bump up Python version (min. 3.6, up to 3.7)
* Add Cython dependency
* Install LMDB libraries system-wide in Travis
* Remove py-lmdb from dev dependencies

* Open the metadata store in tests; don't segfault if txn started in a
closed store.

* Temporary fix for leaked readers on GUnicorn worker restarts.

* Remove Cython-generated C sources from VCS.

* Align console script naming; some cleanup.

* Conftest fixes.

* Add three indices but don't use memcpy. 10% write performance penalty.

* Save memcpy's.

* Move triples() inside Cython module.

* Lookup performance improvements:

* Use secondary indices
* Reduce complexity on lookup_2bound
* Parallelize lookup result triple assembly

* Avoid duplicate function calls in check_ref_int(); adjust logging.

* Replace IMR Graph class with lightweight SimpleGraph and Imr.

* Merge performance branch into simple_graph.

* Correct errors with SimpleGraph.

* Raise error if triple inserted is invalid.

* Parallelize data assembly in lookup_2bound.

* Fix offsets in lookup_1bound().

* Rename test folder to control execution order (ugly but effective).

* Move sandbox.

* Fix tests.

* Implement custom pickler; re-add C source bundles.

* Add term module.

* Add TPL header file.

* Add TPL header file.

* More performance optimizations

* Use OpenSSL C SHA1 function
* Separate from_key and from_trp_key

* Setuptools adjustments; support Python 3.7 in Travis.

* Setuptools adjustments; support Python 3.7 in Travis.

* Move metadata store out of ldp_rs store.

* Setuptools adjustments; support Python 3.7 in Travis.

* Fix memory leak in term module.

* Fix memory leak in term module.

* Comment out or delete code related to RDF hashing.

* Some code cleanup.

* Remove dependency from Cython for non-developers.

* Update documentation.

* Subtle rebranding.

* Minor cleanup.
  • Loading branch information
scossu committed Oct 4, 2018
1 parent a688c50 commit acca1a1
Show file tree
Hide file tree
Showing 81 changed files with 82,670 additions and 2,512 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -103,5 +103,11 @@ venv.bak/
# mypy
.mypy_cache/

# Pytest
.pytest_cache/

# Default LAKEsuperior data directories
/data
#/lakesuperior/store/base_lmdb_store.c
#/lakesuperior/store/ldp_rs/lmdb_triplestore.c
!ext/lib
13 changes: 9 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
sudo: false
language: python
python:
- "3.5"
- "3.6"
matrix:
include:
- python: 3.6
- python: 3.7
dist: xenial
sudo: true

install:
- pip install -e .
script:
Expand All @@ -15,6 +20,6 @@ deploy:
on:
tags: true
branch: master
python: "3.5"
python: "3.6"
distributions: "bdist_wheel"

1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
include README.rst
include LICENSE
include fcrepo
graft lakesuperior/data/bootstrap
graft lakesuperior/endpoints/templates
graft lakesuperior/etc.defaults
10 changes: 5 additions & 5 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
LAKEsuperior
Lakesuperior
============

|build status| |docs| |pypi| |codecov|

LAKEsuperior is an alternative `Fedora
Lakesuperior is an alternative `Fedora
Repository <http://fedorarepository.org>`__ implementation.

Fedora is a mature repository software system historically adopted by
Expand All @@ -14,7 +14,7 @@ any type of binary files and their metadata in Linked Data format.
Guiding Principles
------------------

LAKEsuperior aims at being an uncomplicated, efficient Fedora 4
Lakesuperior aims at being an uncomplicated, efficient Fedora 4
implementation.

Its main goals are:
Expand All @@ -33,9 +33,9 @@ Key features
- Very stable persistence layer based on
`LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
ACID-compliant writes guarantee consistency of data.
- Term-based search (*planned*) and SPARQL Query API + UI
- Term-based search and SPARQL Query API + UI
- No performance penalty for storing many resources under the same
container
container, or having one resource link to many URIs
- Extensible provenance metadata tracking
- Multi-modal access: HTTP (REST), command line interface and native Python
API.
Expand Down
11 changes: 9 additions & 2 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import logging
import pytest

from os import makedirs, path
Expand Down Expand Up @@ -35,17 +36,19 @@ def db(app):
'''
Set up and tear down test triplestore.
'''
makedirs(data_dir, exist_ok=True)
env.app_globals.rdfly.bootstrap()
env.app_globals.nonrdfly.bootstrap()
print('Initialized data store.')
env.app_globals.rdf_store.open_env(
env.app_globals.rdf_store.env_path)

yield env.app_globals.rdfly

# TODO improve this by using tempfile.TemporaryDirectory as a context
# manager.
print('Removing fixture data directory.')
rmtree(data_dir)
env.app_globals.rdf_store.close_env()
env.app_globals.rdf_store.destroy()


@pytest.fixture
Expand All @@ -56,3 +59,7 @@ def rnd_img():
return random_image(8, 256)


@pytest.fixture(autouse=True)
def disable_logging():
"""Disable logging in all tests."""
logging.disable(logging.INFO)
31 changes: 16 additions & 15 deletions docs/about.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
About LAKEsuperior
About Lakesuperior
==================

LAKEsuperior is an alternative `Fedora
Lakesuperior is an alternative `Fedora
Repository <http://fedorarepository.org>`__ implementation.

Fedora is a mature repository software system historically adopted by
Expand All @@ -12,7 +12,7 @@ any type of binary files and their metadata in Linked Data format.
Guiding Principles
------------------

LAKEsuperior aims at being an uncomplicated, efficient Fedora 4
Lakesuperior aims at being an uncomplicated, efficient Fedora 4
implementation.

Its main goals are:
Expand All @@ -33,54 +33,55 @@ Key features
- Very stable persistence layer based on
`LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
ACID-compliant writes guarantee consistency of data.
- Term-based search (*planned*) and SPARQL Query API + UI
- Term-based search and SPARQL Query API + UI
- No performance penalty for storing many resources under the same
container; no
`kudzu <https://www.nature.org/ourinitiatives/urgentissues/land-conservation/forests/kudzu.xml>`__
container; no `kudzu
<https://www.nature.org/ourinitiatives/urgentissues/land-conservation/forests/kudzu.xml>`__
pairtree segmentation [#]_
- Extensible :doc:`provenance metadata <model>` tracking
- :doc:`Multi-modal access <architecture>`: HTTP
(REST), command line interface and native Python API.
- Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.
- Fits in a pocket: you can carry 50M triples in an 8Gb memory stick [#]_.

Implementation of the official `Fedora API
specs <https://fedora.info/spec/>`__ (Fedora 5.x and beyond) is not
foreseen in the short term, however it would be a natural evolution of
this project if it gains support.

Please make sure you read the :doc:`Delta
document <fcrepo4_deltas>` for divergences with the
official Fedora4 implementation.
Please make sure you read the :doc:`Delta document <fcrepo4_deltas>` for
divergences with the official Fedora4 implementation.

Target Audience
---------------

LAKEsuperior is for anybody who cares about preserving data in the long
Lakesuperior is for anybody who cares about preserving data in the long
term.

Less vaguely, LAKEsuperior is targeted at who needs to store large
Less vaguely, Lakesuperior is targeted at who needs to store large
quantities of highly linked metadata and documents.

Its Python/C environment and API make it particularly well suited for
academic and scientific environments who would be able to embed it in a
Python application as a library or extend it via plug-ins.

LAKEsuperior is able to be exposed to the Web as a `Linked Data
Lakesuperior is able to be exposed to the Web as a `Linked Data
Platform <https://www.w3.org/TR/ldp-primer/>`__ server. It also acts as
a SPARQL query (read-only) endpoint, however it is not meant to be used
as a full-fledged triplestore at the moment.

In its current status, LAKEsuperior is aimed at developers and hands-on
In its current status, Lakesuperior is aimed at developers and hands-on
managers who are interested in evaluating this project.

Status and development
----------------------

LAKEsuperior is in **alpha** status. Please see the `project
Lakesuperior is in **alpha** status. Please see the `project
issues <https://github.com/scossu/lakesuperior/issues>`__ list for a
rudimentary road map.

--------------

.. [#] However if your client splits pairtrees upstream, such as Hyrax does,
that obviously needs to change to get rid of the path segments.
.. [#] Your mileage may vary depending on the variety of your triples.
2 changes: 1 addition & 1 deletion docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ API Documentation
Main Interface
--------------

The LAKEsuperior API modules of most interest for a client are:
The Lakesuperior API modules of most interest for a client are:

- :mod:`lakesuperior.api.resource`
- :mod:`lakesupeiror.api.query`
Expand Down
14 changes: 7 additions & 7 deletions docs/architecture.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
LAKEsuperior Architecture
Lakesuperior Architecture
=========================

LAKEsuperior is written in Python. It is not excluded that parts of the
Lakesuperior is written in Python. It is not excluded that parts of the
code may be rewritten in `Cython <http://cython.readthedocs.io/>`__ for
performance.

Multi-Modal Access
------------------

LAKEsuperior services and data are accessible in multiple ways:
Lakesuperior services and data are accessible in multiple ways:

- Via HTTP. This is the canonical way to interact with LDP resources
and conforms quite closely to the Fedora specs (currently v4).
Expand All @@ -17,18 +17,18 @@ LAKEsuperior services and data are accessible in multiple ways:
- Via a Python API. This method allows to use Python scripts to access
the same methods available to the two methods above in a programmatic
way. It is possible to write Python plugins or even to embed
LAKEsuperior in a Python application, even without running a web
Lakesuperior in a Python application, even without running a web
server.

Architecture Overview
---------------------

.. figure:: assets/lakesuperior_arch.png
:alt: LAKEsuperior Architecture
:alt: Lakesuperior Architecture

LAKEsuperior Architecture
Lakesuperior Architecture

The LAKEsuperior REST API provides access to the underlying Python API.
The Lakesuperior REST API provides access to the underlying Python API.
All REST and CLI operations can be replicated by a Python program
accessing this API.

Expand Down
10 changes: 5 additions & 5 deletions docs/cli.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
Command Line Reference
======================

LAKEsuperior comes with some command-line tools aimed at several purposes.
Lakesuperior comes with some command-line tools aimed at several purposes.

If LAKEsuperior is installed via ``pip``, all tools can be invoked as normal
If Lakesuperior is installed via ``pip``, all tools can be invoked as normal
commands (i.e. they are in the virtualenv ``PATH``).

The tools are currently not directly available on Docker instances (*TODO add
Expand All @@ -16,7 +16,7 @@ This is the main server command. It has no parameters. The command spawns
Gunicorn workers (as many as set up in the configuration) and can be sent in
the background, or started via init script.

The tool must be run in the same virtual environment LAKEsuperior
The tool must be run in the same virtual environment Lakesuperior
was installed in (if it was)—i.e.::

source <virtualenv root>/bin/activate
Expand Down Expand Up @@ -44,7 +44,7 @@ self-documented, so this is just a redundant overview::
check_fixity [STUB] Check fixity of a resource.
check_refint Check referential integrity.
cleanup [STUB] Clean up orphan database items.
migrate Migrate an LDP repository to LAKEsuperior.
migrate Migrate an LDP repository to Lakesuperior.
stats Print repository statistics.

All entries marked ``[STUB]`` are not yet implemented, however the
Expand All @@ -65,7 +65,7 @@ The command has no options but prompts the user for a few settings
interactively (N.B. this may change in favor of parameters).

The benchmark tool is able to create RDF sources, or non-RDF, or an equal mix
of them, via POST or PUT, in the currently running LAKEsuperior server. It
of them, via POST or PUT, in the currently running Lakesuperior server. It
runs single-threaded.

The RDF sources are randomly generated graphs of consistent size and
Expand Down
6 changes: 3 additions & 3 deletions docs/contributing.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
Contributing to LAKEsuperior
Contributing to Lakesuperior
============================

LAKEsuperior has been so far a single person’s off-hours project (with much
Lakesuperior has been so far a single person’s off-hours project (with much
very valuable input from several sides). In order to turn into anything close
to a Beta release and eventually to a production-ready implementation, it
needs some community love.

Contributions are welcome in all forms, including ideas, issue reports,
or even just spinning up the software and providing some feedback.
LAKEsuperior is meant to live as a community project.
Lakesuperior is meant to live as a community project.

.. _dev_setup:

Expand Down
14 changes: 7 additions & 7 deletions docs/discovery.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Resource Discovery & Query
==========================

LAKEsuperior offers several way to programmatically discover resources and
Lakesuperior offers several way to programmatically discover resources and
data.

LDP Traversal
Expand All @@ -20,12 +20,12 @@ SPARQL Query
------------

A `SPARQL <https://www.w3.org/TR/sparql11-query/>`__ endpoint is available in
LAKEsuperior both as an API and a Web UI.
Lakesuperior both as an API and a Web UI.

.. figure:: assets/lsup_sparql_query_ui.png
:alt: LAKEsuperior SPARQL Query Window
:alt: Lakesuperior SPARQL Query Window

LAKEsuperior SPARQL Query Window
Lakesuperior SPARQL Query Window

The UI is based on `YASGUI <http://about.yasgui.org/>`__.

Expand Down Expand Up @@ -57,7 +57,7 @@ query in a graph with more than a few thousands resources::

What the RDFLib implementation does is going over every single graph in the
repository and perform the ``?s ?p ?o`` query on each of them. Since
LAKEsuperior creates several graphs per resource, this can run for a very long
Lakesuperior creates several graphs per resource, this can run for a very long
time in any decently sized data set.

The solution to this is either to omit the graph query, or use a term search,
Expand All @@ -67,9 +67,9 @@ Term Search
-----------

.. figure:: assets/lsup_term_search.png
:alt: LAKEsuperior Term Search Window
:alt: Lakesuperior Term Search Window

LAKEsuperior Term Search Window
Lakesuperior Term Search Window

This feature provides a discovery tool focused on resource subjects and based
on individual term match and comparison. It tends to be more manageable than
Expand Down

0 comments on commit acca1a1

Please sign in to comment.