Cython (#77)

N.B. This merge has so many commits, many of which experimental, that squashing will be sensible. The Cython branch will be left to preserve the full commit history. * Bare-bones base LMDB store. * Add _open_env method. * Link header files. * Remove pure Python base LMDB store. * Module path adjustments and further base LMDB store enhancements. * Start work on LmdbTriplestore. * WIP init dbi method. * WIP: pxd file. * LmdbTriplestore compiles, links and imports correctly. * Add LMDB libraries. [ci skip] * Separate DBI labels and config; LmdbTriplestore imports and inits. [ci skip] * Simplify error messages; rearrange declarations. [skip ci] * WIP: get methods. [ci skip] * Simplify return code check further; run get_value. [ci skip] * Add put method; properly raise exceptions; pass unsigned char in cpdef. [ci skip] * Basic put & get method working; minimal one-off test script. * Use transaction ctx. * Rearrange and expand txn and cursor management methods. * Add stats. * Start moving functions to PYX file; remove LexicalSequence. * Add _from_key(). * Add low-level lookup methods. [ci skip] * Change _check() format. * Move _triple_keys method. [ci skip] * Inline _to_key(); add _to_triple_key(). [ci skip] * Add logic for all unbound lookup. [ci skip] * _triple_keys() returns tuples. [ci skip] * Add _remove() method; separate Cython and Python triple_keys() methods. * Add key_exists() method. [ci skip] * Make message in _check() optional. * Add _all_terms(). * Add _index_triple(). [ci skip] * Remove methods moved to Cython module from Python module. [ci skip] * Adapt bind() to new framework. [ci skip] * Adapt bind(), namespace(), prefix(), namespaces(), contexts(). [ci skip] * Implement destroy(); minor adjustments. [ci skip] * Remove redundant methods and comment out non-core ones in Python module. [ci skip] * Add destroy definition in pxd file. [ci skip] * Replace all instances of TxnManager with txn_ctx in tests. [ci skip] * Remove redundant Txnmanager class and encode/decode functions. [ci skip] * Adjust store management functions; retype some cpdefs. [ci skip] * Remove TxnManager everywhere. [ci skip] * Add Cython to dev requirements; rename requirements_rtd to _dev. [ci skip] * Some pre-test fixes. [ci skip] * Properly open and close store; avoid double free segfault. [ci skip] * Complete testing environment and transaction lifecycle. [ci skip] * Fix some tests; use default transaction in _cur_open(). [ci skip] * Protect from deadlock in txn context; pass rollback test. [ci skip] * Insert meaningful data. [ci skip] * Replace string assignments with memcpy. [ci skip] * Fix typos. [ci skip] * Rework exception handling. [ci skip] * Use memcmp for array slice comparison. [skip ci] * Add specific exception for overwriting existing key. [ci skip] * Fix lookup_2bound; make memcpy syntax consistent. [ci skip] * Fix segfaults and a number of other issues. [ci skip] * Correct duplicate check. [ci skip] * Align lookup_1bound and lookup_2bound process. [ci skip] * Re-engineer ResultSet. [ci skip] * Fix _get_dup_data. [ci skip] * Fix ResultSet.to_tuple(). [ci skip] * First attempt to fix all-unbound lookup. [ci skip] * Fix all-unbound lookup and broader issues. * Ensure all cursors are closed. * get_dbi() returns a handle. [ci skip] * Fix all-unbound lookup. [ci skip] * Refactor array notation. [ci skip] * _from_key() returns tuple. [ci skip] * Do not truncate terms in _add. [ci skip] * Put index cursor generation in right order. [ci skip] * Catch all KeyNotFoundErrors in lookups. [ci skip] * Fix simple delete test. [ci skip] * Fix namespace methods. [ci skip] * Change a bunch of function signatures to make more sense. [ci skip] * Move add_graph to Cython module. * Pass add_context() and contexts() tests. * Fix several issues with context handling; add a lot of logging. [ci skip] * Add tests: * Adding graph with a RO transaction * Deleting triples for not found graph * Fix adding and deleting context, and ???c lookup. [ci skip] * Fix delete triples without a context. [ci skip] * Fix stats. [ci skip] * Correct txn reference in _append(). All LMDB tests pass locally. * Adapt MetadataStore to BaseLmdbStore; fix test setup. * Various integration fixes. [ci skip] * Fix s p o c lookup. [ci skip] * Fix RDF store setup and teardown. [ci skip] * Fix segfault on remove((s, p, o), c). [ci skip] * Expose txn_id() to Python; add a lot more debugging statements. * Add cleanup tests for LMDB Store; remove `as` in txn_ctx calls. * Fix triples not being completely deleted. All store tests pass. [ci skip] * Enable MetadataStore. * Clear stale readers. [ci skip] * Fix transaction closing that got out of whack somehow. [ci skip] * Fix graph identifiers for contexts in triples(). [ci skip] * Avoid a memcpy() in _from_key(). [ci skip] * Return all contexts for triples() even when filtering by context. [ci skip] * Fix lookup for s p o ?. [ci skip] * Properly clean up context. [ci skip] * More tests. [ci skip] * Recurse into descendants when forgetting resource. [ci skip] * Temp fix for version test; fix get_raw; fix all_terms. [ci skip] * Fix get_inbound_rel()... really. All tests pass locally. [ci skip] * Comment out all debug statements. * Reset metadata store at bootstrap. * Replace RDFlib dataset methods with straight store methods. * Logging adjustments. * Move destroy() method to Cython module. * Start using local transactions. * Test script for concurrent transactions. * Resolve some multiprocessing issues: * Start GUnicorn via shell script * Start one dbenv per process * Move WSGI config to a module * Resolve EAGAIN error with multi-worker server * Add C files. * Travis adjustments: * Bump up Python version (min. 3.6, up to 3.7) * Add Cython dependency * Install LMDB libraries system-wide in Travis * Remove py-lmdb from dev dependencies * Open the metadata store in tests; don't segfault if txn started in a closed store. * Temporary fix for leaked readers on GUnicorn worker restarts. * Remove Cython-generated C sources from VCS. * Align console script naming; some cleanup. * Conftest fixes. * Add three indices but don't use memcpy. 10% write performance penalty. * Save memcpy's. * Move triples() inside Cython module. * Lookup performance improvements: * Use secondary indices * Reduce complexity on lookup_2bound * Parallelize lookup result triple assembly * Avoid duplicate function calls in check_ref_int(); adjust logging. * Replace IMR Graph class with lightweight SimpleGraph and Imr. * Merge performance branch into simple_graph. * Correct errors with SimpleGraph. * Raise error if triple inserted is invalid. * Parallelize data assembly in lookup_2bound. * Fix offsets in lookup_1bound(). * Rename test folder to control execution order (ugly but effective). * Move sandbox. * Fix tests. * Implement custom pickler; re-add C source bundles. * Add term module. * Add TPL header file. * Add TPL header file. * More performance optimizations * Use OpenSSL C SHA1 function * Separate from_key and from_trp_key * Setuptools adjustments; support Python 3.7 in Travis. * Setuptools adjustments; support Python 3.7 in Travis. * Move metadata store out of ldp_rs store. * Setuptools adjustments; support Python 3.7 in Travis. * Fix memory leak in term module. * Fix memory leak in term module. * Comment out or delete code related to RDF hashing. * Some code cleanup. * Remove dependency from Cython for non-developers. * Update documentation. * Subtle rebranding. * Minor cleanup.
scossu · Oct 4, 2018 · acca1a1 · acca1a1
1 parent a688c50
commit acca1a1
Show file tree

Hide file tree

Showing 81 changed files with 82,670 additions and 2,512 deletions.
diff --git a/.gitignore b/.gitignore
@@ -103,5 +103,11 @@ venv.bak/
 # mypy
 .mypy_cache/
 
+# Pytest
+.pytest_cache/
+
 # Default LAKEsuperior data directories
 /data
+#/lakesuperior/store/base_lmdb_store.c
+#/lakesuperior/store/ldp_rs/lmdb_triplestore.c
+!ext/lib
diff --git a/.travis.yml b/.travis.yml
@@ -1,7 +1,12 @@
+sudo: false
 language: python
-python:
-  - "3.5"
-  - "3.6"
+matrix:
+    include:
+    - python: 3.6
+    - python: 3.7
+      dist: xenial
+      sudo: true
+
 install:
   - pip install -e .
 script:
@@ -15,6 +20,6 @@ deploy:
     on:
         tags: true
         branch: master
-        python: "3.5"
+        python: "3.6"
     distributions: "bdist_wheel"
 
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,5 +1,6 @@
 include README.rst
 include LICENSE
+include fcrepo
 graft lakesuperior/data/bootstrap
 graft lakesuperior/endpoints/templates
 graft lakesuperior/etc.defaults
diff --git a/README.rst b/README.rst
@@ -1,9 +1,9 @@
-LAKEsuperior
+Lakesuperior
 ============
 
 |build status| |docs| |pypi| |codecov|
 
-LAKEsuperior is an alternative `Fedora
+Lakesuperior is an alternative `Fedora
 Repository <http://fedorarepository.org>`__ implementation.
 
 Fedora is a mature repository software system historically adopted by
@@ -14,7 +14,7 @@ any type of binary files and their metadata in Linked Data format.
 Guiding Principles
 ------------------
 
-LAKEsuperior aims at being an uncomplicated, efficient Fedora 4
+Lakesuperior aims at being an uncomplicated, efficient Fedora 4
 implementation.
 
 Its main goals are:
@@ -33,9 +33,9 @@ Key features
 -  Very stable persistence layer based on
    `LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
    ACID-compliant writes guarantee consistency of data.
--  Term-based search (*planned*) and SPARQL Query API + UI
+-  Term-based search and SPARQL Query API + UI
 -  No performance penalty for storing many resources under the same
-   container
+   container, or having one resource link to many URIs
 -  Extensible provenance metadata tracking
 -  Multi-modal access: HTTP (REST), command line interface and native Python
    API.

diff --git a/conftest.py b/conftest.py
@@ -1,3 +1,4 @@
+import logging
 import pytest
 
 from os import makedirs, path
@@ -35,17 +36,19 @@ def db(app):
     '''
     Set up and tear down test triplestore.
     '''
-    makedirs(data_dir, exist_ok=True)
     env.app_globals.rdfly.bootstrap()
     env.app_globals.nonrdfly.bootstrap()
     print('Initialized data store.')
+    env.app_globals.rdf_store.open_env(
+            env.app_globals.rdf_store.env_path)
 
     yield env.app_globals.rdfly
 
     # TODO improve this by using tempfile.TemporaryDirectory as a context
     # manager.
     print('Removing fixture data directory.')
-    rmtree(data_dir)
+    env.app_globals.rdf_store.close_env()
+    env.app_globals.rdf_store.destroy()
 
 
 @pytest.fixture
@@ -56,3 +59,7 @@ def rnd_img():
     return random_image(8, 256)
 
 
+@pytest.fixture(autouse=True)
+def disable_logging():
+    """Disable logging in all tests."""
+    logging.disable(logging.INFO)
diff --git a/docs/about.rst b/docs/about.rst
@@ -1,7 +1,7 @@
-About LAKEsuperior
+About Lakesuperior
 ==================
 
-LAKEsuperior is an alternative `Fedora
+Lakesuperior is an alternative `Fedora
 Repository <http://fedorarepository.org>`__ implementation.
 
 Fedora is a mature repository software system historically adopted by
@@ -12,7 +12,7 @@ any type of binary files and their metadata in Linked Data format.
 Guiding Principles
 ------------------
 
-LAKEsuperior aims at being an uncomplicated, efficient Fedora 4
+Lakesuperior aims at being an uncomplicated, efficient Fedora 4
 implementation.
 
 Its main goals are:
@@ -33,54 +33,55 @@ Key features
 -  Very stable persistence layer based on
    `LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
    ACID-compliant writes guarantee consistency of data.
--  Term-based search (*planned*) and SPARQL Query API + UI
+-  Term-based search and SPARQL Query API + UI
 -  No performance penalty for storing many resources under the same
-   container; no
-   `kudzu <https://www.nature.org/ourinitiatives/urgentissues/land-conservation/forests/kudzu.xml>`__
+   container; no `kudzu
+   <https://www.nature.org/ourinitiatives/urgentissues/land-conservation/forests/kudzu.xml>`__
    pairtree segmentation [#]_ 
 -  Extensible :doc:`provenance metadata <model>` tracking
 -  :doc:`Multi-modal access <architecture>`: HTTP
    (REST), command line interface and native Python API.
--  Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.
+-  Fits in a pocket: you can carry 50M triples in an 8Gb memory stick [#]_.
 
 Implementation of the official `Fedora API
 specs <https://fedora.info/spec/>`__ (Fedora 5.x and beyond) is not
 foreseen in the short term, however it would be a natural evolution of
 this project if it gains support.
 
-Please make sure you read the :doc:`Delta
-document <fcrepo4_deltas>` for divergences with the
-official Fedora4 implementation.
+Please make sure you read the :doc:`Delta document <fcrepo4_deltas>` for
+divergences with the official Fedora4 implementation.
 
 Target Audience
 ---------------
 
-LAKEsuperior is for anybody who cares about preserving data in the long
+Lakesuperior is for anybody who cares about preserving data in the long
 term.
 
-Less vaguely, LAKEsuperior is targeted at who needs to store large
+Less vaguely, Lakesuperior is targeted at who needs to store large
 quantities of highly linked metadata and documents.
 
 Its Python/C environment and API make it particularly well suited for
 academic and scientific environments who would be able to embed it in a
 Python application as a library or extend it via plug-ins.
 
-LAKEsuperior is able to be exposed to the Web as a `Linked Data
+Lakesuperior is able to be exposed to the Web as a `Linked Data
 Platform <https://www.w3.org/TR/ldp-primer/>`__ server. It also acts as
 a SPARQL query (read-only) endpoint, however it is not meant to be used
 as a full-fledged triplestore at the moment.
 
-In its current status, LAKEsuperior is aimed at developers and hands-on
+In its current status, Lakesuperior is aimed at developers and hands-on
 managers who are interested in evaluating this project.
 
 Status and development
 ----------------------
 
-LAKEsuperior is in **alpha** status. Please see the `project
+Lakesuperior is in **alpha** status. Please see the `project
 issues <https://github.com/scossu/lakesuperior/issues>`__ list for a
 rudimentary road map.
 
 --------------
 
 .. [#] However if your client splits pairtrees upstream, such as Hyrax does,
    that obviously needs to change to get rid of the path segments.
+
+.. [#] Your mileage may vary depending on the variety of your triples.
diff --git a/docs/api.rst b/docs/api.rst
@@ -4,7 +4,7 @@ API Documentation
 Main Interface
 --------------
 
-The LAKEsuperior API modules of most interest for a client are:
+The Lakesuperior API modules of most interest for a client are:
 
 - :mod:`lakesuperior.api.resource`
 - :mod:`lakesupeiror.api.query`

diff --git a/docs/architecture.rst b/docs/architecture.rst
@@ -1,14 +1,14 @@
-LAKEsuperior Architecture
+Lakesuperior Architecture
 =========================
 
-LAKEsuperior is written in Python. It is not excluded that parts of the
+Lakesuperior is written in Python. It is not excluded that parts of the
 code may be rewritten in `Cython <http://cython.readthedocs.io/>`__ for
 performance.
 
 Multi-Modal Access
 ------------------
 
-LAKEsuperior services and data are accessible in multiple ways:
+Lakesuperior services and data are accessible in multiple ways:
 
 -  Via HTTP. This is the canonical way to interact with LDP resources
    and conforms quite closely to the Fedora specs (currently v4).
@@ -17,18 +17,18 @@ LAKEsuperior services and data are accessible in multiple ways:
 -  Via a Python API. This method allows to use Python scripts to access
    the same methods available to the two methods above in a programmatic
    way. It is possible to write Python plugins or even to embed
-   LAKEsuperior in a Python application, even without running a web
+   Lakesuperior in a Python application, even without running a web
    server.
 
 Architecture Overview
 ---------------------
 
 .. figure:: assets/lakesuperior_arch.png
-   :alt: LAKEsuperior Architecture
+   :alt: Lakesuperior Architecture
 
-   LAKEsuperior Architecture
+   Lakesuperior Architecture
 
-The LAKEsuperior REST API provides access to the underlying Python API.
+The Lakesuperior REST API provides access to the underlying Python API.
 All REST and CLI operations can be replicated by a Python program
 accessing this API.
 

diff --git a/docs/cli.rst b/docs/cli.rst
@@ -1,9 +1,9 @@
 Command Line Reference
 ======================
 
-LAKEsuperior comes with some command-line tools aimed at several purposes.
+Lakesuperior comes with some command-line tools aimed at several purposes.
 
-If LAKEsuperior is installed via ``pip``, all tools can be invoked as normal
+If Lakesuperior is installed via ``pip``, all tools can be invoked as normal
 commands (i.e. they are in the virtualenv ``PATH``). 
 
 The tools are currently not directly available on Docker instances (*TODO add
@@ -16,7 +16,7 @@ This is the main server command. It has no parameters. The command spawns
 Gunicorn workers (as many as set up in the configuration) and can be sent in
 the background, or started via init script.
 
-The tool must be run in the same virtual environment LAKEsuperior
+The tool must be run in the same virtual environment Lakesuperior
 was installed in (if it was)—i.e.::
 
     source <virtualenv root>/bin/activate
@@ -44,7 +44,7 @@ self-documented, so this is just a redundant overview::
       check_fixity  [STUB] Check fixity of a resource.
       check_refint  Check referential integrity.
       cleanup       [STUB] Clean up orphan database items.
-      migrate       Migrate an LDP repository to LAKEsuperior.
+      migrate       Migrate an LDP repository to Lakesuperior.
       stats         Print repository statistics.
 
 All entries marked ``[STUB]`` are not yet implemented, however the
@@ -65,7 +65,7 @@ The command has no options but prompts the user for a few settings
 interactively (N.B. this may change in favor of parameters).
 
 The benchmark tool is able to create RDF sources, or non-RDF, or an equal mix
-of them, via POST or PUT, in the currently running LAKEsuperior server. It
+of them, via POST or PUT, in the currently running Lakesuperior server. It
 runs single-threaded.
 
 The RDF sources are randomly generated graphs of consistent size and

diff --git a/docs/contributing.rst b/docs/contributing.rst
@@ -1,14 +1,14 @@
-Contributing to LAKEsuperior
+Contributing to Lakesuperior
 ============================
 
-LAKEsuperior has been so far a single person’s off-hours project (with much
+Lakesuperior has been so far a single person’s off-hours project (with much
 very valuable input from several sides). In order to turn into anything close
 to a Beta release and eventually to a production-ready implementation, it
 needs some community love.
 
 Contributions are welcome in all forms, including ideas, issue reports,
 or even just spinning up the software and providing some feedback.
-LAKEsuperior is meant to live as a community project.
+Lakesuperior is meant to live as a community project.
 
 .. _dev_setup:
 

diff --git a/docs/discovery.rst b/docs/discovery.rst
@@ -1,7 +1,7 @@
 Resource Discovery & Query
 ==========================
 
-LAKEsuperior offers several way to programmatically discover resources and
+Lakesuperior offers several way to programmatically discover resources and
 data.
 
 LDP Traversal
@@ -20,12 +20,12 @@ SPARQL Query
 ------------
 
 A `SPARQL <https://www.w3.org/TR/sparql11-query/>`__ endpoint is available in
-LAKEsuperior both as an API and a Web UI.
+Lakesuperior both as an API and a Web UI.
 
 .. figure:: assets/lsup_sparql_query_ui.png
-   :alt: LAKEsuperior SPARQL Query Window
+   :alt: Lakesuperior SPARQL Query Window
 
-   LAKEsuperior SPARQL Query Window
+   Lakesuperior SPARQL Query Window
 
 The UI is based on `YASGUI <http://about.yasgui.org/>`__.
 
@@ -57,7 +57,7 @@ query in a graph with more than a few thousands resources::
 
 What the RDFLib implementation does is going over every single graph in the
 repository and perform the ``?s ?p ?o`` query on each of them. Since
-LAKEsuperior creates several graphs per resource, this can run for a very long
+Lakesuperior creates several graphs per resource, this can run for a very long
 time in any decently sized data set.
 
 The solution to this is either to omit the graph query, or use a term search,
@@ -67,9 +67,9 @@ Term Search
 -----------
 
 .. figure:: assets/lsup_term_search.png
-   :alt: LAKEsuperior Term Search Window
+   :alt: Lakesuperior Term Search Window
 
-   LAKEsuperior Term Search Window
+   Lakesuperior Term Search Window
 
 This feature provides a discovery tool focused on resource subjects and based
 on individual term match and comparison. It tends to be more manageable than