Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQLiteStore #368

Merged
merged 71 commits into from Jan 22, 2019
Merged

Add SQLiteStore #368

merged 71 commits into from Jan 22, 2019

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Dec 21, 2018

Fixes #365

Adds an SQLite-backed MutableMapping store. Performs operations on the underlying SQLite database that treat it like and expose it to users as a key-value store. Should provide a very portable and reliable storage format. Also should be a useful template for users trying to implement a store for their own database that supports SQL.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

@jakirkham jakirkham changed the title WIP: Add sqlite store WIP: Add SQLiteStore Dec 21, 2018
Implements a key-value store using SQLite. As this is a builtin module
in Python and a common database to use in various languages, this should
have high utility and be very portable. Not to mention many databases
provide an SQLite language on top regardless of the internal
representation. So this can be a great template for users wishing to
work with Zarr in their preferred database.
Try using the `SQLiteStore` everywhere one would use another store and
make sure that it behaves correctly. This includes simple key-value
store usage, creating hierarchies, and storing arrays.
Provide a few examples of how one might use `SQLiteStore` to store
arrays or groups. These examples are taken with minor modifications from
the `LMDBStore` examples.
Includes a simple example borrowed from `LMDBStore`'s tutorial example,
which shows how to create and use an `SQLiteStore`.
Otherwise we may end up opening a different databases' files and try to
use them with SQLite only to run into errors. This caused the doctests
to fail previously. Changing the extension as we have done should avoid
these conflicts.
@jakirkham jakirkham changed the title WIP: Add SQLiteStore Add SQLiteStore Dec 21, 2018
@jakirkham jakirkham added this to the v2.3 milestone Dec 21, 2018
Instead of opening, committing, and closing the SQLite database for
every operation, limit these to user requested operations. Namely commit
only when the user calls `flush`. Also close only when the user calls
`close`. This should make operations with SQLite much more performant
than when we automatically committed and closed after every user
operation.
As users need to explicitly close the `SQLiteStore` to commit changes
and serialize them to the SQLite database, make sure to point this out
in the docs.
Appears some of these commands work without capitalization. However as
the docs show commands as capitalized, ensure that we are doing the same
thing as well. That way this won't run into issues with different SQL
implementations or older versions of SQLite that are less forgiving.
Plus this should match closer to what users familiar with SQL expect.
@jakirkham jakirkham mentioned this pull request Dec 21, 2018
Make use of `in` instead of repeating the same logic in `__delitem__`.
As we now keep the database open between operations, this is much
simpler than duplicating the key check logic. Also makes it a bit easier
to understand what is going on.
This was needed when the `import` of `sqlite3` was only here to ensure
that it existed (even though it wasn't used). Now we make use of
`sqlite3` where it is being imported. So there is no need to tell flake8
to not worry about the unused import as there isn't one.
Make sure that everything intended to be added to the `SQLiteStore`
database has been written to disk before attempting to pickle it. That
way we can be sure other processes constructing their own `SQLiteStore`
have access to the same data and not some earlier representation.
No need to normalize the path when there isn't one (e.g. `:memory:`).
Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, all looks good. Couple of tiny comments.

zarr/storage.py Outdated Show resolved Hide resolved
docs/release.rst Outdated Show resolved Hide resolved
docs/release.rst Outdated Show resolved Hide resolved
zarr/storage.py Outdated
kwargs.setdefault('timeout', 5.0)
kwargs.setdefault('detect_types', 0)
kwargs.setdefault('isolation_level', None) # autocommit
kwargs.setdefault('check_same_thread', False) # disallow writing from other threads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs:

By default, check_same_thread is True and only the creating thread may use the connection. If set False, the returned connection may be shared across multiple threads. When using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption.

Maybe we should set this to True. The docs say any multithreaded use needs to be fully synchronized by the user. There is no mechanism in zarr for doing this - the synchronizer classes only prevent writing the same chunk at the same time, but they allow different chunks to be written concurrently, which the sqlite docs suggest could still cause corruption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that was probably an oversight on my part. Have corrected this to True.

I think if we just make sure to flush after every write operation, we should be safe allowing this though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After giving this some thought, it seems reasonable for us to always flush after mutating the database. For example other processes could be accessing the same database and we want to ensure they all see the same thing. This isn't too different from how we handle other stores.

Have pushed a few commits to tidy things up and implement this flushing with each mutation behavior. Also have added an update function to allow more efficient submission of multiple changes (so we only commit once).

Switched back to setting check_same_thread to False by default as we now always serialize data if a change is made.

jakirkham and others added 7 commits December 21, 2018 09:48
Fix a typo.

Co-Authored-By: jakirkham <jakirkham@gmail.com>
Include author and original issue in changelog entry.

Co-Authored-By: jakirkham <jakirkham@gmail.com>
The default value for `check_same_thread` was previously set to `False`
when in reality we want this check enabled. So set `check_same_thread`
to `True`.
As users could change the setting of things like `check_same_thread` or
they may try to access the same database from multiple threads or
processes, make sure to flush any changes that would mutate the
database.
As we now always commit after an operation that mutates the data, there
is no need to commit before pickling the `SQLiteStore` object. After all
the data should already be up-to-date in the database.
As everything should already be flushed to the database whenever the
state is mutated, there is no need to perform this before closing.
alimanfoo and others added 2 commits January 2, 2019 20:38
@jakirkham
Copy link
Member Author

Thanks for taking a look @alimanfoo. Have tried to fix the few nits and answer your questions. Please let me know if you have more thoughts. :)

Adds a simple check to ensure SQLite is new enough to enable thread-safe
sharing of connections before setting `check_same_thread=True`. If
SQLite is not new enough, set `check_same_thread=False`.
zarr/storage.py Outdated Show resolved Hide resolved
As there are some concerns about keeping operations on the SQLite
database sequential for thread-safety, acquire an internal lock when a
DML operation occurs. This should ensure that only one modification can
occur at a time regardless of whether the connection uses the serialized
threading mode or not.
Uses all the same tests we use for SQLiteStore's on disk except it
special cases the pickling test to ensure the `SQLiteStore` cannot be
pickled if it is in-memory.
Simply use the `Connection`'s default arguments implicitly instead of
explicitly setting them in the constructor.
Make sure to inherit directly from `unittest.TestCase` as well.
Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jakirkham. I think it would be worth allowing isolation_level to be overridden by the user via kwargs into __init__. Otherwise looks good to go.

@jakirkham
Copy link
Member Author

Not sure whether that is a good idea. If we do that, we will need to add explicit transactions in the code and add additional code to check whether transactions should or should not be used (spoiler: they can't be used with autocommit mode, but need to be used otherwise). Though I could be missing something.

ref: http://charlesleifer.com/blog/going-fast-with-sqlite-and-python/
ref: https://stackoverflow.com/q/15856976

@alimanfoo
Copy link
Member

Not sure whether that is a good idea. If we do that, we will need to add explicit transactions in the code and add additional code to check whether transactions should or should not be used (spoiler: they can't be used with autocommit mode, but need to be used otherwise).

I was really just thinking to allow someone who really knows what they're doing and wants to control transactions themselves to be able to do so. Exposing isolation_level as a parameter allows them to do so. If they set isolation_level to anything other than None (autocommit), then transaction management is entirely up to them. They can do that by accessing the .db and .cursor members of the store.

It is really just about leaving the door open for people to experiment with other values of isolation_level. I don't think there would be any need to change any other code within the store, the point is that the user/application might want to make their own decisions about where to place transaction boundaries. Although we could add some documentation like, "N.B., if you set an isolation_level other than None then you are responsible for beginning and committing transactions."

I don't feel strongly about this btw, if you think this is something to consider for later then happy to proceed as-is.

@jakirkham
Copy link
Member Author

jakirkham commented Jan 15, 2019

Well if their goal is to circumvent what we are doing with transactions, they can override self.db's isolation_level property. Personally would rather discuss with a user of this functionality to figure out what they are after. That avoids introducing features with rough edges.

@alimanfoo
Copy link
Member

alimanfoo commented Jan 15, 2019 via email

@jakirkham
Copy link
Member Author

Thanks @alimanfoo.

@jhamman, it looks like you have built some stuff on top of this PR. Are you happy with the contents here (for merging) or are there some things we still need to discuss/address?

@jhamman
Copy link
Member

jhamman commented Jan 16, 2019

So I don't know SQL really at all so I'm not going to much use on the review here. You could say I'm as happy as I'm going to get here :) I'll give a quick browse just to be sure though...

I originally branched off this work thinking there may be some useful bits but none of those were used.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one little question

check_same_thread = False

# keep a lock for serializing mutable operations
self.lock = Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value in allowing this lock (or the lock type) to be specified by the user? For example, we've found this sort of thing to be useful at times when using dask-distributed for I/O operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimanfoo and I were a little concerned by some of the wording in the docs about sqlite3 and thread safety as discussed here. Having this lock may be (overly) cautious. It's not clear what the right answer is here. Would be happy to hear other thoughts on this if you have any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now. We can certainly revisit this down the road if additional functionality is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jhamman. SGTM

@jakirkham
Copy link
Member Author

Would someone like to do the honors here? ;)

@jhamman
Copy link
Member

jhamman commented Jan 21, 2019

@jakirkham - I'd be happy to see you merge this. It sounds like @alimanfoo has signed off on this going in.

@alimanfoo
Copy link
Member

Yep @jakirkham I think you should get the satisfaction of clicking the merge button here 😄

@jakirkham jakirkham merged commit 43f7fae into zarr-developers:master Jan 22, 2019
@jakirkham jakirkham deleted the add_sqlite_store branch January 22, 2019 12:52
@jakirkham
Copy link
Member Author

Thanks all 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants