Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Absolute lack of thread-safety SqlBackend should be documented. #981

Open
merijn opened this issue Nov 4, 2019 · 26 comments
Open

Absolute lack of thread-safety SqlBackend should be documented. #981

merijn opened this issue Nov 4, 2019 · 26 comments

Comments

@merijn
Copy link
Contributor

merijn commented Nov 4, 2019

Right now SqlBackend has almost no documentation, there should be some big, bright warning saying "No way, no how can you ever under any conditions assume that using a single SqlBackend is remotely safe to use from separate threads"

I've been bitten by cryptic errors in the past and only NOW by browsing the source do I realise how utterly unsafe SqlBackend is in this regard.

@parsonsmatt
Copy link
Collaborator

Dang. That's a surprising thing to hear! thanks for reporting it.

@merijn
Copy link
Contributor Author

merijn commented Nov 4, 2019

@parsonsmatt the issue boils down most (all?) query functions being build on top of rawQueryRes, which in turn uses getStmtConn, which as you can see maintains a map of Statement in an IORef indexed by the query text. This means that all queries with the same text will map to a single Statement that is used for stepping and preparing. This means if any threads ever run the same query they will interfere with each other's preparation and stepping and generally just cause arbitrary failures due to racing eachother.

@merijn
Copy link
Contributor Author

merijn commented Nov 4, 2019

I'm actually somewhat baffled what prompted this design in the first place...

@parsonsmatt
Copy link
Collaborator

Hell, you're entirely right.

I think I get the intent - we want to prepare a statement once, then execute it multiple times. But sharing it across multiple threads is problematic.

Gating access behind an MVar or TVar would introduce locking/blocking on queries, which also seems bad.

Changing it to Map (ThreadId, Text) (IORef Statement) would be a relatively easy change to make and should stop 'bad' access patterns:

- pure $ Map.lookup sqlText stmtMap
+ threadId <- myThreadId
+ pure $ Map.lookup (threadId, sqlText) stmtMap

This fixes the multi-threaded access problem, but it isn't obvious to me that this is "good" in the initial design goals for the SqlBackend to store a map of Statements.

Could aslo have:

- Map Sql (IORef Statement)
+ Map Sql (IO Statement)

where the IO action has a way of mapping the Statements to appropriate ThreadIds.

@merijn
Copy link
Contributor Author

merijn commented Nov 5, 2019

Ok, bad news first:

After having lain awake all night figuring out how to work around thisthought about it overnight, I have concluded things are even worse than I initially suggested. It's not just the current case that's broken, but also the single-threaded case is just nonsense. I have a large dataset that I stream through conduit via, e.g. selectSourceRes, somewhere further down my conduit pipeline I do some more queries lookups while processing elements. If any of these downstream components for some reason execute a query that was already running in an upstream conduit then everything is now broken, because they'll be racing the same Statement resulting in just garbage output.

This means your suggested fix 1) wouldn't actually fix it and 2) would trigger version bumps across the entire set of persistent packages, since SqlBackend is part of the public API.

Now, on to the good news:

I believe it's possible to implement a workaround that keeps the current types/API that should have negligible impact on current uses that are safe and minor performance impact (in the form of additional statement preparation/finalisation, but I think that generally shouldn't be that costly) on uses that are currently utterly broken and unsafe anyway.

@merijn
Copy link
Contributor Author

merijn commented Nov 5, 2019

Spoke too soon, since persistent exports almost all of its internals I think a version bump is unavoidable...

@merijn
Copy link
Contributor Author

merijn commented Nov 5, 2019

I think the right way to go is to change getStmtConn into return IO (Acquire Statement) instead of IO Statement, which removes the Statement from the map for the duration of the run, this causes concurrent identical statements to create a new Statement to use (now the concurrent queries don't conflict) and then have the Acquire finalizer insert the Statement into the map (if not there) or finalize it (if there's already a Statement in the map.

The happy path (i.e. code that's currently correct) will just keep reusing the same prepared statement from the map, whereas N concurrent queries with the same statement end up creating N-1 new prepared statements for the duration of their query. The duplicate prepared Statement is unavoidable so any slowdown is just a fact of life.

Alternatively, the Statement map could actually keep a pool of prepared Statement for concurrent use, but this seems like it'd be way too easy to leak lots of memory in the form of prepared statements and the cost of preparing multiple statements seems relatively small compared to leaking lots of memory.

Unfortunately a change like this will have a pretty large ripple effect across persistent since getStmtConn is passed to lots of different functions in the SqlBackend, which all will need their type adjusted.

On the upside this change should be fairly mechanical and simple.

@merijn
Copy link
Contributor Author

merijn commented Nov 5, 2019

Ok, so I actually did figure out how to make a backwards compatible fix to getStmtConn only to have that idea torpedoed in the last second, because I just realised that insert (for SQLite, MySQL uses a similar approach..) relies on last_inserted_rowid() to work, which is unreliable/broken in concurrent usage.

So without a fix for that issue my fix is somewhat useless (well, it still fixes a whole bunch of issues, including running the same query multiple times in the same thread and concurrent non-insert operations).

@merijn
Copy link
Contributor Author

merijn commented Nov 5, 2019

This can probably be solved by extending InsertSqlResult to have a case that wraps the queries in a transaction that should presumably fix both the MySQL and SQLite case. Alternatively, we can just always wrap multi query inserts with a transaction, which is presumably fairly low cost.

merijn added a commit to merijn/persistent that referenced this issue Nov 5, 2019
@merijn
Copy link
Contributor Author

merijn commented Nov 6, 2019

I migrated my own code to use a Pool until I figure out if how to tackle the stuff I discussed here. (Which has in turn spawned #983 and #984, because I'm a busy beaver).

@TomMD
Copy link

TomMD commented Feb 21, 2020

So can this issue now be closed thanks to 983 and 984? I see #917 but am not clear on how that relates.

@merijn
Copy link
Contributor Author

merijn commented Feb 21, 2020

Eh, no, this is very much not solved. 983 and 984 just exist so you can more easily use 1 connection per thread with the RawSqlite database. It's still completely unsound to use the same connection from multiple threads.

#917 is the exact error you get when you try to use the same persistent-sqlite SqlBackend from multiple threads concurrently.

@yitz-zoomin
Copy link

I'm a little lost. Why would anyone ever imagine that they can use the same SqlBackend concurrently? Isn't using a DB connection pool standard practice (and not just in Haskell)? Frankly, it never crossed my mind to do anything else. In my opinion, the original suggestion of a "big, bright warning" is the best solution.

@merijn
Copy link
Contributor Author

merijn commented Feb 22, 2020

I'm a little lost. Why would anyone ever imagine that they can use the same SqlBackend concurrently?

SQLite database connections are thread-safe when used directly. It's not obvious that SqlBackend does thing that would break this thread-safety.

Additionally, what the current SqlBackend does is far worse than simply "not being threadsafe", it's not safe to have the same query running twice (in the same thread) on the same SqlBackend either.

"Why would you ever do that?"

Well, if you try and combine the results of multiple queries in a streaming way via Conduit, for one. Suppose we have two conduit's foos :: ConduitT () Foo m r" and bars :: ConduitT () Bar m r, which both happen to internally stream all Quuxpersistent entities from the database viaselectSource. Now if we compose these two conduits and run the resulting Conduit using a single SqlBackend` (they run in the same thread, so it should be fine!) we get intermittent crashes if not garbled results.

@merijn
Copy link
Contributor Author

merijn commented Feb 22, 2020

Oh, actually, I think you don't even need to have the same query running twice from the same thread. Simply having a premature abort in your conduit (i.e. not consuming all the query output) will leave the query in an inconsistent state and crash you on the next invocation if I understand the code correctly.

@yitz-zoomin
Copy link

I admit that I have never used the SQLite backend, and I'm not familiar with whatever specialized thread-safety guarantees you get there. But in general, DB connections are not thread-safe, and pools are industry-wide standard practice. Within that context, the current design of SqlBackend makes perfect sense: there is explicit support for pools, and there is also a lower-level API without pools that obviously puts the obligation on client code to provide whatever guarantees are needed from pools for your backend.

Does it make sense to make a global breaking change for all backends just to make it a bit easier to leverage a backend-specific concurrency feature?

@merijn
Copy link
Contributor Author

merijn commented Feb 22, 2020

there is explicit support for pools, and there is also a lower-level API without pools that obviously puts the obligation on client code to provide whatever guarantees are needed from pools for your backend.

I disagree that this is clearly communicated, many of the persistent functions do not really make clear how to use them together with a pool, this is what I initially did in my own code, but that kept getting stuck, so I went back to just having a single connection.

Does it make sense to make a global breaking change for all backends just to make it a bit easier to leverage a backend-specific concurrency feature?

Anyway, besides the docs being rather lacking in how to use things, this remark glosses over the fact that (as I pointed out in the other comment) the current SqlBackend design isn't safe in a non-concurrent setting either as having the same query running multiple times in a conduit is broken and the fact that not fully consuming the result of the ConduitT queries by persistent leaves said queries in a broken state.

Even after switching back to using pools with one SqlBackend per connection it still cost me considerable amount of time (about a week's worth of digging through the internals of persistent and resourcet) to find an approach that I was sure was safe with single threaded use.

In my code I currently use one SqlBackend per streaming query, because I don't think the current implementation can be trusted when running multiple streaming queries simultaneously if there is any chance that those queries might be the same.

@yitz-zoomin
Copy link

I completely agree that the documentation should be clarified.

I should not have used the word "obviously". It's only obvious when you're coming to persistent as an experienced user of a particular DB where re-using a connection non-sequentially in any way is never safe.

@ygale
Copy link

ygale commented Mar 31, 2020

Thinking more about what @merijn is saying about conduits - this is an inherent part of the persistent API, and it is a deep problem. For most backends, the issue is not the prepared statement cache. It's methods like the selectSource family of methods. The persistent API does not provide any way to guarantee that the same connection will not be re-used before an active query finishes.

I am most familiar with the PostgreSQL backend. That backend currently does not attempt to do streaming. It just dumps all the results down the conduit all at once. So if you use selectList and gulp the entire firehose at once at the other end, you are OK. That's probably what most people do most of the time. But actually, PostgreSQL and the postgresql-simple library do support streaming query results. I've always dreamed of implementing that for persistent. The observations of @merijn show that it wouldn't work. The use of conduits in the persistent API is a lie.

@merijn
Copy link
Contributor Author

merijn commented Mar 31, 2020

Well, conduit used to support eager cleanup, but that was gotten rid of. I made an issue on resourcet about this ones and Snoyman recommended using with around the conduit to enforce termination, but this can be tricky if you want composable conduits like selectSource. In my own code I defined newtype Region m a = Region (StateT (IO ()) m a) where the StateT collects resource cleanups from conduit queries. I then have runRegionConduit :: ConduitT a b (Region m) r which runs the cleanups returned by StateT before returning. This doesn't handle exceptions, but I rely on underlying ResourceT to handle those.

The statement query cache is actually also an issue, because the current API makes it easy/encourages sharing a single connection across an entire Conduit. If different parts of the conduit the same query at different points (as intermediate step), then these queries will trash each other's results or crash (if you're lucky).

@ygale
Copy link

ygale commented Mar 31, 2020

If you share a connection across a conduit, you are totally trashed for most backends long before you get to the prepared statement cache. For most backends, DB connections cannot be re-used if there is an active query. I believe the only reason persistent gives the appearance of being generally working is because no one is actually doing streaming in practice.

Once you fix that, and use a pool to avoid multi-threading issues, the current prepared statement caching will work just fine the way it is.

EDIT: To clarify - we should first focus on how to provide a way to prevent re-use of connections in the selectSource family of methods. It is a serious problem that affects all backends. But that will certainly be a breaking change, so at that point I have no objection also to improve prepared statement caching to work more conveniently for multi-threading with the SQLite backend.

@codygman
Copy link
Contributor

codygman commented Mar 5, 2021

Once you fix that, and use a pool to avoid multi-threading issues, the current prepared statement caching will work just fine the way it is.

@ygale My team is using Persistent streaming queries with a pool and still having issues. I'm looking into it currently though, hopefully I can post something useful.

I believe that for some reason persistent is returning connections to the Pool too quickly from other asynchronous processes.

@codygman
Copy link
Contributor

To clarify - we should first focus on how to provide a way to prevent re-use of connections in the selectSource family of methods.

Couldn't selectSource take a Pool as input and each time part of the pipeline receives input it first has to acquire a SqlBackend from the pool?

Also relevant, this is how streaming-postgresql-simple does it:

doFold :: forall row m.
          (MonadIO m,MonadMask m,MonadResource m)
       => FoldOptions
       -> RowParser row
       -> Connection
       -> Query
       -> Stream (Of row) m ()
doFold FoldOptions{..} parser conn q = do
    stat <- liftIO (withConnection conn LibPQ.transactionStatus)
    case stat of
      LibPQ.TransIdle    ->
        bracket (liftIO (beginMode transactionMode conn))
                (\_ -> ifInTransaction $ liftIO (commit conn))
                (\_ -> go `onException` ifInTransaction (liftIO (rollback conn)))
      LibPQ.TransInTrans -> go
      LibPQ.TransActive  -> liftIO (fail "foldWithOpts FIXME:  PQ.TransActive")
         -- This _shouldn't_ occur in the current incarnation of
         -- the library,  as we aren't using libpq asynchronously.
         -- However,  it could occur in future incarnations of
         -- this library or if client code uses the Internal module
         -- to use raw libpq commands on postgresql-simple connections.
      LibPQ.TransInError -> liftIO (fail "foldWithOpts FIXME:  PQ.TransInError")
         -- This should be turned into a better error message.
         -- It is probably a bad idea to automatically roll
         -- back the transaction and start another.
      LibPQ.TransUnknown -> liftIO (fail "foldWithOpts FIXME:  PQ.TransUnknown")
         -- Not sure what this means.
  where
    ifInTransaction m = ... snip ...

@merijn
Copy link
Contributor Author

merijn commented Mar 15, 2021

Ok, so I actually did figure out how to make a backwards compatible fix to getStmtConn only to have that idea torpedoed in the last second, because I just realised that insert (for SQLite, MySQL uses a similar approach..) relies on last_inserted_rowid() to work, which is unreliable/broken in concurrent usage.

So, it looks like the new SQLite release this week would allow getting rid off last_inserted_rowid() which means that if someone knows how to avoid the two-step ID process in MySQL #982 could be made to work.

@ygale question, in #918 you commented you were opposed to that PR, because:

This is a breaking change for all persistent users to implement a backend-specific feature. Please implement this as part of the SQLite backend, not as part of persistent.

I wonder what makes you say it's a breaking change? If there is only ever one statement that is prepared/used (i.e. the happy path), the semantics/statement cache stays the same as it is now. The only change in my proposed PR results in preparing a new second statement IFF the existing is already in use.

Which, I guess, is technically a change in semantics, but given that the current semantics are "produce random garbage and/or queries", I'd hardly call that a breaking change.

@parsonsmatt
Copy link
Collaborator

FWIW I'm totally fine with breaking changes, especially if they improve the UX of the library, and especially especially if they fix a bug, and super duper especially if it wasn't even possible to use the old code path in the first place.

@NorfairKing
Copy link
Contributor

Ran into this in prod the first time a year ago (or so), just fixed one way that I ran into it today.
Is there anything I can do to help here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants