Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on the use of threads in SpinoDB #4

Open
falcon027 opened this issue Aug 16, 2022 · 8 comments
Open

Thoughts on the use of threads in SpinoDB #4

falcon027 opened this issue Aug 16, 2022 · 8 comments

Comments

@falcon027
Copy link
Contributor

When using node js SpinodDB runs on the main thread which slows down the event loop and degrades the application performance.
I was wondering if it would be possible to make the calls to the Database asynchronous in order to free up the event loop. To go further with speeding up the database, each collection could have its own thread to parallelize the operations even more.
What are your thoughts about this let’s discuss it in the comments. @supercamel

@supercamel
Copy link
Owner

Loading and saving definitely needs to become async / threaded. I've not had issues with queries slowing down the event loop myself, though. Even with several million documents, queries on properly indexed data should be executing in under 100us.

Have you done any bench marking to find out if there is a particular query that is problematic? To me it seems likely that Spino isn't able to look up an index for the query. Might be able to restructure the data somehow to make it easier to index. Then you shouldn't have any performance issues at all.

@supercamel
Copy link
Owner

The whole query parser / executor mechanism is not ideal and I've got some ideas to improve it which I expect will speed up performance a little. Right now, it uses parses the query into a syntax tree, then it transverses the tree first to find an index, and then again for each document it should check.

It should just pick an index during the parsing phase, and compile into a flat list of instructions. This is probably faster and should improve query performance.

This is probably the next thing that I want to do that has performance implications.

@falcon027
Copy link
Contributor Author

You are exactly right; the latency is quite good and not a problem. My idea was to increase the throughput by multithreading queries to get more requests per seconds. Currently, I get about 2000 requests/ per second and because RAM is still expensive the more requests, I can get per second the fewer ram I need to serve the same number of customers. This is particularly true if the database is large, which makes a copy of the data more expensive.

@falcon027
Copy link
Contributor Author

Do you have any thoughts about this?@supercamel

@supercamel
Copy link
Owner

Do you have any thoughts about this?@supercamel

There is some overhead involved in spawning / synchronising threads. The queries themselves are essentially non-blocking. It's not clear to me that multi-threading will be beneficial for queries.
However, the load and save functions do block, for quite a while too depending on the data size. This is, I think, problematic. If I go to a threaded model it will be to solve this problem. Loading must block queries, because it doesn't make sense to run queries while the db is still being loaded, but it shouldn't block the main loop. And saving must block create/update/delete functions. So I think this justifies going to some kind of threaded model but I don't think it will help you with RAM useage or query throughput.

When this happens it will be a major version increment because it will be a significant change to the current API.

@falcon027
Copy link
Contributor Author

I think the simplest approach would be to spit the collections in threads and add a separate log for each. This could speed up the replay of the log by paralysing the io and the execution. And perhaps one could do it without handling the locking of collections. What are your thoughts about this?
Diagram

@supercamel
Copy link
Owner

This is a very interesting idea.

So, I suppose we would have a directory for the data. In that directory would be the data file which contains the saved state of the database, and a log for each collection.

Each collection would need a worker thread and some kind of inter-thread communication mechanism.

What about find queries and cursors? Getting results from queries seems like an interesting challenge. I suppose a cursor could run in the collection thread. A find query might return a handle to a cursor. An action might be 'create a cursor with this query', which would return the cursor handle. Another action might be 'get the next result for this cursor handle'. The handle would be invalidated by drop or insert queries.
Something like this. What do you think?

@falcon027
Copy link
Contributor Author

I agree this could work like you described. My only worry is the latency of the inter-tread communication. But if that isn’t a problem, then this should increase throughput and log replay speeds by quite a bit. This would be really nice if implemented. If there is any way that I can support this effort, let me know (my C++ knowledge is unfortunately somewhat limited).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants