-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow query event processing, most recent N events within bounding box between lat + lon #524
Comments
hey, i'm looking over your example now... might take a while to give it a full run. Initial observation is that your geo $between filters out less than half of your collection size. Removing the limit the find results were roughly 55-65k average over several runs. Not sure if its the random sample generation algorithm or you expect real world results to be similar. So for each of those 200 queries / transforms you are taking those roughly 60k documents and sorting them which is an expensive op and then limiting which is not expensive but masks the volume of processing done previously. So I would guess most of the time spent in your querying is used up by array.sort(). So are your geo coordinate filters supposed to let through so many results? There are a significant number of them that return all 100k objects but the average is as mentioned above. |
First: thanks for taking a look. Really appreciate your time! Totally makes sense based on the type of queries. The real data is not quite uniform but random enough. And those big windows kill the sort. The clients can pass big, tiny, or any size query windows, and it's not known beforehand. I ended up coding up a custom prototype to try and handle those query size domains better, and it can deliver 20-30k queries per second, but there are tuning parameters (number of grid buckets, and a switch count), which may need to change as the data store ages (different best properties for 100k events, vs 500k events, vs 1 million events etc.) I have an idea that might work faster with lokijs
Actually this same approach may work great with mongodb as well since it's facing the same sorting cpu hog problem for large windows. update: |
second update: performance:
|
nice performance gains! I haven't had time to look over yet but I probably will over the weekend. I think your domain specific algorithms show the most promise but I think I will add some functionality to improve options in future like :
so (at least for your first example) you might do something like : var sortedColl = coll.chain().simplesort("ts", true).instanceCollection({ indices: ["latititude", "longitude"]});
//for(idx=0; idx<MAX_ITERATIONS; idx++) {
var results = sortedColl
.find({
$and: [
{ latitude : { $between: [options.lowerLatitude,options.upperLatitude] } },
{ longitude : { $between: [options.lowerLongitude,options.upperLongitude] } }
]
})
.limit(2)
.data();
//} Idea being to move common sort cost up front and utilize index. If at any point in a sorted query chain, you expect multiple subsequent queries could be run against the interim results, you might use this instancing. Unfortunately only the first part of the $and filter is able to use binary index (the longitude filter would be done via array scan of interim results) so it may be more efficient to just have a single index for the instance. I will have to look into your new hybrid approach and see if there is any way to help out there. |
I like that flexibility. By allowing the client author to rearrange how they query data, they can get immediate speedups (time and then space, vs space and then time) Side note, I'm very happy with this issue (please feel free to close it out whenever you are as well) I've had to move on to reviewing replicated local caches so I can better handle scaling (if this feature is on lokijs' roadmap please let me know ) leaning towards an architecture where:
I can stomach the 20-40ms propagation inconsistency per store (need to implement and test to see if this is realistic) Your feedback helped me review the query, get a handle on the internal slowdown, and make progress. I can't thank you enough @obeliskos |
…nce method ( #525 #524 ) before this commit, coll.chain().simplesort('username').data() would not use binary index to speed up results.... it will now (with no filters). resultset (chain) now contains an 'instance' method which will create an anonymous collection populated with objects in the current resultset... those docs were stripped of $loki and meta properties so those will be reassigned by new collection. Can pass collection options object to instance(). Terminates chain by returning anonymous collection (not linked to database). updated jsdoc and rebuilt minified
No problem... its helpful to see varieties of use cases in the wild and adapt where it makes sense. Loki is preparing to shift focus to new 2.0 version where all options are open for refactoring, architecture suggestions or contributions. If you think it makes sense for your purposes, feel free to request once we set that up. |
I created a little repo to try out lokijs, and while it was very fast to get up and running I hit a snag.
My query performance is pretty painful (9 queries per second), 100k elements in the collection uniformly spread out in an area. Random query windows of small to large size.
My environment is nodejs focused on a server application with lokijs
Here's the create collection and index setup (source):
Here's the query (source):
reference:
the repo
As a comparison local mongodb with a 2d, and ts indexes can handle 300-400 queries per second (still too slow for my needs).
The text was updated successfully, but these errors were encountered: