Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow query event processing, most recent N events within bounding box between lat + lon #524

Closed
victusfate opened this issue Jan 5, 2017 · 6 comments

Comments

@victusfate
Copy link

victusfate commented Jan 5, 2017

I created a little repo to try out lokijs, and while it was very fast to get up and running I hit a snag.
My query performance is pretty painful (9 queries per second), 100k elements in the collection uniformly spread out in an area. Random query windows of small to large size.

My environment is nodejs focused on a server application with lokijs

Here's the create collection and index setup (source):

// Get the documents collection
let collection = db.addCollection(sCollection, {
  unique:  ['id']
  // indices: ['ts','latitude','longitude']
});

// add index
// collection.createIndex({ ts: 1, loc: "2dsphere" });
collection.ensureIndex('ts');
collection.ensureIndex('latitude');
collection.ensureIndex('longitude')

Here's the query (source):

  const locQuery = (options) => {  
    return collection.chain().find({ 
      $and: [
        {
          latitude  : { 
            $between: [options.lowerLatitude,options.upperLatitude] 
          }        
        },
        {
          longitude  : { 
            $between: [options.lowerLongitude,options.upperLongitude] 
          }        
        },
      ]
    }).simplesort('ts',true).limit(options.N).data();
  }

reference:
the repo

As a comparison local mongodb with a 2d, and ts indexes can handle 300-400 queries per second (still too slow for my needs).

@victusfate victusfate changed the title event processing (most recent N events within bounding box between lat + lon) slow query event processing, most recent N events within bounding box between lat + lon Jan 5, 2017
@obeliskos
Copy link
Collaborator

hey, i'm looking over your example now... might take a while to give it a full run.

Initial observation is that your geo $between filters out less than half of your collection size. Removing the limit the find results were roughly 55-65k average over several runs. Not sure if its the random sample generation algorithm or you expect real world results to be similar.

So for each of those 200 queries / transforms you are taking those roughly 60k documents and sorting them which is an expensive op and then limiting which is not expensive but masks the volume of processing done previously.

So I would guess most of the time spent in your querying is used up by array.sort().

So are your geo coordinate filters supposed to let through so many results? There are a significant number of them that return all 100k objects but the average is as mentioned above.

@victusfate
Copy link
Author

victusfate commented Jan 6, 2017

First: thanks for taking a look. Really appreciate your time!

Totally makes sense based on the type of queries. The real data is not quite uniform but random enough. And those big windows kill the sort.

The clients can pass big, tiny, or any size query windows, and it's not known beforehand. I ended up coding up a custom prototype to try and handle those query size domains better, and it can deliver 20-30k queries per second, but there are tuning parameters (number of grid buckets, and a switch count), which may need to change as the data store ages (different best properties for 100k events, vs 500k events, vs 1 million events etc.)

I have an idea that might work faster with lokijs

  • check the query lat by lon area and if its > threshold, or better yet run a geo query just with a quick count and use that value as a decision on which method to use (update count is slow)
  • If the area or count is large, do the geo query on the client on chunks of the most recent events stepping back further in time until it's satisfied (update tried area, some help based on this heuristic)
  • If it's smaller than that threshold I can probably get good performance from the original query (update works great for small number of contained events)

Actually this same approach may work great with mongodb as well since it's facing the same sorting cpu hog problem for large windows.

update:
got a chance to try out someQuery.filteredrows.length as a path selector, but for large queries which overlap the data set it's slow (74 queries per second). So a direct count won't cut it as a filter. I'll need to fall back to an overlap area heuristic or something smarter like a spatial grid/2d histogram like in my custom approach

@victusfate
Copy link
Author

victusfate commented Jan 6, 2017

second update:
updated my test repo with a hybrid approach described above, using a heuristic for search area. If area > threshold, use an iterative look back with client side filtering. Data returns from between tStart,tEnd sorted which is great.

performance:

  • large windows: up to 1800 queries per second
  • fixed medium ish windows (area ~ 0.000004): 700 queries per second
  • uniform random smaller windows (area 0-0.000004): 1900 queries per second

@obeliskos
Copy link
Collaborator

obeliskos commented Jan 6, 2017

nice performance gains! I haven't had time to look over yet but I probably will over the weekend.

I think your domain specific algorithms show the most promise but I think I will add some functionality to improve options in future like :

  • chained simplesort (when no filters have been applied) will return sorted docs by using binary index if it is applied on that field.
  • new chain op to instance results into a new anonymous collection

so (at least for your first example) you might do something like :

var sortedColl = coll.chain().simplesort("ts", true).instanceCollection({ indices: ["latititude", "longitude"]});
//for(idx=0; idx<MAX_ITERATIONS; idx++) {
var results = sortedColl
  .find({
    $and: [
        { latitude  : {  $between: [options.lowerLatitude,options.upperLatitude] } },
        { longitude  : {  $between: [options.lowerLongitude,options.upperLongitude] } }
    ]
  })
  .limit(2)
  .data();
//}

Idea being to move common sort cost up front and utilize index. If at any point in a sorted query chain, you expect multiple subsequent queries could be run against the interim results, you might use this instancing. Unfortunately only the first part of the $and filter is able to use binary index (the longitude filter would be done via array scan of interim results) so it may be more efficient to just have a single index for the instance.

I will have to look into your new hybrid approach and see if there is any way to help out there.

@victusfate
Copy link
Author

victusfate commented Jan 9, 2017

I like that flexibility. By allowing the client author to rearrange how they query data, they can get immediate speedups (time and then space, vs space and then time)

Side note, I'm very happy with this issue (please feel free to close it out whenever you are as well)

I've had to move on to reviewing replicated local caches so I can better handle scaling (if this feature is on lokijs' roadmap please let me know )

leaning towards an architecture where:

  • application writes are replicated to redis
  • pubsub allows local db stores to update their cache
  • local db stores allow near instant query time

I can stomach the 20-40ms propagation inconsistency per store (need to implement and test to see if this is realistic)

Your feedback helped me review the query, get a handle on the internal slowdown, and make progress. I can't thank you enough @obeliskos

obeliskos added a commit that referenced this issue Jan 9, 2017
…nce method ( #525 #524 )

before this commit, coll.chain().simplesort('username').data() would not
use binary index to speed up results.... it will now (with no filters).

resultset (chain) now contains an 'instance' method which will create an
anonymous collection populated with objects in the current resultset...
those docs were stripped of $loki and meta properties so those will be
reassigned by new collection.  Can pass collection options object to
instance().  Terminates chain by returning anonymous collection (not
linked to database).

updated jsdoc and rebuilt minified
@obeliskos
Copy link
Collaborator

No problem... its helpful to see varieties of use cases in the wild and adapt where it makes sense.

Loki is preparing to shift focus to new 2.0 version where all options are open for refactoring, architecture suggestions or contributions. If you think it makes sense for your purposes, feel free to request once we set that up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants