Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow query(and it gets slower with readMode: "outdated") #5195

Open
thelinuxlich opened this issue Dec 11, 2015 · 12 comments
Open

Slow query(and it gets slower with readMode: "outdated") #5195

thelinuxlich opened this issue Dec 11, 2015 · 12 comments

Comments

@thelinuxlich
Copy link

So I have this query which returns 13k items before the .group() and 569 after:

r.table("pageviews").between([7,r.time(2015,12,1,"-02:00")],[7,r.now()],{index: "client_id_created_at"})
.eqJoin("session_id",r.table("sessions")).zip().merge(row => {
  return {
                            point: row("point").default(null),
                            pageview_count: row("pageviews"),
                            is_unique: row("is_unique"),
                            is_rejected: row("pageviews").lt(2),
                            session_duration: row("updated_at").toEpochTime().sub(row("created_at").toEpochTime()),
                            session_properties: row("properties"),
                            session_referer_type: row("referer")("type").default(null),
                            session_referer_domain: row("referer")("domain").default(null),
                            session_referer_query: row("referer")("query").default(null),
                            session_referer_path: row("referer")("path").default(null),
                            origin: row("origin").default(null),
                            weather: row("weather").default({
                                temperature: -1,
                                type: "null",
                                weather_icon: "null"
                            }),
                            device: row("device"),
                            is_multi_device: row.hasFields("next_device"),
                            geo: row("geo").default({
                                city: "null",
                                country: "null",
                                country_code: "null",
                                latitude: -1,
                                longitude: -1,
                                region: "null",
                                region_code: "null"
                            })
};
}).group("href")

This takes approximately 5 seconds, running it with a node.js script on the same machine as the database. I have a cluster of 3 n1-standard4 google cloud instances, all with 100gb SSD disks(database is using 4gb, cache is using 10gb of 15gb available RAM.

This is supposed to run on hundreds of millions of records in the worst case, so I get really alarmed that it takes 5 seconds for a small dataset.
Also, I tried setting readMode to "outdated" and to my surprise it slowed the query to 7 seconds.

I'm using latest RethinkDB version and rethinkdbdash

@thelinuxlich
Copy link
Author

Running successive times doesn't change the speed, removing the eqJoin().zip() part reduces the time to 1.8 seconds, with readMode: "outdated" it raises to 2.1 seconds.

@thelinuxlich
Copy link
Author

An average document of the pageviews table:

{
"actions": 0 ,
"campaign_params": { } ,
"category": null ,
"client_id": 7 ,
"created_at": Thu Dec 03 2015 16:15:18 GMT-02:00 ,
"deep_link": false ,
"domain":  "delivery.veran.com.br" ,
"end_ts": 1449170115516 ,
"ended_at": Thu Dec 03 2015 16:15:36 GMT-02:00 ,
"href": http://delivery.veran.com.br/produto, »
"id":  "000147a8-5b2b-4c16-9e4c-4eac40664562-7-browser031220151600-1449170096530" ,
"nav_type":  "normal" ,
"referer": {
"domain":  "delivery.veran.com.br" ,
"path":  "/" ,
"query": null ,
"type":  "internal"
} ,
"scroll": [
{
"percentage": {
"pixel_depth": 763 ,
"timing": 7.32 ,
"value": 25
}
} ,
{
"percentage": {
"pixel_depth": 0 ,
"timing": 7.32 ,
"value": 50
}
} ,
{
"percentage": {
"pixel_depth": 0 ,
"timing": 7.32 ,
"value": 75
}
} ,
{
"percentage": {
"pixel_depth": 1063 ,
"timing": 7.82 ,
"value": 100
}
}
] ,
"session_id":  "000147a8-5b2b-4c16-9e4c-4eac40664562-7-browser031220151600" ,
"ssl": false ,
"timing": {
"ttdr": 2.71 ,
"ttfb": 0.28 ,
"tti": 1.98
} ,
"title":  "supermercado delivery veran - agora também somos delivery!" ,
"ts": 1449170096530 ,
"visitor_id":  "000147a8-5b2b-4c16-9e4c-4eac40664562"
}

Session documents should be half the size approximately

Also, when running this query, all four cores go to ~90% but memory stays the same(8gb consumed), this is running under a write load of ~50/sec

And it is impossible to run this query on the Data Explorer, it always hangs

@thelinuxlich
Copy link
Author

I moved the secondary index into the primary id as an array(on a temporary table, just for testing), but the client_id is incremental so all the data went into only one shard(could this be a issue for secondary indexes too?).
Run the same query, now I see a slight performance increase with readMode: "outdated"(4.2 seconds) and without it is the same thing(5.1 seconds)

@thelinuxlich
Copy link
Author

Rebalanced the test table, run the query again. 4.2 seconds and 4.6s with readMode: "outdated", it seems outdated reads will always be slower when the data is evenly distributed among the shards

@danielmewes
Copy link
Member

To benefit from outdated reads, you could try this:

  • Replicate the table "sessions" so that each server hosting a primary of "pageviews" also has a full replica of the "sessions" table.
  • Use read_mode="outdated" for "sessions", but the normal "single" read mode for "pageviews".

i.e. something like this:

r.table("pageviews").between([7,r.time(2015,12,1,"-02:00")],[7,r.now()],{index: "client_id_created_at"})
  .eqJoin("session_id", r.table("sessions", {readMode: "outdated"})).zip().merge(........

This will allow the eqJoin queries to be served locally, rather than having to go to another server, since each shard of "pageviews" will have a full copy of the "sessions" table.

I suspect that the eqJoin slowdown will be addressed by #5115 .
I'm unsure why this is still taking 1.8 seconds with that part removed.

Could you post the full query you tested when you removed the .eqJoin.zip and got the 1.8 seconds?

@danielmewes danielmewes added this to the 2.3-polish milestone Dec 11, 2015
@thelinuxlich
Copy link
Author

All tables are already replicated evenly among all shards

@thelinuxlich
Copy link
Author

With readMode: "outdated" on the sessions table and removing this optarg from pageviews I get 3,6 seconds

@thelinuxlich
Copy link
Author

The query that returned in 1.8 seconds:

r.db("aidax").table("pageviews").between([7,r.time(2015,12,1,"-02:00")],[7,r.now()],{index: "client_id_created_at"})
.merge({session: r.db("aidax").table("sessions").get(r.row("session_id"))})

@thelinuxlich
Copy link
Author

Talking with @danielmewes, it seems I should raise the cpu cores of my cluster and rerun the queries. I will do that and report again soon, but this is really sad, I didn't know RethinkDB wasn't suitable for big datasets(although today a million records is nothing)

@danielmewes
Copy link
Member

I believe this is just joins being slow, i.e. #5115 .

RethinkDB can generally handle billions of rows, though it's probably not the right system to run analytics at such scales. More for OLTP-style queries where you only use a small selection of rows at a time.

@thelinuxlich
Copy link
Author

I don't intend to process billiions of rows in a single query, but my point stands, in 2015, a million rows is nothing

@thelinuxlich
Copy link
Author

@danielmewes, I've replaced the three servers of my cluster with 16-core instances each(they were 4-core), running the first query of this issue gives me roughly the same time :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants