Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Optimize session initialization for multi-region deployment #3049

Open
kmuthukk opened this issue Dec 2, 2019 · 0 comments
Open

[YSQL] Optimize session initialization for multi-region deployment #3049

kmuthukk opened this issue Dec 2, 2019 · 0 comments
Assignees
Projects

Comments

@kmuthukk
Copy link
Collaborator

@kmuthukk kmuthukk commented Dec 2, 2019

Doing:
~/tserver/bin/ysqlsh -h tserver-ip-addr
for the very first time against a "tserver-ip-addr" in west-coast, when the yb-master leader is in east coast (note: 70ms ping latency between east/west) takes about 45 seconds just to get the prompt

ysqlsh (11.2-YB-2.0.5.3-b0)
Type "help" for help.

yugabyte=#

Doing it a second time is significantly faster.


Details:

Enabled tracing of RPCs to master leader, and we get about 583 RPCs (mostly a mix of Read/GetTableLocations) calls. As you can these calls arrive about 70ms apart. And so 583*70ms = 40sec -- explains why ysqlsh takes about 40sec to launch the very first time against a TServer. Subsequent ysqlsh connections to the same TServer are much faster.

I1128 01:27:25.132483 29222 yb_rpc.cc:378] Call yb.tserver.TabletServerService.Read 172.151.22.178:49748 => 172.152.40.149:7100 (request call id 429) took 0ms. Trace:
I1128 01:27:25.204210 29221 yb_rpc.cc:378] Call yb.master.MasterService.GetTableLocations 172.151.22.178:49748 => 172.152.40.149:7100 (request call id 430) took 0ms. Trace:
I1128 01:27:25.276927 25370 yb_rpc.cc:378] Call yb.tserver.TabletServerService.Read 172.151.22.178:49748 => 172.152.40.149:7100 (request call id 431) took 0ms. Trace:
I1128 01:27:25.348722 24995 yb_rpc.cc:378] Call yb.master.MasterService.GetTableLocations 172.151.22.178:49748 => 172.152.40.149:7100 (request call id 432) took 0ms. Trace:
I1128 01:27:25.421555 29222 yb_rpc.cc:378] Call yb.tserver.TabletServerService.Read 172.151.22.178:49748 => 172.152.40.149:7100 (request call id 433) took 0ms. Trace:
I1128 01:27:25.493427 25368 yb_rpc.cc:378] Call yb.master.MasterService.GetTableLocations 172.151.22.178:49748 => 172.152.40.149:7100 (request call id 434) took 0ms. Trace:
I1128 01:27:25.566165 24981 yb_rpc.cc:378] Call yb.tserver.TabletServerService.Read 172.151.22.178:49748 => 172.152.40.149:7100 (request call id 435) took 0ms. Trace:
I1128 01:27:25.638027 29221 yb_rpc.cc:378] Call yb.master.MasterService.GetTableLocations 172.151.22.178:49748 => 172.152.40.149:7100 (request call id 436) took 0ms. Trace

Turned on some debug logging to print the protos for these calls to from yb-tserver to yb-master when a ysqlsh connects to a TServer for the first time:

yb.master.MasterService.GetTableLocations (290 calls)
yb.tserver.TabletServerService.Read (277 calls)

A typical call to yb.master.MasterService.GetTableLocations looks like:

I1130 03:35:10.172482 21501 master_service.cc:319] table {
  table_id: "000000010000300080000000000004ec"
}
partition_key_start: ""
max_returned_locations: 64

Sometimes same table_id calls come to yb-master with slightly different arguments:

For example:

I1130 03:35:10.180092 20093 master_service.cc:319] table {
  table_id: "000030a90000300080000000000004eb"
}
max_returned_locations: 2147483647
require_tablets_running: true

vs:
another call to same table_id, but notice that the proto payload is slightly different.

I1130 03:35:10.180485 21501 master_service.cc:319] table {
  table_id: "000030a90000300080000000000004eb"
}
partition_key_start: ""
max_returned_locations: 64

By count the 2 table ids see most of the GetTableLocations calls.

$ grep -A1 master_service yb-master.INFO | grep -v master_service | grep -v "\-\-" | sort | uniq  -c
      3   table_id: "000000010000300080000000000004ec"
      5   table_id: "000000010000300080000000000004ee"
      1   table_id: "00000001000030008000000000000a6f"
      5   table_id: "00000001000030008000000000000b94"
      6   table_id: "000030a90000300080000000000004e1"
      9   table_id: "000030a90000300080000000000004eb"
      2   table_id: "000030a9000030008000000000000a29"
     10   table_id: "000030a9000030008000000000000a2b"
    117   table_id: "000030a9000030008000000000000a32"  (pg_index)
     10   table_id: "000030a9000030008000000000000a38"
    120   table_id: "000030a9000030008000000000000a3a"  (pg_rewrite)
      1   table_id: "000030a9000030008000000000000a5f"
      1   table_id: "000030a9000030008000000000000a85"

Checked the top 2 offenders above and they are:

- http://localhost:7000/table?id=000030a9000030008000000000000a3a shows that it is the yugabyte.pg_rewrite table.
- http://localhost:7000/table?id=000030a9000030008000000000000a32 shows that it is the yugabyte.pg_index table.

So we are not only call GetTableLocations many times, but we are calling it many times on the same tables. Most of these calls are to pg_index and pg_rewrite tables.

Now, coming to the 277 yb.tserver.TabletServerService.Read calls to yb-master, nearly 120 of them are those involving this table and its index.

yugabyte.pg_rewrite (000030a9000030008000000000000a3a)
yugabyte.pg_rewrite_rel_rulename_index (000030a9000030008000000000000a85)

The proto for 120 of these calls looks something like this:

include_trace: false
consistency_level: STRONG
pgsql_batch {
  client: YQL_CLIENT_PGSQL
  schema_version: 0
  targets {    column_id: 0  }
  targets {    column_id: 1  }
  targets {    column_id: 2  }
  targets {    column_id: 3  }
  targets {    column_id: 4  }
  targets {    column_id: 5  }
  targets {    column_id: 6  }
  targets {    column_id: 7  }
  targets {    column_id: -8  }
  column_refs {    ids: 0    ids: 1    ids: 2    ids: 3    ids: 4    ids: 5    ids: 6    ids: 7  }
  is_forward_scan: true
  is_aggregate: false
  limit: 1024
  return_paging_state: true
  table_id: "000030a9000030008000000000000a3a"
  index_request {
    targets {      column_id: 3    }
    column_refs {      ids: 3    }
    range_column_values {
      value {
        uint32_value: 12048    --->  each call is only different in this value which seems to increase by 3 (12051, 12054, 12057, etc.)
      }
    }
    table_id: "000030a9000030008000000000000a85"
  }
}

The schema for the pg_rewrite table:

yugabyte=# \d pg_rewrite
               Table "pg_catalog.pg_rewrite"
   Column   |     Type     | Collation | Nullable | Default
------------+--------------+-----------+----------+---------
 rulename   | name         |           | not null |
 ev_class   | oid          |           | not null |
 ev_type    | "char"       |           | not null |
 ev_enabled | "char"       |           | not null |
 is_instead | boolean      |           | not null |
 ev_qual    | pg_node_tree |           |          |
 ev_action  | pg_node_tree |           |          |
Indexes:
    "pg_rewrite_oid_index" PRIMARY KEY, lsm (oid)
    "pg_rewrite_rel_rulename_index" UNIQUE, lsm (ev_class, rulename)
And sample row count and sample rows:
yugabyte=# select count(*) from pg_rewrite;
 count
-------
   121
(1 row)

and sample rows:

yugabyte=# select ev_class, rulename from pg_rewrite limit 10;
 ev_class | rulename
----------+----------
    12048 | _RETURN
    12051 | _RETURN
    12054 | _RETURN
    12057 | _RETURN
    12060 | _RETURN
    12063 | _RETURN
    12066 | _RETURN
    12069 | _RETURN
    12072 | _RETURN
    12075 | _RETURN
(10 rows)

And, the remaining big chunk (90 odd of the Read calls) are of this form to pg_index (000030a9000030008000000000000a32).

I1130 03:35:10.380697 20093 tablet_service.cc:1197] tablet_id: "00000000000000000000000000000000"
include_trace: false
consistency_level: STRONG
pgsql_batch {
  client: YQL_CLIENT_PGSQL
  schema_version: 0
  targets {    column_id: 0  }
   ...
  targets {    column_id: 19  }
  targets {    column_id: -8  }
  column_refs {
    ids: 0
    ids: 1
    ...
    ids: 19
  }
  is_forward_scan: true
  is_aggregate: false
  limit: 1024
  return_paging_state: true
  range_column_values {
    value {
      uint32_value: 6114   --> The calls only differ in this value.
    }
  }
  table_id: "000030a9000030008000000000000a32"
}

For what it's worth, there were only 116 rows in that table:

yugabyte=# select count(*) from pg_index;
 count
-------
   116
(1 row)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
SQL Support
  
In progress
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.