Skip to content

Commit 9882265

Browse files
author
Mikael Ronström
committed
WL#12924: Disk data improvements
WL#8069: Ensure that the first LCP executes well after a long idle period BUG#30276755: sync_lsn can wait forever Reviewed-by: Mauritz Sundell <mauritz.sundell@oracle.com> 1) Add 3 ndbinfo tables: pgman_time_track_stats This table tracks latencies to perform get_page and read and write latencies of PGMAN disk accesses diskstat This table reports disk statistics when accessing PGMAN pages, it reports statistics for the last second diskstat_history Same as diskstat, but reports stats for last 20 seconds, one row per second 2) Add new configuration variable MaxDiskDataLatency Track mean latency of file operations, make it possible to set a maximum disk data latency that is acceptable in new config variable: MaxDiskDataLatency. If we pass this we will abort 20% of all disk accesses and if we pass 2 times this value we will abort 2 out of 5 and so forth if it is X times this value we will abort X out of X + 1 disk access requests. If it passes 5 times this delay all disk accesses will be aborted. By default this configuration is 0 which means that we will not check for any maximum disk data latency. Add new configuration variable DiskDataUsingSameDisk. By default this is set to true. This means that we will by default try to balance the load from writing disk data checkpointing with the load created by writing in-memory checkpoints. If this configuration is set to 0 we will calculate the disk checkpoint speed independent of writes to in-memory checkpoints. 3) Fixed a bug in sync_lsn that caused log_waits to wait forever in some situations (BUG#30276755). 4) Report m_max_sync_req_lsn in DUMP command 5) Report duration of NDBFS requests in microseconds rather than millis in DUMP command 6) Improved latency of get_page requests in a number of different ways, particularly when WAL rule had to be applied 7) Started preparations to handle adaptive checkpoint speed also in PGMAN. - Keep track of number of dirty pages to assess current visible need for checkpointing - Keep track of number of pageouts in last LCP to give an estimate of normal checkpoint speed. - Provide information of how much time disk data uses compared to in memory pages. - Provide information of time it took with last LCP 8) First step to make it possible to perform LCP writes even when not requested to do so yet. The idea is the following. The next fragment to perform an LCP is the next in table id order and fragment order. Thus we can find the next fragment that will execute an LCP. By starting to write the pages in this fragment we are likely to make it much faster to run the LCP ones the request to perform the LCP in PGMAN arrives. We will at most write four fragments ahead of the current fragment being checkpointed. This is to strike a balance between staying ahead to get smooth checkpoint speed and avoiding to write too much that later is written again when writing the current checkpoint fragment. 9) Needed to track number of outstanding prepare LCP writes 10) Made it possible to run checkpointing also in normal operation during times when no disk data checkpoint has been requested. Calculates both IO parallelism and IO rate once every 100 milliseconds. This in order to attempt to even out the IO load during checkpoints. Modern NVMe drives can write millions of IO operations per second, so it is very easy to overwhelm the disk drives with checkpoint flushes that makes it hard to perform normal disk IO for user operations. So we try to keep track of how fast we need to write disk data checkpoints and in-memory checkpoints to ensure an even IO load. It is also important to find a good balance between disk data checkpointing and in-memory checkpointing. 11) We invoked a lot of extra latency by putting the dirty pages first in dirty list. This led to writes of the most recently dirtied pages coming first. This led to a lot of unnecessary invocations of the WAL rule. Fixed this by inserting it last in list. In addition every time the page was made dirty (even when already dirty), we put the page last in its current dirty list to ensure that we minimise the risk of having to apply the WAL rule during LCP execution. In addition we ensured that we never attempted to write any pages that needed application of the WAL rule or that was ready to send a callback on a get page during prepare LCP phase. The prepare phase will mostly have enough work to do anyways. Finally we attempt to avoid writing pages in SL_CALLBACK list during LCP execution. We have to apply rules however to ensure that we make progress by not applying this rule for last page in dirty list. Also we don't apply the rule more than 32 times to avoid breaking any rules for real-time execution. Finally we don't allow skipping two pages after each other that are in SL_CALLBACK list since that would put us at risk of looping on those two pages. We only move the page one step forward when skipping it to avoid giving it to high priority to skip the page. 12) The TSMAN had a major bottleneck in that it was protected by one single mutex. We held this mutex for quite some time during inserts and we also held it for some time preparing to page out and after completed page out and in a few more places. This is a serious bottleneck for disk data. The solution is to break up the protection in a few steps. The first step is that the Tablespace_client only takes a lock on one instance, its own instance. This means those Tablespace_client will not have problems with each other. When TSMAN needs to manipulate any data structures used by any Tablespace_client one can lock all instances. This is a rare operation only occurring when manipulating a tablespace to add more files, add a tablespace or when dropping a tablespace. Allocation and freeing extents requires a bit more protection given that we're working on a free list. Thus we use a allocate extent lock to protect these operations in addition to the instance lock. Finally to ensure that the instances don't bounce into each other we need to protect extent pages. We keep a fixed amount of mutexes per tablespace data file to protect these extent pages accesses from each other. We use a very simple hash function to decide which mutex to use for a specific extent page. 13) The extra PGMAN worker works differently in how LCPs are discovered to be started and ended. We need code to handle this as well in PGMAN. 14) Ensure that SYNC_EXTENT_PAGES_REQ is sent with FIRST_LCP also when no disk data tables present. 15) Handle case with completely empty LCP 16) Added more jam's around BUSY state in PGMAN Added rules to try to keep LCP speed down such that we at least take 10 seconds to run an LCP. 17) Fixed bug with respect to variable m_lcp_ongoing that wasn't properly handled for flushing of page cache for restarts. Reorganised debug printouts a bit and standardised on printing instance number in parenthesis at start of printout. 18) Need to avoid setting BUSY that would require lists to change for extent pages 19) Minor adaptions to decrease aggressiveness in writing LCPs for disk data 20) More fine-tuning of adaptive control parameters 21) Occasionally when running testRedo -n CheckLCPStartsAfterSR we move the REDO log so fast that we haven't finished opening the next file when starting to write into this next file. In this test case the REDO log size is set to 16 * 16 MB. I change to using 4 * 64 MB instead to minimise this risk. 22) Avoid trying to lock client locks when already holding those before calling execFSCLOSECONF. 23) New error code 1518 for overload when going beyond MaxDiskDataLatency 24) Slowed down the speed a bit more to increase lengths of LCPs in normal operation. The aim of this worklog is to ensure that we balance the disk write rate for the actual required disk write rate. This makes it possible to have very heavy write rates also on disk data columns in NDB, particularly when using modern NVMe drives. It has been tested heavily for large rows with YCSB on modern NVMe drives supporting millions of IOPS and also tested on older SSD drives supporting at most 20.000 IOPS. The worklog also adds ndbinfo tables to track the usage a bit more and debugging info to track any problems in the new code. The worklog integrates the Adaptive Redo Control algorithm for in-memory data with the checkpoint algorithm for disk data columns.
1 parent 71543a6 commit 9882265

34 files changed

+3380
-541
lines changed

mysql-test/suite/ndb/r/ndbinfo.result

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,9 +84,12 @@ table_id table_name comment
8484
37 stored_tables Information about stored tables
8585
38 processes Process ID and Name information for connected nodes
8686
39 config_nodes All nodes of current cluster configuration
87+
40 pgman_time_track_stats Time tracking of reads and writes of disk data pages
88+
41 diskstat Disk data statistics for last second
89+
42 diskstats_1sec Disk data statistics history for last few seconds
8790
SELECT COUNT(*) FROM ndb$tables;
8891
COUNT(*)
89-
40
92+
43
9093
SELECT * FROM ndb$tables WHERE table_id = 2;
9194
table_id table_name comment
9295
2 test for testing
@@ -126,6 +129,9 @@ table_id table_name comment
126129
37 stored_tables Information about stored tables
127130
38 processes Process ID and Name information for connected nodes
128131
39 config_nodes All nodes of current cluster configuration
132+
40 pgman_time_track_stats Time tracking of reads and writes of disk data pages
133+
41 diskstat Disk data statistics for last second
134+
42 diskstats_1sec Disk data statistics history for last few seconds
129135
SELECT * FROM ndb$tables WHERE table_name = 'LOGDESTINATION';
130136
table_id table_name comment
131137
SELECT COUNT(*) FROM ndb$tables t, ndb$columns c
@@ -156,6 +162,8 @@ table_id table_name
156162
25 cpustat_50ms
157163
16 dict_obj_info
158164
10 diskpagebuffer
165+
41 diskstat
166+
42 diskstats_1sec
159167
19 disk_write_speed_aggregate
160168
18 disk_write_speed_base
161169
29 frag_locks
@@ -166,6 +174,7 @@ table_id table_name
166174
15 membership
167175
9 nodes
168176
14 operations
177+
40 pgman_time_track_stats
169178
3 pools
170179
38 processes
171180
7 resources
@@ -260,6 +269,9 @@ table_id
260269
37
261270
38
262271
39
272+
40
273+
41
274+
42
263275

264276
TRUNCATE ndb$tables;
265277
ERROR HY000: Table 'ndb$tables' is read only

scripts/mysql_system_tables.sql

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -690,6 +690,16 @@ PREPARE stmt FROM @str;
690690
EXECUTE stmt;
691691
DROP PREPARE stmt;
692692

693+
SET @str=IF(@have_ndbinfo,'DROP VIEW IF EXISTS `ndbinfo`.`diskstat`','SET @dummy = 0');
694+
PREPARE stmt FROM @str;
695+
EXECUTE stmt;
696+
DROP PREPARE stmt;
697+
698+
SET @str=IF(@have_ndbinfo,'DROP VIEW IF EXISTS `ndbinfo`.`diskstats_1sec`','SET @dummy = 0');
699+
PREPARE stmt FROM @str;
700+
EXECUTE stmt;
701+
DROP PREPARE stmt;
702+
693703
SET @str=IF(@have_ndbinfo,'DROP VIEW IF EXISTS `ndbinfo`.`error_messages`','SET @dummy = 0');
694704
PREPARE stmt FROM @str;
695705
EXECUTE stmt;
@@ -735,6 +745,11 @@ PREPARE stmt FROM @str;
735745
EXECUTE stmt;
736746
DROP PREPARE stmt;
737747

748+
SET @str=IF(@have_ndbinfo,'DROP VIEW IF EXISTS `ndbinfo`.`pgman_time_track_stats`','SET @dummy = 0');
749+
PREPARE stmt FROM @str;
750+
EXECUTE stmt;
751+
DROP PREPARE stmt;
752+
738753
SET @str=IF(@have_ndbinfo,'DROP VIEW IF EXISTS `ndbinfo`.`processes`','SET @dummy = 0');
739754
PREPARE stmt FROM @str;
740755
EXECUTE stmt;
@@ -985,6 +1000,28 @@ PREPARE stmt FROM @str;
9851000
EXECUTE stmt;
9861001
DROP PREPARE stmt;
9871002

1003+
# ndbinfo.ndb$diskstat
1004+
SET @str=IF(@have_ndbinfo,'DROP TABLE IF EXISTS `ndbinfo`.`ndb$diskstat`','SET @dummy = 0');
1005+
PREPARE stmt FROM @str;
1006+
EXECUTE stmt;
1007+
DROP PREPARE stmt;
1008+
1009+
SET @str=IF(@have_ndbinfo,'CREATE TABLE `ndbinfo`.`ndb$diskstat` (`node_id` INT UNSIGNED COMMENT "node_id",`block_instance` INT UNSIGNED COMMENT "Block instance",`pages_made_dirty` INT UNSIGNED COMMENT "Pages made dirty last second",`reads_issued` INT UNSIGNED COMMENT "Reads issued last second",`reads_completed` INT UNSIGNED COMMENT "Reads completed last second",`writes_issued` INT UNSIGNED COMMENT "Writes issued last second",`writes_completed` INT UNSIGNED COMMENT "Writes completed last second",`log_writes_issued` INT UNSIGNED COMMENT "Log writes issued last second",`log_writes_completed` INT UNSIGNED COMMENT "Log writes completed last second",`get_page_calls_issued` INT UNSIGNED COMMENT "get_page calls issued last second",`get_page_reqs_issued` INT UNSIGNED COMMENT "get_page calls that triggered disk IO issued last second",`get_page_reqs_completed` INT UNSIGNED COMMENT "get_page calls that triggered disk IO completed last second") COMMENT="Disk data statistics for last second" ENGINE=NDBINFO CHARACTER SET latin1','SET @dummy = 0');
1010+
PREPARE stmt FROM @str;
1011+
EXECUTE stmt;
1012+
DROP PREPARE stmt;
1013+
1014+
# ndbinfo.ndb$diskstats_1sec
1015+
SET @str=IF(@have_ndbinfo,'DROP TABLE IF EXISTS `ndbinfo`.`ndb$diskstats_1sec`','SET @dummy = 0');
1016+
PREPARE stmt FROM @str;
1017+
EXECUTE stmt;
1018+
DROP PREPARE stmt;
1019+
1020+
SET @str=IF(@have_ndbinfo,'CREATE TABLE `ndbinfo`.`ndb$diskstats_1sec` (`node_id` INT UNSIGNED COMMENT "node_id",`block_instance` INT UNSIGNED COMMENT "Block instance",`pages_made_dirty` INT UNSIGNED COMMENT "Pages made dirty per second",`reads_issued` INT UNSIGNED COMMENT "Reads issued per second",`reads_completed` INT UNSIGNED COMMENT "Reads completed per second",`writes_issued` INT UNSIGNED COMMENT "Writes issued per second",`writes_completed` INT UNSIGNED COMMENT "Writes completed per second",`log_writes_issued` INT UNSIGNED COMMENT "Log writes issued per second",`log_writes_completed` INT UNSIGNED COMMENT "Log writes completed per second",`get_page_calls_issued` INT UNSIGNED COMMENT "get_page calls issued per second",`get_page_reqs_issued` INT UNSIGNED COMMENT "get_page calls that triggered disk IO issued per second",`get_page_reqs_completed` INT UNSIGNED COMMENT "get_page calls that triggered disk IO completed per second",`seconds_ago` INT UNSIGNED COMMENT "Seconds ago that this measurement was made") COMMENT="Disk data statistics history for last few seconds" ENGINE=NDBINFO CHARACTER SET latin1','SET @dummy = 0');
1021+
PREPARE stmt FROM @str;
1022+
EXECUTE stmt;
1023+
DROP PREPARE stmt;
1024+
9881025
# ndbinfo.ndb$frag_locks
9891026
SET @str=IF(@have_ndbinfo,'DROP TABLE IF EXISTS `ndbinfo`.`ndb$frag_locks`','SET @dummy = 0');
9901027
PREPARE stmt FROM @str;
@@ -1073,6 +1110,17 @@ PREPARE stmt FROM @str;
10731110
EXECUTE stmt;
10741111
DROP PREPARE stmt;
10751112

1113+
# ndbinfo.ndb$pgman_time_track_stats
1114+
SET @str=IF(@have_ndbinfo,'DROP TABLE IF EXISTS `ndbinfo`.`ndb$pgman_time_track_stats`','SET @dummy = 0');
1115+
PREPARE stmt FROM @str;
1116+
EXECUTE stmt;
1117+
DROP PREPARE stmt;
1118+
1119+
SET @str=IF(@have_ndbinfo,'CREATE TABLE `ndbinfo`.`ndb$pgman_time_track_stats` (`node_id` INT UNSIGNED COMMENT "node_id",`block_number` INT UNSIGNED COMMENT "Block number",`block_instance` INT UNSIGNED COMMENT "Block instance",`upper_bound` INT UNSIGNED COMMENT "Upper bound in microseconds",`page_reads` BIGINT UNSIGNED COMMENT "Number of disk reads in this range",`page_writes` BIGINT UNSIGNED COMMENT "Number of disk writes in this range",`log_waits` BIGINT UNSIGNED COMMENT "Number of waits due to WAL rule in this range (log waits)",`get_page` BIGINT UNSIGNED COMMENT "Number of waits for get_page in this range") COMMENT="Time tracking of reads and writes of disk data pages" ENGINE=NDBINFO CHARACTER SET latin1','SET @dummy = 0');
1120+
PREPARE stmt FROM @str;
1121+
EXECUTE stmt;
1122+
DROP PREPARE stmt;
1123+
10761124
# ndbinfo.ndb$pools
10771125
SET @str=IF(@have_ndbinfo,'DROP TABLE IF EXISTS `ndbinfo`.`ndb$pools`','SET @dummy = 0');
10781126
PREPARE stmt FROM @str;
@@ -1470,6 +1518,18 @@ PREPARE stmt FROM @str;
14701518
EXECUTE stmt;
14711519
DROP PREPARE stmt;
14721520

1521+
# ndbinfo.diskstat
1522+
SET @str=IF(@have_ndbinfo,'CREATE OR REPLACE DEFINER=`root`@`localhost` SQL SECURITY INVOKER VIEW `ndbinfo`.`diskstat` AS SELECT * FROM `ndbinfo`.`ndb$diskstat`','SET @dummy = 0');
1523+
PREPARE stmt FROM @str;
1524+
EXECUTE stmt;
1525+
DROP PREPARE stmt;
1526+
1527+
# ndbinfo.diskstats_1sec
1528+
SET @str=IF(@have_ndbinfo,'CREATE OR REPLACE DEFINER=`root`@`localhost` SQL SECURITY INVOKER VIEW `ndbinfo`.`diskstats_1sec` AS SELECT * FROM `ndbinfo`.`ndb$diskstats_1sec`','SET @dummy = 0');
1529+
PREPARE stmt FROM @str;
1530+
EXECUTE stmt;
1531+
DROP PREPARE stmt;
1532+
14731533
# ndbinfo.error_messages
14741534
SET @str=IF(@have_ndbinfo,'CREATE OR REPLACE DEFINER=`root`@`localhost` SQL SECURITY INVOKER VIEW `ndbinfo`.`error_messages` AS SELECT error_code, error_description, error_status, error_classification FROM `ndbinfo`.`ndb$error_messages`','SET @dummy = 0');
14751535
PREPARE stmt FROM @str;
@@ -1524,6 +1584,12 @@ PREPARE stmt FROM @str;
15241584
EXECUTE stmt;
15251585
DROP PREPARE stmt;
15261586

1587+
# ndbinfo.pgman_time_track_stats
1588+
SET @str=IF(@have_ndbinfo,'CREATE OR REPLACE DEFINER=`root`@`localhost` SQL SECURITY INVOKER VIEW `ndbinfo`.`pgman_time_track_stats` AS SELECT * FROM `ndbinfo`.`ndb$pgman_time_track_stats`','SET @dummy = 0');
1589+
PREPARE stmt FROM @str;
1590+
EXECUTE stmt;
1591+
DROP PREPARE stmt;
1592+
15271593
# ndbinfo.processes
15281594
SET @str=IF(@have_ndbinfo,'CREATE OR REPLACE DEFINER=`root`@`localhost` SQL SECURITY INVOKER VIEW `ndbinfo`.`processes` AS SELECT DISTINCT node_id, CASE node_type WHEN 0 THEN "NDB" WHEN 1 THEN "API" WHEN 2 THEN "MGM" ELSE NULL END AS node_type, node_version, NULLIF(process_id, 0) AS process_id, NULLIF(angel_process_id, 0) AS angel_process_id, process_name, service_URI FROM `ndbinfo`.`ndb$processes` ORDER BY node_id','SET @dummy = 0');
15291595
PREPARE stmt FROM @str;

storage/ndb/include/kernel/signaldata/LCP.hpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -256,8 +256,9 @@ struct SyncExtentPagesReq
256256
enum LcpOrder
257257
{
258258
FIRST_LCP = 0,
259-
INTERMEDIATE_LCP = 1,
260-
END_LCP = 2
259+
END_LCP = 1,
260+
RESTART_SYNC = 2,
261+
FIRST_AND_END_LCP = 3
261262
};
262263
Uint32 senderData;
263264
Uint32 senderRef;

storage/ndb/include/kernel/signaldata/NextScan.hpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
Copyright (c) 2003, 2017, Oracle and/or its affiliates. All rights reserved.
2+
Copyright (c) 2003, 2019, Oracle and/or its affiliates. All rights reserved.
33
44
This program is free software; you can redistribute it and/or modify
55
it under the terms of the GNU General Public License, version 2.0,
@@ -71,6 +71,7 @@ class NextScanConf {
7171

7272
class NextScanRef {
7373
friend class Dbtux;
74+
friend class Dbtup;
7475
friend class Dblqh;
7576
public:
7677
STATIC_CONST( SignalLength = 4 );

storage/ndb/include/kernel/signaldata/PgmanContinueB.hpp

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
Copyright (c) 2005, 2017, Oracle and/or its affiliates. All rights reserved.
2+
Copyright (c) 2005, 2019, Oracle and/or its affiliates. All rights reserved.
33
44
This program is free software; you can redistribute it and/or modify
55
it under the terms of the GNU General Public License, version 2.0,
@@ -40,7 +40,9 @@ class PgmanContinueB {
4040
STATS_LOOP = 0,
4141
BUSY_LOOP = 1,
4242
CLEANUP_LOOP = 2,
43-
LCP_LOOP = 3
43+
LCP_LOOP = 3,
44+
CALC_STATS_LOOP = 4,
45+
TRACK_LCP_SPEED_LOOP = 5
4446
};
4547
};
4648

storage/ndb/include/mgmapi/mgmapi_config_parameters.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -249,6 +249,8 @@
249249
#define CFG_DB_RESERVED_TRANS_BUFFER_MEM 666
250250

251251
#define CFG_DB_TRANSACTION_MEM 667
252+
#define CFG_DB_MAX_DD_LATENCY 668
253+
#define CFG_DB_DD_USING_SAME_DISK 669
252254

253255
#define CFG_NODE_ARBIT_RANK 200
254256
#define CFG_NODE_ARBIT_DELAY 201

storage/ndb/src/common/debugger/signaldata/IsolateOrd.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ printISOLATE_ORD(FILE * output, const Uint32 * theData, Uint32 len, Uint16 recei
5050
}
5151
else
5252
{
53-
fprintf(output, " nodesToIsolate in signal section");
53+
fprintf(output, " nodesToIsolate in signal section\n");
5454
}
5555
return true;
5656
}

storage/ndb/src/kernel/blocks/PgmanProxy.cpp

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,20 @@ PgmanProxy::sendEND_LCPCONF(Signal* signal, Uint32 ssId)
177177
* thread is used. These are extent pages.
178178
*/
179179

180+
void
181+
PgmanProxy::get_extent_page(Page_cache_client& caller,
182+
Signal* signal,
183+
Page_cache_client::Request& req,
184+
Uint32 flags)
185+
{
186+
ndbrequire(blockToInstance(caller.m_block) == 0);
187+
SimulatedBlock* block = globalData.getBlock(caller.m_block);
188+
Pgman* worker = (Pgman*)workerBlock(c_workers - 1); // extraWorkerBlock();
189+
Page_cache_client pgman(block, worker);
190+
pgman.get_extent_page(signal, req, flags);
191+
caller.m_ptr = pgman.m_ptr;
192+
}
193+
180194
int
181195
PgmanProxy::get_page(Page_cache_client& caller,
182196
Signal* signal,

storage/ndb/src/kernel/blocks/PgmanProxy.hpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,11 @@ class PgmanProxy : public LocalProxy {
8484
int get_page(Page_cache_client& caller,
8585
Signal*, Page_cache_client::Request& req, Uint32 flags);
8686

87+
void get_extent_page(Page_cache_client& caller,
88+
Signal*,
89+
Page_cache_client::Request& req,
90+
Uint32 flags);
91+
8792
void update_lsn(Signal *signal,
8893
Page_cache_client& caller,
8994
Local_key key, Uint64 lsn);

0 commit comments

Comments
 (0)