/
query_script_split.html
465 lines (429 loc) · 19.9 KB
/
query_script_split.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
<p>
QueryScript Flow Control: <strong>split</strong> statement
</p>
<h3>SYNOPSIS</h3>
<p>
Single table operations, <a href="#autodetect">autodetect</a> mode:
<blockquote><pre>split (<i>statement operating on single table</i>)
statement;
</pre></blockquote>
Multiple tables operations; <a href="#explicit">explicit declaration</a> of <i>splitting table</i>:
<blockquote><pre>split (schema_name.table_name: <i>statement operating on multiple tables</i>)
statement;
</pre></blockquote>
Statementless split, <a href="#statementless">manual</a> mode:
<blockquote><pre>split (schema_name.table_name)
statement;
</pre></blockquote>
Provide <a href="#parameters">parameters</a> to split operation:
<blockquote><pre>split ({foo:bar}: <i>statement</i>)
statement;
</pre></blockquote>
Discussion on the various flavors of <i>split</i> <a href="#flavors">follows</a>.
</p>
<h3>DESCRIPTION</h3>
<p>
<i>split</i> automagically breaks a query into subparts -- smaller chunks -- and works these in
steps. It alleviates the load caused by large operations by turning them into smaller ones.
</p>
<p>
Consider the following query: we realize we must <strong>UPDATE</strong> a column on all rows:
<blockquote><pre>split (UPDATE sakila.rental SET rental_date = rental_date + INTERVAL 6 HOUR)
pass;
</pre></blockquote>
To execute a "normal" <strong>UPDATE</strong> of the above form would mean,
assuming the table is very large,
issuing a very large transaction. Such transaction could take hours to complete, by which time locks
are accumulating, performance is degrading, and an attempt attempt at rollback can make for an even
larger overhead.
</p>
<p>
A solution to such a problem is in the form of <i>chunking</i>: splitting the query into many smaller ones,
each operating on a distinct group of rows. Not only the query, but the transaction itself
(assuming <strong>AUTOCOMMIT=1</strong>) is broken into smaller transactions. Each such transaction is
quick to complete, and has better chance at not making for any locks. One may choose to "rest" in between
chunks, making for the availability of system resources.
</p>
<p>
The above <i>split</i> code does just that: it automagically breaks the query, by:
<ul>
<li>Analyzing the query, <a href="#autodetect">detecting</a>, if possible, the table on which the split is done
(with a multi-table query <a href="#explicit">explicit</a> instruction is required)</li>
<li>Analyzing the table, detecting best method of splitting it up into smaller parts. This is done
by choosing the best UNIQUE KEY by which to work out the splitting process.</li>
<li>Rewriting the query so as to add a filtering condition placeholder: the expression which
limits.</li>
<li>Determining the particular chunks by running over the actual rows, and issuing query on each chunk.</li>
</ul>
Thus, <i>split</i> works by selecting a particular table used by the SQL action statement, and by breaking it apart.
This table is called the <i>splitting table</i> or the <i>chunking table</i>.
</p>
<p>
<i>split</i> looks like a looping construct: the <i>statement</i> gets executed once per each
chunk of the original query.
As with looping constructs, it respects, among others, the following statements:
<ul>
<li><a href="query_script_break.html">break</a>: terminate <i>split</i> execution: this means skipping any remaining
chunks.</li>
<li><a href="query_script_throttle.html">throttle</a>: control loop execution time by sleeping in between iterations, time of sleep
proportional to time of execution.</li>
</ul>
</p>
<p>
<i>split</i> defaults to chunks of <strong>1,000</strong> rows each. This can be configured via the size <a href="#parameters">parameter</a>.
</p>
<p>
Use of <a href="query_script_variables.html#expansion">expanded variables</a> is allowed within the split statement, as well as within
the table definition and the query params. See EXAMPLES for more on this.
</p>
<a name="magic_variables"></a>
<h4>Magic variables</h4>
<p>
<i>split</i> introduces magic variables, which are available within a <i>split</i> iteration statement. These are:
<ul>
<li>
<strong>$split_columns</strong>: comma separated list of columns by which the <i>split</i>
algorithm splits the table.
</li>
<li>
<strong>$split_index</strong>: name of index used by this <i>split</i>
operation (this index implies the names of columns as presented in <strong>$split_columns</strong>).
</li>
<li>
<strong>$split_min</strong>: minimum values of <strong>$split_columns</strong>. This
is the starting point for the split operation.
</li>
<li>
<strong>$split_max</strong>: minimum values of <strong>$split_columns</strong>. This
is the ending point for the split operation.
</li>
<li>
<strong>$split_range_start</strong>: per chunk, values of <strong>$split_columns</strong>
indicating chunk's lower boundary.
</li>
<li>
<strong>$split_range_end</strong>: per chunk, values of <strong>$split_columns</strong>
indicating chunk's upper boundary.
</li>
<li>
<strong>$split_step</strong>: iteration counter.
Value is <strong>1</strong> for <strong>1st iteration</strong>
<strong>2</strong> for <strong>2nd iteration</strong>, etc.
</li>
<li><strong>$split_rowcount</strong>: the number of rows affected by current <i>split</i> step.</li>
<li>
<strong>$split_total_rowcount</strong>:
the total number of rows affected so far by the <i>split</i> statement.
<br/>This is an accumulation of <strong>$split_rowcount</strong>
</li>
<li>
<strong>$split_clause</strong>:
the computed filtering clause. See following discussion.
</li>
<li>
<strong>$split_total_elapsed_time</strong>:
total number of seconds elapsed since <i>split</i> operation started.
This includes possible <a href="query_script_throttle.html">throttling</a>
or <a href="query_script_sleep.html">sleeping</a> time.
</li>
<li>
<strong>$split_table_schema</strong>: schema for <i>splitting table</i>.
</li>
<li>
<strong>$split_table_name</strong>: the <i>splitting table</i>, by which <i>split</i> works out the smaller steps.
</li>
</ul>
</p>
<a name="flavors"></a>
<h3>Flavors</h3>
<a name="autodetect"></a>
<h4>Single table, autodetect mode</h4>
<p>
<i>split</i> can analyze statements involving single table, and automatically identify the referenced table. This makes for the simplest
and cleanest syntax:
<blockquote><pre>split (DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR)
SELECT $split_total_rowcount AS 'rows deleted so far';
</pre></blockquote>
<blockquote><pre>
create table world.City_dup like world.City;
split (insert into world.City_dup select * from world.City)
{
throttle 2;
}
</pre></blockquote>
</p>
<a name="explicit"></a>
<h4>Multiple tables operations; explicit declaration of splitting table</h4>
<p>
When multiple tables are involved, the user must specify the <i>splitting table</i>:
<blockquote><pre>
split (<strong>sakila.film</strong>: UPDATE sakila.film, sakila.film_category SET film.rental_rate = film.rental_rate * 1.10 WHERE film.film_id = film_category.film_id AND film_category.category_id = 3)
sleep 0.5;
</pre></blockquote>
The user is always allowed to specify the splitting table, even on single table operations:
<blockquote><pre>split (<strong>sakila.rental</strong>: DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR) {
SELECT $split_total_rowcount AS 'rows deleted so far';
}
</pre></blockquote>
The user may still specify the splitting table even when the statement operates on a single table; in which case the specified table
must indeed be the table being operated on. However, with statement operating on single table it is best to let split() figure out the
table name.
</p>
<a name="statementless"></a>
<h4>Statementless split, manual mode</h4>
<p>
<i>split</i> can also accept just the <i>splitting table</i>, without a query. In this "manual mode" the table is being
split and iterated, but no "actual" query is issued.
</p>
<p>
The loop construct, however, is iterated, and the <a href="#magic_variables">magic variables</a> are available. This allows the user
to manually execute what would have been automatic, or otherwise act in unconventional manner. Consider:
<blockquote><pre>split (<strong>sakila.rental</strong>) {
DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR AND <strong>:${split_statement}</strong>;
}
</pre></blockquote>
In the above example, the user builds the splitting of the <strong>DELETE</strong> query manually.
<br/>In other use cases, the user may be interested in the metadata of the splitting process (see EXAMPLES).
The metadata is provided by <i>split</i>'s <a href="#magic_variables">magic variables</a>.
</p>
<a name="parameters"></a>
<h4>Providing parameters</h4>
<p>
<i>split</i> accepts parameters for operation. Normally, <i>split</i> does everything on its own, and does not require instruction.
However, the user is given the choice of fine tuning <i>split</i>'s operation by providing any combination of the following paramaters:
<ul>
<li>
<strong>index</strong>:
explicit name of index to use for splitting. The index must exist under given name
and must be <strong>UNIQUE</strong>, or else an error is thrown.
<br/>By default <i>split</i> chooses the best index
for splitting the table without hint.
</li>
<li>
<strong>size</strong>: number of rows used in each step (minimum: 100; maximum: 50,000; default: 1,000)
</li>
<li>
<strong>start</strong>: starting point for the <i>split</i> operation.
This is a comma delimited list of values (quoted on multiple values).
<br/>The count of values must match the
number of columns in the index picked by <i>split</i>. Thus, if <i>split</i> picks an
AUTO_INCREMENT PRIMARY KEY for the operation, the value is merely a single integer.
<br/>It makes most sense to use this parameters in conjunction with <strong>index</strong>.
<br/>Compound values such as <strong>'2013-07-05,12,smith'</strong> are valid.
<br/>Values do not have to strictly exist in the table: the <i>split</i> operation will begin
<i>as of these values</i>, meaning starting at the first row with these exact values
or larger.
<br/>An error is thrown when the number of values specified does not match the
number of columns covered by the splitting key.
<br/>All data types are supported, including textual.
</li>
<li>
<strong>stop</strong>: ending point for the <i>split</i> operation.
This is a comma delimited list of values (quoted on multiple values).
<br/>The count of values must match the
number of columns in the index picked by <i>split</i>. Thus, if <i>split</i> picks an
AUTO_INCREMENT PRIMARY KEY for the operation, the value is merely a single integer.
<br/>It makes most sense to use this parameters in conjunction with <strong>index</strong>.
<br/>Compound values such as <strong>'2013-07-05,12,smith'</strong> are valid.
<br/>Values do not have to strictly exist in the table: the <i>split</i> operation will run
<i>up to these values</i>, meaning stopping at the first row with these exact values
and excluding any larger.
<br/>An error is thrown when the number of values specified does not match the
number of columns covered by the splitting key.
<br/>All data types are supported, including textual.
</li>
<li>
<strong>table</strong>: explicit table & schema name, when multiple statements are used.
This parameter is not required, though allowed, when the statement operates on a single table.
</li>
</ul>
</p>
<p>
In the following example, the <strong>rental</strong> table has an
AUTO_INCREMENT PRIMARY KEY column called <strong>rental_id</strong>.
The <i>split</i> operation starts with <strong>rental_id</strong> value of 1200, works
till the end of table, and uses chunks of <strong>500</strong> rows at a time.
<blockquote><pre>split ({start: 1200, size: 500} : DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR)
throttle 2;
</pre></blockquote>
In the above example, the user builds the splitting of the <strong>DELETE</strong> query manually.
</p>
<p>
The following example shows how to use the <strong>start</strong> parameter
in the case of a compound key. The <strong>film_actor</strong> table has a compound
PRIMARY KEY on <strong>(`actor_id`,`film_id`)</strong>.
The <i>split</i> operation starts with <strong>actor_id = 12 & film_id = 631</strong>.
<blockquote><pre>split({start:'12,631'} : UPDATE sakila.film_actor set last_update=NOW())
throttle 2;
</pre></blockquote>
In the above the values <strong>'12,631'</strong> are quoted: quotes must be added for
multiple values; they are allowed but not required on single values (as can be seen
on the previous example).
<br/>Since there is no limitation on the type of columns, it is possible that they are
textual. The case where you would have a comma (<strong>","</strong>) in one of your
<strong>start/stop</strong> column values is <i>not supported</i>; any comma is
interpreted as a columns separator.
</p>
<a name="limitations"></a>
<h3>LIMITATIONS</h3>
<p>
<i>split</i> accepts these types of statements:
<ul>
<li>DELETE FROM table_name ...</li>
<li>DELETE FROM table_name USING <multi table syntax> ...</li>
<li>UPDATE table_name SET ...</li>
<li>UPDATE <multiple tables> SET ...</li>
<li>INSERT INTO some_table SELECT ... FROM <single or multiple tables> ...</li>
<li>REPLACE INTO some_table SELECT ... FROM <single or multiple tables> ...</li>
<li>SELECT ... FROM <multiple tables> ...</li>
</ul>
</p>
<p>
The following limitations apply to the split statement:
<ul>
<li>
You should avoid using index hints on the <i>splitting table</i>.
</li>
<li>
At current, <i>split</i> does not accept the <strong>DELETE FROM tbl.* ...</strong> syntax. Use <strong>DELETE FROM tbl ...</strong> instead.
</li>
<li>
Statements with <strong>DISTICT</strong> will probably result with unexpected results.
Statements with <strong>GROUP BY</strong> may also behave unexpectedly, depending on the statement.
</li>
</ul>
</p>
<p>
<i>split</i> is furthermore subject to the following limitations:
<ul>
<li>A <i>split</i> statement cannot be nested within another <i>split</i> statement. To clarify, there is no problem with
nesting other loop constructs such as <a href="query_script_while.html">while</a>, <a href="query_script_foreach.html">foreach</a> etc.
</li>
<li>Aliasing the <i>splitting table</i> is not allowed.</li>
<li>
For <a href="#autodetect">table autodetection</a> to work, the statement must work on a single table only, and must not
contain index hints, derived tables or subqueries. You may always choose to <a href="#explicit">explicitly</a> declare the <i>splitting table</i>
using the <strong>split (<strong>schema_name.table_name</strong>: the statement) {...}</strong> variation.
</li>
</ul>
</p>
<h3>EXAMPLES</h3>
<p>
Mike is resigned. Assign all mike's issues to Jon:
<blockquote><pre>call run("
<strong>split (update sakila.rental set staff_id = 2 where staff_id = 1)
select $split_total_rowcount as 'processed issues';</strong>
");
+------------------+
| processed issues |
+------------------+
| 479 |
+------------------+
1 row in set (0.07 sec)
+------------------+
| processed issues |
+------------------+
| 983 |
+------------------+
1 row in set (0.09 sec)
...
+------------------+
| processed issues |
+------------------+
| 8040 |
+------------------+
1 row in set (0.40 sec)
</pre></blockquote>
</p>
<p>
Create denormalized table, fill it:
<blockquote><pre>
CREATE TABLE sakila.denormalized_film_category (
film_id smallint unsigned NOT NULL,
category_id tinyint unsigned NOT NULL,
last_update timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
film_title varchar(255),
category_name varchar(255),
PRIMARY KEY (film_id,category_id)
);
split (<strong>sakila.film_category</strong>:
INSERT INTO sakila.denormalized_film_category
SELECT
film_id,
category_id,
film_category.last_update,
film.title,
category.name
FROM
sakila.film_category
JOIN sakila.film USING (film_id)
JOIN sakila.category USING (category_id)
)
{
SELECT <strong>$split_total_rowcount</strong> AS 'total rows generated so far';
throttle 2;
}
</pre></blockquote>
The above uses the sample <strong>sakila</strong> database. It just so happens that the number of rows in <strong>sakila.film_category</strong>
is exactly <strong>1,000</strong>, which makes for a single step.
</p>
<p>
Walk through a table (no particular statement to execute); watch the <a href="#magic_variables">magic variables</a>:
<blockquote><pre>call run("
<strong>split(sakila.film_actor) {
select
$split_step as step, $split_columns as columns,
$split_min as min_value, $split_max as max_value,
$split_range_start as range_start, $split_range_end as range_end
}</strong>
");
+------+----------------------+-----------+-------------+-------------+------------+
| step | columns | min_value | max_value | range_start | range_end |
+------+----------------------+-----------+-------------+-------------+------------+
| 1 | `actor_id`,`film_id` | '1','1' | '200','993' | '1','1' | '39','293' |
+------+----------------------+-----------+-------------+-------------+------------+
+------+----------------------+-----------+-------------+-------------+------------+
| step | columns | min_value | max_value | range_start | range_end |
+------+----------------------+-----------+-------------+-------------+------------+
| 2 | `actor_id`,`film_id` | '1','1' | '200','993' | '39','293' | '76','234' |
+------+----------------------+-----------+-------------+-------------+------------+
+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns | min_value | max_value | range_start | range_end |
+------+----------------------+-----------+-------------+-------------+-------------+
| 3 | `actor_id`,`film_id` | '1','1' | '200','993' | '76','234' | '110','513' |
+------+----------------------+-----------+-------------+-------------+-------------+
+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns | min_value | max_value | range_start | range_end |
+------+----------------------+-----------+-------------+-------------+-------------+
| 4 | `actor_id`,`film_id` | '1','1' | '200','993' | '110','513' | '146','278' |
+------+----------------------+-----------+-------------+-------------+-------------+
+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns | min_value | max_value | range_start | range_end |
+------+----------------------+-----------+-------------+-------------+-------------+
| 5 | `actor_id`,`film_id` | '1','1' | '200','993' | '146','278' | '183','862' |
+------+----------------------+-----------+-------------+-------------+-------------+
+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns | min_value | max_value | range_start | range_end |
+------+----------------------+-----------+-------------+-------------+-------------+
| 6 | `actor_id`,`film_id` | '1','1' | '200','993' | '183','862' | '200','993' |
+------+----------------------+-----------+-------------+-------------+-------------+
</pre></blockquote>
</p>
<p>
Use expanded variables as table & schema names in split statement. In this example we update all tables called <strong>rental</strong>
in any database.
<blockquote><pre>foreach($tbl, $scm: table like rental) {
split(update :${scm}.:${tbl} set rental_date = rental_date + interval 1 day) {
throttle 1;
}
}
</pre>
</blockquote>
</p>
<h3>SEE ALSO</h3>
<a href="query_script_foreach.html">foreach</a>,
<a href="query_script_break.html">break</a>,
<a href="query_script_throttle.html">throttle</a>,
<a href="candidate_keys.html">candidate_keys</a>
<h3>AUTHOR</h3>
Shlomi Noach