doc/templates/components/query_script_split.html


<p>
QueryScript Flow Control: <strong>split</strong> statement
</p>

<h3>SYNOPSIS</h3>
<p>
Single table operations, <a href="#autodetect">autodetect</a> mode:
<blockquote><pre>split (<i>statement operating on single table</i>)
  statement;
</pre></blockquote>
Multiple tables operations; <a href="#explicit">explicit declaration</a> of <i>splitting table</i>:
<blockquote><pre>split (schema_name.table_name: <i>statement operating on multiple tables</i>)
  statement;
</pre></blockquote>
Statementless split, <a href="#statementless">manual</a> mode:
<blockquote><pre>split (schema_name.table_name)
  statement;
</pre></blockquote>
Provide <a href="#parameters">parameters</a> to split operation:
<blockquote><pre>split ({foo:bar}: <i>statement</i>)
  statement;
</pre></blockquote>
Discussion on the various flavors of <i>split</i> <a href="#flavors">follows</a>.
</p>

<h3>DESCRIPTION</h3>
<p> 
<i>split</i> automagically breaks a query into subparts -- smaller chunks -- and works these in
steps. It alleviates the load caused by large operations by turning them into smaller ones.
</p>

<p>
Consider the following query: we realize we must <strong>UPDATE</strong> a column on all rows:
<blockquote><pre>split (UPDATE sakila.rental SET rental_date = rental_date + INTERVAL 6 HOUR)
  pass;
</pre></blockquote>

To execute a "normal" <strong>UPDATE</strong> of the above form would mean, 
assuming the table is very large, 
issuing a very large transaction. Such transaction could take hours to complete, by which time locks
are accumulating, performance is degrading, and an attempt attempt at rollback can make for an even
larger overhead. 
</p>

<p>
A solution to such a problem is in the form of <i>chunking</i>: splitting the query into many smaller ones, 
each operating on a distinct group of rows. Not only the query, but the transaction itself 
(assuming <strong>AUTOCOMMIT=1</strong>) is broken into smaller transactions. Each such transaction is
quick to complete, and has better chance at not making for any locks. One may choose to "rest" in between
chunks, making for the availability of system resources. 
</p>

<p>
The above <i>split</i> code does just that: it automagically breaks the query, by:
<ul> 
	<li>Analyzing the query, <a href="#autodetect">detecting</a>, if possible, the table on which the split is done
		(with a multi-table query <a href="#explicit">explicit</a> instruction is required)</li>
	<li>Analyzing the table, detecting best method of splitting it up into smaller parts. This is done
		by choosing the best UNIQUE KEY by which to work out the splitting process.</li>
	<li>Rewriting the query so as to add a filtering condition placeholder: the expression which
		limits.</li>
	<li>Determining the particular chunks by running over the actual rows, and issuing query on each chunk.</li>
</ul> 
Thus, <i>split</i> works by selecting a particular table used by the SQL action statement, and by breaking it apart.
This table is called the <i>splitting table</i> or the <i>chunking table</i>.
</p>
<p>
<i>split</i> looks like a looping construct: the <i>statement</i> gets executed once per each
chunk of the original query.
As with looping constructs, it respects, among others, the following statements:
<ul>
	<li><a href="query_script_break.html">break</a>: terminate <i>split</i> execution: this means skipping any remaining
		chunks.</li>
	<li><a href="query_script_throttle.html">throttle</a>: control loop execution time by sleeping in between iterations, time of sleep
		proportional to time of execution.</li>
</ul>
</p>

<p>
	<i>split</i> defaults to chunks of <strong>1,000</strong> rows each. This can be configured via the size <a href="#parameters">parameter</a>.
</p>

<p>
	Use of <a href="query_script_variables.html#expansion">expanded variables</a> is allowed within the split statement, as well as within
	the table definition and the query params. See EXAMPLES for more on this. 
</p>

<a name="magic_variables"></a>
<h4>Magic variables</h4>
<p>
<i>split</i> introduces magic variables, which are available within a <i>split</i> iteration statement. These are:
<ul>
	<li>
		<strong>$split_columns</strong>: comma separated list of columns by which the <i>split</i> 
		algorithm splits the table.
	</li>
	<li>
		<strong>$split_index</strong>: name of index used by this <i>split</i> 
		operation (this index implies the names of columns as presented in <strong>$split_columns</strong>).
	</li>
	<li>
		<strong>$split_min</strong>: minimum values of <strong>$split_columns</strong>. This
		is the starting point for the split operation.
	</li>
	<li>
		<strong>$split_max</strong>: minimum values of <strong>$split_columns</strong>. This
		is the ending point for the split operation.
	</li>
	<li>
		<strong>$split_range_start</strong>: per chunk, values of <strong>$split_columns</strong>
		indicating chunk's lower boundary.
	</li>
	<li>
		<strong>$split_range_end</strong>: per chunk, values of <strong>$split_columns</strong>
		indicating chunk's upper boundary.
	</li>
	<li>
		<strong>$split_step</strong>: iteration counter. 
		Value is <strong>1</strong> for <strong>1st iteration</strong>
		<strong>2</strong> for <strong>2nd iteration</strong>, etc.
	</li>
	<li><strong>$split_rowcount</strong>: the number of rows affected by current <i>split</i> step.</li>
	<li>
		<strong>$split_total_rowcount</strong>: 
		the total number of rows affected so far by the <i>split</i> statement.
		<br/>This is an accumulation of <strong>$split_rowcount</strong>
	</li>
	<li>
		<strong>$split_clause</strong>: 
		the computed filtering clause. See following discussion.
	</li>
	<li>
		<strong>$split_total_elapsed_time</strong>: 
		total number of seconds elapsed since <i>split</i> operation started.
		This includes possible <a href="query_script_throttle.html">throttling</a> 
		or <a href="query_script_sleep.html">sleeping</a> time.
	</li>
	<li>
		<strong>$split_table_schema</strong>: schema for <i>splitting table</i>.
	</li>
	<li>
		<strong>$split_table_name</strong>: the <i>splitting table</i>, by which <i>split</i> works out the smaller steps.
	</li>
</ul>
</p>

<a name="flavors"></a>
<h3>Flavors</h3>
<a name="autodetect"></a>
<h4>Single table, autodetect mode</h4>
<p>
	<i>split</i> can analyze statements involving single table, and automatically identify the referenced table. This makes for the simplest
	and cleanest syntax:
<blockquote><pre>split (DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR)
  SELECT $split_total_rowcount AS 'rows deleted so far';
</pre></blockquote>
<blockquote><pre>
create table world.City_dup like world.City;
split (insert into world.City_dup select * from world.City)
{
  throttle 2;
}
</pre></blockquote>
</p>

<a name="explicit"></a>
<h4>Multiple tables operations; explicit declaration of splitting table</h4>
<p>
	When multiple tables are involved, the user must specify the <i>splitting table</i>:
<blockquote><pre>
split (<strong>sakila.film</strong>: UPDATE sakila.film, sakila.film_category SET film.rental_rate = film.rental_rate * 1.10 WHERE film.film_id = film_category.film_id AND film_category.category_id = 3)
  sleep 0.5;
</pre></blockquote>
The user is always allowed to specify the splitting table, even on single table operations:
<blockquote><pre>split (<strong>sakila.rental</strong>: DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR) {
  SELECT $split_total_rowcount AS 'rows deleted so far';
}
</pre></blockquote>
	The user may still specify the splitting table even when the statement operates on a single table; in which case the specified table
	must indeed be the table being operated on. However, with statement operating on single table it is best to let split() figure out the
	table name.
</p>

<a name="statementless"></a>
<h4>Statementless split, manual mode</h4>
<p>
	<i>split</i> can also accept just the <i>splitting table</i>, without a query. In this "manual mode" the table is being
	split and iterated, but no "actual" query is issued.
</p>
<p>
	The loop construct, however, is iterated, and the <a href="#magic_variables">magic variables</a> are available. This allows the user
	to manually execute what would have been automatic, or otherwise act in unconventional manner. Consider: 
<blockquote><pre>split (<strong>sakila.rental</strong>) {
  DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR AND <strong>:${split_statement}</strong>;
}
</pre></blockquote>
	In the above example, the user builds the splitting of the <strong>DELETE</strong> query manually.
	<br/>In other use cases, the user may be interested in the metadata of the splitting process (see EXAMPLES).
	The metadata is provided by <i>split</i>'s <a href="#magic_variables">magic variables</a>.
</p>

<a name="parameters"></a>
<h4>Providing parameters</h4>
<p>
	<i>split</i> accepts parameters for operation. Normally, <i>split</i> does everything on its own, and does not require instruction.
	However, the user is given the choice of fine tuning <i>split</i>'s operation by providing any combination of the following paramaters:
	<ul>
		<li>
			<strong>index</strong>: 
			explicit name of index to use for splitting. The index must exist under given name
			and must be <strong>UNIQUE</strong>, or else an error is thrown. 
			<br/>By default <i>split</i> chooses the best index
			for splitting the table without hint.
		</li>
		<li>
			<strong>size</strong>: number of rows used in each step (minimum: 100; maximum: 50,000; default: 1,000)
		</li>
		<li>
			<strong>start</strong>: starting point for the <i>split</i> operation.
			This is a comma delimited list of values (quoted on multiple values). 
			<br/>The count of values must match the 
			number of columns in the index picked by <i>split</i>. Thus, if <i>split</i> picks an
			AUTO_INCREMENT PRIMARY KEY for the operation, the value is merely a single integer.
			<br/>It makes most sense to use this parameters in conjunction with <strong>index</strong>.
			<br/>Compound values such as <strong>'2013-07-05,12,smith'</strong> are valid.
			<br/>Values do not have to strictly exist in the table: the <i>split</i> operation will begin
			<i>as of these values</i>, meaning starting at the first row with these exact values 
			or larger. 
			<br/>An error is thrown when the number of values specified does not match the
			number of columns covered by the splitting key.
			<br/>All data types are supported, including textual.
		</li>
		<li>
			<strong>stop</strong>: ending point for the <i>split</i> operation.
			This is a comma delimited list of values (quoted on multiple values). 
			<br/>The count of values must match the 
			number of columns in the index picked by <i>split</i>. Thus, if <i>split</i> picks an
			AUTO_INCREMENT PRIMARY KEY for the operation, the value is merely a single integer.
			<br/>It makes most sense to use this parameters in conjunction with <strong>index</strong>.
			<br/>Compound values such as <strong>'2013-07-05,12,smith'</strong> are valid.
			<br/>Values do not have to strictly exist in the table: the <i>split</i> operation will run
			<i>up to these values</i>, meaning stopping at the first row with these exact values 
			and excluding any larger. 
			<br/>An error is thrown when the number of values specified does not match the
			number of columns covered by the splitting key.
			<br/>All data types are supported, including textual.
		</li>
		<li>
			<strong>table</strong>: explicit table & schema name, when multiple statements are used.
			This parameter is not required, though allowed, when the statement operates on a single table. 
		</li>
	</ul>
</p>
<p>
	In the following example, the <strong>rental</strong> table has an 
	AUTO_INCREMENT PRIMARY KEY column called <strong>rental_id</strong>.
	The <i>split</i> operation starts with <strong>rental_id</strong> value of 1200, works
	till the end of table, and uses chunks of <strong>500</strong> rows at a time.
<blockquote><pre>split ({start: 1200, size: 500} : DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR)
  throttle 2;
</pre></blockquote>
In the above example, the user builds the splitting of the <strong>DELETE</strong> query manually.
</p>

<p>
	The following example shows how to use the <strong>start</strong> parameter
	in the case of a compound key. The <strong>film_actor</strong> table has a compound
	PRIMARY KEY on <strong>(`actor_id`,`film_id`)</strong>.
	The <i>split</i> operation starts with <strong>actor_id = 12 & film_id = 631</strong>.
<blockquote><pre>split({start:'12,631'} : UPDATE sakila.film_actor set last_update=NOW())
  throttle 2;
</pre></blockquote>
	In the above the values <strong>'12,631'</strong> are quoted: quotes must be added for
	multiple values; they are allowed but not required on single values (as can be seen
	on the previous example).	
	<br/>Since there is no limitation on the type of columns, it is possible that they are
	textual. The case where you would have a comma (<strong>","</strong>) in one of your
	<strong>start/stop</strong> column values is <i>not supported</i>; any comma is 
	interpreted as a columns separator. 
</p>

<a name="limitations"></a>
<h3>LIMITATIONS</h3>
<p>
	<i>split</i> accepts these types of statements:
	<ul>
		<li>DELETE FROM table_name ...</li>
		<li>DELETE FROM table_name USING &lt;multi table syntax&gt; ...</li>
		<li>UPDATE table_name SET ...</li>
		<li>UPDATE &lt;multiple tables&gt; SET ...</li>
		<li>INSERT INTO some_table SELECT ... FROM &lt;single or multiple tables&gt; ...</li>
		<li>REPLACE INTO some_table SELECT ... FROM &lt;single or multiple tables&gt; ...</li>
		<li>SELECT ... FROM &lt;multiple tables&gt; ...</li>
	</ul>
</p>
<p>
	The following limitations apply to the split statement:
	<ul>
		<li>
			You should avoid using index hints on the <i>splitting table</i>.
		</li>
		<li>
			At current, <i>split</i> does not accept the <strong>DELETE FROM tbl.* ...</strong> syntax. Use <strong>DELETE FROM tbl ...</strong> instead.
		</li>
		<li>
			Statements with <strong>DISTICT</strong> will probably result with unexpected results. 
			Statements with <strong>GROUP BY</strong> may also behave unexpectedly, depending on the statement.
		</li>
	</ul>
</p>
<p>
	<i>split</i> is furthermore subject to the following limitations:
	<ul>
		<li>A <i>split</i> statement cannot be nested within another <i>split</i> statement. To clarify, there is no problem with
			nesting other loop constructs such as <a href="query_script_while.html">while</a>, <a href="query_script_foreach.html">foreach</a> etc.
		</li>
		<li>Aliasing the <i>splitting table</i> is not allowed.</li>
		<li>
			For <a href="#autodetect">table autodetection</a> to work, the statement must work on a single table only, and must not
			contain index hints, derived tables or subqueries. You may always choose to <a href="#explicit">explicitly</a> declare the <i>splitting table</i>
			using the <strong>split (<strong>schema_name.table_name</strong>: the statement) {...}</strong> variation.
		</li>
	</ul>
</p>

<h3>EXAMPLES</h3>
<p>
Mike is resigned. Assign all mike's issues to Jon:
<blockquote><pre>call run("
  <strong>split (update sakila.rental set staff_id = 2 where staff_id = 1)
    select $split_total_rowcount as 'processed issues';</strong>
");
  
+------------------+
| processed issues |
+------------------+
|              479 |
+------------------+
1 row in set (0.07 sec)

+------------------+
| processed issues |
+------------------+
|              983 |
+------------------+
1 row in set (0.09 sec)

...

+------------------+
| processed issues |
+------------------+
|             8040 |
+------------------+
1 row in set (0.40 sec)
</pre></blockquote>
</p>

<p>
Create denormalized table, fill it:
<blockquote><pre>
CREATE TABLE sakila.denormalized_film_category (
  film_id smallint unsigned NOT NULL,
  category_id tinyint unsigned NOT NULL,
  last_update timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  film_title varchar(255),
  category_name varchar(255),
  PRIMARY KEY (film_id,category_id)
);

split (<strong>sakila.film_category</strong>:
  INSERT INTO sakila.denormalized_film_category
  SELECT
    film_id,
    category_id,
    film_category.last_update,
    film.title,
    category.name
  FROM
    sakila.film_category
    JOIN sakila.film USING (film_id)
    JOIN sakila.category USING (category_id)
  ) 
{
  SELECT <strong>$split_total_rowcount</strong> AS 'total rows generated so far';
  throttle 2;
}
</pre></blockquote>
The above uses the sample <strong>sakila</strong> database. It just so happens that the number of rows in <strong>sakila.film_category</strong>
is exactly <strong>1,000</strong>, which makes for a single step.
</p>


<p>
Walk through a table (no particular statement to execute); watch the <a href="#magic_variables">magic variables</a>:
<blockquote><pre>call run("
  <strong>split(sakila.film_actor) { 
    select 
      $split_step as step, $split_columns as columns, 
      $split_min as min_value, $split_max as max_value, 
      $split_range_start as range_start, $split_range_end as range_end
  }</strong>
");

+------+----------------------+-----------+-------------+-------------+------------+
| step | columns              | min_value | max_value   | range_start | range_end  |
+------+----------------------+-----------+-------------+-------------+------------+
|    1 | `actor_id`,`film_id` | '1','1'   | '200','993' | '1','1'     | '39','293' |
+------+----------------------+-----------+-------------+-------------+------------+

+------+----------------------+-----------+-------------+-------------+------------+
| step | columns              | min_value | max_value   | range_start | range_end  |
+------+----------------------+-----------+-------------+-------------+------------+
|    2 | `actor_id`,`film_id` | '1','1'   | '200','993' | '39','293'  | '76','234' |
+------+----------------------+-----------+-------------+-------------+------------+

+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns              | min_value | max_value   | range_start | range_end   |
+------+----------------------+-----------+-------------+-------------+-------------+
|    3 | `actor_id`,`film_id` | '1','1'   | '200','993' | '76','234'  | '110','513' |
+------+----------------------+-----------+-------------+-------------+-------------+

+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns              | min_value | max_value   | range_start | range_end   |
+------+----------------------+-----------+-------------+-------------+-------------+
|    4 | `actor_id`,`film_id` | '1','1'   | '200','993' | '110','513' | '146','278' |
+------+----------------------+-----------+-------------+-------------+-------------+

+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns              | min_value | max_value   | range_start | range_end   |
+------+----------------------+-----------+-------------+-------------+-------------+
|    5 | `actor_id`,`film_id` | '1','1'   | '200','993' | '146','278' | '183','862' |
+------+----------------------+-----------+-------------+-------------+-------------+

+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns              | min_value | max_value   | range_start | range_end   |
+------+----------------------+-----------+-------------+-------------+-------------+
|    6 | `actor_id`,`film_id` | '1','1'   | '200','993' | '183','862' | '200','993' |
+------+----------------------+-----------+-------------+-------------+-------------+
</pre></blockquote>
</p>

<p>
Use expanded variables as table & schema names in split statement. In this example we update all tables called <strong>rental</strong>
in any database.
<blockquote><pre>foreach($tbl, $scm: table like rental) {
  split(update :${scm}.:${tbl} set rental_date = rental_date + interval 1 day) {
    throttle 1;
  }
}
</pre>
</blockquote>
</p>
  
<h3>SEE ALSO</h3>
<a href="query_script_foreach.html">foreach</a>,
<a href="query_script_break.html">break</a>,
<a href="query_script_throttle.html">throttle</a>,
<a href="candidate_keys.html">candidate_keys</a>

<h3>AUTHOR</h3>
Shlomi Noach