In [None]:
import os
os.environ['JDBC_HOST'] = 'jrtest01-splice-hregion'

In [None]:
# setup-- 
import os
import pyspark
from splicemachine.spark.context import PySpliceContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
jdbc_host = os.environ['JDBC_HOST']

conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)

spark = SparkSession.builder.config(conf=conf).getOrCreate()

splicejdbc=f"jdbc:splice://{jdbc_host}:1527/splicedb;user=splice;password=admin"

splice = PySpliceContext(spark, splicejdbc)


<link rel="stylesheet" href="https://doc.splicemachine.com/zeppelin/css/zepstyles.css" />

# Explaining and Hinting

In this notebook we'll dig into the explain and hint capabilities that we've briefly seen so far.  We'll see how they can help us:

* Understand more deeply what the optimizer's plan is to run the query
* How to potentially influence that plan if necessary


## Understanding the Query Execution Plan

This section describes more fully what information is in the Explain plan for a query; the key pieces of information in a plan include the:

*  Ordering of the joins and other steps in the query
*  Use of Tables vs Indexes
*  Need for IndexLookup, which can slow a query down
*  Join Strategies employed
*  Actual row count and cost estimates at each step
*  Presence of predicate pushdowns where available
*  Indication of which *engine* will run the query: *control* or *Spark*

We'll delve a bit deeper into pushing down predicates and join ordering/strategies to help you understand plans.

### Explain and Predicates

Let's start with a query variant that is based on the `index_example` table from a previous tutorial. Click the  <img class="inline" src="https://doc.splicemachine.com/zeppelin/images/zepPlayIcon.png" alt="Run Zep Paragraph Icon"> *Run* button in the next paragraph to display the plan for this query:


In [None]:
%%sql 

explain select a.i, a.j from
    index_example a
    ,index_example b --splice-properties joinStrategy=sortmerge
     where a.i = b.i
     and a.j = 700000

<br />

You'll notice that on the very right of the plan are two lines with *preds=* on them. *Preds* is short for *predicates*, which in databases are true/false conditions that are tested during query execution.

Starting on the bottom line, we see an `IndexScan` with the preds specification on it; this is called a *Predicate Pushdown*. A pushdown means: when we perform this `IndexScan`, we'll bring this predicate (`A.J = 700000`) along with us, and will return ONLY the rows that match.  Predicate pushdowns are extremely efficient when performed on keyed results (primary keys or indexes), because only the minimal number of rows are pushed up to the next step.

The other kind of predicate shown here is of the form `[(A.I[4:2] = B.I[4:1])]`. You can ignore the numbers for now; the key part is `A.I = B.I`.  You can see that this is the join predicate, required for the actual join operation.

The main takeaway is that, as with most databases: when you can *push down* a predicate that filters a lot of data with a keyed filter, it helps create efficient scans for that step. If the filter is not keyed, this becomes a potential opportunity for adding an index.

### Join Ordering

The actual join ordering is part of the optimization process: do I get a better cost when I start with table A and join B with it, or the other way around?

Smart join ordering depends a lot on the situation.  Generally speaking, the sooner you can filter out rows (thus working with fewer rows at each step of the query), the faster the query will run.

When you look at an explain plan, if you are unsure of the ordering, remember again the order is *bottom up*. Another way to view this is to look at the counts on each row of the plan (n=1, n=2, etc.).  This dictates the table ordering being used.

## Influencing the Query Execution Plan with Hints

If your query is still slower than you expect or if you want to experiment with plan alternatives, you can use Splice Machine *query hints*. Hints are Suppose your query is still slower than you would expect.  Or you just want to try out other plan alternatives to see what would happen.  This is when hinting helps.

We introduced hints in an earlier tutorial, *Tuning for Performance.* To recap: you add a hint to a query by appending a specially formatted *comment*. These hints must always be placed at the end of a line, and can be used either after a table name or after a `FROM` clause. Most hints are used for these reasons:

<table class="splicezep">
    <col />
    <col />
    <thead>
        <tr>
            <th>Hint Type</th>
            <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
        <tr>
            <td class="ItalicFont">Join Order</td>
            <td>Indicates that the join order of the tables in the plan should be exactly the same as entered in the query SQL (first to last)</td>
        </tr>
            <td class="ItalicFont">Join Strategy</td>
            <td><p>Explicitly specifies the join strategy to use:</p>
                <ul>
                    <li><code>broadcast</code></li>
                    <li><code>sortmerge</code></li>
                    <li><code>merge</code></li>
                    <li><code>nestedloop</code></li>
                </ul>
            </td>
        </tr>
        <tr>
            <td class="ItalicFont">Index Selection</td>
            <td>Explicitly specifies the use of a specific index, or explicitly specifies to NOT use an index</td>
        </tr>
    </tbody>
</table>

### Syntax Matters

Note that in the example below, the comma separating the tables to be joined needs to be AFTER the index hint, and thus is on the next line.  Further, if there is a semicolon separating your SQL calls, you must put the semicolon AFTER the hint: on the next line.

<div class="noteIcon">
    <p>Hints must be specified exactly; any misspelling or any extra text can result in the hint not working because it is considered a comment; for example, you <strong>must</strong> spell joinOrder and joinStrategy in exactly that way.</p>
    <p>Splice Machine <strong>strongly recommends</strong> that you run an <code>explain</code> on any query that contains a hint before actually executing the query, so you can verify that the hint is correctly specified.
</div>

Here's the syntax to use for each hint type:

<table class="splicezep">
    <col />
    <col />
    <col />
    <thead>
        <tr>
            <th>Hint Type</th>
            <th>Syntax Example</th>
            <th>Usage Notes</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td class="ItalicFont">Join Order</td>
            <td class="CodeFont">joinOrder=fixed</td>
            <td>On the <code>FROM</code> line in the query</td>
        </tr>
        <tr>
            <td class="ItalicFont">Join Strategy</td>
            <td class="CodeFont">joinStrategy=broadcast</td>
            <td>After the right-hand-side table. This is typically used with <code>joinOrder=fixed</code> to control which tables are joined.</td>
       </tr>
        <tr>
            <td class="ItalicFont">Index Selection</td>
            <td class="CodeFont">index=ix</td>
            <td>After the specified table</td>
        </tr>
        <tr>
            <td class="ItalicFont">No index</td>
            <td class="CodeFont">index=null</td>
            <td>After the specified table</td>
        </tr>
    </tbody>
</table>

Click <img class="inline" src="https://doc.splicemachine.com/zeppelin/images/zepPlayIcon.png" alt="Run Zep Paragraph Icon"> in the next paragraph  to see a full example:


In [None]:
%%sql 

explain select count(*) from
  (select a.i, a.j from --splice-properties joinOrder=fixed
    index_example b --splice-properties index=ij
    ,index_example a --splice-properties index=null, joinStrategy=nestedloop
     where a.j = 700000) x


### Examples of When to Hint

If the optimizer doesn't give you the execution plan that you were expecting, you can supply hints to guide it. You can also use hints as an experimental tool to discover what happens when a different plan gets chosen: you'll typically find that the cost shown when you use `explain` is higher than the cost chosen by the optimizer.

If you find that your plan (after hinting) runs faster, you should report this to support@splicemachine.com so we can determine if you've found a bug.

## Where to Go Next

Our next tutorial, <a href="./2.8%20TPCH-1%20Tutorial.ipynb">Running the TPCH-1 Benchmark Queries</a>, walks you through loading the TPCH-1 benchmark data and running the benchmark queries.
