In [None]:
import os
os.environ['JDBC_HOST'] = 'jrtest01-splice-hregion'

# Statistics

Database statistics are a form of metadata (data about data) that assists the Splice Machine query optimizer; these statistics help the optimizer to select the most efficient approach to running a query, based on information that has been gathered about the tables involved in the query.


## Collecting Statistics

You can collect statistics on a schema or table using the `analyze` command. As a best practice, you should run statistics on a table if an index has been created, or if you've modified more than 10% of the data. Here are some of the tools available for working with statistics in Splice Machine: 

* By default, statistics are calculated on all columns of a table. To selectively calculate them on one or more columns, you can use built-in stored procedure `SYSCS_UTIL.DISABLE_COLUMN_STATISTICS(schema, table, column)` to specify which column(s) are to be used.

* You can query the `SYSTABLESTATISTICS` system view to view statistics for specific tables. 

* You can query the `SYSCOLUMNSTATISTICS` system view to view statistics for specific columns in a table.

* The `ANALYZE TABLE` command collects statistics for a specific table in the current schema. It also collects statistics for any indexes associated with the table in the schema.

* The `ANALYZE SCHEMA` command collects statistics for every table in the schema. It also collects statistics for any indexes associated with every table in the schema.

## Interpreting Analyze Output

The following table summarizes what you'll see in the output of the `analyze` command:

<table summary="List of columns in the output of the analyze table command.">
            <col />
            <col />
            <thead>
                <tr>
                    <th>Value</th>
                    <th>Description</th>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td><code>schemaName</code></td>
                    <td>The name of the schema.</td>
                </tr>
                <tr>
                    <td><code>tableName</code></td>
                    <td>The name of the table.</td>
                </tr>
                <tr>
                    <td><code>partition</code></td>
                    <td>The Splice Machine partition. We merge the statistics for all table partitions, so the partition will show as <code>-All-</code> when you specify one of the non-merged type values for the <code>statsType</code> parameter.</td>
                </tr>
                <tr>
                    <td><code>rowsCollected</code></td>
                    <td>The total number of rows collected for the table.</td>
                </tr>
                <tr>
                    <td><code>partitionSize</code></td>
                    <td>The combined size of the table's partitions.</td>
                </tr>
                <tr>
                    <td><code>statsType</code></td>
                    <td><p>The type of statistics, which is one of these values:</p>
                        <table>
                            <col />
                            <col />
                            <tbody>
                                <tr>
                                    <td>0</td>
                                    <td>Full table (not sampled) statistics that reflect the unmerged partition values.</td>
                                </tr>
                                <tr>
                                    <td>1</td>
                                    <td>Sampled statistics that reflect the unmerged partition values.</td>
                                </tr>
                                <tr>
                                    <td>2</td>
                                    <td>Full table (not sampled) statistics that reflect the table values after all partitions have been merged.</td>
                                </tr>
                                <tr>
                                    <td>3</td>
                                    <td>Sampled statistics that reflect the table values after all partitions have been merged.</td>
                                </tr>
                            </tbody>
                        </table>
                    </td>
                </tr>
                <tr>
                    <td><code>sampleFraction</code></td>
                    <td>
                        <p>The sampling percentage, expressed as <code>0.0</code> to <code>1.0</code>, </p>
                        <ul>
                            <li>If <code>statsType=0</code> or <code>statsType=1</code> (full statistics), this value is not used, and is shown as <code>0</code>.</li>
                            <li>If <code>statsType=2</code> or <code>statsType=3</code>, this value is the percentage or rows to be sampled. A value of <code>0</code> means no rows, and a value of <code>1</code> means all rows (same as full statistics).</li>
                        </ul>
                    </td>
                </tr>
            </tbody>
        </table>


        
Click *Shift + Enter* to:

* creates a schema
* creates a table
* inserts some records
* runs the `analyze table` command to collect statistics for the table      

In [None]:
%%sql 

create schema test;
set schema test;

create table index_example (i int primary key, j int);
insert into index_example values (1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9),(10,10);
insert into index_example select i+10,j+10 from index_example;
insert into index_example select i+20,j+20 from index_example;
insert into index_example select i+40,j+40 from index_example;
insert into index_example select i+80,j+80 from index_example;
insert into index_example select i+160,j+160 from index_example;
insert into index_example select i+320,j+320 from index_example;
insert into index_example select i+640,j+640 from index_example;
insert into index_example select i+1280,j+1280 from index_example;
insert into index_example select i+2560,j+2560 from index_example;
insert into index_example select i+6000,j+6000 from index_example;
insert into index_example select i+12000,j+12000 from index_example;
insert into index_example select i+24000,j+24000 from index_example;
insert into index_example select i+48000,j+48000 from index_example;
insert into index_example select i+96000,j+96000 from index_example;
insert into index_example select i+200000,j+200000 from index_example;
insert into index_example select i+400000,j+400000 from index_example;
insert into index_example select i+800000,j+800000 from index_example;

analyze table test.index_example;

## For Further Exploration and Understanding

To gain a better understanding of using statistics, try using the `ANALYZE` command as follows:

1. Create a table with some data, similar to the example in this notebook.

2. Run an `EXPLAIN` plan on a sample `SELECT` query. Notice any discrepancies, such as row counts, in the `EXPLAIN` output.

3. Run `ANALYZE` on the table.

4. Run `EXPLAIN` plan again and verify that it picked up the correct statistics.

5. Collect statistics on a specific set of columns. This may optimize for certain cases in which statistics only need to be collected for a few columns.

6. Drop statistics for the schema, and than recalculate schema statistics.



## Where to Go Next

The next notebook in this class, [*Bulk Data Loading*](./d.%20Bulk%20Data%20Loading.ipynb), introduces and walks you through using our Bulk HFile Import mechanism for highly performant loading of data.
