In [1]:
%%HTML
<link rel="stylesheet" href="https://doc.splicemachine.com/jupyter/css/custom.css">

In [None]:
import os
os.environ['JDBC_HOST'] = 'jrtest01-splice-hregion'

# Bulk Data Loading

This notebook introduces the Splice Machine Bulk HFile Import mechanism, which you can use to import data into your database in a highly performant manner. 
We'll walk you through a simple example, after which you will be able to bulk import data into your database. 

<p class="noteIcon">We recommend using Bulk HFile importing; however, if your input might contain data errors that need checking, you must use our our basic import procedures, <code>IMPORT_DATA</code> or <code>MERGE_DATA_FROM_FILE</code> instead, because they perform constraint checking during ingestion.</p>

This notebook contains these sections:

* *About Bulk Import* summarizes how bulk import works.
* The *Bulk Data Import Checklist* summarizes important details about the format of the data you're importing.
* The *Automatic Bulk Data Import* method shows you how to quickly bulk import your data.
* The *Manual Bulk Data Import* allows you to control the splitting of data.
* *Using the `BULK_IMPORT_HFILE` Command* shows the syntax for importing data using the `BULK_IMPORT_HFILE` system procedure.

Lastly we will walk you through step-by-step examples of using both the automatic and manual process bulk data import processes.

## About Bulk Import

Bulk importing splits a data file into temporary HFiles (store files), and then imports the data into your database by directly loading the generated StoreFiles into your Region Servers. 

When you use bulk HFile to import your data, files are essentially split into temporary Hadoop files (HFiles) that are directly loaded into the region servers. Ideally, the process creates HFiles of equal size, which allows the data to be spread across all region servers equally. There are two methods for splitting the data into hfiles.

* With *Automatic Splitting,* the system samplse the data in an attempt to determine the optimal split points. This is the quicker and easier method, but does not always ensure you have the ideal splitting of the data.
* Using *Manual Splitting* improves the performance of the bulk data import process. Instead of having the system samples the data for split point, you provide the split points that tell the procedure where to split your data. This requires you to know the data well enough to provide the split points.


## Bulk Data Import Checklist

When you bulk import data from flat files into your database, you need to specify a number of details about your data files to get them correctly imported. Before starting this process, please make sure that your data formats will work. The following table calls out areas issues you should check:


<table class="splicezepOddEven">
    <col width="30%" />
    <col />
    <thead>
        <tr>
            <th>Data File Detail</th>
            <th>Specific Requirements</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Field delimited?</td>
            <td>The fields in each row <strong>must</strong> have delimiters between them.</td>
        </tr>
        <tr>
            <td>Rows terminated?</td>
            <td>Each row <strong>must</strong> be terminated with a newline character.</td>
        </tr>
        <tr>
            <td>Header row included?</td>
            <td>Header rows are not allowed; if your data contains one, you <strong>must</strong> remove it.</td>
        </tr>
        <tr>
            <td><code>Date</code>, <code>time</code>, <code>timestamp</code> data types</td>
            <td> If you are using <code>date</code>, <code>time</code>, and/or <code>timestamp</code> data types in the target table, you need to know how that data is represented in the flat file; your file <strong>must</strong> use a consistent representation, and you must specify that format when using the import command.</td>
        </tr>
        <tr>
            <td><code>Char</code> and <code>Varchar</code> data</td>
            <td><p>If any of your <code>char</code> or <code>varchar</code> data contains your delimiter character, you <strong>need to use</strong> a special character delimiter.</p>
                <p>If any of your <code>char</code> or <code>varchar</code> data contains newline characters, you <strong>need to use</strong> the <code>oneLineRecords</code> parameter.</p>
            </td>
        </tr>
    </tbody>
</table>


<p class="noteIcon">It is a good idea to test your import on a small amount of data before loading all of your data. This allows you to verify that your delimiters, date formatting, and other details are in proper order. That's what we'll do in this notebook.</p>


## Automatic Bulk Data Import

If you choose to use the automatic bulk data import process, all you need to do is run the `BULK_IMPORT_HFILE` system procedure. Splice Machine will automatically generate the split points by sampling the data that is being imported.  Before you execute `BULK_IMPORT_HFILE`, be sure to create the table and indexes first in Splice Machine.

If you discover that the automatic bulk data import process is not as performant as expected, you should consider the manual bulk data import process.

## Manual Bulk Data Import

When using the manual bulk data import process, you need to pre-split the data by providing a CSV file that specifies all of the split points. Follow these steps to manually bulk import your data:

1. *Create the table in your Splice Machine database*

2. *Optionally, create any index(es) on the table*

3. *Pre-split the table*

4. *Call the `SPLIT_TABLE_OR_INDEX` system procedure for the table*

5. *Optionally, pre-split any index(es) on the table*

6. *Optionally, call the `SPLIT_TABLE_OR_INDEX` system procedure for the index(es)*

7. *Call the `BULK_IMPORT_HFILE` system procedure*

### 1. Create the table in your Splice Machine database

The table you're importing into must exist in your Splice Machine database.

### 2. Optionally, create any index(es) on the table

If you’re going to index the table you’re importing, Splice Machine recommends that you create the index prior to using bulk import. This allows the index to also be pre-split into regions, which will prevent downstream bottlenecks.

### 3. Pre-split the table

When you pre-split a table, you are essentially identifying primary key values that can horizontally split the data into roughly equal parts. For example: let's say you have some order data that you want to load. The data contains a column for order id that has values from 1 to 999,999. We can pre-split this table into 4 roughly equal parts by using the order id column and specifying the values 250000, 500000, and 750000.

Create a file that contains the following. Note that we specify three split values to create four splits:

```
250000
500000
750000
```

<div class="noteIcon"><p>Split key values can be multiple columns, in which case each column value would be delimited by a pipe character (|). For example:<p>
&nbsp;&nbsp;&nbsp;<code>200000|2019-01-01</code></p></div>

### 4. Call SPLIT_TABLE_OR_INDEX on the table

Now call the `SPLIT_TABLE_OR_INDEX` system procedure. The syntax for this command looks like this:

```
call SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX (
    schemaName,
    tableName,
    indexName,
    columnList | null,
    fileName,
    columnDelimiter | null,
    characterDelimiter | null,
    timestampFormat | null,
    dateFormat | null,
    timeFormat | null,
    maxBadRecords,
    badRecordDirectory | null,
    oneLineRecords | null,
    charset | null,
);
```
Notice that many of the parameters allow you to apply the default value by specifying `null`.

<p class="noteNote">You can find full details about these parameters, including the default value for each, in <a href="https://doc.splicemachine.com/sqlref_sysprocs_splittable.html" target="_blank">our <code>SPLIT_TABLE_OR_INDEX</code> documentation.</a></p>

### 5. Optionally, pre-split any index(es) on the table

If you have any indexes for the table you need to repeat step 3 for each index.

### 6. Optionally, call SPLIT_TABLE_OR_INDEX for the index(es)

If you have any indexes for the table you need to repeat step 4 for each index.

### 7. Call the BULK_IMPORT_HFILE system procedure

Lastly, call the `BULK_IMPORT_HFILE` command to bulk load your data. We will go over a complete example later in this notebook.

## BULK_IMPORT_HFILE Command

Syntax for the `BULK_IMPORT_HFILE` command looks like this:
```
call SYSCS_UTIL.BULK_IMPORT_HFILE (
    schemaName,
    tableName,
    insertColumnList | null,
    fileOrDirectoryName,
    columnDelimiter | null,
    characterDelimiter | null,
    timestampFormat | null,
    dateFormat | null,
    timeFormat | null,
    maxBadRecords,
    badRecordDirectory | null,
    oneLineRecords | null,
    charset | null,
    bulkImportDirectory,
    skipSampling
);
```
Notice that many of the parameters allow you to apply the default value by specifying `null`.

<p class="noteNote">You can find full details about these parameters, including the default value for each, in <a href="https://doc.splicemachine.com/sqlref_sysprocs_importhfile.htmll" target="_blank">our <code>BULK_IMPORT_HFILE</code> documentation.</a></p>


## Step-by-Step Example of Automatic Bulk Data Import

This example walks you through the automatic bulk data import process one step at a time:

1. *Create a Database Schema and Table*
2. *Run the Automatic Bulk Import Data Procedure*
3. *Review Imported Data*


### 1. Create a Database Schema and Table

You can get started by running the next paragraph in this Notebook, which uses the `%%sql` magic function to:

* Create a new schema named `DEV3` in your database.

* Create the `customer_bulk_import_example1` table in your database.

In [None]:
%%sql 

CREATE TABLE DEV3.CUSTOMER_BULK_IMPORT_EXAMPLE1 (
 C_CUSTKEY INTEGER NOT NULL PRIMARY KEY,
 C_NAME VARCHAR(25),
 C_ADDRESS VARCHAR(40),
 C_NATIONKEY INTEGER NOT NULL,
 C_PHONE VARCHAR(15),
 C_ACCTBAL DECIMAL(15,2),
 C_MKTSEGMENT VARCHAR(10),
 C_COMMENT VARCHAR(117)
);

### 2. Run the Automatic Bulk Import Data Procedure

Now we'll bulk import our data, which is the customer table data for TPCH-100. We'll use the automatic method for generating the split keys.

The first two rows of the data look like this:

<pre>1,1,1,0.2548,GC,BARBARBAR,amotastqqsxqk,50000.0,-10.0,10.0,1,0,xzgtmkfcc,ylkolttoyrgtoypvu,vihxvtlufla,UC,808711111,7664160137823585,2016-08-26 15:09:58.574,OE,kmmmpivhrpyzdrcxiznqsujdvhnvbtvktvzncdigrmfrnbvmuqfdrgyzsacziwobfxcwbqrctbvyyhcefmpsitdjlebpphhtihhnserrolqbjeqpggnkvfowexxgtqoglwqrhhbjdnfrmbeieubdmtkfqicfjtwwbdwbbcacqvukmxhmnpekydlesjzatacwuntakarbcbfgrvxcztgzzfcbkppjfpznjxpnanktiswszzrgxvlccrksbojbzywtziijmdwfwdxyzrwwllzcnjdbkfoxzudqfdowuwuopemcfhxhrskhsrgydkklsgzaujzbqagycvqdnkpmtligujssjierqetnwxzipykpbxbkdunzrvekjpsonlqhtvmntabhlfpzvrzzbwbuzsqdvkbebissivntknptkfpxfcgcwmymncyfqifaqecxp

2,1,1,0.2728,GC,BARBARBAR,maqolvlurgl,50000.0,-10.0,10.0,1,0,aommxpkuezwgcfyl,allgnqktgezptiui,rphjjqoevqwdpugh,ZL,115811111,8576454984888259,2016-08-26 15:09:58.575,OE,ozcazgrmfwgpdjcgeestwqmnygzqvvuxsrnnyjbxostljnlplmrdmcdbtxxoadyxeidhbffnevyierxsfgaqjuvpvgpcokbkshiqmyxlnqxwwukdcxswquhivmfkmmapjokweswzvwrpckizbjqjqlmvjjhdnjtdfxwfxfgufdjnfjudkmnbygatwvjbvgahmwwvchbsjeibfoqsfxcsbjkpxyhipsymwolcrokuhxumxkaafrgjzpxyuhkkqbijpgzcmzculivdhwjewwdepktddpswulbzwioeqvhjtorzeyqztitiagbqreoaydsqzixungqygjpiysoexyunbnuzupywuyjxavzvuccooszqbxgioulozbojsumejqoajofbzckjprjjcmmgugntnao</pre>

Import this data by running the next paragraph, which calls our `BULK_IMPORT_HFILE` command.

In [None]:
%%sql 

call SYSCS_UTIL.BULK_IMPORT_HFILE('DEV3', 'CUSTOMER_BULK_IMPORT_EXAMPLE1', null, 's3a://splice-benchmark-data/flat/TPCH/100/customer', '|', null, null, null, null, 0, '/tmp', true, null, '/tmp/TMP_HFILE', false);

<br/>
You'll notice that after you run the paragraph, you see a short report that indicates how many rows were successfully loaded and how many failed to load. In this example, all 15 million rows were successfully loaded.

You have probably also noticed that we used default values by specifying `null` for all of the parameters that have defaults; here's what those defaults mean:

<table class="splicezepOddEven">
    <col />
    <col />
    <thead>
        <tr>
            <th>Parameter</th>
            <th>NULL Value Details</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><code>insertColumnList</code></td>
            <td>Our column list exactly matches the columns and ordering of columns in the table, so there's not need to specify a list.</td>
        </tr>
        <tr>
            <td><code>columnDelimiter</code></td>
            <td>Our data uses the default comma character (<code>,</code>) to delimit columns.</td>
        </tr>
        <tr>
            <td><code>stringDelimiter</code></td>
            <td>None of our data fields contain the comma character, so we don't need a string delimiter character.</td>
        </tr>
        <tr>
            <td><code>timestampFormat</code></td>
            <td>Our data matches the default timestamp format, which is <code>yyyy-MM-dd HH:mm:ss</code>.</td>
        </tr>
        <tr>
            <td><code>dateFormat</code></td>
            <td>Our data doesn't contain any date columns, so there's no need to specify a format.</td>
        </tr>
        <tr>
            <td><code>timeFormat</code></td>
            <td>Our data doesn't contain any time columns, so there's no need to specify a format.</td>
        </tr>
    </tbody>
</table>

### 3. Review Imported Data

Let's take a look at the data we just imported by running the next paragraph, which selects all the data from the `CUSTOMER_BULK_IMPORT_EXAMPLE1` table.

In [None]:
%%sql 

SELECT * FROM DEV3.CUSTOMER_BULK_IMPORT_EXAMPLE1;


## Step-by-Step Example of Manual Bulk Data Import

This example will walk you through the manual bulk data import process one step at a time. In these steps you will:

1. *Create a Database Schema and Table*
2. *Manually Split the Table*
2. *Run the Manual Bulk Import Data Procedure*
3. *Review Imported Data*


### 1. Create a Database Schema and Table

You can get started by clicking running the next paragraph in this Notebook, which creates the `customer_bulk_import_example2` table in your database.

In [None]:
%%sql 

CREATE TABLE DEV3.CUSTOMER_BULK_IMPORT_EXAMPLE2 (
 C_CUSTKEY INTEGER NOT NULL PRIMARY KEY,
 C_NAME VARCHAR(25),
 C_ADDRESS VARCHAR(40),
 C_NATIONKEY INTEGER NOT NULL,
 C_PHONE VARCHAR(15),
 C_ACCTBAL DECIMAL(15,2),
 C_MKTSEGMENT VARCHAR(10),
 C_COMMENT VARCHAR(117)
);

### 2. Manually Split the Table

Now we need to create the split keys for our data. The customer data file that we are loading contains a customer key column that has values from 1 to 15,000,000. For this exercise, we want to evenly split the data into 4 hfiles, so the keys look like this:

```
3750000
7500000
11250000
```

For every N lines of split data you specify, you’ll end up with N+1 regions; for example, the above 3 splits will produce these 4 regions:

```
0 -> 3750000
3750001 -> 7500000
7500001 -> 11250000
11250001 -> (last possible key value)
```

We have already created this file for you, so all you need to do is run the next paragraph. This calls the `SPLIT_TABLE_OR_INDEX` procedure, passing in the file that contains the split keys as a parameter value.



In [None]:
%%sql 

call SYSCS_UTIL.SYSCS_SPLIT_TABLE_OR_INDEX(
    'DEV3',
    'CUSTOMER_BULK_IMPORT_EXAMPLE2',
    null, 
    'C_CUSTKEY',
    '/opt/data/customer-split-keys.csv',
    '|', 
    null, 
    null, 
    null,
    null, 
    -1, 
    '/tmp', 
    true, 
    null
);

### 3. Run the Manual Bulk Import Data Procedure

Now we're ready to bulk import our data. As in the previous example, we are bulk importing the customer table data for TPCH-100. This time we have pre-split the table so we will not need to sample the data before importing. The main difference in this call to `BULK_IMPORT_HFILE` is that the last parameter, `skipSampling`, is now set to `true` to indicate that we've already pre-split the data.

Import the data in this file by running the next paragraph, which calls our `BULK_IMPORT_HFILE` procedure.

In [None]:
%%sql 

call SYSCS_UTIL.BULK_IMPORT_HFILE('DEV3', 'CUSTOMER_BULK_IMPORT_EXAMPLE2', null, 's3a://splice-benchmark-data/flat/TPCH/100/customer', '|', null, null, null, null, 0, '/tmp', true, null, '/tmp/TMP_HFILE', true);

### 4. Review Imported Data

Lets take a look at the data we just imported by clicking the *Shift + Enter* to run the next paragraph, which selects all of the data in the `CUSTOMER_BULK_IMPORT_EXAMPLE1` table.


In [None]:
%%sql 

SELECT * FROM DEV3.CUSTOMER_BULK_IMPORT_EXAMPLE2;

## Where to Go Next
The next notebook in this class, [*Transactions in Splice Machine*](./e.%20Transactions%20in%20Splice%20Machine.ipynb), teaches you how transactions are processed in Splice Machine.
