# Running the TPCH-1 Benchmark Queries
In this notebook, we'll introduce the TPCH data, and then walk you through using the TPCH-1 data set, in these steps:

1. *Creating the Tables*
2. *Importing the TPCH-1 Data From S3*
3. *Compacting and Collecting Statistics*
4. *Running TPCH-1 Queries*

At the bottom you will also see some graphical output from some queries (or slight variations).

<p class="noteIcon">The code cells in this notebook all use the <em>%%sql</em> magic, which is pre-configured to interact with the Splice Machine with ANSI SQL.</p>

## About TPCH Data
TPC-H (aka *TPCH*) is a decision support benchmark. It consists of a suite of business-oriented ad hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

### The TPCH Schema
Here's a view of the TPC-H schema:

<img class="fit3qtrwidth" src="https://s3.amazonaws.com/splice-examples/images/tutorials/sample-data-tpch-schema.png">



## 1. Creating the Tables

Run the next cell to create tables in your database:


In [None]:
%%sql 

DROP TABLE IF EXISTS LINEITEM; 
CREATE TABLE LINEITEM (
 L_ORDERKEY BIGINT NOT NULL,
 L_PARTKEY INTEGER NOT NULL,
 L_SUPPKEY INTEGER NOT NULL, 
 L_LINENUMBER INTEGER NOT NULL, 
 L_QUANTITY DECIMAL(15,2),
 L_EXTENDEDPRICE DECIMAL(15,2),
 L_DISCOUNT DECIMAL(15,2),
 L_TAX DECIMAL(15,2),
 L_RETURNFLAG VARCHAR(1), 
 L_LINESTATUS VARCHAR(1),
 L_SHIPDATE DATE,
 L_COMMITDATE DATE,
 L_RECEIPTDATE DATE,
 L_SHIPINSTRUCT VARCHAR(25),
 L_SHIPMODE VARCHAR(10),
 L_COMMENT VARCHAR(44),
 PRIMARY KEY(L_ORDERKEY,L_LINENUMBER)
 );
 
 DROP TABLE IF EXISTS ORDERS; 
 CREATE TABLE ORDERS (
 O_ORDERKEY BIGINT NOT NULL PRIMARY KEY,
 O_CUSTKEY INTEGER,
 O_ORDERSTATUS VARCHAR(1),
 O_TOTALPRICE DECIMAL(15,2),
 O_ORDERDATE DATE,
 O_ORDERPRIORITY VARCHAR(15),
 O_CLERK VARCHAR(15),
 O_SHIPPRIORITY INTEGER ,
 O_COMMENT VARCHAR(79)
 );
 
 DROP TABLE IF EXISTS CUSTOMER; 
 CREATE TABLE CUSTOMER (
 C_CUSTKEY INTEGER NOT NULL PRIMARY KEY,
 C_NAME VARCHAR(25),
 C_ADDRESS VARCHAR(40),
 C_NATIONKEY INTEGER NOT NULL,
 C_PHONE VARCHAR(15),
 C_ACCTBAL DECIMAL(15,2),
 C_MKTSEGMENT VARCHAR(10),
 C_COMMENT VARCHAR(117)
 );
 
 DROP TABLE IF EXISTS PARTSUPP; 
 CREATE TABLE PARTSUPP (
 PS_PARTKEY INTEGER NOT NULL ,
 PS_SUPPKEY INTEGER NOT NULL , 
 PS_AVAILQTY INTEGER,
 PS_SUPPLYCOST DECIMAL(15,2),
 PS_COMMENT VARCHAR(199),
 PRIMARY KEY(PS_PARTKEY,PS_SUPPKEY) 
 );
 
 DROP TABLE IF EXISTS SUPPLIER; 
 CREATE TABLE SUPPLIER (
 S_SUPPKEY INTEGER NOT NULL PRIMARY KEY,
 S_NAME VARCHAR(25) ,
 S_ADDRESS VARCHAR(40) ,
 S_NATIONKEY INTEGER ,
 S_PHONE VARCHAR(15) ,
 S_ACCTBAL DECIMAL(15,2),
 S_COMMENT VARCHAR(101)
 );
 
 DROP TABLE IF EXISTS PART; 
 CREATE TABLE PART (
 P_PARTKEY INTEGER NOT NULL PRIMARY KEY,
 P_NAME VARCHAR(55) ,
 P_MFGR VARCHAR(25) ,
 P_BRAND VARCHAR(10) ,
 P_TYPE VARCHAR(25) ,
 P_SIZE INTEGER ,
 P_CONTAINER VARCHAR(10) ,
 P_RETAILPRICE DECIMAL(15,2),
 P_COMMENT VARCHAR(23)
 );
 
 DROP TABLE IF EXISTS REGION; 
 CREATE TABLE REGION (
 R_REGIONKEY INTEGER NOT NULL PRIMARY KEY,
 R_NAME VARCHAR(25),
 R_COMMENT VARCHAR(152)
 );
 
 DROP TABLE IF EXISTS NATION; 
 CREATE TABLE NATION (
 N_NATIONKEY INTEGER NOT NULL,
 N_NAME VARCHAR(25),
 N_REGIONKEY INTEGER NOT NULL,
 N_COMMENT VARCHAR(152),
 PRIMARY KEY (N_NATIONKEY)
 );

## 2. Importing the TPCH-1 Data From S3

We have pre-created flat files with the TPCH-1 data into an S3 bucket, to facilitate importing the data into your database. Run the next cell to import all of the data from those files:

<p class="noteNote">Importing this much data can take a few minutes; you'll see the result of each import displayed below the <code>IMPORT</code> statements as they complete.</p>



In [None]:
%%sql 

call SYSCS_UTIL.IMPORT_DATA (null, 'LINEITEM', null, 's3a://splice-benchmark-data/flat/TPCH/1/lineitem', '|', null, null, null, null, 0, '/tmp', true, null);

call SYSCS_UTIL.IMPORT_DATA (null, 'ORDERS',   null, 's3a://splice-benchmark-data/flat/TPCH/1/orders',   '|', null, null, null, null, 0, '/tmp', true, null);

call SYSCS_UTIL.IMPORT_DATA (null, 'CUSTOMER', null, 's3a://splice-benchmark-data/flat/TPCH/1/customer', '|', null, null, null, null, 0, '/tmp', true, null);

call SYSCS_UTIL.IMPORT_DATA (null, 'PARTSUPP', null, 's3a://splice-benchmark-data/flat/TPCH/1/partsupp', '|', null, null, null, null, 0, '/tmp', true, null);

call SYSCS_UTIL.IMPORT_DATA (null, 'SUPPLIER', null, 's3a://splice-benchmark-data/flat/TPCH/1/supplier', '|', null, null, null, null, 0, '/tmp', true, null);

call SYSCS_UTIL.IMPORT_DATA (null, 'PART',     null, 's3a://splice-benchmark-data/flat/TPCH/1/part',     '|', null, null, null, null, 0, '/tmp', true, null);

call SYSCS_UTIL.IMPORT_DATA (null, 'REGION',   null, 's3a://splice-benchmark-data/flat/TPCH/1/region',   '|', null, null, null, null, 0, '/tmp', true, null);

call SYSCS_UTIL.IMPORT_DATA (null, 'NATION',   null, 's3a://splice-benchmark-data/flat/TPCH/1/nation',   '|', null, null, null, null, 0, '/tmp', true, null);

## 3. Compacting and Collecting Statistics

Since you've just imported a large amount of data into your database, it's a good idea to run a major compaction and collect statistics.

To do so, run each of the next two cells:


In [None]:
%%sql 

call SYSCS_UTIL.SYSCS_PERFORM_MAJOR_COMPACTION_ON_SCHEMA('REPLACE_ME_DBSCHEMA');

In [None]:
%%sql 

analyze schema REPLACE_ME_DBSCHEMA;

## 4. Running the TPCH-1 Queries

We'll now run a sampling of the TPCH queries.  The full set is available to run with our Jupyter Notebook deployment for your cluster, or contact us and we will provide them.  We include them with "explain" in front of them.  Run that first (always good practice), then rerun after removing the `explain` in the cell.  It is also informative to go to `localhost:4040` to monitor these queries in the Spark Console while they are running.


In [None]:
%%sql 
-- QUERY 01
explain select
	l_returnflag,
	l_linestatus,
	sum(l_quantity) as sum_qty,
	sum(l_extendedprice) as sum_base_price,
	sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
	sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
	avg(l_quantity) as avg_qty,
	avg(l_extendedprice) as avg_price,
	avg(l_discount) as avg_disc,
	count(*) as count_order
from
	lineitem
where
	l_shipdate <= date({fn TIMESTAMPADD(SQL_TSI_DAY, -90, cast('1998-12-01 00:00:00' as timestamp))})
group by
	l_returnflag,
	l_linestatus
order by
	l_returnflag,
	l_linestatus
-- END OF QUERY

In [None]:
%%sql 
-- QUERY 02
explain select
	s_acctbal,
	s_name,
	n_name,
	p_partkey,
	p_mfgr,
	s_address,
	s_phone,
	s_comment
from
	part,
	supplier,
	partsupp,
	nation,
	region
where
	p_partkey = ps_partkey
	and s_suppkey = ps_suppkey
	and p_size = 15
	and p_type like '%BRASS'
	and s_nationkey = n_nationkey
	and n_regionkey = r_regionkey
	and r_name = 'EUROPE'
	and ps_supplycost = (
		select
			min(ps_supplycost)
		from
			partsupp,
			supplier,
			nation,
			region
		where
			p_partkey = ps_partkey
			and s_suppkey = ps_suppkey
			and s_nationkey = n_nationkey
			and n_regionkey = r_regionkey
			and r_name = 'EUROPE'
	)
order by
	s_acctbal desc,
	n_name,
	s_name,
	p_partkey
{limit 100}
-- END OF QUERY

In [None]:
%%sql 
-- QUERY 03
explain select
	l_orderkey,
	sum(l_extendedprice * (1 - l_discount)) as revenue,
	o_orderdate,
	o_shippriority
from
	customer,
	orders,
	lineitem
where
	c_mktsegment = 'BUILDING' 
	and c_custkey = o_custkey
	and l_orderkey = o_orderkey
	and o_orderdate < date('1995-03-15') 
	and l_shipdate > date('1995-03-15') 
group by
	l_orderkey,
	o_orderdate,
	o_shippriority
order by
	revenue desc,
	o_orderdate 
{limit 10}
-- END OF QUERY

In [None]:
%%sql 
-- QUERY 11
explain select
	ps_partkey,
	sum(ps_supplycost * ps_availqty) as value
from
	partsupp,
	supplier,
	nation
where
	ps_suppkey = s_suppkey
	and s_nationkey = n_nationkey
	and n_name = 'GERMANY'
group by
	ps_partkey having
		sum(ps_supplycost * ps_availqty) > (
			select
				sum(ps_supplycost * ps_availqty) * 0.0000010000
			from
				partsupp,
				supplier,
				nation
			where
				ps_suppkey = s_suppkey
				and s_nationkey = n_nationkey
				and n_name = 'GERMANY'
		)
order by
	value desc
-- END OF QUERY

In [None]:
%%sql 
-- QUERY 13
explain select
	c_count,
	count(*) as custdist
from
	(
		select
			c_custkey,
			count(o_orderkey)
		from
			customer left outer join orders on
				c_custkey = o_custkey
				and o_comment not like '%special%requests%'
		group by
			c_custkey
	) as c_orders (c_custkey, c_count)
group by
	c_count
order by
	custdist desc,
	c_count desc
-- END OF QUERY

In [None]:
%%sql 
-- QUERY 22
explain select
	cntrycode,
	count(*) as numcust,
	sum(c_acctbal) as totacctbal
from
	(
		select
			SUBSTR(c_phone, 1, 2) as cntrycode,
			c_acctbal
		from
			customer
		where
			SUBSTR(c_phone, 1, 2) in
				('13', '31', '23', '29', '30', '18', '17')
			and c_acctbal > (
				select
					avg(c_acctbal)
				from
					customer
				where
					c_acctbal > 0.00
					and SUBSTR(c_phone, 1, 2) in
						('13', '31', '23', '29', '30', '18', '17')
			)
			and not exists (
				select
					*
				from
					orders
				where
					o_custkey = c_custkey
			)
	) as custsale
group by
	cntrycode
order by
	cntrycode
-- END OF QUERY

## Where to Go Next

Our next tutorial, [*Common Utilities*](../2.%20Tutorials/2.9%20Common%20Utilities.ipynb), introduces some common utilities that you may find useful.
