# Query Tuning

You will analyze the affects of Compression, De-Normalization, Distribution and Sorting on Redshift query performance.

## Result Set Caching and Execution Plan Reuse

Redshift enables a result set cache to speed up retrieval of data when it knows that the data in the underlying table has not changed. It can also re-use compiled query plans when only the predicate of the query has changed.

Execute the following query and note the query execution time. Since this is the first execution of this query Redshift will need to compile the query as well as cache the result set.

```sql
SELECT c_mktsegment, o_orderpriority, sum(o_totalprice)
FROM customer c
JOIN orders o on c_custkey = o_custkey
GROUP BY c_mktsegment, o_orderpriority;
```

Execute the same query a second time and note the query execution time. In the second execution redshift will leverage the result set cache and return immediately.

```sql
SELECT c_mktsegment, o_orderpriority, sum(o_totalprice)
FROM customer c
JOIN orders o on c_custkey = o_custkey
GROUP BY c_mktsegment, o_orderpriority;
```

Update data in the table and run the query again. When data in an underlying table has changed Redshift will be aware of the change and invalidate the result set cache associated to the query. Note the execution time is not as fast as Step 2, but faster than Step 1 because while it couldn’t re-use the cache it could re-use the compiled plan.

```sql
UPDATE customer
SET c_mktsegment = c_mktsegment
WHERE c_mktsegment = 'MACHINERY';
```

```sql
VACUUM DELETE ONLY customer;
```

```sql
SELECT c_mktsegment, o_orderpriority, sum(o_totalprice)
FROM customer c
JOIN orders o on c_custkey = o_custkey
GROUP BY c_mktsegment, o_orderpriority;
```

Execute a new query with a predicate and note the query execution time. Since this is the first execution of this query Redshift will need to compile the query as well as cache the result set.

```sql
SELECT c_mktsegment, count(1)
FROM Customer c
WHERE c_mktsegment = 'MACHINERY'
GROUP BY c_mktsegment;
```

Execute the query with a slightly different predicate and note that the execution time is faster than the prior execution even though a very similar amount of data was scanned and aggregated. This behavior is due to the re-use of the compile cache because only the predicate has changed. This type of pattern is typical for BI reporting where the SQL pattern remains consistent with different users retrieving data associated to different predicates.

```sql
SELECT c_mktsegment, count(1)
FROM customer c
WHERE c_mktsegment = 'BUILDING'
GROUP BY c_mktsegment;
```

For the remainder of this lab turn off result set caching to ensure the runtimes are representative of an adhoc user query.
Make sure to replace [Your-Redshift_User] value in the script below.

```sql
ALTER USER [Your-Redshift_User] set enable_result_cache_for_session to false;
```

## Compression

Redshift operates on high amounts of data. In order to optimize Redshift workloads, one of the key principles is to lower the amount of data stored. Instead of working on entire rows of data, containing values of different types and function, Redshift operates in a columnar fashion. This gives the opportunity to implement algorithms that can operate on single columns of data which can be compressed independently.

The lineitem table was defined without any specified compression encodings. Instead, when the data was loaded, the encodings were automatically applied using the defaults because the COMPUPDATE PRESET clause was used in the COPY statement. Execute the following query to determine the compression used for the lineitem table.

```sql
SELECT tablename, "column", encoding
FROM pg_table_def
WHERE schemaname = 'public' AND tablename = 'lineitem'
```

Create a copy of lineitem table setting the ENCODING of each column to RAW and load that table with the lineitem data.

```sql
DROP TABLE IF EXISTS lineitem_v1;
CREATE TABLE lineitem_v1 (
  L_ORDERKEY bigint NOT NULL ENCODE RAW       ,
  L_PARTKEY bigint ENCODE RAW                 ,
  L_SUPPKEY bigint ENCODE RAW                 ,
  L_LINENUMBER integer NOT NULL ENCODE RAW    ,
  L_QUANTITY decimal(18,4) ENCODE RAW         ,
  L_EXTENDEDPRICE decimal(18,4) ENCODE RAW    ,
  L_DISCOUNT decimal(18,4) ENCODE RAW         ,
  L_TAX decimal(18,4) ENCODE RAW              ,
  L_RETURNFLAG varchar(1) ENCODE RAW          ,
  L_LINESTATUS varchar(1) ENCODE RAW          ,
  L_SHIPDATE date ENCODE RAW                  ,
  L_COMMITDATE date ENCODE RAW                ,
  L_RECEIPTDATE date ENCODE RAW               ,
  L_SHIPINSTRUCT varchar(25) ENCODE RAW       ,
  L_SHIPMODE varchar(10) ENCODE RAW           ,
  L_COMMENT varchar(44) ENCODE RAW
)
distkey (L_ORDERKEY)
sortkey (L_RECEIPTDATE);

INSERT INTO lineitem_v1
SELECT * FROM lineitem;

ANALYZE lineitem_v1;
```

Redshift provides the ANALYZE COMPRESSION command. This command will determine the encoding for each column which will yield the most compression. Execute the ANALYZE COMPRESSION command on the table which was just loaded.

```sql
ANALYZE COMPRESSION lineitem_v1;
```

Note: While most columns have the same encodings, some columns will get better compression if the encoding is changed.

Analyze the storage space for these tables, with and without compression. The table stores by column the amount of storage used in MB. You should see about a 70% savings on the storage of the second table compared to first. This query gives you the storage requirements per column for each table, then the total storage for the table (repeated identically on each line).

```sql
SELECT CAST(d.attname AS CHAR(50)),
    SUM(
        CASE
            WHEN CAST(d.relname AS CHAR(50)) = 'lineitem' THEN b.size_in_mb
            ELSE 0
        END
    ) AS size_in_mb,
    SUM(
        CASE
            WHEN CAST(d.relname AS CHAR(50)) = 'lineitem_v1' THEN b.size_in_mb
            ELSE 0
        END
    ) AS size_in_mb_v1,
    SUM(
        SUM(
            CASE
                WHEN CAST(d.relname AS CHAR(50)) = 'lineitem' THEN b.size_in_mb
                ELSE 0
            END
        )
    ) OVER () AS total_mb,
    SUM(
        SUM(
            CASE
                WHEN CAST(d.relname AS CHAR(50)) = 'lineitem_v1' THEN b.size_in_mb
                ELSE 0
            END
        )
    ) OVER () AS total_mb_v1
FROM (
        SELECT relname,
            attname,
            attnum - 1 as colid
        FROM pg_class t
            INNER JOIN pg_attribute a ON a.attrelid = t.oid
        WHERE t.relname LIKE 'lineitem%'
    ) d
    INNER JOIN (
        SELECT name,
            col,
            MAX(blocknum) AS size_in_mb
        FROM stv_blocklist b
            INNER JOIN stv_tbl_perm p ON b.tbl = p.id
        GROUP BY name,
            col
    ) b ON d.relname = b.name
    AND d.colid = b.col
GROUP BY d.attname
ORDER BY d.attname;
```

## Dist and Sort key

We want to know which state customers purchased most of our products.

```sql
SELECT ca_state, COUNT(o_orderkey) AS orders_count, AVG(o_totalprice) AS medium_amount, SUM(o_totalprice) AS orders_revenue
FROM orders o
INNER JOIN customer c ON o.o_custkey = c.c_customer_sk
inner join customer_address d on c.c_current_addr_sk = d.ca_address_sk
WHERE ca_country = 'United States'
GROUP BY ca_state;
```

This query is a bit slow, let's examine the query plan. just add "EXPLAIN" before the query and re-run it, you will get query plan.

```
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 XN HashAggregate  (cost=467747127979.01..467747127979.40 rows=51 width=27)
   ->  XN Hash Join DS_BCAST_INNER  (cost=80000084157.24..467746843568.12 rows=28441089 width=27)
         Hash Cond: ("outer".o_custkey = ("inner".c_customer_sk)::bigint)
         ->  XN Seq Scan on orders o  (cost=0.00..1500000.00 rows=150000000 width=29)
         ->  XN Hash  (cost=80000079310.46..80000079310.46 rows=1938708 width=10)
               ->  XN Hash Join DS_DIST_OUTER  (cost=14923.39..80000079310.46 rows=1938708 width=10)
                     Outer Dist Key: c.c_current_addr_sk
                     Hash Cond: ("outer".c_current_addr_sk = "inner".ca_address_sk)
                     ->  XN Seq Scan on customer c  (cost=0.00..20000.00 rows=2000000 width=8)
                     ->  XN Hash  (cost=12500.00..12500.00 rows=969354 width=10)
                           ->  XN Seq Scan on customer_address d  (cost=0.00..12500.00 rows=969354 width=10)
                                 Filter: ((ca_country)::text = 'United States'::text)
```

Notice the join strategy, `DS_BCAST_INNER`, `DS_DIST_OUTER`, looks like lots of data shuffling happened. How can we improve this query's performance?

Execute below query to find out table distribution key.

```sql
select * from svv_table_info where "table" like 'customer%' or "table" like 'order%';
```

![](https://static.us-east-1.prod.workshops.aws/public/0c9aa4a3-ef98-4fc8-8579-9f6cd75acdae/static/images/lab3/table_info.png)

Notice that the distribution key is not aligned with the join condition of the query. We can change the distkey for the order table to colocate data with customer table and avoid data shuffling.

```sql
alter table orders alter distkey o_custkey;
```

There is no customer key in customer\_address table so we cannot leverage distkey for this table. But since customer\_address table is relatively small, we could change its distribution style to store a copy on every Redshift compute node, this will reduce data shuffling as well.

```sql
alter table customer_address alter diststyle ALL;
```

Let's examine the query plan again, now the join strategies are `DS_DIST_NONE`, `DS_DIST_ALL_NONE`

```
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 XN HashAggregate  (cost=5527979.02..5527979.40 rows=51 width=27)
   ->  XN Hash Join DS_DIST_NONE  (cost=84157.24..5243568.12 rows=28441089 width=27)
         Hash Cond: ("outer".o_custkey = ("inner".c_customer_sk)::bigint)
         ->  XN Seq Scan on orders o  (cost=0.00..1500000.00 rows=150000000 width=29)
         ->  XN Hash  (cost=79310.46..79310.46 rows=1938708 width=10)
               ->  XN Hash Join DS_DIST_ALL_NONE  (cost=14923.39..79310.46 rows=1938708 width=10)
                     Hash Cond: ("outer".c_current_addr_sk = "inner".ca_address_sk)
                     ->  XN Seq Scan on customer c  (cost=0.00..20000.00 rows=2000000 width=8)
                     ->  XN Hash  (cost=12500.00..12500.00 rows=969354 width=10)
                           ->  XN Seq Scan on customer_address d  (cost=0.00..12500.00 rows=969354 width=10)
                                 Filter: ((ca_country)::text = 'United States'::text)
```

In our sales dashboard, we like to focus on recent 12 months orders, let's add order date filter and run the query and check how it's executed.

```sql
SELECT ca_state, COUNT(o_orderkey) AS orders_count, AVG(o_totalprice) AS medium_amount, SUM(o_totalprice) AS orders_revenue
FROM orders o
INNER JOIN customer c ON o.o_custkey = c.c_customer_sk
inner join customer_address d on c.c_current_addr_sk = d.ca_address_sk
WHERE o_orderdate BETWEEN '1995-01-01' AND '1995-12-31' AND ca_country = 'United States'
GROUP BY ca_state;
```

Use below query to get execution details. Replace `<query_id>` to the actual query id, you can find it from query results, or query history, or use [pg\_last\_query\_id](https://docs.aws.amazon.com/redshift/latest/dg/PG_LAST_QUERY_ID.html) . ![](https://static.us-east-1.prod.workshops.aws/public/0c9aa4a3-ef98-4fc8-8579-9f6cd75acdae/static/images/lab3/query_id.png)

```sql
SELECT query,
       segment,
       step,
       label,
       is_rrscan AS rrS,
       is_diskbased AS disk,
       is_delayed_scan AS DelayS,
       MIN(start_time) AS starttime,
       MAX(end_time) AS endtime,
       datediff(ms,MIN(start_time),MAX(end_time)) AS "elapsed_msecs",
       SUM(ROWS) AS row_s,
       SUM(rows_pre_filter) AS rows_pf,
       CASE
         WHEN SUM(rows_pre_filter) = 0 THEN 100
         ELSE SUM(ROWS)::FLOAT/ SUM(rows_pre_filter)::FLOAT*100
       END AS pct_filter,
       SUM(workmem) / 1024 / 1024 AS "Memory(MB)",
       SUM(bytes) / 1024 / 1024 AS "MB_produced"
FROM svl_query_report
WHERE query = <query_id>
GROUP BY query,
         segment,
         step,
         label,
         is_rrscan,
         is_diskbased,
         is_delayed_scan
ORDER BY query,
         segment,
         step,
         label;
```

We can see Redshift scans the whole table for multiple years data even though the query only needs recent 12 months. 

![](https://static.us-east-1.prod.workshops.aws/public/0c9aa4a3-ef98-4fc8-8579-9f6cd75acdae/static/images/lab3/query_report_unsorted.png)

To improve the performance, we could sort the data by `o_orderdate`

```sql
alter table orders alter COMPOUND sortkey (o_orderdate);
```

Re-run the query again, and now Redshift scans much less data. 

![](https://static.us-east-1.prod.workshops.aws/public/0c9aa4a3-ef98-4fc8-8579-9f6cd75acdae/static/images/lab3/query_report_sorted.png)

## Takeaways

There are a couple of themes when looking at query performance on Redshift:

- Think about table Distribution and Sort keys and how they affect queries.

How the table is architected will impact how Redshift is able to leverage query plans and so keeping that in mind can help you effectively define tables and query them. Knowing what your Distribution keys and Sort Keys are when joining tables can help you write better queries.

The grander point is to keep access patterns in mind when architecting tables. Before making optimization decisions, think about the data volume (result set size), query frequency, and downstream impact of optimizing towards those operations. If you don’t have a clear vision of all access patterns, start with DIST STYLE EVEN, and build a baseline of access patterns on your cluster to optimize against.

- Result Set does not equal table size

Just because your table is giant, doesn’t necessarily mean that the results you are extracting out (or the data being scanned) is giant. Often in EDWs we are actually just pulling out the latest week or month’s data. In terms of query optimization, think in terms of how much data you are reading, not the total size of the table. Keep in mind sort keys and the columnar architecture in how it can make query more effective in reading data from tables.

- Make big things small early

When possible, filter data as early as possible in your query. If you only need a month of data, don’t pass the entire table around until the very end. While this takes some effort upfront in the query writing process, habit such as this will help ensure you are utilizing the cluster in a more efficient manner.