## Environment Setup

In [None]:
!mkdir -p ~/.aws && cp /content/drive/MyDrive/AWS/684947_admin ~/.aws/credentials
!chmod 600 ~/.aws/credentials
!pip install -qq awscli boto3
!aws sts get-caller-identity

### Create Cloudformation Stack

In [6]:
!aws cloudformation create-stack \
--stack-name RedshiftWorkshop \
--template-body file://redshift_workshop.yml \
--capabilities CAPABILITY_NAMED_IAM \
--parameters \
ParameterKey=EETeamRoleArn,ParameterValue=arn:aws:iam::684199068947:user/sparsh

{
    "StackId": "arn:aws:cloudformation:us-east-1:684199068947:stack/RedshiftWorkshop/2940aec0-6c03-11ed-bb3f-0e972e325d17"
}


### Configure Client Tool - Query Editor V2

1. On the left-hand side, click on the Redshift cluster you want to connect to.
1. A pop-up window should have opened.
1. If your IAM User/Role has the privilege "redshift:GetClusterCredentials", you can use the Temporary credentials option.
1. Enter the Database name `dev` and user name `awsuser`. Click connect

### Run Sample Query

Run the following query to list the users within the redshift cluster.

```sql
select * from pg_user
```

## Data Loading

In this lab, you will use a set of eight tables based on the TPC Benchmark data model. You create these tables within your Redshift cluster then load these tables with sample data stored in S3.

### Create Tables

Copy the following create table statements to create tables in the database mimicking the TPC Benchmark data model.

```sql
DROP TABLE IF EXISTS partsupp;
DROP TABLE IF EXISTS lineitem;
DROP TABLE IF EXISTS supplier;
DROP TABLE IF EXISTS part;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS customer;
DROP TABLE IF EXISTS nation;
DROP TABLE IF EXISTS region;

CREATE TABLE region (
  R_REGIONKEY bigint NOT NULL,
  R_NAME varchar(25),
  R_COMMENT varchar(152))
diststyle all;

CREATE TABLE nation (
  N_NATIONKEY bigint NOT NULL,
  N_NAME varchar(25),
  N_REGIONKEY bigint,
  N_COMMENT varchar(152))
diststyle all;

create table customer (
  C_CUSTKEY bigint NOT NULL,
  C_NAME varchar(25),
  C_ADDRESS varchar(40),
  C_NATIONKEY bigint,
  C_PHONE varchar(15),
  C_ACCTBAL decimal(18,4),
  C_MKTSEGMENT varchar(10),
  C_COMMENT varchar(117))
diststyle all;

create table orders (
  O_ORDERKEY bigint NOT NULL,
  O_CUSTKEY bigint,
  O_ORDERSTATUS varchar(1),
  O_TOTALPRICE decimal(18,4),
  O_ORDERDATE Date,
  O_ORDERPRIORITY varchar(15),
  O_CLERK varchar(15),
  O_SHIPPRIORITY Integer,
  O_COMMENT varchar(79))
distkey (O_ORDERKEY)
sortkey (O_ORDERDATE);

create table part (
  P_PARTKEY bigint NOT NULL,
  P_NAME varchar(55),
  P_MFGR  varchar(25),
  P_BRAND varchar(10),
  P_TYPE varchar(25),
  P_SIZE integer,
  P_CONTAINER varchar(10),
  P_RETAILPRICE decimal(18,4),
  P_COMMENT varchar(23))
diststyle all;

create table supplier (
  S_SUPPKEY bigint NOT NULL,
  S_NAME varchar(25),
  S_ADDRESS varchar(40),
  S_NATIONKEY bigint,
  S_PHONE varchar(15),
  S_ACCTBAL decimal(18,4),
  S_COMMENT varchar(101))
diststyle all;                                                              

create table lineitem (
  L_ORDERKEY bigint NOT NULL,
  L_PARTKEY bigint,
  L_SUPPKEY bigint,
  L_LINENUMBER integer NOT NULL,
  L_QUANTITY decimal(18,4),
  L_EXTENDEDPRICE decimal(18,4),
  L_DISCOUNT decimal(18,4),
  L_TAX decimal(18,4),
  L_RETURNFLAG varchar(1),
  L_LINESTATUS varchar(1),
  L_SHIPDATE date,
  L_COMMITDATE date,
  L_RECEIPTDATE date,
  L_SHIPINSTRUCT varchar(25),
  L_SHIPMODE varchar(10),
  L_COMMENT varchar(44))
distkey (L_ORDERKEY)
sortkey (L_RECEIPTDATE);

create table partsupp (
  PS_PARTKEY bigint NOT NULL,
  PS_SUPPKEY bigint NOT NULL,
  PS_AVAILQTY integer,
  PS_SUPPLYCOST decimal(18,4),
  PS_COMMENT varchar(199))
diststyle even;
```

### Loading Data

A COPY command loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well. Use a single COPY command to load data for one table from multiple files. Amazon Redshift then automatically loads the data in parallel. For your convenience, the sample data you will use is available in a public Amazon S3 bucket. To ensure that Redshift performs a compression analysis, set the COMPUPDATE parameter to ON in your COPY commands.

```sql
COPY region FROM 's3://redshift-immersionday-labs/data/region/region.tbl.lzo'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

COPY nation FROM 's3://redshift-immersionday-labs/data/nation/nation.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy customer from 's3://redshift-immersionday-labs/data/customer/customer.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy orders from 's3://redshift-immersionday-labs/data/orders/orders.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy part from 's3://redshift-immersionday-labs/data/part/part.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy supplier from 's3://redshift-immersionday-labs/data/supplier/supplier.json' manifest
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy lineitem from 's3://redshift-immersionday-labs/data/lineitem-part/'
iam_role default
region 'us-west-2' gzip delimiter '|' COMPUPDATE PRESET;

copy partsupp from 's3://redshift-immersionday-labs/data/partsupp/partsupp.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;
```

The estimated time to load the data is as follows, note you can check timing information on actions in the performance and query tabs on the redshift console:

```
REGION (5 rows) - 20s
NATION (25 rows) - 2s
CUSTOMER (15M rows) – 31s
ORDERS - (76M rows) - 13s
PART - (20M rows) - 34s
SUPPLIER - (1M rows) - 7s
LINEITEM - (303M rows) - 48s
PARTSUPPLIER - (80M rows) 12s
```

A few key takeaways from the above COPY statements:

- COMPUPDATE PRESET ON will assign compression using the Amazon Redshift best practices related to the data type of the column but without analyzing the data in the table.
- COPY for the REGION table points to a specfic file (region.tbl.lzo) while COPY for other tables point to a prefix to multiple files (lineitem.tbl.)
- COPY for the SUPPLIER table points a manifest file (supplier.json)

## Table Design and Query Tuning

You will analyze the affects of Compression, De-Normalization, Distribution and Sorting on Redshift query performance.

### Result Set Caching and Execution Plan Reuse

Redshift enables a result set cache to speed up retrieval of data when it knows that the data in the underlying table has not changed. It can also re-use compiled query plans when only the predicate of the query has changed.

Execute the following query and note the query execution time. Since this is the first execution of this query Redshift will need to compile the query as well as cache the result set.

```sql
SELECT c_mktsegment, o_orderpriority, sum(o_totalprice)
FROM customer c
JOIN orders o on c_custkey = o_custkey
GROUP BY c_mktsegment, o_orderpriority;
```

Execute the same query a second time and note the query execution time. In the second execution redshift will leverage the result set cache and return immediately.

```sql
SELECT c_mktsegment, o_orderpriority, sum(o_totalprice)
FROM customer c
JOIN orders o on c_custkey = o_custkey
GROUP BY c_mktsegment, o_orderpriority;
```

Update data in the table and run the query again. When data in an underlying table has changed Redshift will be aware of the change and invalidate the result set cache associated to the query. Note the execution time is not as fast as Step 2, but faster than Step 1 because while it couldn’t re-use the cache it could re-use the compiled plan.

```sql
UPDATE customer
SET c_mktsegment = c_mktsegment
WHERE c_mktsegment = 'MACHINERY';
```

```sql
VACUUM DELETE ONLY customer;
```

```sql
SELECT c_mktsegment, o_orderpriority, sum(o_totalprice)
FROM customer c
JOIN orders o on c_custkey = o_custkey
GROUP BY c_mktsegment, o_orderpriority;
```

Execute a new query with a predicate and note the query execution time. Since this is the first execution of this query Redshift will need to compile the query as well as cache the result set.

```sql
SELECT c_mktsegment, count(1)
FROM Customer c
WHERE c_mktsegment = 'MACHINERY'
GROUP BY c_mktsegment;
```

Execute the query with a slightly different predicate and note that the execution time is faster than the prior execution even though a very similar amount of data was scanned and aggregated. This behavior is due to the re-use of the compile cache because only the predicate has changed. This type of pattern is typical for BI reporting where the SQL pattern remains consistent with different users retrieving data associated to different predicates.

```sql
SELECT c_mktsegment, count(1)
FROM customer c
WHERE c_mktsegment = 'BUILDING'
GROUP BY c_mktsegment;
```

For the remainder of this lab turn off result set caching to ensure the runtimes are representative of an adhoc user query.
Make sure to replace [Your-Redshift_User] value in the script below.

```sql
ALTER USER [Your-Redshift_User] set enable_result_cache_for_session to false;
```

### Compression

Redshift operates on high amounts of data. In order to optimize Redshift workloads, one of the key principles is to lower the amount of data stored. Instead of working on entire rows of data, containing values of different types and function, Redshift operates in a columnar fashion. This gives the opportunity to implement algorithms that can operate on single columns of data which can be compressed independently.

The lineitem table was defined without any specified compression encodings. Instead, when the data was loaded, the encodings were automatically applied using the defaults because the COMPUPDATE PRESET clause was used in the COPY statement. Execute the following query to determine the compression used for the lineitem table.

```sql
SELECT tablename, "column", encoding
FROM pg_table_def
WHERE schemaname = 'public' AND tablename = 'lineitem'
```

Create a copy of lineitem table setting the ENCODING of each column to RAW and load that table with the lineitem data.

```sql
DROP TABLE IF EXISTS lineitem_v1;
CREATE TABLE lineitem_v1 (
  L_ORDERKEY bigint NOT NULL ENCODE RAW       ,
  L_PARTKEY bigint ENCODE RAW                 ,
  L_SUPPKEY bigint ENCODE RAW                 ,
  L_LINENUMBER integer NOT NULL ENCODE RAW    ,
  L_QUANTITY decimal(18,4) ENCODE RAW         ,
  L_EXTENDEDPRICE decimal(18,4) ENCODE RAW    ,
  L_DISCOUNT decimal(18,4) ENCODE RAW         ,
  L_TAX decimal(18,4) ENCODE RAW              ,
  L_RETURNFLAG varchar(1) ENCODE RAW          ,
  L_LINESTATUS varchar(1) ENCODE RAW          ,
  L_SHIPDATE date ENCODE RAW                  ,
  L_COMMITDATE date ENCODE RAW                ,
  L_RECEIPTDATE date ENCODE RAW               ,
  L_SHIPINSTRUCT varchar(25) ENCODE RAW       ,
  L_SHIPMODE varchar(10) ENCODE RAW           ,
  L_COMMENT varchar(44) ENCODE RAW
)
distkey (L_ORDERKEY)
sortkey (L_RECEIPTDATE);

INSERT INTO lineitem_v1
SELECT * FROM lineitem;

ANALYZE lineitem_v1;
```

Redshift provides the ANALYZE COMPRESSION command. This command will determine the encoding for each column which will yield the most compression. Execute the ANALYZE COMPRESSION command on the table which was just loaded.

```sql
ANALYZE COMPRESSION lineitem_v1;
```

Note: While most columns have the same encodings, some columns will get better compression if the encoding is changed.

Analyze the storage space for these tables, with and without compression. The table stores by column the amount of storage used in MB. You should see about a 70% savings on the storage of the second table compared to first. This query gives you the storage requirements per column for each table, then the total storage for the table (repeated identically on each line).

```sql
SELECT
  CAST(d.attname AS CHAR(50)),
  SUM(CASE WHEN CAST(d.relname AS CHAR(50)) = 'lineitem'
THEN b.size_in_mb ELSE 0 END) AS size_in_mb,
  SUM(CASE WHEN CAST(d.relname AS CHAR(50)) = 'lineitem_v1'
THEN b.size_in_mb ELSE 0 END) AS size_in_mb_v1,
  SUM(SUM(CASE WHEN CAST(d.relname AS CHAR(50)) = 'lineitem'
THEN b.size_in_mb ELSE 0 END)) OVER () AS total_mb,
  SUM(SUM(CASE WHEN CAST(d.relname AS CHAR(50)) = 'lineitem_v1'
THEN b.size_in_mb ELSE 0 END)) OVER () AS total_mb_v1
FROM (
  SELECT relname, attname, attnum - 1 as colid
  FROM pg_class t
  INNER JOIN pg_attribute a ON a.attrelid = t.oid
  WHERE t.relname LIKE 'lineitem%') d
INNER JOIN (
  SELECT name, col, MAX(blocknum) AS size_in_mb
  FROM stv_blocklist b
  INNER JOIN stv_tbl_perm p ON b.tbl=p.id
  GROUP BY name, col) b
ON d.relname = b.name AND d.colid = b.col
GROUP BY d.attname
ORDER BY d.attname;
```

## Modernize w/ Spectrum

In this lab, we show you how to query petabytes of data with Amazon Redshift and exabytes of data in your Amazon S3 data lake, without loading or moving objects. We will also demonstrate how you can leverage views which union data in direct attached storage as well as in your S3 Datalake to create a single source of truth. Finally, we will demonstrate strategies for aging off old data into S3 and maintaining only the most recent data in Amazon Redshift direct attached storage.

### Build your DDL

Create a schema workshop_das and table workshop_das.green_201601_csv for tables that will reside on the Redshift compute nodes, AKA the Redshift direct-attached storage (DAS) tables.

```sql
CREATE SCHEMA workshop_das;

CREATE TABLE workshop_das.green_201601_csv
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
DISTSTYLE EVEN
SORTKEY (passenger_count,pickup_datetime);
```

### Build your Copy Command

Build your copy command to copy the data from Amazon S3. This dataset has the number of taxi rides in the month of January 2016.

```sql
COPY workshop_das.green_201601_csv
FROM 's3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01.csv'
IAM_ROLE default
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-west-2';
```

Determine how many rows you just loaded.

```sql
select count(1) from workshop_das.green_201601_csv;
--1445285
```

Pin-point the Blizzard

In this month, there is a date which had the lowest number of taxi rides due to a blizzard. Can you find that date?

```sql
SELECT TO_CHAR(pickup_datetime, 'YYYY-MM-DD'),
COUNT(*)
FROM workshop_das.green_201601_csv
GROUP BY 1
ORDER BY 2;
```

### Create external schema (and DB) for Redshift Spectrum

Because external tables are stored in a shared Glue Catalog for use within the AWS ecosystem, they can be built and maintained using a few different tools, e.g. Athena, Redshift, and Glue.

Use the AWS Glue Crawler to create your external table adb305.month_12 stored in parquet format under location s3://serverless-analytics/canonical/NY-Pub/year=2016/month=12/.

- Navigate to the Glue Crawler Page.  
- Click on Add Crawler, and enter the crawler name NYTaxiCrawler and click Next. 
- Select Data stores as the source type and click Next. 
- Choose S3 as the data store and the include path of `s3://serverless-analytics/canonical/NY-Pub/year=2016/month=12/`.
- Create an IAM Role and select it
- Select Run on demand for the frequency. 
- Click on Add database and enter the Database of spectrumdb 
- Select all remaining defaults. Once the Crawler has been created, click on Run Crawler. 
- Once the Crawler has completed its run, you will see a new table in the Glue Catalog.
- Note: For the next step, make sure to attach Glue access policy permissions into the Redshift role. Also, you need to add the glue service in the role's trust policy.
- Now that the table has been cataloged, switch back to your Redshift query editor and create an external schema adb305 pointing to your Glue Catalog Database spectrumdb
    ```sql
    CREATE external SCHEMA adb305
    FROM data catalog DATABASE 'spectrumdb'
    IAM_ROLE default
    CREATE external DATABASE if not exists;
    ```
- Run the query from the previous step using the external table instead of the direct-attached storage (DAS).
    ```sql
    SELECT TO_CHAR(pickup_datetime, 'YYYY-MM-DD'),
    COUNT(*)
    FROM adb305.month_12
    GROUP BY 1
    ORDER BY 1;
    ```

### Plan for the Future

In this final part of this lab, we will compare different strategies for maintaining more recent or HOT data within Redshift direct-attached storage, and keeping older COLD data in S3 by performing the following steps:

- Allow for trailing 5 quarters reporting by adding the Q4 2015 data to Redshift DAS:
    - Anticipating that we’ll want to ”age-off” the oldest quarter on a 3 month basis, architect your DAS table to make this easy to maintain and query.
    - Adjust your Redshift Spectrum table to exclude the Q4 2015 data.
- Develop and execute a plan to move the Q4 2015 data to S3.
    - What are the discrete steps to be performed?
    - What extra-Redshift functionality must be leveraged?
    - Simulating the extra-Redshift steps with the existing Parquet data, age-off the Q4 2015 data from Redshift DAS and perform any needed steps to maintain a single version of the truth.
- There are several options to accomplish this goal. Anticipating that we’ll want to ”age-off” the oldest quarter on a 3 month basis, architect your DAS table to make this easy to maintain and query. How about something like this?

## Spectrum Query Tuning

In this lab, we show you how to diagnose your Redshift Spectrum query performance and optimize performance by leveraging partitions, optimizing storage, and predicate pushdown.

### Querying with Amazon Redshift Spectrum

Create a star schema data model by creating dimension tables in your Redshift cluster, and fact tables in S3.

1. Create the dimension tables by running this script from your client tool.
    ```sql
    DROP TABLE IF EXISTS customer;
    CREATE TABLE customer (
    c_custkey     	integer        not null sortkey,
    c_name        	varchar(25)    not null,
    c_address     	varchar(25)    not null,
    c_city        	varchar(10)    not null,
    c_nation      	varchar(15)    not null,
    c_region      	varchar(12)    not null,
    c_phone       	varchar(15)    not null,
    c_mktsegment      varchar(10)    not null)
    diststyle all;

    DROP TABLE IF EXISTS dwdate;
    CREATE TABLE dwdate (
    d_datekey            integer       not null sortkey,
    d_date               varchar(19)   not null,
    d_dayofweek	      varchar(10)   not null,
    d_month      	    varchar(10)   not null,
    d_year               integer       not null,
    d_yearmonthnum       integer  	 not null,
    d_yearmonth          varchar(8)	not null,
    d_daynuminweek       integer       not null,
    d_daynuminmonth      integer       not null,
    d_daynuminyear       integer       not null,
    d_monthnuminyear     integer       not null,
    d_weeknuminyear      integer       not null,
    d_sellingseason      varchar(13)    not null,
    d_lastdayinweekfl    varchar(1)    not null,
    d_lastdayinmonthfl   varchar(1)    not null,
    d_holidayfl          varchar(1)    not null,
    d_weekdayfl          varchar(1)    not null)
    diststyle all;
    ```
1. Load data into your dimension table by running the following script. You will need to provide an IAM role with the permissions to run the COPY command on your cluster. You can use the IAM role that you identified earlier. This will load the data set from S3 into your Redshift cluster. Expect the script to take a few minutes to complete. The customer and time dimension consists of 3M records, and 2556 records respectively.
    ```sql
    copy customer from 's3://awssampledbuswest2/ssbgz/customer'
    iam_role default
    gzip region 'us-west-2';

    copy dwdate from 's3://awssampledbuswest2/ssbgz/dwdate'
    iam_role default
    gzip region 'us-west-2';
    ```
1. Next, create an External Schema that references datasets that reside outside of your Redshift cluster. Define this schema by running the following command. You will need to provide an IAM role with the permissions to read S3 date from your cluster. This should be the same role used above in the COPY command. Redshift stores the meta-data that describes your external databases and schemas in the AWS Glue data catalog by default. Once created, you can view the schema from Glue or Athena.
    ```sql
    CREATE EXTERNAL SCHEMA clickstream
    from data catalog database 'clickstream'
    iam_role default
    CREATE EXTERNAL DATABASE IF NOT EXISTS;
    ```
1. Use the AWS Glue Crawler to create your external table clickstream.clickstream-csv10 and clickstream.clickstream-parquet1 under locations s3://wysde-datasets/clickstream.
    - Navigate to the Glue Crawler Page
    - Add Crawler, and enter the crawler name clickstream and click Next
    - Choose S3 as the data store and the include path of s3://wysde-datasets/clickstream
    - Click on Add database. Name it clickstream and click Create. Select the Database clickstream from the list
    - Select all remaining defaults. Once the Crawler has been created, click on Run Crawler.
    - Once the Crawler has completed its run, you will see two new tables in the Glue Catalog.
    - For uservisits_csv10 table, we need to correct the schema. Click on Edit Schema and adjust the column names and datatypes as per the table below. Click Save.
    ```
    | #  | Column name    | Data type | Partition key | Comment |
    | -- | -------------- | --------- | ------------- | ------- |
    | 1  | adrevenue      | double    | \-            | \-      |
    | 2  | countrycode    | string    | \-            | \-      |
    | 3  | custkey        | bigint    | \-            | \-      |
    | 4  | desturl        | string    | \-            | \-      |
    | 5  | duration       | bigint    | \-            | \-      |
    | 6  | languagecode   | string    | \-            | \-      |
    | 7  | searchword     | string    | \-            | \-      |
    | 8  | sourceip       | string    | \-            | \-      |
    | 9  | useragent      | string    | \-            | \-      |
    | 10 | visitdate      | bigint    | \-            | \-      |
    | 11 | yearmonthkey   | bigint    | \-            | \-      |
    | 12 | customer       | string    | Partition (0) | \-      |
    | 13 | visityearmonth | string    | Partition (1) | \-      |
    ```
1. Navigate back to your SQL Client tool and run the query below. This query performs a join between dimension tables in Redshift, and the clickstream fact table in S3 effectively blending data from the data Lake and data warehouse. The ad revenue data originates from S3 while the customer and time attributes like market segment originate from the dimension tables in Redshift.
    ```sql
    SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, SUM(uv.adRevenue)
    FROM clickstream.uservisits_csv10 as uv
    RIGHT OUTER JOIN customer as c ON c.c_custkey = uv.custKey
    INNER JOIN (
    SELECT DISTINCT d_yearmonthnum, (d_month||','||d_year) as prettyMonthYear
    FROM dwdate
    WHERE d_yearmonthnum >= 199410) as t ON uv.yearMonthKey = t.d_yearmonthnum
    WHERE c.c_custkey <= 5
    GROUP BY c.c_name, c.c_mktsegment, t.prettyMonthYear, uv.yearMonthKey
    ORDER BY c.c_name, c.c_mktsegment, uv.yearMonthKey  ASC
    ```


### Performance Diagnostics

There are a few utilities that provide visibility into Redshift Spectrum:

- EXPLAIN  - Provides the query execution plan, which includes info around what processing is pushed down to Spectrum. Steps in the plan that include the prefix S3 are executed on Spectrum; for instance, the plan for the query above has a step “S3 Seq Scan clickstream.uservisits_csv10” indicating that Spectrum performs a scan on S3 as part of the query execution.
- SVL_S3QUERY_SUMMARY  - Provides statistics for Redshift Spectrum queries are stored in this table. While the execution plan presents cost estimates, this table stores actual statistics of past query runs.
- SVL_S3PARTITION  - Provides details about Amazon Redshift Spectrum partition pruning at the segment and node slice level.

## ETL/ELT Strategies

This lab demonstrates how you can modernize your ETL/ELT processes using Materialized Views, Stored Procedures, and Query Scheduling to transform data within Redshift.

### Materialized Views

In a data warehouse environment, applications often need to perform complex queries on large tables—for example, SELECT statements that perform multi-table joins and aggregations on the tables that contain billions of rows. Processing these queries can be expensive in terms of system resources and the time it takes to compute the results. Materialized views in Amazon Redshift provide a way to address these issues. A materialized view contains a precomputed result set, based on SQL query over one or more base tables. Here you will learn how to create, query and refresh a materialized view.

Let’s take an example where you want to generate a report of the top suppliers by shipped quantity. This will join large tables like and lineitem, and suppliers and scan a large quantity of data. You might write a query like the following:

```sql
select n_name, s_name, l_shipmode,
  SUM(L_QUANTITY) Total_Qty
from lineitem
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where datepart(year, L_SHIPDATE) > 1997
group by 1,2,3
order by 3 desc
limit 1000;
```

This query takes time to execute and because it is scanning a large amount of data will use a lot of I/O & CPU resources. Think of a situation, where multiple users in the organization need get supplier-level metrics like the above. Each may write similarly heavy queries which can be time consuming and expensive operations. Instead of that you can use a materialized view to store precomputed results for speeding up queries that are predictable and repeated.

Amazon Redshift provides a few methods to keep materialized views up-to-date. You can configure the automatic refresh option to refresh materialized views when base tables of mare updated. The auto refresh operation runs at a time when cluster resources are available to minimize disruptions to other workloads.

Execute below query to create materialized view which aggregates the lineitem data to the supplier level. Note, the AUTO REFRESH option is set to YES and we've included additional columns in our MV in case other users can take advantage of this aggregated data.

```sql
CREATE MATERIALIZED VIEW supplier_shipmode_agg
AUTO REFRESH YES AS
select l_suppkey, l_shipmode, datepart(year, L_SHIPDATE) l_shipyear,
  SUM(L_QUANTITY)	TOTAL_QTY,
  SUM(L_DISCOUNT) TOTAL_DISCOUNT,
  SUM(L_TAX) TOTAL_TAX,
  SUM(L_EXTENDEDPRICE) TOTAL_EXTENDEDPRICE  
from LINEITEM
group by 1,2,3;
```

Now execute the below query which has been re-written to use the materialized view. Note the difference in query execution time. You get the same results in few seconds.

```sql
select n_name, s_name, l_shipmode,
  SUM(TOTAL_QTY) Total_Qty
from supplier_shipmode_agg
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where l_shipyear > 1997
group by 1,2,3
order by 3 desc
limit 1000;
```

Another powerful feature of Materialized view is auto query rewrite. Amazon Redshift can automatically rewrite queries to use materialized views, even when the query doesn't explicitly reference a materialized view.

Now, re-run your original query which references the lineitem table and see this query now executes faster because Redshift has re-written this query to leverage the materialized view instead of base table.

```sql
select n_name, s_name, l_shipmode, SUM(L_QUANTITY) Total_Qty
from lineitem
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where datepart(year, L_SHIPDATE) > 1997
group by 1,2,3
order by 3 desc
limit 1000;
```

### Stored procedures

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific logic. By combining multiple SQL steps into a stored procedure, you can reduce round trips between your applications and the database. A stored procedure can incorporate data definition language (DDL) and data manipulation language (DML) in addition to SELECT queries. A stored procedure doesn’t have to return a value. You can use the PL/pgSQL procedural language, including looping and conditional expressions, to control logical flow.

Let’s see how you can create and invoke stored procedure in Redshift. Here our goal is to incrementally refresh the lineitem data. Execute the following query to create lineitem staging table:

```sql
create table stage_lineitem (
  L_ORDERKEY bigint NOT NULL,
  L_PARTKEY bigint,
  L_SUPPKEY bigint,
  L_LINENUMBER integer NOT NULL,
  L_QUANTITY decimal(18,4),
  L_EXTENDEDPRICE decimal(18,4),
  L_DISCOUNT decimal(18,4),
  L_TAX decimal(18,4),
  L_RETURNFLAG varchar(1),
  L_LINESTATUS varchar(1),
  L_SHIPDATE date,
  L_COMMITDATE date,
  L_RECEIPTDATE date,
  L_SHIPINSTRUCT varchar(25),
  L_SHIPMODE varchar(10),
  L_COMMENT varchar(44));
```

Execute below script to create a stored procedure. This stored procedure performs following tasks:
- Truncate staging table to clean up old data
- Load data in the stage_lineitem table using the COPY command.
- Merge updated records in existing lineitem table.

```sql
CREATE OR REPLACE PROCEDURE lineitem_incremental()
AS $$
BEGIN

truncate stage_lineitem;  

copy stage_lineitem from 's3://redshift-immersionday-labs/data/lineitem-part/l_orderyear=1998/l_ordermonth=8/'
iam_role default
region 'us-west-2' gzip delimiter '|' COMPUPDATE PRESET;

delete from lineitem using stage_lineitem
where stage_lineitem.l_orderkey=lineitem.l_orderkey and stage_lineitem.l_linenumber = lineitem.l_linenumber;

insert into lineitem
select * from stage_lineitem;

END;
$$ LANGUAGE plpgsql;
```

Before you call this stored procedure, capture a metric using the materialized view. We'll compare this value after the stored procedure loads new data to demonstrate the Materialized View auto refresh capability.

```sql
select SUM(TOTAL_QTY) Total_Qty from supplier_shipmode_agg;
```

Call this stored procedure using CALL statement. When executed it will perform an incremental load:

```sql
call lineitem_incremental();
```

### Query scheduling

Amazon Redshift allows you to schedule your SQL queries for executions in recurring schedules. You can now schedule time sensitive or long running queries, loading or unloading your data, stored procedures or refreshing your materialized views on a regular schedule. You can use the Amazon Redshift Console or Amazon Redshift Data API to schedule your SQL queries.

1. Navigate, back to Redshift query editor, ensure the query to call your stored procedure is in the editor and click on the Schedule button.
1. Select IAM Role, select the cluster, and provide a database name and database user.
1. Enter a query name as well as the query text.
1. Provide values for Repeat By, Repeat every, and Repeat time. When you select “Repeat at time (UTC)” enter a time that is little later than current time so you can observe the execution. Optionally, you can enable monitoring via Amazon SNS notifications. For this example, we can leave this Disabled.
1. Navigate to the scheduled queries tab and you can see your query scheduler has been created.
1. Click on the schedule and after successful execution on scheduled time, you can see the status is “success”.



## References

- https://catalog.us-east-1.prod.workshops.aws/workshops/9f29cdba-66c0-445e-8cbb-28a092cb5ba7/en-US