# Ongoing Load - ELT

This lab demonstrates how you can modernize your ongoing data loads using `Stored Procedures`, `Materialized Views` and `Pre-defined Functions` to transform data within Redshift.

## Stored Procedures - Ongoing loads

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific logic. By combining multiple SQL steps into a stored procedure, you can reduce round trips between your applications and the database. A stored procedure can incorporate data definition language (DDL) and data manipulation language (DML) in addition to SELECT queries. A stored procedure doesn’t have to return a value. You can use the PL/pgSQL procedural language, including looping and conditional expressions, to control logical flow.

Let’s see how you can create and invoke stored procedure in Redshift. Here our goal is to incrementally refresh the _lineitem_ data. Execute the following query to create _lineitem_ staging table:

```sql
create table stage_lineitem (
  L_ORDERKEY bigint NOT NULL,
  L_PARTKEY bigint,
  L_SUPPKEY bigint,
  L_LINENUMBER integer NOT NULL,
  L_QUANTITY decimal(18,4),
  L_EXTENDEDPRICE decimal(18,4),
  L_DISCOUNT decimal(18,4),
  L_TAX decimal(18,4),
  L_RETURNFLAG varchar(1),
  L_LINESTATUS varchar(1),
  L_SHIPDATE date,
  L_COMMITDATE date,
  L_RECEIPTDATE date,
  L_SHIPINSTRUCT varchar(25),
  L_SHIPMODE varchar(10),
  L_COMMENT varchar(44));
```

Execute below script to create a stored procedure. This stored procedure performs following tasks:

1. Truncate staging table to clean up old data
2. Load data in the _stage\_lineitem_ table using the COPY command.
3. Merge updated records in existing _lineitem_ table.

```sql
CREATE OR REPLACE PROCEDURE lineitem_incremental()
AS $$
BEGIN

truncate stage_lineitem;  

copy stage_lineitem from 's3://redshift-immersionday-labs/data/lineitem-part/l_orderyear=1998/l_ordermonth=8/'
iam_role default
region 'us-west-2' gzip delimiter '|' COMPUPDATE PRESET;

delete from lineitem using stage_lineitem
where stage_lineitem.l_orderkey=lineitem.l_orderkey and stage_lineitem.l_linenumber = lineitem.l_linenumber;

copy stage_lineitem from 's3://redshift-immersionday-labs/data/lineitem-part/l_orderyear=1998/l_ordermonth=3/'
iam_role default
region 'us-west-2' gzip delimiter '|' COMPUPDATE PRESET;

insert into lineitem
select * from stage_lineitem;

END;
$$ LANGUAGE plpgsql;
```

Before you call this stored procedure, capture total #rows from lineitem table

```sql
SELECT count(*) FROM "dev"."public"."lineitem"; --303008217
```

Call this stored procedure using CALL statement. When executed it will perform an incremental load:

```sql
call lineitem_incremental();
```

Afeter you call this stored procedure, Verify total #rows from lineitem table has changed

```sql
SELECT count(*) FROM "dev"."public"."lineitem"; --306911266
```

## Materialized Views

In a data warehouse environment, applications often need to perform complex queries on large tables—for example, SELECT statements that perform multi-table joins and aggregations on the tables that contain billions of rows. Processing these queries can be expensive in terms of system resources and the time it takes to compute the results.

Materialized views in Amazon Redshift provide a way to address these issues. A materialized view contains a precomputed result set, based on SQL query over one or more base tables. Here you will learn how to create, query and refresh a materialized view.

Let’s take an example where you want to generate a report of the top suppliers by shipped quantity. This will join large tables like and _lineitem_, and _suppliers_ and scan a large quantity of data. You might write a query like the following:

```sql
select n_name, s_name, l_shipmode,
  SUM(L_QUANTITY) Total_Qty
from lineitem
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where datepart(year, L_SHIPDATE) > 1997
group by 1,2,3
order by 3 desc
limit 1000;
```

This query takes time to execute and because it is scanning a large amount of data will use a lot of I/O & CPU resources. Think of a situation, where multiple users in the organization need get supplier-level metrics like the above. Each may write similarly heavy queries which can be time consuming and expensive operations. Instead of that you can use a materialized view to store precomputed results for speeding up queries that are predictable and repeated.

Amazon Redshift provides a few methods to keep materialized views up-to-date. You can configure the automatic refresh option to refresh materialized views when base tables of mare updated. The auto refresh operation runs at a time when cluster resources are available to minimize disruptions to other workloads.

Execute below query to create materialized view which aggregates the _lineitem_ data to the _supplier_ level. Note, the AUTO REFRESH option is set to YES and we've included additional columns in our MV in case other users can take advantage of this aggregated data.

```sql
CREATE MATERIALIZED VIEW supplier_shipmode_agg
AUTO REFRESH YES AS
select l_suppkey, l_shipmode, datepart(year, L_SHIPDATE) l_shipyear,
  SUM(L_QUANTITY)	TOTAL_QTY,
  SUM(L_DISCOUNT) TOTAL_DISCOUNT,
  SUM(L_TAX) TOTAL_TAX,
  SUM(L_EXTENDEDPRICE) TOTAL_EXTENDEDPRICE  
from LINEITEM
group by 1,2,3;
```

![](https://static.us-east-1.prod.workshops.aws/public/731bd8a6-b5d4-4391-9b55-7cfb7a6d8abc/static/images/lab13/mvnew.png)

Now execute the below query which has been re-written to use the materialized view. Note the difference in query execution time. You get the same results in few seconds.

```sql
select n_name, s_name, l_shipmode,
  SUM(TOTAL_QTY) Total_Qty
from supplier_shipmode_agg
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where l_shipyear > 1997
group by 1,2,3
order by 3 desc
limit 1000;
```

Another powerful feature of Materialized view is `auto` query rewrite. Amazon Redshift can automatically rewrite queries to use materialized views, even when the query doesn't explicitly reference a materialized view.

Now, re-run your `original` query which references the _lineitem_ table and see this query now executes faster because Redshift has re-written this query to leverage the materialized view instead of base table.

```sql
select n_name, s_name, l_shipmode, SUM(L_QUANTITY) Total_Qty
from lineitem
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where datepart(year, L_SHIPDATE) > 1997
group by 1,2,3
order by 3 desc
limit 1000;
```

You can verify that the query re-writer is using the MV by running an explain operation:

```sql
explain
select n_name, s_name, l_shipmode, SUM(L_QUANTITY) Total_Qty
from lineitem
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where datepart(year, L_SHIPDATE) > 1997
group by 1,2,3
order by 3 desc
limit 1000;
```

Below output shows a successful automatic rewriting and execution of query on the materialized view.

![](https://static.us-east-1.prod.workshops.aws/public/731bd8a6-b5d4-4391-9b55-7cfb7a6d8abc/static/images/lab13/Explain.png)

Write additional queries which can leverage your materialized view but which do not directly reference it. For example, _Total Extendedprice_ by _Region_.

## Bringing it together

Let’s see if Redshift is automatically refresh materialized view after _lineitem_ table Data Changes.

Please capture a metric using the materialized view. We'll compare this value after base table data changes.

```sql
select SUM(TOTAL_QTY) Total_Qty from supplier_shipmode_agg;
```

![](https://static.us-east-1.prod.workshops.aws/public/731bd8a6-b5d4-4391-9b55-7cfb7a6d8abc/static/images/lab13/TotalQtyBefore.png)

let's delete some previously loaded data from the _lineitem_ table.

```sql
delete from lineitem
using orders
where l_orderkey = o_orderkey
and datepart(year, o_orderdate) = 1998 and datepart(month, o_orderdate) = 8;
```

Run the below queries on the MV and compare with the value you had noted previously. You will see SUM has changed which indicates that Redshift has identified changes that have taken place in the base table or tables, and then applied those changes to the materialized view.

Please Note, the materialized view refresh is _asynchronous_. For this lab, expect ~5min for the data to be refreshed after you called the _lineitem\_incremental_ procedure:

```sql
select SUM(TOTAL_QTY) Total_Qty from supplier_shipmode_agg;
```

![](https://static.us-east-1.prod.workshops.aws/public/731bd8a6-b5d4-4391-9b55-7cfb7a6d8abc/static/images/lab13/TotalQtyAfter.png)

## User Defined functions

Redshift supports scalar user-defined function (UDF) using either a SQL SELECT clause or a Python program. The following example creates a Python function that compares two numbers and returns the larger value:

```sql
create function f_py_greater (a float, b float)
  returns float
stable
as $$
  if a > b:
    return a
  return b
$$ language plpythonu;

select f_py_greater (l_extendedprice, l_discount) from lineitem limit 10
```

![](https://static.us-east-1.prod.workshops.aws/public/731bd8a6-b5d4-4391-9b55-7cfb7a6d8abc/static/images/lab13/pythonUdf.png)

The following example creates a SQL function that compares two numbers and returns the larger value:

```sql
create function f_sql_greater (float, float)
  returns float
stable
as $$
  select case when $1 > $2 then $1
    else $2
  end
$$ language sql;  

select f_sql_greater (l_extendedprice, l_discount) from lineitem limit 10
```

![](https://static.us-east-1.prod.workshops.aws/public/731bd8a6-b5d4-4391-9b55-7cfb7a6d8abc/static/images/lab13/sqlUdf.png)