# Implement a slowly changing dimension

- Learn how to create a type 2 dimension table by adding slowly changing tracking columns
- Dataset: customer table from the TPC-DS benchmark dataset
- Uses the best practices and anti-patterns
- Go over the extract, transform, and load (ETL) merge technique, demonstrating the SCD process

## Business Requirement

In our data warehouse system we have following criteria to be met
1.	Track changes on Address and Phone fields only – type 2 with start and end timestamps
2.	Other attributes are required to kept up-to-date without creating history records – type 1
3.	Source system will provide incremental delta change records

Note: If your source systems cannot provide delta change records and is instead going to provide full load every time, then the Data Warehouse needs to have logic to identify change records. For such a workload we build a second uniquely identifiable value by using hash function on all the dimension columns to identify the changed rows.

## Process flow

The following figure is the process flow diagram:

![](https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2021/09/27/BDB-1563-image001.png)

## SCD Type 2 Fields

Customer_dim table is enriched with SCD Type 2 fields listed below

- CUSTOMER_DIM_ID: This is the surrogate key. For every record we generate new surrogate key. By adding this field to fact tables along with CUSTOMER_ID field, we enable reporting on historical dimension record. 
- RECORD_START_TS: This field sets the “valid from” timestamp for the dimension record
- RECORD_END_TS: This field sets the “valid to” timestamp for the dimension record
- RECORD_OPEN_FLAG: This filed sets record to an active = 1 or closed = 0 i.e., historic record type. 
- RECORD_HASH_VALUE: This field is used to store hash value for SCD tracking fields.

The following diagram shows how a regular dimensional table is converted to a type 2 dimension table:

![](https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2021/09/27/BDB-1563-image003.png)

## Connect

In [3]:
import boto3
import json

def get_secret(secret_name, region_name="us-east-1"):
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name)
    get_secret_value_response = client.get_secret_value(SecretId=secret_name)
    get_secret_value_response = json.loads(get_secret_value_response['SecretString'])
    return get_secret_value_response

creds = get_secret("wysde")
USERNAME = creds["REDSHIFT_USERNAME"]
PASSWORD = creds["REDSHIFT_PASSWORD"]
HOST = creds["REDSHIFT_HOST"]
PORT = 5439
DATABASE = 'dev'

conn_str = 'postgresql://{0}:{1}@{2}:{3}/{4}'.format(USERNAME, PASSWORD, HOST, PORT, DATABASE)

%config SqlMagic.autopandas=True
%config SqlMagic.displaycon=False
%config SqlMagic.feedback=False
%config SqlMagic.displaylimit=5
%reload_ext sql
%sql {conn_str}

## Load base data

In [11]:
%%sql

create table customer (
    C_CUSTKEY bigint NOT NULL,
    C_NAME varchar(25),
    C_ADDRESS varchar(40),
    C_NATIONKEY bigint,
    C_PHONE varchar(15),
    C_ACCTBAL decimal(18, 4),
    C_MKTSEGMENT varchar(10),
    C_COMMENT varchar(117)
)

In [12]:
%%sql

copy customer from 's3://redshift-immersionday-labs/data/customer/customer.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET

In [13]:
%%sql

SELECT * FROM customer LIMIT 10

Unnamed: 0,c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
0,6311801,Customer#006311801,VrmU5RxRNn8,21,31-104-944-8699,2413.53,FURNITURE,es boost blithely around the furiously final d...
1,6311802,Customer#006311802,aXvjh03GDnUut0V9WoZeVjh3yVB1S,17,27-157-138-1415,5090.79,BUILDING,eodolites sleep blithely among the carefully f...
2,6311803,Customer#006311803,HTay0kADod3Xq90QPc1Fdt4fxP1grX4P9,15,25-227-195-3637,872.87,HOUSEHOLD,sly close pinto beans. quickly fi
3,6311804,Customer#006311804,"PblW0,,cTsmMQtG,qJjEy I3WBikPbOarXnV",24,34-374-858-9310,5234.92,AUTOMOBILE,ronic requests use. dugouts wake boldly accord...
4,6311805,Customer#006311805,"T oDKg9UdmKKDpqdfcr2BwFFNP,k1yUKZL",9,19-651-246-7989,3316.61,MACHINERY,st slyly according to the fluffily
5,6311806,Customer#006311806,"nz02UJu5vZXZ5olPzfbFW,4",13,23-714-563-9575,3205.69,FURNITURE,slyly regular theodolites haggle furiously qui...
6,6311807,Customer#006311807,7K40OWDMr,2,12-194-718-6815,8962.84,BUILDING,o beans are boldly even accounts. blithely fin...
7,6311808,Customer#006311808,e6nn7miwN76sKXvCBSd81mT4N34tDvrIgDTmZ7,6,16-955-350-5118,2546.52,HOUSEHOLD,lithely express deposits. theodolites haggle. ...
8,6311809,Customer#006311809,NzKCHqPqVTk9ToQ8aj0oxA4CRSE,2,12-615-612-1349,7344.52,AUTOMOBILE,o beans. accounts sleep among the
9,6311810,Customer#006311810,sJvoZtPpOiEbq1UGIB2xHSy4OjEdxJrll,0,10-474-998-2718,6310.57,BUILDING,foxes. blithely bold packages serve about the...


## Implementing SCD

The first step to implement SCD for a given dimension table is to create the dimension table with SCD tracking attributes. For example, record effective date, record end date, and active record indicator are typically added to track if a record is active or not. These fields are collectively referenced as the SCD fields (as shown in the following code).

These SCD fields are added so that when a field is changed, for example, a customer’s address, the existing record in the dimension table is updated to indicate that the record isn’t active and a new record is inserted with an active flag. This way, every change to an SCD field is stored in the table and business users can run queries to see historical performance of a dimension for a given change that is being tracked.

We also introduce the following:

- Record hash value to easily track if the customer data fields have changed their values. This hash column is computed over all the customer fields. This single hash column is compared instead of comparing multiple individual columns to determine if the data has changed.
- Record insert and update timestamps to capture when the actual dimension row was added to the table and updated.

The following code shows the SCD fields added to the dimension table:

In [4]:
%%sql
-- create the customer dimension table with added hash column and timestamp tracking
DROP TABLE IF EXISTS customer_dim CASCADE;
CREATE TABLE customer_dim ( 
customer_dim_id     bigint GENERATED BY DEFAULT AS IDENTITY(1, 1), 
c_custkey           bigint distkey, 
c_name              CHARACTER VARYING(30), 
c_address            CHARACTER VARYING(50), 
c_nationkey         integer, 
c_phone             CHARACTER VARYING(20), 
c_acctbal           NUMERIC(12, 2), 
c_mktsegment        CHARACTER VARYING(10), 
c_comment           CHARACTER VARYING(120), 
track_hash          bigint, 
record_start_ts     timestamp WITHOUT time ZONE 
                    DEFAULT '1970-01-01 00:00:00'::timestamp WITHOUT time ZONE, 
record_end_ts       timestamp WITHOUT time ZONE 
                    DEFAULT '2999-12-31 00:00:00'::timestamp WITHOUT time ZONE, 
record_active_flag  SMALLINT DEFAULT 1, 
record_upd_ts       timestamp WITHOUT time ZONE DEFAULT NULL, 
record_insert_ts    timestamp WITHOUT time ZONE DEFAULT current_timestamp 
);

Next, we perform the initial load to the dimension table. Because this is the first time that the dimension records are loaded, the SCD tracking attributes are set to active. For example, record start date is set to a low date, like 1900-01-01, or to a business date value to reflect when a particular change became effective. The record end date is set to a high date, like 2999-12-31, and active record indicator is set 1, indicating these rows are active.

After the initial load is complete, we create a staging table to load the incremental changes that come from the source system. This table acts as temporary holding place for incoming records. To identify if a change has occurred or not for a given record, we left outer join the customer staging table to the customer dimension table on the customer primary key (`c_cust_key`). We use left outer join because we want to flag matching records for the update process and unmatched records for the insert process. Left outer joining the staging table to the customer table projects both matched and unmatched rows. Matched rows are treated as updates and unmatched rows are treated as inserts.

In [14]:
%%sql

-- populate dimension and hash based off of (address and phone) as drivers
insert into customer_dim (
        c_custkey,
        c_name,
        c_address,
        c_nationkey,
        c_phone,
        c_acctbal,
        c_mktsegment,
        c_comment,
        track_hash
    )
select c.c_custkey,
    c.c_name,
    cast(c.c_address as varchar(50)) as c_address,
    c.c_nationkey,
    cast(c.c_phone as varchar(20)) as c_phone,
    c.c_acctbal,
    c.c_mktsegment,
    c.c_comment,
    FNV_HASH(c_address + c_phone) as track_hash
from customer c;

In [15]:
%%sql

SELECT * FROM customer_dim LIMIT 10

Unnamed: 0,customer_dim_id,c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment,track_hash,record_start_ts,record_end_ts,record_active_flag,record_upd_ts,record_insert_ts
0,27694,2203256,Customer#002203256,ZT2ba3LVzq2nupeaAJO3CEtd5a8oAiWs4zlSN,1,11-763-592-2993,4478.86,FURNITURE,"ainst the silent, even dependencies sleep car",3433493595239974706,1970-01-01,2999-12-31 00:00:00,1,,2023-03-18 13:53:11.750472
1,31150,2203283,Customer#002203283,2biXlEwb8aaY 6FbwSiAWMwIYYPjWVMPRjIx9q,24,34-203-538-3012,4222.54,AUTOMOBILE,across the regularly ironic requests cajole furi,-3934258293891854987,1970-01-01,2999-12-31 00:00:00,1,,2023-03-18 13:53:11.750472
2,36398,503239,Customer#000503239,"Dl8T9mcAlO88aqDvXqpIr7,NJbw",21,31-257-940-7280,196.26,FURNITURE,accounts are furiously along the dolphins,-3674726255705408143,1970-01-01,2999-12-31 00:00:00,1,,2023-03-18 13:53:11.750472
3,36654,503241,Customer#000503241,"HEpKWNlgUs9N4N,",19,29-897-671-4269,4812.85,HOUSEHOLD,ckages wake. quickly express pinto beans haggl...,9046707033814721907,1970-01-01,2999-12-31 00:00:00,1,,2023-03-18 13:53:11.750472
4,62126,1103275,Customer#001103275,ywam H4yxTbHVVIDnAabHprr8wXN,10,20-529-270-4443,5269.32,MACHINERY,lar packages sleep slyly alongside,-5067628428269017701,1970-01-01,2999-12-31 00:00:00,1,,2023-03-18 13:53:11.750472
5,63790,7103205,Customer#007103205,svStAlTtn6,5,15-893-593-2694,-323.79,HOUSEHOLD,kages affix above the busily even theodolites-...,-4717140670429443461,1970-01-01,2999-12-31 00:00:00,1,,2023-03-18 13:53:11.750472
6,66734,7103228,Customer#007103228,QEazsAkI2neEOb82EbCk,9,19-344-910-8177,3893.96,AUTOMOBILE,special accounts can wake fluffily blithel,-4286438037836278535,1970-01-01,2999-12-31 00:00:00,1,,2023-03-18 13:53:11.750472
7,91310,9103252,Customer#009103252,TKqiLBSfJPJ1e9tyf 2wS CW,2,12-567-218-3140,7251.79,FURNITURE,ing deposits. slyly regular foxes wake. ironic...,1168700261402724394,1970-01-01,2999-12-31 00:00:00,1,,2023-03-18 13:53:11.750472
8,92078,9103258,Customer#009103258,"EI9pGJ,K5qYS2,rjFNb9PKY Sr",14,24-992-870-5299,7205.84,AUTOMOBILE,? accounts wake quickly bold platelets. pendin...,8441044717337203623,1970-01-01,2999-12-31 00:00:00,1,,2023-03-18 13:53:11.750472
9,98734,12703228,Customer#012703228,"A,eWnmWou4,g4 8Fu6s",2,12-807-511-2489,8048.9,HOUSEHOLD,ways among the fluffily regular requests. quic...,4939834901605315270,1970-01-01,2999-12-31 00:00:00,1,,2023-03-18 13:53:11.750472


By setting end timestamp to high date like 31 December 2999 as used here, and record active flag as 1 we have marked all current records as active rows in the customer dimension. 
As the data changes come in to the DWH we shall “close out” the open customer record by setting the end timestamp value and record active flag. And create a new version of customer record with current timestamp for start timestamp, use high date for end timestamp and 1 for record active flag.

Let’s create a dummy source data table from OLTP system. Here we simulate 
- Change to Phone for 200 customers – type 2
- Change to Address for 500 customers – type 2
- Change to Name for 150 customers – type 1
- Completely new customers added to OLTP

In [16]:
%%sql

-- create a source table from OLTP system dummy data
-- where phone has changed for 200 customers
drop table if exists src_customer;
create table src_customer distkey(custkey) as
select c_custkey as custkey,
    c_name as name,
    c_address as address,
    c_nationkey as nationkey,
    c_phone + '0' as phone,
    c_acctbal as acctbal,
    c_mktsegment as mktsegment,
    cast('Update 1' as varchar(120)) as comment,
    getdate() as effective_dt
from customer_dim
limit 200;

In [17]:
%%sql

-- address has changed for 500 customers
insert into src_customer
select c_custkey as custkey,
    c_name as name,
    c_address + 'Z' as address,
    c_nationkey as nationkey,
    c_phone as phone,
    c_acctbal as acctbal,
    c_mktsegment as mktsegment,
    'Update 2' as comment,
    getdate() as effective_dt
from customer_dim
where c_custkey not in (
        select custkey
        from src_customer
    )
limit 500;

In [18]:
%%sql

-- name has changed for 150 customers
insert into src_customer
select c_custkey as custkey,
    c_name + '.' as name,
    c_address as address,
    c_nationkey as nationkey,
    c_phone as phone,
    c_acctbal as acctbal,
    c_mktsegment as mktsegment,
    'Update 3' as comment,
    getdate() as effective_dt
from customer_dim
where c_custkey not in (
        select custkey
        from src_customer
    )
limit 150;

In [19]:
%%sql

-- also let's add three completely new customers
insert into src_customer
values (
        15000001,
        'Customer#15000001',
        '1 Main Street, Somewhere, Zip 10001',
        7,
        '56-451-856-9532',
        8612.30,
        'BUILDING',
        'NC#1',
        getdate()
    ),
    (
        15000002,
        'Customer#15000002',
        '2 First Street, Herethere, Zip 50006',
        15,
        '65-125-586-3249',
        65112.30,
        'FURNITURE',
        'NC#2',
        getdate()
    ),
    (
        15000003,
        'Customer#15000003',
        '3 Second Street, Righthere, Zip 90003',
        22,
        '65-125-586-3249',
        1315827.30,
        'MACHINERY',
        'NC#3',
        getdate()
    );

In [20]:
%%sql

-- check source count
select count(1) from src_customer;

Unnamed: 0,count
0,853


Let’s prepare the staging table for the ETL load. This staging will become basis of subsequent merge operation. So, here we shall identify changed customer records based off of the hash column, and also newly added customer records.

In [21]:
%%sql

-- create the staging table for ETL load
drop table if exists stg_customer;
create table stg_customer distkey(stg_custkey) as with stg as (
    select custkey as stg_custkey,
        name as stg_name,
        address as stg_address,
        nationkey as stg_nationkey,
        phone as stg_phone,
        acctbal as stg_acctbal,
        mktsegment as stg_mktsegment,
        comment as stg_comment,
        effective_dt as stg_effective_dt,
        FNV_HASH(address + phone) as stg_track_hash
    from src_customer
)
select s.*,
    case
        when c.c_custkey is null then 1
        else 0
    end new_ind,
    case
        when c.c_custkey is not null
        and s.stg_track_hash <> track_hash then 1
        else 0
    end track_ind
from stg s
    left join customer_dim c on s.stg_custkey = c.c_custkey;

In [22]:
%%sql

-- check staging count
select new_ind,
    track_ind,
    count(1)
from stg_customer
group by 1,
    2;

Unnamed: 0,new_ind,track_ind,count
0,0,0,150
1,1,0,3
2,0,1,700


We notice that there are 3 net new records, 700 records that have changed based on tracking fields (address and phone) and 150 records that have changes in non tracking fields. In next step we show you how to apply these changes on against 15 Million row customer dimension table.

For rows that aren’t matched (for example, completely new records such as `new_ind = 1`), the rows are inserted into the dimensional table with SCD tracking attributes set as new and an active record flag indicating `Active = 1`.

For matched records, two possibilities could happen:

- **SCD type 2 field has changed** – For this category, we use a two-step process to retain the previous version of the customer record and also record the latest version of the customer record for type 2 fields in our data warehouse. This satisfies our first business requirement. The steps are as follows:
    - **Step 1** – Update the existing record in the target customer dimension table as inactive by setting the record end date to the current timestamp and active record indicator to 0.
    - **Step 2** – Insert the new rows from the customer staging table into the customer target table with the record start date set to the current timestamp, record end date set to a high date, and the record active flag set to 1.
- **SCD type 1 field has changed** – For this category, the row in the customer target table is updated directly with the latest rows from staging table. While doing so, we don’t update any SCD tracking date fields or flags. With this step, we retain only the latest version of the record for type 1 fields in our data warehouse. This satisfies our second business requirement.

Apply changes to the dimension table with the following code:

In [23]:
%%sql

-- merge changes to dim customer
begin transaction;

-- close current type 2 active record based of staging data where change indicator is 1
update customer_dim
set record_end_ts = stg_effective_dt - interval '1 second',
    record_active_flag = 0,
    record_upd_ts = current_timestamp 
from stg_customer
where c_custkey = stg_custkey
and record_end_ts = '2999-12-31'
and track_ind = 1;

-- create latest version type 2 active record from staging data
-- this includes Changed + New records
insert into customer_dim
   (c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,
    c_mktsegment,c_comment,track_hash,record_start_ts,record_end_ts, 
    record_active_flag, record_insert_ts, record_upd_ts) 
select
    stg_custkey, stg_name, stg_address, stg_nationkey, stg_phone,
    stg_acctbal, stg_mktsegment, stg_comment, stg_track_hash, 
    stg_effective_dt as record_start_ts, '2999-12-31' as record_end_ts,
    1 as record_active_flag, current_timestamp as record_insert_ts, 
    current_timestamp as record_upd_ts
from
    stg_customer
where
    track_ind = 1 or new_ind = 1;

-- update type 1 current active records for non-tracking attributes
update customer_dim
set c_name = stg_name,
    c_nationkey = stg_nationkey,
    c_acctbal = stg_acctbal,
    c_mktsegment = stg_mktsegment,
    c_comment = stg_comment,
    record_upd_ts = current_timestamp
from
    stg_customer
where
    c_custkey = stg_custkey
and record_end_ts = '2999-12-31'
and track_ind = 0 and new_ind = 0;

-- end merge operation
commit transaction;

## Best practices

The Amazon Redshift cloud data warehouse can process a large number of updates efficiently. To achieve this, have a staging table that shares the same table definition as your target dimension table. Then, as shown in the earlier code snippet, you can join the staging and the target dimension tables and perform the update and insert in a transaction block. This operation performs bulk updates and inserts on the target table, yielding good performance.

The Amazon Redshift shared nothing architecture typically performs at its peak when operations can be run by each node independently with minimal data movement between nodes. The target customer dimension table and the intermediate staging table created with matched distribution keys provide the best performance because all operations can be completed within the node.

## Anti-patterns

You can also approach this method by comparing dimension records in a row-by-row fashion using cursors and then updating or inserting a particular row on the target table. Although this method works on smaller tables, for larger tables, it’s advised to use the bulk operations method explained in this lab.