# How to Join a fact and a type 2 dimension (SCD2) table

SCD2 stands for slowly changing dimension type 2. In this type, we create a new row for each change to an existing record in the corresponding transaction table. Each row in the SCD2 dimension table will have row effective and row expiration datetime columns to denote the range within which that row represents the state of the data.

Assume in our e-commerce applications database that we have a user table to store user attributes.

| user | column definition |
| ---- | ----------------- |
| user_id | unique identifier for a user (primary key) |
| first_name | first name of the user |
| last_name | last name of the user |
| address | address of the user |
| zipcode | zipcode of the user |
| created_datetime | Date and time when the user was created |
| updated_datetime | Date and time when this user data was last modified |

Let’s say that a user with user_id=b0cc9fde-a29a-498e-824f-e52399991beb has a zip code of 10027 until 2020-12-31, after which, the user changes address and the new zip code is 10012. In our application table, this would mean that the record with user_id=b0cc9fde-a29a-498e-824f-e52399991beb now has a zip code of 10012.

In the data warehouse, we need to ensure that we have access to historical data to run historical analyses. For example, if an end user want to check how many of our users lived in the zip code 10027 for the month of Dec 2020, we need to make sure that the user with user_id=b0cc9fde-a29a-498e-824f-e52399991beb is counted towards 10027 and not 10012, because that is where that user was, as of December 2020.

To store this historically changing data in our data warehouse, we create a new row for each change. Let’s consider our user_dim(user dimension) table in our data warehouse. This corresponds to the user table in our application database.

| user\_dim                 | column definition                                                                      |
| ------------------------- | -------------------------------------------------------------------------------------- |
| user\_id                  | unique identifier for a user (primary key)                                             |
| first\_name               | first name of the user                                                                 |
| last\_name                | last name of the user                                                                  |
| address                   | address of the user                                                                    |
| zipcode                   | zipcode of the user                                                                    |
| created\_datetime         | Date and time when the user was created                                                |
| updated\_datetime         | Date and time when this user data was last modified                                    |
| row\_effective\_datetime  | The date and time from which this row is the effective data for this user\_id          |
| row\_expiration\_datetime | The date and time until which this row is the effective data for this user\_id         |
| current\_row\_indicator   | Indicator denoting if this row is the most current state of the data for this user\_id |

We do not update the value of an existing row, but we add a new record to indicate the new state. We also set the old record’s row_expiration_datetime and set current_row_indicator to expired.

For the new record, we set the updated_datetime as row_effective_datetime and set end of time as row_expiration_datetime and set current for current_row_indicator to reflect the current state.

Let’s create a simple user_dim table and an items_purchased fact table in our data warehouse.

```sql
DROP TABLE IF EXISTS user_dim;
CREATE TABLE user_dim (
    user_key BIGINT,
    user_id VARCHAR(40),
    first_name VARCHAR(10),
    last_name VARCHAR(10),
    address VARCHAR(100),
    zipcode VARCHAR(10),
    created_datetime TIMESTAMP,
    updated_datetime TIMESTAMP,
    row_effective_datetime TIMESTAMP,
    row_expiration_datetime TIMESTAMP,
    current_row_indicator VARCHAR(10)
);
INSERT INTO user_dim (
        user_key,
        user_id,
        first_name,
        last_name,
        address,
        zipcode,
        created_datetime,
        updated_datetime,
        row_effective_datetime,
        row_expiration_datetime,
        current_row_indicator
    )
VALUES (
        1000,
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'john',
        'doe',
        'world',
        10027,
        '2020-01-01 10:00:00',
        '2020-01-01 10:00:00',
        ' 2020-01-01 10:00:00',
        '2021-01-01 17:59:59',
        'expired'
    ),
    (
        1200,
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'john',
        'doe',
        'world',
        10012,
        '2020-01-01 10:00:00',
        '2021-01-01 18:00:00',
        '2021-01-01 18:00:00',
        '9999-12-31 00:00:00',
        'current'
    );
DROP TABLE IF EXISTS items_purchased;
CREATE TABLE items_purchased (
    item_purchased_id VARCHAR(40),
    order_id VARCHAR(40),
    user_id VARCHAR(40),
    item_id VARCHAR(40),
    item_cost decimal(10, 2),
    purchased_datetime TIMESTAMP -- and other fact information
);
INSERT INTO items_purchased (
        item_purchased_id,
        order_id,
        user_id,
        item_id,
        item_cost,
        purchased_datetime
    )
VALUES (
        'nljbac724bbskd',
        'order_id_1',
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'item_id_1',
        1500.00,
        '2020-12-28 12:30:00'
    ),
    (
        'ljbkcfvbj6758njh',
        'order_id_23',
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'item_id_45',
        20.00,
        '2021-01-28 09:30:00'
    ),
    (
        'sjbv09uy7njhbvvj',
        'order_id_100',
        '11111111-2222-3333-44444444444',
        'item_id_12',
        32.00,
        '2021-02-01 11:00:00'
    );
```

We now have an items_purchased fact table and a user_dim dimension table, with some sample data.

Let’s say our data analysts wants to answer questions like

- I want to see the distribution of our high spending users by month, year and zip code(at time of purchase)
- I want to see the high spending users first name, last name, zip code along with how long they lived at each zip code(duration_of_stay). If they are currently living at a zip code then the duration_of_stay should be Null.

High spending user is defined as any user who has spent at least 1000$ on our website.

We can write a query to answer question 1, as shown below:

```sql
WITH high_spenders AS (
    SELECT user_id
    FROM items_purchased
    GROUP BY user_id
    HAVING sum(item_cost) > 1000
),
user_items AS (
    SELECT ip.item_purchased_id,
        ip.user_id,
        ip.item_cost,
        ud.zipcode,
        ip.purchased_datetime
    FROM items_purchased ip
        JOIN user_dim ud ON ip.user_id = ud.user_id
        AND ip.purchased_datetime BETWEEN ud.row_effective_datetime AND ud.row_expiration_datetime
)
SELECT EXTRACT(
        YEAR
        FROM ui.purchased_datetime
    ) yr,
    EXTRACT(
        MONTH
        FROM ui.purchased_datetime
    ) mnth,
    ui.zipcode,
    COUNT(DISTINCT ui.user_id) num_high_spenders
FROM user_items ui
    JOIN high_spenders hs ON ui.user_id = hs.user_id
GROUP BY yr,
    mnth,
    ui.zipcode
ORDER BY yr DESC,
    mnth DESC;
```