# Week 4 Lab: Data Visualization with DBT and Superset

During this week's lab, you will learn how to create analytical views and a dashboard.

# Table of Contents

- [ 1 - Introduction and Setup](#1)
- [ 2 - Views with dbt](#2)
  - [ 2.1 - Annual Sales per Office](#2.1)
  - [ 2.2 - Average Sales per Product Line](#2.2)
  - [ 2.3 - Running Analytical Views](#2.3)
  - [ 2.4 - Creating Materialized View](#2.4)
- [ 3 - Dashboard with Apache Superset](#3)

Load the SQL extension.

In [None]:
%load_ext sql

<a name='1'></a>
## 1 - Introduction and Setup

Data visualization is a critical skill for a Data Engineer, enabling you to transform complex data sets into insightful, actionable visuals. Effective visual communication can enhance understanding of the insights you gained while processing data, uncover trends, and drive strategic actions in any data-centric organization. In this lab, you will use a star schema model created on top of the `classicmodels` dataset, you will create analytical views on top of this model and then display the results in a dashboard using Apache Superset.

**dbt** is a transformation workflow command line tool based on SQL, you have worked with dbt before and the initial project is similar to the one done for the assignment of the first week of this course.

Let's start the `dbt` project called `classicmodels_modeling`.

*Note*:<span style="color:red"> All terminal commands in this lab should be run in the VSCode terminal, not Jupyter, as it may cause some issues. Always check that the virtual environment is active.</span>

1.1.1. Activate the Python virtual environment for the lab. Run the following command in the VSCode terminal:
```bash
source jupyterlab-venv/bin/activate
```

1.1.2. Check that `dbt` Core is installed.

```bash
dbt --version
```

1.1.3. Go to AWS console, and in the CloudFormation Outputs tab find the key `PostgresEndpoint`. Copy the corresponding **Value**. 

1.1.4. Open the file located at `./scripts/profiles.yml`. Replace the placeholders `<DATABASE_ENDPOINT>` with the endpoint value. Save changes.

1.1.5. Run the following command to copy the `profiles.yml` file to the invisible folder `.dbt` of the project:

```bash
cp ./scripts/profiles.yml $HOME/.dbt/profiles.yml 
```

1.1.6. Navigate into your project's directory:

```bash
cd classicmodels_modeling
```

1.1.7. Run the following command to test the connection:

```bash
dbt debug
```

It should return a `Connection test: OK connection ok` at the end of the output.

1.1.8. Run the following command to fetch the latest stable versions of tools and libraries specified in the `packages.yml` file.

```bash
dbt deps
```

1.1.9. Load the source configuration into the notebook with the following code:

In [None]:
import yaml

with open("./scripts/profiles.yml", 'r') as stream:
    data_loaded = yaml.safe_load(stream)
    
DBCONFIG = data_loaded["classicmodels_modeling"]["outputs"]["source"]
DBHOST = DBCONFIG["host"]
DBPORT = int(DBCONFIG["port"])
DBNAME = DBCONFIG["dbname"]
DBUSER = DBCONFIG["user"]
DBPASSWORD = DBCONFIG["password"]
db_connection_url = f'postgresql+psycopg2://{DBUSER}:{DBPASSWORD}@{DBHOST}:{DBPORT}/{DBNAME}'

%sql {db_connection_url}

1.1.10. Test the connection from the Notebook to the Postgres database. You should see the schemas of the database in the output.

In [None]:
%%sql

SELECT schema_name
FROM information_schema.schemata;

1.1.11. Run the star schema models in the VSCode terminal (make sure that you are still in the project directory `classicmodels_modeling`):

```bash
dbt run
```

You should see a similar output to the run command:

```bash
Finished running 7 table models in 0 hours 0 minutes and 1.xx seconds (1.xx s).

Completed successfully

Done. PASS=7 WARN=0 ERROR=0 SKIP=0 TOTAL=7
```

1.1.12. Verify that the star schema was added to the Postgres database, the new schema should be called `classicmodels_star_schema`:

In [None]:
%%sql

SELECT schema_name
FROM information_schema.schemata;

1.1.13. Now, let's verify that the star schema fact and dimensional tables are available in the new schema. Run the following cell:

In [None]:
%%sql
SELECT table_catalog, table_schema, table_name, table_type  FROM information_schema.tables 
WHERE table_schema = 'classicmodels_star_schema'

Verify that each table has data and their columns names and types.

In [None]:
%sql SELECT * FROM classicmodels_star_schema.fact_orders LIMIT 10

In [None]:
%sql SELECT * FROM classicmodels_star_schema.dim_customers LIMIT 10

In [None]:
%sql SELECT * FROM classicmodels_star_schema.dim_employees LIMIT 10

In [None]:
%sql SELECT * FROM classicmodels_star_schema.dim_offices LIMIT 10

In [None]:
%sql SELECT * FROM classicmodels_star_schema.dim_products LIMIT 10

In [None]:
%sql SELECT * FROM classicmodels_star_schema.dim_dates LIMIT 10

<a name='2'></a>
## 2 - Views with dbt

Let's review the star schema that you just created, which corresponds to the same schema you created in the first `dbt` lab.
![star_schema](images/star_schema.png)

Based on this schema, you are going to create some views and materialized views to serve your data by answering some business questions and generate a dashboard to visualize the results. First, let's review the definition of views and materialized views.

A **view** is a virtual table based on the result of a SQL query. It does not store the data physically; instead, it provides a way to look at the data from one or more tables. When you query a view, the underlying query is executed, and the result is returned. This means that you will always see up to date data from a view.

A **materialized view** is a database object that contains the results of a query and stores them physically. Unlike regular views, materialized views store the data, and therefore, do not need to query the base tables every time they are accessed. They need to be refreshed periodically to reflect changes in the underlying data. 

In this lab you are going to create 2 views and in order to compare views and materialized views, one of those views will be recreated as a materialized view. Check the following diagram corresponding to the two views that you are going to create:

![views](./images/views.png)

<a name='2.1'></a>
### 2.1 - Annual Sales per Office

The first business query that you will create should answer the question about the annual sales per office in terms of quantity of items sold and the total sales income received.

2.1.1. This is the query which should answer that question. Review the query and run it:

In [None]:
%%sql 
SELECT 
    DISTINCT fct.office_key
    , dof.city 
    , dof.state 
    , dof.country
    , dof.territory
    , SUM(fct.quantity_ordered) AS total_quantity 
    , sum(fct.quantity_ordered*fct.product_price) as total_price
    , EXTRACT(YEAR FROM fct.order_date) as year
FROM classicmodels_star_schema.fact_orders AS fct
JOIN classicmodels_star_schema.dim_offices AS dof ON dof.office_key=fct.office_key
GROUP BY fct.office_key
    , dof.city
    , dof.state
    , dof.country
    , dof.territory
    , EXTRACT(YEAR FROM fct.order_date)
ORDER BY fct.office_key ASC, year ASC

2.1.2. Create a new folder `analytical_views` located at `./classicmodels_modeling/models/`. Then in that folder create a new file `annual_sells_per_office_view.sql`. Copy the query from the previous step into that file (do not include the `%%sql` line).

2.1.3. At the beginning of this file right before your query, add the following jinja configuration that creates views instead of physical tables:

```
{{
config(
    materialized='view'
    )
}}
```

Don't be confused by the `materialized` key, in this case, this is a `dbt` [concept](https://docs.getdbt.com/docs/build/materializations) regarding the strategy on how the models will be persisted into the destination database. In this particular case, this specifies that the model should be materialized as a view (different from a materialized view) in the database.

2.1.4. Exchange the `classicmodels_star_schema` with the jinja templating string `{{var("star_schema")}}` (in two places). You will end with `{{var("star_schema")}}.<TABLE_NAME>` for each table in your query. This line will use jinja templating to take the value stored in the `star_schema` variable and will replace it with the value `classicmodels_star_schema` which is the database that hosts your new star schema data model.

Save changes to the file.

2.1.5. Create a new `schema.yml` file in the `./classicmodels_modeling/models/analytical_views/`folder. To set the schema for the view `annual_sells_per_office_view`, copy the following into the file and save changes:

```yml
version: 2

models:
  - name: annual_sells_per_office_view
    description: "Annual sells per office view"
    columns:
      - name: office_code        
      - name: city
      - name: state
      - name: country
      - name: territory
      - name: total_quantity
      - name: total_price
      - name: year
```

2.1.6. Open the `./classicmodels_modeling/dbt_project.yml` file, and at the bottom of the file add the following key:

```yml
analytical_views:
  +materialized: view
  +schema: star_schema
```

This also ensures that no physical tables are created but only materialized views and that they will be created in the schema named `star_schema`.

<a name='2.2'></a>
### 2.2 - Average Sales per Product Line

Now you've got a business question about the average sales (in terms of units and price) of each product line per month and year. 

2.2.1. This is the query which should answer that question. Review the query and run it:

In [None]:
%%sql 
SELECT 
    dp.product_line
    , AVG(fct.quantity_ordered) AS avg_quantity 
    , AVG(fct.quantity_ordered*fct.product_price) AS avg_price
    , EXTRACT(MONTH FROM fct.order_date) AS month
    , EXTRACT(YEAR FROM fct.order_date) AS year 
FROM classicmodels_star_schema.fact_orders AS fct
JOIN classicmodels_star_schema.dim_products AS dp ON dp.product_key = fct.product_key
GROUP BY dp.product_line
    , EXTRACT(MONTH FROM fct.order_date)
    , EXTRACT(YEAR FROM fct.order_date)
ORDER BY
    dp.product_line ASC    
    , month ASC
    , year ASC
LIMIT 10

2.2.2. In the folder `./classicmodels_modeling/models/analytical_views/` create a new file `avg_sells_per_product_line_view.sql`. Copy the query from the previous step into that file (do not include the lines `%%sql` and `LIMIT 10`).

2.2.3. At the beginning of this file right before your query, add the following jinja configuration that creates views instead of physical tables:

```
{{
config(
    materialized='view'
    )
}}
```

2.2.4. Exchange the `classicmodels_star_schema` with the jinja templating string `{{var("star_schema")}}` (in two places). Save changes.

2.2.5. In the `./classicmodels_modeling/models/analytical_views/schema.yml` file add the schema for the view `avg_sells_per_product_line_view`:

```yml
  - name: avg_sells_per_product_line_view
    description: "Average sells per product lind view"
    columns:
      - name: product_line        
      - name: avg_quantity
      - name: avg_price
      - name: month
      - name: year
```

Save changes.

<a name='2.3'></a>
### 2.3 - Running Analytical Views

In order to run the analytical views that you just created.

2.3.1. Get back to your `dbt` project folder `classicmodels_modeling` run the analytical view:

*Note*: You may need to reactivate the environment:

```bash
source jupyterlab-venv/bin/activate
```

```bash
cd classicmodels_modeling
dbt run -s analytical_views
```

You should see an output similar to this one:

![dbt_create_views](./images/dbt_create_views.png)

2.3.2. The views will be created at the `classicmodels_star_schema` database, so you can use the following scripts to check that they were populated:

In [None]:
%%sql
SELECT COUNT(*) FROM classicmodels_star_schema.annual_sells_per_office_view;

In [None]:
%%sql
SELECT * FROM classicmodels_star_schema.annual_sells_per_office_view order by year desc;

In [None]:
%%sql
SELECT COUNT(*) FROM classicmodels_star_schema.avg_sells_per_product_line_view;

In [None]:
%%sql
SELECT * FROM classicmodels_star_schema.avg_sells_per_product_line_view order by year desc, month asc LIMIT 20;

<a name='2.4'></a>
### 2.4 - Creating Materialized View

Up to this moment, you have created views. Remember that those views are virtual tables that do not store data physically. Let's take the last view and create a materialized view so we can compare how these two materialization strategies work in a database. 

2.4.1. Take the file `./classicmodels_modeling/models/analytical_views/avg_sells_per_product_line_view.sql`, copy it and rename as `avg_sells_per_product_line_mv.sql`. Open this new file and change the configuration to 

```
{{
    config(
        materialized='materialized_view',
        on_configuration_change = 'apply',
    )
}}
```

Take a look at the value of the `materialized` key. This value indicates that the materialization strategy will create an actual materialized view, which actually stores data physically. 

Save changes.

2.4.2. Open the `./classicmodels_modeling/models/analytical_views/schema.yml` file and add a schema named `avg_sells_per_product_line_mv` for this materialized view:

```yml
  - name: avg_sells_per_product_line_mv
    description: "Average sells per product line materialized view"
    columns:
      - name: product_line        
      - name: avg_quantity
      - name: avg_price
      - name: month
      - name: year
```

Save changes.

2.4.3. Finally, run again your `dbt` process with 

```
dbt run -s analytical_views
```

*Note:* Remember that in the `dbt_project` file, you set this configuration:

```
analytical_views:
    +materialized: view
    +schema: star_schema
```

Although this configuration specifies that the materialization strategy is a `view`, the configuration that you set at the `avg_sells_per_product_line_mv.sql` file overwrites this configuration to create a materialized view.

2.4.4. Run this command to check that the data has been successfully loaded into the materialized view:

In [None]:
%%sql
SELECT COUNT(*) FROM classicmodels_star_schema.avg_sells_per_product_line_mv;

In [None]:
%%sql
SELECT * FROM classicmodels_star_schema.avg_sells_per_product_line_mv order by year desc, month asc LIMIT 20;

Now, it is the time to actually understand the difference between views and materialized views. Given that materialized views store data physically, they need to be refreshed when the underlying tables are updated. On the other hand, as views require the execution of the underlying query, they will always present up-to-date data.

2.4.5. Let's insert some mock data into the `fact_orders` table and see how it affects the views and materialized views. Execute the following cell to insert the data. Take a look at the dates, as all of them are in the year `2006` while the maximum year we have in our database currently is `2005`.

In [None]:
%%sql
SELECT MAX(EXTRACT(YEAR FROM fct.order_date))
FROM classicmodels_star_schema.fact_orders fct;

In [None]:
%%sql
INSERT INTO classicmodels_star_schema.fact_orders (fact_order_key, customer_key, employee_key, office_key, product_key, order_date, order_required_date, order_shipped_date, quantity_ordered, product_price) VALUES ('9eec411e690b55dafeb3ec3393aa6d57', '7d04bbbe5494ae9d2f5a76aa1c00fa2f', '4671aeaf49c792689533b00664a5c3ef', 'eccbc87e4b5ce2fe28308fd9f2a7baf3', '296efc252c7855537b5d9e6015bf42b8', '2006-06-01 00:00:00', '2006-07-01 00:00:00', '2006-06-15 00:00:00', 41, 83.79);
INSERT INTO classicmodels_star_schema.fact_orders (fact_order_key, customer_key, employee_key, office_key, product_key, order_date, order_required_date, order_shipped_date, quantity_ordered, product_price) VALUES ('f485cfdd94901e9e237dcc3f644f7edc', '7d04bbbe5494ae9d2f5a76aa1c00fa2f', '4671aeaf49c792689533b00664a5c3ef', 'eccbc87e4b5ce2fe28308fd9f2a7baf3', '8bff119b349bf271ef0684a15808ea18', '2006-06-01 00:00:00', '2006-07-01 00:00:00', '2006-06-15 00:00:00', 11, 50.32);
INSERT INTO classicmodels_star_schema.fact_orders (fact_order_key, customer_key, employee_key, office_key, product_key, order_date, order_required_date, order_shipped_date, quantity_ordered, product_price) VALUES ('7eecd924b84c4a03fcb69d5ec6df4670', '0f28b5d49b3020afeecd95b4009adf4c', 'd1ee59e20ad01cedc15f5118a7626099', 'a87ff679a2f3e71d9181a67b7542122c', '99733605e1ea651ec564248e05f77741', '2006-06-02 00:00:00', '2006-06-21 00:00:00', '2006-06-07 00:00:00', 18, 94.92 );

2.4.6. Now that you have inserted the data, query again the views. Remember that `annual_sells_per_office_view` had 21 rows while `avg_sells_per_product_line_view` had 182 rows before the update.

In [None]:
%%sql
SELECT COUNT(*) FROM classicmodels_star_schema.annual_sells_per_office_view;

In [None]:
%%sql
SELECT * FROM classicmodels_star_schema.annual_sells_per_office_view order by year desc;

In [None]:
%%sql
SELECT COUNT(*) FROM classicmodels_star_schema.avg_sells_per_product_line_view;

In [None]:
%%sql
SELECT * FROM classicmodels_star_schema.avg_sells_per_product_line_view order by year desc, month asc LIMIT 20;

2.4.7. Finally, let's query the materialized view. What can you see?

In [None]:
%%sql
SELECT COUNT(*) FROM classicmodels_star_schema.avg_sells_per_product_line_mv;

In [None]:
%%sql
SELECT * FROM classicmodels_star_schema.avg_sells_per_product_line_mv order by year desc, month asc LIMIT 20;

When querying the materialized view, you can see that you get the same number of rows as before the update to the `fact_orders` table. In order to refresh this materialized view, you should run again the `dbt` model. Do it with the command

```
dbt run -s analytical_views
```

You will see that the materialized view has been refreshed:

![dbt_refresh_views](./images/dbt_refresh_views.png)

Finally, query again your materialized view to see the difference. 


In [None]:
%%sql
SELECT COUNT(*) FROM classicmodels_star_schema.avg_sells_per_product_line_mv;

In [None]:
%%sql
SELECT * FROM classicmodels_star_schema.avg_sells_per_product_line_mv order by year desc, month asc LIMIT 20;

With that exercise, you can notice the difference between a view and a materialized view. You may be wondering when you should use one or the other and you can use these paragraphs as a hint:

- Views: Use when you need a simple way to encapsulate complex queries, enhance security by exposing only certain data, or provide a simplified interface to the underlying tables. Ideal for frequently changing data where up-to-the-moment accuracy is essential.

- Materialized Views: Use when you need to improve performance for complex, resource-intensive queries. Ideal for reporting and data warehousing scenarios where data does not need to be up-to-the-second accurate, and where the performance gain from precomputed results outweighs the need for real-time data.

<a name='3'></a>
## 3 - Dashboard with Apache Superset

*Note*:Apache Superset takes around 15 to 20 minutes to load from the start of the lab, if you don't see the interface when entering the URL in a new tab, wait a while and refresh the page.

3.1. The EC2 instance has been set up for you to work with Apache Superset, the URL is among the CloudFormation Outputs. Access the Superset UI using the URL provided in the CloudFormation Outputs (take the value for the key `SupersetEndpoint`), you should see a login screen like this:

![superset_ui](images/superset_ui.png)

3.2. Login using the following credentials:

- user: `admin`
- password: `admin`

3.3. Configure the Postgres connection. Click on the dropdown **Settings** menu in the top right, and under the Data section select **Database Connections**. Click on the top right `+ Database` button, a new menu should appear to configure the new connection:

![superset_conf](images/superset_conf.png)

Select `PostgreSQL` and click Next. Fill out the details using the same connection parameters as the ones found in the `./scripts/profiles.yml`. The details for the connection are also can be printed out:

In [None]:
print(DBCONFIG)

After filling out the details, click on **Connect** button, this should have created a new connection in the `Data Connections` section. Click on **Finish** button.

3.4. Now, select the **Datasets** tab in the top header menu. You will be directed to a page with various example datasets. Click on `+ DATASET` button on the top right, a new screen will appear, you can use the connection that you just configured for Postgres, then select the `classicmodels_star_schema` and finally one of the views:

![superset_dataset](images/superset_dataset.png)

3.5. Click on the `CREATE DATASET AND CREATE CHART` button (in the bottom right), you will be directed to a new page to create a chart based on the dataset. You can select the type of chart and what variable is used for each dimension.

![superset_chart](images/superset_chart.png)

Once you are done with the chart, hit the **Save** button on the top right, it will ask you to give the chart a name and then save it. Create a chart for each view then create a new dashboard in the **Dashboards** section of the top navigational header, using the `+ Dashboard` button. Enter a name (in the top left part) for your dashboard and then drag and drop the charts you created earlier onto the dashboard canvas, Resize and arrange the charts as desired to create your dashboard layout and finally click Save to save your dashboard layout.

![superset_dashboard](images/superset_dashboard.png)

In this lab, you focused on data visualization using various types of views and a dashboard tool. Effective visual communication is crucial for presenting your findings and the results of your data pipelines to potential stakeholders. Although this task aligns more with the role of a data analyst, it is essential to understand their responsibilities and how we can supply them with the necessary data for their work.