# Kellogg Data Cloud

:::{admonition} Data Workflow
:class: note

```{figure} ./images/data-workflow-kdc.png
---
width: 900px
name: data-workflow-kdc
---
```
:::

## Accessing Datasets

:::{admonition} [KDC](https://www.kellogg.northwestern.edu/academics-research/research-support/computing/kellogg-data-cloud.aspx)
:class: note

<!-- ```{figure} ./images/KDC-1.png
---
width: 900px
name: KDC-1
---
``` -->

```{figure} ./images/KDC-2.png
---
width: 900px
name: KDC-2
---
```

Use your netid to log in on the [Northwestern AWS Access Portal](https://nu-sso.awsapps.com/start/#/?tab=accounts)
:::


:::{admonition} Numerator Dataset
:class: note
For this workshop, we will be exploring consumer panel data in the [Numerator](https://www.numerator.com/) dataset

```{figure} ./images/numerator-website.png
---
width: 900px
name: numerator-website
---
```
:::

:::{admonition} Numerator dataset on KDC
```{figure} ./images/numerator-s3.png
---
width: 900px
name: numerator-s3
---
```
:::

## Writing SQL Queries

:::{admonition} Basic structure of a SQL statement
```sql
SELECT <column expr>        -- required
FROM <table expr>           -- required
JOIN <another table>        -- optional
WHERE <boolean condition>   -- optional
GROUP BY <columns>          -- optional
ORDER BY <columns>          -- optional
;
```
[Northwestern Research Computing Guide to SQL](https://sites.northwestern.edu/researchcomputing/resource-guides/resource-guide-sql/)
:::

:::{admonition} Athena Query Console
:class: note
```{figure} ./images/numerator-athena-1.png
---
width: 900px
name: numerator-athena-1
---
```
:::

:::{admonition} SQL numerator example: counting rows
```sql
SELECT COUNT(*) 
FROM standard_nmr_feed_fact_table
; -- As of 2024-06-19: 5,444,236,053
```
Response shows there are approximately 5.4 billion rows in this table
:::

:::{important} 
The structure of your SQL query can dramatically impact efficiency. For large datasets a well structured query can dramatically improve speed. Sometimes it can make the difference between success and failure.

Two important principles for efficiently using KDC datasets:

- Only select the columns you need
- Only select the rows you need
- Use field partitions in your "WHERE" clause when possible
:::

:::{admonition} Which query is more efficient?
:class: tip
*Goal: select all user_ids and their postal codes*

```sql
-- Query 1
SELECT * -- selects all columns, including user_id and postal_code
FROM standard_nmr_feed_people_table
;
```

```sql
-- Query 2
SELECT user_id, postal_code
FROM standard_nmr_feed_people_table
;
```
:::

:::{admonition} Answer
:class: toggle
- *Query 1*: Runtime 34 seconds, Data scanned 72.6 MB
- *Query 2*: Runtime 5.4 seconds, Data scanned 44.4 MB

<span style="color:purple"><em>Select just the columns you need!</em></span>
:::

:::{admonition} How much data will these queries scan (the entire table, part of the table)?
:class: tip

*Goal: select all items from the "books" sector*

```sql
-- Query 1
SELECT *
FROM standard_nmr_feed_item_table
WHERE sector_id = 'isc_books' -- this is a legitimate sector designation
;
```

```sql
-- Query 2
SELECT *
FROM standard_nmr_feed_item_table
; -- pull all data, then post-process by filtering rows for sector_id = 'isc_books'
```

```sql
-- Query 3
SELECT *
FROM standard_nmr_feed_item_table
WHERE sector_id = 'books' -- This is a typo, there is no 'books' sector
;
```
:::

:::{admonition} Answer
:class: toggle
- *Query 1*: Run time 10.4 seconds, Data scanned 132.3 MB
- *Query 2*: Run time 17 min 40 sec, Data scanned: 18.11 GB
- *Query 3*: Run time 71 ms, Data scanned: None!

<span style="color:purple"><em>Use partitioned fields in your "WHERE" statement when possible!</em></span>
:::

:::{admonition} Which query is better?
:class: tip

*Goal: select a sample of purchase fact rows for doing analysis

```sql
-- Query 1
SELECT FACT.*
FROM standard_nmr_feed_fact_table AS FACT
; -- download table and sample using code
```

```sql
-- Query 2
SELECT FACT.*
FROM standard_nmr_feed_fact_table AS FACT
LIMIT 5000000 -- take first 5 million rows
;
```

```sql
-- Query 3
SELECT FACT.*
FROM standard_nmr_feed_fact_table AS FACT
TABLESAMPLE SYSTEM(0.1) -- random sampling
;
```

```sql
-- Query 4
SELECT FACT.*
FROM standard_nmr_feed_fact_table AS FACT
TABLESAMPLE BERNOULLI(0.1) -- random sampling
;
```
:::

:::{admonition} Answer
:class: toggle
- *Query 1*: Not going to run this one -- all the data are already on KLC...
- *Query 2*: Run time 35.6 sec, Data scanned 1.26 GB
- *Query 3*: Run time: 20.9 sec, Data scanned: 120.22 MB
- *Query 4*: Run time 1 min 46 seconds, Data scanned: 271.26 GB (entire table)
:::

## Automation with Scripting

:::{admonition} GUI vs. Scripting
:class: important
- Graphical User Interfaces (GUIs) like the AWS Athena console are convenient for exploration and development, but not so great for reproducibility

Advantages of scripts for reproducibility:

- Scripts store your workflow in a file as a sequence of commands.
- Scripts lay out your data workflow logic, including logging and testing
- Scripts can be executed on the command line, or with an automated scheduler.
- Scripts are just text files that can be put into version control
:::

:::{admonition} Connect to KDC via code
:class: note
There are several ways to connect to KDC via code, including using the [AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/athena/) or using [ODBC](https://docs.aws.amazon.com/athena/latest/ug/connect-with-odbc.html). We've created a module for KLC that makes it easy to run SQL queries on KDC databases, including in parallel if desired
:::

:::{admonition} KDC utilities module
:class: note

```{figure} ./images/kdcutils-1.png
---
width: 900px
name: data-workflow-kdc
---
```
:::

:::{admonition} Run SQL queries on the command line, log and test results
:class: note

```{figure} ./images/kdcutils-2.png
---
width: 900px
name: data-workflow-kdc
---
```
:::

:::{admonition} Instead of "for loops" in code, use templates over partitioned fields
:class: note

```{figure} ./images/kdcutils-3.png
---
width: 900px
name: data-workflow-kdc
---
```
:::

:::{admonition} Template subsitution generates multiple queries
:class: note

```{figure} ./images/kdcutils-4.png
---
width: 900px
name: data-workflow-kdc
---
```
:::

:::{admonition} Use bash scripts to create a reproducible and testable workflow
:class: note

```{figure} ./images/kdcutils-5.png
---
width: 900px
name: data-workflow-kdc
---
```
:::

:::{admonition} Scale to multiple nodes when necessary
:class: note

```{figure} ./images/kdcutils-6.png
---
width: 900px
name: data-workflow-kdc
---
```
:::