# Week 1: Project

The goal for the first week was to get set up with `dbt`, configure the project, create some models, add basic tests, and explore the dbt workflow to build models.

![greenery-erd.png](greenery-erd.png)
[https://dbdiagram.io/d/6199cada02cf5d186b6052df](https://dbdiagram.io/d/6199cada02cf5d186b6052df)

## Self Review Questions

- ✅ Were you able to create schema.yml files with model names and descriptions? 
- ✅ Were you able to run your dbt models and snapshots against the data warehouse?
- ✅ Could you run the queries to answer key questions from the project instructions?

__What was most challenging/surprising in completing this week’s project?__

I was most challenged and surprised by five things while working on the project.

1. __Boilerplate__: I was pleasantly surprised by the amount of boilerplate code one has to write to get set up. Importing the source tables and bootstrapping the staging models and the associated documentation was a largely manual effort. While dbt does provide some handy utilities for code generation, I still found the workflow to be largely manual. This motivated me to try my hand at automating this boilerplate and I have a [proof-of-concept](https://github.com/ramnathv/dbt-explore/blob/main/dbt-greenery/_automation.ipynb).

2. __Conventions__: While `dbt` is opinionated in many ways, it is unopinionated in other ways, which was troublesome. I am a big believer in conventions over configuration, and so the ability to add configuration and properties in any `yaml` file in the project folder was confusing, leading to decision fatigue. However, I landed on this really handy [style guide](https://github.com/dbt-labs/corp/blob/master/dbt_style_guide.md) from dbt Labs that gave me a solid set of conventions to follow, that I believe will scale as we expand the data model layers.

3. __Tests__: `dbt` makes it really easy to translate logical checks on a model to simple configuration in a yaml file. An interesting byproduct of this simplicity is that it made me think about data quality a lot deeper and I ended up writing a lot more tests than I typically have the time for. This is great to track data quality at a much deeper level. I would love for the ability to dynamically populate a dashboard that has the test results along with profiling information, along the lines of what [Great Expectations](https://greatexpectations.io/) provides.

4. __Snapshots__: This is a really cool concept and is a great way to be able to reproduce analyses going back in time. However, I was surprised by the amount of additional thought one needs to put in order to decide what tables to snapshot, and what strategy to use in order to snapshot them. I landed on the following framework to make decisions: (a) Are the rows mutable?, (b) Is there value in capturing history?, and (c) Is the table small? If the answer to all three questions is yes, it should be snapshotted. If the answer to (c) is No, then you trade off the benefit of (b) vs. the cost of (c), and decide accordingly. I would love to learn more about how others think about snapshotting in practice and what are some tips and tricks to handle them.

5. __Workflow__: I am a big believer in repeatable workflows that can become routines that one can do spontaenously. After quite a bit of iteration, I landed on a reasonable worfklow: (1) tweak the staging model sql, (2) run the model to ensure it runs correctly, (3) add documentation and tests for each column, (4) rinse and repeat these steps. It was during this that I discovered the awesome `dbt build` command which basically runs `seed`, `snapshot`, `run`, and `test` in one go in that order to make it easier to iterate. I would love to learn more about other approaches to workflows and tips and tricks to make it efficient. 




__Is there a particular part of the project where you want focused feedback from your reviewers?__

## Analytics Questions

In [1]:
%%capture
!pip install -r ../requirements.txt

In [9]:
%load_ext sql
%sql postgresql://corise:corise@localhost:5432/dbt
%config SqlMagic.displaylimit=5
%config SqlMagic.displaycon = False
%config SqlMagic.feedback = False

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


__1. How many users do we have?__

In [10]:
%%sql
SELECT COUNT(DISTINCT user_id) 
  FROM users

count
130


__2. On average, how many orders do we receive per hour?__

In [11]:
%%sql 
WITH nb_orders_by_hour AS (
SELECT DATE_TRUNC('hour', created_at) AS created_at_hour,
       COUNT(DISTINCT order_id) AS nb_orders
  FROM orders
 GROUP BY 1
)

SELECT ROUND(AVG(nb_orders), 2)
  FROM nb_orders_by_hour

round
8.16


__3. On average, how long does an order take from being placed to being delivered?__

In [12]:
%%sql
WITH nb_days_by_order AS (
SELECT order_id, 
       created_at,
       delivered_at,
       EXTRACT(epoch FROM (delivered_at - created_at))/(24*3600) AS nb_days
  FROM orders
 WHERE status = 'delivered'
)

SELECT ROUND(AVG(nb_days)::NUMERIC, 2)
  FROM nb_days_by_order

round
3.93


__4. How many users have only made one purchase? Two purchases? Three+ purchases?__

In [13]:
%%sql
WITH nb_purchases_by_user AS (
SELECT user_id,
       COUNT(DISTINCT order_id) AS nb_purchases
  FROM orders
 GROUP BY 1
)

SELECT CASE 
         WHEN nb_purchases < 3 THEN nb_purchases::VARCHAR 
         ELSE '3+'
       END AS nb_purchases,
       COUNT(user_id) AS nb_users
  FROM nb_purchases_by_user
 GROUP BY 1
 ORDER BY 1

nb_purchases,nb_users
1,25
2,22
3+,81


__5. On average, how many unique sessions do we have per hour?__

In [14]:
%%sql
WITH nb_sessions_by_hour AS (
SELECT DATE_TRUNC('hour', created_at) AS created_at_hour,
       COUNT(DISTINCT session_id) AS nb_sessions
  FROM events
 GROUP BY 1
 ORDER BY 2 DESC
)

SELECT ROUND(AVG(nb_sessions)::NUMERIC, 2)
  FROM nb_sessions_by_hour

round
7.39
