<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/5_Views/1_View_Intro.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Views Intro

## Overview

### 🥅 Analysis Goals

- **Daily Revenue**: Return the daily net revenue. 
- **Cohort Revenue:** Calculate daily net revenue for each cohort.  

### 📘 Concepts Covered

- Create views
- Use views

In [2]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## Views

### 📝 Notes

`CREATE VIEW`

- **Why Use Views in PostgreSQL?**  
  - Simplifies complex queries by storing them as reusable, named objects.  
  - Ensures consistency and readability when multiple queries rely on the same logic.  
  - Enhances security by restricting access to specific rows/columns.  
  - Improves maintainability by centralizing changes to the query logic.

- **Syntax:**  
    ```sql
    CREATE VIEW view_name AS
    SELECT
        column1,
        column2,
        column3
    FROM table_name
    ```
    - `CREATE VIEW view_name AS`: Creates a new view with the specified name.
    - `SELECT`: Defines the query whose results will be stored in the view.
  
### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Cohort Revenue: Total sales generated by a group of customers who started at the same time
  - Daily Revenue Trends: Patterns in revenue performance over daily periods
- **💡 Why It Matters**: Enables efficient analysis of cohort performance
    - Simplifies complex revenue calculations for repeated use
    - Maintains consistency in cohort analysis across queries
    - Provides standardized way to track daily revenue patterns
    - Provides insights into overall customer value trends and short-term customer activity for cohorts.
- **🎯 Common Use Cases**: 
  - Cohort revenue analysis
  - Daily trend tracking
  - Revenue pattern identification
- **📈 Related KPIs**: 
  - Daily revenue
  - Cohort growth metrics

### 📈 Analysis

- Calculate the daily net revenue.
- Find the total net revenue for all the cohorts. 

#### Daily Net Revenue

**`CREATE VIEWS`**

1. Get the daily net revenue and number of orders for each customer.
   - Select `orderdate`.
   - Use `GROUP BY` to group by `orderdate`.

In [4]:
%%sql

CREATE VIEW daily_revenue AS
SELECT 
    orderdate,
    SUM(quantity * netprice * exchangerate) AS total_revenue
FROM sales
GROUP BY 
    orderdate

- **View the View in DBeaver:**
    1. Click the `Views` folder
    2. Refresh the `Views` using `F5`.
    3. Then go to the left side in the Database Navigator. 
    4. Double click the view you created named `cohort_analysis`
    5. Then go to the `Data` tab (if it doesn't go there by default).

<img src="../Resources/images/5.1_dbeaver_views.gif" alt="DBeaver Query Results" width="50%">

- **View the View in Query:**

In [5]:
%%sql

SELECT *
FROM daily_revenue

Unnamed: 0,orderdate,total_revenue
0,2017-10-23,74893.02
1,2023-04-24,43321.33
2,2017-12-28,101464.75
3,2015-01-19,12002.09
4,2019-02-12,156723.97
...,...,...
3289,2023-10-19,114969.37
3290,2017-10-29,649.59
3291,2023-02-08,158675.41
3292,2021-08-03,47364.43


- **Remove the View:**   
**⛔️ WARNING: This is permanent!**
    - **Delete from DBeaver:**
    1. Click the `Views` folder
    2. Refresh the `Views` using `F5`.
    3. Right click the view you created named `daily_revenue`
    4. Click `Delete` and confirm.
        - Note: Select "Cascade Delete" if you also want to delete all other views that depend on this view.

    - **Delete with SQL Query:**

In [6]:
%%sql

DROP VIEW daily_revenue;

#### Create Project View

**`CREATE VIEWS`**

1. Get the daily net revenue and number of orders for each customer.
   - Select `customerkey`, `orderdate`, and `total_net_revenue`
   - Use `GROUP BY` to group by `customerkey` and `orderdate`.
   - Use `SUM` to calculate the total net revenue for each customer per day.
   - Use `COUNT` to calculate the number of orders for each customer per day.

In [7]:
%%sql

SELECT 
    s.customerkey,
    s.orderdate,
    SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
    COUNT(s.orderkey) AS num_orders
FROM sales s
GROUP BY 
    s.customerkey,
    s.orderdate


Unnamed: 0,customerkey,orderdate,total_net_revenue,num_orders
0,1506769,2022-03-04,996.79,2
1,909157,2021-11-16,1565.80,2
2,2047462,2020-12-09,34.06,1
3,1933480,2021-11-10,45.90,2
4,1701958,2017-10-05,5144.64,1
...,...,...,...,...
83094,1273185,2023-09-16,3452.05,3
83095,420797,2022-12-29,278.90,1
83096,642485,2023-03-08,574.31,3
83097,863441,2023-03-23,475.84,3


2. Join with customer table to add demographic data:
   - Use `LEFT JOIN` with `customer` table on `customerkey` to preserve all sales records
   - Add customer columns:
     - `countryfull`: Customer's country of residence
     - `age`: Customer's age
     - `givenname`: Customer's first name
     - `surname`: Customer's last name
   - Keep existing aggregations:
     - `SUM()` for total revenue
     - `COUNT()` for number of orders
   - Group by all non-aggregated columns


In [8]:
%%sql

SELECT
    s.customerkey,
    s.orderdate,
    SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
    COUNT(s.orderkey),
    c.countryfull,
    c.age,
    c.givenname,
    c.surname
FROM sales s 
LEFT JOIN customer c ON c.customerkey = s.customerkey
GROUP BY
    s.customerkey,
    s.orderdate,
    c.countryfull,
    c.age,
    c.givenname,
    c.surname


Unnamed: 0,customerkey,orderdate,total_net_revenue,count,countryfull,age,givenname,surname
0,15,2021-03-08,2217.41,1,Australia,55,Julian,McGuigan
1,180,2018-07-28,525.31,1,Australia,65,Gabriel,Bosanquet
2,180,2023-08-28,1984.90,2,Australia,65,Gabriel,Bosanquet
3,185,2019-06-01,1395.52,1,Australia,40,Gabrielle,Castella
4,243,2016-05-19,287.67,1,Australia,66,Maya,Atherton
...,...,...,...,...,...,...,...,...
83094,2099697,2022-09-13,38.20,3,United States,54,Phillipp,Maier
83095,2099711,2016-08-13,2067.75,1,United States,80,Katerina,Pavlícková
83096,2099711,2017-08-14,3940.92,1,United States,80,Katerina,Pavlícková
83097,2099743,2022-03-17,469.62,2,United States,21,Luciana,Almonte


3. Create a query that:
   - Uses previous query as a CTE named `customer_revenue`
   - Adds customer demographic data by joining with `customer` table
   - Calculates first_purchase_date and cohort_year using window functions:
       - `MIN(orderdate) OVER (PARTITION BY customerkey)` gets earliest purchase date per customer
       - `EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey))` gets cohort year
   - Returns customer purchase data with demographics and cohort 
   
> **Why didn't we include window functions in original query vice CTE?**  
> Window functions are processed after GROUP BY: Window functions operate on the result set after aggregation (GROUP BY) has been applied.
> Technically we could have used a window function in the original query, because we aggregated on the `orderdate` and thus all distinct `orderdate`s would have been maintained producing the correct results.
> However, using window functions with group bys in my opinion is not a great practice as:
> 1. If the order of operations aren't understood, window functions may yield unexpected results.
> 2. The CTE approach provides better readability for the query.

In [9]:
%%sql

WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey),
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM sales s 
	LEFT JOIN customer c ON c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
)
SELECT
	cr.*,
	MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
	EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
FROM customer_revenue cr 
    

Unnamed: 0,customerkey,orderdate,total_net_revenue,count,countryfull,age,givenname,surname,first_purchase_date,cohort_year
0,15,2021-03-08,2217.41,1,Australia,55,Julian,McGuigan,2021-03-08,2021
1,180,2018-07-28,525.31,1,Australia,65,Gabriel,Bosanquet,2018-07-28,2018
2,180,2023-08-28,1984.90,2,Australia,65,Gabriel,Bosanquet,2018-07-28,2018
3,185,2019-06-01,1395.52,1,Australia,40,Gabrielle,Castella,2019-06-01,2019
4,243,2016-05-19,287.67,1,Australia,66,Maya,Atherton,2016-05-19,2016
...,...,...,...,...,...,...,...,...,...,...
83094,2099697,2022-09-13,38.20,3,United States,54,Phillipp,Maier,2022-09-13,2022
83095,2099711,2016-08-13,2067.75,1,United States,80,Katerina,Pavlícková,2016-08-13,2016
83096,2099711,2017-08-14,3940.92,1,United States,80,Katerina,Pavlícková,2016-08-13,2016
83097,2099743,2022-03-17,469.62,2,United States,21,Luciana,Almonte,2022-03-17,2022


4. Create a view in DBeaver using `CREATE VIEW`.  
    - NOTE: This adds CREATE OR REPLACE VIEW to the query.

In [11]:
%%sql 

-- DROP VIEW IF EXISTS cohort_analysis;

CREATE OR REPLACE VIEW cohort_analysis AS  --create view as cohort_analysis
WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM sales s 
	LEFT JOIN customer c ON c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
)
SELECT
	cr.*,
	MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
	EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
FROM customer_revenue cr 

dbeaver output:

![Create View](../Resources/images/5.1_dbeaver_create_view.png)

5. Use the view and calculate the total net revenue.  

   - Query the `cohort_analysis` view to retrieve `cohort_year` and `total_net_revenue`.  
   - 🔔 Use `SUM(total_net_revenue)` to calculate the total revenue for all customers within each cohort for a specific day.  
   - 🔔 Group by `cohort_year` to ensure the total revenue is aggregated at the cohort level.  
   - 🔔 Select `cohort_year` and `total_revenue` for the final output.  

In [14]:
%%sql

SELECT
    cohort_year,
    SUM(total_net_revenue) AS total_revenue
FROM cohort_analysis
GROUP BY 
    cohort_year
ORDER BY cohort_year;


Unnamed: 0,cohort_year,total_revenue
0,2015,14892230.47
1,2016,18360521.74
2,2017,21979733.96
3,2018,36460385.42
4,2019,36696243.88
5,2020,11921900.97
6,2021,18387736.18
7,2022,29872808.3
8,2023,14979328.33
9,2024,2856649.33


<img src="../Resources/images/5.1_cohort_rev.png" alt="Cohort Revenue" width="50%">