<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/4_Lag_Lead.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Lag / Lead

### 🥅 Analysis Goals

- **Analyze mMonth-over-Month Growth**: Calculate the percent change in net revenue between the different months.
- **Year-over-Year LTV Changes:** Analyze changes in average LTV between consecutive cohorts to track year-over-year shifts in customer value.  
- **LTV Changes Between Cohorts:** Evaluate how the current cohort's average LTV compares to the next cohort to detect evolving trends.  

### 📘 Concepts Covered

- `LAG`
- `LEAD`
- `FIRST_VALUE`
- `LAST_VALUE`
- `NTH_VALUE`

---

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## FIRST VALUE, LAG, & LEAD - QUICK DEMO

### 📈 Analysis

- Investigate functions to explore month-over-month growth for 2023

1. First, let's get the monthly net revenue for 2023.

In [22]:
%%sql

SELECT 
    TO_CHAR(orderdate, 'YYYY-MM') as month,
    SUM(quantity * netprice * exchangerate) as net_revenue
FROM sales
WHERE EXTRACT(YEAR FROM orderdate) = 2023
GROUP BY month
ORDER BY month

Unnamed: 0,month,revenue
0,2023-01,3664431.34
1,2023-02,4465204.57
2,2023-03,2244316.52
3,2023-04,1162796.16
4,2023-05,2943005.99
5,2023-06,2864500.03
6,2023-07,2337639.34
7,2023-08,2623919.79
8,2023-09,2622774.85
9,2023-10,2551322.61


2. Next, we'll investigate using `FIRST_VALUE`, `LAG`, and `LEAD` for `net_revenue`.

In [32]:
%%sql

WITH monthly_sales AS (
    SELECT 
        TO_CHAR(orderdate, 'YYYY-MM') as month,
        SUM(quantity * netprice * exchangerate) as net_revenue
    FROM sales
    WHERE EXTRACT(YEAR FROM orderdate) = 2023
    GROUP BY month
    ORDER BY month
    
)
SELECT 
    month,
    net_revenue,
    FIRST_VALUE(net_revenue) OVER (ORDER BY month) as first_month_revenue,
    LAST_VALUE(net_revenue) OVER (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as last_month_revenue,
    NTH_VALUE(net_revenue, 3) OVER (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as third_month_revenue,
    LAG(net_revenue) OVER (ORDER BY month) as previous_month_revenue,
    LEAD(net_revenue) OVER (ORDER BY month) as next_month_revenue
FROM monthly_sales;

Unnamed: 0,month,net_revenue,first_month_revenue,last_month_revenue,third_month_revenue,previous_month_revenue,next_month_revenue
0,2023-01,3664431.34,3664431.34,2928550.93,2244316.52,,4465204.57
1,2023-02,4465204.57,3664431.34,2928550.93,2244316.52,3664431.34,2244316.52
2,2023-03,2244316.52,3664431.34,2928550.93,2244316.52,4465204.57,1162796.16
3,2023-04,1162796.16,3664431.34,2928550.93,2244316.52,2244316.52,2943005.99
4,2023-05,2943005.99,3664431.34,2928550.93,2244316.52,1162796.16,2864500.03
5,2023-06,2864500.03,3664431.34,2928550.93,2244316.52,2943005.99,2337639.34
6,2023-07,2337639.34,3664431.34,2928550.93,2244316.52,2864500.03,2623919.79
7,2023-08,2623919.79,3664431.34,2928550.93,2244316.52,2337639.34,2622774.85
8,2023-09,2622774.85,3664431.34,2928550.93,2244316.52,2623919.79,2551322.61
9,2023-10,2551322.61,3664431.34,2928550.93,2244316.52,2622774.85,2700103.38


3. Finally, let's take it a step further and apply this in a real-world scenario to analyze month-over-month growth.

In [28]:
%%sql

WITH monthly_sales AS (
    SELECT 
        TO_CHAR(orderdate, 'YYYY-MM') as month,
        SUM(quantity * netprice * exchangerate) as net_revenue
    FROM sales
    WHERE EXTRACT(YEAR FROM orderdate) = 2023
    GROUP BY month
    ORDER BY month
    
)
SELECT 
    month,
    net_revenue,
    LAG(net_revenue) OVER (ORDER BY month) as previous_month_revenue,
    ROUND(
        ((net_revenue - LAG(net_revenue) OVER (ORDER BY month)) / 
        LAG(net_revenue) OVER (ORDER BY month) * 100)::numeric, 
        2
    ) as month_over_month_rev_growth
FROM monthly_sales;

Unnamed: 0,month,net_revenue,previous_month_revenue,month_over_month_rev_growth
0,2023-01,3664431.34,,
1,2023-02,4465204.57,3664431.34,21.85
2,2023-03,2244316.52,4465204.57,-49.74
3,2023-04,1162796.16,2244316.52,-48.19
4,2023-05,2943005.99,1162796.16,153.1
5,2023-06,2864500.03,2943005.99,-2.67
6,2023-07,2337639.34,2864500.03,-18.39
7,2023-08,2623919.79,2337639.34,12.25
8,2023-09,2622774.85,2623919.79,-0.04
9,2023-10,2551322.61,2622774.85,-2.72


<img src="../Resources/images/3.4_mom_rev_growth.png" alt="M-o-M Growth" width="50%">

---
## FIRST VALUE

### 📝 Notes

`FIRST_VALUE`

- **FIRST_VALUE**: Returns the first value in an ordered partition of data.
- Syntax:
  ```sql
  SELECT
    FIRST_VALUE(column_name) OVER(
        PARTITION BY partition_expression
        ORDER BY order_expression
    ) AS window_column_alias
  FROM table_name;
  ```
- Retrieve the earliest value within a group or window, such as the first purchase date or initial value in a time series.

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Initial Purchase Value: Amount spent on customer's first transaction
  - Customer Starting Point: Baseline transaction used for comparison
  - Reference Transaction: First occurrence in a series of ordered events
- **💡 Why It Matters**: Establishes baseline metrics for measuring customer behavior changes: 
  - Identifies trends in customer value, such as growth or decline, across different cohorts.  
  - Provides insights into the success of customer acquisition and retention strategies over the years.  
- **🎯 Common Use Cases**: 
  - Comparing current to first purchase
  - Tracking changes from initial transaction
  - Measuring customer value growth
- **📈 Related KPIs**: Customer growth rate, purchase value trends

---
## LAG

### 📝 Notes

`LAG`

- **LAG**: Returns the value of a column from a specified number of rows before the current row in a partition.
- Syntax:
  ```sql
  SELECT
    LAG(column_name, offset, default_value) OVER(
        PARTITION BY partition_expression
        ORDER BY order_expression
    ) AS window_column_alias
  FROM table_name;
  ```
- **Parameters**:
  - `offset` (optional, default: `1`): How many rows back to look.
  - `default_value` (optional): Value to return if there’s no preceding row.
- Compare current and previous values, such as tracking changes in sales or stock prices.



### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Previous Period Comparison: Analysis of current vs. previous time period
  - Sequential Analysis: Study of events in chronological order
  - Period-over-Period Change: Difference between current and previous values
- **💡 Why It Matters**: Enables trend analysis and identification of changes in customer behavior
  - Highlights year-over-year changes in customer value, revealing potential growth or declines in customer behavior.
  - How recent cohorts compare to their immediate predecessors, helping assess the impact of short-term strategies.  
- **🎯 Common Use Cases**: Month-over-month analysis, purchase pattern tracking
- **📈 Related KPIs**: Growth rate, change in purchase value, trend indicators

### 📈 Analysis

- Calculate the changes in the average lifetime value (LTV) between the current cohort and the previous cohort's LTV.  


#### Year-over-Year LTV Changes

**`LAG`**

1. Use this query that calculates the `avg_cohort_ltv` by `cohort_year`.
   - Define a CTE `yearly_cohort` to calculate the cohort year and total net revenue per customer.  
       - Use `EXTRACT(YEAR FROM MIN(orderdate))` to determine the cohort year for each customer.  
       - Calculate `total_net_revenue` as the sum of `quantity * netprice * exchangerate`.  
       - Group the results by `customerkey` to calculate these metrics for each customer.  
   - Define a CTE `cohort_summary` to calculate the average LTV for each cohort year.  
       - Calculate `avg_ltv` as `AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) ` to determine the average lifetime value of each cohort.  
       - Return `cohort_year`, `customerkey`, `customer_ltv`
    - In the main query:
        - Select the distinct values for `cohort_year` and `avg_cohort_summary`
        - Order by `cohort_year` 

In [3]:
%%sql

WITH yearly_cohort AS (
    SELECT 
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
),

cohort_summary AS (
    SELECT 
        cohort_year,
        customerkey,
        total_customer_net_revenue AS customer_ltv,
        AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) AS avg_cohort_ltv-- Added
    FROM yearly_cohort
)

SELECT DISTINCT
    cohort_year,
    avg_cohort_ltv
FROM cohort_summary
ORDER BY 
    cohort_year

Unnamed: 0,cohort_year,avg_cohort_ltv
0,2015,5271.59
1,2016,5404.92
2,2017,5403.08
3,2018,4896.64
4,2019,4731.95
5,2020,3933.32
6,2021,3943.33
7,2022,3315.52
8,2023,2543.18
9,2024,2037.55


2. Calculate the changes in the LTVs between the current cohort and the previous cohort's LTV using the `LAG` window function.  
   - Define a CTE `yearly_cohort` to calculate the cohort year and total net revenue per customer.  
       - Use `EXTRACT(YEAR FROM MIN(orderdate))` to determine the cohort year for each customer.  
       - Calculate `total_net_revenue` as the sum of `quantity * netprice * exchangerate`.  
       - Group the results by `customerkey` to calculate these metrics for each customer.  
   - Define a CTE `cohort_summary` to calculate the average LTV for each cohort year.  
       - Calculate `avg_ltv` as `AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) ` to determine the average lifetime value of each cohort.  
       - Return `cohort_year`, `customerkey`, `customer_ltv`
    - Define a CTE `cohort_final` to get the average LTV values for each cohort:
        - Select the distinct values for `cohort_year` and `avg_cohort_summary`
        - Order by `cohort_year` 
   - In the main query, calculate the previous cohort's LTV and the change in LTV using `LAG`.  
       - Use `LAG(avg_ltv) OVER (ORDER BY cohort_year)` to fetch the previous cohort's average LTV, naming it `prev_cohort_ltv`.  
       - 🔔 Calculate the change in LTV percentage as `(avg_cohort_ltv - LAG(avg_cohort_ltv) OVER (ORDER BY cohort_year)) / LAG(avg_cohort_ltv) OVER (ORDER BY cohort_year) AS ltv_change` and name it `ltv_change`.  

In [6]:
%%sql

WITH yearly_cohort AS (
    SELECT 
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
),

cohort_summary AS (
    SELECT 
        cohort_year,
        customerkey,
       total_customer_net_revenue AS customer_ltv,
        AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) AS avg_cohort_ltv-- Added
    FROM yearly_cohort
),

cohort_final AS (
    SELECT DISTINCT
        cohort_year,
        avg_cohort_ltv
    FROM cohort_summary
    ORDER BY 
        cohort_year
)

SELECT 
    *,
    LAG(avg_cohort_ltv) OVER (ORDER BY cohort_year) AS prev_cohort_ltv,
    100 * (avg_cohort_ltv - LAG(avg_cohort_ltv) OVER (ORDER BY cohort_year)) /
        LAG(avg_cohort_ltv) OVER (ORDER BY cohort_year) AS ltv_change
FROM cohort_final

Unnamed: 0,cohort_year,avg_cohort_ltv,prev_cohort_ltv,ltv_change
0,2015,5271.59,,
1,2016,5404.92,5271.59,2.53
2,2017,5403.08,5404.92,-0.03
3,2018,4896.64,5403.08,-9.37
4,2019,4731.95,4896.64,-3.36
5,2020,3933.32,4731.95,-16.88
6,2021,3943.33,3933.32,0.25
7,2022,3315.52,3943.33,-15.92
8,2023,2543.18,3315.52,-23.29
9,2024,2037.55,2543.18,-19.88


<img src="../Resources/images/3.4_cohort_ltv_prev.png" alt="Cohort LTV Change" width="50%">

---
## LEAD

### 📝 Notes

`LEAD`

- **LEAD**: Returns the value of a column from a specified number of rows after the current row in a partition.
- Syntax:
  ```sql
  SELECT
    LEAD(column_name, offset, default_value) OVER(
        PARTITION BY partition_expression
        ORDER BY order_expression
    ) AS window_column_alias
  FROM table_name;
  ```
- **Parameters**:
  - `offset` (optional, default: `1`): How many rows forward to look.
  - `default_value` (optional): Value to return if there’s no subsequent row.
- Compare current and future values, such as forecasting or tracking upcoming events. 

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Forward-Looking Analysis: Comparison with future period values
  - Next Period Prediction: Using known future values for current analysis
  - Purchase Sequence: Series of customer transactions over time
- **💡 Why It Matters**: Helps identify upcoming trends and predict customer behavior
    - Highlights shifts in customer value as cohorts evolve, identifying trends in decreasing or increasing value.  
    - Provides insights into whether newer cohorts are more or less valuable compared to the preceding ones, helping refine acquisition strategies.  
- **🎯 Common Use Cases**: 
  - Future value comparison
  - Purchase pattern analysis
  - Customer behavior prediction
- **📈 Related KPIs**: Future growth indicators, predictive metrics

### 📈 Analysis

- Calculate the changes in the average lifetime value (LTV) between the current cohort and the next cohort's LTV.  

#### LTV Changes Between Cohorts

**`LEAD`**

1. Use the same query from before but remove the `prev_cohort_ltv` and `ltv_change` columns. 
   - Define a CTE `yearly_cohort` to calculate the cohort year and total net revenue per customer.  
       - Use `EXTRACT(YEAR FROM MIN(orderdate))` to determine the cohort year for each customer.  
       - Calculate `total_net_revenue` as the sum of `quantity * netprice * exchangerate`.  
       - Group the results by `customerkey` to calculate these metrics for each customer.  
   - Define a CTE `cohort_summary` to calculate the average LTV for each cohort year.  
       - Calculate `avg_ltv` as `AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) ` to determine the average lifetime value of each cohort.  
       - Return `cohort_year`, `customerkey`, `customer_ltv`
    - Define a CTE `cohort_final` to get the average LTV values for each cohort:
        - Select the distinct values for `cohort_year` and `avg_cohort_summary`
        - Order by `cohort_year` 
   - In the main query, select only the `cohort_year` and `avg_ltv` columns.  
       - 🔔 The `prev_cohort_ltv` and `ltv_change_prev` columns from the previous query are removed to simplify the output.  

In [8]:
%%sql

WITH yearly_cohort AS (
    SELECT 
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
),

cohort_summary AS (
    SELECT 
        cohort_year,
        customerkey,
       total_customer_net_revenue AS customer_ltv,
        AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) AS avg_cohort_ltv-- Added
    FROM yearly_cohort
),

cohort_final AS (
    SELECT DISTINCT
        cohort_year,
        avg_cohort_ltv
    FROM cohort_summary
    ORDER BY 
        cohort_year
)

SELECT 
    *
FROM cohort_final

Unnamed: 0,cohort_year,avg_cohort_ltv
0,2015,5271.59
1,2016,5404.92
2,2017,5403.08
3,2018,4896.64
4,2019,4731.95
5,2020,3933.32
6,2021,3943.33
7,2022,3315.52
8,2023,2543.18
9,2024,2037.55


2. Calculate the changes in the LTVs between the current cohort and the next cohort's LTV using the `LEAD` window function.  
   - Define a CTE `yearly_cohort` to calculate the cohort year and total net revenue per customer.  
       - Use `EXTRACT(YEAR FROM MIN(orderdate))` to determine the cohort year for each customer.  
       - Calculate `total_net_revenue` as the sum of `quantity * netprice * exchangerate`.  
       - Group the results by `customerkey` to calculate these metrics for each customer.  
   - Define a CTE `cohort_summary` to calculate the average LTV for each cohort year.  
       - Calculate `avg_ltv` as `AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) ` to determine the average lifetime value of each cohort.  
       - Return `cohort_year`, `customerkey`, `customer_ltv`
    - Define a CTE `cohort_final` to get the average LTV values for each cohort:
        - Select the distinct values for `cohort_year` and `avg_cohort_summary`
        - Order by `cohort_year` 
   - In the main query, calculate the next cohort's LTV and the change in LTV using `LEAD`.  
       - 🔔 Use `LEAD(avg_ltv) OVER (ORDER BY cohort_year)` to fetch the average LTV of the next cohort, naming it `next_cohort_ltv`.  
       - 🔔 Calculate the change in LTV as `LEAD(avg_ltv) OVER (ORDER BY cohort_year) - avg_ltv` and name it `ltv_change_next`.  

In [7]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_net_revenue
    FROM sales
    GROUP BY 
        customerkey
),

cohort_totals AS (
    SELECT
        cohort_year,
        SUM(total_net_revenue) / COUNT(DISTINCT customerkey) AS avg_ltv   
    FROM cohort_analysis
    GROUP BY
        cohort_year
)

SELECT
    cohort_year,
    avg_ltv,
    LEAD(avg_ltv) OVER (ORDER BY cohort_year) AS next_cohort_ltv,
    LEAD(avg_ltv) OVER (ORDER BY cohort_year) - avg_ltv AS ltv_change_next -- Added
FROM cohort_totals;

Unnamed: 0,cohort_year,avg_ltv,next_cohort_ltv,ltv_change_next
0,2015,5271.59,5404.92,133.34
1,2016,5404.92,5403.08,-1.84
2,2017,5403.08,4896.64,-506.44
3,2018,4896.64,4731.95,-164.69
4,2019,4731.95,3933.32,-798.62
5,2020,3933.32,3943.33,10.0
6,2021,3943.33,3315.52,-627.81
7,2022,3315.52,2543.18,-772.34
8,2023,2543.18,2037.55,-505.63
9,2024,2037.55,,


<img src="../Resources/images/3.4_cohort_ltv_change_prev.png" alt="Cohort LTV Change" width="50%">

### 💡 Why analyze LTV Changes Between Cohorts?

- **Reverse Perspective:** The "LTV Changes Between Cohorts" is essentially the same as the "Year-over-Year LTV Changes," but viewed in reverse, comparing each cohort to the next instead of the previous.  
- **Reason for Reverse Comparison:** 
    - This perspective helps identify how newer cohorts are performing compared to their predecessors, providing insights into declining or improving trends as cohorts evolve over time.  
    - By looking forward instead of backward, businesses can focus on emerging patterns and adjust acquisition or retention strategies for future cohorts.  