<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/5_Views/2_Project_Cohort_Revenue.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Project - Customer Revenue by Cohort

## Overview

### 🥅 Analysis Goals

Answer question 2️⃣ from the project: How do different customer groups generate revenue?

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Cohort Revenue: Total sales generated by a group of customers who started at the same time
  - Rolling Average: Moving average of customer value over a specific period
- **💡 Why It Matters**: Enables efficient analysis of cohort performance
    - Simplifies complex revenue calculations for repeated use
    - Provides standardized way to track daily revenue patterns
    - Computes a 3-month rolling average for mid-term timeframes to analyze recent changes in customer value
- **📈 Related KPIs**: 
  - Daily revenue
  - Cohort growth metrics
  - Cumulative revenue trends
  - Average lifetime value

## 📖 Background

> **NOTE:** This section of the analysis details what we have done previously with analyzing cohorts. This section wasn't covered in the video.

### Project Background

You're a data analyst at an e-commerce company, Contoso. Your stakeholders on marketing & finance teams need insights to improve customer retention and maximize revenue. They have three key questions:

1️⃣ **Customer Segmentation**: Who are our most valuable customers?

2️⃣ **Cohort-Analysis**: How do different customer groups generate long-term revenue? ⬅️

3️⃣ **Retention Analysis**: Which customers haven’t purchased recently?

Your job is to create a structured analysis using SQL that answers these questions and provides actionable insights for the business.

### Calculate Revenue by Cohort Year

- Group customers into cohorts based on their first purchase year and track their revenue over time.

In [10]:
%%sql

WITH yearly_cohort AS (
    SELECT DISTINCT
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
    FROM sales
)
SELECT 
    y.cohort_year,
    EXTRACT(YEAR FROM s.orderdate) AS purchase_year,
    SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM sales s
LEFT JOIN yearly_cohort y ON s.customerkey = y.customerkey
GROUP BY 
    y.cohort_year,
    purchase_year
ORDER BY 
    y.cohort_year, 
    purchase_year
LIMIT 10

Unnamed: 0,cohort_year,purchase_year,net_revenue
0,2015,2015,7370979.48
1,2015,2016,392623.48
2,2015,2017,479841.31
3,2015,2018,1069850.87
4,2015,2019,1235991.48
5,2015,2020,386489.6
6,2015,2021,872845.99
7,2015,2022,1569787.72
8,2015,2023,1157633.91
9,2015,2024,356186.62


<img src="../Resources/images/3.1_cohort_year_rev.png" alt="Processing & Revenue" width="50%">

> **Graph Note:** As expected, we see an increase in revenue, which is contributed to by previous years cohorts. Mind you, this increase due to previous years cohort is not proportional to the total revenue growth.

### Total Count of Customers by Cohort

- Group customers into cohorts based on their first purchase year and track their revenue over time.

In [11]:
%%sql

WITH yearly_cohort AS (
    SELECT DISTINCT
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year,
        EXTRACT(YEAR FROM orderdate) AS purchase_year  --moved
    FROM sales
)
SELECT DISTINCT  -- added
    cohort_year,
    purchase_year, --added
    COUNT(*) OVER (PARTITION BY purchase_year, cohort_year) as num_customers  --added
FROM yearly_cohort
ORDER BY 
    purchase_year,
    cohort_year
LIMIT 10

Unnamed: 0,cohort_year,purchase_year,num_customers
0,2015,2015,2825
1,2015,2016,126
2,2016,2016,3397
3,2015,2017,149
4,2016,2017,174
5,2017,2017,4068
6,2015,2018,348
7,2016,2018,374
8,2017,2018,473
9,2018,2018,7446


<img src="../Resources/images/3.2_cohort_customer_yearly.png" alt="Cohort Customers" width="50%">  

> **Graph Note:** Earlier cohorts continue to contribute to customer growth in future years, but their impact compared to new customer acquisitions is minimal.  
> 
> Driving the questions: Why are more resources not spent on customer retention?

## 📈 Analysis

- Calculate customer revenue by cohort to uncover deeper trends in cohorts.

> **NOTE**: This section is where the video begins.

### Customer Revenue by Cohort (NOT adjusted for time in market)

- Find the average revenue per customer after grouping them by cohort.
- Note: LTV is great, but since we are dealing with cohorts of years, it makes sense that older cohorts have a higher LTV, so we need to adjust for this.

In [12]:
%%sql

SELECT
    cohort_year,
    SUM(total_net_revenue) AS total_revenue,
    COUNT(DISTINCT customerkey) AS total_customers,
    SUM(total_net_revenue) / COUNT(DISTINCT customerkey) AS customer_revenue
FROM cohort_analysis
GROUP BY 
    cohort_year

Unnamed: 0,cohort_year,total_revenue,total_customers,customer_revenue
0,2015,14892230.47,2825,5271.59
1,2016,18360521.74,3397,5404.92
2,2017,21979733.96,4068,5403.08
3,2018,36460385.42,7446,4896.64
4,2019,36696243.88,7755,4731.95
5,2020,11921900.97,3031,3933.32
6,2021,18387736.18,4663,3943.33
7,2022,29872808.3,9010,3315.52
8,2023,14979328.33,5890,2543.18
9,2024,2856649.33,1402,2037.55


<img src="../Resources/images/5.2_cohort_totals.png" alt="Cohort Total Revenue and Customers" width="50%">

<img src="../Resources/images/5.2_customer_revenue_not_normalized.png" alt="Customer Revenue Not Normalized" width="50%">

> **Graph Note:** Over time revenue per customer has decreased. BUT this could be due to time in market. Meaning Cohorts in 2015 have been around longer than those in 2023, contributing to a higher LTV.

### Customer Revenue by Cohort (Adjusted for time in market)

Let's take a look at the revenue distribution over time for each customer but corrected by time in market.
- For this we need to see how much time we to consider for LTV of a customer in a cohort.

In [13]:
%%sql

WITH purchase_days AS (
    SELECT
        customerkey,
        total_net_revenue,
        orderdate - MIN(orderdate) OVER (PARTITION BY customerkey) AS days_since_first_purchase
    FROM cohort_analysis
)

SELECT
    days_since_first_purchase,
    SUM(total_net_revenue) as total_revenue,
    SUM(total_net_revenue) / (SELECT SUM(total_net_revenue) FROM cohort_analysis) * 100 as percentage_of_total_revenue,
    SUM(SUM(total_net_revenue) / (SELECT SUM(total_net_revenue) FROM cohort_analysis) * 100) OVER (ORDER BY days_since_first_purchase) as cumulative_percentage_of_total_revenue
FROM purchase_days
GROUP BY days_since_first_purchase
ORDER BY days_since_first_purchase
LIMIT 10


Unnamed: 0,days_since_first_purchase,total_revenue,percentage_of_total_revenue,cumulative_percentage_of_total_revenue
0,0,127070684.34,61.56,61.56
1,1,31952.44,0.02,61.58
2,2,51219.17,0.02,61.6
3,3,44392.15,0.02,61.62
4,4,48196.56,0.02,61.65
5,5,72757.85,0.04,61.68
6,6,50986.83,0.02,61.71
7,7,86608.91,0.04,61.75
8,8,73662.89,0.04,61.79
9,9,73995.8,0.04,61.82


<img src="../Resources/images/5.2_revenue_distribution_time.png" alt="Customer Revenue Not Normalized" width="50%">

> **Graph Note:** On average, for a customer, 60% of their contribution to LTV revenue is contributed on the first day. After this less than 0.05% is contributed daily after this.

- Adjust our previous query to only attribute purchases for a cohort to the first purchase date.

In [18]:
%%sql

SELECT
    cohort_year,
    SUM(total_net_revenue) AS total_revenue,
    COUNT(DISTINCT customerkey) AS total_customers,
    SUM(total_net_revenue) / COUNT(DISTINCT customerkey) AS customer_revenue
FROM cohort_analysis
WHERE orderdate = first_purchase_date
GROUP BY 
    cohort_year

Unnamed: 0,cohort_year,total_revenue,total_customers,customer_revenue
0,2015,7245612.98,2825,2564.82
1,2016,9839134.34,3397,2896.42
2,2017,11771496.31,4068,2893.68
3,2018,19773770.56,7446,2655.62
4,2019,22245058.22,7755,2868.48
5,2020,7058614.52,3031,2328.81
6,2021,11974082.36,4663,2567.89
7,2022,21507554.55,9010,2387.08
8,2023,12890580.84,5890,2188.55
9,2024,2764779.66,1402,1972.03


<img src="../Resources/images/5.2_customer_revenue_normalized.png" alt="Customer Revenue Normalized" width="50%">

> **Graph Note:** So analysis proves that over time, the revenue per customer is actually decreasing. 😳 There is even a bigger drop in revenue for newer cohorts, that we wouldn't have seen if we didn't adjust for time in market!