### gold.report_customers View
#### Overview
This Spark SQL script creates a view named `gold.report_customers` in the `gold` schema. The view aggregates customer data from the `gold.fact_sales` and `gold.dim_customers` tables to generate a comprehensive report on customer metrics, including order history, sales, and segmentation.

#### Purpose
The `gold.report_customers` view is designed to:
- Summarize customer-level metrics such as total orders, total sales, and product diversity.
- Segment customers based on age and purchase behavior (VIP, Regular, New).
- Calculate derived metrics like average order value (AOV) and average monthly spend.
- Provide insights into customer recency and lifespan for business analysis.

#### Schema
The view is created in the `gold` schema and relies on the following tables:
- `gold.fact_sales`: Contains sales transaction data (order_number, product_key, order_date, sales_amount, quantity, customer_key).
- `gold.dim_customers`: Contains customer details (customer_key, customer_number, first_name, last_name, birthdate).

#### Structure
The view is built using two Common Table Expressions (CTEs) and a final SELECT statement:

1. **base_query CTE**
   - **Purpose**: Retrieves core columns from `gold.fact_sales` and `gold.dim_customers`.
   - **Joins**: Left joins `fact_sales` with `dim_customers` on `customer_key`.
   - **Filters**: Excludes rows where `order_date` is NULL.
   - **Columns**:
     - `order_number`: Unique identifier for each order.
     - `product_key`: Identifier for products.
     - `order_date`: Date of the order.
     - `sales_amount`: Total sales amount for the order.
     - `quantity`: Number of items in the order.
     - `customer_key`: Unique identifier for the customer.
     - `customer_number`: Customer identifier.
     - `customer_name`: Concatenated first and last names.
     - `age`: Calculated as the difference in years between the current date and the customer's birthdate.

2. **customer_aggregation CTE**
   - **Purpose**: Aggregates data at the customer level.
   - **Grouping**: Groups by `customer_key`, `customer_number`, `customer_name`, and `age`.
   - **Metrics**:
     - `total_orders`: Count of distinct orders.
     - `total_sales`: Sum of sales amounts.
     - `total_quantity`: Sum of quantities ordered.
     - `total_products`: Count of distinct products purchased.
     - `last_order_date`: Most recent order date.
     - `lifespan`: Number of months between the first and last order.

3. **Final SELECT**
   - **Purpose**: Builds the final report with additional derived fields and customer segmentation.
   - **Columns**:
     - `customer_key`, `customer_number`, `customer_name`, `age`: Direct from `customer_aggregation`.
     - `age_group`: Categorizes customers into age bands:
       - Under 20
       - 20-29
       - 30-39
       - 40-49
       - 50 and above
     - `customer_segment`: Segments customers based on lifespan and total sales:
       - VIP: Lifespan ≥ 12 months and total sales > 5000.
       - Regular: Lifespan ≥ 12 months and total sales ≤ 5000.
       - New: Lifespan < 12 months.
     - `last_order_date`: Most recent order date.
     - `recency`: Months since the last order.
     - `total_orders`, `total_sales`, `total_quantity`, `total_products`, `lifespan`: Direct from `customer_aggregation`.
     - `avg_order_value`: Total sales divided by total orders (returns 0 if `total_orders` is 0 to avoid division by zero).
     - `avg_monthly_spend`: Total sales divided by lifespan (returns `total_sales` if `lifespan` is 0).

In [0]:
# Drop the existing view if it exists
spark.sql("DROP VIEW IF EXISTS dwh_project.gold.report_customers")

# Create the new view
spark.sql("""
CREATE VIEW dwh_project.gold.report_customers AS
WITH base_query AS (
    SELECT
        f.order_number,
        f.product_key,
        f.order_date,
        f.sales_amount,
        f.quantity,
        c.customer_key,
        c.customer_number,
        CONCAT(c.first_name, ' ', c.last_name) AS customer_name,
        YEAR(CURRENT_DATE) - YEAR(c.birthdate) AS age
    FROM dwh_project.gold.fact_sales f
    LEFT JOIN dwh_project.gold.dim_customers c
        ON c.customer_key = f.customer_key
    WHERE order_date IS NOT NULL
),

customer_aggregation AS (
    SELECT 
        customer_key,
        customer_number,
        customer_name,
        age,
        COUNT(DISTINCT order_number) AS total_orders,
        SUM(sales_amount) AS total_sales,
        SUM(quantity) AS total_quantity,
        COUNT(DISTINCT product_key) AS total_products,
        MAX(order_date) AS last_order_date,
        CAST(
            MONTHS_BETWEEN(MAX(order_date), MIN(order_date))
            AS INT
        ) AS lifespan
    FROM base_query
    GROUP BY 
        customer_key,
        customer_number,
        customer_name,
        age
)
SELECT
    customer_key,
    customer_number,
    customer_name,
    age,
    CASE 
        WHEN age < 20 THEN 'Under 20'
        WHEN age BETWEEN 20 AND 29 THEN '20-29'
        WHEN age BETWEEN 30 AND 39 THEN '30-39'
        WHEN age BETWEEN 40 AND 49 THEN '40-49'
        ELSE '50 and above'
    END AS age_group,
    CASE 
        WHEN lifespan >= 12 AND total_sales > 5000 THEN 'VIP'
        WHEN lifespan >= 12 AND total_sales <= 5000 THEN 'Regular'
        ELSE 'New'
    END AS customer_segment,
    last_order_date,
    CAST(
        MONTHS_BETWEEN(CURRENT_DATE, last_order_date)
        AS INT
    ) AS recency,
    total_orders,
    total_sales,
    total_quantity,
    total_products,
    lifespan,
    CASE 
        WHEN total_orders = 0 THEN 0
        ELSE total_sales / total_orders
    END AS avg_order_value,
    CASE 
        WHEN lifespan = 0 THEN total_sales
        ELSE total_sales / lifespan
    END AS avg_monthly_spend
FROM customer_aggregation
""")

# Display the new view
df = spark.sql("SELECT * FROM dwh_project.gold.report_customers")
display(df)

customer_key,customer_number,customer_name,age,age_group,customer_segment,last_order_date,recency,total_orders,total_sales,total_quantity,total_products,lifespan,avg_order_value,avg_monthly_spend
16478,AW00027476,Alexis Alexander,43.0,40-49,New,2013-11-24,141,1,25,2,2,0,25.0,25.0
3693,AW00014691,Derrick Torres,40.0,40-49,New,2013-09-02,143,1,90,3,3,0,90.0,90.0
10046,AW00021044,Franklin Zheng,68.0,50 and above,New,2013-06-04,146,2,2534,3,3,7,1267.0,362.0
13904,AW00024902,Jill Murphy,41.0,40-49,New,2013-03-26,149,2,2120,2,2,11,1060.0,192.72727272727272
6097,AW00017095,Logan Garcia,48.0,40-49,Regular,2013-03-15,149,2,2760,4,4,14,1380.0,197.14285714285717
17148,AW00028146,Gavin Washington,53.0,50 and above,New,2013-07-07,145,1,60,3,3,0,60.0,60.0
12145,AW00023143,Jada Phillips,48.0,40-49,New,2013-06-12,146,1,563,3,3,0,563.0,563.0
4963,AW00015961,Ashley Walker,47.0,40-49,New,2013-09-14,143,2,3127,4,4,9,1563.5,347.44444444444446
6544,AW00017542,Bridget Rai,47.0,40-49,New,2013-11-14,141,1,2342,2,2,0,2342.0,2342.0
18147,AW00029145,Shelby Gray,57.0,50 and above,New,2011-07-19,169,1,3578,1,1,0,3578.0,3578.0
