# Business Questions — Evidence Notebook (Gold Layer)

This notebook provides **reproducible analytical evidence** for the MVP business questions using the Gold star schema.

**Scope:** Jan/2023 to Dec/2024 (`AnoMes` 202301–202412)  
**Source:** `mvp_pix.gold` (fact + dimensions + helper view)

Notes:
- Each answered question produces a single primary output designed for screenshots.
- Two business questions (income level; essential vs non-essential expenses) are intentionally not answered due to dataset limitations and are discussed in the final conclusion.

In [0]:
from pyspark.sql import functions as F

SCOPE_START = "202301"
SCOPE_END   = "202412"

SNAPSHOT_MONTH = "202412"  # used to generate chart-friendly, screenshot-ready outputs

fact = spark.table("mvp_pix.gold.fato_transacoes_pix")
t    = spark.table("mvp_pix.gold.dim_tempo")
u    = spark.table("mvp_pix.gold.dim_usuario")
r    = spark.table("mvp_pix.gold.dim_regiao")
n    = spark.table("mvp_pix.gold.dim_natureza")
p    = spark.table("mvp_pix.gold.dim_finalidade")
m    = spark.table("mvp_pix.gold.dim_forma_iniciacao")

vw_conc = spark.table("mvp_pix.gold.vw_regional_concentration")


## Q1 — Monthly evolution + top payer age groups
**Question:** Which age groups most frequently make PIX payments, and how does this pattern evolve over time?

Evidence is provided in two outputs:
1) Monthly evolution of total PIX activity (baseline trend).
2) Top payer age groups in the 2023–2024 scope (to identify the most active profiles).

In [0]:
%sql
SELECT
  t.AnoMes,
  SUM(f.quantidade_transacoes) AS total_transactions,
  SUM(f.valor_total)           AS total_value_brl
FROM mvp_pix.gold.fato_transacoes_pix f
JOIN mvp_pix.gold.dim_tempo t
  ON f.id_tempo = t.id_tempo
WHERE t.AnoMes BETWEEN '202301' AND '202412'
GROUP BY t.AnoMes
ORDER BY t.AnoMes;


AnoMes,total_transactions,total_value_brl
202301,4365833932,1910575883899.74
202302,4307580286,1812531230996.96
202303,5092026344,2196684484144.8
202304,5038153402,2058691807765.34
202305,5517812146,2293614535551.3
202306,5716857614,2314269454039.58
202307,6155096154,2414137613483.72
202308,6556460874,2573533834817.28
202309,6702111484,2554410269903.46
202310,7109217440,2761013080434.82


Databricks visualization. Run in Databricks to view.

In [0]:
%sql
SELECT
  up.faixa_etaria AS payer_age_group,
  SUM(f.quantidade_transacoes) AS total_transactions,
  SUM(f.valor_total)           AS total_value_brl
FROM mvp_pix.gold.fato_transacoes_pix f
JOIN mvp_pix.gold.dim_tempo   t  ON f.id_tempo           = t.id_tempo
JOIN mvp_pix.gold.dim_usuario up ON f.id_usuario_pagador = up.id_usuario
WHERE t.AnoMes BETWEEN '202301' AND '202412'
GROUP BY up.faixa_etaria
ORDER BY total_transactions DESC;


payer_age_group,total_transactions,total_value_brl
30–39,49132343082,9689187227525.96
20–29,47108188206,6362519295046.56
40–49,35227929048,8156832289395.4
Not informed,28615878296,43970052524056.14
50–59,16582667004,4475783596236.5
<20,8356526578,522172693926.32


## Q2 — Payer vs receiver age-group interaction
**Question:** Are there relevant differences between payers and receivers in terms of age distribution and transaction volume?

Primary evidence: a ranked table of payer → receiver age-group combinations (top interactions).


In [0]:
%sql
SELECT
  up.faixa_etaria AS payer_age_group,
  ur.faixa_etaria AS receiver_age_group,
  SUM(f.quantidade_transacoes) AS total_transactions,
  SUM(f.valor_total)           AS total_value_brl
FROM mvp_pix.gold.fato_transacoes_pix f
JOIN mvp_pix.gold.dim_usuario up ON f.id_usuario_pagador   = up.id_usuario
JOIN mvp_pix.gold.dim_usuario ur ON f.id_usuario_recebedor = ur.id_usuario
JOIN mvp_pix.gold.dim_tempo   t  ON f.id_tempo            = t.id_tempo
WHERE t.AnoMes BETWEEN '202301' AND '202412'
GROUP BY up.faixa_etaria, ur.faixa_etaria
ORDER BY total_transactions DESC
LIMIT 25;


payer_age_group,receiver_age_group,total_transactions,total_value_brl
30–39,Not informed,22609429178,3002049133292.22
20–29,Not informed,20715262634,1829561498022.5
40–49,Not informed,15581188134,2633932758312.98
20–29,20–29,13759885462,2871884873214.84
30–39,30–39,13599549254,4449806483339.4
Not informed,Not informed,10212968670,34098695576096.72
40–49,40–49,8704279110,3393339411471.26
50–59,Not informed,7025418488,1452572801514.66
Not informed,30–39,5753389450,3159901877945.22
20–29,30–39,5303495374,703157102010.66


## Q3 — Purpose by age group (payer perspective)
**Question:** How does transaction purpose vary across different age groups?

Primary evidence: purpose distribution by payer age group (by transaction count).


In [0]:
%sql
WITH base AS (
  SELECT
    u.faixa_etaria AS payer_age_group,
    p.finalidade   AS purpose,
    SUM(f.quantidade_transacoes) AS total_transactions
  FROM mvp_pix.gold.fato_transacoes_pix f
  JOIN mvp_pix.gold.dim_tempo      t ON f.id_tempo           = t.id_tempo
  JOIN mvp_pix.gold.dim_usuario    u ON f.id_usuario_pagador = u.id_usuario
  JOIN mvp_pix.gold.dim_finalidade p ON f.id_finalidade      = p.id_finalidade
  WHERE t.AnoMes = '202412'
  GROUP BY payer_age_group, purpose
),
top_purposes AS (
  SELECT purpose
  FROM base
  GROUP BY purpose
  ORDER BY SUM(total_transactions) DESC
  LIMIT 6
),
filtered AS (
  SELECT
    payer_age_group,
    CASE WHEN purpose IN (SELECT purpose FROM top_purposes) THEN purpose ELSE 'Other' END AS purpose_group,
    SUM(total_transactions) AS total_transactions
  FROM base
  GROUP BY payer_age_group, purpose_group
)
SELECT *
FROM filtered
ORDER BY payer_age_group, total_transactions DESC;


payer_age_group,purpose_group,total_transactions
20–29,PIX,2775815702
20–29,NOT AVAILABLE,3551348
20–29,PIX SAQUE,758172
20–29,PIX TROCO,1624
30–39,PIX,2945985370
30–39,NOT AVAILABLE,2638408
30–39,PIX SAQUE,925574
30–39,PIX TROCO,2954
40–49,PIX,2191334578
40–49,NOT AVAILABLE,1542628


Databricks visualization. Run in Databricks to view.

## Q4 — Regional differences (purpose and nature)
**Question:** Are there regional differences in PIX usage considering transaction purpose, nature, and volume?

Primary evidence: payer region × nature (by total value).


In [0]:
%sql
WITH base AS (
  SELECT
    rp.regiao  AS payer_region,
    n.natureza AS nature,
    SUM(f.valor_total) AS total_value_brl
  FROM mvp_pix.gold.fato_transacoes_pix f
  JOIN mvp_pix.gold.dim_tempo    t  ON f.id_tempo          = t.id_tempo
  JOIN mvp_pix.gold.dim_regiao   rp ON f.id_regiao_pagador = rp.id_regiao
  JOIN mvp_pix.gold.dim_natureza n  ON f.id_natureza       = n.id_natureza
  WHERE t.AnoMes = '202412'
  GROUP BY payer_region, nature
),
ranked AS (
  SELECT
    *,
    DENSE_RANK() OVER (PARTITION BY payer_region ORDER BY total_value_brl DESC) AS rnk
  FROM base
)
SELECT payer_region, nature, total_value_brl
FROM ranked
WHERE rnk <= 4
ORDER BY payer_region, total_value_brl DESC;


payer_region,nature,total_value_brl
CENTRO-OESTE,P2P,144072017499.82
CENTRO-OESTE,B2B,142150136551.2
CENTRO-OESTE,P2B,60937760617.26
CENTRO-OESTE,B2P,56685545077.86
NORDESTE,P2P,331818119910.18
NORDESTE,B2B,177355670936.52
NORDESTE,P2B,112066416592.2
NORDESTE,B2P,78200068189.96
NORTE,P2P,115816162003.3
NORTE,B2B,72110656253.76


Databricks visualization. Run in Databricks to view.

## Q5 — Most common patterns (age × nature × purpose)
**Question:** Which combinations of age group, transaction nature, and transaction purpose represent the most common PIX usage patterns?

Primary evidence: Top patterns by transaction count (payer perspective).


In [0]:
%sql
SELECT
  u.faixa_etaria AS payer_age_group,
  n.natureza     AS nature,
  p.finalidade   AS purpose,
  SUM(f.quantidade_transacoes) AS total_transactions,
  SUM(f.valor_total)           AS total_value_brl
FROM mvp_pix.gold.fato_transacoes_pix f
JOIN mvp_pix.gold.dim_tempo      t ON f.id_tempo           = t.id_tempo
JOIN mvp_pix.gold.dim_usuario    u ON f.id_usuario_pagador = u.id_usuario
JOIN mvp_pix.gold.dim_natureza   n ON f.id_natureza        = n.id_natureza
JOIN mvp_pix.gold.dim_finalidade p ON f.id_finalidade      = p.id_finalidade
WHERE t.AnoMes BETWEEN '202301' AND '202412'
GROUP BY payer_age_group, nature, purpose
ORDER BY total_transactions DESC
LIMIT 30;


payer_age_group,nature,purpose,total_transactions,total_value_brl
30–39,P2P,PIX,27858049412,7050724904266.06
20–29,P2P,PIX,27397694052,4693123407121.2
30–39,P2B,PIX,21109350086,2590944171661.36
40–49,P2P,PIX,20818443384,5899036704468.52
20–29,P2B,PIX,19566605306,1640991803211.9
Not informed,B2P,PIX,14520676564,9457039464626.86
40–49,P2B,PIX,14277747360,2215400192595.52
50–59,P2P,PIX,10267923344,3283390866035.2
50–59,P2B,PIX,6243860258,1167510342522.72
Not informed,B2B,PIX,5446584164,30395071696483.34


## Q6 — Payer vs receiver profiles across regions and age groups
**Question:** How does the distribution of PIX usage differ between payer and receiver profiles across regions and age groups?

Primary evidence: role-playing comparison (payer vs receiver) by region and age group.


In [0]:
%sql
WITH payer AS (
  SELECT
    rp.regiao AS region,
    up.faixa_etaria AS age_group,
    SUM(f.quantidade_transacoes) AS total_transactions
  FROM mvp_pix.gold.fato_transacoes_pix f
  JOIN mvp_pix.gold.dim_tempo   t  ON f.id_tempo           = t.id_tempo
  JOIN mvp_pix.gold.dim_regiao  rp ON f.id_regiao_pagador  = rp.id_regiao
  JOIN mvp_pix.gold.dim_usuario up ON f.id_usuario_pagador = up.id_usuario
  WHERE t.AnoMes = '202412'
  GROUP BY region, age_group
),
receiver AS (
  SELECT
    rr.regiao AS region,
    ur.faixa_etaria AS age_group,
    SUM(f.quantidade_transacoes) AS total_transactions
  FROM mvp_pix.gold.fato_transacoes_pix f
  JOIN mvp_pix.gold.dim_tempo   t  ON f.id_tempo            = t.id_tempo
  JOIN mvp_pix.gold.dim_regiao  rr ON f.id_regiao_recebedor = rr.id_regiao
  JOIN mvp_pix.gold.dim_usuario ur ON f.id_usuario_recebedor= ur.id_usuario
  WHERE t.AnoMes = '202412'
  GROUP BY region, age_group
),
unioned AS (
  SELECT region, age_group, 'Payer' AS role, total_transactions FROM payer
  UNION ALL
  SELECT region, age_group, 'Receiver' AS role, total_transactions FROM receiver
)
SELECT region, role, total_transactions
FROM unioned
WHERE age_group = '30–39'
ORDER BY region, role;


region,role,total_transactions
CENTRO-OESTE,Payer,249356210
CENTRO-OESTE,Receiver,145137164
NORDESTE,Payer,860393762
NORDESTE,Receiver,558101620
NORTE,Payer,314776530
NORTE,Receiver,197539612
NOT INFORMED,Payer,1623634
NOT INFORMED,Receiver,1172252
SUDESTE,Payer,1167426456
SUDESTE,Receiver,692763514


Databricks visualization. Run in Databricks to view.

## Q7 — Regional concentration (value vs transaction count)
**Question:** How concentrated is PIX usage across regions when comparing total transaction value and transaction count?

Primary evidence uses the Gold helper view:
`mvp_pix.gold.vw_regional_concentration`


In [0]:
%sql
SELECT
  AnoMes,
  papel   AS role_pt,
  regiao  AS region,
  share_valor             AS value_share,
  share_quantidade        AS transaction_share,
  rank_valor              AS value_rank,
  rank_quantidade         AS transaction_rank,
  cumulative_share_valor  AS cumulative_value_share,
  cumulative_share_quantidade AS cumulative_transaction_share
FROM mvp_pix.gold.vw_regional_concentration
WHERE AnoMes BETWEEN '202301' AND '202412'
ORDER BY AnoMes, role_pt, value_rank;


AnoMes,role_pt,region,value_share,transaction_share,value_rank,transaction_rank,cumulative_value_share,cumulative_transaction_share
202301,Pagador,SUDESTE,0.5129982371,0.428592468047179,1,1,0.5129982371,0.428592468047179
202301,Pagador,SUL,0.1716003192,0.1223364892753323,2,3,0.6845985563,0.817631152627177
202301,Pagador,NORDESTE,0.1625622937,0.2667021953046655,3,2,0.84716085,0.6952946633518446
202301,Pagador,CENTRO-OESTE,0.093474286,0.0859424302078561,4,5,0.940635136,0.999561015368461
202301,Pagador,NORTE,0.058522776,0.0959874325334278,5,4,0.999157912,0.9136185851606048
202301,Pagador,NOT INFORMED,0.0008420881,0.0004389846315391183,6,6,1.0000000001,1.0
202301,Recebedor,SUDESTE,0.5108271998,0.4514524662868006,1,1,0.5108271998,0.4514524662868006
202301,Recebedor,SUL,0.1738502973,0.1298401865094121,2,3,0.6846774971,0.8331359613428375
202301,Recebedor,NORDESTE,0.1622103739,0.2518433085466248,3,2,0.846887871,0.7032957748334254
202301,Recebedor,CENTRO-OESTE,0.0942155394,0.0803023247014334,4,5,0.9411034104,0.9996064183780778


Databricks visualization. Run in Databricks to view.

## Notes

This notebook focuses on evidence generation for the business questions that are answerable with the available aggregated PIX dataset (2023–2024).

Two proposed questions are not answered here and are discussed in the final conclusion:
- income level of the most active users
- essential vs non-essential expense classification
