Client: Federal Government of Brazil / Central Bank of Brazil  
Topic: PIX user behavior  
Analyzed period: 2023 and 2024  
Platform: Databricks Free Edition  
Storage: Unity Catalog + Volumes
---
# Project Context

This work is part of the MVP for the Data Engineering module and aims to build a cloud-based data pipeline, from data collection to analysis, using real and public data.

The project was developed in a governmental context, considering the Central Bank as the main stakeholder. The focus is on analyzing the behavior of users of the PIX instant payment system, seeking to understand usage patterns over time, differences across user profiles, and regional variations.

Although the project applies Data Engineering concepts and tools, it prioritizes clarity, organization, and end-to-end process understanding, focusing on building a functional and well-documented pipeline rather than overly complex solutions.

---
# Dataset Used

The dataset used in this MVP comes from the Central Bank of Brazil through the PIX – Open Data initiative, made available via a public API.

The data represents aggregated monthly statistics of PIX transactions, including information on transaction volume, total value, and general characteristics of the users involved, such as age group, region, transaction nature, and purpose.

For this work, data from the years 2023 and 2024 was used, stored in two separate CSV files:

* pix_2023.csv  
* pix_2024.csv  

The files were downloaded locally from the official API and later uploaded to the Databricks environment, where they became part of the raw data layer of the pipeline.

Main columns used in the analysis:

* AnoMes: transaction period (year and month)  
* PAG_IDADE: payer age group  
* REC_IDADE: receiver age group  
* PAG_REGIAO: payer region  
* REC_REGIAO: receiver region  
* NATUREZA: transaction nature  
* FINALIDADE: transaction purpose  
* QUANTIDADE: number of transactions  
* VALOR: total transacted value  

In addition, some extra columns were kept in the raw data only for documentation and contextual purposes, but were not directly used in the final analyses.

---
# Data Pipeline Architecture

To organize the pipeline, a layered architecture was adopted, widely used in data projects for facilitating separation of responsibilities and transformation traceability.

## Bronze Layer (raw data)

The Bronze layer contains the data exactly as obtained, without any transformation or analytical treatment.

In this layer:

* the original CSV files are stored in Databricks;  
* the goal is to preserve the original data source;  
* potential issues or inconsistencies are not yet addressed.

This layer serves as a reliable reference point for auditing and reprocessing, if needed.

## Silver Layer (processed data)

The Silver layer contains cleaned and standardized data, ready for analytical modeling.

At this stage:

* data from 2023 and 2024 is unified;  
* data types are adjusted (e.g., numeric values and dates);  
* relevant columns are organized and standardized;  
* the data gains a consistent structure.

The goal of the Silver layer is to prepare the data for analysis without yet applying specific business rules.

## Gold Layer (analytical model)

The Gold layer represents the final analytical view of the data, structured to facilitate queries and analysis.

In this layer:

* data is organized into a star schema;  
* dimension tables are created (time, user, region, nature, purpose);  
* a fact table is created to concentrate the main transaction and value metrics.

This structure enables clear and efficient answers to the business questions defined at the beginning of the project.

---
# Databricks Organization

The project uses Unity Catalog for data organization, with the following structure:

* Catalog: mvp_pix  
* Schema: dados  
* Volume (Bronze): /Volumes/mvp_pix/dados/bronze/

This organization facilitates pipeline visualization, project documentation, and the generation of evidence for evaluation.

---
# Final Note

This MVP does not aim to exhaust all analytical possibilities of the dataset, but rather to demonstrate, in a structured and functional way, the construction of a complete data pipeline—from data collection to analysis—with justified technical decisions and clear documentation.