# 1. Data Inspection - BanVic Project

**Objective:** Perform an initial exploratory data analysis (EDA) on the raw CSV files provided for the BanVic challenge. This notebook aims to understand the structure, quality, and relationships within the data before the transformation phase.

In [17]:
import os
import pandas as pd
from dotenv import load_dotenv

# Load environment variables (like GOOGLE_APPLICATION_CREDENTIALS)
load_dotenv()

# --- Configuration ---
# Set the path to the directory containing the raw data
# Assumes the data is in a subfolder 'data/raw/' relative to the notebook's location
DATA_DIR = os.path.join('..', 'data')

# List to hold all found csv files
csv_files = [f for f in os.listdir(DATA_DIR) if f.endswith('.csv')]

print(f"Found {len(csv_files)} files in '{DATA_DIR}':")
for file in csv_files:
    print(f"- {file}")

# Setting pandas display options for better visualization
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 100)

Found 7 files in '..\data':
- agencies.csv
- bank_accounts.csv
- clients.csv
- credit_proposal.csv
- employee.csv
- employee_agency.csv
- transactions.csv


# 2. Automated Overview of All Files

Loop through each CSV file to get a high-level, consistent overview of its contents.

In [18]:
# Create a dictionary to hold all dataframes for later use
dataframes = {}

for file in csv_files:
    file_path = os.path.join(DATA_DIR, file)
    df_name = file.replace('.csv', '')
    
    print(f"--- INSPECTING: {file} ---")
    
    try:
        df = pd.read_csv(file_path)
        dataframes[df_name] = df
        
        print("\n[Shape]")
        print(df.shape)
        
        print("\n[Head]")
        print(df.head())
        
        print("\n[Info]")
        df.info()
        
        print("\n[Describe]")
        # include='all' provides stats for both numeric and categorical columns
        print(df.describe(include='all'))
        
    except Exception as e:
        print(f"Could not read or process file {file}. Error: {e}")
        
    print("\n" + "="*50 + "\n")

--- INSPECTING: agencies.csv ---

[Shape]
(10, 7)

[Head]
   cod_agencia              nome                                           endereco     cidade  \
0            7   Agência Digital  Av. Paulista, 1436 - Cerqueira César, São Paul...  São Paulo   
1            1    Agência Matriz  Av. Paulista, 1436 - Cerqueira César, São Paul...  São Paulo   
2            2   Agência Tatuapé  Praça Sílvio Romero, 158 - Tatuapé, São Paulo ...  São Paulo   
3            3  Agência Campinas  Av. Francisco Glicério, 895 - Vila Lidia, Camp...   Campinas   
4            4    Agência Osasco  Av. Antônio Carlos Costa, 1000 - Bela Vista, O...     Osasco   

   uf data_abertura tipo_agencia  
0  SP    2015-08-01      Digital  
1  SP    2010-01-01       Física  
2  SP    2010-06-14       Física  
3  SP    2012-03-04       Física  
4  SP    2013-11-06       Física  

[Info]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column         Non-Null Count

# 3. Deep Dive into Key Tables

Now, we'll perform a more detailed analysis on the most relevant tables: `clients`, `transactions`, and `credit_proposal`.

In [10]:
# --- Clients Analysis ---
print("--- DEEP DIVE: clients ---")
df_clients = dataframes.get('clients')
if df_clients is not None:
    print("\n[Missing Value Percentage]")
    print((df_clients.isnull().sum() / len(df_clients) * 100).sort_values(ascending=False))
    
    print(f"\n[Duplicate Rows]: {df_clients.duplicated().sum()}")
    
    print("\n[Value Counts for 'cep']")
    print(df_clients['cep'].value_counts())

print("\n" + "="*50 + "\n")

# --- Transactions Analysis ---
print("--- DEEP DIVE: transactions ---")
df_transactions = dataframes.get('transactions')
if df_transactions is not None:
    # Convert data column to datetime to perform time-based analysis
    df_transactions['data_transacao'] = pd.to_datetime(df_transactions['data_transacao'], format='mixed', utc=True)
    
    print("\n[Missing Value Percentage]")
    print((df_transactions.isnull().sum() / len(df_transactions) * 100).sort_values(ascending=False))
    
    print(f"\n[Duplicate Rows]: {df_transactions.duplicated().sum()}")
    
    print(f"\n[Transaction Date Range]: {df_transactions['data_transacao'].min()} to {df_transactions['data_transacao'].max()}")
    
print("\n" + "="*50 + "\n")

# --- Credit Proposal Analysis ---
print("--- DEEP DIVE: credit_proposal ---")
df_credit_proposal = dataframes.get('credit_proposal')
if df_credit_proposal is not None:
    print("\n[Missing Value Percentage]")
    print((df_credit_proposal.isnull().sum() / len(df_credit_proposal) * 100).sort_values(ascending=False))

    print(f"\n[Duplicate Rows]: {df_credit_proposal.duplicated().sum()}")

    print("\n[Value Counts for 'status_proposta']")
    print(df_credit_proposal['status_proposta'].value_counts())

--- DEEP DIVE: clients ---

[Missing Value Percentage]
cod_cliente        0.0
primeiro_nome      0.0
ultimo_nome        0.0
email              0.0
tipo_cliente       0.0
data_inclusao      0.0
cpfcnpj            0.0
data_nascimento    0.0
endereco           0.0
cep                0.0
dtype: float64

[Duplicate Rows]: 0

[Value Counts for 'cep']
cep
95140-704    1
76516-765    1
51779625     1
19615792     1
01672838     1
            ..
08264521     1
55045-265    1
88159-361    1
36211-005    1
15386938     1
Name: count, Length: 998, dtype: int64


--- DEEP DIVE: transactions ---

[Missing Value Percentage]
cod_transacao      0.0
num_conta          0.0
data_transacao     0.0
nome_transacao     0.0
valor_transacao    0.0
dtype: float64

[Duplicate Rows]: 0

[Transaction Date Range]: 2010-02-27 16:39:46+00:00 to 2023-01-15 15:57:23.974201+00:00


--- DEEP DIVE: credit_proposal ---

[Missing Value Percentage]
cod_proposta             0.0
cod_cliente              0.0
cod_colaborador     

# 4. Key and Relationship Analysis

Here, we check the integrity of primary and foreign keys to understand how the tables connect and to find potential orphan records.

In [None]:
# --- Primary Key Uniqueness Check ---
print("--- PRIMARY KEY UNIQUENESS ---")

# Check if 'cod_cliente' is unique in the clients table
is_client_id_unique = dataframes['clients']['cod_cliente'].is_unique
print(f"Is 'cod_cliente' in 'clients' unique? -> {is_client_id_unique}")

# Check if 'num_conta' is unique in the bank_accounts table
is_account_id_unique = dataframes['bank_accounts']['num_conta'].is_unique
print(f"Is 'num_conta' in 'bank_accounts' unique? -> {is_account_id_unique}")

# Check if 'cod_agencia' is unique in the agencies table
is_agency_id_unique = dataframes['agencies']['cod_agencia'].is_unique
print(f"Is 'cod_agencia' in 'agencies' unique? -> {is_agency_id_unique}")

# Check if 'cod_colaborador' is unique in the employee table
is_agency_id_unique = dataframes['employee']['cod_colaborador'].is_unique
print(f"Is 'cod_colaborador' in 'employee' unique? -> {is_agency_id_unique}")

# --- Referential Integrity Check ---
print("\n--- REFERENTIAL INTEGRITY ---")

# Check for orphan records in 'bank_accounts' (accounts with no client)
accounts_ids = set(dataframes['bank_accounts']['cod_cliente'])
clients_ids = set(dataframes['clients']['cod_cliente'])
orphan_accounts = accounts_ids - clients_ids
print(f"Found {len(orphan_accounts)} orphan accounts (cod_cliente in 'bank_accounts' but not in 'clients').")

# Check for orphan records in 'transactions' (transactions with no account)
transactions_account_ids = set(dataframes['transactions']['num_conta'])
accounts_ids_in_accounts_table = set(dataframes['bank_accounts']['num_conta'])
orphan_transactions = transactions_account_ids - accounts_ids_in_accounts_table
print(f"Found {len(orphan_transactions)} orphan transactions (num_conta in 'transactions' but not in 'bank_accounts').")

--- PRIMARY KEY UNIQUENESS ---
Is 'cod_cliente' in 'clients' unique? -> True
Is 'num_conta' in 'bank_accounts' unique? -> True
Is 'cod_agencia' in 'agencies' unique? -> True
Is 'cod_colaborador' in 'employee' unique? -> True

--- REFERENTIAL INTEGRITY ---
Found 1 orphan accounts (client_id in 'bank_accounts' but not in 'clients').
Found 0 orphan transactions (num_conta in 'transactions' but not in 'bank_accounts').


# 5. Summary of Findings & Next Steps

The initial data inspection phase is complete. The analysis reveals that the raw data is of a very high quality in terms of completeness, but requires specific cleaning and standardization steps before it can be used for reliable analysis.

## Key Findings

* **Excellent Completeness:**
    * There are **zero missing (null) values** across all 7 tables.
    * There are **zero fully duplicated rows** in the key business tables (`clients`, `transactions`, `credit_proposal`).

* **Valid Primary Keys:**
    * All identified primary keys (`cod_cliente`, `num_conta`, `cod_agencia`, `cod_colaborador`) were confirmed to be **100% unique**, making them reliable identifiers for their respective tables.

* **Full Traceability of Transactions:**
    * The integrity check confirmed there are **zero orphan transactions**. This is a crucial finding, as it means every one of the 71,999 transaction records is correctly linked to an existing bank account in the `bank_accounts` table. For a financial institution, this indicates a healthy data trail where every transaction can be traced to an account.

* **Identified Data Integrity Issue:**
    * One **"orphan" account record** was found. This means there is one account in `bank_accounts` whose `cod_cliente` does not exist in the `clients` table. This discovery also explains the initial discrepancy of having 999 accounts but only 998 clients.

### Handling the Orphan Account

For the purposes of this academic exercise, this single orphan account will be filtered out during the transformation phase to ensure the integrity of our final data model.

However, it is critical to state that **in a real-world business scenario, simply deleting this record would be incorrect and dangerous**. An untraceable account, even just one, is a significant red flag for data governance, compliance, and financial auditing. The correct procedure would be to:
1.  **Flag the record** instead of deleting it.
2.  **Generate a report** for the data governance or source system team.
3.  **Initiate an investigation** to understand why the client record is missing and correct the data at its source.

## Action Plan for Transformation Phase

Based on the findings, the following tasks must be performed to create a clean and reliable Data Warehouse:

1.  **Data Type Conversion:** Convert all date and timestamp columns (currently `object` type) to the proper `datetime` type.
2.  **Text Standardization:** Standardize the `cep` column in the `clients` table to a single format (e.g., only numbers, without hyphens).
3.  **Text Parsing:** Parse the `endereco` column in the `clients` table into multiple structured columns (e.g., street, city, state).
4.  **Data Filtering:** Filter out the single orphan account record from the `bank_accounts` table to ensure referential integrity in the final model.