### Provision Layer for BBVA bank, debit account

**Author**: Ricardo Pérez Castillo

**Latest update**: 2024-12-30

**Version**: 6.0

**Purpose**: Prepare expense data into an unified single source of truth.

### Table of Contents
1. [Introduction](#introduction)
2. [Raw File Importing](#raw-file-importing)
3. [Data Description](#data-description)
4. [Basic Data Cleansing](#basic-data-cleansing)
5. [Entity Harmonization](#entity-harmonization)
6. [Transaction Type Harmonization](#transaction-type-harmonization)
7. [Data Cleansing and Transformation](#data-cleansing-and-transformation)
8. [Exporting](#exporting)

### Introduction

BBVA is one of the largest banks in México, offering debit and credit accounts. Data extraction is a challenge, since the bank does not offer API usage. Instead, manual extraction is needed. I use the custom dates option to filter the entire month in one file, however for debit accounts, it is only possible to extract from the latest 2 months, this is quite unfortunate as if you forget to extract the data for one month, then there is no other way to get the data, besides the bank statement which is in PDF format.

### Raw File Importing

In [1]:
# Import necessary libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical computations
from datetime import datetime  # To handle date and time operations
import pathlib  # For handling and manipulating file paths in an object-oriented way

# Import custom configurations and mappings
from config import current_month, current_month_text, current_year  # Custom configurations for date handling
from entity_mapping import entity_mapping  # Predefined mapping for harmonizing entities
from transaction_subtype_mapping import transaction_subtype_mapping  # Predefined mapping for harmonizing transaction subtypes

# Import utilities for fuzzy matching
from difflib import get_close_matches  # To find close matches between strings

# Import utilities for generating hash values
import hashlib  # For generating hash values


In [2]:
# Define the base directory for file storage
base_dir = pathlib.Path.home() / "Documents" / "Finanzas"
# The pathlib.Path.home() dynamically retrieves the user's home directory.

# Define the input directory for files
input_dir = base_dir / "BBVA" / "BBVA TDB" / "Movimientos" / "2024"
# This constructs the full path to the directory where input files are stored.
# Example: "C:/Users/YourUserName/Documents/Finanzas/BBVA/BBVA TDB/Movimientos/2024"

# Define the output directory for processed files
output_dir = base_dir / "Personal Spend" / "02 Individual Datasets" / "2024"
# This constructs the full path to the directory where processed output files will be saved.
# Example: "C:/Users/YourUserName/Documents/Finanzas/Personal Spend/02 Individual Datasets/2024"

In [3]:
# Function to construct the input file path for a specific month and year
def get_filename_input(month, year, month_text, prefix="", suffix="", extension=".xlsx"):
    """
    Constructs a filename for the input file based on the provided parameters.

    Parameters:
    - month (int): The numeric month (e.g., 1 for January, 12 for December).
    - year (int): The year (e.g., 2024).
    - month_text (str): The textual representation of the month (e.g., "January").
    - prefix (str): An optional prefix for the file name (default is an empty string).
    - suffix (str): An optional suffix for the file name (default is an empty string).
    - extension (str): The file extension (default is ".xlsx").

    Returns:
    - str: The constructed file name (e.g., "01_suffix.xlsx").
    """
    return str(month).zfill(2) + suffix + extension


# Function to construct the output file path for a specific month and year
def get_filename_output(month, year, month_text, prefix="", suffix="", extension=".csv"):
    """
    Constructs a filename for the output file based on the provided parameters.

    Parameters:
    - month (int): The numeric month (e.g., 1 for January, 12 for December).
    - year (int): The year (e.g., 2024).
    - month_text (str): The textual representation of the month (e.g., "January").
    - prefix (str): An optional prefix for the file name (default is an empty string).
    - suffix (str): An optional suffix for the file name (default is an empty string).
    - extension (str): The file extension (default is ".csv").

    Returns:
    - str: The constructed file name with the format "month_text/month_textMM.csv".
    """
    return f"{month_text}/{prefix}{month_text}{str(month).zfill(2)}{extension}"


In [None]:
# Create file paths for the current month's input and output files

# Construct the full input file path
input_file = input_dir / get_filename_input(
    current_month,
    current_year,
    current_month_text,
    prefix="",
    suffix="-movimientos"
)

# Construct the full output file path
output_file = output_dir / get_filename_output(
    current_month,
    current_year,
    current_month_text,
    prefix="df-bbva-tdb-",
    suffix=""
)

# Print the constructed file paths
print("Input file path: ", input_file)
print("Output file path: ", output_file)


In [None]:
# Import monthly data from the specified Excel file

try:
    # Attempt to read the Excel file, skipping the first 3 rows
    df_bbva = pd.read_excel(input_file, skiprows=3)
    print("Data imported successfully!")
except FileNotFoundError:
    # Handle the case where the input file is not found
    print(f"File not found: {input_file}")
except Exception as e:
    # Handle any other exceptions that may occur during file import
    print(f"An error occurred while reading the file: {e}")


### Data Description

Extract contains
- **FECHA**: Date in DD/MM/YYYY format
- **DESCRIPCIÓN**: Description that contains the entity and transaction reference number
- **CARGO**: Charged amount, mxn
- **ABONO**: Deposited amount, mxn
- **SALDO**: Balance, mxn

In [None]:
# Visualize the first few rows of the imported data
try:
    # Display the first five rows of the DataFrame
    print("Preview of the imported data:")
    display(df_bbva.head())
except NameError:
    # Handle the case where the DataFrame does not exist
    print("The data has not been successfully loaded into a DataFrame. Please check the file import process.")
except Exception as e:
    # Handle any other unexpected errors
    print(f"An error occurred during data visualization: {e}")


### Basic Data Cleansing

This section focuses on preparing the financial dataset for analysis by removing irrelevant rows, handling missing values, transforming data types, and splitting columns for better structure. 


In [7]:

# Step 1: Remove irrelevant rows
df_bbva = df_bbva[df_bbva['FECHA'] != 'BBVA México, S.A. Institución de Banca Múltiple, Grupo Financiero BBVA México.']
# Drop rows where 'FECHA' column is missing
df_bbva = df_bbva.dropna(subset=['FECHA'])

# Step 2: Convert 'FECHA' column to datetime format
df_bbva['FECHA'] = pd.to_datetime(df_bbva['FECHA'], format="%d/%m/%Y")

# Step 3: Replace null values in numeric columns with zero
df_bbva['CARGO'] = df_bbva['CARGO'].fillna(0)
df_bbva['ABONO'] = df_bbva['ABONO'].fillna(0)

# Step 4: Convert numeric columns from strings to floats
# First, ensure all values are strings to handle potential errors
df_bbva['CARGO'] = df_bbva['CARGO'].astype(str)
df_bbva['ABONO'] = df_bbva['ABONO'].astype(str)
df_bbva['SALDO'] = df_bbva['SALDO'].astype(str)

# Remove commas and convert to float
df_bbva['CARGO'] = df_bbva['CARGO'].str.replace(',', '').astype(float)
df_bbva['ABONO'] = df_bbva['ABONO'].str.replace(',', '').astype(float)
df_bbva['SALDO'] = df_bbva['SALDO'].str.replace(',', '').astype(float)

# Step 5: Split 'DESCRIPCIÓN' column into two separate columns
df_bbva[['DESCRIPCION_ENTIDAD', 'DESCRIPCION_DETALLE']] = df_bbva['DESCRIPCIÓN'].str.split('/', n=1, expand=True)

# Step 6: Strip whitespace from the new columns
df_bbva['DESCRIPCION_ENTIDAD'] = df_bbva['DESCRIPCION_ENTIDAD'].str.strip()
df_bbva['DESCRIPCION_DETALLE'] = df_bbva['DESCRIPCION_DETALLE'].str.strip()


### Entity Harmonization

This section focuses on standardizing supplier names within the dataset by applying fuzzy matching techniques against a predefined mapping. The goal is to ensure consistency in supplier names for easier analysis and reporting.

In [8]:
def harmonize_supplier(supplier_name, entity_mapping):
    """
    Harmonizes a supplier name using fuzzy matching against a predefined mapping.

    Parameters:
    - supplier_name (str): The name of the supplier to be harmonized.
    - entity_mapping (dict): A dictionary where keys are potential supplier names 
                             and values are their harmonized names.

    Returns:
    - str: The harmonized supplier name if a close match is found; 
           otherwise, returns the original supplier name.
    """
    # Use fuzzy matching to find potential matches for the supplier name
    matches = get_close_matches(supplier_name, entity_mapping.keys(), n=1, cutoff=0.8)
    
    # Return the harmonized name if a close match is found
    if matches:
        return entity_mapping[matches[0]]
    
    # Return the original name if no close match is found
    return supplier_name


In [None]:
# Extract the list of suppliers from the 'DESCRIPCION_ENTIDAD' column
suppliers_to_harmonize = df_bbva['DESCRIPCION_ENTIDAD'].tolist()

# Step 1: Harmonize the supplier names using the mapping
harmonized_suppliers = [harmonize_supplier(supplier, entity_mapping) for supplier in suppliers_to_harmonize]

# Step 2: Add the harmonized supplier names back to the DataFrame as a new column
df_bbva['Harmonized_Supplier'] = harmonized_suppliers


# Step 3: Display both 'DESCRIPCION_ENTIDAD' and 'Harmonized_Supplier' for comparison
print("Comparison of original and harmonized supplier names:")
display(df_bbva[['DESCRIPCIÓN','DESCRIPCION_DETALLE', 'DESCRIPCION_ENTIDAD', 'Harmonized_Supplier']])

# Step 4: Review the output and if necessary, update the entity_mapping.py dictionary and re-run the script


### Transaction Type Harmonization

In [10]:
def harmonize_transaction_subtype(transaction_desc, transaction_subtype_mapping):
    """
    Harmonizes the transaction subtype description using fuzzy matching against a predefined mapping.

    Parameters:
    - transaction_desc (str): The name of the transaction descriptions to be harmonized.
    - entity_mapping (dict): A dictionary where keys are potential transaction descriptions 
                             and values are their harmonized names.

    Returns:
    - str: The harmonized transaction subtype if a close match is found; 
           otherwise, returns the original transaction description.
    """
    # Use fuzzy matching to find potential matches for the transaction subtype
    matches = get_close_matches(transaction_desc, transaction_subtype_mapping.keys(), n=1, cutoff=0.8)
    
    # Return the harmonized name if a close match is found
    if matches:
        return transaction_subtype_mapping[matches[0]]
    
    # Return the original name if no close match is found
    return transaction_desc

In [None]:
# Extract the list of transaction descriptions from the 'DESCRIPCION_DETALLE' column
transactions_to_harmonize = df_bbva['DESCRIPCION_DETALLE'].tolist()

# Step 1: Harmonize the transaction descriptions using the mapping
harmonized_transactions = [harmonize_transaction_subtype(transaction, transaction_subtype_mapping) for transaction in transactions_to_harmonize]

# Step 2: Add the harmonized supplier names back to the DataFrame as a new column
df_bbva['Harmonized_Transaction_Subtype'] = harmonized_transactions

# Step 3: Display both 'DESCRIPCION_DETALLE' and 'harmonized_transactions' for comparison
print("Comparison of original and harmonized transaction:")
display(df_bbva[['DESCRIPCIÓN','DESCRIPCION_DETALLE', 'Harmonized_Transaction_Subtype']])

# Step 4: Review the output and if necessary, update the transaction_subtype_mapping.p dictionary and re-run the script

In [None]:
# Extract the list of transaction descriptions from the 'DESCRIPCION_ENTIDAD' column
transactions_to_harmonize = df_bbva['DESCRIPCION_ENTIDAD'].tolist()

# Step 1: Harmonize the transaction descriptions using the mapping
harmonized_transactions = [harmonize_transaction_subtype(transaction, transaction_subtype_mapping) for transaction in transactions_to_harmonize]

# Step 2: Add the harmonized supplier names back to the DataFrame as a new column
df_bbva['Harmonized_Transaction_Subtype'] = harmonized_transactions

# Step 3: Display both 'DESCRIPCION_DETALLE' and 'harmonized_transactions' for comparison
print("Comparison of original and harmonized transaction:")
display(df_bbva[['DESCRIPCIÓN','DESCRIPCION_ENTIDAD', 'Harmonized_Transaction_Subtype']])

# Step 4: Review the output and if necessary, update the transaction_subtype_mapping.py dictionary and re-run the script

In [13]:
# Create a new column 'Transaction_Type' based on the given conditions
df_bbva['TXT_TRANSACTION_TYPE'] = np.where(df_bbva['CARGO'] != 0, 'Expense', 
                                       np.where(df_bbva['ABONO'] != 0, 'Deposit', '-1'))


In [None]:
df_bbva.head()

### Data Cleansing and Transformation

In [15]:
# Create a new column that identifies the source system
df_bbva['KEY_SYSTEM'] = 'BBVA'

# Create a new column that identifies the source account
df_bbva['KEY_ACCOUNT'] = 'BBVATDB'
df_bbva['TXT_ACCOUNT'] = 'BBVA TDB'

# Create a new column that combines CARGO and ABONO columns
df_bbva['NUM_AMT_NET_REPORTING'] = np.where(df_bbva['CARGO'] != 0, df_bbva['CARGO'], df_bbva['ABONO']).astype(float)

# Create columns that contains the currency-related information
df_bbva['NUM_AMT_DOCUMENT'] = df_bbva['NUM_AMT_NET_REPORTING']
df_bbva['KEY_CURRENCY_DOCUMENT'] = 'MXN'
df_bbva['KEY_RATE'] = 1.0

# Create new columns with time information
df_bbva['KEY_MONTH'] = df_bbva['FECHA'].dt.month
df_bbva['KEY_YEAR'] = df_bbva['FECHA'].dt.year

# Create a new column that contains the flag indicating whether the transaction is debit or credit
df_bbva['FLG_DEBIT_CREDIT'] = np.where(df_bbva['CARGO'] != 0, 'C', 'D')

# Create new columns that identify the grouping operation (project, vacation, etc.). These will be blank and filled in later.
df_bbva['KEY_OPERATION'] = ''
df_bbva['TXT_OPERATION'] = ''

# Create new columns relevant for credit card transactions and payments in installments. Not relevant for this account.
df_bbva['DUE_DATE'] = ''
df_bbva['KEY_PAYMENT_TERM'] = ''
df_bbva['TXT_PAYMENT_TERM'] = ''
df_bbva['NUM_AMT_DUE'] = ''
df_bbva['KEY_ID_DUE'] = ''
df_bbva['TXT_ENTITY_DUE'] = ''
df_bbva['TXT_DESC_DUE'] = ''

# Create new columns with the country information. This account does not provide the country information.
df_bbva['KEY_COUNTRY'] = 'MX'
df_bbva['TXT_COUNTRY'] = 'Mexico'

# Create a new column with the purchase document number, blank initially.
df_bbva['KEY_PURCH_DOC_NO'] = ''

# Create a new column with internal flag
df_bbva['FLG_INTERNAL'] = np.where(df_bbva['Harmonized_Supplier'] == 'BBVA Your Name', 'Y', 'N')

# Create a new column with flag indicating whether the transaction was canceled
df_bbva['FLG_CANCEL'] = ''

# Create a new column with flag indicating whether the transaction was refunded
df_bbva['FLG_REFUND'] = ''

# Create new columns that will be filled later with the master tables
df_bbva['KEY_ENTITY'] = ''
df_bbva['KEY_TRANSACTION_TYPE'] = ''
df_bbva['KEY_TRANSACTION_SUBTYPE'] = ''

# Rename the columns to match the standard naming convention
df_bbva.rename(columns={
    'FECHA': 'KEY_DATE',
    'DESCRIPCIÓN': 'TXT_DESC',
    'Harmonized_Supplier': 'TXT_ENTITY',
    'Harmonized_Transaction_Subtype': 'TXT_TRANSACTION_SUBTYPE'
}, inplace=True)


In [16]:
# Function to generate a hash
def generate_shorter_hash(row):
    concat_str = f"{row['KEY_DATE']}_{row['TXT_ENTITY']}_{row['TXT_TRANSACTION_SUBTYPE']}_{row['NUM_AMT_NET_REPORTING']}"
    return hashlib.md5(concat_str.encode('utf-8')).hexdigest()  # MD5 generates a 32-character hash

# Apply the function to create the hash-based KEY_ID
df_bbva['KEY_ID'] = df_bbva.apply(generate_shorter_hash, axis=1)

In [17]:
# Order the columns based on the standard order
df_bbva = df_bbva[[
    'KEY_ID', 'KEY_SYSTEM', 'KEY_ACCOUNT', 'TXT_ACCOUNT', 'KEY_DATE', 'KEY_MONTH', 'KEY_YEAR',
    'KEY_ENTITY', 'TXT_ENTITY', 'KEY_TRANSACTION_TYPE', 'TXT_TRANSACTION_TYPE', 'KEY_TRANSACTION_SUBTYPE',
    'TXT_TRANSACTION_SUBTYPE', 'TXT_DESC', 'NUM_AMT_NET_REPORTING', 'NUM_AMT_DOCUMENT', 'KEY_CURRENCY_DOCUMENT',
    'KEY_RATE', 'FLG_DEBIT_CREDIT', 'KEY_OPERATION', 'TXT_OPERATION', 'DUE_DATE', 'KEY_PAYMENT_TERM',
    'TXT_PAYMENT_TERM', 'NUM_AMT_DUE', 'KEY_ID_DUE', 'TXT_ENTITY_DUE', 'TXT_DESC_DUE', 'KEY_COUNTRY',
    'TXT_COUNTRY', 'KEY_PURCH_DOC_NO', 'FLG_INTERNAL', 'FLG_CANCEL', 'FLG_REFUND'
]]

In [None]:
# Final visualization of the processed data
df_bbva.head()

### Exporting

In [None]:
# Export the cleaned and processed DataFrame to a CSV file

try:
    # Export DataFrame to the specified output file
    df_bbva.to_csv(output_file, index=False)
    print(f"File successfully exported to: {output_file}")
except FileNotFoundError:
    # Handle the case where the output directory does not exist
    print(f"Output path not found: {output_file}")
except PermissionError:
    # Handle permission issues when writing to the file
    print(f"Permission denied while trying to write to: {output_file}")
except Exception as e:
    # Handle any other unforeseen errors during the export
    print(f"An unexpected error occurred while writing the file: {e}")
