# Combining the Public Ministries Raw Data

The Ontario government publishes each ministry's statements and schedules which contains detailed sechedules of debt and other items.

The link for 2024-25 is in the below link under the Ministry Statements and Schedules.

https://www.ontario.ca/page/public-accounts-ontario-2024-25

I have taken the schedules for 2023-24 and 2024-25, renamed them as such and saved them in the raw_data folder for ease of reference and ease of use.



## Purpose of this notebook

The purpose of this notebook is to combine multiple years of data and aggregate them into one combined file to show the multiple years.

So far, I have started with only two years (2023-24 and 2024-25).

In [1]:
import pandas as pd

# --- 1. File paths (adjust if needed) ---
file_2023_24 = "./raw_data/2023-24.csv"
file_2024_25 = "./raw_data/2024-25.csv"
output_file = "./processed_data/combined_tbs_statements.xlsx"

# --- 2. Helper: normalize text columns to avoid tiny differences ---

def normalize_text_series(s: pd.Series) -> pd.Series:
    """
    Normalize text to reduce fake differences like:
    - hyphen vs en-dash/em-dash
    - leading/trailing spaces
    - multiple spaces
    """
    # Only touch non-null values
    mask = s.notna()
    s_norm = s.copy()

    s_norm.loc[mask] = (
        s_norm.loc[mask]
        .astype(str)
        .str.strip()
        .str.replace("\u2013", "-", regex=False)  # en dash
        .str.replace("\u2014", "-", regex=False)  # em dash
        .str.replace("\u2212", "-", regex=False) # minus sign
        .str.replace(r"\s+", " ", regex=True)    # collapse multiple spaces
    )

    return s_norm

def normalize_text_columns(df: pd.DataFrame) -> pd.DataFrame:
    text_cols = df.select_dtypes(include=["object"]).columns
    for col in text_cols:
        df[col] = normalize_text_series(df[col])
    return df

# --- 2. Load the CSVs ---
df_2023 = pd.read_csv(file_2023_24)
df_2024 = pd.read_csv(file_2024_25)

# --- 3. Capture original totals for testing ---
orig_total_2023 = df_2023["Amount $"].sum()
orig_total_2024 = df_2024["Amount $"].sum()

# --- 4. Drop unwanted columns: _id and Year ---
cols_to_drop = ["_id", "Year"]

df_2023_clean = df_2023.drop(columns=cols_to_drop, errors="ignore")
df_2024_clean = df_2024.drop(columns=cols_to_drop, errors="ignore")

# --- 5. Normalize text columns BEFORE renaming / merging ---
df_2023_clean = normalize_text_columns(df_2023_clean)
df_2024_clean = normalize_text_columns(df_2024_clean)

# --- 6. Rename amount columns ---
df_2023_clean = df_2023_clean.rename(columns={"Amount $": "Amount_2023_24"})
df_2024_clean = df_2024_clean.rename(columns={"Amount $": "Amount_2024_25"})

# --- 7. Build the join keys (all non-amount columns) ---
key_cols = [c for c in df_2023_clean.columns if c != "Amount_2023_24"]

# --- 8. Merge the two years side-by-side ---
combined = df_2023_clean.merge(
    df_2024_clean,
    on=key_cols,
    how="outer"  # keeps rows that exist in either year
)

amount_cols = ["Amount_2023_24", "Amount_2024_25"]
other_cols = [c for c in combined.columns if c not in amount_cols]

combined = combined[other_cols + amount_cols]

# --- 9. Save to Excel ---
combined.to_excel(output_file, index=False)

print("Done! Saved combined file as:", output_file)


Done! Saved combined file as: ./processed_data/combined_tbs_statements.xlsx


## Checker

This script below is a simple data-integrity guardrail. 

It sums the dollar amounts in the raw source CSVs and compares them to the totals in the merged dataset. 

If those totals align—difference equals zero—the check passes and we know the merge preserved the original aggregates.

In [2]:
orig_total_2023 = df_2023["Amount $"].sum()
orig_total_2024 = df_2024["Amount $"].sum()

combined_total_2023 = combined["Amount_2023_24"].sum(skipna=True)
combined_total_2024 = combined["Amount_2024_25"].sum(skipna=True)

print("Combined totals (after merge):")
print(f"  2023-24 total in combined: {combined_total_2023:,.2f}")
print(f"  2024-25 total in combined: {combined_total_2024:,.2f}")
print()

diff_2023 = combined_total_2023 - orig_total_2023
diff_2024 = combined_total_2024 - orig_total_2024

print("Differences (combined - original):")
print(f"  2023-24 difference: {diff_2023:,.2f}")
print(f"  2024-25 difference: {diff_2024:,.2f}")
print()

# Allow a tiny floating point tolerance, but expect zero difference in practice
tolerance = 0.01

if abs(diff_2023) > tolerance or abs(diff_2024) > tolerance:
    raise ValueError(
        "❗ Sum mismatch detected between original CSVs and combined dataset. "
        "Check for duplicate rows, mismatched keys, or merge logic."
    )
else:
    print("✅ Tests passed: sums match for both 2023-24 and 2024-25.")
    print()

Combined totals (after merge):
  2023-24 total in combined: 200,634,829,913.00
  2024-25 total in combined: 221,264,301,149.05

Differences (combined - original):
  2023-24 difference: 0.00
  2024-25 difference: 0.00

✅ Tests passed: sums match for both 2023-24 and 2024-25.

