# Morpho Blue Attribution Features Demo

This notebook demonstrates the attribution layer that decomposes utilization dynamics into flow contributions.

## What You'll Learn

- **Flow Decomposition**: How borrows, repays, liquidations, supplies, and withdrawals contribute to market dynamics
- **Utilization Attribution**: Quantify the impact of each flow type on utilization changes
- **IRM Diagnostics**: Analyze Interest Rate Model responsiveness
- **Rolling Windows**: Multi-timeframe analysis from 5min to 365 days
- **Data Quality**: Validate attribution accuracy with integrity checks

## Requirements

Make sure you have:
- Morpho Blue market data in `data_examples/` directory
- All dependencies installed via `rye sync`

In [1]:
import sys
from pathlib import Path

# Add src to path
src_path = Path.cwd().parent / "src"

import duckdb
import pandas as pd
import numpy as np

from morpho_blue.indicator_service import IndicatorService
from morpho_blue.data.parquet_repository import ParquetRepository
from morpho_blue.transformations import AttributionWindowSpec

print("✓ Imports successful")

✓ Imports successful


## Setup: Load Data and Initialize Service

In [2]:
# Market ID for cbBTC market (from your example)
CBBTC_MARKET_ID = "0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e314a26f04d312434191836"

# Data directory
DATA_ROOT = Path.cwd().parent / "data_examples"

print(f"Data root: {DATA_ROOT}")
print(f"Market ID: {CBBTC_MARKET_ID}")

Data root: /home/youssef/morpho_blue/data_examples
Market ID: 0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e314a26f04d312434191836


In [3]:
# Initialize DuckDB connection
con = duckdb.connect()

# Initialize repository
repo = ParquetRepository(con=con, root=str(DATA_ROOT))

# Initialize service
service = IndicatorService(repo=repo, con=con)

print("✓ Service initialized")

✓ Service initialized


In [4]:
ledger = service.build_market_dataset(market_id=CBBTC_MARKET_ID)
print(f"✓ Market dataset built")


✓ Market dataset built


In [5]:
ledger_df = ledger.to_pandas()

In [6]:
ledger_df.columns

Index(['market_id', 'block_number', 'log_index', 'block_timestamp', 'tx_hash',
       'event_type', 'delta_supply_assets', 'delta_borrow_assets',
       'delta_collateral_assets', 'total_supplied_assets',
       'outstanding_borrow_assets', 'total_collateral_assets',
       'borrow_rate_per_sec', 'utilization_rate', 'borrow_apy',
       'supply_rate_per_sec', 'supply_apy', 'delta_utilization',
       'delta_borrow_apy', 'delta_supply_apy', 'borrow_apr',
       'borrow_in_assets', 'borrow_out_assets', 'supply_in_assets',
       'supply_out_assets', 'interest_assets', 'util_mean_5min',
       'util_std_5min', 'util_mean_1h', 'util_std_1h', 'util_mean_6h',
       'util_std_6h', 'borrow_apr_mean_5min', 'borrow_apr_std_5min',
       'borrow_apr_mean_1h', 'borrow_apr_std_1h', 'borrow_apr_mean_6h',
       'borrow_apr_std_6h', 'borrow_intensity_5min', 'repay_intensity_5min',
       'supply_intensity_5min', 'withdraw_intensity_5min',
       'interest_intensity_5min', 'borrow_intensity_1h', 'rep

## NEW: Build Tick-Aggregated Ledger

Major performance optimization! Instead of working with event-level data, we can aggregate to tick-level (e.g., 5min windows) or block-level.

**Benefits:**
- **90-99% data reduction**: 10,000 events → 100 ticks
- **Much faster rolling windows**: Operating on 100 rows vs 10,000
- **Lower memory usage**: Smaller dataset = less RAM
- **Preserves attribution accuracy**: Flows are aggregated correctly

**How it works:**
1. Compute cumulative states at event-level (precise)
2. Group events into ticks/blocks
3. Sum deltas, take final states
4. Aggregate flows by type (supply_in, borrow_in, etc.)
5. Fill empty ticks with zeros

In [6]:
# Build tick-aggregated ledger (5-minute ticks)
print("Building tick-aggregated ledger...")
tick_ledger = service.build_tick_ledger(
    market_id=CBBTC_MARKET_ID,
    tick_seconds=300,  # 5 minutes
    aggregation_level="tick",
)

tick_df = tick_ledger.table.to_pandas()


Building tick-aggregated ledger...


In [7]:
tick_df 

Unnamed: 0,market_id,block_number,log_index,block_timestamp,tx_hash,event_type,delta_supply_assets,delta_borrow_assets,delta_collateral_assets,total_supplied_assets,...,borrow_in_assets,repay_out_assets,liquidate_repay_assets,interest_assets,collateral_in_assets,collateral_out_assets,event_count,delta_utilization,delta_borrow_apy,delta_supply_apy
0,0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e31...,19403244,265,1725595835,0x808559ab9f6195db59cac4cb8205cbfd617ca6eb7013...,Aggregated,0,0,0,0,...,0,0,0,0,0,0,1,0.000000,0.000000e+00,0.000000
1,0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e31...,19403244,265,1725596400,0x808559ab9f6195db59cac4cb8205cbfd617ca6eb7013...,Aggregated,0,0,0,0,...,0,0,0,0,0,0,0,0.000000,0.000000e+00,0.000000
2,0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e31...,19403244,265,1725596700,0x808559ab9f6195db59cac4cb8205cbfd617ca6eb7013...,Aggregated,0,0,0,0,...,0,0,0,0,0,0,0,0.000000,0.000000e+00,0.000000
3,0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e31...,19403244,265,1725597000,0x808559ab9f6195db59cac4cb8205cbfd617ca6eb7013...,Aggregated,0,0,0,0,...,0,0,0,0,0,0,0,0.000000,0.000000e+00,0.000000
4,0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e31...,19403244,265,1725597300,0x808559ab9f6195db59cac4cb8205cbfd617ca6eb7013...,Aggregated,0,0,0,0,...,0,0,0,0,0,0,0,0.000000,0.000000e+00,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136356,0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e31...,39856769,406,1766502885,0x49a0cf9ef5f6b788d6a926cfafbf00513dad4a8fd3aa...,Aggregated,-18919184428,7854835952,5960437,1140933652721324,...,27342470000,20031834989,0,544200941,80437681,74477244,131,0.000022,6.075401e-07,0.000002
136357,0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e31...,39856920,276,1766503187,0x50aa4931a5157dbbecb4ff48d12d724dc8c4dc6f1175...,Aggregated,53473063445,8339538697,12394972,1140987125784769,...,8265000000,500000000,0,574538697,12394972,0,170,-0.000034,-2.458677e-06,-0.000004
136358,0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e31...,39857068,471,1766503483,0x65b17c07903665733608c2edc02d505c68d992ae143b...,Aggregated,553786621465,28524842387,-436583,1141540912406234,...,32462260000,4500624684,0,563207071,11118200,11554783,158,-0.000403,-2.216101e-05,-0.000045
136359,0x9103c3b4e834476c9a62ea009ba2c884ee42e94e6e31...,39857225,1459,1766503797,0x47f7e45ac2093f3383751f2148705981cae502d3ab92...,Aggregated,-500011791263,30447460877,1664802,1141040900614971,...,31500000000,1650000000,0,597460877,1664802,0,113,0.000413,2.190270e-05,0.000045


In [8]:
# Verify accounting integrity: sum of flow deltas should equal total delta
sample_tick = tick_df[tick_df['event_count'] > 1].iloc[0] if (tick_df['event_count'] > 1).any() else tick_df.iloc[10]

print(f"\n{'='*60}")
print(f"ACCOUNTING INTEGRITY CHECK")
print(f"{'='*60}")

# Borrow side
borrow_flows_sum = (sample_tick['borrow_in_assets'] - 
                    sample_tick['repay_out_assets'] - 
                    sample_tick['liquidate_repay_assets'] + 
                    sample_tick['interest_assets'])
borrow_delta = sample_tick['delta_borrow_assets']
print(f"Borrow side:")
print(f"  Flow sum: {borrow_flows_sum:,.0f}")
print(f"  Delta:    {borrow_delta:,.0f}")
print(f"  Match: {'✓' if abs(borrow_flows_sum - borrow_delta) < 1 else '✗'}")

# Supply side
supply_flows_sum = (sample_tick['supply_in_assets'] - 
                    sample_tick['withdraw_out_assets'] +
                    sample_tick['interest_assets'])
supply_delta = sample_tick['delta_supply_assets']
print(f"\nSupply side:")
print(f"  Flow sum: {supply_flows_sum:,.0f}")
print(f"  Delta:    {supply_delta:,.0f}")
print(f"  Match: {'✓' if abs(supply_flows_sum - supply_delta) < 1 else '✗'}")


ACCOUNTING INTEGRITY CHECK
Borrow side:
  Flow sum: 0
  Delta:    0
  Match: ✓

Supply side:
  Flow sum: 1,000,000
  Delta:    1,000,000
  Match: ✓


## Build Attribution Feature Dataset with Optimizations

The attribution pipeline follows a two-step process:
1. Build the enriched market ledger (raw events → standardized → enriched with state)
2. Compute attribution features from the ledger

**NEW: Data Preprocessing & Batch Processing**

We've added two major optimizations to reduce memory usage and processing time:

1. **Tick-based preprocessing** (`preprocess_ticks=True`): 
   - For 5-minute window analysis, keeps only first and last rows within each tick
   - Can reduce dataset size by 90%+ for high-frequency event data
   - Preserves accuracy for rolling window calculations

2. **Batch processing** (`batch_size`):
   - Processes rolling windows in batches to reduce memory footprint
   - Useful when computing many windows (e.g., all 8 windows)

3. **Window selection** (`compute_windows`):
   - Specify only the windows you need (e.g., ['5min', '1h', '24h'])
   - Each window adds significant memory overhead

In [20]:
import time


attribution = service.build_attribution_dataset(
    ledger=ledger,
    compute_windows=['5min', '1h', '24h'],  # Only compute these 3 windows
    validate=True,
    preprocess_ticks=True,  # NEW: Reduce data by keeping only first/last rows per tick
    tick_seconds=300,       # 5-minute ticks
    batch_size=None,        # Set to 2 or 3 to batch process windows
)

df = attribution.table.to_pandas()
print(f"\n✓ Attribution dataset built")


Preprocessing: Reduced from 4,084,573 to 222,839 rows (94.5% reduction)


  df_batch[f"interest_intensity_{window_label}"] = df_batch["interest_assets"].rolling(window_label, min_periods=1).sum() / seconds_w
  
  # Scale-free intensity (normalized by mean supply)
  mean_supply_window = df_batch["total_supplied_assets"].astype("float64").rolling(window_label, min_periods=1).mean()
  df_batch[f"withdraw_intensity_norm_supply_{window_label}"] = (
  df_batch[f"repay_intensity_norm_supply_{window_label}"] = (
  
  df_batch[f"contrib_borrow_sum_{window_label}"] = df_batch["contrib_u_from_borrow"].rolling(window_label, min_periods=1).sum()
  df_batch[f"contrib_supply_sum_{window_label}"] = df_batch["contrib_u_from_supply"].rolling(window_label, min_periods=1).sum()
  
  # Absolute contribution sums (activity magnitude)
  df_batch[f"contrib_withdraw_abs_sum_{window_label}"] = df_batch["contrib_u_from_withdraw"].abs().rolling(window_label, min_periods=1).sum()
  
  # Volatility / fragility
  df_batch[f"delta_u_std_{window_label}"] = df_batch["delta_u_real"].rolling(w


✓ Attribution dataset built


In [16]:
print(df.columns[:50])

Index(['market_id', 'block_number', 'log_index', 'block_timestamp', 'tx_hash',
       'event_type', 'delta_supply_assets', 'delta_borrow_assets',
       'delta_collateral_assets', 'total_supplied_assets',
       'outstanding_borrow_assets', 'total_collateral_assets',
       'borrow_rate_per_sec', 'utilization_rate', 'borrow_apy',
       'supply_rate_per_sec', 'supply_apy', 'delta_utilization',
       'delta_borrow_apy', 'delta_supply_apy', 'borrow_apr',
       'borrow_in_assets', 'borrow_out_assets', 'supply_in_assets',
       'supply_out_assets', 'interest_assets', 'util_mean_5min',
       'util_std_5min', 'util_mean_1h', 'util_std_1h', 'util_mean_6h',
       'util_std_6h', 'borrow_apr_mean_5min', 'borrow_apr_std_5min',
       'borrow_apr_mean_1h', 'borrow_apr_std_1h', 'borrow_apr_mean_6h',
       'borrow_apr_std_6h', 'borrow_intensity_5min', 'repay_intensity_5min',
       'supply_intensity_5min', 'withdraw_intensity_5min',
       'interest_intensity_5min', 'borrow_intensity_1h', 'rep

## Inspect Core Features

In [21]:
# Show sample of state columns
state_cols = [
    'block_number', 'event_type', 'utilization_rate', 'buffer',
    'borrow_apr', 'headroom_u90_assets', 'headroom_u95_assets'
]

print("State Columns Sample:")
print(df[state_cols].head(10))
print(f"\nUtilization range: [{df['utilization_rate'].min():.4f}, {df['utilization_rate'].max():.4f}]")
print(f"Borrow APR range: [{df['borrow_apr'].min():.4f}, {df['borrow_apr'].max():.4f}]")

State Columns Sample:
   block_number        event_type  utilization_rate    buffer  borrow_apr  \
0      19403244    AccrueInterest          0.000000  1.000000    0.008901   
1      19713959    AccrueInterest          0.000000  1.000000    0.005110   
2      19713959            Supply          0.000000  1.000000    0.005110   
3      19714179  SupplyCollateral          0.000000  1.000000    0.005110   
4      19714179            Borrow          0.983789  0.016211    0.002954   
5      19726214  SupplyCollateral          0.983789  0.016211    0.002954   
6      19726214            Borrow          0.980745  0.019255    0.042173   
7      19727323  SupplyCollateral          0.980745  0.019255    0.042173   
8      19727323            Borrow          0.980566  0.019434    0.041796   
9      19729336  SupplyCollateral          0.980566  0.019434    0.041796   

   headroom_u90_assets  headroom_u95_assets  
0                    0                    0  
1                    0                

In [22]:
# Show flow decomposition
flow_cols = [
    'event_type', 'borrow_in_assets', 'repay_out_assets', 'liquidate_repay_assets',
    'supply_in_assets', 'withdraw_out_assets', 'interest_assets'
]

print("\nFlow Decomposition (non-zero rows):")
flow_df = df[flow_cols]
non_zero = flow_df[(flow_df.iloc[:, 1:] != 0).any(axis=1)]
print(non_zero.head(20))

print("\nFlow totals:")
print(f"  Total borrowed: {df['borrow_in_assets'].sum():,.0f}")
print(f"  Total repaid: {df['repay_out_assets'].sum():,.0f}")
print(f"  Total liquidated: {df['liquidate_repay_assets'].sum():,.0f}")
print(f"  Total supplied: {df['supply_in_assets'].sum():,.0f}")
print(f"  Total withdrawn: {df['withdraw_out_assets'].sum():,.0f}")
print(f"  Total interest: {df['interest_assets'].sum():,.0f}")


Flow Decomposition (non-zero rows):
        event_type  borrow_in_assets  repay_out_assets  \
2           Supply      0.000000e+00               0.0   
4           Borrow      1.000000e+06               0.0   
6           Borrow      1.500000e+11               0.0   
8           Borrow      1.500000e+08               0.0   
10          Borrow      2.500000e+08               0.0   
11  AccrueInterest      0.000000e+00               0.0   
12          Supply      0.000000e+00               0.0   
13  AccrueInterest      0.000000e+00               0.0   
14        Withdraw      0.000000e+00               0.0   
15  AccrueInterest      0.000000e+00               0.0   
16          Supply      0.000000e+00               0.0   
17  AccrueInterest      0.000000e+00               0.0   
18        Withdraw      0.000000e+00               0.0   
20          Borrow      1.765100e+10               0.0   
21  AccrueInterest      0.000000e+00               0.0   
22          Supply      0.000000e+0

In [23]:
# Show contribution terms
contrib_cols = [
    'event_type', 'delta_u_real', 'delta_u_pred',
    'contrib_u_from_borrow', 'contrib_u_from_repay', 'contrib_u_from_liquidate',
    'contrib_u_from_withdraw', 'contrib_u_from_supply', 'contrib_u_from_interest',
    'delta_u_residual'
]

print("\nUtilization Contribution Terms (rows with significant change):")
contrib_df = df[contrib_cols]
significant = contrib_df[contrib_df['delta_u_real'].abs() > 0.001]
print(significant.head(20))

print("\nResidual statistics:")
print(df['delta_u_residual'].describe())
print(f"\nRows with |residual| > 0.01: {(df['delta_u_residual'].abs() > 0.01).sum()} ({100*(df['delta_u_residual'].abs() > 0.01).sum()/len(df):.2f}%)")


Utilization Contribution Terms (rows with significant change):
   event_type  delta_u_real   delta_u_pred  contrib_u_from_borrow  \
4      Borrow      0.983789       1.000000               1.000000   
6      Borrow     -0.003044  147568.368425          147568.368425   
10     Borrow      0.001450       0.001633               0.001633   
12     Supply     -0.049040      -0.051618               0.000000   
14   Withdraw      0.041029       0.039301               0.000000   
16     Supply     -0.041671      -0.043533               0.000000   
18   Withdraw      0.039370       0.037775               0.000000   
20     Borrow      0.001030       0.114036               0.114036   
22     Supply     -0.035612      -0.036966               0.000000   
24   Withdraw      0.034596       0.033364               0.000000   
26     Supply     -0.033341      -0.034526               0.000000   
28     Borrow      0.033205       0.000223               0.000223   
31     Borrow     -0.001759       0.000

## IRM Slope Analysis

In [24]:
# Analyze IRM slope
irm_cols = [
    'event_type', 'delta_u_real', 'delta_borrow_apr', 'irm_slope',
    'irm_response_to_withdraw', 'irm_response_to_repay', 'irm_response_to_liquidate'
]

print("IRM Slope Analysis:")
irm_df = df[irm_cols]
valid_slope = irm_df[irm_df['irm_slope'].notna()]

if len(valid_slope) > 0:
    print(f"\nValid slope measurements: {len(valid_slope)} rows")
    print("\nIRM slope statistics:")
    print(valid_slope['irm_slope'].describe())
    
    print("\nSample rows with significant IRM slope:")
    print(valid_slope[valid_slope['irm_slope'].abs() > 1.0].head(10))
else:
    print("No valid IRM slope measurements (no significant utilization changes)")

IRM Slope Analysis:

Valid slope measurements: 203292 rows

IRM slope statistics:
count    2.032920e+05
mean     1.122489e+03
std      2.444427e+05
min     -1.131314e+07
25%     -2.401353e+00
50%      1.314870e-01
75%      2.167668e+00
max      6.900629e+07
Name: irm_slope, dtype: float64

Sample rows with significant IRM slope:
        event_type  delta_u_real  delta_borrow_apr     irm_slope  \
6           Borrow -3.043641e-03          0.039219 -1.288568e+01   
8           Borrow -1.792826e-04         -0.000377  2.105185e+00   
11  AccrueInterest  3.164238e-07          0.001011  3.196422e+03   
13  AccrueInterest  1.213619e-08         -0.018030 -1.485617e+06   
15  AccrueInterest  5.010309e-08          0.015434  3.080497e+05   
17  AccrueInterest  1.244524e-08         -0.015627 -1.255660e+06   
20          Borrow  1.029665e-03          0.014841  1.441321e+01   
21  AccrueInterest  5.737837e-08          0.000464  8.087400e+03   
23  AccrueInterest  1.262074e-08         -0.013399 -1.061

## Rolling Window Statistics

In [25]:
# Show rolling window statistics for 6h and 24h
rolling_cols_6h = [
    'block_number', 'utilization_rate', 'util_mean_6h', 'util_max_6h', 'util_std_6h',
    'borrow_apr', 'borrow_apr_mean_6h', 'borrow_apr_max_6h'
]

print("\n6-Hour Rolling Window Statistics:")
print(df[rolling_cols_6h].tail(20))

# Check if 24h columns exist
if 'util_mean_24h' in df.columns:
    rolling_cols_24h = [
        'block_number', 'utilization_rate', 'util_mean_24h', 'util_max_24h',
        'borrow_apr', 'borrow_apr_mean_24h', 'borrow_apr_max_24h'
    ]
    print("\n24-Hour Rolling Window Statistics:")
    print(df[rolling_cols_24h].tail(20))


6-Hour Rolling Window Statistics:


KeyError: "['util_max_6h', 'borrow_apr_max_6h'] not in index"

In [26]:
# Flow intensities
intensity_cols = [
    'borrow_intensity_6h', 'repay_intensity_6h', 'liquidate_intensity_6h',
    'withdraw_intensity_6h', 'supply_intensity_6h', 'interest_intensity_6h'
]

print("\n6-Hour Flow Intensities (assets per second):")
print(df[intensity_cols].describe())

# Normalized intensities
norm_cols = [
    'borrow_intensity_norm_supply_6h',
    'withdraw_intensity_norm_supply_6h',
    'repay_intensity_norm_supply_6h'
]

print("\n6-Hour Normalized Intensities (fraction of supply per second):")
print(df[norm_cols].describe())


6-Hour Flow Intensities (assets per second):


KeyError: "['liquidate_intensity_6h'] not in index"

## Attribution Aggregates

In [None]:
# Attribution contribution sums
attribution_sum_cols = [
    'contrib_withdraw_sum_6h', 'contrib_repay_sum_6h', 'contrib_liquidate_sum_6h',
    'contrib_borrow_sum_6h', 'contrib_supply_sum_6h'
]

print("\n6-Hour Attribution Contribution Sums:")
print(df[attribution_sum_cols].describe())

# Check latest values
print("\nLatest attribution sums (6h window):")
print(df[attribution_sum_cols].iloc[-1])

# Absolute contribution sums (activity magnitude)
abs_sum_cols = [
    'contrib_withdraw_abs_sum_6h',
    'contrib_repay_abs_sum_6h',
    'contrib_liquidate_abs_sum_6h'
]

print("\n6-Hour Absolute Contribution Sums (activity magnitude):")
print(df[abs_sum_cols].describe())

## Volatility Attribution Shares

In [None]:
# Volatility attribution shares (6h window)
vol_share_cols = [
    'vol_share_withdraw_6h', 'vol_share_repay_6h', 'vol_share_liquidate_6h',
    'vol_share_borrow_6h', 'vol_share_supply_6h'
]

if all(col in df.columns for col in vol_share_cols):
    print("\nVolatility Attribution Shares (6h window):")
    print("These show what fraction of utilization variance is explained by each flow type.")
    print(df[vol_share_cols].describe())
    
    # Show rows with defined shares
    valid_shares = df[vol_share_cols].dropna(how='all')
    if len(valid_shares) > 0:
        print("\nLatest volatility shares:")
        print(valid_shares.tail(10))
    else:
        print("\nNo valid volatility shares (insufficient variance in window)")
else:
    print("\nVolatility share columns not found in dataset.")

## Expanding Statistics (All-Time)

In [None]:
# Expanding window statistics
expanding_cols = [
    'util_mean_expanding', 'util_std_expanding', 'util_max_expanding',
    'borrow_apr_mean_expanding', 'borrow_apr_std_expanding', 'borrow_apr_max_expanding',
    'delta_u_std_expanding'
]

print("\nExpanding (All-Time) Statistics:")
print(df[expanding_cols].tail(20))

print("\nFinal expanding statistics:")
print(df[expanding_cols].iloc[-1])

## Data Quality Summary

In [28]:
# Summary statistics for validation
print("DATA QUALITY SUMMARY")


# 1. Check integer columns
int_cols = [
    'delta_supply_assets', 'delta_borrow_assets',
    'total_supplied_assets', 'outstanding_borrow_assets'
]
print("\n1. Integer Column Integrity:")
for col in int_cols:
    nulls = df[col].isna().sum()
    print(f"   {col}: {nulls} nulls")

# 2. Utilization bounds
print("\n2. Utilization Bounds [0, 1]:")
util = df['utilization_rate']
print(f"   Min: {util.min():.6f}")
print(f"   Max: {util.max():.6f}")
print(f"   Out of bounds: {((util < 0) | (util > 1)).sum()} rows")

# 3. Residual distribution
print("\n3. Attribution Residuals:")
residuals = df['delta_u_residual'].abs()
print(f"   Mean: {residuals.mean():.6f}")
print(f"   Median: {residuals.median():.6f}")
print(f"   95th percentile: {residuals.quantile(0.95):.6f}")
print(f"   Max: {residuals.max():.6f}")
print(f"   Rows with |residual| > 0.01: {(residuals > 0.01).sum()} ({100*(residuals > 0.01).sum()/len(df):.2f}%)")

# 4. Flow decomposition check
print("\n4. Flow Decomposition Identity Check:")
reconstructed = (
    df['borrow_in_assets'] - df['repay_out_assets'] -
    df['liquidate_repay_assets'] + df['interest_assets']
)
delta_actual = df['delta_borrow_assets'].astype('float64')
flow_diff = (reconstructed - delta_actual).abs()
print(f"   Max difference: {flow_diff.max():.2f}")
print(f"   Mean difference: {flow_diff.mean():.2e}")
print(f"   Rows with diff > 1.0: {(flow_diff > 1.0).sum()}")

print("\n" + "=" * 60)
print("✓ Attribution feature dataset ready for analysis")
print("=" * 60)

DATA QUALITY SUMMARY

1. Integer Column Integrity:
   delta_supply_assets: 0 nulls
   delta_borrow_assets: 0 nulls
   total_supplied_assets: 0 nulls
   outstanding_borrow_assets: 0 nulls

2. Utilization Bounds [0, 1]:
   Min: 0.000000
   Max: 1.000000
   Out of bounds: 0 rows

3. Attribution Residuals:
   Mean: 0.662608
   Median: 0.000000
   95th percentile: 0.000906
   Max: 147568.371469
   Rows with |residual| > 0.01: 1543 (0.69%)

4. Flow Decomposition Identity Check:
   Max difference: 0.00
   Mean difference: 0.00e+00
   Rows with diff > 1.0: 0

✓ Attribution feature dataset ready for analysis


## Export Dataset

Save the attribution feature table for further analysis.

In [None]:
# Save to parquet
output_path = Path.cwd().parent / "data_examples" / "attribution_features.parquet"
df.to_parquet(output_path, index=False, engine='pyarrow', compression='snappy')

print(f"✓ Saved attribution features to: {output_path}")
print(f"  Size: {output_path.stat().st_size / 1024 / 1024:.2f} MB")
print(f"  Rows: {len(df):,}")
print(f"  Columns: {len(df.columns)}")