### RBSA Notebook - Private Equity Analyzer - AI Prototype

#### Overview

This notebook is a prototype that how AI can facilitate and strengthen analytics of PE investments.

Often a user just wants "the answer," and in many cases this is reasonable. 

However, in analyzing Private Equity, limited partners and potential investors do not get full details, and the analytics for PE investments, therefore, are not "cookie cutter." 

Engaging with a user to develop insights and make informed decisions is essential. 

In this prototype, AI is used to incorporate incomplete information and judgements, and to identify questions that deserve further investigation.

In [1]:
import os, sys, yaml, pandas as pd, numpy as np
from IPython.display import display

# Add parent directory to path so we can import rbsa module
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from rbsa.main_pipeline import load_config, load_raw_data, prepare_data, run_all_methods, finalize_results
from rbsa.rbsa_initialize import summarizer, checkpoint_runner
from rbsa.reporting import format_weights, format_final_results

cfg = load_config(os.path.join(project_root, "config.yaml"))
raw_data = load_raw_data(cfg, project_root)


✓ Batch mode
Loading portfolio from data/portfolio.csv
Portfolio: 2 holdings
  ticker   wt
0  EVSYX  0.5
1  VWEHX  0.5
Downloading 14 selection tickers + 5 substitution-only tickers


In [2]:
# checkpoint_runner = None
# if cfg.get('interactive', {}).get('enabled', False):
#     from rbsa import CheckpointRunner
#     checkpoint_runner = CheckpointRunner(cfg, summarizer)
#     print('✓ Interactive mode enabled')
# else:
#     print('✓ Batch mode')
#checkpoint_runner

#### Preliminary Diagnostics — Checking for Smoothed Returns

Estimated or appraisal-based valuations typically smooths reported returns. 

This can lead to an underestimation of risk and reduces accuracy of returns-based analyses.


In [3]:
data = prepare_data(cfg, project_root, raw_data, checkpoint_runner=checkpoint_runner)


AR(1) AUTOCORRELATION TEST - Preliminary Diagnostics
Sample size: 237 observations
AR(1) coefficient (ρ): 0.038522
AR(1) p-value: 0.553799
Ljung-Box p-value (lag 1): 0.550844
Significance level: 0.05

○ Positive autocorrelation (ρ=0.0385) but NOT significant (p=0.5538)
  → No de-smoothing needed
Selection universe: 14 assets
Full universe (including substitutions): 19 assets
After cleaning: 221 observations, 12 selection assets, 17 total assets


#### Approach A — Constrained RBSA

In [4]:
from rbsa.models.approach_a import approach_A_pipeline
resA = approach_A_pipeline(data["X"], data["y"], cfg)
resA["summary"] = summarizer.summarize(f"Selected: {', '.join(resA['selected'])}\nRMSE={resA['diagnostics']['rmse']:.6f}")
display(format_weights(resA["weights"]))
print(resA["summary"])


Unnamed: 0,weight
HYG,0.292147
IWF,0.28663
IWD,0.235783
BIL,0.135705
TIP,0.049735


Here’s a concise style snapshot and what to do with it.

Fit quality
- RMSE 0.00448 (~45 bps/month) suggests a strong fit; factor mix explains most of the return variance.

Implied style exposures
- Equity: Blend of US large-cap growth (IWF) and value (IWD). Net tilt depends on their relative weights; together they behave close to a core large-cap sleeve with style rotation risk.
- Credit: High yield (HYG) adds spread and equity-like cyclicality; biggest left-tail risk in stress episodes.
- Rates/Inflation: TIP adds real-rate duration and positive inflation beta; BIL adds cash/defensive ballast and liquidity.
- Net takeaway: Pro-cyclical (IWF/IWD/HYG) tempered by an inflation hedge (TIP) and cash buffer (BIL).

Key risk drivers and regime sensitivity
- Risk-on/soft landing: IWF/IWD/HYG drive upside; BIL drags; TIP mixed (depends on real yields).
- Risk-off/recession: HYG and equities draw down together; BIL cushions; TIP helps if real yields fall (flight-to-quality), hurts if real yiel

#### Approach B — Elastic Net 

<!-- (auto α and λ) → constrained refit -->

In [5]:
import importlib
import rbsa.models.approach_b
importlib.reload(rbsa.models.approach_b)
from rbsa.models.approach_b import approach_B_pipeline

print("Starting Approach B with verbose=True")
resB = approach_B_pipeline(data["X"], data["y"], cfg, verbose=True)
print("Approach B completed")
resB["summary"] = summarizer.summarize(f"Selected: {', '.join(resB['selected'])}\nRMSE={resB['diagnostics'].get('rmse', float('nan'))}")
display(format_weights(resB["weights"]))
print(resB["summary"])

Starting Approach B with verbose=True

ElasticNet Selection (testing 4 l1_ratio values):

  l1_ratio=0.25: Selected 7 assets: DBC, HYG, IEF, IWD, IWF, LQD, TIP
              Top coefficients: IWF=0.013, IWD=0.010, HYG=0.007
              RBSA refit: R²=0.9743, Adj-R²=0.9735, MSE=0.000025, MAE=0.003487
              Weights: HYG=0.302, IWF=0.275, IWD=0.240, TIP=0.165, LQD=0.014, DBC=0.004, IEF=0.000

  l1_ratio=0.50: Selected 7 assets: DBC, HYG, IEF, IWD, IWF, LQD, TIP
              Top coefficients: IWF=0.013, IWD=0.010, HYG=0.007
              RBSA refit: R²=0.9743, Adj-R²=0.9735, MSE=0.000025, MAE=0.003487
              Weights: HYG=0.302, IWF=0.275, IWD=0.240, TIP=0.165, LQD=0.014, DBC=0.004, IEF=0.000

  l1_ratio=0.75: Selected 7 assets: DBC, HYG, IEF, IWD, IWF, LQD, TIP
              Top coefficients: IWF=0.013, IWD=0.010, HYG=0.007
              RBSA refit: R²=0.9743, Adj-R²=0.9735, MSE=0.000025, MAE=0.003487
              Weights: HYG=0.302, IWF=0.275, IWD=0.240, TIP=0.165, LQD=

Unnamed: 0,weight
HYG,0.3023085
IWF,0.274793
IWD,0.2399229
TIP,0.164732
LQD,0.01389408
DBC,0.004349481
IEF,6.824091999999999e-19


Here’s what your RBSA selection implies and how to act on it.

Big picture
- The mix maps to four core betas: equity style (IWF/IWD), credit (HYG/LQD), duration (IEF plus the duration inside LQD/TIP), and inflation (TIP/DBC).
- RMSE ≈ 0.50% monthly (~1.7% annualized TE) suggests the portfolio is well explained by these factors; residual/alpha is small.

Key exposures and overlaps
- Duration: IEF, TIP, and the Treasury component inside LQD all add rate sensitivity. You may be double-counting rate risk.
- Credit: LQD (IG) and HYG (HY) load on spread risk; they are correlated and both weaken crisis ballast.
- Equity beta/style: IWF vs IWD sets growth vs value tilt; net tilt determines sensitivity to rates and cyclicality.
- Inflation: TIP and DBC both load on inflation; TIP responds to breakevens, DBC to spot commodities (energy-heavy).

Actionable adjustments
- Reduce rate sensitivity: cut IEF and/or shift LQD to shorter IG (e.g., IGSB) or to equity/DBC; keep TIP only if you want inflati

#### Approach C — Bayesian RBSA

<!-- with Dirichlet-Spike Prior -->

In [6]:
import importlib
import rbsa.models.approach_c
importlib.reload(rbsa.models.approach_c)
from rbsa.models.approach_c import approach_C_pipeline

print("Starting Approach C - Bayesian RBSA")
resC = approach_C_pipeline(data["X"], data["y"], cfg, verbose=True)
print("\nApproach C completed")
resC["summary"] = summarizer.summarize(f"Selected: {', '.join(resC['selected'])}\nR²={resC['diagnostics'].get('r2', float('nan')):.4f}")
display(format_weights(resC["weights"]))
print(resC["summary"])

# Show posterior inclusion probabilities
print("\nPosterior Inclusion Probabilities (top 10):")
display(resC["pip"].sort_values(ascending=False).head(10))

Starting Approach C - Bayesian RBSA

Bayesian RBSA with Dirichlet-Spike Prior
Running MCMC: 5000 samples, 1000 burn-in
  Iteration 1000/5000, active assets: 7, sigma²=0.000118
  Iteration 2000/5000, active assets: 6, sigma²=0.000136
  Iteration 3000/5000, active assets: 6, sigma²=0.000127
  Iteration 4000/5000, active assets: 7, sigma²=0.000106
  Iteration 5000/5000, active assets: 5, sigma²=0.000117

Posterior Inclusion Probabilities (PIP):
asset     PIP  mean_weight  std_weight
  HYG 1.00000     0.307504    0.039867
  IWD 1.00000     0.232566    0.031975
  IWF 1.00000     0.287731    0.028413
  BIL 0.93775     0.152055    0.053808
  TIP 0.38250     0.017461    0.050101
  AGG 0.31400     0.000336    0.001608
  LQD 0.30225     0.000846    0.004090
  IWM 0.29450     0.000255    0.001361
  DBC 0.29375     0.000191    0.001054
  EEM 0.29125     0.000342    0.001935
  EFA 0.28725     0.000459    0.002376
  IEF 0.28100     0.000255    0.001208

Selected assets (PIP >= 0.5): BIL, HYG, IWD, I

Unnamed: 0,weight
HYG,0.306996
IWF,0.290705
IWD,0.230337
BIL,0.171963


Here’s what your RBSA setup implies and how to act on it:

Fit and residual risk
- R²=0.9788: very high fit; only 2.12% of variance unexplained. Residual (idiosyncratic) vol ≈ sqrt(1−R²) ≈ 14.6% of total volatility.
- Interpretation: your portfolio can be closely replicated by a mix of cash (BIL), high-yield credit (HYG), and large-cap value/growth (IWD/IWF).

Factor exposures and implications
- Equity beta: Driven by IWD + IWF. If their weights are similar, you’re large-cap core; imbalance implies a clear value or growth tilt.
- Credit beta: HYG adds pro‑cyclical spread risk that is positively correlated with equities; boosts carry but increases drawdown risk in stress.
- Cash/defensiveness: BIL dampens volatility and drawdowns but creates cash drag; minimal duration exposure overall (no Treasury ballast).
- Collinearity: IWD and IWF are highly correlated; weights can be unstable over time even if the net equity beta is stable.

Risk and scenario takeaways
- Risk stack: Equity and cre

HYG    1.00000
IWD    1.00000
IWF    1.00000
BIL    0.93775
TIP    0.38250
AGG    0.31400
LQD    0.30225
IWM    0.29450
DBC    0.29375
EEM    0.29125
dtype: float64

#### Approach D — Cluster-then-Span

In [7]:
from rbsa.models.approach_d import approach_D_pipeline
resD = approach_D_pipeline(data["X"], data["y"], cfg)
resD["summary"] = summarizer.summarize(f"Selected: {', '.join(resD['selected'])}\nRMSE={resD['diagnostics']['rmse']:.6f}")
display(format_weights(resD["weights"]))
print(resD["summary"])


Unnamed: 0,weight
HYG,0.292147
IWF,0.28663
IWD,0.235783
BIL,0.135705
TIP,0.049735


Concise takeaways from the selected basis set (HYG, IWF, IWD, BIL, TIP):

What the exposures imply
- Core US large-cap equity with a style barbell: IWF (growth) + IWD (value) suggests a mostly style-neutral large-cap core; relative weights indicate the growth/value tilt.
- Credit beta via HYG: meaningful exposure to credit spreads and equity-like drawdowns in risk-off regimes.
- Rates/inflation via TIP: real-rate duration plus inflation surprise protection; TIP helps if inflation rises relative to expectations but falls when real yields rise.
- Cash via BIL: liquidity and drawdown dampener; indicates some market-timing or dry powder.

Risk/return drivers to watch
- Equity vs style rotation: Performance sensitive to growth–value cycles; check the net tilt (IWF minus IWD).
- Spread risk: HYG will likely sell off alongside equities in stress; concentrates cyclical risk.
- Real-rate shock: TIP has material real-duration; +100 bps real yield hurts TIP, even if inflation is stable.
- Liquidi

#### Final Consolidation — Top Candidates

In [8]:
from rbsa.consolidate import create_diagnostic_questions

# Add approach labels to candidates
all_cands = [
    {"approach": "A", **resA},
    {"approach": "B", **resB},
    {"approach": "C", **resC},
    {"approach": "D", **resD},
]

# Rank by performance
final_ranked = finalize_results(all_cands, cfg)

# Display formatted results
print(format_final_results(final_ranked))

mode = cfg.get("analysis", {}).get("mode", "in_sample")
print(f"\nAnalysis mode: {mode}")

# Diagnostic questions
print("\nDiagnostic Questions:")
display(create_diagnostic_questions(final_ranked))

Candidate 1 (Method A):
  Statistics: n_assets=5, R²=0.979321, Adj-R²=0.978840, RMSE=0.004482, MAE=0.003017, AIC=-1753.05, AICc=-1752.77, BIC=-1736.05
  Assets:     HYG(0.292), IWF(0.287), IWD(0.236), BIL(0.136), TIP(0.050)

Candidate 2 (Method D):
  Statistics: n_assets=5, R²=0.979321, Adj-R²=0.978840, RMSE=0.004482, MAE=0.003017, AIC=-1753.05, AICc=-1752.77, BIC=-1736.05
  Assets:     HYG(0.292), IWF(0.287), IWD(0.236), BIL(0.136), TIP(0.050)

Candidate 3 (Method C):
  Statistics: n_assets=4, R²=0.978787, Adj-R²=0.978394, RMSE=0.004539, MAE=0.003032, AIC=-1749.41, AICc=-1749.23, BIC=-1735.82
  Assets:     HYG(0.307), IWF(0.291), IWD(0.230), BIL(0.172)

Candidate 4 (Method B):
  Statistics: n_assets=7, R²=0.974304, Adj-R²=0.973459, RMSE=0.004996, MAE=0.003487, AIC=-1701.04, AICc=-1700.51, BIC=-1677.25
  Assets:     HYG(0.302), IWF(0.275), IWD(0.240), TIP(0.165), LQD(0.014), DBC(0.004), IEF(0.000)


Analysis mode: in_sample

Diagnostic Questions:


Unnamed: 0_level_0,q1,q2
rank,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"If the fund's mandate restricts leverage/cash,...","Would a different commodity spec (e.g., PDBC v..."
2,"If the fund's mandate restricts leverage/cash,...","Would a different commodity spec (e.g., PDBC v..."
3,"If the fund's mandate restricts leverage/cash,...","Would a different commodity spec (e.g., PDBC v..."
4,"If the fund's mandate restricts leverage/cash,...","Would a different commodity spec (e.g., PDBC v..."


In [9]:
if checkpoint_runner:
    checkpoint_runner.run_checkpoint('checkpoint-candidate-review', {'candidates': final_ranked})

#### Index Substitution Analysis

Explore whether an index that is a composite of sub-indexes can be used to simplify and reduce the number of indexes

In [10]:
from rbsa.substitution import analyze_substitutions, apply_recommended_substitutions

substitution_rules = cfg.get("substitutions", [])

if len(substitution_rules) > 0:
    print(f"Running substitution analysis with {len(substitution_rules)} rule(s)...\n")
    # Use X_full which includes substitution-only assets
    sub_results = analyze_substitutions(final_ranked, data["X_full"], data["y"], substitution_rules, verbose=True)
    
    # Apply recommended substitutions and re-rank
    final_candidates = apply_recommended_substitutions(final_ranked, sub_results, data["X_full"], data["y"], cfg, verbose=True)
    
    # Display final re-ranked results
    print(f"\n{'='*80}")
    print("FINAL RESULTS AFTER SUBSTITUTIONS")
    print(f"{'='*80}\n")
    print(format_final_results(final_candidates))
else:
    print("No substitution rules defined in config.yaml")
    final_candidates = final_ranked

Running substitution analysis with 3 rule(s)...


Candidate 1: HYG, IWF, IWD, BIL, TIP

✓ Found IWF + IWD → Testing bottom-up consolidation to IWB

  Weight Swap Test (IWF ↔ IWD):
    Original weights: IWF=0.287, IWD=0.236
    Swapped weights:  IWF=0.236, IWD=0.287
    R² difference: -0.001872
    RMSE difference: +0.000198
    Materially different: True

  Substitution Test (IWF + IWD → IWB):
    Combined weight: 0.522
    Original:    R²=0.979321, Adj-R²=0.978840, RMSE=0.004482
    Substituted: R²=0.979159, Adj-R²=0.978773, RMSE=0.004499
    Differences: ΔR²=-0.000162, ΔAdj-R²=-0.000067, ΔRMSE=+0.000018
    Assets saved: 1
    ✓ RECOMMEND SUBSTITUTION

✓ Found IWF + IWD → Testing bottom-up consolidation to SPY

  Weight Swap Test (IWF ↔ IWD):
    Original weights: IWF=0.287, IWD=0.236
    Swapped weights:  IWF=0.236, IWD=0.287
    R² difference: -0.001872
    RMSE difference: +0.000198
    Materially different: True

  Substitution Test (IWF + IWD → SPY):
    Combined weight: 0.522
 

In [11]:
if checkpoint_runner:
    checkpoint_runner.run_checkpoint('checkpoint-final-selection', {'candidates': final_candidates[:3]})

#### AI-Powered Comprehensive Summary

Synthesize and summarize results with actionable insights.

In [12]:
from rbsa.final_summary import create_summary_report

# Generate comprehensive AI summary of final results
print("\nGenerating AI-powered analysis...\n")

summary_report = create_summary_report(
    candidates=final_candidates,
    summarizer=summarizer,
    cfg=cfg,
    desmooth_diagnostics=data.get("desmooth_diagnostics")
)

print(summary_report["full"])


Generating AI-powered analysis...

ANALYSIS CONFIGURATION:
  Mode: in_sample
  Number of candidates: 4
  De-smoothing: Not needed

AI ANALYSIS

Here’s what the RBSA says about the portfolio, distilled to what matters:

1) Factor exposure and economic interpretation
- Core risk drivers:
  - US large-cap equity (IWB) ~52%: dominant driver; style appears core/blend, not distinctly growth or value.
  - High yield credit (HYG) ~29–31%: sizable credit-spread beta; equity-like behavior in selloffs.
- Defensive/liquidity:
  - Cash/T-bills (BIL) ~14–17%: volatility dampener; return drag in risk-on markets.
- Rates/inflation:
  - TIP 0–5% in top fits (up to 16.5% in a less parsimonious candidate): modest to low inflation hedging; low interest rate duration overall.
- Implied risk posture:
  - Effective equity beta ≈ 0.7 (52% equity + equity-like component from HY). Expect material drawdowns in equity/credit stress, with limited duration ballast.

2) Model quality and parsimony
- Fit is strong: 

In [13]:
if checkpoint_runner and checkpoint_runner.history:
    import json
    from datetime import datetime
    json.dump(checkpoint_runner.history, open(f'checkpoint_history_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json', 'w'), indent=2, default=str)