## Hierarchical Risk Parity (HRP) Kimi-K2

Think of HRP as having **two separate steps**:

1. **“Where do we cut the tree?”** (build the dendrogram)  
   → depends **only** on the correlation matrix you feed it.

2. **“How risky is each branch?”** (recursive bisection)  
   → uses the **covariance matrix** of the period you will actually trade.

`hrp_weights` and `hrp_quick` differ only in **which period supplies the correlation matrix for step 1**.

---

### Two practical recipes

| Method | Step-1 tree built on | Step-2 cov matrix built on | When to use |
|---|---|---|---|
| **hrp_weights(df_train)** | same as step-2 (`df_train`) | `df_train` | *Pure in-sample* – tree and risk both estimated on the same data. Simple, but may over-fit. |
| **hrp_quick(df_train)** | **long history (`df`)** | `df_train` | *Hybrid* – stable tree from long history, fresh risk estimates from recent data. Lower turnover, more robust to regime changes. |

---

### Pros & cons in plain language

| Aspect | hrp_weights(df_train) | hrp_quick(df_train) |
|---|---|---|
| **Stability** | Tree changes every time you roll the window → higher turnover, more re-balancing costs. | Tree is “anchored” by long history → weights evolve slowly, cheaper to trade. |
| **Over-fitting** | Correlations estimated on short window can be noisy; clusters flip on a whim. | Long window smooths out noise; structure is less jumpy. |
| **Regime sensitivity** | Reacts quickly if correlation breaks down. | May miss a *genuine* regime shift because the tree is “locked”. |
| **Implementation effort** | One line; nothing to cache. | You must decide *how long* the “long” window should be and store the tree. |

---

### What do practitioners do?

Most **quant desks / multi-asset funds** use the **hybrid approach** (i.e., the logic behind `hrp_quick`):

- **Tree** = 3-5 years of daily returns (or even longer for low-turnover mandates).  
- **Covariance** = rolling 6-12 months (Ledoit-Wolf shrinkage).  
- Re-estimate the tree only once a year or when major structural breaks are detected.

They do *not* rebuild the tree every month because:

- Execution costs outweigh the marginal improvement.  
- Correlation structure is fairly persistent at the **cluster** level (sectors, styles).  
- Risk (volatility) is what really moves month-to-month, not the *order* of assets in the tree.

---

### Rule of thumb for you

- If you are **learning / prototyping**, `hrp_weights(df_train)` is fine—easy to code, easier to debug.  
- If you are **running live capital**, adopt the hybrid recipe:  
  `linkage_cache = linkage(long_history)`  
  `hrp_quick(df_train)`  
  and refresh the cache only **quarterly or annually**.

In [1]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import pprint
import inspect  # <--- ADD THIS LINE
from IPython.display import display, Markdown

# --- 1. PANDAS & IPYTHON OPTIONS ---
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 3000)
pd.set_option('display.float_format', '{:.6f}'.format)
%load_ext autoreload
%autoreload 2

# --- 2. PROJECT PATH CONFIGURATION ---
NOTEBOOK_DIR = Path.cwd()
PARENT_DIR = NOTEBOOK_DIR.parent
ROOT_DIR = NOTEBOOK_DIR.parent.parent  # Adjust if your notebook is in a 'notebooks' subdirectory
DATA_DIR = ROOT_DIR / 'data'
SRC_DIR = ROOT_DIR / 'src'

# Add 'src' to the Python path to import custom modules
if str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))

# --- 3. IMPORT CUSTOM MODULES ---
import utils

# --- 4. CONSTANTS ---
INITIAL_CAPITAL = 100_000  # 100,000
RISK_FREE_ANNUAL_RATE = 0.04
BENCHMARK_TICKER = "VGT"

# --- 5. VERIFICATION ---
print("--- Path Configuration ---")
print(f"✅ Project Root: {ROOT_DIR}")
print(f"✅ Parent Dir:   {PARENT_DIR}")
print(f"✅ Notebook Dir: {NOTEBOOK_DIR}")
print(f"✅ Data Dir:     {DATA_DIR}")
print(f"✅ Source Dir:   {SRC_DIR}")
assert all([ROOT_DIR.exists(), DATA_DIR.exists(), SRC_DIR.exists()]), "A key directory was not found!"

print("\n--- Module Verification ---")
print(f"✅ Successfully imported 'utils' and 'plotting_utils'.")

--- Path Configuration ---
✅ Project Root: c:\Users\ping\Files_win10\python\py311\stocks
✅ Parent Dir:   c:\Users\ping\Files_win10\python\py311\stocks\notebooks_PyPortfOpt
✅ Notebook Dir: c:\Users\ping\Files_win10\python\py311\stocks\notebooks_PyPortfOpt\_working
✅ Data Dir:     c:\Users\ping\Files_win10\python\py311\stocks\data
✅ Source Dir:   c:\Users\ping\Files_win10\python\py311\stocks\src

--- Module Verification ---
✅ Successfully imported 'utils' and 'plotting_utils'.


In [2]:
df = pd.read_parquet(DATA_DIR / 'df_adj_close.parquet')
# print(f'df:\n{df}')

In [3]:
import pandas as pd
import numpy as np

# 2. --- This is the code to split your DataFrame ---

# Create a boolean mask for the years 2023 and 2024
mask = df.index.year.isin([2023, 2024])

# Apply the mask to get the first DataFrame
df_train = df[mask]

# Apply the inverse of the mask (using ~) to get the second DataFrame
df_remain = df[~mask]

test_mask = (df_remain.index.year == 2025) & df_remain.index.month.isin([1])
df_test = df_remain.loc[test_mask]

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 644 entries, 2023-01-03 to 2025-07-29
Columns: 1524 entries, A to ZWS
dtypes: float64(1524)
memory usage: 7.5 MB


In [5]:
train_start_date = df_train.index.min().strftime('%Y-%m-%d')
train_end_date = df_train.index.max().strftime('%Y-%m-%d')
print(f'train_start_date: {train_start_date}')
print(f'train_end_date: {train_end_date}')

train_start_date: 2023-01-03
train_end_date: 2024-12-31


In [6]:
test_start_date = df_test.index.min().strftime('%Y-%m-%d')
test_end_date = df_test.index.max().strftime('%Y-%m-%d')
print(f'test_start_date: {test_start_date}')
print(f'test_end_date: {test_end_date}')

test_start_date: 2025-01-02
test_end_date: 2025-01-31


In [7]:
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import squareform
from sklearn.covariance import ledoit_wolf

# ------------------------------------------------------------------
# Helpers
# ------------------------------------------------------------------
def _log_returns(price_df: pd.DataFrame) -> pd.DataFrame:
    """Log returns, drop first NA row."""
    return np.log(price_df / price_df.shift(1)).dropna()

def _cov2corr(cov: np.ndarray) -> np.ndarray:
    """Covariance matrix → correlation matrix."""
    std = np.sqrt(np.diag(cov))
    corr = cov / np.outer(std, std)
    corr[corr < -1], corr[corr > 1] = -1, 1
    return corr

# ------------------------------------------------------------------
# Core HRP
# ------------------------------------------------------------------
def _quasi_diagonal(link) -> np.ndarray:
    """Return reordering indices from linkage."""
    return np.array(dendrogram(link, no_plot=True)["leaves"])

def _recursive_bisection(cov: np.ndarray, perm: np.ndarray) -> np.ndarray:
    """Allocate top-down via inverse-variance weighting."""
    w = np.ones(len(perm))
    clusters = [perm]
    while clusters:
        new_clusters = []
        for idx in clusters:
            if len(idx) == 1:
                continue
            half = len(idx) // 2
            left, right = idx[:half], idx[half:]
            # variance of each branch
            var_l = np.linalg.inv(cov[np.ix_(left, left)]).sum()
            var_r = np.linalg.inv(cov[np.ix_(right, right)]).sum()
            alpha = 1 - var_l / (var_l + var_r)
            w[left] *= alpha
            w[right] *= (1 - alpha)
            new_clusters.extend([left, right])
        clusters = new_clusters
    return w

# ------------------------------------------------------------------
# Public entry point
# ------------------------------------------------------------------
def hrp_weights(price_df: pd.DataFrame,
                linkage_method: str = "single") -> pd.Series:
    """
    Build HRP portfolio weights from a DataFrame of daily adjusted-close prices.

    Parameters
    ----------
    price_df : pd.DataFrame
        Columns = tickers, index = DatetimeIndex of daily prices.
    linkage_method : str
        Any valid scipy linkage method ("single", "ward", etc.).

    Returns
    -------
    pd.Series
        Index = tickers, values = portfolio weights (sum = 1).
    """
    returns = _log_returns(price_df)
    cov, _ = ledoit_wolf(returns)
    corr = _cov2corr(cov)

    # distance matrix → condensed 1-D vector
    dist = np.sqrt(np.clip((1 - corr) / 2, 0, 1))
    dist_vec = squareform(dist, checks=False)

    link = linkage(dist_vec, method=linkage_method)
    perm = _quasi_diagonal(link)

    raw = _recursive_bisection(cov, perm)
    w = pd.Series(raw, index=returns.columns).sort_index()
    return w / w.sum()

In [8]:
weights = hrp_weights(df_train)          # df is your 643×1518 price DataFrame
# print(weights[weights > 0.001].round(4))
print(weights)

Ticker
A      0.000488
AA     0.000251
AAL    0.001555
AAON   0.000715
AAPL   0.000574
         ...   
ZM     0.002229
ZS     0.003353
ZTO    0.000943
ZTS    0.000045
ZWS    0.000533
Length: 1524, dtype: float64


In [9]:
print(weights.sort_values(ascending=False).head(10))
print("Total tickers in portfolio :", len(weights))
print("Sum of weights             :", weights.sum())

Ticker
U        0.008131
MARA     0.008011
TPG      0.006788
EQT      0.006164
MNDY     0.005975
CWEN-A   0.005819
YUMC     0.005505
FND      0.005062
NEE      0.004829
CF       0.004727
dtype: float64
Total tickers in portfolio : 1524
Sum of weights             : 1.0


In [10]:
# 1. Run HRP once
w = hrp_weights(df_train)          # or hrp_quick(df_train)

# 2. Drop anything < 5 %
target_ticker_count = 10           # set your desired count
cutoff = 0.001                     # starting threshold

while len(w[w >= cutoff]) > target_ticker_count:
    # coarse bisection: shrink or grow cutoff
    if len(w[w >= cutoff]) < target_ticker_count:
        cutoff *= 0.9                 # too many left → raise threshold
    else:
        cutoff *= 1.1                 # too few left → lower threshold
    print(cutoff, len(w[w >= cutoff]))

w_kept = w[w >= cutoff]
print(f"\nCutoff that yields {target_ticker_count} tickers: {cutoff:.6f}")
print("Tickers kept:", len(w_kept))

# 3. Re-balance so the remaining weights sum to 1
w_balanced = w_kept / w_kept.sum()

# 4. How many tickers survived?
print("Tickers kept:", len(w_balanced))
print("Sum of re-balanced weights:", w_balanced.sum())

# 5. View the new weights
w_balanced.sort_values(ascending=False)

0.0011 291
0.0012100000000000001 261
0.0013310000000000002 235
0.0014641000000000003 211
0.0016105100000000005 179
0.0017715610000000007 150
0.0019487171000000009 141
0.002143588810000001 122
0.0023579476910000016 103
0.002593742460100002 90
0.0028531167061100022 71
0.003138428376721003 59
0.0034522712143931033 46
0.003797498335832414 30
0.004177248169415656 22
0.004594972986357222 10

Cutoff that yields 10 tickers: 0.004595
Tickers kept: 10
Tickers kept: 10
Sum of re-balanced weights: 1.0


Ticker
U        0.133264
MARA     0.131311
TPG      0.111259
EQT      0.101024
MNDY     0.097938
CWEN-A   0.095382
YUMC     0.090225
FND      0.082968
NEE      0.079153
CF       0.077476
dtype: float64

In [11]:
# --- after you have w_balanced (the re-balanced Series) ---
last_prices = df_train.loc[df_train.index[-1], w_balanced.index]

In [12]:

initial_shares = INITIAL_CAPITAL * w_balanced / last_prices
initial_shares.info() 

<class 'pandas.core.series.Series'>
Index: 10 entries, CF to YUMC
Series name: None
Non-Null Count  Dtype  
--------------  -----  
10 non-null     float64
dtypes: float64(1)
memory usage: 160.0+ bytes


In [13]:
# daily portfolio value (scalar for each day)
portfolio_value = (
    df_test[initial_shares.index]          # prices of the 10 tickers
    .mul(initial_shares)                  # shares × price
    .sum(axis=1)                          # sum across tickers
)


In [14]:
import pandas as pd
import numpy as np

# ------------------------------------------------------------------
# 1) daily portfolio value series (already computed)
# ------------------------------------------------------------------
# portfolio_value = df_test[initial_shares.index].mul(initial_shares).sum(axis=1)

# ------------------------------------------------------------------
# 2) daily arithmetic returns of the portfolio
# ------------------------------------------------------------------
daily_ret = portfolio_value.pct_change().dropna()

# ------------------------------------------------------------------
# 3) annualisation factor
# ------------------------------------------------------------------
ANN_FACTOR = 252          # trading days per year

# ------------------------------------------------------------------
# 4) performance metrics
# ------------------------------------------------------------------
total_return   = (portfolio_value.iloc[-1] / portfolio_value.iloc[0] - 1)
ann_return     = (1 + total_return) ** (ANN_FACTOR / len(daily_ret)) - 1
ann_vol        = daily_ret.std() * np.sqrt(ANN_FACTOR)
sharpe_ratio   = ann_return / ann_vol
max_dd         = (portfolio_value / portfolio_value.cummax() - 1).min()
sortino        = ann_return / (daily_ret[daily_ret < 0].std() * np.sqrt(ANN_FACTOR))

# ------------------------------------------------------------------
# 5) summary
# ------------------------------------------------------------------
summary = {
    "Total Return (%)":         round(total_return * 100, 2),
    "Annualized Return (%)":    round(ann_return * 100, 2),
    "Annualized Volatility (%)":round(ann_vol * 100, 2),
    "Sharpe Ratio":             round(sharpe_ratio, 2),
    "Max Drawdown (%)":         round(max_dd * 100, 2),
    "Sortino Ratio":            round(sortino, 2)
}

pd.Series(summary)

Total Return (%)             2.560000
Annualized Return (%)       39.810000
Annualized Volatility (%)   27.610000
Sharpe Ratio                 1.440000
Max Drawdown (%)            -6.960000
Sortino Ratio                1.820000
dtype: float64

In [15]:
import pandas as pd
import numpy as np

# ------------------------------------------------------------------
# 1) daily portfolio value series
# ------------------------------------------------------------------
portfolio_value = df_test[initial_shares.index].mul(initial_shares).sum(axis=1)

# ------------------------------------------------------------------
# 2) daily returns – portfolio vs benchmark (VGT)
# ------------------------------------------------------------------
daily_ret = portfolio_value.pct_change().dropna()
bench_ret = df_test['VGT'].pct_change().dropna()

# align dates
common_dates = daily_ret.index.intersection(bench_ret.index)
daily_ret = daily_ret.loc[common_dates]
bench_ret = bench_ret.loc[common_dates]

# ------------------------------------------------------------------
# 3) annualisation
# ------------------------------------------------------------------
ANN_FACTOR = 252

# ------------------------------------------------------------------
# 4) helper
# ------------------------------------------------------------------
def _annualised_metrics(price_series):
    ret = price_series.pct_change().dropna()
    total = (price_series.iloc[-1] / price_series.iloc[0] - 1)
    ann_ret = (1 + total) ** (ANN_FACTOR / len(ret)) - 1
    ann_vol = ret.std() * np.sqrt(ANN_FACTOR)
    sharpe  = ann_ret / ann_vol
    max_dd  = (price_series / price_series.cummax() - 1).min()
    sortino = ann_ret / (ret[ret < 0].std() * np.sqrt(ANN_FACTOR))
    return {
        "Total Return (%)": round(total * 100, 2),
        "Annualized Return (%)": round(ann_ret * 100, 2),
        "Annualized Volatility (%)": round(ann_vol * 100, 2),
        "Sharpe Ratio": round(sharpe, 2),
        "Max Drawdown (%)": round(max_dd * 100, 2),
        "Sortino Ratio": round(sortino, 2)
    }

# ------------------------------------------------------------------
# 5) metrics
# ------------------------------------------------------------------
port_metrics = _annualised_metrics(portfolio_value)
bench_metrics = _annualised_metrics(df_test['VGT'])

# ------------------------------------------------------------------
# 6) tracking error & information ratio
# ------------------------------------------------------------------
tracking_error = (daily_ret - bench_ret).std() * np.sqrt(ANN_FACTOR)
info_ratio     = (daily_ret.mean() - bench_ret.mean()) * ANN_FACTOR / tracking_error

port_metrics["Tracking Error (%)"] = round(tracking_error * 100, 2)
port_metrics["Information Ratio"]  = round(info_ratio, 2)
port_metrics["Train Start Date"] = train_start_date
port_metrics["Train End Date"] = train_end_date
port_metrics["Test Start Date"] = test_start_date
port_metrics["Test End Date"] = test_end_date

bench_metrics["Train Start Date"] = train_start_date
bench_metrics["Train End Date"] = train_end_date
bench_metrics["Test Start Date"] = test_start_date
bench_metrics["Test End Date"] = test_end_date


# ------------------------------------------------------------------
# 7) summary table
# ------------------------------------------------------------------
summary_df = pd.DataFrame({
    "Portfolio": port_metrics,
    "VGT": bench_metrics
}).T

summary_df

Unnamed: 0,Total Return (%),Annualized Return (%),Annualized Volatility (%),Sharpe Ratio,Max Drawdown (%),Sortino Ratio,Tracking Error (%),Information Ratio,Train Start Date,Train End Date,Test Start Date,Test End Date
Portfolio,2.56,39.81,27.61,1.44,-6.96,1.82,18.05,2.37,2023-01-03,2024-12-31,2025-01-02,2025-01-31
VGT,-0.76,-9.64,30.63,-0.31,-6.12,-0.39,,,2023-01-03,2024-12-31,2025-01-02,2025-01-31


In [16]:
# summary_df.to_csv('prices.csv', index=True)
# append without writing the header again
summary_df.to_csv('prices.csv', mode='a', header=False, index=True)