# HRD Feature Engineering | Label Consolidation

**Objective**: Integrate multiple sources of HRD information into a unified binary label, alongside the curated cell lines metadata to prepare for downstream analyses of PARP inhibitor response.

---
---
---

## 1. Setup üì¶

---
---

In [1]:
# Data Management
import numpy as np
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Utils
from IPython.display import display, HTML, IFrame, Image
from pathlib import Path

# Settings
pd.set_option('display.max_columns', None)
pd.set_option('mode.chained_assignment', None)
sns.set_style('darkgrid', {'grid.color':'0.9','xtick.bottom':True,'ytick.left':True})

## 2. Data Loading ‚öôÔ∏è

**Objective**: Load and organize the curated cell line cohort and HRD-related feature sets.

---
---

In [2]:
# Define project root and file path relative to user OS
BASE_DIR = Path.cwd()
CCL_PARPI_PATH              = BASE_DIR / 'utils' / 'ccl_parpi_df.pkl'
BRCA_HRD_PATH               = BASE_DIR / 'utils' / 'brca_df.pkl'
PROXY_HRD_PATH              = BASE_DIR / 'utils' / 'proxy_hrd.pkl'
CCL_HRD_SUMS_PATH           = BASE_DIR / 'utils' / 'ccl_hrd_sums_df.pkl'
CCL_SBS3_PATH               = BASE_DIR / 'utils' / 'ccl_sbs3_df.pkl'

In [3]:
# Load data into local dataframes 
ccl_parpi_df                = pd.read_pickle(CCL_PARPI_PATH)     # DepMap CCL Metadata
brca_df                     = pd.read_pickle(BRCA_HRD_PATH)      # BRCA-based HRD features
proxy_hrd                   = pd.read_pickle(PROXY_HRD_PATH)     # Proxy HRD score-based labels
ccl_hrd_sums_df             = pd.read_pickle(CCL_HRD_SUMS_PATH)  # HRD score-based features 
ccl_sbs3_df                 = pd.read_pickle(CCL_SBS3_PATH)      # SBS3-based HRD features

This project aims to systematically examine how multiple definitions of HRD relate to drug response across all PARP inhibitors available in the dataset. The prior HRD feature engineering notebooks derived HRD status using one of three complementary strategies: (1) BRCA1/2 double-hit alteration status, (2) computed genomic scar‚Äìbased HRD scores, and (3) COSMIC mutational signatures. 

This notebook consolidates these engineered features and harmonizes them with PARP inhibitor AUC measurements to construct a modeling-ready dataset for downstream statistical analyses of HRD-associated drug response.

## 3. Data Integration üîó

**Objective**: Compile all HRD feature sets into a singular dataframe with CCL metadata.

---
---

In [4]:
# Merge CCL data with BRCA HRD features
ccl_hrd_parpi_df = ccl_parpi_df.merge(
    brca_df,
    left_on='ModelID',
    right_on='ModelID',
    how='left'
)

# Merge CCL data with proxy HRD labels
ccl_hrd_parpi_df = ccl_hrd_parpi_df.merge(
    proxy_hrd[['ModelID', 'proxy_HRD_genomic_score', 'proxy_HRD_status']],
    left_on='ModelID',
    right_on='ModelID',
    how='left'
)

# Merge CCL data with canonical HRD scores
ccl_hrd_parpi_df = ccl_hrd_parpi_df.merge(
    ccl_hrd_sums_df[['ModelID', 'hrdsum_summary', 'hrd_score_pct_flag', 'hrd_score_std_flag']],
    left_on='ModelID',
    right_on='ModelID',
    how='left'
)

# Merge CCL data with SBS3-HRD feature
ccl_hrd_parpi_df = ccl_hrd_parpi_df.merge(
    ccl_sbs3_df[['ModelID', 'SBS3', 'hrd_sbs3']],
    left_on='ModelID',
    right_on='ModelID',
    how='left'
)

# Drop the keys (and other features) unnecessary for analysis
ccl_hrd_parpi_df = ccl_hrd_parpi_df.drop(
    columns=['COSMICID', 'SangerModelID', ],
).reset_index(drop=True)

# Reorganize and standardize the primary feature set
ccl_hrd_parpi_df = ccl_hrd_parpi_df.rename(columns={
    'name'               : 'PARP_inhibitor',
    'auc'                : 'AUC',
    'BRCA_HRD'           : 'hrd_BRCA',
    'proxy_HRD_status'   : 'hrd_proxy',
    'hrd_score_pct_flag' : 'hrd_score_pct',
    'hrd_score_std_flag' : 'hrd_score_std',
    'hrd_sbs3'           : 'hrd_SBS3',
})
primary_cols = ['ModelID', 'PARP_inhibitor', 'AUC', 'hrd_BRCA', 'hrd_proxy', 'hrd_score_pct', 'hrd_score_std', 'hrd_SBS3']
context_cols = [col for col in ccl_hrd_parpi_df.columns if col not in primary_cols]
ccl_hrd_parpi_df = ccl_hrd_parpi_df[primary_cols + context_cols]

ccl_hrd_parpi_df

Unnamed: 0,ModelID,PARP_inhibitor,AUC,hrd_BRCA,hrd_proxy,hrd_score_pct,hrd_score_std,hrd_SBS3,OncotreeLineage,OncotreePrimaryDisease,OncotreeSubtype,target,Age,AgeCategory,Sex,PatientRace,PrimaryOrMetastasis,GrowthPattern,indication,r2,BRCA2_damaging,BRCA1_damaging,BRCA1_cnloss_log2,BRCA2_cnloss_log2,BRCA1_loh_flag,BRCA2_loh_flag,BRCA1_homdel_flag,BRCA2_homdel_flag,BRCA1_double_hit,BRCA2_double_hit,BRCA_double_hit,BRCA_homdel,proxy_HRD_genomic_score,hrdsum_summary,SBS3
0,ACH-000001,niraparib,0.741783,0.0,1,1,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,PARP1,60.0,Adult,Female,caucasian,Metastatic,Adherent,primary peritoneal cancer (PPC),0.876853,0.0,0.0,0.752341,0.722099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.395118,32.0,
1,ACH-000001,olaparib,0.916763,0.0,1,1,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,"PARP1, PARP2",60.0,Adult,Female,caucasian,Metastatic,Adherent,ovarian cancer,0.822464,0.0,0.0,0.752341,0.722099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.395118,32.0,
2,ACH-000001,rucaparib,0.828451,0.0,1,1,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,"PARP1, PARP2",60.0,Adult,Female,caucasian,Metastatic,Adherent,,0.797532,0.0,0.0,0.752341,0.722099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.395118,32.0,
3,ACH-000001,talazoparib,0.537748,0.0,1,1,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,PARP2,60.0,Adult,Female,caucasian,Metastatic,Adherent,,0.966008,0.0,0.0,0.752341,0.722099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.395118,32.0,
4,ACH-000013,niraparib,0.596983,0.0,1,1,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,PARP1,60.0,Adult,Female,caucasian,Metastatic,Adherent,primary peritoneal cancer (PPC),0.768993,0.0,0.0,0.836582,0.788146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.215191,30.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371,ACH-000979,talazoparib,0.736700,0.0,1,0,0,0,Prostate,Prostate Adenocarcinoma,Prostate Adenocarcinoma,PARP2,69.0,Adult,Male,caucasian,Metastatic,Adherent,,0.832637,0.0,0.0,1.074130,0.852980,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.401734,24.0,
372,ACH-001145,niraparib,0.821086,,0,0,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,Serous Ovarian Cancer,PARP1,60.0,Adult,Female,,Metastatic,Adherent,primary peritoneal cancer (PPC),0.803825,,,,,,,,,,,,,-0.446518,6.0,
373,ACH-001145,olaparib,0.938036,,0,0,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,Serous Ovarian Cancer,"PARP1, PARP2",60.0,Adult,Female,,Metastatic,Adherent,ovarian cancer,0.679592,,,,,,,,,,,,,-0.446518,6.0,
374,ACH-001145,rucaparib,0.915185,,0,0,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,Serous Ovarian Cancer,"PARP1, PARP2",60.0,Adult,Female,,Metastatic,Adherent,,0.677656,,,,,,,,,,,,,-0.446518,6.0,


## 4. Cursory Explorations üîç

**Objective**: Examine cell lines, HRD label coverage, and other key features to establish a baseline understanding of the engineered dataset and execute any remaining preprocessing steps.

---
---

In [5]:
# Examine primary features and assess missing values
ccl_hrd_parpi_df[primary_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 376 entries, 0 to 375
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   ModelID         376 non-null    object 
 1   PARP_inhibitor  376 non-null    object 
 2   AUC             376 non-null    float64
 3   hrd_BRCA        330 non-null    float64
 4   hrd_proxy       376 non-null    int64  
 5   hrd_score_pct   376 non-null    int64  
 6   hrd_score_std   376 non-null    int64  
 7   hrd_SBS3        376 non-null    int64  
dtypes: float64(2), int64(4), object(2)
memory usage: 23.6+ KB


In [6]:
# Inspect counts for binary flags
print(ccl_hrd_parpi_df[['hrd_BRCA', 'hrd_proxy', 'hrd_score_pct', 'hrd_score_std', 'hrd_SBS3']].apply(
    pd.Series.value_counts).fillna(0).astype(int))

     hrd_BRCA  hrd_proxy  hrd_score_pct  hrd_score_std  hrd_SBS3
0.0       323         62            198            323       353
1.0         7        314            178             53        23


The integrated dataset contains 376 cell line‚ÄìPARP inhibitor pairs with complete drug response measurements (`AUC`) and harmonized HRD features. All engineered HRD labels are present for the full cohort except `hrd_BRCA`, which has 46 missing values, reflecting incomplete genomic annotation for a subset of models. The low prevalence of BRCA-defined HRD, combined with the availability of complementary genomic scores and SBS3-based HRD definitions, supports constructing a unified HRD status without further preprocessing.

The HRD definitions show markedly different prevalence across the cohort:

- **BRCA-based HRD** is rare (7 positive cases), consistent with the expectation that confirmed BRCA double-hit events / homozygous deletion are uncommon.
- **Proxy genomic score‚Äìbased HRD** classifies the majority of models as HRD-positive (314/376), indicating an overly broad labeling strategy.
- **HRD score-based flags** show intermediate prevalence (178 and 53 positives, respectively), reflecting threshold-driven stratification. Since all the curated cell line models have complete data within these features, it removes the need for the proxy labels.
- **SBS3-based HRD** is present in 23 cases; similar to BRCA-based HRD, exposure is sparse but well-defined when detected.

These discrepancies underscore that each HRD definition captures a distinct biological or statistical construct. Before modeling, it will be important to examine overlap between labels and assess whether certain definitions are overly restrictive or overly inclusive relative to drug response.

In [7]:
# Examine std-based vs pct-based HRD score threshold
pd.crosstab(
    ccl_hrd_parpi_df['hrd_score_pct'],
    ccl_hrd_parpi_df['hrd_score_std']
)

hrd_score_std,0,1
hrd_score_pct,Unnamed: 1_level_1,Unnamed: 2_level_1
0,198,0
1,125,53


The cross-tabulation shows that all cell lines classified as HRD-positive under the std-based threshold are also captured by the pct-based threshold. In other words, the standard deviation‚Äìbased definition represents a strict subset of the percentile-based group. The additional 125 models captured only by the pct-based threshold likely represent intermediate HRD phenotypes rather than extreme outliers.

Based on this, a singular HRD status feature can first use the pct-based scores for the primary analysis while the stricter high-confidence HRD can be used to check robustness of the study.

## 5. HRD Status Consolidation üßÆ

**Objective**: Combine distinct HRD labels into singular feature to determine HRD status across the cell lines for drug sensitivity analysis.

---
---

In [8]:
# Create unified HRD status (positive if any HRD feature is present)
ccl_hrd_parpi_df['HRD_status'] = (
    (ccl_hrd_parpi_df['hrd_BRCA'] == 1) |
    (ccl_hrd_parpi_df['hrd_score_pct'] == 1) |
    (ccl_hrd_parpi_df['hrd_SBS3'] == 1)
).astype(int)

# Construct high-confidence composite HRD label
ccl_hrd_parpi_df['HRD_status_strict'] = (
    (ccl_hrd_parpi_df['hrd_BRCA'] == 1) |
    (ccl_hrd_parpi_df['hrd_score_std'] == 1) |
    (ccl_hrd_parpi_df['hrd_SBS3'] == 1)
).astype(int)

In [9]:
cols_set_1 = ['ModelID', 'PARP_inhibitor', 'AUC', 'HRD_status', 'HRD_status_strict']
cols_set_2 = [col for col in ccl_hrd_parpi_df.columns if col not in cols_set_1]
ccl_hrd_parpi_df = ccl_hrd_parpi_df[cols_set_1 + cols_set_2]

## 6. Summary & Export üíæ 

---
---

In [10]:
# Export merged data as reusable pickle file
ccl_hrd_parpi_df.to_pickle(BASE_DIR / 'utils' / 'ccl_hrd_parpi_df.pkl')

ccl_hrd_parpi_df

Unnamed: 0,ModelID,PARP_inhibitor,AUC,HRD_status,HRD_status_strict,hrd_BRCA,hrd_proxy,hrd_score_pct,hrd_score_std,hrd_SBS3,OncotreeLineage,OncotreePrimaryDisease,OncotreeSubtype,target,Age,AgeCategory,Sex,PatientRace,PrimaryOrMetastasis,GrowthPattern,indication,r2,BRCA2_damaging,BRCA1_damaging,BRCA1_cnloss_log2,BRCA2_cnloss_log2,BRCA1_loh_flag,BRCA2_loh_flag,BRCA1_homdel_flag,BRCA2_homdel_flag,BRCA1_double_hit,BRCA2_double_hit,BRCA_double_hit,BRCA_homdel,proxy_HRD_genomic_score,hrdsum_summary,SBS3
0,ACH-000001,niraparib,0.741783,1,0,0.0,1,1,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,PARP1,60.0,Adult,Female,caucasian,Metastatic,Adherent,primary peritoneal cancer (PPC),0.876853,0.0,0.0,0.752341,0.722099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.395118,32.0,
1,ACH-000001,olaparib,0.916763,1,0,0.0,1,1,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,"PARP1, PARP2",60.0,Adult,Female,caucasian,Metastatic,Adherent,ovarian cancer,0.822464,0.0,0.0,0.752341,0.722099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.395118,32.0,
2,ACH-000001,rucaparib,0.828451,1,0,0.0,1,1,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,"PARP1, PARP2",60.0,Adult,Female,caucasian,Metastatic,Adherent,,0.797532,0.0,0.0,0.752341,0.722099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.395118,32.0,
3,ACH-000001,talazoparib,0.537748,1,0,0.0,1,1,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,PARP2,60.0,Adult,Female,caucasian,Metastatic,Adherent,,0.966008,0.0,0.0,0.752341,0.722099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.395118,32.0,
4,ACH-000013,niraparib,0.596983,1,0,0.0,1,1,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,PARP1,60.0,Adult,Female,caucasian,Metastatic,Adherent,primary peritoneal cancer (PPC),0.768993,0.0,0.0,0.836582,0.788146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.215191,30.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371,ACH-000979,talazoparib,0.736700,0,0,0.0,1,0,0,0,Prostate,Prostate Adenocarcinoma,Prostate Adenocarcinoma,PARP2,69.0,Adult,Male,caucasian,Metastatic,Adherent,,0.832637,0.0,0.0,1.074130,0.852980,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.401734,24.0,
372,ACH-001145,niraparib,0.821086,0,0,,0,0,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,Serous Ovarian Cancer,PARP1,60.0,Adult,Female,,Metastatic,Adherent,primary peritoneal cancer (PPC),0.803825,,,,,,,,,,,,,-0.446518,6.0,
373,ACH-001145,olaparib,0.938036,0,0,,0,0,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,Serous Ovarian Cancer,"PARP1, PARP2",60.0,Adult,Female,,Metastatic,Adherent,ovarian cancer,0.679592,,,,,,,,,,,,,-0.446518,6.0,
374,ACH-001145,rucaparib,0.915185,0,0,,0,0,0,0,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,Serous Ovarian Cancer,"PARP1, PARP2",60.0,Adult,Female,,Metastatic,Adherent,,0.677656,,,,,,,,,,,,,-0.446518,6.0,


This notebook consolidates independently engineered HRD features into unified composite HRD labels, integrating BRCA double-hit or homozygous deletion status, genomic scar‚Äìbased HRD scores, and SBS3 mutational signature exposure. The resulting dataset establishes a standardized, analysis-ready framework for systematically evaluating the association between HRD status and PARP inhibitor response.