# TCGA-BRCA Demo

## Dataset Source

- **Omics Data**: [FireHose BRCA](http://firebrowse.org/?cohort=BRCA)
- **Clinical and PAM50 Data**: [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)

## Dataset Overview

**Original Data**:

- **Methylation**: 20,107 × 885
- **mRNA**: 18,321 × 1,212
- **miRNA**: 503 × 1,189
- **PAM50**: 1,087 × 1
- **Clinical**: 1,098 × 101

- **Note: Omics matrices are features × samples; clinical matrices are samples × fields.**

### PAM50 Subtype Counts (Original)

- **LumA**: 419
- **LumB**: 140
- **Basal**: 130
- **Her2**: 46
- **Normal**: 34

## Patients in Every Dataset

- Total patients present in methylation, mRNA, miRNA, PAM50, and clinical: **769**

## Final Shapes (Per-Patient)

After aggregating multiple aliquots by mean, all modalities align on 769 patients:

- **Methylation**: 769 × 20,107
- **mRNA**: 769 × 20,531
- **miRNA**: 769 × 503
- **PAM50**: 769 × 1
- **Clinical**: 769 × 119

## Data Summary Table

| Stage                          | Clinical    | Methylation  | miRNA       | mRNA           | PAM50 (Subtype Counts)                                         | Notes                                   |
| ------------------------------ | ----------- | ------------ | ----------- | -------------- | -------------------------------------------------------------- | --------------------------------------- |
| **Original Raw Data**          | 1,098 × 101 | 20,107 × 885 | 503 × 1,189 | 18,321 × 1,212 | LumA: 509<br>LumB: 209<br>Basal: 192<br>Her2: 82<br>Normal: 40 | Raw FireHose & TCGAbiolinks files       |
| **Patient-Level Intersection** | 769 × 101   | 769 × 20,107 | 769 × 1,046 | 769 × 20,531   | LumA: 419<br>LumB: 140<br>Basal: 130<br>Her2: 46<br>Normal: 34 | Patients with complete data in all sets |

## Reference Links

- [FireHose BRCA](http://firebrowse.org/?cohort=BRCA)
- [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)
- [Direct Download BRCA](http://firebrowse.org/?cohort=BRCA&download_dialog=true)


### Lets take a look at the data from FireHose directly after download

In [1]:
import pandas as pd
from pathlib import Path
root = Path("/home/vicente/Github/BioNeuralNet/TCGA_BRCA_DATA")

mirna_raw = pd.read_csv(root/"BRCA.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)                            
rna_raw = pd.read_csv(root / "BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
meth_raw = pd.read_csv(root/"BRCA.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)
clinical_raw = pd.read_csv(root / "BRCA.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False)

print(f"mirna shape: {mirna_raw.shape}, rna shape: {rna_raw.shape}, meth shape: {meth_raw.shape}, clinical shape: {clinical_raw.shape}")
display(mirna_raw.head())
display(rna_raw.head())
display(meth_raw.head())
display(clinical_raw.head())

mirna shape: (503, 1189), rna shape: (18321, 1212), meth shape: (20107, 885), clinical shape: (18, 1097)


Unnamed: 0_level_0,TCGA-3C-AAAU-01,TCGA-3C-AALI-01,TCGA-3C-AALJ-01,TCGA-3C-AALK-01,TCGA-4H-AAAK-01,TCGA-5L-AAT0-01,TCGA-5L-AAT1-01,TCGA-5T-A9QA-01,TCGA-A1-A0SB-01,TCGA-A1-A0SD-01,...,TCGA-BH-A0WA-01,TCGA-E2-A105-01,TCGA-E2-A106-01,TCGA-E2-A107-01,TCGA-E2-A108-01,TCGA-E2-A109-01,TCGA-E2-A10B-01,TCGA-E2-A10C-01,TCGA-E2-A10E-01,TCGA-E2-A10F-01
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
hsa-let-7a-1,13.129765,12.918069,13.012033,13.144697,13.411684,13.316301,13.44523,13.72785,13.601504,13.598739,...,12.225132,13.938134,13.609853,13.50829,13.406359,13.730647,13.198426,12.79335,14.060268,12.990403
hsa-let-7a-2,14.117933,13.9223,14.010002,14.141721,14.413518,14.310917,14.448556,14.714551,14.608693,14.606942,...,13.235065,14.930021,14.603389,14.525026,14.402735,14.719166,14.200523,13.796623,15.047592,14.006035
hsa-let-7a-3,13.147714,12.913194,13.028483,13.151281,13.420481,13.327144,13.446806,13.736891,13.613105,13.606224,...,12.261971,13.972011,13.643274,13.549981,13.438737,13.73207,13.212367,12.79335,14.074978,13.018659
hsa-let-7b,14.595135,14.512657,13.419612,14.667196,14.438548,14.576493,14.611137,15.098805,16.505758,15.638855,...,14.684912,15.230457,15.357655,15.112011,15.040315,15.806771,15.64591,14.724106,16.370741,15.439239
hsa-let-7c,8.41489,9.646536,9.312455,11.511431,11.693927,11.138419,11.284446,9.197514,13.392164,11.419823,...,10.565698,10.483745,11.159056,12.47334,12.405828,10.613712,11.395452,9.087202,10.88552,11.385638


Unnamed: 0_level_0,TCGA-3C-AAAU-01,TCGA-3C-AALI-01,TCGA-3C-AALJ-01,TCGA-3C-AALK-01,TCGA-4H-AAAK-01,TCGA-5L-AAT0-01,TCGA-5L-AAT1-01,TCGA-5T-A9QA-01,TCGA-A1-A0SB-01,TCGA-A1-A0SD-01,...,TCGA-UL-AAZ6-01,TCGA-UU-A93S-01,TCGA-V7-A7HQ-01,TCGA-W8-A86G-01,TCGA-WT-AB41-01,TCGA-WT-AB44-01,TCGA-XX-A899-01,TCGA-XX-A89A-01,TCGA-Z7-A8R5-01,TCGA-Z7-A8R6-01
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
?|100133144,4.032489,3.211931,3.538886,3.595671,2.77543,1.995991,,0.55031,3.939189,3.250628,...,-1.324816,2.108558,,2.475707,,,3.846574,4.480524,1.178747,2.783771
?|100134869,3.692829,4.119273,3.206237,3.469873,3.850979,3.766489,3.405298,3.169252,3.847346,3.501324,...,3.845189,3.443978,1.622556,3.845099,2.657434,1.703987,4.422294,4.769476,2.866572,4.631075
?|10357,5.704604,6.124231,7.26957,7.168565,6.395968,6.836141,6.857961,6.749035,6.862786,5.913201,...,7.08347,7.088829,4.906766,7.003547,5.744909,5.401368,7.106177,6.003213,6.410173,7.388457
?|10431,8.672694,9.139279,10.410275,9.75745,9.581922,9.657753,10.114256,10.472185,9.360367,9.933569,...,10.616682,11.495054,10.74977,9.44641,10.282241,10.874534,9.3504,9.497295,10.155173,9.970921
?|155060,10.21311,9.011343,9.209506,9.110487,8.027083,8.110023,7.704865,6.254741,8.128052,6.387132,...,8.052478,7.516236,9.280761,9.631306,8.137225,9.460539,8.738651,8.556414,7.97767,7.894918


Unnamed: 0_level_0,TCGA-3C-AAAU-01,TCGA-3C-AALI-01,TCGA-3C-AALJ-01,TCGA-3C-AALK-01,TCGA-4H-AAAK-01,TCGA-5L-AAT0-01,TCGA-5L-AAT1-01,TCGA-5T-A9QA-01,TCGA-A1-A0SB-01,TCGA-A1-A0SE-01,...,TCGA-UL-AAZ6-01,TCGA-UU-A93S-01,TCGA-V7-A7HQ-01,TCGA-W8-A86G-01,TCGA-WT-AB41-01,TCGA-WT-AB44-01,TCGA-XX-A899-01,TCGA-XX-A89A-01,TCGA-Z7-A8R5-01,TCGA-Z7-A8R6-01
Hybridization REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Composite Element REF,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,...,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value,Beta_Value
A1BG,0.483716119676,0.637191226131,0.656092398242,0.615194471357,0.612080370511,0.469600740678,0.582188239422,0.66617073097,0.659965611959,0.641701155202,...,0.631413241724,0.64952294395,0.596585169597,0.615558357651,0.580837880262,0.615814023324,0.589897794957,0.572606636128,0.617859586161,0.568150149265
A1CF,0.295827203492,0.458972998571,0.489725289638,0.625765223243,0.507736509665,0.514770866326,0.549850958729,0.381038654448,0.826312156393,0.606699429409,...,0.383469192855,0.183354853938,0.403909161312,0.716980255014,0.613131295074,0.665043713213,0.705153725375,0.494848686021,0.691835387189,0.224696596211
A2BP1,0.187699869591,0.240515847704,0.279087851226,0.488888510474,0.463845494635,0.504450855353,0.480885816745,0.622832399216,0.474678831563,0.339829506578,...,0.130529915536,0.319855310743,0.335517456053,0.512185396638,0.563519806811,0.507364324635,0.520542747167,0.412562068574,0.522169978143,0.33955834608
A2LD1,0.62958551322,0.666272288675,0.755630499986,0.74575121287,0.698515739124,0.706812706661,0.759017355996,0.694010939885,0.847837522256,0.786662091353,...,0.587475995313,0.667969642321,0.689140211036,0.791381283524,0.680499323148,0.660476360054,0.745725420412,0.74390049875,0.791229999577,0.637764188841


Unnamed: 0_level_0,tcga-5l-aat0,tcga-5l-aat1,tcga-a1-a0sp,tcga-a2-a04v,tcga-a2-a04y,tcga-a2-a0cq,tcga-a2-a1g4,tcga-a2-a25a,tcga-a7-a0cd,tcga-a7-a13g,...,tcga-s3-aa11,tcga-s3-aa14,tcga-s3-aa15,tcga-ul-aaz6,tcga-uu-a93s,tcga-v7-a7hq,tcga-wt-ab44,tcga-xx-a899,tcga-xx-a89a,tcga-z7-a8r6
Hybridization REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Composite Element REF,value,value,value,value,value,value,value,value,value,value,...,value,value,value,value,value,value,value,value,value,value
years_to_birth,42,63,40,39,53,62,71,44,66,79,...,67,47,51,73,63,75,,46,68,46
vital_status,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
days_to_death,,,,1920,,,,,,,...,,,,,116,,,,,
days_to_last_followup,1477,1471,584,,1099,2695,595,3276,1165,718,...,421,529,525,518,,2033,883,467,488,3256


## TCGAbiolinks

This section demonstrates how to use the `TCGAbiolinks` R package to access and download clinical and molecular subtype data. It begins by ensuring `TCGAbiolinks` is installed, then loads the package. It retrieves PAM50 molecular subtype labels using `TCGAquery_subtype()` and writes them to a CSV file. Additionally, it downloads clinical data using `GDCquery_clinic()` and formats it with `GDCprepare_clinic()`, saving the result as another CSV file.

```R
  # Install TCGAbiolinks
  if (!requireNamespace("TCGAbiolinks", quietly = TRUE)) {
    if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
    BiocManager::install("TCGAbiolinks")
  }

  # Load the library
  library(TCGAbiolinks)

  # Download PAM50 subtype labels
  pam50_df <- TCGAquery_subtype(tumor = "BRCA")[ , c("patient", "BRCA_Subtype_PAM50")]
  write.csv(pam50_df, file = "BRCA_PAM50_labels.csv", row.names = FALSE, quote = FALSE)

  # Download clinical data
  clin_raw <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
  clin_df <- GDCprepare_clinic(clin_raw, clinical.info = "patient")
  write.csv(clin_df, file = "BRCA_clinical_data.csv", row.names = FALSE, quote = FALSE)
```

In [2]:
import pandas as pd

# from Firehose
mirna = pd.read_csv(root/"BRCA.miRseq_RPKM_log2.txt", sep="\t",index_col=0,low_memory=False)
meth = pd.read_csv(root/"BRCA.meth.by_mean.data.txt", sep='\t',index_col=0,low_memory=False)                             
rna = pd.read_csv(root / "BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt", sep="\t",index_col=0,low_memory=False)
clinical_firehose = pd.read_csv(root / "BRCA.clin.merged.picked.txt",sep="\t", index_col=0, low_memory=False).T

# from TCGABiolinks
pam50 = pd.read_csv(root /"BRCA_PAM50_labels.csv",index_col=0)
clinical_biolinks = pd.read_csv(root /"BRCA_clinical_data.csv",index_col=1)

print("Initial shapes")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical TCGABioLinks: {clinical_biolinks.shape}")
print(f"clinical FireHose: {clinical_firehose.shape}")

meth = meth.T
rna = rna.T
mirna = mirna.T

print("\nAfter tranpose")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")

def trim(idx):
    return idx.to_series().str.extract(r'(^TCGA-\w\w-\w\w\w\w)')[0]

meth.index = trim(meth.index)
rna.index = trim(rna.index)
mirna.index = trim(mirna.index)
pam50.index = pam50.index.str.upper()
clinical_biolinks.index = clinical_biolinks.index.str.upper()
clinical_firehose.index = clinical_firehose.index.str.upper()

idx1 = clinical_biolinks.index
idx2 = clinical_firehose.index

# intersection and unique counts
common = idx1.intersection(idx2)
only_in_1 = idx1.difference(idx2)
only_in_2 = idx2.difference(idx1)

print(f"Patients in both clinical datasets: {len(common)}")
common = clinical_biolinks.index.intersection(clinical_firehose.index)
clinical_biolinks = clinical_biolinks.loc[common]
clinical_firehose = clinical_firehose.loc[common]

clinical = pd.concat([clinical_biolinks, clinical_firehose], axis=1)

print(f"Combined Clinical shape {clinical.shape}")

common = sorted(set(meth.index) & set(rna.index) & set(mirna.index) & set(pam50.index) & set(clinical.index))
print(f"Patients in every dataset: {len(common)}")

meth = meth.loc[common]
rna = rna.loc[common]
mirna = mirna.loc[common]
pam50 = pam50.loc[common]
clinical = clinical.loc[common]

print("\nFinal shapes:")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical: {clinical.shape}\n")

Initial shapes
meth: (20107, 885)
rna: (18321, 1212)
mirna: (503, 1189)
pam50: (1087, 1)
clinical TCGABioLinks: (1098, 101)
clinical FireHose: (1097, 18)

After tranpose
meth: (885, 20107)
rna: (1212, 18321)
mirna: (1189, 503)
Patients in both clinical datasets: 1097
Combined Clinical shape (1097, 119)
Patients in every dataset: 769

Final shapes:
meth: (863, 20107)
rna: (865, 18321)
mirna: (855, 503)
pam50: (769, 1)
clinical: (769, 119)



### Handling Multiple Aliquots per Sample

This section addresses cases where some patients have multiple aliquots per sample in the `meth`, `rna`, and `mirna` datasets. It first identifies and counts patients with duplicate entries. Then, it coerces all data to numeric types and aggregates the duplicates by computing the mean across aliquots for each patient, ensuring only one row per patient. After aggregation, the datasets are aligned by keeping only the patients that are common across all five datasets (`meth`, `rna`, `mirna`, `pam50`, and `clinical`). The result is s set of matched samples ready for integrated analysis.

In [3]:
for name, df in [("meth", meth), ("rna", rna), ("mirna", mirna)]:
    counts = df.index.value_counts()
    n_multiple = (counts > 1).sum()
    total_duplicates = counts[counts > 1].sum() - n_multiple
    
    print(f"{name}:")
    print(f"patients with >1 aliquot: {n_multiple}")
    print(f"total duplicate rows: {total_duplicates}\n")

meth = meth.apply(pd.to_numeric, errors="coerce")
rna = rna .apply(pd.to_numeric, errors="coerce")
mirna = mirna.apply(pd.to_numeric, errors="coerce")

meth = meth.groupby(level=0).mean()
rna = rna.groupby(level=0).mean()
mirna = mirna.groupby(level=0).mean()

# Now each has one row per patient
print("Post-aggregation shapes:")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")

common = sorted( set(meth.index) & set(rna.index) & set(mirna.index)& set(pam50.index) & set(clinical.index) )
print(f"Patients in every dataset: {len(common)}")

meth = meth.loc[common]
rna = rna.loc[common]
mirna = mirna.loc[common]
pam50 = pam50.loc[common]
clinical = clinical.loc[common]

print("\nFinal shapes")
print(f"meth: {meth.shape}")
print(f"rna: {rna.shape}")
print(f"mirna: {mirna.shape}")
print(f"pam50: {pam50.shape}")
print(f"clinical:{clinical.shape}")

meth:
patients with >1 aliquot: 91
total duplicate rows: 94

rna:
patients with >1 aliquot: 93
total duplicate rows: 96

mirna:
patients with >1 aliquot: 84
total duplicate rows: 86

Post-aggregation shapes:
meth: (769, 20107)
rna: (769, 18321)
mirna: (769, 503)
Patients in every dataset: 769

Final shapes
meth: (769, 20107)
rna: (769, 18321)
mirna: (769, 503)
pam50: (769, 1)
clinical:(769, 119)


### Review the first few rows of each file

In [4]:
display(meth.head())
display(rna.head())
display(mirna.head())
display(clinical.head())
display(pam50.value_counts())

Hybridization REF,Composite Element REF,A1BG,A1CF,A2BP1,A2LD1,A2M,A2ML1,A4GALT,A4GNT,AAA1,...,ZWILCH,ZWINT,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,psiTPTE22,tAKR
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-3C-AAAU,,0.483716,0.295827,0.1877,0.629586,0.559654,0.835412,0.4848,0.690217,0.807805,...,0.112978,0.053939,0.287665,0.328087,0.502935,0.220683,0.482044,0.107396,0.247304,0.506404
TCGA-3C-AALI,,0.637191,0.458973,0.240516,0.666272,0.607505,0.842391,0.550047,0.74989,0.39529,...,0.111834,0.04616,0.265322,0.405851,0.434024,0.236362,0.458847,0.119652,0.163022,0.623865
TCGA-3C-AALJ,,0.656092,0.489725,0.279088,0.75563,0.66236,0.82902,0.476107,0.653756,0.795102,...,0.113218,0.042657,0.272103,0.391326,0.449525,0.210976,0.482641,0.102385,0.252328,0.504451
TCGA-3C-AALK,,0.615194,0.625765,0.488889,0.745751,0.727982,0.835365,0.556016,0.652005,0.816423,...,0.145133,0.047022,0.301284,0.410348,0.446571,0.220185,0.485944,0.112941,0.471956,0.682468
TCGA-4H-AAAK,,0.61208,0.507737,0.463845,0.698516,0.692364,0.802388,0.50487,0.531183,0.851114,...,0.118928,0.045057,0.300647,0.379998,0.487929,0.233324,0.490736,0.115646,0.314877,0.744877


gene,?|100133144,?|100134869,?|10357,?|10431,?|155060,?|26823,?|340602,?|388795,?|390284,?|391343,...,ZWINT|11130,ZXDA|7789,ZXDB|158586,ZXDC|79364,ZYG11A|440590,ZYG11B|79699,ZYX|7791,ZZEF1|23140,ZZZ3|26009,psiTPTE22|387590
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-3C-AAAU,4.032489,3.692829,5.704604,8.672694,10.21311,,0.785174,-1.536587,2.048201,,...,9.86412,7.01783,9.976968,10.695662,8.013988,10.238851,11.776124,10.887932,10.205129,0.785174
TCGA-3C-AALI,3.211931,4.119273,6.124231,9.139279,9.011343,0.121015,7.170928,2.291014,0.706022,3.027968,...,9.914682,5.902438,8.809329,10.391374,7.632831,9.237422,12.426428,10.364848,8.667973,9.855788
TCGA-3C-AALJ,3.538886,3.206237,7.26957,10.410275,9.209506,,,1.443554,1.443554,,...,11.30565,5.143969,9.060691,9.586488,8.374267,9.055784,12.414355,9.880935,8.992994,5.143969
TCGA-3C-AALK,3.595671,3.469873,7.168565,9.75745,9.110487,-1.273343,,1.048724,2.186215,,...,9.384994,5.782065,8.773906,9.754688,7.454703,9.246419,12.474556,9.609426,9.453001,6.057699
TCGA-4H-AAAK,2.77543,3.850979,6.395968,9.581922,8.027083,-1.232769,-1.232769,1.574683,1.574683,,...,9.397606,5.61283,8.728789,10.035881,3.811738,9.599438,11.980747,9.700292,9.784147,7.548699


gene,hsa-let-7a-1,hsa-let-7a-2,hsa-let-7a-3,hsa-let-7b,hsa-let-7c,hsa-let-7d,hsa-let-7e,hsa-let-7f-1,hsa-let-7f-2,hsa-let-7g,...,hsa-mir-937,hsa-mir-939,hsa-mir-940,hsa-mir-942,hsa-mir-944,hsa-mir-95,hsa-mir-96,hsa-mir-98,hsa-mir-99a,hsa-mir-99b
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-3C-AAAU,13.129765,14.117933,13.147714,14.595135,8.41489,8.665921,10.521777,3.879392,11.824817,8.597744,...,0.906699,-0.093302,2.672234,2.467414,1.044202,2.044202,6.906699,5.754696,7.024602,15.506461
TCGA-3C-AALI,12.918069,13.9223,12.913194,14.512657,9.646536,9.003653,9.13176,4.386952,12.678841,8.455144,...,1.579597,-0.083367,0.139024,3.032109,-0.668331,0.33167,5.91287,6.427066,7.885299,13.626182
TCGA-3C-AALJ,13.012033,14.010002,13.028483,13.419612,9.312455,9.276943,11.395711,5.314692,13.530255,9.230563,...,3.270298,-2.189134,0.395828,1.855261,-0.381778,0.717757,6.603657,6.878301,7.580704,15.013822
TCGA-3C-AALK,13.144697,14.141721,13.151281,14.667196,11.511431,8.384763,10.368981,4.159182,12.652559,8.471503,...,0.923965,-0.660997,-0.076034,1.798435,1.798435,0.798435,6.181354,5.377922,10.031619,14.554783
TCGA-4H-AAAK,13.411684,14.413518,13.420481,14.438548,11.693927,8.453747,10.741371,4.494537,13.009499,8.38122,...,0.18295,-0.624403,-1.624403,1.076036,0.18295,-0.302475,4.31811,5.103516,10.078201,14.650338


Unnamed: 0,project,synchronous_malignancy,ajcc_pathologic_stage,days_to_diagnosis,laterality,created_datetime,last_known_disease_status,tissue_or_organ_of_origin,days_to_last_follow_up,age_at_diagnosis,...,pathology_N_stage,pathology_M_stage,gender,date_of_initial_pathologic_diagnosis,days_to_last_known_alive,radiation_therapy,histological_type,number_of_lymph_nodes,race,ethnicity
TCGA-3C-AAAU,TCGA-BRCA,No,Stage X,0.0,Left,,,"Breast, NOS",,20211.0,...,nx,mx,female,2004,,no,infiltrating lobular carcinoma,4,white,not hispanic or latino
TCGA-3C-AALI,TCGA-BRCA,No,Stage IIB,0.0,Right,,,"Breast, NOS",,18538.0,...,n1a,m0,female,2003,,yes,infiltrating ductal carcinoma,1,black or african american,not hispanic or latino
TCGA-3C-AALJ,TCGA-BRCA,No,Stage IIB,0.0,Right,,,"Breast, NOS",,22848.0,...,n1a,m0,female,2011,,no,infiltrating ductal carcinoma,1,black or african american,not hispanic or latino
TCGA-3C-AALK,TCGA-BRCA,No,Stage IA,0.0,Right,,,"Breast, NOS",,19074.0,...,n0 (i+),m0,female,2011,,no,infiltrating ductal carcinoma,0,black or african american,not hispanic or latino
TCGA-4H-AAAK,TCGA-BRCA,No,Stage IIIA,0.0,Left,,,"Breast, NOS",,18371.0,...,n2a,m0,female,2013,,no,infiltrating lobular carcinoma,4,white,not hispanic or latino


BRCA_Subtype_PAM50
LumA                  419
LumB                  140
Basal                 130
Her2                   46
Normal                 34
Name: count, dtype: int64

# Preprocessing

After reviewing the data above, we applied the following steps to the data before further analysis.

1. Methylation (B -> M-value)
   - Clip B-values to \[E, 1-E] and apply logit transform: M = log_2(B / (1-B)).
   - Drop the original `Composite Element REF` column.

2. mRNA & miRNA:
   - Already in log_2 scale (RSEM normalized and RPKM).

3. Quality Control:
   - Count samples with all-zero rows in each modality.
   - Compute NaN counts post-transformation, then replace all NaNs with 0.

4. Column Name Cleaning:
   - Replace all `-` and `|` characters with `_`.
   - Replace `?` with `unknown`.

5. Label Encoding:
   - Map PAM50 subtypes to integers: Normal=0, Basal=1, Her2=2, LumA=3, LumB=4

6. Alignment & Aggregation:
   - Trim barcodes to patient level.
   - Aggregate duplicate aliquots by mean per patient.
   - Drop the `project` column from clinical.
   - Subset all tables to the common patient set (no missing or all-zero samples).

7. Final Output Shapes:
   - Methylation M-value: 769 × 20,107
   - mRNA (log_2): 769 × 20,531
   - miRNA (log_2): 769 × 503
   - PAM50 labels: 769 × 1
   - Clinical covariates: 769 × 101

In [None]:
import numpy as np
import pandas as pd

def beta_to_m(df, eps=1e-6):
    B = np.clip(df.values, eps, 1.0 - eps)
    M = np.log2(B / (1 - B))
    return pd.DataFrame(M, index=df.index, columns=df.columns)

# find rows that are all 0s
zeros_meth = (meth  == 0).all(axis=1).sum()
zeros_rna = (rna   == 0).all(axis=1).sum()
zeros_mirna = (mirna == 0).all(axis=1).sum()
print(f"All zeros: meth: {zeros_meth}, rna: {zeros_rna}, mirna: {zeros_mirna}")

# find rows with all nans
nan_meth = meth.isna().all(axis=1).sum()
nan_rna = rna.isna().all(axis=1).sum()
nan_mirna = mirna.isna().all(axis=1).sum()
nan_clinical = clinical.isna().all(axis=1).sum()
nan_pam50 = pam50.isna().all(axis=1).sum()
print(f"nan_meth: {nan_meth}, nan_rna: {nan_rna}, nan_mirna: {nan_mirna}, nan_clinical: {nan_clinical}, nan_pam50: {nan_pam50}")

# map PAM50 subtypes to integers
mapping = {"Normal":0, "Basal":1, "Her2":2, "LumA":3, "LumB":4}
pam50 = pam50["BRCA_Subtype_PAM50"].map(mapping).to_frame(name="pam50")

# drop and transform methylation
meth_clean = meth.drop(columns=["Composite Element REF"], errors="ignore")
meth_m = beta_to_m(meth_clean)
clinical = clinical.drop(columns=["project"], errors="ignore")

# clean column names and fill nans
for df in [meth_m, rna, mirna]:
    df.columns = df.columns.str.replace(r"\?\|", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"[?|]", "unknown_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.fillna(0, inplace=True)

# check for nans after filling
print("NaN counts after filling:")
print(meth_m.isna().sum().sum(),rna.isna().sum().sum(),mirna.isna().sum().sum(),clinical.isna().sum().sum(),pam50.isna().sum().sum())

# align index to PAM50
X_meth = meth_m.loc[pam50.index]
X_rna = rna.loc[pam50.index]
X_mirna = mirna.loc[pam50.index]
clinical= clinical.loc[pam50.index]

print(f"new shapes: meth: {X_meth.shape}, rna: {X_rna.shape}, mirna: {X_mirna.shape}, pam50: {pam50.shape}, clinical: {clinical.shape}")
display(X_meth.head())
display(X_rna.head())
display(X_mirna.head())
display(clinical.head())
display(pam50.value_counts())

All zeros: meth: 0, rna: 0, mirna: 0
nan_meth: 0, nan_rna: 0, nan_mirna: 0, nan_clinical: 0, nan_pam50: 0
NaN counts after filling:
0 0 0 46476 0
new shapes: meth: (769, 20106), rna: (769, 18321), mirna: (769, 503), pam50: (769, 1), clinical: (769, 118)


Hybridization REF,A1BG,A1CF,A2BP1,A2LD1,A2M,A2ML1,A4GALT,A4GNT,AAA1,AAAS,...,ZWILCH,ZWINT,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3,psiTPTE22,tAKR
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-3C-AAAU,-0.094004,-1.251175,-2.113585,0.765262,0.345896,2.343631,-0.087741,1.155791,2.071436,-2.650851,...,-2.972923,-4.132523,-1.308165,-1.034199,0.016935,-1.820233,-0.103662,-3.055084,-1.605783,0.036955
TCGA-3C-AALI,0.812517,-0.237291,-1.658888,0.99744,0.630221,2.418135,0.28978,1.584114,-0.613329,-4.072465,...,-2.989465,-4.369032,-1.469365,-0.549876,-0.382967,-1.691887,-0.238022,-2.879231,-2.360128,0.729981
TCGA-3C-AALJ,0.931878,-0.059301,-1.369104,1.628617,0.97213,2.277584,-0.137988,0.916964,1.95623,-3.781647,...,-2.969472,-4.48819,-1.419578,-0.637297,-0.292273,-1.902991,-0.100215,-3.132087,-1.567104,0.025686
TCGA-3C-AALK,0.676913,0.741678,-0.064133,1.552454,1.4202,2.343133,0.324621,0.905816,2.152928,-3.894574,...,-2.558319,-4.341028,-1.213585,-0.523013,-0.309506,-1.824419,-0.081137,-2.973455,-0.162004,1.10386
TCGA-4H-AAAK,0.657963,0.044649,-0.209004,1.21221,1.170304,2.021628,0.028103,0.180184,2.515149,-3.885526,...,-2.889175,-4.40558,-1.21795,-0.706284,-0.06967,-1.716283,-0.053464,-2.934908,-1.121575,1.545812


gene,unknown_100133144,unknown_100134869,unknown_10357,unknown_10431,unknown_155060,unknown_26823,unknown_340602,unknown_388795,unknown_390284,unknown_391343,...,ZWINTunknown_11130,ZXDAunknown_7789,ZXDBunknown_158586,ZXDCunknown_79364,ZYG11Aunknown_440590,ZYG11Bunknown_79699,ZYXunknown_7791,ZZEF1unknown_23140,ZZZ3unknown_26009,psiTPTE22unknown_387590
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-3C-AAAU,4.032489,3.692829,5.704604,8.672694,10.21311,0.0,0.785174,-1.536587,2.048201,0.0,...,9.86412,7.01783,9.976968,10.695662,8.013988,10.238851,11.776124,10.887932,10.205129,0.785174
TCGA-3C-AALI,3.211931,4.119273,6.124231,9.139279,9.011343,0.121015,7.170928,2.291014,0.706022,3.027968,...,9.914682,5.902438,8.809329,10.391374,7.632831,9.237422,12.426428,10.364848,8.667973,9.855788
TCGA-3C-AALJ,3.538886,3.206237,7.26957,10.410275,9.209506,0.0,0.0,1.443554,1.443554,0.0,...,11.30565,5.143969,9.060691,9.586488,8.374267,9.055784,12.414355,9.880935,8.992994,5.143969
TCGA-3C-AALK,3.595671,3.469873,7.168565,9.75745,9.110487,-1.273343,0.0,1.048724,2.186215,0.0,...,9.384994,5.782065,8.773906,9.754688,7.454703,9.246419,12.474556,9.609426,9.453001,6.057699
TCGA-4H-AAAK,2.77543,3.850979,6.395968,9.581922,8.027083,-1.232769,-1.232769,1.574683,1.574683,0.0,...,9.397606,5.61283,8.728789,10.035881,3.811738,9.599438,11.980747,9.700292,9.784147,7.548699


gene,hsa_let_7a_1,hsa_let_7a_2,hsa_let_7a_3,hsa_let_7b,hsa_let_7c,hsa_let_7d,hsa_let_7e,hsa_let_7f_1,hsa_let_7f_2,hsa_let_7g,...,hsa_mir_937,hsa_mir_939,hsa_mir_940,hsa_mir_942,hsa_mir_944,hsa_mir_95,hsa_mir_96,hsa_mir_98,hsa_mir_99a,hsa_mir_99b
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-3C-AAAU,13.129765,14.117933,13.147714,14.595135,8.41489,8.665921,10.521777,3.879392,11.824817,8.597744,...,0.906699,-0.093302,2.672234,2.467414,1.044202,2.044202,6.906699,5.754696,7.024602,15.506461
TCGA-3C-AALI,12.918069,13.9223,12.913194,14.512657,9.646536,9.003653,9.13176,4.386952,12.678841,8.455144,...,1.579597,-0.083367,0.139024,3.032109,-0.668331,0.33167,5.91287,6.427066,7.885299,13.626182
TCGA-3C-AALJ,13.012033,14.010002,13.028483,13.419612,9.312455,9.276943,11.395711,5.314692,13.530255,9.230563,...,3.270298,-2.189134,0.395828,1.855261,-0.381778,0.717757,6.603657,6.878301,7.580704,15.013822
TCGA-3C-AALK,13.144697,14.141721,13.151281,14.667196,11.511431,8.384763,10.368981,4.159182,12.652559,8.471503,...,0.923965,-0.660997,-0.076034,1.798435,1.798435,0.798435,6.181354,5.377922,10.031619,14.554783
TCGA-4H-AAAK,13.411684,14.413518,13.420481,14.438548,11.693927,8.453747,10.741371,4.494537,13.009499,8.38122,...,0.18295,-0.624403,-1.624403,1.076036,0.18295,-0.302475,4.31811,5.103516,10.078201,14.650338


Unnamed: 0_level_0,synchronous_malignancy,ajcc_pathologic_stage,days_to_diagnosis,laterality,created_datetime,last_known_disease_status,tissue_or_organ_of_origin,days_to_last_follow_up,age_at_diagnosis,primary_diagnosis,...,pathology_N_stage,pathology_M_stage,gender,date_of_initial_pathologic_diagnosis,days_to_last_known_alive,radiation_therapy,histological_type,number_of_lymph_nodes,race,ethnicity
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-3C-AAAU,No,Stage X,0.0,Left,,,"Breast, NOS",,20211.0,"Lobular carcinoma, NOS",...,nx,mx,female,2004,,no,infiltrating lobular carcinoma,4,white,not hispanic or latino
TCGA-3C-AALI,No,Stage IIB,0.0,Right,,,"Breast, NOS",,18538.0,"Infiltrating duct carcinoma, NOS",...,n1a,m0,female,2003,,yes,infiltrating ductal carcinoma,1,black or african american,not hispanic or latino
TCGA-3C-AALJ,No,Stage IIB,0.0,Right,,,"Breast, NOS",,22848.0,"Infiltrating duct carcinoma, NOS",...,n1a,m0,female,2011,,no,infiltrating ductal carcinoma,1,black or african american,not hispanic or latino
TCGA-3C-AALK,No,Stage IA,0.0,Right,,,"Breast, NOS",,19074.0,"Infiltrating duct carcinoma, NOS",...,n0 (i+),m0,female,2011,,no,infiltrating ductal carcinoma,0,black or african american,not hispanic or latino
TCGA-4H-AAAK,No,Stage IIIA,0.0,Left,,,"Breast, NOS",,18371.0,"Lobular carcinoma, NOS",...,n2a,m0,female,2013,,no,infiltrating lobular carcinoma,4,white,not hispanic or latino


pam50
3        419
4        140
1        130
2         46
0         34
Name: count, dtype: int64

In [6]:
# lets set up a commong index for all the files and then save them to csv
X_meth.index.name = "patient"
X_rna.index.name = "patient"
X_mirna.index.name = "patient"
pam50.index.name = "patient"
clinical.index.name = "patient"

X_meth.to_csv(root / "meth.csv", index=True)
X_rna.to_csv(root / "rna.csv", index=True)
X_mirna.to_csv(root / "mirna.csv", index=True)
pam50.to_csv(root / "pam50.csv", index=True)
clinical.to_csv(root / "clinical.csv", index=True)

# Optional: Load the data we just saved to make sure it looks okay.

In [None]:
meth = pd.read_csv(root / "meth.csv", index_col=0)
rna = pd.read_csv(root / "rna.csv", index_col=0)
mirna = pd.read_csv(root / "mirna.csv", index_col=0)
pam50 = pd.read_csv(root / "pam50.csv", index_col=0)
clinical = pd.read_csv(root / "clinical.csv", index_col=0)

display(meth.head())
display(rna.head())
display(mirna.head())
display(clinical.head())
display(pam50.head())

# Easy Access via DatasetLoader

To facilitate working with this data, we have made it available through our DatasetLoader component. If you have additional pre-processed or raw datasets you would like to include, feel free to reach out and we are happy to support adding new datasets to the platform.

In [8]:
from bioneuralnet.datasets import DatasetLoader

tgca_brca = DatasetLoader("brca")

print(f"TGCA BRCA dataset shape: {tgca_brca.shape}")
brca_meth = tgca_brca.data["meth"]
brca_rna = tgca_brca.data["rna"]
brca_mirna = tgca_brca.data["mirna"]
brca_clinical = tgca_brca.data["clinical"]
brca_pam50 = tgca_brca.data["pam50"]


TGCA BRCA dataset shape: {'mirna': (769, 503), 'pam50': (769, 1), 'clinical': (769, 118), 'meth': (769, 20106), 'rna': (769, 18321)}


In [None]:
from bioneuralnet.utils.preprocess import preprocess_clinical

#shapes
print(f"RNA shape: {brca_rna.shape}")
print(f"METH shape: {brca_meth.shape}")
print(f"miRNA shape: {brca_mirna.shape}")
print(f"Clinical shape: {brca_clinical.shape}")
print(f"Phenotype shape: {brca_pam50.shape}")
print(f"Phenotype counts:\n{brca_pam50.value_counts()}")

# review min and max values from the datasets
for name, df in {"rna": brca_rna, "meth": brca_meth, "mirna": brca_mirna}.items():
    min_val = df.min().min()
    max_val = df.max().max()
    print(f"\n{name.upper()}:")
    print(f"Min: {min_val:.4f}")
    print(f"Max: {max_val:.4f}")

#check nans in pam50
print(f"Nan values in pam50 {brca_pam50.isna().sum().sum()}")

brca_pam50 = brca_pam50.dropna()
X_rna = brca_rna.loc[brca_pam50.index]
X_meth = brca_meth.loc[brca_pam50.index]
X_mirna = brca_mirna.loc[brca_pam50.index]
clinical = brca_clinical.loc[brca_pam50.index]

# for more details on the preprocessing function, see bioneuralnet.utils.preprocess
clinical = preprocess_clinical(clinical, brca_pam50, top_k=15, scale=True, ignore_columns=["days_to_birth", "age_at_diagnosis", "days_to_last_followup", "age_at_index", "years_to_birth"])
display(clinical.head())

2025-05-16 10:31:09,364 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-16 10:31:09,365 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 31384 NaNs after median imputation
2025-05-16 10:31:09,365 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 39 columns dropped due to zero variance


RNA shape: (769, 18321)
METH shape: (769, 20106)
miRNA shape: (769, 503)
Clinical shape: (769, 118)
Phenotype shape: (769, 1)
Phenotype counts:
pam50
3        419
4        140
1        130
2         46
0         34
Name: count, dtype: int64

RNA:
Min: -8.5873
Max: 20.9784

METH:
Min: -7.1642
Max: 6.9710

MIRNA:
Min: -4.4631
Max: 19.3838
Nan values in pam50 0


2025-05-16 10:31:09,752 - bioneuralnet.utils.preprocess - INFO - Selected top 15 features by RandomForest importance


Unnamed: 0_level_0,days_to_birth,age_at_diagnosis,days_to_last_followup,age_at_index,years_to_birth,year_of_diagnosis,number_of_lymph_nodes,date_of_initial_pathologic_diagnosis,histological_type_infiltrating lobular carcinoma,"primary_diagnosis_Lobular carcinoma, NOS",morphology_8520/3,race.1_white,days_to_death.1,laterality_Right,"primary_diagnosis_Infiltrating duct carcinoma, NOS",country_of_residence_at_enrollment_United States,"sites_of_involvement_Breast, NOS",days_to_death,race_white,ajcc_staging_system_edition_6th
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
TCGA-3C-AAAU,-20211.0,20211.0,4047.0,55.0,55.0,-1.5,1.5,-1.5,True,True,True,True,0.0,False,False,True,False,0.0,True,True
TCGA-3C-AALI,-18538.0,18538.0,4005.0,50.0,50.0,-1.75,0.0,-1.75,False,False,False,False,0.0,True,True,True,False,0.0,False,True
TCGA-3C-AALJ,-22848.0,22848.0,1474.0,62.0,62.0,0.25,0.0,0.25,False,False,False,False,0.0,True,True,True,True,0.0,False,False
TCGA-3C-AALK,-19074.0,19074.0,1448.0,52.0,52.0,0.25,-0.5,0.25,False,False,False,False,0.0,True,True,True,True,0.0,False,False
TCGA-4H-AAAK,-18371.0,18371.0,348.0,50.0,50.0,0.75,1.5,0.75,True,True,True,True,0.0,False,False,False,False,0.0,True,False


# Preparing Multi-Omics Data for downstream tasks

1. Check sample overlap.

2. Select top features.

    - Uses ANOVA F-test to select the most relevant features for classification from each omics dataset.

3. Combine datasets.

    - Selected features from RNA, methylation, and miRNA are combined into a single dataset.

4. Clean missing values.

    - Counts and removes any missing (nan) values from the combined dataset.

5. Build similarity graph.

    - Creates a k-nearest neighbors graph from the transposed feature matrix.

Note: For more details on preprocessing functions and graph generation algorithms, see the [Utils documentation](https://bioneuralnet.readthedocs.io/en/latest/utils.html)

In [None]:
from sklearn.metrics import accuracy_score, f1_score
from bioneuralnet.utils.preprocess import top_anova_f_features
from bioneuralnet.utils.graph import gen_similarity_graph

#count intersection of samples
print("Intersection of samples:")
print(f"RNA: {len(set(X_rna.index) & set(pam50.index))}")
print(f"METH: {len(set(X_meth.index) & set(pam50.index))}")
print(f"miRNA: {len(set(X_mirna.index) & set(pam50.index))}")
print(f"Clinical: {len(set(clinical.index) & set(pam50.index))}")

meth_sel = top_anova_f_features(X_meth, brca_pam50, max_features=1000, task="classification")
rna_sel = top_anova_f_features(X_rna, brca_pam50 ,max_features=1000, task="classification")
mirna_sel = top_anova_f_features(X_mirna, brca_pam50,max_features=503, task="classification")
X_train_full = pd.concat([meth_sel, rna_sel, mirna_sel], axis=1)

#count nan values
print(f"Nan values in X_train_full: {X_train_full.isna().sum().sum()}")

#drop nan values
X_train_full = X_train_full.dropna()

#check if there are any nan values
print(f"Nan value in X_train_full after dropping: {X_train_full.isna().sum().sum()}")

print(f"X_train_full shape: {X_train_full.shape}")
A_train = gen_similarity_graph(X_train_full.T, k=15)

print(f"Network shape: {A_train.shape}")

Intersection of samples:
RNA: 769
METH: 769
miRNA: 769
Clinical: 769


2025-05-16 10:31:12,677 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-16 10:31:12,678 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-16 10:31:12,678 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-16 10:31:12,835 - bioneuralnet.utils.preprocess - INFO - Selected 1000 features by ANOVA (task=classification), 17514 significant, 0 padded
2025-05-16 10:31:15,470 - bioneuralnet.utils.preprocess - INFO - [Inf]: Replaced 0 infinite values
2025-05-16 10:31:15,471 - bioneuralnet.utils.preprocess - INFO - [NaN]: Replaced 0 NaNs after median imputation
2025-05-16 10:31:15,471 - bioneuralnet.utils.preprocess - INFO - [Zero-Var]: 0 columns dropped due to zero variance
2025-05-16 10:31:15,635 - bioneuralnet.utils.preprocess - INFO - Selected 1000 features by ANOVA (task=classification), 16864 significant, 0 padded
2025-05-16 10:31:15,714 - bioneuralnet.utils.prepr

Nan values in X_train_full: 0
Nan value in X_train_full after dropping: 0
X_train_full shape: (769, 2503)
Network shape: (2503, 2503)


In [None]:
from bioneuralnet.downstream_task import DPMON

save = Path("/home/vicente/Github/BioNeuralNet/TCGA_BRCA/results")
brca_pam50 = brca_pam50.rename(columns={"pam50": "phenotype"})

dpmon = DPMON(
    adjacency_matrix=A_train,
    omics_list=[meth_sel, rna_sel, mirna_sel],
    phenotype_data=brca_pam50,
    clinical_data=clinical,
    repeat_num=3,
    tune=True, gpu=True, cuda=0,
    output_dir=Path(save/"results1"),
)

predictions_df, avg_accuracy = dpmon.run()
actual = predictions_df["Actual"]
pred = predictions_df["Predicted"]
dp_acc = (accuracy_score(actual, pred), 0)
dp_f1w = (f1_score(actual, pred, average='weighted'), 0)
dp_f1m = (f1_score(actual, pred, average='macro'), 0)

print(f"DPMON results:")
print(f"Accuracy: {dp_acc[0]}")
print(f"F1 weighted: {dp_f1w[0]}")
print(f"F1 macro: {dp_f1m[0]}")

2025-05-16 15:22:49,397 - bioneuralnet.downstream_task.dpmon - INFO - Output directory set to: /home/vicente/Github/BioNeuralNet/TCGA_BRCA/results/results1
2025-05-16 15:22:49,397 - bioneuralnet.downstream_task.dpmon - INFO - Initialized DPMON with the provided parameters.
2025-05-16 15:22:49,397 - bioneuralnet.downstream_task.dpmon - INFO - Starting DPMON run.
2025-05-16 15:22:49,411 - bioneuralnet.downstream_task.dpmon - INFO - Running hyperparameter tuning for DPMON.
2025-05-16 15:22:49,412 - bioneuralnet.downstream_task.dpmon - INFO - Using GPU 0
2025-05-16 15:22:49,412 - bioneuralnet.downstream_task.dpmon - INFO - Slicing omics dataset based on network nodes.
2025-05-16 15:22:49,415 - bioneuralnet.downstream_task.dpmon - INFO - Building PyTorch Geometric Data object from adjacency matrix.
2025-05-16 15:22:49,487 - bioneuralnet.downstream_task.dpmon - INFO - Number of nodes in network: 2503
2025-05-16 15:22:49,487 - bioneuralnet.downstream_task.dpmon - INFO - Using clinical vars fo

DPMON results:
Accuracy: 0.9557867360208062
F1 weighted: 0.9360974742812752
F1 macro: 0.7772360237077294
