# 01 - Download and Organize Data

This notebook collates the RNA-seq gene count tables previously downloaded from the GDC portal, cleans a combined expression matrix, and saves the organized outputs under `data/processed/` for downstream analysis.


**Workflow overview**

1. Discover every `.tsv` expression file that lives inside the raw GDC download bundles (ignoring manifest `.txt` files).
2. Build a tidy manifest so we can trace each file back to its case identifier.
3. Load the TPM counts, align the genes across all samples, and write a single expression matrix that other notebooks can use.


In [37]:
from __future__ import annotations

import json
import sys
from pathlib import Path

import pandas as pd

def find_project_root(start: Path) -> Path:
    for candidate in [start, *start.parents]:
        if (candidate / 'README.md').exists():
            return candidate
    raise FileNotFoundError('Unable to locate repository root (README.md not found)')



PROJECT_ROOT = find_project_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src.config import ProjectConfig
from src.notebook_utils import collect_expression_files, flatten_metadata_column, summarize_missing,drop_missing_columns

config = ProjectConfig()
RAW_DATA_DIR = config.raw_data_dir / 'star gene counts'
CLINICAL_DATA_DIR = config.clinical_data_dir
PROCESSED_DATA_DIR = config.processed_data_dir
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

RAW_DATA_DIR, CLINICAL_DATA_DIR, PROCESSED_DATA_DIR


(PosixPath('/Users/lennonmccartney/Desktop/tcga-brca-multiomics-subtyping/data/raw/star gene counts'),
 PosixPath('/Users/lennonmccartney/Desktop/tcga-brca-multiomics-subtyping/data/raw/clinical data'),
 PosixPath('/Users/lennonmccartney/Desktop/tcga-brca-multiomics-subtyping/data/processed'))

## Discover downloaded expression bundles

Each download bundle from GDC sits in its own UUID-named directory. We walk the tree, capture every `.tsv` file (ignoring plain `.txt` documents such as manifests), and track minimal metadata for reproducibility.


In [38]:
expression_index = collect_expression_files(RAW_DATA_DIR)
print(f'Discovered {len(expression_index)} expression TSV files.')
expression_index.head()


Discovered 1197 expression TSV files.


Unnamed: 0,file_name,bundle_id,path,relative_path
0,ba295155-272e-43eb-9d6a-e4c9c392e68b.rna_seq.a...,0019c951-16c5-48d0-85c8-58d96b12d330,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0019c951-16c5-48d0-85c8-58d96b12d330/ba295155-...
1,8d1641ea-7552-4d23-9298-094e0056386a.rna_seq.a...,0022cd20-f64f-4773-b9ff-a3de0b71b259,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0022cd20-f64f-4773-b9ff-a3de0b71b259/8d1641ea-...
2,2f51534b-248b-4999-bc3f-e42a2e98332e.rna_seq.a...,00469928-b243-4cae-acd7-134508e99ceb,/Users/lennonmccartney/Desktop/tcga-brca-multi...,00469928-b243-4cae-acd7-134508e99ceb/2f51534b-...
3,b321a3f5-043d-42c6-8c9d-5784d45cb85c.rna_seq.a...,0081f507-b104-4214-9ea1-31dd69130991,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0081f507-b104-4214-9ea1-31dd69130991/b321a3f5-...
4,cafc9e36-c5f0-45df-ad03-16210ff0d870.rna_seq.a...,0094f9d0-45ec-4aad-bca0-71c60bdd7113,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0094f9d0-45ec-4aad-bca0-71c60bdd7113/cafc9e36-...


In [39]:
# Attach GDC metadata (JSON) to each expression file
metadata_path = PROJECT_ROOT / 'data' / 'raw' / 'clinical data' / 'metadata.cart.2025-12-07.json'
with metadata_path.open() as fh:
    metadata_records = json.load(fh)
metadata_df = pd.json_normalize(metadata_records)

assoc_cols = ['entity_submitter_id', 'entity_type', 'case_id', 'entity_id']
assoc_df = flatten_metadata_column(metadata_df, 'associated_entities', assoc_cols)
input_cols = ['access', 'submitter_id', 'data_category', 'file_id', 'data_type', 'experimental_strategy']
input_df = flatten_metadata_column(metadata_df, 'analysis.input_files', input_cols)
metadata_filtered = assoc_df.merge(input_df, on='file_name', how='left')
expression_index = expression_index.merge(metadata_filtered, how='left', on='file_name')
print(f'Metadata-loaded expression manifest: {expression_index.shape[0]} rows, {expression_index.shape[1]} columns')
expression_index.head()
# Preserve a copy enriched with biospecimen metadata for downstream plots
expression_index_with_biospecimen = expression_index.copy()
expression_index_with_biospecimen.head()


Metadata-loaded expression manifest: 1197 rows, 14 columns


Unnamed: 0,file_name,bundle_id,path,relative_path,entity_submitter_id,entity_type,case_id,entity_id,access,submitter_id,data_category,file_id,data_type,experimental_strategy
0,ba295155-272e-43eb-9d6a-e4c9c392e68b.rna_seq.a...,0019c951-16c5-48d0-85c8-58d96b12d330,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0019c951-16c5-48d0-85c8-58d96b12d330/ba295155-...,TCGA-D8-A1XO-01A-11R-A14M-07,aliquot,6a186809-3422-41d0-83d2-867145830936,eddd285e-a946-4e4e-a21d-6e84ed397594,controlled,0e68b49d-a54d-41c3-a5f7-13fc141ab0e8,Sequencing Reads,cb7714fd-20e5-4ecf-9498-591e0c1fdcde,Aligned Reads,RNA-Seq
1,8d1641ea-7552-4d23-9298-094e0056386a.rna_seq.a...,0022cd20-f64f-4773-b9ff-a3de0b71b259,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0022cd20-f64f-4773-b9ff-a3de0b71b259/8d1641ea-...,TCGA-AN-A0FN-01A-11R-A034-07,aliquot,c2a742fe-3e8b-4210-85a6-7191a1123609,4aac7c40-db6b-4648-a82a-7bf397de913a,controlled,fd4c1bc5-6eab-4830-9178-afbab6a7a954,Sequencing Reads,786b6363-889b-4437-8830-eac79e5a8d5b,Aligned Reads,RNA-Seq
2,2f51534b-248b-4999-bc3f-e42a2e98332e.rna_seq.a...,00469928-b243-4cae-acd7-134508e99ceb,/Users/lennonmccartney/Desktop/tcga-brca-multi...,00469928-b243-4cae-acd7-134508e99ceb/2f51534b-...,TCGA-AC-A62X-01A-11R-A29R-07,aliquot,5b2a4f11-ca46-4974-9420-59b4820920bf,cdc523d2-da82-4a3d-a97e-9745c8a802d1,controlled,6920521e-f04a-4ae1-b161-df65804500b6,Sequencing Reads,2cd40ce3-5054-452e-bd21-854fdcd3646b,Aligned Reads,RNA-Seq
3,b321a3f5-043d-42c6-8c9d-5784d45cb85c.rna_seq.a...,0081f507-b104-4214-9ea1-31dd69130991,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0081f507-b104-4214-9ea1-31dd69130991/b321a3f5-...,TCGA-E2-A14U-01A-11R-A22K-07,aliquot,23b7aaea-1119-4b10-aa1a-0ae255d2f2a6,20d25e27-b509-45eb-a734-9b8d6a40c632,controlled,f8ed6c74-0202-4ac0-a581-b00c504e9e4b,Sequencing Reads,5d613c10-ab5c-4ab4-a42a-5f68de579ca3,Aligned Reads,RNA-Seq
4,cafc9e36-c5f0-45df-ad03-16210ff0d870.rna_seq.a...,0094f9d0-45ec-4aad-bca0-71c60bdd7113,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0094f9d0-45ec-4aad-bca0-71c60bdd7113/cafc9e36-...,TCGA-E9-A1R3-01A-31R-A14M-07,aliquot,4922cddc-575c-4b8a-8245-ce5f6876760c,678ba111-3b01-49b7-83d9-165ef0288a34,controlled,71af5a94-601d-4710-834a-ee8ac0265d20,Sequencing Reads,bcecf315-bca9-4518-ace3-dad827722905,Aligned Reads,RNA-Seq


In [40]:
# expression_index[expression_index["entity_submitter_id"].str.contains("TCGA-E2-A1IU")]

## Load biospecimen cart metadata (JSON)


In [41]:
biospecimen_metadata_path = PROJECT_ROOT / 'data' / 'raw' / 'clinical data' / 'biospecimen.cart.2025-12-07.json'
with biospecimen_metadata_path.open() as fh:
    biospecimen_metadata_records = json.load(fh)
biospecimen_metadata_df = pd.json_normalize(biospecimen_metadata_records)
print(f'Biospecimen metadata rows: {len(biospecimen_metadata_df)}; columns: {biospecimen_metadata_df.shape[1]}')
biospecimen_metadata_df.head()

Biospecimen metadata rows: 1095; columns: 4


Unnamed: 0,case_id,submitter_id,samples,project.project_id
0,001cef41-ff86-4d3f-a140-a647ac4b10a1,TCGA-E2-A1IU,"[{'tumor_descriptor': 'Primary', 'specimen_typ...",TCGA-BRCA
1,0045349c-69d9-4306-a403-c9c1fa836644,TCGA-A1-A0SB,"[{'tumor_descriptor': 'Not Applicable', 'speci...",TCGA-BRCA
2,00807dae-9f4a-4fd1-aac2-82eb11bf2afb,TCGA-A2-A04W,"[{'tumor_descriptor': 'Not Applicable', 'speci...",TCGA-BRCA
3,00a2d166-78c9-4687-a195-3d6315c27574,TCGA-AN-A0AM,"[{'tumor_descriptor': 'Primary', 'sample_id': ...",TCGA-BRCA
4,00b11ca8-8540-4a3d-b602-ec754b00230b,TCGA-LL-A440,"[{'tumor_descriptor': 'Primary', 'specimen_typ...",TCGA-BRCA


## Inspect nested samples/portions/analytes (Aliquots within analytes)/slides for one biospecimen record


In [42]:
sample_record = biospecimen_metadata_df.iloc[0]
samples_expanded = pd.json_normalize(sample_record['samples'])
print('Samples columns:', samples_expanded.columns.tolist())
sample_with_portions = samples_expanded.iloc[2]
portions_expanded = pd.json_normalize(sample_with_portions['portions'])
print('Portions columns:', portions_expanded.columns.tolist())
portion_with_analytes = portions_expanded.iloc[0]
analytes_expanded = pd.json_normalize(portion_with_analytes.get('analytes', []))
print('Analyte columns:', analytes_expanded.columns.tolist())
slides_expanded = pd.json_normalize(portion_with_analytes.get('slides', []))
print('Slides columns:', slides_expanded.columns.tolist())
aliquots_with_analytes = analytes_expanded.iloc[0]
aliquots_expanded = pd.json_normalize(aliquots_with_analytes.get('aliquots', []))
print('Aliquots columns:', aliquots_expanded.columns.tolist())

display(samples_expanded.head())
display(portions_expanded.head())
display(analytes_expanded.head())
display(slides_expanded.head())
display(aliquots_expanded.head())


display()

Samples columns: ['tumor_descriptor', 'specimen_type', 'days_to_sample_procurement', 'updated_datetime', 'sample_id', 'submitter_id', 'state', 'preservation_method', 'sample_type', 'tissue_type', 'created_datetime', 'portions', 'pathology_report_uuid', 'days_to_collection', 'initial_weight']
Portions columns: ['analytes', 'portion_id', 'slides', 'updated_datetime', 'creation_datetime', 'submitter_id', 'weight', 'portion_number', 'state', 'is_ffpe']
Analyte columns: ['analyte_id', 'experimental_protocol_type', 'aliquots', 'analyte_type', 'updated_datetime', 'normal_tumor_genotype_snp_match', 'spectrophotometer_method', 'submitter_id', 'concentration', 'state', 'a260_a280_ratio']
Slides columns: ['percent_stromal_cells', 'submitter_id', 'section_location', 'percent_tumor_cells', 'slide_id', 'percent_neutrophil_infiltration', 'percent_lymphocyte_infiltration', 'updated_datetime', 'percent_necrosis', 'percent_monocyte_infiltration', 'percent_normal_cells', 'state', 'percent_tumor_nuclei']


Unnamed: 0,tumor_descriptor,specimen_type,days_to_sample_procurement,updated_datetime,sample_id,submitter_id,state,preservation_method,sample_type,tissue_type,created_datetime,portions,pathology_report_uuid,days_to_collection,initial_weight
0,Primary,Solid Tissue,0.0,2023-11-08T11:17:33.420637-06:00,81ef5a2d-dcb8-49fa-a9ef-c5520d703e18,TCGA-E2-A1IU-01Z,released,FFPE,Primary Tumor,Tumor,2018-05-17T12:10:31.746556-05:00,"[{'analytes': [], 'portion_id': '5458dfd2-8926...",,,
1,Primary,Solid Tissue,,2023-11-08T11:10:15.917339-06:00,92058c44-a484-4e08-b1fe-dfe2f03a0aa1,TCGA-E2-A1IU-01A,released,Unknown,Primary Tumor,Tumor,,"[{'analytes': [], 'portion_id': '19429156-19ad...",ACE20EC5-2822-4F50-949E-E194C59F0FCD,136.0,270.0
2,Not Applicable,Solid Tissue,,2023-05-05T14:01:40.186650-05:00,c8e39abb-9027-4636-84b9-7bc03a8d4e49,TCGA-E2-A1IU-11A,released,Unknown,Solid Tissue Normal,Normal,,[{'analytes': [{'analyte_id': 'b60646e4-b524-4...,,136.0,340.0


Unnamed: 0,analytes,portion_id,slides,updated_datetime,creation_datetime,submitter_id,weight,portion_number,state,is_ffpe
0,[{'analyte_id': 'b60646e4-b524-4ad7-8cc5-359d5...,533c254b-0d13-45fb-b9b8-07b4324a45c6,"[{'percent_stromal_cells': 10.0, 'submitter_id...",2018-09-06T13:49:20.245333-05:00,1300666000.0,TCGA-E2-A1IU-11A-61,210.0,61,released,False


Unnamed: 0,analyte_id,experimental_protocol_type,aliquots,analyte_type,updated_datetime,normal_tumor_genotype_snp_match,spectrophotometer_method,submitter_id,concentration,state,a260_a280_ratio
0,b60646e4-b524-4ad7-8cc5-359d598e40db,aDNA Preparation Type,"[{'aliquot_quantity': 1.95, 'aliquot_id': '296...",DNA,2022-04-29T01:58:33.546762-05:00,Yes,UV Spec,TCGA-E2-A1IU-11A-61D,0.15,released,2.11
1,eec40dee-2d51-4990-809e-12f9edc472c6,Repli-G,"[{'aliquot_quantity': 40.0, 'aliquot_id': '3cf...",Repli-G (Qiagen) DNA,2022-04-29T01:58:40.400110-05:00,Yes,,TCGA-E2-A1IU-11A-61W,,released,


Unnamed: 0,percent_stromal_cells,submitter_id,section_location,percent_tumor_cells,slide_id,percent_neutrophil_infiltration,percent_lymphocyte_infiltration,updated_datetime,percent_necrosis,percent_monocyte_infiltration,percent_normal_cells,state,percent_tumor_nuclei
0,10.0,TCGA-E2-A1IU-11A-01-TSA,TOP,0.0,1c0aabbc-d315-4571-ada6-4d23b691441e,40.0,40.0,2018-09-06T13:49:20.245333-05:00,0.0,10.0,90.0,released,0.0
1,10.0,TCGA-E2-A1IU-11A-03-TSC,TOP,0.0,1e0a5cc1-d90d-4b19-9f96-93f67139fdb7,20.0,20.0,2018-09-06T13:49:20.245333-05:00,10.0,50.0,80.0,released,0.0
2,,TCGA-E2-A1IU-11A-05-TSE,TOP,0.0,3bc5b9f2-5fd3-4a00-a8f6-9cf98ab35113,0.0,0.0,2018-09-06T13:49:20.245333-05:00,,0.0,100.0,released,
3,,TCGA-E2-A1IU-11A-06-TSF,TOP,0.0,96609ca3-27d5-47b7-a3b2-9f658b24a26b,0.0,0.0,2018-09-06T13:49:20.245333-05:00,,0.0,100.0,released,
4,5.0,TCGA-E2-A1IU-11A-02-TSB,TOP,0.0,b1e1d4f0-7a97-4a67-a15f-667ba0454e31,20.0,20.0,2018-09-06T13:49:20.245333-05:00,15.0,50.0,80.0,released,0.0


Unnamed: 0,aliquot_quantity,aliquot_id,source_center,updated_datetime,submitter_id,concentration,state,aliquot_volume,center.code,center.center_id,center.name,center.namespace,center.short_name,center.center_type
0,1.95,29649014-93dc-43a1-921b-7fef21993d35,23,2022-04-29T01:58:34.961222-05:00,TCGA-E2-A1IU-11A-61D-A14E-02,0.15,released,13.0,2,c8611490-4cbd-5651-8de2-64484a515eec,Harvard Medical School,hms.harvard.edu,HMS,CGCC
1,4.27,48e5f4e0-7f51-47f6-a508-347ed3638a6d,23,2022-04-29T01:58:38.593782-05:00,TCGA-E2-A1IU-11A-61D-A14G-09,0.08,released,53.4,9,956ca84c-1124-53ff-824f-fa0c84425425,Washington University School of Medicine,genome.wustl.edu,WUSM,GSC
2,1.0,4b2073cb-62fe-41f3-baa1-200fc865871b,23,2022-04-29T01:58:36.696508-05:00,TCGA-E2-A1IU-11A-61D-A14F-01,0.15,released,6.67,1,5069ce55-a23f-57c4-a28c-70a3c3cb0e4c,Broad Institute of MIT and Harvard,broad.mit.edu,BI,CGCC


## Build aliquot-level manifest by traversing biospecimen metadata

In [43]:
# Build aliquot-level manifest by traversing biospecimen metadata
def build_aliquot_manifest(records: list[dict[str, object]]) -> pd.DataFrame:
    rows: list[dict[str, object]] = []
    for record in records:
        case_base = {
            'case_id': record.get('case_id'),
            'case_submitter_id': record.get('submitter_id'),
            'project_id': (record.get('project') or {}).get('project_id'),
            'disease_type': record.get('disease_type'),
            'primary_site': record.get('primary_site'),
        }
        for sample in record.get('samples', []):
            sample_info = {k: sample.get(k) for k in ['submitter_id', 'sample_id', 'sample_type', 'specimen_type', 'tumor_descriptor', 'preservation_method', 'tissue_type', 'days_to_sample_procurement', 'days_to_collection']}
            for portion in sample.get('portions', []):
                portion_info = {'portion_id': portion.get('portion_id')}
                for analyte in portion.get('analytes', []):
                    analyte_info = {k: analyte.get(k) for k in ['analyte_id', 'submitter_id', 'analyte_type', 'experimental_protocol_type', 'normal_tumor_genotype_snp_match']}
                    for aliquot in analyte.get('aliquots', []):
                        aliquot_info = {k: aliquot.get(k) for k in ['submitter_id', 'aliquot_id', 'aliquot_quantity', 'aliquot_volume', 'concentration', 'source_center', 'center', 'state']}
                        row = {**case_base, **sample_info, **portion_info, **analyte_info, **aliquot_info}
                        rows.append(row)
    return pd.DataFrame(rows)

aliquot_manifest = build_aliquot_manifest(biospecimen_metadata_records)
print(f'Aliquot-level manifest rows: {len(aliquot_manifest)}')
aliquot_manifest.head()


Aliquot-level manifest rows: 16441


Unnamed: 0,case_id,case_submitter_id,project_id,disease_type,primary_site,submitter_id,sample_id,sample_type,specimen_type,tumor_descriptor,...,analyte_type,experimental_protocol_type,normal_tumor_genotype_snp_match,aliquot_id,aliquot_quantity,aliquot_volume,concentration,source_center,center,state
0,001cef41-ff86-4d3f-a140-a647ac4b10a1,TCGA-E2-A1IU,TCGA-BRCA,,,TCGA-E2-A1IU-01A-11W-A16H-09,92058c44-a484-4e08-b1fe-dfe2f03a0aa1,Primary Tumor,Solid Tissue,Primary,...,Repli-G (Qiagen) DNA,Repli-G,Yes,a030f906-bc29-4230-af43-9aef58a3f539,40.0,80.0,0.5,23,"{'code': '09', 'center_id': '956ca84c-1124-53f...",released
1,001cef41-ff86-4d3f-a140-a647ac4b10a1,TCGA-E2-A1IU,TCGA-BRCA,,,TCGA-E2-A1IU-01A-11D-A14H-05,92058c44-a484-4e08-b1fe-dfe2f03a0aa1,Primary Tumor,Solid Tissue,Primary,...,DNA,aDNA Preparation Type,Yes,180ccad3-0279-4dea-a595-f154f85abea2,4.27,26.7,0.16,23,"{'code': '05', 'center_id': '7ef3885b-37ce-5e1...",released
2,001cef41-ff86-4d3f-a140-a647ac4b10a1,TCGA-E2-A1IU,TCGA-BRCA,,,TCGA-E2-A1IU-01A-11D-A14F-01,92058c44-a484-4e08-b1fe-dfe2f03a0aa1,Primary Tumor,Solid Tissue,Primary,...,DNA,aDNA Preparation Type,Yes,34a6f1be-b1db-4659-b620-6e21e0f73b2b,1.07,6.67,0.16,23,"{'code': '01', 'center_id': '5069ce55-a23f-57c...",released
3,001cef41-ff86-4d3f-a140-a647ac4b10a1,TCGA-E2-A1IU,TCGA-BRCA,,,TCGA-E2-A1IU-01A-11D-A14G-09,92058c44-a484-4e08-b1fe-dfe2f03a0aa1,Primary Tumor,Solid Tissue,Primary,...,DNA,aDNA Preparation Type,Yes,7fcd5fda-8155-4b48-afb9-9e7958627113,4.27,53.4,0.08,23,"{'code': '09', 'center_id': '956ca84c-1124-53f...",released
4,001cef41-ff86-4d3f-a140-a647ac4b10a1,TCGA-E2-A1IU,TCGA-BRCA,,,TCGA-E2-A1IU-01A-11D-A14E-02,92058c44-a484-4e08-b1fe-dfe2f03a0aa1,Primary Tumor,Solid Tissue,Primary,...,DNA,aDNA Preparation Type,Yes,aa6390ba-ac10-480e-888e-da7005c9af1b,2.08,13.0,0.16,23,"{'code': '02', 'center_id': 'c8611490-4cbd-565...",released


In [44]:
aliquot_manifest.shape

(16441, 26)

In [45]:
# # Load and inspect clinical cart metadata
# clinical_metadata_path = PROJECT_ROOT / 'data' / 'raw' / 'clinical data' / 'clinical.cart.2025-12-07.json'
# with clinical_metadata_path.open() as fh:
#     clinical_metadata_records = json.load(fh)
# clinical_metadata_df = pd.json_normalize(clinical_metadata_records)
# print(f'Clinical metadata rows: {len(clinical_metadata_df)}; columns: {clinical_metadata_df.shape[1]}')
# clinical_metadata_df.head()

In [46]:
# ## look at a case with multiple diagnosis
# clinical_entry = next((rec for rec in clinical_metadata_records if rec.get('submitter_id') == 'TCGA-A2-A04P'), clinical_metadata_records[0])
# diagnoses_expanded = pd.json_normalize(clinical_entry.get('diagnoses', []))
# followups_expanded = pd.json_normalize(clinical_entry.get('follow_ups', []))
# demographic_expanded = pd.json_normalize([clinical_entry.get('demographic', {})])
# treatments_expanded = pd.json_normalize(diagnoses_expanded.iloc[0].get('treatments', [])) if not diagnoses_expanded.empty else pd.DataFrame()
# if not followups_expanded.empty:
#     first_followup = followups_expanded.iloc[0].get('molecular_tests', [])
#     molecular_tests_expanded = pd.json_normalize(first_followup) if isinstance(first_followup, list) else pd.DataFrame()
# else:
#     molecular_tests_expanded = pd.DataFrame()
# print('Top-level keys:', list(clinical_entry.keys()))
# print('Diagnoses columns:', diagnoses_expanded.columns.tolist())
# print('Follow-up columns:', followups_expanded.columns.tolist())
# print('Treatments columns:', treatments_expanded.columns.tolist())
# print('Molecular test columns:', molecular_tests_expanded.columns.tolist())
# display(pd.json_normalize([clinical_entry]))
# display(diagnoses_expanded.head())
# display(treatments_expanded.head())
# display(followups_expanded.head())
# display(molecular_tests_expanded.head())
# display(demographic_expanded.head())


In [47]:
# # Build diagnosis-level manifest from clinical cart metadata
# def build_diagnosis_manifest(records: list[dict[str, object]]) -> pd.DataFrame:
#     rows: list[dict[str, object]] = []
#     for record in records:
#         case_base = {
#             'case_id': record.get('case_id'),
#             'case_submitter_id': record.get('submitter_id'),
#             'project_id': (record.get('project') or {}).get('project_id'),
#             'disease_type': record.get('disease_type'),
#             'primary_site': record.get('primary_site'),
#         }
#         diagnoses = record.get('diagnoses', [])
#         if not diagnoses:
#             continue
#         latest = max(
#             diagnoses,
#             key=lambda d: (
#                 d.get('year_of_diagnosis') or -1,
#                 d.get('age_at_diagnosis') or -1,
#                 d.get('days_to_last_follow_up') or -1,
#             ),
#         )
#         diag_row = {
#             'diagnosis_id': latest.get('diagnosis_id'),
#             'diagnosis_submitter_id': latest.get('submitter_id'),
#             'morphology': latest.get('morphology'),
#             'primary_diagnosis': latest.get('primary_diagnosis'),
#             'ajcc_pathologic_stage': latest.get('ajcc_pathologic_stage'),
#             'tumor_stage': latest.get('tumor_stage'),
#             'year_of_diagnosis': latest.get('year_of_diagnosis'),
#             'age_at_diagnosis': latest.get('age_at_diagnosis'),
#         }
#         diag_row.update(case_base)
#         rows.append(diag_row)
#     return pd.DataFrame(rows)

# diagnosis_manifest = build_diagnosis_manifest(clinical_metadata_records)
# print(f'Diagnosis-level manifest rows: {len(diagnosis_manifest)}')
# diagnosis_manifest.head()

In [48]:
# diagnosis_manifest.shape

In [49]:
# diagnosis_manifest['case_submitter_id'].value_counts().sort_values(ascending=False)

In [50]:
expression_index_with_biospecimen = expression_index.merge(aliquot_manifest,left_on='entity_id',right_on='aliquot_id',suffixes=('','_biospecimen'))
expression_index_with_biospecimen

Unnamed: 0,file_name,bundle_id,path,relative_path,entity_submitter_id,entity_type,case_id,entity_id,access,submitter_id,...,analyte_type,experimental_protocol_type,normal_tumor_genotype_snp_match,aliquot_id,aliquot_quantity,aliquot_volume,concentration,source_center,center,state
0,ba295155-272e-43eb-9d6a-e4c9c392e68b.rna_seq.a...,0019c951-16c5-48d0-85c8-58d96b12d330,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0019c951-16c5-48d0-85c8-58d96b12d330/ba295155-...,TCGA-D8-A1XO-01A-11R-A14M-07,aliquot,6a186809-3422-41d0-83d2-867145830936,eddd285e-a946-4e4e-a21d-6e84ed397594,controlled,0e68b49d-a54d-41c3-a5f7-13fc141ab0e8,...,RNA,mirVana (Allprep DNA) RNA,Yes,eddd285e-a946-4e4e-a21d-6e84ed397594,4.27,26.7,0.16,23,"{'code': '07', 'center_id': 'ee7a85b3-8177-5d6...",released
1,8d1641ea-7552-4d23-9298-094e0056386a.rna_seq.a...,0022cd20-f64f-4773-b9ff-a3de0b71b259,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0022cd20-f64f-4773-b9ff-a3de0b71b259/8d1641ea-...,TCGA-AN-A0FN-01A-11R-A034-07,aliquot,c2a742fe-3e8b-4210-85a6-7191a1123609,4aac7c40-db6b-4648-a82a-7bf397de913a,controlled,fd4c1bc5-6eab-4830-9178-afbab6a7a954,...,RNA,mirVana (Allprep DNA) RNA,Yes,4aac7c40-db6b-4648-a82a-7bf397de913a,4.27,26.7,0.16,23,"{'code': '07', 'center_id': 'ee7a85b3-8177-5d6...",released
2,2f51534b-248b-4999-bc3f-e42a2e98332e.rna_seq.a...,00469928-b243-4cae-acd7-134508e99ceb,/Users/lennonmccartney/Desktop/tcga-brca-multi...,00469928-b243-4cae-acd7-134508e99ceb/2f51534b-...,TCGA-AC-A62X-01A-11R-A29R-07,aliquot,5b2a4f11-ca46-4974-9420-59b4820920bf,cdc523d2-da82-4a3d-a97e-9745c8a802d1,controlled,6920521e-f04a-4ae1-b161-df65804500b6,...,RNA,mirVana (Allprep DNA) RNA,Yes,cdc523d2-da82-4a3d-a97e-9745c8a802d1,1.95,13.0,0.15,23,"{'code': '07', 'center_id': 'ee7a85b3-8177-5d6...",released
3,b321a3f5-043d-42c6-8c9d-5784d45cb85c.rna_seq.a...,0081f507-b104-4214-9ea1-31dd69130991,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0081f507-b104-4214-9ea1-31dd69130991/b321a3f5-...,TCGA-E2-A14U-01A-11R-A22K-07,aliquot,23b7aaea-1119-4b10-aa1a-0ae255d2f2a6,20d25e27-b509-45eb-a734-9b8d6a40c632,controlled,f8ed6c74-0202-4ac0-a581-b00c504e9e4b,...,RNA,mirVana (Allprep DNA) RNA,Yes,20d25e27-b509-45eb-a734-9b8d6a40c632,2.08,13.0,0.16,23,"{'code': '07', 'center_id': 'ee7a85b3-8177-5d6...",released
4,cafc9e36-c5f0-45df-ad03-16210ff0d870.rna_seq.a...,0094f9d0-45ec-4aad-bca0-71c60bdd7113,/Users/lennonmccartney/Desktop/tcga-brca-multi...,0094f9d0-45ec-4aad-bca0-71c60bdd7113/cafc9e36-...,TCGA-E9-A1R3-01A-31R-A14M-07,aliquot,4922cddc-575c-4b8a-8245-ce5f6876760c,678ba111-3b01-49b7-83d9-165ef0288a34,controlled,71af5a94-601d-4710-834a-ee8ac0265d20,...,RNA,mirVana (Allprep DNA) RNA,Yes,678ba111-3b01-49b7-83d9-165ef0288a34,4.27,26.7,0.16,23,"{'code': '07', 'center_id': 'ee7a85b3-8177-5d6...",released
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1192,967f7008-e212-4114-ab43-dc2a6295f80c.rna_seq.a...,fed5e108-cb1a-45b8-88e8-c5ea6a6bd61a,/Users/lennonmccartney/Desktop/tcga-brca-multi...,fed5e108-cb1a-45b8-88e8-c5ea6a6bd61a/967f7008-...,TCGA-AR-A252-01A-11R-A169-07,aliquot,4d0fec97-e024-4608-a0cc-426a3decc7b1,f325c7a8-46b0-4d75-9228-db391c270c71,controlled,503b018d-abc4-4c8a-96f5-040872dec7fc,...,RNA,mirVana (Allprep DNA) RNA,Yes,f325c7a8-46b0-4d75-9228-db391c270c71,4.01,26.7,0.15,23,"{'code': '07', 'center_id': 'ee7a85b3-8177-5d6...",released
1193,deb7967f-9339-4bb1-ae0b-81a72a472bba.rna_seq.a...,ff1f2f31-8607-4627-b487-c5e39e7c30f5,/Users/lennonmccartney/Desktop/tcga-brca-multi...,ff1f2f31-8607-4627-b487-c5e39e7c30f5/deb7967f-...,TCGA-BH-A18G-01A-11R-A12D-07,aliquot,747083ff-0703-431b-aad4-f2adff739516,a8a7a335-8e21-4f24-b48b-8764caa21329,controlled,e7133d35-e25f-4f94-90f2-b4d8d7e8f313,...,RNA,mirVana (Allprep DNA) RNA,Yes,a8a7a335-8e21-4f24-b48b-8764caa21329,4.54,26.7,0.17,23,"{'code': '07', 'center_id': 'ee7a85b3-8177-5d6...",released
1194,574d0a5f-8cb7-4783-8d5e-b07c1b3460dc.rna_seq.a...,ff570f9a-a252-496d-a452-344063851a7b,/Users/lennonmccartney/Desktop/tcga-brca-multi...,ff570f9a-a252-496d-a452-344063851a7b/574d0a5f-...,TCGA-AC-A2QH-01A-11R-A18M-07,aliquot,67c5f371-3fa9-47c5-8b15-c2dd9acc8519,d848ad65-9261-4854-a6bd-e76484c6a337,controlled,196d3065-2d3f-43a0-a762-c6985eaf2c0a,...,RNA,mirVana (Allprep DNA) RNA,Yes,d848ad65-9261-4854-a6bd-e76484c6a337,2.08,13.0,0.16,23,"{'code': '07', 'center_id': 'ee7a85b3-8177-5d6...",released
1195,64b12ba7-a481-4fdb-9c74-38c94c7ef3c9.rna_seq.a...,ff5f8ada-17c5-497e-9182-63a05e3ab4c5,/Users/lennonmccartney/Desktop/tcga-brca-multi...,ff5f8ada-17c5-497e-9182-63a05e3ab4c5/64b12ba7-...,TCGA-OL-A5D6-01A-21R-A27Q-07,aliquot,c3a981c7-f148-4252-bd50-af8a49ec0df8,5657c6b6-ba04-4dd8-8623-2a3834820f7d,controlled,67e82e85-0160-4e64-9f99-b580fe71d60b,...,RNA,mirVana (Allprep DNA) RNA,Yes,5657c6b6-ba04-4dd8-8623-2a3834820f7d,2.21,13.0,0.17,23,"{'code': '07', 'center_id': 'ee7a85b3-8177-5d6...",released


In [51]:
# expression_index_with_biospecimen_and_clinical = expression_index_with_biospecimen.merge(diagnosis_manifest, left_on='')
# summarize_missing(expression_index_with_biospecimen)

In [52]:
expression_index_with_biospecimen = drop_missing_columns(expression_index_with_biospecimen,50)

## Sanity check for experimental_protocol_type for the above join

In [53]:
expression_index_with_biospecimen[['experimental_protocol_type','tumor_descriptor']].value_counts()

experimental_protocol_type             tumor_descriptor
mirVana (Allprep DNA) RNA              Primary             1061
                                       Not Applicable       103
HighPure miRNA (Allprep DNA) FFPE RNA  Primary               12
Allprep RNA Extraction                 Primary                6
mirVana (Allprep DNA) RNA              Metastatic             6
Allprep RNA Extraction                 Not Applicable         4
Name: count, dtype: int64

## Helper to read a single expression table

We only need the `gene_id` column plus a quantitative abundance metric (`tpm_unstranded`). The helper below returns one `pd.Series` per sample and removes the technical summary rows (the `N_*` counters at the top of each file).


In [54]:
from pathlib import Path


def read_expression_table(path: Path, value_column: str = "unstranded") -> pd.Series:
    '''Load one RNA-seq table and return the chosen value column indexed by gene ID.'''

    usecols = ["gene_id", value_column]
    df = pd.read_csv(
        path,
        sep="\t",
        comment="#",
        usecols=usecols,
        dtype={value_column: "float32"},
    ).dropna(subset=["gene_id"])
    series = df.set_index("gene_id")[value_column]
    series = series[~series.index.str.startswith("N_")]
    file_name = Path(path).name
    if 'expression_index_with_biospecimen' in globals():
        match = expression_index_with_biospecimen.loc[
            expression_index_with_biospecimen['file_name'] == file_name, 'entity_id'
        ]
        if not match.empty:
            series.name = match.iloc[0]
        else:
            series.name = file_name.split('.')[0]
    else:
        series.name = file_name.split('.')[0]
    return series


# Quick sanity check on the first file
first_sample = read_expression_table(Path(expression_index.loc[0, "path"]))
first_sample.head()


gene_id
ENSG00000000003.15    4370.0
ENSG00000000005.6        7.0
ENSG00000000419.13    2625.0
ENSG00000000457.14    3005.0
ENSG00000000460.17    1578.0
Name: eddd285e-a946-4e4e-a21d-6e84ed397594, dtype: float32

## Build the combined expression matrix

We iterate through every discovered file, ensure the genes line up across samples, and concatenate the resulting vectors column-wise. Progress messages every 100 samples make it easier to monitor long runs.


In [55]:
from typing import Optional

expression_series: list[pd.Series] = []
gene_index: Optional[pd.Index] = None

total_files = len(expression_index)
for idx, path in enumerate(expression_index["path"], start=1):
    series = read_expression_table(Path(path))
    if gene_index is None:
        gene_index = series.index
    elif not series.index.equals(gene_index):
        # Align to the reference order if a file arrives with an unexpected layout.
        series = series.reindex(gene_index)
    expression_series.append(series)
    if idx % 100 == 0 or idx == total_files:
        print(f"Loaded {idx}/{total_files} samples")

expression_matrix = pd.concat(expression_series, axis=1)
expression_matrix.head()

Loaded 100/1197 samples
Loaded 200/1197 samples
Loaded 300/1197 samples
Loaded 400/1197 samples
Loaded 500/1197 samples
Loaded 600/1197 samples
Loaded 700/1197 samples
Loaded 800/1197 samples
Loaded 900/1197 samples
Loaded 1000/1197 samples
Loaded 1100/1197 samples
Loaded 1197/1197 samples


Unnamed: 0_level_0,eddd285e-a946-4e4e-a21d-6e84ed397594,4aac7c40-db6b-4648-a82a-7bf397de913a,cdc523d2-da82-4a3d-a97e-9745c8a802d1,20d25e27-b509-45eb-a734-9b8d6a40c632,678ba111-3b01-49b7-83d9-165ef0288a34,4dfbdec6-1911-412c-9873-f6782ab8ff66,5db1c37a-2329-4bd3-baf2-89d26a8959fa,b908cd1b-b274-4b23-9cd5-b59d93a7a6c2,bbb7f769-ac55-4e7f-b0bf-acc36294bb05,ec6f69db-ff0b-491c-8b70-d53d0e7ad65f,...,fce0d59e-78a5-4d7b-8136-b8421f5116b0,a0059fcb-6adc-47f4-8601-b7eed9e12576,1e59ea3f-36fa-4860-8c28-87b4b09075ec,8c435d42-a76a-4078-a593-6a41a2bb6f5e,26f5aec2-a2d8-4151-b975-33d7518e36d0,f325c7a8-46b0-4d75-9228-db391c270c71,a8a7a335-8e21-4f24-b48b-8764caa21329,d848ad65-9261-4854-a6bd-e76484c6a337,5657c6b6-ba04-4dd8-8623-2a3834820f7d,7347a336-6248-47be-817b-88543164d564
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000000003.15,4370.0,2443.0,3508.0,6928.0,2890.0,1635.0,3456.0,1410.0,1899.0,3362.0,...,4518.0,3958.0,5469.0,2212.0,3321.0,2586.0,2127.0,8020.0,1417.0,4263.0
ENSG00000000005.6,7.0,144.0,7.0,17.0,4.0,101.0,22.0,14.0,4.0,35.0,...,10.0,18.0,9.0,0.0,0.0,286.0,0.0,22722.0,2.0,9.0
ENSG00000000419.13,2625.0,2322.0,2421.0,1812.0,4025.0,1565.0,1779.0,1431.0,2167.0,3346.0,...,1657.0,2113.0,2153.0,1184.0,3880.0,1655.0,742.0,1676.0,1138.0,2071.0
ENSG00000000457.14,3005.0,1466.0,839.0,1651.0,2769.0,1183.0,2176.0,1556.0,1516.0,2135.0,...,2009.0,1830.0,985.0,1426.0,1559.0,1723.0,911.0,1065.0,904.0,1101.0
ENSG00000000460.17,1578.0,409.0,744.0,366.0,663.0,419.0,864.0,318.0,417.0,973.0,...,739.0,931.0,1452.0,343.0,880.0,600.0,328.0,918.0,233.0,717.0


In [56]:
print(f"Expression matrix shape: {expression_matrix.shape}")
expression_matrix.iloc[:5, :5]

Expression matrix shape: (60660, 1197)


Unnamed: 0_level_0,eddd285e-a946-4e4e-a21d-6e84ed397594,4aac7c40-db6b-4648-a82a-7bf397de913a,cdc523d2-da82-4a3d-a97e-9745c8a802d1,20d25e27-b509-45eb-a734-9b8d6a40c632,678ba111-3b01-49b7-83d9-165ef0288a34
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENSG00000000003.15,4370.0,2443.0,3508.0,6928.0,2890.0
ENSG00000000005.6,7.0,144.0,7.0,17.0,4.0
ENSG00000000419.13,2625.0,2322.0,2421.0,1812.0,4025.0
ENSG00000000457.14,3005.0,1466.0,839.0,1651.0,2769.0
ENSG00000000460.17,1578.0,409.0,744.0,366.0,663.0


## Persist processed outputs

The TPM matrix is saved as a gzipped TSV to keep downstream tooling simple. A companion manifest captures where each column originated and the original file sizes.


In [57]:
expression_output_path = PROCESSED_DATA_DIR / "tcga_brca_expression_tpm.tsv.gz"
manifest_output_path = PROCESSED_DATA_DIR / "expression_file_index.tsv"

expression_matrix.to_csv(expression_output_path, sep="	", compression="gzip")

manifest_df = expression_index_with_biospecimen.copy()
manifest_df["file_size_mb"] = manifest_df["path"].map(lambda p: round(p.stat().st_size / 1024 ** 2, 3))
manifest_df = manifest_df.drop(columns=["path"])
manifest_df.to_csv(manifest_output_path, sep="	", index=False)

expression_output_path, manifest_output_path

(PosixPath('/Users/lennonmccartney/Desktop/tcga-brca-multiomics-subtyping/data/processed/tcga_brca_expression_tpm.tsv.gz'),
 PosixPath('/Users/lennonmccartney/Desktop/tcga-brca-multiomics-subtyping/data/processed/expression_file_index.tsv'))

## Quick summary statistics

Verify the number of samples/genes and inspect the distribution of file sizes to catch obvious anomalies.


In [58]:
summary = pd.Series(
    {
        "n_samples": expression_matrix.shape[1],
        "n_genes": expression_matrix.shape[0],
        "min_file_size_mb": expression_index["path"].map(lambda p: p.stat().st_size / 1024 ** 2).min(),
        "max_file_size_mb": expression_index["path"].map(lambda p: p.stat().st_size / 1024 ** 2).max(),
    }
)
summary

n_samples            1197.000000
n_genes             60660.000000
min_file_size_mb        4.007236
max_file_size_mb        4.075985
dtype: float64