In [1]:
from helpers import *

## Directory Structure

The first step is to set up a rawdata, procdata, and results directory to keep track of the downloaded data, processing steps, and the final result. We'll also make a workflow directory to keep track of code, logs, configuration files, and other relevant files.

The organization will look like this:


In [2]:
!mkdir -p ../../rawdata/ ../../procdata/ ../../results/

# Download clinical data from TCIA for the HNSCC dataset

HNSCC is a publicly available dataset on TCIA. The imaging data is under a TCIA Restricted License, but the clinical data is available for download. 

There are two clinical spreadsheets provided as data from two institutions is included. The first is _Head-Neck-CT-Atlas_ and the second is _Radiomics outcome prediction in Oropharyngeal cancer_. We'll refer to these as ATLAS and OPC, respectively.

In [6]:
# Download ATLAS Clinical Data to the appropriate directory
# Won't download if the file already exists
!wget -nc -P ../../rawdata/HNSCC/clinical/atlas/ https://www.cancerimagingarchive.net/wp-content/uploads/HNSCC-MDA-Data_update_20240514.xlsx

--2024-09-30 12:38:24--  https://www.cancerimagingarchive.net/wp-content/uploads/HNSCC-MDA-Data_update_20240514.xlsx
Resolving www.cancerimagingarchive.net (www.cancerimagingarchive.net)... 144.30.169.13
Connecting to www.cancerimagingarchive.net (www.cancerimagingarchive.net)|144.30.169.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 137751 (135K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘../../rawdata/HNSCC/clinical/atlas/HNSCC-MDA-Data_update_20240514.xlsx’


2024-09-30 12:38:24 (900 KB/s) - ‘../../rawdata/HNSCC/clinical/atlas/HNSCC-MDA-Data_update_20240514.xlsx’ saved [137751/137751]



In [7]:
# Download OPC Clinical Data to the appropriate directory
# Won't download if the file already exists
!wget -nc -P ../../rawdata/HNSCC/clinical/opc/ https://www.cancerimagingarchive.net/wp-content/uploads/Radiomics_Outcome_Prediction_in_OPC_ASRM_corrected.csv

--2024-09-30 12:38:25--  https://www.cancerimagingarchive.net/wp-content/uploads/Radiomics_Outcome_Prediction_in_OPC_ASRM_corrected.csv
Resolving www.cancerimagingarchive.net (www.cancerimagingarchive.net)... 144.30.169.13
Connecting to www.cancerimagingarchive.net (www.cancerimagingarchive.net)|144.30.169.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 88002 (86K) [text/csv]
Saving to: ‘../../rawdata/HNSCC/clinical/opc/Radiomics_Outcome_Prediction_in_OPC_ASRM_corrected.csv’


2024-09-30 12:38:26 (826 KB/s) - ‘../../rawdata/HNSCC/clinical/opc/Radiomics_Outcome_Prediction_in_OPC_ASRM_corrected.csv’ saved [88002/88002]



# Data Loading

In [8]:
import pandas as pd
import os

In [9]:
atlas_clinical_data_path = os.path.join("../../rawdata/HNSCC/clinical/atlas/", os.listdir("../../rawdata/HNSCC/clinical/atlas/")[0])

atlas_clinical_dataframe = load_data_to_df(atlas_clinical_data_path)

In [10]:
opc_clinical_data_path = os.path.join("../../rawdata/HNSCC/clinical/opc/", os.listdir("../../rawdata/HNSCC/clinical/opc/")[0])

opc_clinical_dataframe =load_data_to_df(opc_clinical_data_path)

# Rename Clinical Columns to match

These two clinical datasets have different column numbers and names. We're going to rename the columns to match, based on the RADCURE clinical dataset (also on TCIA).

In [11]:
updated_col_names_atlas = {"Alive or Dead": "Status"}

updated_atlas_clinical_dataframe = atlas_clinical_dataframe.rename(columns=updated_col_names_atlas)

In [12]:
updated_col_names_opc = {"Gender": "Sex",
                         "Age at Diag": "Age",
                         "T-category": "T",
                         "N-category": "N",
                         "AJCC Stage (7th edition)": "Stage",
                         "Therapeutic Combination": "Oncologic Treatment Summary",
                         "Total prescribed Radiation treatment dose": "RT Total Dose (Gy)",
                         "Radiation treatment_number of fractions": "Number of Fractions",
                         "Radiation treatment_dose per fraction": "Dose/Fraction (Gy/fx)",
                         "Vital status": "Status",
                         "Overall survival_duration of Merged updated ASRM V2": "Survival (days)",
                         }

updated_opc_clinical_dataframe = opc_clinical_dataframe.rename(columns=updated_col_names_opc)

# Merge Dataframes

Now that the datasets have more matching columns, we can merge them into a single dataframe. The ATLAS dataset has many columns not present in OPC, so these will be marked as `NaN` in the merged dataframe.

First we'll set the patient ID column as the index of the dataframes, and then we'll merge them into a single dataframe.

In [13]:
atlas_patient_identifier = getPatientIdentifierLabel(updated_atlas_clinical_dataframe)
updated_atlas_clinical_dataframe.set_index(atlas_patient_identifier, inplace=True)

In [14]:
opc_patient_identifier = getPatientIdentifierLabel(updated_opc_clinical_dataframe)
updated_opc_clinical_dataframe.set_index(opc_patient_identifier, inplace=True)

Multiple patient identifier labels found. Using the first one.
