# MIDRC BIH LOINC StudyDescription Harmonization Workflow - 2025
---

# Harmonization of StudyDescriptions using the LOINC Playbook
---
by Chris Meyer, PhD

Director of Data and User Services at the Center for Translational Data Science at University of Chicago

Template Last Updated: February 2025

---
This Jupyter notebook demonstrates how to map the imaging_study node's `StudyDescription` and `modality` properties to a `loinc_code` by utilizing the MIDRC DQH committee's [LOINC mapping table in GitHub](https://github.com/MIDRC/midrc_dicom_harmonization/tree/main/out). Once the imaging study has been mapped to a LOINC code, the code can be used to derive other LOINC properties. The following [LOINC properties](https://github.com/uc-cdis/midrc_dictionary/blob/deae581f0fb8b9ae5add1458d7882e189ba97af6/gdcdictionary/schemas/imaging_study.yaml#L77) are on the `imaging_study` node of the [MIDRC data dictionary](https://data.midrc.org/dd):
* loinc_code
* loinc_long_common_name
* loinc_method
* loinc_system
* loinc_contrast

In [12]:
# Import Python Packages and scripts
import pandas as pd
import numpy as np
from pathlib import Path
import sys, os, copy, datetime, shutil


import gen3
from gen3.submission import Gen3Submission
from gen3.auth import Gen3Auth
from gen3.index import Gen3Index
from gen3.query import Gen3Query


In [13]:
# download and import some custom Python scripts from https://github.com/cgmeyer/gen3sdk-python
loinc_dir = "/Users/christopher/Documents/Notes/BIH/LOINC" #set your working dir
os.chdir(loinc_dir)
os.system("wget https://raw.githubusercontent.com/cgmeyer/gen3sdk-python/master/expansion/expansion.py -O expansion.py")
from expansion import Gen3Expansion


--2025-02-12 12:19:49--  https://raw.githubusercontent.com/cgmeyer/gen3sdk-python/master/expansion/expansion.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 235369 (230K) [text/plain]
Saving to: ‘expansion.py’

     0K .......... .......... .......... .......... .......... 21% 2.17M 0s
    50K .......... .......... .......... .......... .......... 43% 6.80M 0s
   100K .......... .......... .......... .......... .......... 65% 2.78M 0s
   150K .......... .......... .......... .......... .......... 87% 10.6M 0s
   200K .......... .......... .........                       100%  235M=0.05s

2025-02-12 12:19:49 (4.33 MB/s) - ‘expansion.py’ saved [235369/235369]



In [14]:
## Create the output/working directory
import datetime
from pathlib import Path
now = datetime.datetime.now()
date = "{}-{}-{}".format(now.year, now.month, now.day)

results_dir = "{}/results_{}".format(loinc_dir,date)
Path(results_dir).mkdir(parents=True, exist_ok=True)

print("Save copy of this template notebook to the results_dir and run from there: \n\t{}".format(results_dir))

Save copy of this template notebook to the results_dir and run from there: 
	/Users/christopher/Documents/Notes/BIH/LOINC/results_2025-2-12


In [15]:
# Initiate instances of the Gen3 SDK Classes using credentials file downloaded from https://staging.midrc.org/identity
# You can view the SDK code/functions in GitHub: https://github.com/uc-cdis/gen3sdk-python
sapi = 'https://bihstaging.data-commons.org'
scred = '/Users/christopher/Downloads/bih-staging-credentials.json'
sauth = Gen3Auth(sapi, refresh_file=scred) # authentication class
ssub = Gen3Submission(sapi, sauth) # submission class
squery = Gen3Query(sauth) # query class
sexp = Gen3Expansion(sapi,sauth,ssub) # class with some custom scripts
spids = sexp.get_project_ids()


Getting all project_ids you have access to in https://bihstaging.data-commons.org
['ACRdart-EA1141Restricted', 'AIMI-CheXpertPlus', 'AIMI-COCA', 'IDC-acrin_nsclc_fdg_pet', 'IDC-anti_pd_1_lung', 'IDC-htan_hms', 'IDC-htan_ohsu', 'IDC-htan_vanderbilt', 'IDC-htan_wustl', 'IDC-lung_pet_ct_dx', 'IDC-nlm_visible_human_project', 'IDC-nlst', 'IDC-nsclc_radiogenomics', 'IDC-nsclc_radiomics', 'IDC-prostate_mri_us_biopsy', 'IDC-prostatex', 'IDC-qin_breast', 'IDC-rider_lung_pet_ct', 'IDC-rms_mutation_prediction', 'IDC-tcga_acc', 'IDC-tcga_chol', 'IDC-tcga_dlbc', 'IDC-tcga_kirc', 'IDC-tcga_lihc', 'IDC-tcga_luad', 'IDC-tcga_lusc', 'IDC-tcga_meso', 'IDC-tcga_paad', 'IDC-tcga_pcpg', 'IDC-tcga_skcm', 'IDC-tcga_tgct', 'IDC-tcga_thym', 'IDC-tcga_ucs', 'IDC-tcga_uvm', 'MIDRC-Open-A1', 'MIDRC-Open-A1_PETAL_BLUECORAL', 'MIDRC-Open-A1_PETAL_REDCORAL', 'MIDRC-Open-A1_SCCM_VIRUS', 'MIDRC-Open-R1', 'MIDRC-TCIA-COVID-19-AR', 'MIDRC-TCIA-COVID-19-NY-SBU', 'MIDRC-TCIA-COVID-19_CT_Images', 'MIDRC-TCIA-RICORD_1a', 'M

## Prepare the mapping and filter tables
---
Get the mapping table and the filtering attributes tables from the [MIDRC GitHub repository](https://github.com/MIDRC/midrc_dicom_harmonization/tree/main/out) and reformat them to prepare for the LOINC mapping.

Prior to this, you will need to clone the repo using `git clone git@github.com:MIDRC/midrc_dicom_harmonization.git`


In [16]:
## Set the directory to your copy of the GitHub repo and pull main branch to pull latest updates from GitHub
git_dir = "/Users/christopher/Documents/GitHub/MIDRC/midrc_dicom_harmonization/"
os.chdir(git_dir)
os.system("git checkout main")
os.system("git pull origin main")
os.chdir(loinc_dir)


Already on 'main'


Your branch is up to date with 'origin/main'.
Already up to date.


From github.com:MIDRC/midrc_dicom_harmonization
 * branch            main       -> FETCH_HEAD


In [17]:
## Create the output/working directory
now = datetime.datetime.now()
date = "{}-{}-{}".format(now.year, now.month, now.day)

results_dir = "{}/results_{}".format(loinc_dir,date)
Path(results_dir).mkdir(parents=True, exist_ok=True)

In [42]:
mapping_file = "{}/out/StudyDescription_mapping_table.csv".format(git_dir)
mapping = pd.read_csv(mapping_file,dtype=str) #['Modality', 'StudyDescription', 'LOINC code', 'L-Long Common Name']

shutil.copy2(mapping_file, results_dir) # copy the mapping table version used for this mapping to the results_dir

mapping.rename(columns={"Modality":"study_modality","LOINC code":"loinc_code","L-Long Common Name":"loinc_long_common_name"},inplace=True)
mapping.drop_duplicates(inplace=True)
mapping['StudyDescription'] = mapping.apply(lambda row: row['StudyDescription'].casefold(),axis=1)
mapping['study_modality'] = mapping.apply(lambda row: row['study_modality'].casefold(),axis=1)

## Change any "(blank)" values for StudyDescription to "[blank]" to match the mapping table
mapping.replace({"StudyDescription":{
    '(blank)':'[blank]'}},
    inplace=True)

## remove any leading/trailing whitespaces in the codes/descriptions
mapping['loinc_code'] = mapping['loinc_code'].str.strip()
mapping['loinc_long_common_name'] = mapping['loinc_long_common_name'].str.strip()
mapping['StudyDescription'] = mapping['StudyDescription'].str.strip()
mapping['study_modality'] = mapping['study_modality'].str.strip()

## Summarize the mapping table
modalities = list(set(mapping['study_modality']))
descriptions = list(set(mapping['StudyDescription']))
mapping_codes = list(set(mapping['loinc_code']))
mapping_names = list(set(mapping['loinc_long_common_name']))

display(mapping)
print("study_modality values in mapping table: {}".format(modalities))
print("Number of unique LOINC codes in mapping table: {}".format(len(mapping_codes)))
print("Number of unique LOINC Long Common Names in mapping table: {}".format(len(mapping_names)))


Unnamed: 0,study_modality,StudyDescription,loinc_code,loinc_long_common_name
0,ct,[blank],25045-6,CT Unspecified body region
1,ct,pet ct fdg imag skull to thigh,81555-5,PET+CT Guidance for localization of tumor of W...
2,ct,chest pe(adult),79077-4,CTA Pulmonary arteries for pulmonary embolus W...
3,ct,ct chest pulmonary angio with iv con,79077-4,CTA Pulmonary arteries for pulmonary embolus W...
4,ct,ct chest pulmonary embolism (ctpe),79077-4,CTA Pulmonary arteries for pulmonary embolus W...
...,...,...,...,...
301,mg,[blank],36625-2,MG Breast Views
302,nm,[blank],49118-3,NM Unspecified body region Views
303,st,[blank],43526-3,SPECT Unspecified body region
304,[blank],bronchoscopy,18744-3,Bronchoscopy study


study_modality values in mapping table: ['nm', 'st', 'pt, ctpt', 'mr', 'rf', '[blank]', 'mg', 'ct', 'us', 'cr, dx']
Number of unique LOINC codes in mapping table: 83
Number of unique LOINC Long Common Names in mapping table: 83


In [52]:
filters_file = "{}/out/StudyDescription_filtering_attributes.csv".format(git_dir)
filters = pd.read_csv(filters_file,dtype=str) #['LOINC code', 'L-Long Common Name', 'L-Method', 'L-System', 'Rad.Timing']

shutil.copy2(filters_file, results_dir) # copy the mapping table version used for this mapping to the results_dir

filters.rename(columns= {
    'LOINC code':'loinc_code',
    'L-Long Common Name':'loinc_long_common_name',
    'L-Method':'loinc_method',
    'Rad.Timing':'loinc_contrast',
    'MIDRC-System':'loinc_system'
    },
    inplace=True,
    errors='ignore'
)
filters.drop(columns='L-System',inplace=True,errors='ignore')

## remove any leading/trailing whitespaces
filters['loinc_code'] = filters['loinc_code'].str.strip()
filters['loinc_long_common_name'] = filters['loinc_long_common_name'].str.strip()

filter_codes = list(set(filters['loinc_code']))
filter_names = list(set(filters['loinc_long_common_name']))
print("Number of unique LOINC codes in filters table: {}".format(len(filter_codes)))
print("Number of unique LOINC Long Common Names in mapping table: {}".format(len(filter_names)))


missing_codes = set(mapping_codes).difference(set(filter_codes))
missing_names = set(mapping_names).difference(set(filter_names))

print("Number of LOINC codes in mapping table missing from filters table: {}\n\t{}".format(len(missing_codes),missing_codes))
print("Number of LOINC names in mapping table missing from filters table: {}\n\t{}".format(len(missing_names),missing_names))

display(filters)


Number of unique LOINC codes in filters table: 82
Number of unique LOINC Long Common Names in mapping table: 82
Number of LOINC codes in mapping table missing from filters table: 1
	{'18744-3'}
Number of LOINC names in mapping table missing from filters table: 1
	{'Bronchoscopy study'}


Unnamed: 0,loinc_code,loinc_long_common_name,loinc_method,loinc_contrast,loinc_system
0,35889-5,RF Guidance for bronchoscopy of Chest,RF,,Chest
1,36813-4,CT Abdomen and Pelvis W contrast IV,CT,W,Abdomen
2,42274-1,CT Abdomen and Pelvis WO and W contrast IV,CT,WO & W,Abdomen+Pelvis
3,36952-0,CT Abdomen and Pelvis WO contrast,CT,WO,Abdomen+Pelvis
4,79103-8,CT Abdomen W contrast IV,CT,W,Abdomen
...,...,...,...,...,...
77,83017-4,XR Chest View and Abdomen Supine and Upright,XR,,Chest && Abdomen
78,30745-4,XR Chest Views,XR,,Chest
79,24899-7,XR Ribs Views,XR,,Chest>Ribs
80,43468-8,XR Unspecified body region Views,XR,,Unspecified


In [41]:
mapping

Unnamed: 0,study_modality,StudyDescription,loinc_code,loinc_long_common_name
0,ct,[blank],25045-6,CT Unspecified body region
1,ct,pet ct fdg imag skull to thigh,81555-5,PET+CT Guidance for localization of tumor of W...
2,ct,chest pe(adult),79077-4,CTA Pulmonary arteries for pulmonary embolus W...
3,ct,ct chest pulmonary angio with iv con,79077-4,CTA Pulmonary arteries for pulmonary embolus W...
4,ct,ct chest pulmonary embolism (ctpe),79077-4,CTA Pulmonary arteries for pulmonary embolus W...
...,...,...,...,...
301,mg,[blank],36625-2,MG Breast Views
302,nm,[blank],49118-3,NM Unspecified body region Views
303,st,[blank],43526-3,SPECT Unspecified body region
304,[blank],bronchoscopy,18744-3,Bronchoscopy study


## Get the imaging_study information from MIDRC Staging and Validate Staging Environments
---
* Use the [`Gen3Expansion.get_node_tsvs()` SDK function](https://github.com/cgmeyer/gen3sdk-python/blob/389e3945482439ace6e4536e6d0e35c6e48de9c9/expansion/expansion.py#L219) to get all the imaging_studies in MIDRC Staging (staging.midrc.org).

* The function `exp.get_node_tsvs()` will return a master dataframe containing all the imaging_study data from the projects specified.

In [20]:
os.chdir(loinc_dir)
## get imaging study modality for staging / validatestaging
#sdf = sexp.get_node_tsvs(node='imaging_study',overwrite=True)
#sdf = sexp.get_node_tsvs(node='imaging_study',overwrite=False) # if already downloaded
all_studies = copy.deepcopy(sdf)


File previously downloaded.
node_tsvs/imaging_study_tsvs/TCIA-gbm_dsc_mri_dro_imaging_study.tsv has 31 records.
File previously downloaded.
node_tsvs/imaging_study_tsvs/TCIA-cptac_ucec_imaging_study.tsv has 390 records.
File previously downloaded.
node_tsvs/imaging_study_tsvs/TCIA-prostate_diagnosis_imaging_study.tsv has 92 records.
File previously downloaded.
node_tsvs/imaging_study_tsvs/IDC-tcga_thym_imaging_study.tsv has 124 records.
File previously downloaded.
node_tsvs/imaging_study_tsvs/TCIA-cptac_gbm_imaging_study.tsv has 178 records.
File previously downloaded.
node_tsvs/imaging_study_tsvs/TCIA-tcga_prad_imaging_study.tsv has 520 records.
File previously downloaded.
node_tsvs/imaging_study_tsvs/TCIA-midrc_ricord_1b_imaging_study.tsv has 120 records.
File previously downloaded.
node_tsvs/imaging_study_tsvs/TCIA-qin_prostate_repeatability_imaging_study.tsv has 30 records.
File previously downloaded.
node_tsvs/imaging_study_tsvs/TCIA-cmb_mel_imaging_study.tsv has 50 records.
File 

In [21]:
# Save all imaging studies across staging and validatestaging to TSV
all_studies_file = f"{results_dir}/all_imaging_studies_bihstaging_{date}.tsv"
all_studies.to_csv(all_studies_file,sep='\t',index=False)
display(len(all_studies))

590885

In [None]:
# # Read in saved file
# all_studies = pd.read_csv(all_studies_file, header=0, sep='\t')

## Save File "BIH_StudyDescriptions_Gen3.tsv" for GitHub repo
---

Each month, after running this LOINC mapping notebook, the Gen3 User Services team should open a PR to update/replace the following file that contains all the BIH imaging StudyDescription/modality combos to the GitHub repository where Gen3 collaborates with AAPM/DQH/MIDRC TDP3d to harmonization imaging study descriptions.

GitHub Repo: https://github.com/MIDRC/midrc_dicom_harmonization

Specific File to replace (in the ["in"](https://github.com/MIDRC/midrc_dicom_harmonization/blob/main/in/) folder): https://github.com/MIDRC/midrc_dicom_harmonization/blob/main/in/BIH_StudyDescriptions_Gen3.tsv


In [23]:
# create this "in" DataFrame based on all_studies
idf = copy.deepcopy(all_studies)
idf = idf.groupby(["study_modality", "StudyDescription"]).size().reset_index(name="frequency").sort_values(by='frequency',ascending=False) # frequencies of all studies sorted by freq
idf.rename(columns={'study_modality':'Modality'},inplace=True)
idf.reset_index(drop=True)

Unnamed: 0,Modality,StudyDescription,frequency
0,XR,XR CHEST AP,156488
1,CT,NLST-LSS,48544
2,SEG,NLST-LSS,41134
3,CT,NLST-ACRIN,24569
4,SEG,NLST-ACRIN,19500
...,...,...,...
5860,MG,Mammo-Digital Screening Employee w/ Tomo,1
5861,MG,Mammogram Diagnostic Bilat,1
5862,MG,Needle Loc Mammo Guidance 1st Right PRIM,1
5863,MG,RIGHT MAMMO DIGITAL DIAGNOSTIC,1


In [25]:
# Save the "in" file to a TSV file locally
ifile = '{}/BIH_StudyDescriptions_Gen3.tsv'.format(results_dir)
idf.to_csv(ifile,sep='\t',index=False)
print("Saved input file to: {}".format(ifile))

Saved input file to: /Users/christopher/Documents/Notes/BIH/LOINC/results_2025-2-12/BIH_StudyDescriptions_Gen3.tsv


In [26]:
## Pull latest version of GitHub Repo
git_dir = "/Users/christopher/Documents/GitHub/MIDRC/midrc_dicom_harmonization/"
os.chdir(git_dir)
os.system("git checkout main")
os.system("git pull origin main")


Already on 'main'


Your branch is up to date with 'origin/main'.


From github.com:MIDRC/midrc_dicom_harmonization
 * branch            main       -> FETCH_HEAD


Already up to date.


0

In [27]:
## Create a new branch to push the new file
git_branch = "Gen3_input_BIH"
os.system("git branch {}".format(git_branch))
os.system("git checkout {}".format(git_branch))


Switched to branch 'Gen3_input_BIH'


0

In [29]:
# Copy, git add the file and git commit
os.system("cp {} {}/in".format(ifile, git_dir))
os.system("git add {}/in/BIH_StudyDescriptions_Gen3.tsv".format(git_dir))
os.system("git commit -a -m 'Adding input study descriptions as of {}'".format(date))


[Gen3_input_BIH 45e49d8] Adding input study descriptions as of 2025-2-12
 1 file changed, 5866 insertions(+)
 create mode 100644 in/BIH_StudyDescriptions_Gen3.tsv


0

In [30]:
os.system("git push origin {}".format(git_branch))



remote: 
remote: Create a pull request for 'Gen3_input_BIH' on GitHub by visiting:        
remote:      https://github.com/MIDRC/midrc_dicom_harmonization/pull/new/Gen3_input_BIH        
remote: 
To github.com:MIDRC/midrc_dicom_harmonization.git
 * [new branch]      Gen3_input_BIH -> Gen3_input_BIH


0

## Reformat the imaging_study data to prepare for mapping
---
* **NaN values**: In the mapping table, a `StudyDescription` or `study_modality` with a value of `null` or `NaN` is listed as `[blank]`.
* **Case-insensitivity**: we should ignore capitalization of `study_modality` and `StudyDescription` strings when we do the look-up. We'll use `str.casefold()` to ignore case for matching.
* **CR/DX Modality**: in the mapping table, all versions of "CR" and "DX" modalities are denoted "CR, DX"; so, we need to change all occurrences of other spellings to match that exactly.
* **Duplicated study UIDs**: Identify any duplicated study UIDs and remove them from studies.


In [32]:
# Subset to studies that don't already have a LOINC code assigned.
os.chdir(loinc_dir) # switch back to working loinc_dir
#ast = copy.deepcopy(all_studies.loc[all_studies['loinc_code'].isna()])
ast = copy.deepcopy(all_studies)
display(len(ast))

590885

In [33]:
display(sorted(list(set(ast['study_modality']))))

['CR',
 'CR,DX',
 'CT',
 'DX',
 'FUSION',
 'MG',
 'MR',
 'NM',
 'NM,CT',
 'NM,PT',
 'OT',
 'OT,PT',
 'PR',
 'PT',
 'PT,CT',
 'REG',
 'RF',
 'RTDOSE',
 'RTPLAN',
 'RTSTRUCT',
 'RWV',
 'SC',
 'SEG',
 'SM',
 'SR',
 'US',
 'XA',
 'XC',
 'XR']

In [34]:
dupes = ast.loc[ast.duplicated(subset='submitter_id',keep=False)].sort_values(by='submitter_id')
print("There are {} duplicate submitter_ids in the list of imaging studies.".format(len(dupes)))
if len(dupes) > 0:
    display(dupes)
    dupes.to_csv('duplicated_imaging_studies_staging_validatestaging_{}.tsv'.format(date),sep='\t',index=False)


There are 167078 duplicate submitter_ids in the list of imaging studies.


Unnamed: 0,type,id,project_id,submitter_id,BodyPartExamined,EthnicGroup,PatientAge,PatientID,PatientSex,StudyDescription,StudyInstanceUID,disease_type,primary_site,study_modality,datasets.id,datasets.submitter_id,subjects.id,subjects.submitter_id
258055,imaging_study,148ca76f-a730-426f-8627-5e5cd4ae775f,MIDRC-TCIA-RICORD_1c,1.2.826.0.1.3680043.10.474.1018345410527870119...,CHEST,,62.0,,Female,,1.2.826.0.1.3680043.10.474.1018345410527870119...,,CHEST,CR,2eb3763e-34f6-4e4a-8d15-1f677ab3a3fc,TCIA-RICORD_1c,219216a5-90fe-4c4b-b7ae-69ab99fe32ac,SITE2-000184
264741,imaging_study,2d2ea41d-2e95-4203-beaa-d9e6bff19fbd,TCIA-midrc_ricord_1c,1.2.826.0.1.3680043.10.474.1018345410527870119...,CHEST,,,,Female,,1.2.826.0.1.3680043.10.474.1018345410527870119...,COVID-19 (non-cancer),Lung,CR,1224e193-60e4-42c1-a244-ebdab28c8ca9,midrc_ricord_1c,ec8a6904-056f-476d-b5a5-555308ac312f,MIDRC-RICORD-1C-SITE2-000184
264787,imaging_study,3924184b-44da-4404-ba93-ac02070f4c7e,TCIA-midrc_ricord_1c,1.2.826.0.1.3680043.10.474.1022500940913947735...,CHEST,,,,Female,,1.2.826.0.1.3680043.10.474.1022500940913947735...,COVID-19 (non-cancer),Lung,CR,1224e193-60e4-42c1-a244-ebdab28c8ca9,midrc_ricord_1c,a0c91dd8-4bdf-41a6-8a5c-5f4900d24775,MIDRC-RICORD-1C-SITE2-000186
258700,imaging_study,c4d221db-0bfb-4f1d-a204-d7b19e2918f4,MIDRC-TCIA-RICORD_1c,1.2.826.0.1.3680043.10.474.1022500940913947735...,CHEST,,43.0,,Female,,1.2.826.0.1.3680043.10.474.1022500940913947735...,,CHEST,CR,2eb3763e-34f6-4e4a-8d15-1f677ab3a3fc,TCIA-RICORD_1c,1d0d3fb0-5bd4-4bbe-8abd-09a9ceab1031,SITE2-000186
265492,imaging_study,ebf5cc0e-24d9-4ec6-b43a-652385ce7eb8,TCIA-midrc_ricord_1c,1.2.826.0.1.3680043.10.474.1024095929771784514...,CHEST,,,,Male,,1.2.826.0.1.3680043.10.474.1024095929771784514...,COVID-19 (non-cancer),Lung,CR,1224e193-60e4-42c1-a244-ebdab28c8ca9,midrc_ricord_1c,5feedc96-a45e-49d2-b78d-32938565bdf7,MIDRC-RICORD-1C-SITE2-000203
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1618,imaging_study,af6860e5-b653-4f2c-b824-621a4d29197f,IDC-rider_lung_pet_ct,1.3.6.1.4.1.9328.50.17.82783880319638152814177...,CHEST,,31.0,,Female,FDG 5AFOV TORSO,1.3.6.1.4.1.9328.50.17.82783880319638152814177...,Lung Cancer,Lung,SEG,a410193b-d081-4bd3-8dde-222d3ad4b4e6,rider_lung_pet_ct,1a7c84e9-fe80-4e24-b563-5ee2f3aee3a6,RIDER-2414443006
567651,imaging_study,05f89ae8-33e1-408a-a452-cfc97d965fac,TCIA-TCIA_rider_lung_pet_ct,1.3.6.1.4.1.9328.50.17.83984297885348723829825...,CHEST,,56.0,,Female,PET CT MID BODY,1.3.6.1.4.1.9328.50.17.83984297885348723829825...,Lung Cancer,Lung,CT,9c6122bc-3dc6-4049-95dd-44eee118f31f,TCIA_rider_lung_pet_ct,76ffde88-4fca-454b-8ca1-4007b68e83e2,RIDER-2112049538
1583,imaging_study,02819c8e-9bd6-4212-9446-0dea3787aec5,IDC-rider_lung_pet_ct,1.3.6.1.4.1.9328.50.17.83984297885348723829825...,CHEST,,56.0,,Female,PET CT MID BODY,1.3.6.1.4.1.9328.50.17.83984297885348723829825...,Lung Cancer,Lung,SEG,a410193b-d081-4bd3-8dde-222d3ad4b4e6,rider_lung_pet_ct,a0c4319f-523d-4147-839b-cc0a3ef565d3,RIDER-2112049538
1597,imaging_study,3a32d063-dd97-4eaf-b2b6-e3cac56fc94c,IDC-rider_lung_pet_ct,1.3.6.1.4.1.9328.50.17.96109484884537053882771...,CHEST,,49.0,,Male,FDG 5AFOV TORSO,1.3.6.1.4.1.9328.50.17.96109484884537053882771...,Lung Cancer,Lung,SEG,a410193b-d081-4bd3-8dde-222d3ad4b4e6,rider_lung_pet_ct,33e934e2-fe49-4b17-b000-4a77dc754c25,RIDER-2624615528


In [36]:
## Make a copy of the master imaging_study table with only data relevant to LOINC mapping, and change NaN to "[blank]"
studies = copy.deepcopy(ast[['submitter_id','StudyDescription','study_modality']])
#studies.index = studies['submitter_id']
studies['StudyDescription'].fillna("[blank]", inplace=True)
studies['study_modality'].fillna("[blank]", inplace=True)

## Make the mapping case-insensitive
studies['StudyDescription'] = studies['StudyDescription'].str.casefold()
studies['study_modality'] = studies['study_modality'].str.casefold()

## remove any leading/trailing whitespaces
studies['StudyDescription'] = studies['StudyDescription'].str.strip()
studies['study_modality'] = studies['study_modality'].str.strip()


## Fix any CR/DX study_modality to all be "cr, dx" to match mapping table
studies.replace({"study_modality":{
    'cr':'cr, dx',
    'dx':'cr, dx',
    'cr,dx':'cr, dx',
    'dx,cr':'cr, dx',
    'dx, cr':'cr, dx',
    'crdx':'cr, dx',
    'dxcr':'cr, dx'}},
    inplace=True)

## Fix any CT/PT study_modality to all be "pt, ct" to match mapping table
studies.replace({"study_modality":{
    'pt':'pt, ctpt',
    'ctpt':'pt, ctpt',
    'ptct':'pt, ctpt',
    'ct, pt':'pt, ctpt',
    'pt, ct':'pt, ctpt',
    'pt,ct':'pt, ctpt',
    'ct,pt':'pt, ctpt'}},
    inplace=True)

## Change any "<NONE>" values for StudyDescription to "[blank]" to match the mapping table
studies.replace({"StudyDescription":{
    '<none>':'[blank]'}},
    inplace=True)

## Identify and drop duplicates
dupes = studies.loc[studies.duplicated(keep=False)]
studies = studies.drop_duplicates()
display(studies)



Unnamed: 0,submitter_id,StudyDescription,study_modality
0,1.3.6.1.4.1.9590.100.1.2.329737130703596962422...,gbm dsc dro b0 3t ca pre full bolus full - tr...,mr
1,1.3.6.1.4.1.9590.100.1.2.104887745833729382822...,gbm dsc dro b0 3t ca pre full bolus full - tr...,mr
2,1.3.6.1.4.1.9590.100.1.2.137620470032206897528...,gbm dsc dro b0 1.5t ca pre full bolus full - ...,mr
3,1.3.6.1.4.1.9590.100.1.2.360758109641141985523...,gbm dsc dro b0 3t ca pre quarter bolus three-q...,mr
4,1.3.6.1.4.1.9590.100.1.2.370659582215339645325...,gbm dsc dro b0 1.5t ca pre full bolus full - ...,mr
...,...,...,...
590880,2.25.84879617548504063537570954242798838793,histopathology,sm
590881,2.25.172535366789155891798994576712979330476,histopathology,sm
590882,2.25.42786662352207131091028545195519986096,histopathology,sm
590883,2.25.197297254260545288180471484544781359458,histopathology,sm


In [37]:
## Check modalities for presence in the mapping table and frequency in MIDRC data:
study_modalities = list(set(studies.study_modality))
print("There are the following values of study_modality in the MIDRC imaging_study data: \n{}".format(study_modalities))

missing_modalities = set(study_modalities).difference(modalities)

mfreq = {}
for mod in missing_modalities:
    freq = len(studies.loc[studies['study_modality']==mod])
    mfreq[mod] = freq

print("\nThese modalities are in the study data but not in the mapping table: \n{}".format(mfreq))

There are the following values of study_modality in the MIDRC imaging_study data: 
['mr', 'rwv', 'rtdose', 'xa', 'mg', 'sm', 'cr, dx', 'pr', 'nm,ct', 'sr', 'fusion', 'xr', 'rtstruct', 'xc', 'ot,pt', 'ct', 'seg', 'us', 'ot', 'nm,pt', 'nm', 'rtplan', 'sc', 'pt, ctpt', 'rf', 'reg']

These modalities are in the study data but not in the mapping table: 
{'nm,ct': 6, 'sr': 12121, 'fusion': 18, 'xr': 187711, 'rwv': 4, 'rtdose': 59, 'rtstruct': 830, 'xa': 9, 'xc': 2, 'ot,pt': 2, 'seg': 66185, 'ot': 3, 'nm,pt': 1, 'rtplan': 68, 'sc': 2, 'sm': 13992, 'reg': 11, 'pr': 52}


## Do mapping using pandas pd.merge(): 
---

1) Merge `loinc_code` in `mapping` dataframe into the `studies` dataframe on the combination of `StudyDescription` and `study_modality`.

2) Merge the `loinc_long_common_name`, `loinc_method`, `loinc_contrast`, and `loinc_system` in `filters` dataframe into the `studies` dataframe on `loinc_code`.



In [43]:
df = studies.merge(mapping,on=['study_modality','StudyDescription'],how='left').drop_duplicates()
df = df.merge(filters,on=['loinc_code','loinc_long_common_name'],how='left')
df.drop(columns=['StudyDescription','study_modality'],inplace=True,errors='ignore')
df

Unnamed: 0,submitter_id,loinc_code,loinc_long_common_name,loinc_method,loinc_contrast,loinc_system
0,1.3.6.1.4.1.9590.100.1.2.329737130703596962422...,,,,,
1,1.3.6.1.4.1.9590.100.1.2.104887745833729382822...,,,,,
2,1.3.6.1.4.1.9590.100.1.2.137620470032206897528...,,,,,
3,1.3.6.1.4.1.9590.100.1.2.360758109641141985523...,,,,,
4,1.3.6.1.4.1.9590.100.1.2.370659582215339645325...,,,,,
...,...,...,...,...,...,...
581487,2.25.84879617548504063537570954242798838793,,,,,
581488,2.25.172535366789155891798994576712979330476,,,,,
581489,2.25.42786662352207131091028545195519986096,,,,,
581490,2.25.197297254260545288180471484544781359458,,,,,


In [46]:
## Pull in the original study_modality and StudyDescription (non-lowercase) and rearrange the columns
#cols = ['type','project_id','submitter_id','cases.submitter_id','study_modality','StudyDescription','loinc_code','loinc_long_common_name','loinc_method','loinc_contrast','loinc_system']
#df = df.merge(ast[['type','project_id','submitter_id','subjects.submitter_id','StudyDescription','study_modality']], on='submitter_id',how='left')[cols]

cols = ['type','project_id','submitter_id','subjects.submitter_id','study_modality','StudyDescription','loinc_code','loinc_long_common_name','loinc_method','loinc_contrast','loinc_system','datasets.submitter_id'] 
df = df.merge(ast[['type','project_id','submitter_id','subjects.submitter_id','StudyDescription','study_modality','datasets.submitter_id']], on='submitter_id',how='left')[cols]
df


Unnamed: 0,type,project_id,submitter_id,subjects.submitter_id,study_modality,StudyDescription,loinc_code,loinc_long_common_name,loinc_method,loinc_contrast,loinc_system,datasets.submitter_id
0,imaging_study,TCIA-gbm_dsc_mri_dro,1.3.6.1.4.1.9590.100.1.2.329737130703596962422...,GBM_DSC_MRI_DRO_Bo=3T,MR,GBM DSC DRO b0 3T CA pre Full bolus Full - TR...,,,,,,gbm_dsc_mri_dro
1,imaging_study,TCIA-gbm_dsc_mri_dro,1.3.6.1.4.1.9590.100.1.2.104887745833729382822...,GBM_DSC_MRI_DRO_Bo=3T,MR,GBM DSC DRO b0 3T CA pre Full bolus Full - TR...,,,,,,gbm_dsc_mri_dro
2,imaging_study,TCIA-gbm_dsc_mri_dro,1.3.6.1.4.1.9590.100.1.2.137620470032206897528...,GBM_DSC_MRI_DRO_Bo=15T,MR,GBM DSC DRO b0 1.5T CA pre Full bolus Full - ...,,,,,,gbm_dsc_mri_dro
3,imaging_study,TCIA-gbm_dsc_mri_dro,1.3.6.1.4.1.9590.100.1.2.360758109641141985523...,GBM_DSC_MRI_DRO_Bo=3T,MR,GBM DSC DRO b0 3T CA pre quarter bolus three-q...,,,,,,gbm_dsc_mri_dro
4,imaging_study,TCIA-gbm_dsc_mri_dro,1.3.6.1.4.1.9590.100.1.2.370659582215339645325...,GBM_DSC_MRI_DRO_Bo=15T,MR,GBM DSC DRO b0 1.5T CA pre Full bolus Full - ...,,,,,,gbm_dsc_mri_dro
...,...,...,...,...,...,...,...,...,...,...,...,...
739172,imaging_study,TCIA-cptac_brca,2.25.84879617548504063537570954242798838793,01BR024,SM,Histopathology,,,,,,cptac_brca
739173,imaging_study,TCIA-cptac_brca,2.25.172535366789155891798994576712979330476,01BR002,SM,Histopathology,,,,,,cptac_brca
739174,imaging_study,TCIA-cptac_brca,2.25.42786662352207131091028545195519986096,05BR019,SM,Histopathology,,,,,,cptac_brca
739175,imaging_study,TCIA-cptac_brca,2.25.197297254260545288180471484544781359458,05BR043,SM,Histopathology,,,,,,cptac_brca


In [47]:
dm = df.loc[df['datasets.submitter_id'].isna()]
if len(dm) > 0:
    print("{} imaging studies are missing a link to dataset.".format(len(dm)))
    dm.to_csv('{}/imaging_studies_missing_datasets_link_{}.tsv'.format(results_dir,len(dm)),sep='\t',index=False)
else:
    print("All imaging studies have a link to dataset node.")

All imaging studies have a link to dataset node.


In [48]:
afdf = df.groupby(["study_modality", "StudyDescription"]).size().reset_index(name="Freq").sort_values(by='Freq',ascending=False) # frequencies of all studies sorted by freq
mdf = df.loc[~df['loinc_code'].isna()] # mapped studies
udf = df.loc[df['loinc_code'].isna()] # unmapped studies
fdf = udf.groupby(["study_modality", "StudyDescription"]).size().reset_index(name="Freq").sort_values(by='Freq',ascending=False) # frequencies of unmapped studies sorted by freq

print("Total Studies: {}, Mapped: {}, Unmapped: {}".format(len(df),len(mdf),len(udf)))


Total Studies: 739177, Mapped: 171261, Unmapped: 567916


In [49]:
afdf

Unnamed: 0,study_modality,StudyDescription,Freq
5861,XR,XR CHEST AP,156488
2064,CT,NLST-LSS,96327
5098,SEG,NLST-LSS,82268
2063,CT,NLST-ACRIN,48447
5097,SEG,NLST-ACRIN,39000
...,...,...,...
3385,MR,BI-LATERAL BREASTS,1
3386,MR,BILAT BREAT,1
3387,MR,BILAT BRST,1
849,CT,CHEST CT W/CONTRAST,1


In [50]:
## Save results to files
filename="{}/ALL_LOINC_mapping_results_{}_{}.tsv".format(results_dir,len(df),date)
df.to_csv(filename, sep='\t', index=False)
print("All LOINC mapping results for saved to file: \n{}\n".format(filename))

filename="{}/ALL_LOINC_study_frequencies_{}_{}.tsv".format(results_dir,len(afdf),date)
afdf.to_csv(filename, sep='\t', index=False)
print("Frequencies of all (mapped and unmapped) imaging study description/modality combinations for saved to file: \n{}\n".format(filename))

filename="{}/ALL_LOINC_umapped_frequencies_{}_{}.tsv".format(results_dir,len(fdf),date)
fdf.to_csv(filename, sep='\t', index=False)
print("Frequencies of unmapped imaging study description/modality combinations for saved to file: \n{}\n".format(filename))

filename="{}/ALL_LOINC_mapped_{}_{}.tsv".format(results_dir,len(mdf),date)
mdf.to_csv(filename, sep='\t', index=False)
print("Successfully mapped imaging studies for saved to file: \n{}\n".format(filename))

filename="{}/ALL_LOINC_umapped_{}_{}.tsv".format(results_dir,len(udf),date)
udf.to_csv(filename, sep='\t', index=False)
print("Unmapped imaging studies for saved to file: \n{}\n".format(filename))


All LOINC mapping results for saved to file: 
/Users/christopher/Documents/Notes/BIH/LOINC/results_2025-2-12/ALL_LOINC_mapping_results_739177_2025-2-12.tsv

Frequencies of all (mapped and unmapped) imaging study description/modality combinations for saved to file: 
/Users/christopher/Documents/Notes/BIH/LOINC/results_2025-2-12/ALL_LOINC_study_frequencies_5865_2025-2-12.tsv

Frequencies of unmapped imaging study description/modality combinations for saved to file: 
/Users/christopher/Documents/Notes/BIH/LOINC/results_2025-2-12/ALL_LOINC_umapped_frequencies_5508_2025-2-12.tsv

Successfully mapped imaging studies for saved to file: 
/Users/christopher/Documents/Notes/BIH/LOINC/results_2025-2-12/ALL_LOINC_mapped_171261_2025-2-12.tsv

Unmapped imaging studies for saved to file: 
/Users/christopher/Documents/Notes/BIH/LOINC/results_2025-2-12/ALL_LOINC_umapped_567916_2025-2-12.tsv



In [51]:
## Take a glance at frequency of the unmapped imaging study modality/description combinations
display(fdf)


Unnamed: 0,study_modality,StudyDescription,Freq
5504,XR,XR CHEST AP,156488
1849,CT,NLST-LSS,96327
4741,SEG,NLST-LSS,82268
1848,CT,NLST-ACRIN,48447
4740,SEG,NLST-ACRIN,39000
...,...,...,...
3044,MR,ANGIOS^GBO,1
805,CT,CT A/P LIVER PROT,1
3047,MR,B MRI THIGH C- LEFT,1
3048,MR,B KNEE RT,1


## Submit the data to MIDRC BIH Staging
---
Use the Gen3SDK function `Gen3Submission.submit_file()` to update the LOINC properties for each of the newly mapped studies in `mdf`.

In [None]:
# Submit the derived data to staging
projects = list(set(mdf['project_id']))
print("Updating {} newly mapped imaging studies across {} projects: {}".format(len(mdf),len(projects),projects))
data = {}
for pid in projects:
    sub_df = copy.deepcopy(mdf.loc[mdf["project_id"]==pid])
    sids = list(set(sub_df['submitter_id']))
    if len(sids) != len(sub_df):
        print("\tDuplicates found in df! Dropping {} duplicates".format((len(sub_df)-len(sids))))
        sub_df = sub_df.loc[sub_df.isna().sum(axis=1).groupby(sub_df.submitter_id).idxmin(),:]
        #sub_df = sub_df.drop_duplicates(keep='first')
    try:
        if pid in spids:
            print("Submitting {} updated studies to project '{}' in bihstaging.".format(len(sub_df),pid))
            data[pid] = sexp.submit_df(df=sub_df,project_id=pid, chunk_size=1000)
    except KeyboardInterrupt:
        break
    except:
        pass


In [None]:
display(data)