### Get un-anonymized DICOM tags from the test set

In this notebook, we'll ..

- Extract relevant DICOM tags from all the images in the train set that haven't been anon'd.
- Export them to a csv file with their StudyID and ImageID.

I used the train csv file to get a single image from each study, rather than iterating through the directories and picking up a lot of dupes.
- It took about 17 min on GPU to go through the train set.
- I created a dataset if you don't want to export the tags yourself -> https://www.kaggle.com/davidbroberts/siimfisabiorsna-covid19-dicom-tags
- This could be tweaked and run on the test set by iterating over the directory since there isn't a df for test.

Here's the tags we extract:
- TransferSyntaxUID
- PhotometricInterpretation
- PatientSex
- Rows
- Columns
- BitsAllocated
- BitsStored
- HighBit
- PixelRepresentation
- ImagerPixelSpacing

In [None]:
import os
import numpy as np
import pandas as pd
import pydicom

In [None]:
# Load the data
base_path = "/kaggle/input/siim-covid19-detection/"
images_df = pd.read_csv(os.path.join(base_path,"train_image_level.csv"))

# Strip the extra text from the image IDs
images_df['id'] = images_df['id'].map(lambda x: x.rstrip('_image'))

In [None]:
# This function finds the first image in a StudyInstanceUID directory and returns its path
def get_image_by_study_id(study_id):
    study_path = base_path + "train/" + study_id + "/"
    for subdir, dirs, files in os.walk(study_path):
        for file in files:     
            image = os.path.join(subdir, file)
            if os.path.isfile(image):
                return image
    return "none"

In [None]:
# Specify the tags we want and make dataframe to hold them
columns = columns=['StudyID',
                   'FileID',
                   'TransferSyntaxUID',
                   'PhotometricInterpretation',
                   'PatientSex',
                   'Rows',
                   'Columns',
                   'BitsAllocated',
                   'BitsStored',
                   'HighBit',
                   'PixelRepresentation',
                   'ImagerPixelSpacing']

tags_df = pd.DataFrame(columns=columns)

In [None]:
# Iterate through the images and grab some tags

for index, row in images_df.iterrows():
    pt_sex = "O"
    rows = 0
    cols = 0
    bits_allocated = 0
    bits_stored = 0
    high_bit = 0
    pixel_representation = -1
    imager_pixel_spacing = "0"
    
    # Open a file
    img_file = get_image_by_study_id(row['StudyInstanceUID'])
    img = pydicom.dcmread(img_file)
    
    # Get the file ID from the filename
    file = img_file.split("/")
    filename = file[-1].split(".")

    # TransfersyntaxUID
    ts_uid = img.file_meta.TransferSyntaxUID
    
    # PhotometricInterpretation
    pm = img.PhotometricInterpretation
    
    # Get the following tags using their group/element hex values. We _could_ use pydicom dot notation here instead.   
    # PatientSex
    if (0x0010,0x0040) in img:
        pt_sex = img[0x0010,0x0040][0]
        
    # Rows
    if (0x0028,0x0010) in img:
        rows = img[0x0028,0x0010].value
        
    # Columns
    if (0x0028,0x0011) in img:
        cols = img[0x0028,0x0011].value
    
    # BitsAllocated
    if (0x0028,0x0100) in img:
        bits_allocated = img[0x0028,0x0100].value
        
    # BitsStored
    if (0x0028,0x0101) in img:
        bits_stored = img[0x0028,0x0101].value
        
    # HighBit
    if (0x0028,0x0102) in img:
        high_bit = img[0x0028,0x0102].value
        
    # PixelRepresentation
    if (0x0028,0x0103) in img:
        pixel_representation = img[0x0028,0x0103].value
        
    # ImagerPixelSpacing
    if (0x0018,0x1164) in img:
        imager_pixel_spacing = str(img[0x0018,0x1164].value[0]) + "/" + str(img[0x0018,0x1164].value[1])
        
    new_row = [[row['StudyInstanceUID'], 
                filename[0], 
                ts_uid, 
                pm, 
                pt_sex, 
                rows, 
                cols, 
                bits_allocated, 
                bits_stored, 
                high_bit, 
                pixel_representation,
                imager_pixel_spacing]]
    
    new_df = pd.DataFrame(new_row, columns=columns)
    tags_df = tags_df.append(new_df)
    
tags_df.head()

In [None]:
# Check the original images dataframe shape
print("Images DF shape: " + str(images_df.shape))

# Check the tags dataframe shape
print("Tags DF shape: " + str(tags_df.shape))

In [None]:
# Export to CSV
tags_df.to_csv('dicom_tags.csv',index=False)