# Data Preparation Example: Open-i

## What is Open-i?

Open-i service of the National Library of Medicine enables search and retrieval of abstracts and images (including charts, graphs, clinical images, etc.) from the open source literature, and biomedical image collections. Searching may be done using text queries as well as query images. Open-i provides access to over 3.7 million images from about 1.2 million PubMed Central® articles; 7,470 chest x-rays with 3,955 radiology reports; 67,517 images from NLM History of Medicine collection; and 2,064 orthopedic illustrations.

The chest x-ray images from the Indiana University hospital network are available here:

- PNG images: [Link](https://openi.nlm.nih.gov/imgs/collections/NLMCXR_png.tgz)
- DICOM images: [Link](https://openi.nlm.nih.gov/imgs/collections/NLMCXR_dcm.tgz)
- Reports: [Link](https://openi.nlm.nih.gov/imgs/collections/NLMCXR_reports.tgz)( To identify images associated with the reports, use XML tag . More than one image could be associated with a report). Alternatively, use the API as indicated here: https://openi.nlm.nih.gov/services.php?it=xg, using the Indiana U. Chest X-rays (iu) filter.
- Term mapping to terminologies Link ( [radiology_vocabulary_final.xlsx](https://openi.nlm.nih.gov/imgs/collections/radiology_vocabulary_final.xlsx) )



## Download data

Now download the PNG images and put them into the fold `./data/NLMCXR_png`. Download the XML Reports into the fold './data/ecgen-radiology'. 
We need to extract the image descriptions from XML files.

## Set up Google Colab environment

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
cur_path = "/content/drive/MyDrive/open-i/"
os.chdir(cur_path)
!pwd

/content/drive/MyDrive/CL-medical-image-text-pair


## Extract the data

First import necessary modules.

In [3]:
import os
import sys
import xml.etree.ElementTree as ET
import glob
import random
import numpy as np
import pandas as pd

Set XML fold and TXT fold as the target.

In [4]:
xml_fold = 'data/ecgen-radiology'
txt_fold = xml_fold+'-txt'
if not os.path.exists(txt_fold):
    os.makedirs(txt_fold)
xml_texts = os.listdir(xml_fold+'/.')

Extract the description content and save them as TXT files.

In [60]:
for index, file in enumerate(xml_texts):
    print(str(index)+file)
    in_file = open(xml_fold+'/'+file)
    tree=ET.parse(in_file)
    root = tree.getroot()
    
    for child in root.find('MedlineCitation').find('Article').find('Abstract'):
        # print(child.attrib)
        if child.attrib['Label'] == 'FINDINGS':
            findings = child.text
        elif child.attrib['Label'] == 'IMPRESSION':
            impression = child.text       
    # print('FINDINGS:', findings)
    # print('IMPRESSION:', impression)
    
    image_ids = []
    for child in root:
        if child.tag == 'parentImage':
            image_ids.append(child.attrib['id'])
    # print(image_ids)
    if len(image_ids) == 0:
        continue
    elif findings or impression:
        for i in image_ids:
            
            txt_file = txt_fold+"/"+i+'.txt'
            f_w = open(txt_file, 'w+')
            if findings:
                f_w.write(findings+' ')
            if impression:
                f_w.write(impression)
            f_w.close()
            #image_txt = image_txt.append([{'image':i+'.png', 'text':i+'.txt'}], ignore_index=True)

    # print(file+"progress: {0}%".format(round((index + 1) * 100 / len(xml_texts), 2)), end="\n")

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m

1455507.xml

1456504.xml

1457496.xml

1458491.xml

1459505.xml

1460514.xml

1461510.xml

1462490.xml

1463494.xml

1464513.xml

1465502.xml

1466512.xml

1467559.xml

1468526.xml

1469518.xml

1470519.xml

1471521.xml

1472534.xml

147357.xml

1474549.xml

1475545.xml

1476566.xml

1477529.xml

1478555.xml

1479533.xml

1480552.xml

1481541.xml

1482540.xml

148352.xml

1484525.xml

1485556.xml

1486557.xml

1487561.xml

1488539.xml

1489524.xml

1490530.xml

1491554.xml

1492536.xml

1493548.xml

1494551.xml

1495528.xml

1496531.xml

1497547.xml

149856.xml

1499542.xml

15002317.xml

15012293.xml

15022313.xml

15032311.xml

1504230.xml

15052335.xml

15062368.xml

15072363.xml

15082338.xml

15092350.xml

15102365.xml

15112357.xml

15122336.xml

1513235.xml

15142330.xml

15152360.xml

15162364.xml

15172343.xml

15182356.xml

15192342.xml

15202334.xml

15212345.xml

15222358.xml

15232351.xml

15242352.xml

15252344.xml

15262366.xml



Compile image-text-paired table.

In [61]:
file_list = glob.glob('./data/ecgen-radiology-txt/*txt')
name_list = [i.split('/')[-1].split('.')[0] for i in file_list]

In [62]:
image_txt = pd.DataFrame(columns=['image','text'])
for i in name_list:
    image_txt = image_txt.append([{'image':i+'.png', 'text':i+'.txt'}], ignore_index=True)

In [63]:
len(image_txt)

7430

Save the table as a CSV file.

In [64]:
image_txt.to_csv('data/image_txt.csv', 
                 index=False, 
                 header=None
                 )

In [65]:
read_image_txt = pd.read_csv('data/image_txt.csv')

In [78]:
read_image_txt

Unnamed: 0,CXR54_IM-2145-1001.png,CXR54_IM-2145-1001.txt
0,CXR54_IM-2145-1002.png,CXR54_IM-2145-1002.txt
1,CXR546_IM-2150-1001.png,CXR546_IM-2150-1001.txt
2,CXR546_IM-2150-2001.png,CXR546_IM-2150-2001.txt
3,CXR570_IM-2170-1001.png,CXR570_IM-2170-1001.txt
4,CXR570_IM-2170-1002.png,CXR570_IM-2170-1002.txt
...,...,...
7424,CXR1452_IM-0291-1001.png,CXR1452_IM-0291-1001.txt
7425,CXR1452_IM-0291-2001.png,CXR1452_IM-0291-2001.txt
7426,CXR1430_IM-0277-2001.png,CXR1430_IM-0277-2001.txt
7427,CXR142_IM-0267-1001.png,CXR142_IM-0267-1001.txt
