# Covid-19 X-Ray Data Preprocessing

In [1]:
import pandas as pd
import os
import shutil
import random
import datetime as dt

Process images of positive Covid-19 sample set (Github) 
Link: https://github.com/ieee8023/covid-chestxray-dataset

We have taken the Covid images as of July 31st, 2020. This image set is updated regularly.

## Sampling Github X-Ray Images of Covid-19 Patients

We first download the metadata file which contains the breakdown of filename, diagnosis (not all images are Covid), and view. We are interested in the PA (posteroanterior) view only.

In [17]:
metadata_url_latest = "https://raw.githubusercontent.com/ieee8023/covid-chestxray-dataset/master/metadata.csv"
metadata_url = "https://raw.githubusercontent.com/ieee8023/covid-chestxray-dataset/59d85dfc206cf5159fd3f1a1cb5e2727ed95eac3/metadata.csv"
df_metadata = pd.read_csv(metadata_url)
print(df_metadata.shape)

(877, 29)


In [18]:
df_metadata.head()

Unnamed: 0,patientid,offset,sex,age,finding,RT_PCR_positive,survival,intubated,intubation_present,went_icu,...,modality,date,location,folder,filename,doi,url,license,clinical_notes,other_notes
0,2,0.0,M,65.0,COVID-19,Y,Y,N,N,N,...,X-ray,"January 22, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-a-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
1,2,3.0,M,65.0,COVID-19,Y,Y,N,N,N,...,X-ray,"January 25, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-b-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
2,2,5.0,M,65.0,COVID-19,Y,Y,N,N,N,...,X-ray,"January 27, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-c-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
3,2,6.0,M,65.0,COVID-19,Y,Y,N,N,N,...,X-ray,"January 28, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-d-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
4,4,0.0,F,52.0,COVID-19,Y,,N,N,N,...,X-ray,"January 25, 2020","Changhua Christian Hospital, Changhua City, Ta...",images,nejmc2001573_f1a.jpeg,10.1056/NEJMc2001573,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,diffuse infiltrates in the bilateral lower lungs,


### Create the Training Dataset for Covid Images

In [20]:
covid_raw = r"C:\Users\vijay\OneDrive\Documents\Projects\Covid19-Xray-Detection\raw\covid-chestxray-dataset-master\images"
dest_dir = r"C:\Users\vijay\OneDrive\Documents\Projects\Covid19-Xray-Detection\Covid_Dataset\Training\Covid"

if not os.path.exists(dest_dir):
    os.mkdir(dest_dir)
    print("Covid X-ray folder created")
else:
        print("Folder already exists!")

Covid X-ray folder created


In the next step, we will loop over all raw images based on covid classification, which are Posteroanterior (beams pass from back-to-front). This is determined by using the metadata file. These images are moved into a new dataset folder based on the criteria.

We see below that there are 204 covid-19 classified images with the PA view.

In [21]:
df_metadata[(df_metadata["finding"]=="COVID-19") & (df_metadata["view"]=="PA")]

Unnamed: 0,patientid,offset,sex,age,finding,RT_PCR_positive,survival,intubated,intubation_present,went_icu,...,modality,date,location,folder,filename,doi,url,license,clinical_notes,other_notes
0,2,0.0,M,65.0,COVID-19,Y,Y,N,N,N,...,X-ray,"January 22, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-a-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
1,2,3.0,M,65.0,COVID-19,Y,Y,N,N,N,...,X-ray,"January 25, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-b-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
2,2,5.0,M,65.0,COVID-19,Y,Y,N,N,N,...,X-ray,"January 27, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-c-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
3,2,6.0,M,65.0,COVID-19,Y,Y,N,N,N,...,X-ray,"January 28, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-d-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
4,4,0.0,F,52.0,COVID-19,Y,,N,N,N,...,X-ray,"January 25, 2020","Changhua Christian Hospital, Changhua City, Ta...",images,nejmc2001573_f1a.jpeg,10.1056/NEJMc2001573,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,diffuse infiltrates in the bilateral lower lungs,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
651,347,3.0,F,39.0,COVID-19,,Y,N,N,N,...,X-ray,2020,Italy,images,covid-19-caso-95-1-15.png,,https://www.sirm.org/2020/05/20/covid-19-caso-95/,,,
653,347,13.0,F,39.0,COVID-19,,Y,N,N,N,...,X-ray,2020,Italy,images,covid-19-caso-95-3-14.png,,https://www.sirm.org/2020/05/20/covid-19-caso-95/,,Plenty of nuanced shaded glass areola are appr...,"Credit to Davide Stoppa, Federico Paltenghi, L..."
659,350b,4.0,M,30.0,COVID-19,,,,,,...,X-ray,2020,"Doha, Qatar",images,1141cc2b8b9cc394becce5d978b5a7_jumbo.jpeg,,https://radiopaedia.org/cases/covid-19-18?lang=us,CC BY-NC-SA,Presentation: Four days history of fever. Imag...,"Case courtesy of Dr Salah Aljilly, Radiopaedia..."
666,355,,M,40.0,COVID-19,,,,,Y,...,X-ray,2020,"Doha, Qatar",images,14d81f378173b86cc53f21d2d67040_jumbo.jpeg,,https://radiopaedia.org/cases/covid-19-pneumon...,CC BY-NC-SA,Presentation: Three days of high-grade fever w...,"Case courtesy of Dr Salah Aljilly, Radiopaedia..."


In [22]:
count = 0

for (i,row) in df_metadata.iterrows():
    if row["finding"]=="COVID-19" and row["view"]=="PA":
        filename = row["filename"]
        img_orig_filepath = os.path.join(covid_raw, filename)
        img_new_filepath = os.path.join(dest_dir,filename)
        shutil.copy2(img_orig_filepath, img_new_filepath)
        print("Copying image to new path", count+1)
        count +=1
print(count)

Copying image to new path 1
Copying image to new path 2
Copying image to new path 3
Copying image to new path 4
Copying image to new path 5
Copying image to new path 6
Copying image to new path 7
Copying image to new path 8
Copying image to new path 9
Copying image to new path 10
Copying image to new path 11
Copying image to new path 12
Copying image to new path 13
Copying image to new path 14
Copying image to new path 15
Copying image to new path 16
Copying image to new path 17
Copying image to new path 18
Copying image to new path 19
Copying image to new path 20
Copying image to new path 21
Copying image to new path 22
Copying image to new path 23
Copying image to new path 24
Copying image to new path 25
Copying image to new path 26
Copying image to new path 27
Copying image to new path 28
Copying image to new path 29
Copying image to new path 30
Copying image to new path 31
Copying image to new path 32
Copying image to new path 33
Copying image to new path 34
Copying image to new pa

In [27]:
print(count)

180


## Sampling Kaggle X-Ray Images of Normal Lungs

These X-Ray images have been taken from https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia

In [23]:
kaggle_raw = r"C:\Users\vijay\Downloads\17810_23812_bundle_archive\chest_xray\train\NORMAL"
kaggle_dest_dir = r"C:\Users\vijay\OneDrive\Documents\Projects\Covid19-Xray-Detection\Covid_Dataset\Training\Normal"

In [25]:
# randomly shuffle the images and sample from
img_names = os.listdir(kaggle_raw)
random.shuffle(img_names)

In [26]:
# check if desination folder exists
if not os.path.exists(kaggle_dest_dir):
    os.mkdir(kaggle_dest_dir)
    print("Normal X-ray folder created")
else:
        print("Folder already exists!")

Normal X-ray folder created


In [28]:
for i in range(count):
    img_name = img_names[i]
    img_dest = os.path.join(kaggle_raw, img_name)
    img_target_dir = os.path.join(kaggle_dest_dir, img_name)
    shutil.copy2(img_dest, img_target_dir)
    print("Copying normal xray image", i+1)

Copying normal xray image 1
Copying normal xray image 2
Copying normal xray image 3
Copying normal xray image 4
Copying normal xray image 5
Copying normal xray image 6
Copying normal xray image 7
Copying normal xray image 8
Copying normal xray image 9
Copying normal xray image 10
Copying normal xray image 11
Copying normal xray image 12
Copying normal xray image 13
Copying normal xray image 14
Copying normal xray image 15
Copying normal xray image 16
Copying normal xray image 17
Copying normal xray image 18
Copying normal xray image 19
Copying normal xray image 20
Copying normal xray image 21
Copying normal xray image 22
Copying normal xray image 23
Copying normal xray image 24
Copying normal xray image 25
Copying normal xray image 26
Copying normal xray image 27
Copying normal xray image 28
Copying normal xray image 29
Copying normal xray image 30
Copying normal xray image 31
Copying normal xray image 32
Copying normal xray image 33
Copying normal xray image 34
Copying normal xray ima

In order to create the testing set, we will use the 80/20 split of the Normal and Covid images into a separate folder.
The final breakdown is as follows:

*  Training: 144 images for each category (Covid & Normal)
*  Testing: 36 images for each category

We will now proceed to loading the images into Google Colab and training a CNN!