<a href="https://colab.research.google.com/github/vignagajan/covid-xrays-predictor/blob/master/Compile_COVID_Xrays.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Get 'kaggle-chestxrays-dataset' from repo
!git clone https://github.com/vignagajan/covid-xrays-predictor
%cd covid-xrays-predictor

Cloning into 'covid-xrays-predictor'...
remote: Enumerating objects: 5840, done.[K
remote: Total 5840 (delta 0), reused 0 (delta 0), pack-reused 5840[K
Receiving objects: 100% (5840/5840), 1.13 GiB | 15.71 MiB/s, done.
Checking out files: 100% (5858/5858), done.
/content/covid-xrays-predictor


## 1. Have a look at data obtained

In [2]:
# Fetch relavent repository
!git clone 'https://github.com/ieee8023/covid-chestxray-dataset'

Cloning into 'covid-chestxray-dataset'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (54/54), done.[K
remote: Total 3204 (delta 30), reused 49 (delta 17), pack-reused 3133[K
Receiving objects: 100% (3204/3204), 582.58 MiB | 15.38 MiB/s, done.
Resolving deltas: 100% (1266/1266), done.
Checking out files: 100% (991/991), done.


In [3]:
# Load metadata as data frame 
import pandas as pd

df = pd.read_csv('covid-chestxray-dataset/metadata.csv')

df.head()

Unnamed: 0,patientid,offset,sex,age,finding,survival,intubated,intubation_present,went_icu,in_icu,needed_supplemental_O2,extubated,temperature,pO2_saturation,leukocyte_count,neutrophil_count,lymphocyte_count,view,modality,date,location,folder,filename,doi,url,license,clinical_notes,other_notes,Unnamed: 28
0,2,0,M,65.0,COVID-19,Y,N,N,N,N,Y,,,,,,,PA,X-ray,"January 22, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-a-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
1,2,3,M,65.0,COVID-19,Y,N,N,N,N,Y,,,,,,,PA,X-ray,"January 25, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-b-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
2,2,5,M,65.0,COVID-19,Y,N,N,N,N,Y,,,,,,,PA,X-ray,"January 27, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-c-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
3,2,6,M,65.0,COVID-19,Y,N,N,N,N,Y,,,,,,,PA,X-ray,"January 28, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-d-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",,
4,4,0,F,52.0,COVID-19,,N,N,N,N,N,,,,,,,PA,X-ray,"January 25, 2020","Changhua Christian Hospital, Changhua City, Ta...",images,nejmc2001573_f1a.jpeg,10.1056/NEJMc2001573,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,diffuse infiltrates in the bilateral lower lungs,,


****

You could observe that, the findings, filename and view are import features to extract COVID xray images from total images.



*   findings - Type of diseases the patient got
*   filename - Filename of the image
*   view - The view of the xrays is taken (radiology terminology)

From the dataframe, we have to extract filename with regarding to,

finding == 'COVID-19" and view == 'PA'.

Reason for the view is, the kaggle dataset xrays are also in the same view.

***





## 2. Dataset Creation

In [4]:
import os
import shutil
import datetime

def covid_dataset(IMG_DIR,META_DATA,COVID_DIR):

  # Create directory structure
  if not os.path.exists(COVID_DIR): 
    os.makedirs(COVID_DIR)
  # Load metadata 
  df = pd.read_csv(META_DATA)
  # Extract COVID images' file names, then images
  total = 0
  for (i,row) in df.iterrows() : 
    if row["finding"] == "COVID-19"and row["view"]=="PA": 
      file_name = row['filename']
      img_path = os.path.join(IMG_DIR,file_name)
      covid_path = os.path.join(COVID_DIR,file_name)
      shutil.copy2(img_path,covid_path)
      total += 1 
  # Total number of COVID xrays
  return total

In [5]:
# Extract same number of normal, pnuemonia xrays equal to COVID xrays 
def image_sampling(IN_DIR,OUT_DIR,num):

  if not os.path.exists(OUT_DIR): 
    os.makedirs(OUT_DIR)
  
  img_list = os.listdir(IN_DIR)

  total = 0

  for i in range(num):
    img_name = img_list[i]
    in_path = os.path.join(IN_DIR,img_name)
    out_path = os.path.join(OUT_DIR,img_name)
    shutil.copy2(in_path,out_path)
    total += 1 

  return total


In [6]:
def data_gen(IMG_DIR,META_DATA):

  # Date is used as data directory as the images will updated with time
  date = str(datetime.datetime.now())[:10]

  DATA_DIR = "data/covid-chestxray-images/"+date
  
  # Define directory structures
  COVID_DIR = DATA_DIR+"/COVID"
  NORMAL_DIR = DATA_DIR+'/NORMAL'
  PNEUMONIA_DIR = DATA_DIR+'/PNEUMONIA'

  # Create dataset and get amount of images
  covid_total = covid_dataset(IMG_DIR,META_DATA,COVID_DIR)
  normal_total = image_sampling('data/kaggle-chestxray-dataset/NORMAL',NORMAL_DIR,covid_total)
  pneumonia_total = image_sampling('data/kaggle-chestxray-dataset/PNEUMONIA',PNEUMONIA_DIR,covid_total)

  # Find weather image amounts are same or different
  if (covid_total == normal_total) and (covid_total == pneumonia_total):
    print("Dataset is created with xray images of each COVID, NORMAL and PNUEMONIA :",covid_total)
  else:
    print("Dataset is created with")
    print("Covid images = {covid_total}, Normal images = {normal_total}, Pnemonia images = {pneumonia_total}")

  # Remove original dataset
  shutil.rmtree('covid-chestxray-dataset')

In [7]:
# Parent folder
DIR = 'covid-chestxray-dataset'
# Images folder
IMG_DIR = DIR+'/images'
# Metadata 
META_DATA = DIR+'/metadata.csv'

In [8]:
# Generate dataset and get status
data_gen(IMG_DIR,META_DATA)

Dataset is created with xray images of each COVID, NORMAL and PNUEMONIA : 201


**COVID image count changes with time as they update the repo. So the count will also varies with time.**