`I Tried to Solve this on Google Colab but the dataset size was beyond available disk space`

# My biggest mistake was that I did not figure out the model I will be using for Object detection as my dataset and inputs will be prepared accordingly.

Here I've decided to go ahead with SSD 

## 1. Problem Statement:
Identify and localize COVID-19 abnormalities on chest radiographs.


* Brief:
Five times more deadly than the flu, COVID-19 causes significant morbidity and mortality. Like other pneumonias, pulmonary infection with COVID-19 results in inflammation and fluid in the lungs. COVID-19 looks very similar to other viral and bacterial pneumonias on chest radiographs, which makes it difficult to diagnose. Your computer vision model to detect and localize COVID-19 would help doctors provide a quick and confident diagnosis. As a result, patients could get the right treatment before the most severe effects of the virus take hold.

## 2. Evaluation:
The challenge uses the standard PASCAL VOC 2010 mean Average Precision (mAP) at IoU > 0.5. Note that the linked document describes VOC 2012, which differs in some minor ways (e.g. there is no concept of "difficult" classes in VOC 2010). The P/R curve and AP calculations remain the same.

In this competition, we are making predictions at both a study (multi-image) and image level.

## 3. Data Description

Dataset information
The train dataset comprises 6,334 chest scans in DICOM format, which were de-identified to protect patient privacy. All images were labeled by a panel of experienced radiologists for the presence of opacities as well as overall appearance.

Note that all images are stored in paths with the form study/series/image. The study ID here relates directly to the study-level predictions, and the image ID is the ID used for image-level predictions.

The hidden test dataset is of roughly the same scale as the training dataset.

Columns
train_study_level.csv

* id - unique study identifier
* Negative for Pneumonia - 1 if the study is negative for pneumonia, 0 otherwise
* Typical Appearance - 1 if the study has this appearance, 0 otherwise
* Indeterminate Appearance  - 1 if the study has this appearance, 0 otherwise
* Atypical Appearance  - 1 if the study has this appearance, 0 otherwise


train_image_level.csv

* id - unique image identifier
* boxes - bounding boxes in easily-readable dictionary format
* label - the correct prediction label for the provided bounding boxes


## Lets Start by importing basic libraries & the dataset



In [None]:
# Installing dcm library to read Dicom Images
!pip install python-gdcm
print("Installation Complete")
!pip install tensorflow-io
print(" TF - io Installed Successfully")

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import ast #helps to process trees of the Python abstract syntax grammar.
import pydicom # for working with DICOM files such as medical images, reports, and radiotherapy objects.
import matplotlib.pyplot as plt
%matplotlib inline
import PIL # Python Imaging Library
from PIL import Image, ImageDraw, ImageFont #Python Imaging Library
import tensorflow as tf

import tensorflow_hub as hub
import wandb # experiment tracking, dataset versioning, and model management
import seaborn as sns
import tqdm # visualise progress
import cv2 #convert dicom to png




In [None]:
# Importing the training files names
t_image_fnames = []
path = "/kaggle/input/siim-covid19-detection/train/"
import os
len(os.listdir(path))
for root, dirs, filenames in os.walk(path):
    for fname in filenames:
        t_image_fnames.append(os.path.join(root,fname))

train_image_level = pd.read_csv("/kaggle/input/siim-covid19-detection/train_image_level.csv")
train_study_level = pd.read_csv("/kaggle/input/siim-covid19-detection/train_study_level.csv")    
len(t_image_fnames)

In [None]:
# Importing the training files names
test_image_fnames = []
path1 = "/kaggle/input/siim-covid19-detection/test/"
import os
print(len(os.listdir(path1)))
for root, b, filenames in os.walk(path1):
    for fname in filenames:
        test_image_fnames.append(os.path.join(root,fname))
len(test_image_fnames)

In [None]:
# Crosschecking that the number of image file paths is same as the number of image IDs
if len(train_image_level.id) == len(t_image_fnames):
    print("length is almost the same")
    
else:
    print("holy moly")
    
train_image_level.head()

# Check for any missing data

In [None]:
train_image_level.isnull().sum()


In [None]:
train_study_level.isnull().sum()

## Since boundary box is missing in some of the images... that means there is no abnormality in those cases

In [None]:
print("There are ",train_image_level.StudyInstanceUID.duplicated().sum()," Images that refer to duplicated study IDs")

In [None]:
X = t_image_fnames
y = train_image_level["label"]


In [None]:
import matplotlib.pyplot as plt

import pydicom
%matplotlib inline
plt.figure(figsize = (10,8))
image = pydicom.dcmread(X[21])
plt.imshow(image.pixel_array,cmap=plt.cm.bone)

In [None]:
## Function to Display 25 Images
def show_25_images(images):
    """
    Displays a plot of 25 images and their labes for training images
    """
    
    # setup the figure
    plt.figure(figsize = (10,10))
    
    # loop through 25 files to display 25 images
    for i in range(25):
        # Create subplots ( 5 rows , 5 columns)
        ax = plt.subplot(5,5,i+1)
        # display an image
        image = pydicom.dcmread(images[i])
        plt.imshow(image.pixel_array,cmap = plt.cm.bone)
        plt.axis("off")

In [None]:
show_25_images(X[20:])

In [None]:
import ast
boxes = ast.literal_eval(train_image_level.loc[3,'boxes'])
boxes

# Function to display 9 images with boundary boxes if present


In [None]:
def display_image_and_box(image):
    """
    Takes image number as input
    """
    import matplotlib
    fig, axs = plt.subplots(3,3,figsize = (20,16))
    fig.subplots_adjust(hspace = .1 , wspace = .1)
    axs = axs.ravel()
    row = image
    i=0
    for row in range(row,row+9):
        study = train_image_level.loc[row, 'StudyInstanceUID']
        dt_file =pydicom.dcmread( X[row])
        img = dt_file.pixel_array
        if(train_image_level.loc[row,'boxes']!= train_image_level.loc[row,'boxes']) == False:
            boxes = ast.literal_eval(train_image_level.loc[row,'boxes'])
        
            for box in boxes:
                p = matplotlib.patches.Rectangle((box['x'],box['y']), box['width'], box['height'],
                                           ec = 'r',fc = 'none', lw = 2.)
                axs[i].add_patch(p)
        axs[i].imshow(img,cmap = plt.cm.bone)
        axs[i].set_title(train_image_level.loc[row,'label'].split(' ')[0])
        axs[i].set_xticklabels([])
        axs[i].set_yticklabels([])
        i+=1

In [None]:
display_image_and_box(19)

# **Extract Metadata for all the images into a seprate DataFrame**

In [None]:
from pydicom import dcmread, read_file
from pydicom.tag import Tag
from tqdm import tqdm
pvt_creator1 = Tag(0x2001, 0x10) #Private Creator 1
pvt_creator2 = Tag(0x0903, 0x10) #Private Creator 1

columns = [ 'StudyID',
 'StudyInstanceUID',
 'PatientSex', 
 'BitsAllocated',
 'BitsStored',
 'Columns',
 'Rows',
 'BodyPartExamined', 
 'HighBit', 
 'ImageType',
 'ImagerPixelSpacing',
 'InstanceNumber',
 'Modality',
 'PatientID',
 'PatientName',
 'AccessionNumber',
 'DeidentificationMethod',
 #'DeidentificationMethodCodeSequence',
 'PhotometricInterpretation',
 'PixelRepresentation',
 'SOPClassUID',
 'SOPInstanceUID',
 'SamplesPerPixel',
 'SeriesInstanceUID',
 'SeriesNumber',
 'SpecificCharacterSet',
 'StudyDate',
 'StudyTime',
 'PrivateCreator1',
 'PrivateCreator2']

def extract_metadata(columns, files):
    df = pd.DataFrame(columns=columns)
    for num, file in tqdm(enumerate(files)):
        row = {}
        dicom_img = read_file(file, stop_before_pixels=True)
        for col in columns:
            if col not in ['PrivateCreator1', 'PrivateCreator2']:
                row[col] = dicom_img[col].value
        try:        
            row['PrivateCreator1'] = dicom_img.get_item(pvt_creator1).value
            row['PrivateCreator2'] = dicom_img.get_item(pvt_creator2).value    
        except AttributeError:
            pass
        df = df.append(row,ignore_index=True)
    return df

In [None]:
#train_df = extract_metadata(columns,X)

In [None]:
#train_df.head().T

In [None]:
#test_df = extract_metadata(columns,test_image_fnames)

In [None]:
"""
train_df['Rows'] = train_df['Rows'].astype(int)
train_df['Columns'] = train_df['Columns'].astype(int)
test_df['Rows'] = test_df['Rows'].astype(int)
test_df['Columns'] = test_df['Columns'].astype(int)
train_df.to_csv('train_imgs_meta.csv', index=None)
test_df.to_csv('test_imgs_meta.csv', index=None)

"""

# Convert the Dicom files to png of custom size for model input



In [None]:

# Importing the training files names
s = []
dcm = ".dcm"
path = "/kaggle/input/siim-covid19-detection/train/"
import os
#print(len(os.listdir(path)))
for root, dirs, filenames in os.walk(path):
    for fname in filenames:
        s.append(os.path.join(root,fname))



ss = [x.replace('/kaggle/input/siim-covid19-detection/train/','') for x in s]
ss[:10]

In [None]:

"""outdir = "./custompng/train"
#os.mkdir(outdir)



for f in X[:10]:   # remove "[:10]" to convert all images 
    
    ds = pydicom.read_file(f) # read dicom image
    img = ds.pixel_array # get image array
    print(f)
    cv2.imwrite(outdir + f.replace('.dcm','.png'),img) # write png image
    
    
    """



In [None]:
import cv2
import os
import pydicom

inputdir = '../input/siim-covid19-detection/train/'
outdir = './custompng/train'
#os.mkdir(outdir)

test_list = [ f for f in  os.listdir(inputdir)]
i=0
for f in tqdm(X):   # remove "[:10]" to convert all images 
    
    ds = pydicom.read_file(inputdir + ss[i]) # read dicom image
    img = ds.pixel_array # get image array
    cv2.imwrite(outdir + ss[i].replace('.dcm','.png'),img)
    i+=1
    
    
print(i)