# **SIIM-COVID19 : DATA EXPLORATION**

In [None]:
from IPython.display import Image
Image("../input/headercovid/header.png")

<a href='https://fr.freepik.com/photos/fond'>Fond photo créé par kjpargeter - fr.freepik.com</a>

### *1. DATA PROCESSING*

Let's import some libraries we will use in this notebook ...

In [None]:
import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as plt
import matplotlib
import pydicom as dicom
import cv2
import ast
import warnings
from collections import Counter
import seaborn as sns
warnings.filterwarnings('ignore')

First of all, we define the path variable and check files and folders :

In [None]:
path = '/kaggle/input/siim-covid19-detection/'
os.listdir(path)

It's time to import data in two differents dataframes :

In [None]:
idf = pd.read_csv(path + 'train_image_level.csv')
sdf = pd.read_csv(path + 'train_study_level.csv')

It's always better to check first few lines of each dataframes ...
We can already guess that there is a join between the 2 dataframes on "StudyInstanceUID" from image and "id" from study. We also note the suffix "_study" on the latter.

In [None]:
idf.head(3)

In [None]:
sdf.head(3)

Let's display some stats :

In [None]:
print("Nb of rows in study level file :", sdf.shape[0])
print("Nb of rows in image level file :", idf.shape[0])

print("Nb of StudyInstanceUID in image level file :", 
        len(idf['StudyInstanceUID'].unique()))

print("Nb of null StudyInstanceUID in image level file :",
        len(idf[idf['StudyInstanceUID'].isna()]))

print("Nb of StudyInstanceUID in study level file :", 
        len(sdf['id'].unique()))

There are less studyId than imageId, but the number of studies Id is the same in both files. Is that mean there are several imagesId for one study ? 
Let's check this out !

We first merge the two dataframes :

In [None]:
mdf = pd.merge(idf, sdf, 
               left_on=idf['StudyInstanceUID'], 
               right_on=sdf['id'].map(lambda x: x[:-6]))
mdf.head(3)

Then, in order to check the result of the join that we just realised, we verify the presence of null values in right part of merged dataframe :

In [None]:
print("Study id without image correspondence : ", mdf[mdf['id_y'].isna()].shape[0])

Everything seems to be ok, we do some cleaning in the dataframe resulting from the join.

In [None]:
mdf = mdf.drop(['key_0', 'id_y'], axis = 1)
mdf = mdf.rename(columns={'id_x': 'imageID',
                          'Negative for Pneumonia': 'negative',
                          'Typical Appearance': 'typical',
                          'Indeterminate Appearance': 'indeterminate',
                          'Atypical Appearance': 'atypical'})
mdf.head(3)

And finally we calculate the number of studyID assigned to one image, then two, etc ...

In [None]:
cnt = Counter(mdf.groupby('StudyInstanceUID')['imageID'].nunique())
utotal = 0
btotal = 0
for key in cnt:
    print('Nb of studies associated to {} images : {}'.format(key, cnt[key]))
    utotal += cnt[key]
    btotal += key * cnt[key]
print("\nTotal of uniques studies : ", utotal)
print("Total of studies : ", btotal)


Most studies are associated with a single image, but there are a few cases where the same study can be associated with 2 or more images. We will see this in more detail in the exploratory analysis.

### *2. EXPLORATORY DATA ANALYSIS*

To understand what we are talking about let's take an image and then compare the values of label and boxes fields.

In [None]:
print("LABEL : ", mdf[mdf['imageID']=='000a312787f2_image'].label[0])
print("\nBOXES : ",mdf[mdf['imageID']=='000a312787f2_image'].boxes[0])

We can see several things :
- more than one box can be associated to an image, 
- boxes's coordinates are listed into label,except that in place of (x y w h) format, coordinates indicated in label are formated as (x1 y1 x2 y2).
- finally, diagnostic is not present in label, so we have to consider that opacity term means for us to search for the corresponding class.

Last check to perform is to verify if it can be more than one diagnostic per image :

In [None]:
print('Nb of images with more than one diagnostic : ', 
        len(mdf[mdf['negative']+mdf['typical']+mdf['indeterminate']+mdf['atypical']>1]))

It seems that we have all we need to begin to plot some graphics.
<br>Let's add some informations : the diagnosis as categorical variable and the number of spot (understand boxes).

In [None]:
mdf['nbSpot'] = mdf['label'].apply(lambda lab: lab.count('opacity'))
mdf['diagnosis'] = mdf[['negative','typical','indeterminate','atypical']].idxmax(axis=1)

**Graphs serie 1 : Diagnosis repartition**

In [None]:
diagnosisColors = ['b','r','g','y']

plt.figure(figsize = (14, 6))

plt.suptitle('Diagnosis repartition')
plt.subplot(1, 2, 1)

sns.set();
ax=sns.countplot(x = mdf['diagnosis'].sort_values(), palette = diagnosisColors)

for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), 
                    (p.get_x()+0.2, p.get_height()+50))

plt.subplot(1, 2, 2)

mdf['diagnosis'].value_counts(normalize=True).sort_index().plot(kind='pie', 
                                                 colors = diagnosisColors, 
                                                 explode = [0.025, 0.025, 0.025, 0.2],
                                                 autopct = lambda x : str(round(x, 2)) + '%')

plt.show()



We can already see that the "typical" diagnosis represents almost half of the dataset. Conversely, the "atypical" diagnosis represents only 7.63%.

In [None]:
def plotDiagnosis(ax, text, color):
    ax=sns.countplot(x = mdf[mdf[text]==1].nbSpot, color=color)

    for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), 
                    (p.get_x()+0.2, p.get_height()+3))
    plt.ylabel('')
    plt.xlabel('nb ' + text + ' spots per image')
    
fig = plt.figure(figsize=(15,4))

plt.subplot(1, 3, 1)
plotDiagnosis(ax, 'atypical', diagnosisColors[0])

plt.subplot(1, 3, 2)
plotDiagnosis(ax, 'indeterminate', diagnosisColors[1])

plt.subplot(1, 3, 3)
plotDiagnosis(ax, 'typical', diagnosisColors[3])

plt.show()

"atypical" and "indeterminate" diagnosis follow more or less the same pattern: a majority of images have 1 single identified spot, then come the images with 2 spots.
For the "typical" diagnosis, on the other hand, the vast majority of images have 2 spots.