Quick summary of below kernel - 
1. DICOM files are dead slow while loading, will need to convert them into something faster
2. Null values are present only for the Bounding boxes.
3. There is a lot of data loss if we remove all the null values from the dataset.
4. Out of 15000 patients, we are left with only ~4000 patient ids if we remove the null value rows. 
5. There is a lot of data imbalance, cases with "No finding" as ground truth are drastically more than other classes(check below for bar plot)
6. in context to height and width of images, these calues can range between 3200 - 3400 for max value and 927 - 800 for minimum value. 
7. All the DICOM images are in MONOCHROME1 type(quite obvious... it's an x-ray)


In [None]:
#!pip install pydicom
#import necessary packages 

import os 
import numpy as np 
import pydicom 
import matplotlib.pyplot as plt 



Reading any random file to know about the metadata inside 

In [None]:
with pydicom.dcmread("../input/vinbigdata-chest-xray-abnormalities-detection/test/004f33259ee4aef671c2b95d54e4be68.dicom") as ds:
  print(ds)

- Photometric Interpretation - MONOCHROME1, this indicates that the greyscale ranges from bright to dark with ascending pixel value

- These attributes are pretty much self-explanatory, for more detailed information please refer to this site - https://dicom.innolitics.com/ciods .
I will try to focus only on the reading and interpreting these images here. 

In [None]:
import pandas as pd

train = pd.read_csv('../input/vinbigdata-chest-xray-abnormalities-detection/train.csv')
train.head()

okay... looks like there are some null values... let's have a look at them 

In [None]:
def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns 
    
missing_values = missing_values_table(train)
print(missing_values)

46% of the data is missing, this is ... not good... will have to look at that. 
In other way we can say that we do not have bounding boxes for 46% of the data present, maybe we can treat this as a semi-supervides learning problem([link to research paper](https://arxiv.org/pdf/2005.07377.pdf))... or...  weakly-supervised learning problem(if that's a thing)


"SOME RANDOM THOUGHTS"

How to tackle it? - 
The ground truth here is the "class_name".. which certainly big time depends on the anchor boxes(which are missing big time here... ). 

We can divide this training dataset into two sections, one which has the values for bounding boxes as well as the ground truth, accounts for 37000 images, validate on the remaining 31000 images and try to correct the error.we cannot do normal splitting strategies as we follow in other classification problems, has a hardcore data splitting will be done.

(just a thought, i maybe wrong)

If it happend that all the missing values are from the same rows, we can create a new dataframe our of the ones which has allt he information present. For this purpose there is a python module called missingno ([github repo](https://github.com/ResidentMario/missingno)). Go ahead and have a look, cool stuff.

In [None]:
!pip install quilt
!quilt install ResidentMario/missingno_data


#github repo - https://github.com/ResidentMario/missingno

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import missingno as msno
%matplotlib inline

df = train
missingdata_df = df.columns[df.isnull().any()].tolist()
msno.matrix(df[missingdata_df])

Well... the missing data is only about the bounding boxes, nothing about the ground truth, we can eliminate this and create a new dataset here... with no null values... but we will be keeping both of these datasets, will see at the time of modeling how to use both of these efficiently. 

In [None]:
print(len(train['class_id'].unique()))
print(len(updated_data['class_id'].unique()))

In [None]:
updated_data = train.dropna(how = "any", inplace = False)
updated_data.head()

In [None]:
missing = missing_values_table(updated_data)
print(missing)

Cool... no missing values... easy life.. 

A big issue... How many images are we left with...??
I mean we did actually drop the missing data rows, and most probably we are missing out some images(not omitting them though, ther are still present in the previous dataframe).. so out of 18000 images... how many are we left with ? 

In [None]:
len(updated_data['image_id'].unique())

okay... not good... this is an important thing to take care of, out of 15000 unique patient scans present we are having just 4394 scans that have Anchor boxes in them.

In [None]:
updated_data[['x_min', 'y_min','x_max', 'y_max']].describe(percentiles=[0.25, 0.5, 0.75, .95])


In [None]:
len(train['image_id'].unique())

In [None]:
folder = '../input/vinbigdata-chest-xray-abnormalities-detection/train/'
files = os.listdir(folder)
sex = []
width = []
height = []
filename = []
for i in range(len(files)):
  print(files[i])
  with pydicom.dcmread(folder + files[i]) as ds:
    sex.append(ds.PatientSex)
    width.append(ds.Columns)
    height.append(ds.Rows)
    filename.append(files[i])




In [None]:
unique_values = {'sex':sex , 'height':height , 'width':width}
unique_values_dataframe = pd.DataFrame.from_dict(unique_values)

In [None]:
unique_values_dataframe['sex'].unique()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data=unique_values_dataframe ,y='sex')
plt.title('Sex distribution ',fontsize=20)

In [None]:
unique_values_dataframe[['height', 'width']].describe(percentiles=[0.25, 0.5, 0.75, .95])


In [None]:
import seaborn as sns
plt.figure(figsize=(10,5))
sns.countplot(data=train ,y='class_name')
plt.title('Counts of the Classes',fontsize=20)