# SIIM-FISABIO-RSNA COVID-19 Detection

### Business Objective: 
   #### Currently, COVID-19 can be diagnosed via polymerase chain reaction to detect genetic material from the virus
   #### or chest radiograph. However, it can take a few hours and sometimes days before the molecular test results are back.
   #### In this competition, we identify and localize COVID-19 abnormalities on chest radiographs. 
   #### In particular, we categorize the radiographs as negative for pneumonia or typical, 
   #### indeterminate, or atypical for COVID-19.THis will help radiologists diagnose the millions 
   #### of COVID-19 patients more confidently and quickly


##### Below is the approach taken in this Case Study analysis.


###### Step1 : Importing Data

Step 2: Inspecting the Dataframe Train_Study_level dataset.

Step3: Checking for Missing values and Duplicates in Train_Study_level dataset.

Step4 : Adding new column Study_Result in Train_Study_level dataset and Performing EDA.

Step5 : Inspecting dataframe Train_Image_level dataset.

Step6 : Checking for Missing values and Duplicates in Train_Image_level dataset.

Step7 : Merging both the datasets Train_Study_level and Train_Image_level

Step8 : Feature Engineering on Train_Image_level dataset.Creating new columns by splitting data in Label column.

Step9 : Performing EDA (both UniVariate and BiVariate analysis) on Merged dataset.

# DataSet Information
train_study_level.csv - the train study-level metadata, with one row for each study, including correct labels.

In [None]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing Pandas and NumPy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing more modules
import seaborn as sns




In [None]:
train_study_level_data = pd.read_csv("../input/siim-covid19-detection/train_study_level.csv")

###### Step2: Inspecting Dataframe

In [None]:


#id:unique study identifier
#Negative for Pneumonia:1 if the study is negative for pneumonia, 0 otherwise
#Typical Appearance:1 if the study has this appearance, 0 otherwise
#Indeterminate Appearance:1 if the study has this appearance, 0 otherwise
#Atypical Appearance:1 if the study has this appearance, 0 otherwise


In [None]:
train_study_level_data.head()

In [None]:
train_study_level_data.shape

In [None]:
train_study_level_data.info()

In [None]:
#Above info shows that there are no null columns in the dataset.

In [None]:
#Check for duplicates
sum(train_study_level_data.duplicated(subset = 'id')) == 0
# No duplicate values

There are no duplicates in this dataset

##### Check for Missing Values

In [None]:
train_study_level_data.isnull().sum()

There are no missing values in the Train Study dataset.

# Perform EDA

In [None]:
# Add a new feature "Study Result" which shows the result name of that row.

train_study_level_data["Study_Result"]='Negative'
train_study_level_data.loc[train_study_level_data['Typical Appearance']==1, 'Study_Result'] = 'Typical'
train_study_level_data.loc[train_study_level_data['Indeterminate Appearance']==1, 'Study_Result'] = 'Indeterminate'
train_study_level_data.loc[train_study_level_data['Atypical Appearance']==1, 'Study_Result'] = 'Atypical'

train_study_level_data.head(5)

In [None]:
train_study_level_data["Study_Result"].value_counts()

##### Plot a bar plot to see the Study results.

In [None]:
train_study_level_data["Study_Result"].value_counts().plot(kind="bar")
plt.xlabel('Study Results')
plt.ylabel('Count')
plt.show()

Above BarPlot shows that Typical Appearence is the mostly got result.

### Analyse Train image dataset

In [None]:
train_image_level_data = pd.read_csv("../input/siim-covid19-detection/train_image_level.csv")

In [None]:
train_image_level_data.head(10)

##### Check for missing values in Train Image dataset

In [None]:
train_image_level_data.isnull().sum()

Only boxes column has 2040 missing values. Lets analyse further to check when the boxes column doesnt have value.

In [None]:
#Check for duplicates
sum(train_image_level_data.duplicated(subset = 'id')) == 0
# No duplicate values

## Feature Engineering on Train Image DataSet

In Label Column, the format is as follows: [class ID] [confidence score] [bounding box]

Class ID: Either opacity or none 

Confidence score: confidence from your neural network model. If none, the confidence is 1.

Bounding box:typical xmin ymin xmax ymax format. If class ID is none, the bounding box is 1 0 0 1 1.

### Based on the above interpretation, lets split the Label column and start analysing the images 

In [None]:
# 1.Class ID
#Class ID is Either Opacity or None.
#So lets check how many have opacity by splitting the ClassID as new column


In [None]:
train_image_level_data['Class_ID'] = train_image_level_data.label.apply(lambda x: x.split()[0])
train_image_level_data['Class_ID'].value_counts().plot(kind="bar")
plt.xlabel('Class_ID')
plt.ylabel('Count')
plt.show()

In [None]:
#Column 2 is Confidence score: confidence from your neural network model. If none, the confidence is 1

In [None]:
train_image_level_data['Confidence_Score'] = train_image_level_data.label.apply(lambda x: x.split()[1])
train_image_level_data['Confidence_Score'].value_counts()


In [None]:
#Confidence Score is 1 for all the data.

In [None]:
#Column 3 is bounding box parameters. From Boxes column we either get xmin ,ymin,width and height.Or we can use
#label column to get xmin,ymin,xmax,ymax.Both will give same values

In [None]:
train_image_level_data['boxes'][0]

In [None]:
train_image_level_data['label'][0]

In [None]:
#So lets add new columns for xmin,ymin,xmax and ymax by retrieving values from Label column

In [None]:
train_image_level_data['x_min'] = train_image_level_data.label.apply(lambda x: float(x.split()[2]))
train_image_level_data['y_min'] = train_image_level_data.label.apply(lambda x: float(x.split()[3]))
train_image_level_data['x_max'] = train_image_level_data.label.apply(lambda x: float(x.split()[4]))
train_image_level_data['y_max'] = train_image_level_data.label.apply(lambda x: float(x.split()[5]))




In [None]:
#Check the parameters now for 1 record.

train_image_level_data.head(1).T

In [None]:
#train_image_level_data[train_image_level_data['label']=='none'].value_counts().plot(kind='bar')
train_image_level_data['OpacityCount']=train_image_level_data['label'].str.count('opacity')
train_image_level_data['OpacityCount']

#  Merge both datasets train Study and Train image to understand report result and report parameters link


In [None]:

##Column ID value in train_study_level_data is same as StudyInstanceID column value in train_Image_level_data


train_study_level_data['StudyInstanceUID'] = train_study_level_data['id'].apply(lambda x: x.replace('_study', ''))
##del train_study_level_data['id']
train_Merged_df = train_study_level_data.merge(train_image_level_data, on='StudyInstanceUID')
train_Merged_df.head()

## Perform UniVaraite and ByVariate analysis on Merged DataSet.

### Check UniVariate Analysis of few columns in Merged Data 

#### UniVaraite analysis for Negative for Pneumonia 

In [None]:

sns.distplot(train_Merged_df["Negative for Pneumonia"])


In [None]:
train_Merged_df["Negative for Pneumonia"].value_counts().plot(kind='bar')
plt.show()



#### UniVaraite distribution for Typical Appearance

In [None]:
sns.distplot(train_Merged_df["Typical Appearance"])

In [None]:
train_Merged_df["Typical Appearance"].value_counts().plot(kind='bar')
plt.show()

#### UniVariate analysis for Indeterminate Appearance

In [None]:
sns.distplot(train_Merged_df["Indeterminate Appearance"])

In [None]:
train_Merged_df["Indeterminate Appearance"].value_counts().plot(kind='bar')
plt.show()

#### UniVariate analysis for Atypical Appearance

In [None]:
sns.distplot(train_Merged_df["Atypical Appearance"])

In [None]:
train_Merged_df["Atypical Appearance"].value_counts().plot(kind='bar')
plt.show()

#### Check UniVariate analysis for Opacity Count

In [None]:
train_Merged_df["OpacityCount"].value_counts().plot(kind='bar')
plt.show()

By looking at the above plot, we can analyse that most of the data have 2 opacity values.

#### Lets check how the results are when opacity is available.

In [None]:
grouped_df = train_Merged_df[['Study_Result','Class_ID']]
grouped_df[grouped_df['Class_ID']=='opacity'].value_counts()
           

WHen Opacity is available, most of the results were Typical .Also includes Indeterminate and Atypical.

In [None]:
grouped_df[grouped_df['Class_ID']=='opacity'].value_counts().plot(kind='bar')

In [None]:
#Check for None Class values

grouped_df[grouped_df['Class_ID']=='none'].value_counts()

In [None]:
grouped_df[grouped_df['Class_ID']=='none'].value_counts().plot(kind='bar')

When there is no Opacity values available in Label column, most of the Study Results were Negative(1736 records) which indicates there was no lung infection in most cases.
But surprisingly there were records with Typical (153 records) ,Atypical(92) and Indeterminate (59) study results as well with None opacity values.

## Lets check whether Data is imbalanced or not

In [None]:
#RATIO OF IMBALANCE 

frequency_values=train_study_level_data['Study_Result'].value_counts()

frequency_values.plot(kind='pie',autopct='%1.2f%%',fontsize=14,figsize=(6,6))

From above Pie chart, it is clear that near to 50% of Study results are Typical.Rest 50% include Negative,Atypical and Indeterminate

#### Lets analyse further how the Study result is based on opacity count.

In [None]:
train_image_level_data['OpacityCount'].value_counts().plot(kind='bar')

In [None]:
train_Merged_df.head(1)

In [None]:
train_Merged_df['OpacityCount'].unique()

### When OpacityCount is 8, lets analyse the Study Result.

In [None]:
OpacityCount_8_df=train_Merged_df[train_Merged_df['OpacityCount']==8][['Study_Result','OpacityCount']]#.plot(kind='bar')

OpacityCount_8_df

In [None]:

#grouped_df[grouped_df['Class_ID']=='opacity'].value_counts().plot(kind='bar')
OpacityCount_8_df.value_counts().plot(kind='pie',autopct='%1.2f%%',fontsize=14,figsize=(4,4))
plt.show()

THere is only one record with opacity count 8 and the study result is Typical.

### When Opacity Count is 5, Lets analyse the Study Result

In [None]:
OpacityCount_5_df=train_Merged_df[train_Merged_df['OpacityCount']==5][['Study_Result','OpacityCount']]
OpacityCount_5_df

In [None]:
OpacityCount_5_df.value_counts().plot(kind='pie',autopct='%1.2f%%',fontsize=14,figsize=(4,4))
plt.show()

THere is only one record with opacity count 5 and the study result is Indeterminate.

In [None]:

OpacityCount_4_df=train_Merged_df[train_Merged_df['OpacityCount']==4][['Study_Result','OpacityCount',]]
OpacityCount_4_df

In [None]:
OpacityCount_4_df.value_counts().plot(kind='pie',autopct='%1.2f%%',fontsize=14,figsize=(4,4))
plt.show()

When OpacityCount is 4, there are 91.3% of data which has Typical Study Result,4.35% of ATypical and 4.35% of Indeterminate Study Results.

In [None]:

OpacityCount_3_df=train_Merged_df[train_Merged_df['OpacityCount']==3][['Study_Result','OpacityCount',]]


In [None]:
OpacityCount_3_df.value_counts().plot(kind='pie',autopct='%1.2f%%',fontsize=14,figsize=(4,4))
plt.show()

When OpacityCount is 3, there are 83.06 of data which has Typical Study Result,7.65% of ATypical and 
9.29% of Indeterminate Study Results.


In [None]:
OpacityCount_2_df=train_Merged_df[train_Merged_df['OpacityCount']==2][['Study_Result','OpacityCount',]]
OpacityCount_2_df.value_counts().plot(kind='pie',autopct='%1.2f%%',fontsize=14,figsize=(4,4))
plt.show()

When OpacityCount is 2, there are 85.22 of data which has Typical Study Result,3.69% of ATypical and 
11.08% of Indeterminate Study Results.

In [None]:
OpacityCount_1_df=train_Merged_df[train_Merged_df['OpacityCount']==1][['Study_Result','OpacityCount',]]
OpacityCount_1_df.value_counts().plot(kind='pie',autopct='%1.2f%%',fontsize=14,figsize=(4,4))
plt.show()

When OpacityCount is 1, there are 70.40 of data which has Indeterminate Study Result,2.77% of Typical and 
26.82% of ATypical Study Results.

In [None]:
OpacityCount_0_df=train_Merged_df[train_Merged_df['OpacityCount']==0][['Study_Result','OpacityCount','label']]
OpacityCount_0_df.value_counts().plot(kind='pie',autopct='%1.2f%%',fontsize=14,figsize=(4,4))
plt.show()

When OpacityCount is 0, Mostly Study Results were Negative(85.1%).But also includes
Indeterminate(2.89%),ATypical(4.5%) and Typical(7.5%).

### Analysing OpacityCount 0 data with Typical Study Result

In [None]:
OpacityCount_0_df[OpacityCount_0_df['Study_Result']=='Typical']

## OBSERVATIONS FROM Basic Data Analysis and EDA:

1. There were no missing values in train_study_level dataset.


2. When EDA is Performed on train_study_level dataset, we observed that Data is Highly Imbalanced 
   and near to 50% of Study results are Typical.Rest 50% include Negative,Atypical and Indeterminate.
   

3. There were 2040 missing values in Box Column in Train_Image_level dataset.


4. There were no duplicate data.


After Merging the both the datasets train_study_level and train_Image_level, we observed below points.

5. When Opacity values are available in Label column, study results were mostly Typical(2854 records),
   followed by Indeterminate(1049 records) and Atypical(391 records).
   
    
6. When there is no Opacity values available in Label column, most of the Study Results were Negative(1736 records) 
    which indicates there was no lung infection in most cases.
    But surprisingly there were records with Typical (153 records) ,Atypical(92) and Indeterminate (59) 
    study results as well with None opacity values.
    

7. There were multiple Opacity values available for each study result which indicates either multiple images available or 
   multiple boxes in single image.
    Multiple values of Opacity Count is 2, 0, 1, 3, 4, 8, 5.
    
    
8. There is only one record with opacity count 8 and the study result is Typical.


9. There is only one record with opacity count 5 and the study result is Indeterminate.


10. When OpacityCount is 4, there are 91.3% of data which has Typical Study Result,4.35% of ATypical and 
    4.35% of Indeterminate Study Results.
    
    
11. When OpacityCount is 3, there are 83.06 of data which has Typical Study Result,7.65% of ATypical and 
    9.29% of Indeterminate Study Results.


12. When OpacityCount is 2, there are 85.22 of data which has Typical Study Result,3.69% of ATypical and 
    11.08% of Indeterminate Study Results.


13. When OpacityCount is 1, there are 70.40 of data which has Indeterminate Study Result,2.77% of Typical and 
    26.82% of ATypical Study Results.


14. When OpacityCount is 0, Mostly Study Results were Negative(85.1%).But also includes
Indeterminate(2.89%),ATypical(4.5%) and Typical(7.5%).

