In [None]:

# print("\n... IMPORTS STARTING ...\n")
# print("\n\tVERSION INFORMATION")
# # Machine Learning and Data Science Imports
# import tensorflow as tf; print(f"\t\t– TENSORFLOW VERSION: {tf.__version__}");
# import tensorflow_addons as tfa; print(f"\t\t– TENSORFLOW ADDONS VERSION: {tfa.__version__}");


# # Built In Imports
# from kaggle_datasets import KaggleDatasets
# from collections import Counter
# from datetime import datetime
# from glob import glob
# import warnings
# import requests
# import imageio
# import IPython
# import urllib
# import zipfile
# import pickle
# import random
# import shutil
# import string
# import math
# import time
# import gzip
# import ast
# import sys
# import io
# import gc
# import re

# # Visualization Imports
# from matplotlib.colors import ListedColormap
# import matplotlib.patches as patches
# import plotly.graph_objects as go
# import plotly.express as px
# import seaborn as sns
# import matplotlib; print(f"\t\t– MATPLOTLIB VERSION: {matplotlib.__version__}");
# import plotly
# import PIL
    
# print("\n\n... IMPORTS COMPLETE ...\n")



### Background

In this competition, we are identifying and localizing COVID-19 abnormalities on chest radiographs. <br>**This is an object detection and classification problem.**

For each test image, you will be predicting a bounding box and class for all findings. 
* If you predict that there are no findings, you should create a prediction of **`"none 1 0 0 1 1"`** 
    * "none" is the class ID for no finding, and this provides a one-pixel bounding box with a confidence of 1.0


To make a prediction of one of the above labels, create a prediction string similar to the "none" class above: 
* i.e. **`atypical 1 0 0 1 1`**

---

**MESSAGE FROM THE COMPETITION HOST ON LABEL AND BBOX DETAILS:**

In this challenge, the chest radiographs (CXRs) were categorized using a specific grading schema, based on a published paper:

[**Litmanovich DE, Chung M, Kirkbride RR, Kicska G, Kanne JP. Review of chest radiograph findings of COVID-19 pneumonia and suggested reporting language. Journal of thoracic imaging. 2020 Nov 14;35(6):354-60.**](https://journals.lww.com/thoracicimaging/Fulltext/2020/11000/Review_of_Chest_Radiograph_Findings_of_COVID_19.4.aspx)

Per the grading schema, chest radiographs are classified into one of four categories, which are mutually exclusive:

1. **Typical Appearance**: Multifocal bilateral, peripheral opacities with rounded morphology, lower lung–predominant distribution
2. **Indeterminate Appearance**: Absence of typical findings AND unilateral, central or upper lung predominant distribution
3. **Atypical Appearance**: Pneumothorax, pleural effusion, pulmonary edema, lobar consolidation, solitary lung nodule or mass, diffuse tiny nodules, cavity
4. **Negative for Pneumonia**: No lung opacities

Bounding boxes were placed on lung opacities, whether typical or indeterminate. Bounding boxes were also placed on some atypical findings including solitary lobar consolidation, nodules/masses, and cavities. Bounding boxes were not placed on pleural effusions, or pneumothoraces. No bounding boxes were placed for the negative for pneumonia category.

In cases of multiple adjacent opacities, we opted for one large bounding box, rather than multiple adjacent smaller boxes, to improve consistency in the labeling.

Annotators did have access to the COVID status for each patient, but were asked to adhere to the grading system above irrespective of the status. As such, some patients who were COVID negative still had chest radiographs with typical appearances. Similarly, some patients who were COVID positive had atypical appearances, or were negative for pneumonia (no lung opacities), because the grading system is based off the chest radiographic findings alone.

The goal in this challenge is to determine the appropriate category for each radiograph, as well as localize the lung opacities with a bounding box prediction.

---

The images are in DICOM format, which means they contain additional data that might be useful for visualizing and classifying.
Note that the images are in **DICOM** format, which means they contain additional data that might be useful for visualizing and classifying.

![Example Radiographs](https://i.imgur.com/QWmbhXx.png)

<br>

<b style="text-decoration: underline; font-family: Verdana;">DATASET INFORMATION</b>

The **train dataset** comprises **`6,334`** chest scans in **DICOM** format, which were de-identified to protect patient privacy. 

Note that all images are stored in paths with the form **`study/series/image`**. 
* The **`study`** ID here relates directly to the study-level predictions
* the **`image`** ID is the ID used for image-level predictions

The **test dataset** is of roughly the same scale as the training dataset. 
* As this is a kernels only competition we shsould plan accordingly
* i.e. we should be able to infer on the entirety of the training dataset within the submission kernel

<br>

<b style="text-decoration: underline; font-family: Verdana;">DATA FILES</b>
> **`train_study_level.csv`**
> * **`id`** - unique study identifier
> * **Negative for Pneumonia** - **`1`** if the study is negative for pneumonia, **`0`** otherwise
> * **Typical Appearance** - **`1`** if the study has this appearance, **`0`** otherwise
> * **Indeterminate Appearance**  - **`1`** if the study has this appearance, **`0`** otherwise
> * **Atypical Appearance**  - **`1`** if the study has this appearance, **`0`** otherwise

> **`train_image_level.csv`**
> * **`id`** - unique image identifier
> * **`boxes`** - bounding boxes in easily-readable dictionary format
> * **`label`** - the correct prediction label for the provided bounding boxes

<br>

<h2 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: darkred; background-color: #ffffff;">1.2  THE GOAL</h2>

---

In this competition, you’ll identify and localize COVID-19 abnormalities on chest radiographs. In particular, you'll categorize the radiographs as one of a possible **`4`** categories. 

In this competition, we are making predictions at both a study (multi-image) and image level.
* **`negative for pneumonia`** or **`typical`**, **`indeterminate`**, or **`atypical`** 
     
You'll work with a dataset consisting of **`8,781`** scans that have been annotated by experienced radiologists. You can train your model with **`6,334`** independently-labeled images and you will be evaluated on a test set of **`2,447`** images. 

The challenge uses the standard PASCAL VOC 2010 mean Average Precision (mAP) at IoU > 0.5.
* Note that the linked document describes VOC 2012, which differs in some minor ways (e.g. there is no concept of "difficult" classes in VOC 2010). The P/R curve and AP calculations remain the same.



 ### ADDITIONAL INFORMATION ON ABNORMALITIES

Negative for Pneumonia =  No lung opacities

Typical Appearance = Multifocal bilateral, peripheral opacities with rounded morphology, lower lung–predominant distribution

Indeterminate Appearance = Absence of typical findings AND unilateral, central or upper lung predominant distribution

Atypical Appearance = Pneumothorax, pleural effusion, pulmonary edema, lobar consolidation, solitary lung nodule or mass, diffuse tiny nodules, cavity

In [None]:
# TRAIN_CSV_PATH = "/kaggle/input/siim-covid19-updated-train-labels/updated_train_labels.csv"
# SS_CSV_PATH = "/kaggle/input/siim-covid19-updated-train-labels/updated_sample_submission.csv"

# print("\n\nCOMBINED AND EXPLODED TRAIN DATAFRAME\n\n")
# train_df = pd.read_csv(TRAIN_CSV_PATH)
# display(train_df)

# print("\n\nSAMPLE SUBMISSION DATAFRAME\n\n")
# ss_df = pd.read_csv(SS_CSV_PATH)
# display(ss_df)



### This work is adaped from existing notebooks of:

https://www.kaggle.com/dschettler8845/covid-detection-studies-with-multiple-images-viz

https://www.kaggle.com/c/siim-covid19-detection/discussion/240878

https://www.kaggle.com/c/siim-covid19-detection/discussion/246597

for getting started and also see multi-image cases.

-----------------------

https://www.kaggle.com/jaideepvalani/basic-exploration-eda-duplicate-nonduplicate

https://www.kaggle.com/devanshchowdhury/eda-understand-data

for further EDA on study/image level rows


-----------------------

https://www.kaggle.com/aleksandramowio/getting-started-simple-eda

to read image data/voxels



-----------------------



In [None]:
import pandas as pd


In [None]:
study_df = pd.read_csv("../input/siim-covid19-detection/train_study_level.csv")


In [None]:
study_df.head()

In [None]:
study_df.sum()

In [None]:
study_df.describe()

In [None]:
study_df[~study_df.id.str.contains("study")]

In [None]:
study_df["id"] = study_df["id"].str.replace("_study", "")


In [None]:
study_df.head()

In [None]:
import os


In [None]:
def get_absolute_file_paths(directory):
    all_abs_file_paths = []
    for dirpath,_,filenames in os.walk(directory):
        for f in filenames:
            all_abs_file_paths.append(os.path.abspath(os.path.join(dirpath, f)))
    return all_abs_file_paths

In [None]:
from tqdm.notebook import tqdm; tqdm.pandas();


In [None]:
study_df["study_dir"] = "/kaggle/input/siim-covid19-detection/train/"+study_df["id"]
study_df["images_per_study"] = study_df.study_dir.progress_apply(lambda x: len(get_absolute_file_paths(x)))
# study_df["images_per_study"] = study_df.study_dir.apply(lambda x: len(get_absolute_file_paths(x)))


In [None]:
study_df.head()

In [None]:
# study_df.images_per_study.describe()
study_df.images_per_study.value_counts()


### where is train data?

In [None]:
row = study_df.iloc[0]

In [None]:
row

In [None]:
# ! ls /kaggle/input/siim-covid19-detection/train/ | grep 00086460a852
! ls /kaggle/input/siim-covid19-detection/train/00086460a852/9e8302230c91/65761e66de9f.dcm


### visualise a dicom file ?

### study can have multi images.

In [None]:
multiple_images_per_study_df = study_df[study_df.images_per_study>1].reset_index(drop=True)


In [None]:
multiple_images_per_study_df.head()

In [None]:
multiple_images_per_study_df.shape

In [None]:
import matplotlib.pyplot as plt


In [None]:
image_df = pd.read_csv("/kaggle/input/siim-covid19-detection/train_image_level.csv")


In [None]:
image_df.head()

In [None]:
all_image_ids = image_df.id.str.replace("_image", "")
bbox_image_ids = image_df.dropna().id.str.replace("_image", "")

In [None]:
# all_image_ids
len(all_image_ids)

In [None]:
all_image_ids[:10]

In [None]:
# bbox_image_ids
len(bbox_image_ids)

In [None]:
bbox_image_ids[:10]

### for this need to read dicom files 

In [None]:
import numpy as np; print(f"\t\t– NUMPY VERSION: {np.__version__}");


In [None]:
import pydicom
from pydicom import dcmread
from pydicom.pixel_data_handlers.util import apply_voi_lut


In [None]:
# # Installs
# !cp /kaggle/input/gdcm-conda-install/gdcm.tar .
# !tar -xvzf gdcm.tar
# !conda install --offline ./gdcm/gdcm-2.8.9-py37h71b2a6d_0.tar.bz2
# !rm -rf ./gdcm.tar


In [None]:
'''
def dicom2array(path, voi_lut=True, fix_monochrome=True):
    """ Convert dicom file to numpy array 
    
    Args:
        path (str): Path to the dicom file to be converted
        voi_lut (bool): Whether or not VOI LUT is available
        fix_monochrome (bool): Whether or not to apply monochrome fix
        
    Returns:
        Numpy array of the respective dicom file 
        
    """
    # Use the pydicom library to read the dicom file
    dicom = pydicom.read_file(path)
    
    # VOI LUT (if available by DICOM device) is used to 
    # transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
        
    # The XRAY may look inverted
    #   - If we want to fix this we can
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    
    # Normalize the image array and return
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data
'''


def dicom2array_2(fname, target_size=512, use_clahe=True, clip_limit=2., grid_size=(8,8)):
    dicom = pydicom.dcmread(fname)
    
    try:
        data = dicom.pixel_array
    except Exception as err:
        print('exception seen=', err)
        data = None
    
#     data = apply_voi_lut(dicom.pixel_array, dicom)
#     im = data - np.min(data)
#     im = 255. * im / np.max(im)
    
#     if dicom.PhotometricInterpretation == "MONOCHROME1": # check for inverted image
#         im = 255. - im
    
#     if use_clahe:
#         clahe = cv2.createCLAHE(clipLimit=clip_limit, tileGridSize=grid_size)
#         climg = clahe.apply(im.astype('uint8'))
#         img = Image.fromarray(climg.astype('uint8'), 'L')
#     else:
#         img = Image.fromarray(im.astype('uint8'), 'L')
#     org_size = img.size
    
#     if max(img.size) > target_size:
#         img.thumbnail((target_size, target_size), Image.ANTIALIAS)
    
#     return np.asarray(img)
    return data


In [None]:
import cv2


In [None]:
from PIL import Image


In [None]:
for ix, case in multiple_images_per_study_df.head(20).iterrows():
    print('new case= \n ', case)
    dir_path = case.study_dir
    image_paths = get_absolute_file_paths(dir_path)
    
    selected = []
    rejected = []
    
    if len(image_paths) <= 4:
        plt.figure(figsize=(18,4))
        print('\n dir_path=', dir_path)
        
        study_id = dir_path.rsplit('/', 1)[1]
        print('\n study_id=', study_id)
        
        
        plt.suptitle(f"\n\nSTUDY: {study_id}", fontsize=16, fontweight="bold")
        print('\n image_paths=', image_paths)
        
        for j, x in enumerate(image_paths):
            print(j, x)
            if any(True for xx in all_image_ids if xx in x):
                title = "\nINCLUDED IN IMAGE LEVEL"
                if any(True for xx in bbox_image_ids if xx in x):
                    title += " - W/ BBOX!!!"
                    selected.append(x)
                    
            else:
                title = "xxxxxxxxxxxxxxxxxxxxxxx"
                
            data = dicom2array_2(x)
            
            if data is not None:
                plt.subplot(1,4,j+1)
                plt.imshow(data)
                plt.axis(False)
                plt.title(title, fontweight="bold")
                
        print('\n selected=', selected)
        selected_img = selected[0]
        selected_img_id = selected_img.split('/')[-1]
        selected_img_id = selected_img_id.replace('.dcm', '_image')
        print('selected_img_id=', selected_img_id)

        docs_img_selected = image_df[(image_df.StudyInstanceUID == study_id) & (image_df.id == selected_img_id)] 
        print('\n docs_img_selected =\n', docs_img_selected.iloc[0])
        
            
    else:
        print('not processing this right now')
    
    '''
    elif len(image_paths)<=8:
        plt.figure(figsize=(18,8))
        plt.suptitle(f"\n\nSTUDY: {dir_path.rsplit('/', 1)[1]}", fontsize=16, fontweight="bold")
        for j, x in enumerate(image_paths):
            if any(True for xx in all_image_ids if xx in x):
                title = "\nINCLUDED IN IMAGE LEVEL"
                if any(True for xx in bbox_image_ids if xx in x):
                    title += " - W/ BBOX!!!"
            else:
                title = "xxxxxxxxxxxxxxxxxxxxxxx"
            plt.subplot(2,4,j+1)
            plt.imshow(dicom2array_2(x))
            plt.axis(False)
            plt.title(title, fontweight="bold")
    else:
        plt.figure(figsize=(18,12))
        plt.suptitle(f"\n\nSTUDY: {dir_path.rsplit('/', 1)[1]}", fontsize=16, fontweight="bold")
        for j, x in enumerate(image_paths):
            if any(True for xx in all_image_ids if xx in x):
                title = "\nINCLUDED IN IMAGE LEVEL"
                if any(True for xx in bbox_image_ids if xx in x):
                    title += " - W/ BBOX!!!"
            else:
                title = "xxxxxxxxxxxxxxxxxxxxxxx"
            plt.subplot(3,4,j+1)
            plt.imshow(dicom2array_2(x))
            plt.title(title, fontweight="bold")
            plt.axis(False)
    '''

    plt.tight_layout()
    plt.show()
#     break

    print('===============================\n')

### viz  another image

In [None]:
study_df.head()

In [None]:

for dir_path in study_df.study_dir.values:
    print('dir_path=', dir_path)
    image_paths = get_absolute_file_paths(dir_path)
    if len(image_paths)<=4:
        plt.figure(figsize=(18,4))
        plt.suptitle(f"\n\nSTUDY: {dir_path.rsplit('/', 1)[1]}", fontsize=16, fontweight="bold")
        for j, x in enumerate(image_paths):
            print(j,x)
            if any(True for xx in all_image_ids if xx in x):
                title = "\nINCLUDED IN IMAGE LEVEL"
                if any(True for xx in bbox_image_ids if xx in x):
                    title += " - W/ BBOX!!!"
            else:
                title = "xxxxxxxxxxxxxxxxxxxxxxx"
            plt.subplot(1,4,j+1)
            plt.imshow(dicom2array_2(x))
            plt.axis(False)
            plt.title(title, fontweight="bold")
    elif len(image_paths)<=8:
        plt.figure(figsize=(18,8))
        plt.suptitle(f"\n\nSTUDY: {dir_path.rsplit('/', 1)[1]}", fontsize=16, fontweight="bold")
        for j, x in enumerate(image_paths):
            if any(True for xx in all_image_ids if xx in x):
                title = "\nINCLUDED IN IMAGE LEVEL"
                if any(True for xx in bbox_image_ids if xx in x):
                    title += " - W/ BBOX!!!"
            else:
                title = "xxxxxxxxxxxxxxxxxxxxxxx"
            plt.subplot(2,4,j+1)
            plt.imshow(dicom2array_2(x))
            plt.axis(False)
            plt.title(title, fontweight="bold")
    else:
        plt.figure(figsize=(18,12))
        plt.suptitle(f"\n\nSTUDY: {dir_path.rsplit('/', 1)[1]}", fontsize=16, fontweight="bold")
        for j, x in enumerate(image_paths):
            if any(True for xx in all_image_ids if xx in x):
                title = "\nINCLUDED IN IMAGE LEVEL"
                if any(True for xx in bbox_image_ids if xx in x):
                    title += " - W/ BBOX!!!"
            else:
                title = "xxxxxxxxxxxxxxxxxxxxxxx"
            plt.subplot(3,4,j+1)
            plt.imshow(dicom2array_2(x))
            plt.title(title, fontweight="bold")
            plt.axis(False)
    
    plt.tight_layout()
    plt.show()
    break

### So for each test study, possible labels:

'Negative for Pneumonia' - 1676 have this - 27%


'Typical Appearance' - 2855 have this - 47%


'Indeterminate Appearance' - 1049 have this - 17% 

'Atypical Appearance' - 474 have this - 7%


#### Overall Multi-class(4), unbalanced classification problem



### Is it  a Multilabel Multiclass problem?



In [None]:
study_df.head()

In [None]:
study_df.columns

In [None]:
study_df['all_sum'] = study_df.apply(lambda row: row['Negative for Pneumonia'] + 
                                     row['Typical Appearance'] + row['Indeterminate Appearance']
                                    + row['Atypical Appearance'], axis=1)

In [None]:
study_df.head()

In [None]:
study_df['all_sum'].value_counts()

### Is it  a Multilabel Multiclass problem?

No

### dig bit more into imagelevel df

In [None]:
image_df.shape

In [None]:
image_df.head()

In [None]:
image_df[['image_id','image_type']]= image_df['id'].str.split('_',expand=True)

In [None]:
image_df.head()

In [None]:
image_df.StudyInstanceUID.describe()

In [None]:
image_df.image_id.describe()

In [None]:
image_df.image_type.describe()

### so we have 6.3k images belonging to 6k cases

1 case has 9 images

In [None]:
image_df[image_df.StudyInstanceUID == '0fd2db233deb']

In [None]:
image_df[image_df.StudyInstanceUID == '0fd2db233deb'].iloc[3].label

In [None]:
image_df[image_df.StudyInstanceUID == '0fd2db233deb'].iloc[3].label

In [None]:
image_df[image_df.StudyInstanceUID == '0fd2db233deb'].iloc[3].label.split()


In [None]:
image_df[image_df.StudyInstanceUID == '0fd2db233deb'].iloc[3].boxes

In [None]:
eval(image_df[image_df.StudyInstanceUID == '0fd2db233deb'].iloc[3].boxes)

In [None]:
study_df[study_df.id == '0fd2db233deb']

In [None]:
study_df[study_df.id == '0fd2db233deb'].iloc[0].study_dir

### so opacity is one form of indeterminate appearance?

yes - Bounding boxes were placed on lung opacities, whether typical or indeterminate

### how many image labels are possible?

so far seen opacity and none


In [None]:
all_labels = list(image_df.label)
all_labels_list = [k.split(' ') for k in all_labels]


In [None]:
# all_labels_list[:3]

In [None]:
possible_lens = [len(k) for k in all_labels_list]

In [None]:
possible_values = set([k[0] for k in all_labels_list])

In [None]:
possible_values

In [None]:
d = {12:0, 6:0, 18:0, 24:0, 30:0, 48:0}
for k in possible_lens:
#     print(k)
    d[k] += 1


In [None]:
d

### so bbox/label distribution

almost 50% have 1 bbox (can also be none), 

50% have 2 bbox (should not be none)

no other label possible

In [None]:
image_df['boxes'].fillna(0,inplace=True)

In [None]:
image_df.head()

In [None]:
image_df['len_boxes'] = image_df['boxes'].apply(lambda row: len(eval(row)) if row else 0)

In [None]:
image_df.head()

In [None]:
image_df.len_boxes.value_counts()

In [None]:
image_df['label_type'] = image_df.label.str.split(expand=True)[0]

In [None]:
image_df.head()

In [None]:
image_df.label_type.value_counts()

### similar + finer bbox info  now

### look at 2 dfs together?

In [None]:
image_df.head()

In [None]:
study_df.head()

In [None]:
print(image_df.columns,study_df.columns)

In [None]:
image_study_df = pd.merge(image_df, 
                          study_df, 
                          left_on='StudyInstanceUID',
                          right_on='id')

In [None]:
image_study_df.shape

In [None]:
image_study_df.head()

In [None]:
# image_study_df['study_id'] = image_study_df['id_y']

In [None]:
image_study_df.head()

In [None]:
image_study_df[image_study_df['StudyInstanceUID']=='0fd2db233deb']

### so if any1 image in case is opaque, complete case becomes opaque/indeterminate

### how to read an image data

In [None]:
image_study_df[image_study_df['StudyInstanceUID']=='0fd2db233deb'].iloc[3]

In [None]:
image_study_df[image_study_df['StudyInstanceUID']=='0fd2db233deb'].iloc[3].study_dir

In [None]:
image_study_df[image_study_df['StudyInstanceUID']=='0fd2db233deb'].iloc[3].image_id

In [None]:
study_dir = '/kaggle/input/siim-covid19-detection/train/0fd2db233deb'

In [None]:
! ls  /kaggle/input/siim-covid19-detection/train/0fd2db233deb/

In [None]:
! ls -r /kaggle/input/siim-covid19-detection/train/0fd2db233deb/*/

In [None]:
! ls /kaggle/input/siim-covid19-detection/train/0fd2db233deb/9025f953c3d2/26f643772090.dcm

### 3-level folder hierarchy here

study_id/folder?/image_id.dcm

In [None]:
path = '/kaggle/input/siim-covid19-detection/train/0fd2db233deb/9025f953c3d2/26f643772090.dcm'

In [None]:
    # Use the pydicom library to read the dicom file
    dicom = pydicom.read_file(path)


In [None]:
data = dicom.pixel_array


In [None]:
data

In [None]:
data.shape

In [None]:
            plt.imshow(dicom2array_2(x))
            plt.axis(False)
            plt.title(title, fontweight="bold")


In [None]:
import matplotlib.pyplot as plt
plt.imshow(data, cmap='gray')
plt.show()

In [None]:
# this can help with increasing the contrast
from skimage import exposure
equ_img = exposure.equalize_hist(data)
plt.imshow(equ_img, cmap='gray')
plt.show()

### also add bboxes now

In [None]:
image_study_df[image_study_df['StudyInstanceUID']=='0fd2db233deb'].iloc[3]

In [None]:
eval(image_study_df[image_study_df['StudyInstanceUID']=='0fd2db233deb'].iloc[3].boxes)

In [None]:
image_study_df[image_study_df['StudyInstanceUID']=='0fd2db233deb'].iloc[3].label

In [None]:
box1 = image_study_df[image_study_df['StudyInstanceUID']=='0fd2db233deb'].iloc[3].label.split()

In [None]:
box1

In [None]:
img = data

In [None]:
img = cv2.rectangle(img,(int(float(box1[2])), int(float(box1[3]))), 
                    (int(float(box1[4])), int(float(box1[5]))),
                    color=(200, 0, 0), thickness=15)

In [None]:
plt.imshow(img, cmap='gray')
plt.show()

In [None]:
equ_img = exposure.equalize_hist(data)

equ_img = cv2.rectangle(equ_img,(int(float(box1[2])), int(float(box1[3]))), 
                    (int(float(box1[4])), int(float(box1[5]))),
                    color=(200, 0, 0), thickness=15)

In [None]:
plt.imshow(equ_img, cmap='gray')
plt.show()

### these image might not be of same size - and require resizing

### images without boxes

In [None]:
image_study_df.shape

In [None]:
# image_study_df[image_study_df.boxes == 0].head()

In [None]:
# image_study_df[image_study_df.boxes == 0].shape

In [None]:
# image_study_df[image_study_df.boxes != 0].shape

In [None]:
# image_study_df[image_study_df.boxes != 0].head()

### There are 4294 images with boxes and 2040 images without any boxes.

In [None]:
nobox_docs = image_study_df[image_study_df.boxes == 0]

In [None]:
box_docs = image_study_df[image_study_df.boxes != 0]

In [None]:
nobox_docs.shape

In [None]:
box_docs.shape

In [None]:
nobox_docs.head()

In [None]:
box_docs.head()

In [None]:
nobox_docs.study_id.describe()

In [None]:
nobox_docs['Negative for Pneumonia'].value_counts()

In [None]:
nobox_docs['Typical Appearance'].value_counts()

In [None]:
box_docs['Negative for Pneumonia'].value_counts()

In [None]:
box_docs['Typical Appearance'].value_counts()

### if you have a box, you can't be negative for pneumonia

### TODO:

play with few more examples to see opacity vs. none

think of classifier