In [None]:
#%tensorflow_version 2.x
import tensorflow
tensorflow.__version__
## ignore warnings
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from tqdm import tqdm, tqdm_notebook
import seaborn as sns
import pydicom as dcm
from glob import glob
from skimage.transform import resize
from skimage import io, measure
import cv2, random

import tensorflow as tf
from tensorflow import keras

In [None]:
!ls ../input

The several key items in this folder:
* `stage_2_train_labels.csv`: CSV file containing training set patientIds and  labels (including bounding boxes)
* `stage_2_detailed_class_info.csv`: CSV file containing detailed labels (explored further below)
* `stage_2_train_images/`:  directory containing training set raw image (DICOM) files

Let's go ahead and take a look at the first labels CSV file first:

In [None]:
train_labels= pd.read_csv('../input/stage_2_train_labels.csv')
print('First five rows of Training set:\n', train_labels.head())
print(train_labels.iloc[0])

As you can see, each row in the CSV file contains a `patientId` (one unique value per patient), a target (either 0 or 1 for absence or presence of pneumonia, respectively) and the corresponding abnormality bounding box defined by the upper-left hand corner (x, y) coordinate and its corresponding width and height. In this particular case, the patient does *not* have pneumonia and so the corresponding bounding box information is set to `NaN`. See an example case with pnuemonia here:

In [None]:
print(train_labels.iloc[4])

Some information about the data field present in the 'stage_2_train_labels.csv' are:
*   **patientId** - A patientId. Each patientId corresponds to a unique image (which we will see a little bit later) 
*   **x** - The upper-left x coordinate of the bounding box
*   **y** - The upper-left y coordinate of the bounding box
*   **width** - The width of the bounding box
*   **height** - The height of the bounding box
*   **Target** - The binary Target indicating whether this sample has evidence of pneumonia or not.

One important thing to keep in mind is that a given `patientId` may have **multiple** boxes if more than one area of pneumonia is detected (see below for example images).

In [None]:
# Number of entries in Train label dataframe:
print('The train_label dataframe has {} rows and {} columns.'.format(train_labels.shape[0], train_labels.shape[1]))

In [None]:
# Number of duplicates in patientId:
print('Number of unique patientId are: {}'.format(train_labels['patientId'].nunique()))

Thus, the dataset contains information about 30227 patients. Out of these 26684 patients, some of them have multiple entries in the dataset.

In [None]:
train_labels.drop_duplicates('patientId').shape[0]


In [None]:
one = train_labels[train_labels.Target == 1].drop_duplicates('patientId').shape[0]
zero = train_labels[train_labels.Target == 0].drop_duplicates('patientId').shape[0]
total = train_labels.drop_duplicates('patientId').shape[0]

In [None]:
print(f'No of entries which has Pneumonia: {one} i.e., {round(one/total*100, 0)}%')
print(f'No of entries which don\'t have Pneumonia: {zero} i.e., {round(zero/total, 0)}%')
_ = train_labels.drop_duplicates('patientId').drop_duplicates('patientId')['Target'].value_counts().plot(kind = 'pie', autopct = '%.0f%%', labels = ['Negative', 'Positive'], figsize = (10, 6))

Thus, from the above pie chart it is clear that out of unique 26684 entries in the dataset, there are 20672 (i.e., 77%) entries in the dataset which corresponds to the entries of the patient Not having Pnuemonia whereas 6012 (i.e., 23%) entries corresponds to Positive case of Pneumonia.

In [None]:
# Checking nulls in bounding box columns:
print('Number of nulls in bounding box columns: {}'.format(train_labels[['x', 'y', 'width', 'height']].isnull().sum().to_dict()))

Thus, we can see that number of nulls in bounding box columns are equal to the number of 0's we have in the Target column.

In [None]:
bounding_box = train_labels.groupby('patientId').size().to_frame('number_of_boxes').reset_index()
train_labels = train_labels.merge(bounding_box, on = 'patientId', how = 'left')
print('Number of patientIds per bounding box in the dataset: ')
(bounding_box.groupby('number_of_boxes').size().to_frame('number_of_patientId').reset_index().set_index('number_of_boxes').sort_values(by = 'number_of_boxes'))

Thus, there are 23286 unique patients which have only one entry in the dataset. It also has the patientsbounding box, 3266 with 2 bounding box, 119 with 3 bounding box and 13 with 4 bounding box coordinates.

- **stage_2_detailed_class_info.csv**

It provides detailed information about the type of positive or negative class for each image.


In [None]:
class_labels = pd.read_csv('../input/stage_2_detailed_class_info.csv')
print('First five rows of Class label dataset are:\n', class_labels.head())

Some information about the data field present in the 'stage_2_detailed_class_info.csv' are:
*   **patientId** - A patientId. Each patientId corresponds to a unique image
*   **class** - Have three values depending what is the current state of the patient's lung: 'No Lung Opacity / Not Normal', 'Normal' and 'Lung Opacity'.



In [None]:
# Number of entries in class_label dataframe:
print('The class_label dataframe has {} rows and {} columns.'.format(class_labels.shape[0], class_labels.shape[1]))

In [None]:
# Number of duplicates in patients:
print('Number of unique patientId are: {}'.format(class_labels['patientId'].nunique()))

Thus, the dataset contains information about 26684 patients (which is same as that of the train_labels dataframe).

In [None]:
def get_feature_distribution(data, feature):
  # Count for each label
  label_counts = data[feature].value_counts()
  # Count the number of items in each class
  total_samples = len(data)
  print("Feature: {}".format(feature))
  for i in range(len(label_counts)):
    label = label_counts.index[i]
    count = label_counts.values[i]
    percent = int((count / total_samples) * 10000) / 100
    print("{:<30s}: {} which is {}% of the total data in the dataset".format(label, count, percent))

In [None]:
get_feature_distribution(class_labels, 'class')

In [None]:
figsize = (10, 6)
_ = class_labels['class'].value_counts().sort_index(ascending = False).plot(kind = 'pie', autopct = '%.0f%%').set_ylabel('')

In [None]:
# Checking nulls in class_labels:
print('Number of nulls in class columns: {}'.format(class_labels['class'].isnull().sum()))

Thus, none of the columns in class_labels has an empty row. 

In [None]:
# Checking whether each patientId has only one type of class or not
class_labels.groupby(['patientId'])['class'].nunique().max()

Thus, we can say that each patientId is associated with only 1 class.

In [None]:
# Merging the two dataset - 'train_labels' and 'class_labels':
training_data = pd.concat([train_labels, class_labels['class']], axis = 1)
print('After merging, the dataset looks like: \n')
training_data.head()

In [None]:
print('After merge, the dataset has {} rows and {} columns.'.format(training_data.shape[0], training_data.shape[1]))

#### Target and Class

In [None]:
fig, ax = plt.subplots(nrows = 1, figsize = (12, 6))
temp = training_data.groupby('Target')['class'].value_counts()
data_target_class = pd.DataFrame(data = {'Values': temp.values}, index = temp.index).reset_index()
sns.barplot(ax = ax, x = 'Target', y = 'Values', hue = 'class', data = data_target_class, palette = 'Set3')
plt.title('Class and Target for Chest Exams')

Thus, **Target = 1** is associated with only **class = Lung Opacity** whereas **Target = 0** is associated with only **class = No Lung Opacity / Not Normal** as well as **Normal**.

#### Bounding Box Distribution

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (7, 7))
target_1 = training_data[training_data['Target'] == 1]
target_sample = target_1.sample(2000)
target_sample['xc'] = target_sample['x'] + target_sample['width'] / 2
target_sample['yc'] = target_sample['y'] + target_sample['height'] / 2
plt.title('Centers of Lung Opacity Rectangles (brown) over rectangles (yellow)\nSample Size: 2000')
target_sample.plot.scatter(x = 'xc', y = 'yc', xlim = (0, 1024), ylim = (0, 1024), ax = ax, alpha = 0.8, marker = '.', color = 'brown')

for i, crt_sample in target_sample.iterrows():
    ax.add_patch(Rectangle(xy=(crt_sample['x'], crt_sample['y']),
                width=crt_sample['width'],height=crt_sample['height'],alpha=3.5e-3, color="yellow"))

Thus, we can see that the centers for the bounding box are spread out evenly across the Lungs. Though a large portion of the bounding box have their centers at the centers of the Lung, but some centers of the box are also located at the edges of lung. 

# Overview of DICOM files and medical images

Medical images are stored in a special format known as DICOM files (`*.dcm`). They contain a combination of header metadata as well as underlying raw image arrays for pixel data. In Python, one popular library to access and manipulate DICOM files is the `pydicom` module. To use the `pydicom` library, first find the DICOM file for a given `patientId` by simply looking for the matching file in the `stage_2_train_images/` folder, and the use the `pydicom.read_file()` method to load the data:

In [None]:
sample_patientId = train_labels['patientId'][0]
dcm_file = '../input/stage_2_train_images/'+'{}.dcm'.format(sample_patientId)
dcm_data = dcm.read_file(dcm_file)

print('Metadata of the image consists of \n', dcm_data)


Most of the standard headers containing patient identifable information have been anonymized (removed) so we are left with a relatively sparse set of metadata. The primary field we will be accessing is the underlying pixel data as follows:

From the above sample we can see that dicom file contains some of the information that can be used for further analysis such as sex, age, body part examined (which should be mostly chest), view position and modality. Size of this image is 1024 x 1024 (rows x columns).

In [None]:
import os

In [None]:
print('Number of images in training images folders are: {}.'.format(len(os.listdir('../input/stage_2_train_images'))))

Thus, we can see that in the training images folder we have just 26684 images which is same as that of unique patientId's present in either of the csv files. Thus, we can say that **each of the unique patientId's present in either of the csv files corresponds to an image present in the folder**.

In [None]:
training_image_path = '../input/stage_2_train_images/'

images = pd.DataFrame({'path': glob(os.path.join(training_image_path, '*.dcm'))})
images['patientId'] = images['path'].map(lambda x:os.path.splitext(os.path.basename(x))[0])
print('Columns in the training images dataframe: {}'.format(list(images.columns)))

In [None]:
# Merging the images dataframe with training_data dataframe
training_data = training_data.merge(images, on = 'patientId', how = 'left')
print('After merging the two dataframe, the training_data has {} rows and {} columns.'.format(training_data.shape[0], training_data.shape[1]))

In [None]:
print('The training_data dataframe as of now stands like\n')
training_data.head()

In [None]:
columns_to_add = ['Modality', 'PatientAge', 'PatientSex', 'BodyPartExamined', 'ViewPosition', 'ConversionType', 'Rows', 'Columns', 'PixelSpacing']

def parse_dicom_data(data_df, data_path):
  for col in columns_to_add:
    data_df[col] = None
  image_names = os.listdir('../input/stage_2_train_images/')
  
  for i, img_name in tqdm_notebook(enumerate(image_names)):
    imagepath = os.path.join('../input/stage_2_train_images/', img_name)
    data_img = dcm.read_file(imagepath)
    idx = (data_df['patientId'] == data_img.PatientID)
    data_df.loc[idx, 'Modality'] = data_img.Modality
    data_df.loc[idx, 'PatientAge'] = pd.to_numeric(data_img.PatientAge)
    data_df.loc[idx, 'PatientSex'] = data_img.PatientSex
    data_df.loc[idx, 'BodyPartExamined'] = data_img.BodyPartExamined
    data_df.loc[idx, 'ViewPosition'] = data_img.ViewPosition
    data_df.loc[idx, 'ConversionType'] = data_img.ConversionType
    data_df.loc[idx, 'Rows'] = data_img.Rows
    data_df.loc[idx, 'Columns'] = data_img.Columns
    data_df.loc[idx, 'PixelSpacing'] = str.format("{:4.3f}", data_img.PixelSpacing[0])

In [None]:
parse_dicom_data(training_data, '../input/stage_2_train_images/')

In [None]:
print('So after parsing the information from the dicom images, our training_data dataframe has {} rows and {} columns and it looks like:\n'.format(training_data.shape[0], training_data.shape[1]))
training_data.head()

In [None]:
# Saving the training_data for further use:
training_data.to_pickle('training_data.pkl')

### EDA on this saved training data:

#### Modality

In [None]:
print('Modality for the images obtained is: {} \n'.format(training_data['Modality'].unique()[0]))

#### Body Part Examined

In [None]:
print('The images obtained are of {} areas.'.format(training_data['BodyPartExamined'].unique()[0]))

#### Understanding Different Positions

In [None]:
get_feature_distribution(training_data.drop_duplicates('patientId'), 'ViewPosition')

As seen above, two View Positions that are in the training dataset are AP (Anterior/Posterior) and PA (Posterior/Anterior). These type of X-rays are mostly used to obtain the front-view. Apart from front-view, a lateral image is usually taken to complement the front-view.
- **Posterior/Anterior (PA)**: Here the chest radiograph is acquired by passing the X-Ray beam from the patient's posterior (back) part of the chest  to the anterior (front) part. While obtaining the image patient is asked to stand with their chest against the film. In this image, the hear is on the right side of the image as one looks at it. These are of higher quality and assess the heart size more accurately
- **Anterior/Posterior (AP)**: At times it is not possible for radiographers to acquire a PA chest X-ray. This is usually because the patient is too unwell to stand. In these images the size of Heart is exaggerated.

In [None]:
print('The distribution of View Position when there is an evidence of Pneumonia:\n')
_ = training_data.drop_duplicates('patientId').loc[training_data['Target'] == 1, 'ViewPosition'].value_counts().sort_index(ascending = False).plot(kind = 'pie', autopct = '%.0f%%').set_ylabel('')

In [None]:
print('Plot x and y centers of bounding box')
bboxes = training_data[training_data['Target'] == 1]
bboxes['xw'] = bboxes['x'] + bboxes['width']/2
bboxes['yh'] = bboxes['y'] + bboxes['height']/2

g = sns.jointplot(x = bboxes['xw'], y = bboxes['yh'], data = bboxes,
                  kind = 'hex', alpha = 0.5, size = 8)
plt.suptitle('Bounding Box location when there is an evidence of Pneumonia')
plt.tight_layout()
plt.subplots_adjust(top = 0.95)

In [None]:
def bboxes_scatter(data, color_point, color_window, text):
  fig, ax = plt.subplots(1, 1, figsize = (7, 7))
  plt.title('Plotting centers of Lung Opacity\n{}'.format(text))
  data.plot.scatter(x = 'xw', y = 'yh', xlim = (0, 1024), ylim = (0, 1024), ax = ax, alpha = 0.8, marker = ".", color = color_point)
  for i, crt_sample in data.iterrows():
    ax.add_patch(Rectangle(xy = (crt_sample['x'], crt_sample['y']), width = crt_sample['width'], height = crt_sample['height'], alpha = 3.5e-3, color = color_window))

In [None]:
data_PA = bboxes[bboxes['ViewPosition'] == 'PA'].sample(1000)
data_AP = bboxes[bboxes['ViewPosition'] == 'AP'].sample(1000)

bboxes_scatter(data_PA, 'green', 'yellow', 'ViewPosition = PA')

In [None]:
bboxes_scatter(data_AP, 'blue', 'red', 'ViewPosition = AP')

We can see that the centers of the box are spread across the entire region of the Lungs. Both of the cases (PA and AP) seem to have outliers in them.  

#### Conversion Type

In [None]:
print('Conversion Type for the data in Training Data: ', training_data['ConversionType'].unique()[0])

#### Rows and Columns

In [None]:
print(f'The training images has {training_data.Rows.unique()[0]} rows and {training_data.Columns.unique()[0]} columns.')

#### Patient Sex

In [None]:
def drawgraphs(data_file, columns, hue = False, width = 15, showdistribution = True):
  if (hue):
    print('Creating graph for: {} and {}'.format(columns, hue))
  else:  
    print('Creating graph for : {}'.format(columns))
  length = len(columns) * 6
  total = float(len(data_file))

  fig, axes = plt.subplots(nrows = len(columns) if len(columns) > 1 else 1, ncols = 1, figsize = (width, length))
  for index, content in enumerate(columns):
    plt.title(content)

    currentaxes = 0
    if (len(columns) > 1):
      currentaxes = axes[index]
    else:
      currentaxes = axes

    if (hue):
      sns.countplot(x = columns[index], data = data_file, ax = currentaxes, hue = hue)
    else:
      sns.countplot(x = columns[index], data = data_file, ax = currentaxes)

    if(showdistribution):
      for p in (currentaxes.patches):
        height = p.get_height()
        if (height > 0 and total > 0):
          currentaxes.text(p.get_x() + p.get_width()/2., height + 3, '{:1.2f}%'.format(100*height/total), ha = "center")

In [None]:
get_feature_distribution(training_data.drop_duplicates('patientId'), 'PatientSex')

In [None]:
drawgraphs(data_file = training_data.drop_duplicates('patientId'), columns = ['PatientSex'], hue = 'class', width = 20, showdistribution = True)

In [None]:
drawgraphs(data_file = training_data.drop_duplicates('patientId'), columns = ['PatientSex'], hue = 'Target', width = 20, showdistribution = True)

Thus, we can see that the number of Male patients suffering from Pneumonia is greater when compared with that of Females.

In [None]:
data_male = bboxes[bboxes['PatientSex'] == 'M'].sample(1000)
data_female = bboxes[bboxes['PatientSex'] == 'F'].sample(1000)

bboxes_scatter(data_male, "darkblue", "blue", "Patients Sex: Male")

In [None]:
bboxes_scatter(data_female, "red", "magenta", "Patients Sex: Female")

The centres for the bounding box in both the cases are spread out evenly across the entire lung with a slight number of outliers.

#### Patient Age

In [None]:
print('The minimum and maximum recorded age of the patients are {} and {} respectively.'.format(training_data['PatientAge'].min(), training_data['PatientAge'].max()))

In [None]:
age_25 = np.percentile(training_data['PatientAge'], 25)
age_75 = np.percentile(training_data['PatientAge'], 75)
iqr_age = age_75 - age_25
cutoff_age = 1.5 * iqr_age

low_lim_age = age_25 - cutoff_age
upp_lim_age = age_75 + cutoff_age

outlier_age = [x for x in training_data['PatientAge'] if x < low_lim_age or x > upp_lim_age]
print('The number of outliers in `PatientAge` out of 30277 records are: ', len(outlier_age))
print('\nThe ages which are in the outlier categories are:', outlier_age)

fig = plt.figure(figsize = (10, 6))
sns.boxplot(training_data['PatientAge'], orient = 'h').set_title('Outliers in PatientAge')

Thus, we can say that the ages like 148, 150, 151, 153 and 155 are mistakes. We can trim these outlier values to a somewhat lower value say 100 so that the max age of the patient will be 100.

In [None]:
drawgraphs(data_file = training_data.drop_duplicates('patientId'), columns = ['PatientAge'], width = 20, showdistribution = True)

Here, we can see that the maximum number of patient are of 58 years old. In order to have a more clear idea, we will introduce a new column where the patients will be placed in an age group like (0, 10), (10, 20) etc.

In [None]:
print('Removing the outliers from `PatientAge`')
training_data['PatientAge'] = training_data['PatientAge'].clip(training_data['PatientAge'].min(), 100)
training_data['PatientAge'].describe().astype(int)

In [None]:
print('Distribution of `PatientAge`: Overall and Target = 1')
fig = plt.figure(figsize = (10, 6))

ax = fig.add_subplot(121)
g = (sns.distplot(training_data['PatientAge']).set_title('Distribution of PatientAge'))

ax = fig.add_subplot(122)
g = (sns.distplot(training_data.drop_duplicates('patientId').loc[training_data['Target'] == 1, 'PatientAge']).set_title('Distribution of PatientAge vs PnuemoniaEvidence'))

In [None]:
custom_array = np.linspace(0, 100, 11)
training_data['PatientAgeBins'] = pd.cut(training_data['PatientAge'], custom_array)
training_data.drop_duplicates('patientId')['PatientAgeBins'].value_counts()

Thus, we can see that the maximum number of patients belong to the age group of (50, 60] whereas the least belong to (90, 100]

In [None]:
print('After adding the bin column, the dataset turns out to be:\n')
training_data.head()

In [None]:
drawgraphs(data_file = training_data.drop_duplicates('patientId'), columns = ['PatientAgeBins'], width = 20, showdistribution = True)

In [None]:
drawgraphs(data_file = training_data.drop_duplicates('patientId'), columns = ['PatientAgeBins'], hue = 'PatientSex', width = 20, showdistribution = True)

In [None]:
drawgraphs(data_file = training_data.drop_duplicates('patientId'), columns = ['PatientAgeBins'], hue = 'class', width = 20, showdistribution = True)

In [None]:
drawgraphs(data_file = training_data.drop_duplicates('patientId'), columns = ['PatientAgeBins'], hue = 'Target', width = 20, showdistribution = True)

From the above three plots we can infer that the maximum percentage of Male and Females, Patients with “No Lung Opacity / Not Normal” as well as “Lung Opacity” classes and Patients having Pneumonia all lies in the [50, 60] age group.

In [None]:
data_age_19 = bboxes[bboxes['PatientAge'] < 20]
data_age_20_34 = bboxes[(bboxes['PatientAge'] >= 20) & (bboxes['PatientAge'] < 35)]
data_age_35_49 = bboxes[(bboxes['PatientAge'] >= 35) & (bboxes['PatientAge'] < 50)]
data_age_50_64 = bboxes[(bboxes['PatientAge'] >= 50) & (bboxes['PatientAge'] < 65)]
data_age_65 = bboxes[bboxes['PatientAge'] >= 65]

bboxes_scatter(data_age_19,'blue', 'red', 'Patient Age: 1-19 years')

In [None]:
bboxes_scatter(data_age_20_34, 'blue', 'red', 'Patient Age: 20-34 years')

In [None]:
bboxes_scatter(data_age_35_49, 'blue', 'red', 'Patient Age: 35-49 years')

In [None]:
bboxes_scatter(data_age_50_64, 'blue', 'red', 'Patient Age: 50-64 years')

In [None]:
bboxes_scatter(data_age_65, 'blue', 'red', 'Patient Age: 65+ years')

#### Plotting DICOM Images

In [None]:
def show_dicom_images(data, df, img_path):
  img_data = list(data.T.to_dict().values())
  f, ax = plt.subplots(3, 3, figsize = (16, 18))
  
  for i, row in enumerate(img_data):
    image = row['patientId'] + '.dcm'
    path = os.path.join(img_path, image)
    data = dcm.read_file(path)
    rows = df[df['patientId'] == row['patientId']]
    age = rows.PatientAge.unique().tolist()[0]
    sex = data.PatientSex
    part = data.BodyPartExamined
    vp = data.ViewPosition
    modality = data.Modality
    data_img = dcm.dcmread(path)
    ax[i//3, i%3].imshow(data_img.pixel_array, cmap = plt.cm.bone)
    ax[i//3, i%3].axis('off')
    ax[i//3, i%3].set_title('ID: {}\nAge: {}, Sex: {}, Part: {}, VP: {}, Modality: {}\nTarget: {}, Class: {}\nWindow: {}:{}:{}:{}'\
                            .format(row['patientId'], age, sex, part,
                                    vp, modality, row['Target'],
                                    row['class'], row['x'],
                                    row['y'], row['width'],
                                    row['height']))
    box_data = list(rows.T.to_dict().values())
    
    for j, row in enumerate(box_data):
      ax[i//3, i%3].add_patch(Rectangle(xy = (row['x'], row['y']),
                                        width = row['width'], height = row['height'],
                                        color = 'blue', alpha = 0.15))
  plt.show()

- Target = 0

In [None]:
show_dicom_images(data = training_data.loc[(training_data['Target'] == 0)].sample(9),
                  df = training_data, img_path = '../input/stage_2_train_images/')

As the above subplots are of the images which belong to either "Normal" or "No Lung Opacity / Not Normal", hence no bounding box is observed.

- Target = 1

In [None]:
show_dicom_images(data = training_data.loc[(training_data['Target'] == 1)].sample(9),
                  df = training_data, img_path = '../input/stage_2_train_images/')

In the above subplots, we can see that the area covered by the box (in blue colour) depicts the area of interest i.e., the area in which the opacity is observed in the Lungs.

### Conclusion
- The training dataset (both of the csv files and the training image folder) contains information of 26684 patients (unique)
- Out of these 26684 unique patients some of these have multiple entries in the both of the csv files
- Most of the recorded patient belong to Target = 0 (i.e., they don't have Pneumonia)
- Some of the patients have more than one bounding box. The maximum being 4
- The classes "No Lung Opacity / Not Normal" and "Normal" is associated with Target = 0 whereas "Lung Opacity" belong to Target = 1
- The images are present in dicom format, from which information like PatientAge, PatientSex, ViewPosition etc are obtained
- There are two ways from which images were obtained: AP and PA. The age ranges from 1-155 (which were further clipped to 100)
- The centers of the bounding box are spread out over the entire region of the lungs. But there are some centers which are outliers.