## Introduction

COVIDx CT is a large-scale chest CT image dataset for the detection of COVID-19 cases. The dataset consists of three classes: Normal (0), non-COVID-19 pneumonia (1), and COVID-19 (2). The images are provided as a single directory, and labels for the images are provided in a set of three label files. Each label file has the following format:
```
filename class xmin ymin xmax ymax
```

## Loading the Data and Labels
The image data is stored in a single directory, namely `2A_images`. Labels are provided as a set of text files named `{train,val,test}_COVIDx_CT-2A.txt`, and the data from a particular label file can be loaded using the function below.

In [None]:
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt

In [None]:
def load_labels(label_file):
    """Loads image filenames, classes, and bounding boxes"""
    fnames, classes, bboxes = [], [], []
    with open(label_file, 'r') as f:
        for line in f.readlines():
            fname, cls, xmin, ymin, xmax, ymax = line.strip('\n').split()
            fnames.append(fname)
            classes.append(int(cls))
            bboxes.append((int(xmin), int(ymin), int(xmax), int(ymax)))
    return fnames, classes, bboxes

Let's try loading some images from the dataset and displaying their class labels and bounding boxes:

In [None]:
# Set paths
image_dir = '/kaggle/input/covidxct/2A_images'
label_file = '/kaggle/input/covidxct/val_COVIDx_CT-2A.txt'

# Load labels
fnames, classes, bboxes = load_labels(label_file)

# Select cases to view
np.random.seed(27)
indices = np.random.choice(list(range(len(fnames))), 9)

# Show a grid of 9 images
fig, axes = plt.subplots(3, 3, figsize=(16, 16))
class_names = ('Normal', 'Pneumonia', 'COVID-19')
for index, ax in zip(indices, axes.ravel()):
    # Load the CT image
    image_file = os.path.join(image_dir, fnames[index])
    image = cv2.imread(image_file, cv2.IMREAD_UNCHANGED)

    # Overlay the bounding box
    image = np.stack([image]*3, axis=-1)  # make image 3-channel
    bbox = bboxes[index]
    cv2.rectangle(image, bbox[:2], bbox[2:], color=(255, 0, 0), thickness=3)

    # Display
    cls = classes[index]
#     plt.figure()
    ax.imshow(image)
    ax.set_title('Class: {} ({})'.format(class_names[cls], cls))
plt.show()

## Examining the Metadata
Metadata for all patient cases is available in `metadata.csv`. This file includes:
* Patient ID (key "patient id")
* Data source (key: "source")
* Country (key "country")
* Sex & age (if available, keys "sex" and "age")
* Finding (Normal, Pneumonia, or COVID-19, key "finding")
* Verified finding, which indicates whether the finding is confirmed (Yes or No, key "verified finding")
* Slice selection, which indicates how slice selection was performed (either Expert, Non-expert, or Automatic, key "slice selection")
* View and modality (all are axial CT, keys "view" and "modality")

Let's first define some functions to display the metadata:

In [None]:
def pie_chart(ax, data, labels, title=None):
    """Helper to plot a pie chart"""
    ax.pie(data, labels=labels, autopct='%1.1f%%', shadow=False, startangle=90)
    if title is not None:
        ax.set_title(title)

def age_histogram(ax, data, title=None):
    """Helper to plot a histogram of ages"""
    bins = np.arange(0, 101, 5)
    labels = ['{}-{}'.format(bins[i], bins[i+1]) for i in range(len(bins)-1)]
    ax.hist(data, bins, label=labels, rwidth=0.9)
    ax.set_xticks(bins[:-1] + 2.5)
    ax.set_xticklabels(labels)
    if title is not None:
        ax.set_title(title)

In [None]:
import pandas as pd

# Load metadata and replace missing entries
metadata = pd.read_csv('/kaggle/input/covidxct/metadata.csv')
metadata.replace(np.nan, 'Unknown', regex=True, inplace=True)

### Data Sources and Countries

In [None]:
# Get source and country info
src_labels, src_counts = np.unique(metadata['source'], return_counts=True)
cnt_labels, cnt_counts = np.unique(metadata['country'], return_counts=True)
main_countries = {'China', 'Iran', 'Russia', 'Unknown'}
main_cnt_labels, main_cnt_counts = [], []
other_cnt_labels, other_cnt_counts = [], []
for cnt, count in zip(cnt_labels, cnt_counts):
    if cnt in main_countries:
        main_cnt_labels.append(cnt)
        main_cnt_counts.append(count)
    else:
        other_cnt_labels.append(cnt)
        other_cnt_counts.append(count)
main_cnt_labels.append('Other')
main_cnt_counts.append(sum(other_cnt_counts))

# Display source and country info
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
pie_chart(axes[0], src_counts, src_labels, title='Data Sources')
pie_chart(axes[1], main_cnt_counts, main_cnt_labels, title='Patient Countries - All')
pie_chart(axes[2], other_cnt_counts, other_cnt_labels, title='Patient Countries - Other ({} cases)'.format(main_cnt_counts[-1]))

### Patient Sexes and Ages

In [None]:
from matplotlib.gridspec import GridSpec

# Get sex and age info
sex_labels, sex_counts = np.unique(metadata['sex'], return_counts=True)
ages = metadata['age'].replace('Unknown', -1)

# Display sex and age info
fig = plt.figure(figsize=(18, 5))
gs = GridSpec(1, 3, figure=fig)
ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[0, 1:])
pie_chart(ax1, sex_counts, sex_labels, title='Patient Sexes')
age_histogram(ax2, ages, title='Patient Ages')

### Findings and Slice Selection

In [None]:
# Get finding and labelling info
finding_labels, finding_counts = np.unique(metadata['finding'], return_counts=True)
verif_labels, verif_counts = np.unique(metadata['verified finding'], return_counts=True)
slice_labels, slice_counts = np.unique(metadata['slice selection'].replace('Unknown', 'N/A (normal cases)'), return_counts=True)

# Display finding and labelling info
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
pie_chart(axes[0], finding_counts, finding_labels, title='Findings')
pie_chart(axes[1], verif_counts, verif_labels, title='Verified Findings')
pie_chart(axes[2], slice_counts, slice_labels, title='Slice Labelling')