# This notebook uses the new Fastai 2 medical library to train a classification model

Some remarks:

* This notebook is for training (and learning) only and it is not ready for submission. 
* I tried to make a quick overview of the DataBlock api. For more information you can take a look at this post https://towardsdatascience.com/how-to-create-a-datablock-for-multispectral-satellite-image-segmentation-with-the-fastai-v2-bc5e82f4eb5 or the Fastai documentation. 
* This version loads a pre-trained model and train with 500k images
* The problem with this dataset is that is is unbalanced
* If something is not clear enough or wrong, or you just have a suggestion on how to improve the notebook, please let us all know in the comments. 

Thanks,

In [None]:
# Importing our dependencies
import torch
import fastai
from fastai.medical import *
from fastai.medical.imaging import *
from fastai.torch_core import *
import matplotlib.pyplot as plt
import pandas as pd
from fastai.vision.all import *
from pathlib import Path
import pydicom

!conda install -c conda-forge gdcm -y
import gdcm

## Taking a look at data using the PILdicom class
This section is just to check if we can open the images correctly

In [None]:
# Just for visualization purpose, we will grab the files contained in just 1 study (each study has many images/slices)
# For that, we will use Fastai 2 get_dicom_files, that just maps all dcm files within recursive subdirs
sample_files = get_dicom_files('../input/rsna-str-pulmonary-embolism-detection/train/0003b3d648eb')

In [None]:
# Let's grab the first file of this study and display its metadata
dicom = dcmread(sample_files[0])
dicom

Just for information: another way to open the dcm file is just using:<br> `dicom = files[idx].dcmread()`<br>
As we used the Fastai 2 function `get_dicom_files` to grab the files, it added automatically the dcmread() method to our path items.

In [None]:
# using a snippet from fastai medical tutorial, we will display the images with different scales
scales = False, True, dicom_windows.brain, dicom_windows.subdural
titles = 'raw','normalized','brain windowed','subdural windowed'
for s,a,t in zip(scales, subplots(2,2,imsize=4)[1].flat, titles):
    dicom.show(scale=s, ax=a, title=t)

In [None]:
dicom.show(cmap=plt.cm.gist_ncar, figsize=(6,6))

## Subsampling the training set
The training set is unbalanced with much more negative results and that interferes with our training. It's much easier to just guess all negative than trying to predict anything. 


In [None]:
# Initially, we will create our Pandas Dataframes with the CSV diles.
# The train dataframe contains all the information to get to the images, so it will be passed 
# as the source of our dataset and the datablock will be in charge of transforming it into 
# inputs and targets (x, y)

train = pd.read_csv('../input/rsna-str-pulmonary-embolism-detection/train.csv', low_memory=False)
test = pd.read_csv('../input/rsna-str-pulmonary-embolism-detection/test.csv')

In [None]:
# We will separate in train_pos and train_neg. Afterwards, we will grab 100k images from each dataframe and join then to do our training
negatives = train['negative_exam_for_pe'] == 1
train_neg = train[negatives]
train_pos = train[~negatives]

In [None]:
balanced = pd.concat([train_neg[:250000], train_pos[:250000]], axis=0)
balanced

## Preparing the DataBlock
The first step in fastai is to create the DataBlock, that holds all the necessary transformations to get the item, put it in the correct format, and so on...

After the DataBlock is prepared, we will easily create datasets and dataloaders

Each row of the source dataframe will be passed to each block of the DataBlock, so we need to create functions that manipulate each row and transform it into inputs (X) or targets (Y).<br>
For that, we will define `get_x` and `get_y` functions.

Keep in mind that the `get_x` will not apply the necessary transformations. Instead it will be responsible just for transforming the rows into paths to the files. 

The `get_y` will return the necessary columns. As it is a multicategory classification, where they occur at the same time, our targets cannot be just a single number. We can return a list of occurrences and the DataBlock will take care of transforming it into hot-encoded. The problem of this approach is that the Categorize function takes too long, because it has to look at all possible targets to create it's internal vocab. <br>
To overcome this situation, I created the `get_encoded_y` that returns it as hot-encoded and the vocab are the columns that we want to extract. This way, we can bypass the Categorize function using encoded=True in the MultiCategoryBlock.


In [None]:
vocab = ['negative_exam_for_pe', 'pe_present_on_image', 'rv_lv_ratio_gte_1', 'rv_lv_ratio_lt_1',
         'leftsided_pe', 'chronic_pe', 'rightsided_pe', 'acute_and_chronic_pe',
         'central_pe', 'indeterminate']
vocab.sort()

def get_x(row):
    base_path = Path('../input/rsna-str-pulmonary-embolism-detection/train')
    file_path = f"{row['StudyInstanceUID']}/{row['SeriesInstanceUID']}/{row['SOPInstanceUID']}.dcm"
    return base_path/file_path

# def get_y(row):
#     labels = row[vocab]
    
#     return list(labels.index[labels==1])

def get_encoded_y(row):
    return row[vocab].values.squeeze().astype('long')

In [None]:
# we will test our functions by passing an arbitrary row
r = train.iloc[3]
get_x(r), get_encoded_y(r)

The DataBlock will be then created using the `get_x` and `get_y` functions we just created. <br>
The transformation that will be applied to the inputs (Xs) is ImageBlock(cls=PILDicom), that gets the file path and create a PILDicom instance.<br>
For the targets (Ys) we will pass our hot-encoded to the MultiCategoryBlock block, passing encoded=True and the vocab.

Just after the dblock creation, we will create a dataset to test it.


In [None]:
dblock = DataBlock(#blocks=(ImageBlock(cls=PILDicom), MultiCategoryBlock(encoded=True, vocab=vocab)),
                   blocks=(TransformBlock([PILDicom.create, ToTensor]), MultiCategoryBlock(encoded=True, vocab=vocab)),
                   get_x=get_x,
                   get_y=get_encoded_y,
                  )
dsets = dblock.datasets(balanced, verbose=False)

In [None]:
# If we index the dataset, we get x and y as return
dsets[150000]

In [None]:
# to check the sanity of our dblock, we could also call the `.summary()` function
# dblock.summary(train.iloc[:100])

## Creating a dataloader
As noted before, we will use just a subset of the training set for learning purposes. <br>
Our dataloader will be created with 20000 images and batch size of 32.

In [None]:
# dls = dblock.dataloaders(train.iloc[:20000], bs=16, num_workers=0)
# To check our dataloader we can either create an item or a full batch
dls = dsets.dataloaders(bs=64, num_workers=0)
dls.create_item(1)

In [None]:
dls.show_batch()

Great!!! It seems to be working. Let's pass to the training phase.

## Training a simple model
Fastai provides a wide range of CNN models to be used (https://docs.fast.ai/vision.models.xresnet)

We will use a simple resnet18, pretrained with ImageNet (don't know if it makes sense) that is fast to train, just to see if everything is working

In [None]:
# Will create a multicategorical accuracy
def accuracy_multi(inp, targ, thresh=0.5, sigmoid=True):
    "Compute accuracy when `inp` and `targ` are the same size."
    if sigmoid: inp = inp.sigmoid()
    return ((inp>thresh)==targ.bool()).float().mean()

# accuracy_multi(y, activs, thresh=0.5)

In [None]:
learn = cnn_learner(dls, resnet18, n_in=1, metrics=accuracy_multi)
learn.model_dir = '.'

try:
    learn.load('../input/fastai2-medical-simple-training/resnet18-v3')
    print('Model Loaded Successfully')
except:
    print('Could not load model. Content of added data:')

    !ls ../input/fastai2-medical-simple-training/
    print('Content of working directory')
    !ls ../working

In [None]:
#testing one pass through the model
# x,y = to_cpu(dls.train.one_batch())
# activs = learn.model(x)
# activs.shape

In [None]:
# apply the new metrics and look for the best learning rate
# learn.metrics=accuracy_multi
# learn.lr_find()

# I noticed a problem when training with more data. 
# I also noticed that some dcm cannot be opened correctly, have to check further.
# It seems to be a good idea to iterate through all dcms and check if thay are 
# opening correctly and that they can be cast to PILDicom

In [None]:
learn.fine_tune(1, base_lr=2e-2, freeze_epochs=1)

In [None]:
learn.model_dir = '.'
learn.save('./resnet18-v4')

In [None]:
item = dsets.valid[1500]

In [None]:
learn.predict(item[0])

In [None]:
item[1]

In [None]:
# interp = ClassificationInterpretation.from_learner(learn)
# interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

In [None]:
# interp.plot_top_losses(6)

That's all for now!