### Advanced nodules sampling

In previous [tutorial](/1_Running_preprocessing.ipynb) you explored load/dump, preprocessing and sampling crops with nodules via sample_nodules method. Sample_nodules has share argument, which shows ratio of crops with nodules vs crops without nodules for balancing positive/negative crops in batch for better network training. However, crops without nodules are made randomly, which is not what you may want, as some locations are more likely to have nodules then others. For this purpose, it is possible to use any histogram (e.g. histogram of nodules locations) for sampling.

Examples in this notebook use [LUNA16 competition dataset](https://luna16.grand-challenge.org/) in MetaImage (mhd/raw) format.

In [None]:
import os
import sys
import glob
import shutil
import pandas as pd
import numpy as np
from ipywidgets import interact
from copy import deepcopy
import matplotlib.pyplot as plt

In [None]:
sys.path.append('..')

In [None]:
from radio.batchflow import FilesIndex, Dataset, Pipeline
from radio import CTImagesMaskedBatch as CTIMB

### Build histogram of nodules' positions

Let's load a dataset with ct-scans (LUNA16), See previous [tutorial](/1_Running_preprocessing.ipynb) for clarification.

In [None]:
DIR_LUNA = '/notebooks/data/MRT/luna/s*/*.mhd'
lunaix = FilesIndex(path=DIR_LUNA, no_ext=True)
lunaset = Dataset(index=lunaix, batch_class=CTIMB)

Dataset has 888 CT-scans

In [None]:
len(lunaset.indices)

Let's load annotation file provided by LUNA16


In [None]:
nodules = pd.read_csv('/notebooks/data/MRT/luna/CSVFILES/annotations.csv')

Let's create a toy histogram with random uniform sampling

In [None]:
ranges = list(zip([0]*3, (32, 64, 64)))
histo = list(np.histogram(np.random.uniform(low=0, high=1, size=(100, 3)) *
                            np.array(SHAPE).reshape(1, -1), range=ranges, bins=4))

In [None]:
histo[0].sum()

### Use histogram to sample nodules

In [None]:
pipe = (Pipeline()
        .load(fmt='raw', components='images')
        .fetch_nodules_info(nodules)
        .unify_spacing(shape=(384, 448, 448), spacing=(1.7, 1.0, 1.0))
        .create_mask()
        .sample_nodules(batch_size=10, nodule_size=(32, 64, 64), share=0.5,
                        histo=histo, variance=(20, 70, 70))
       )

We may use pipeline as generator, let's specify batch_size = 5

In [None]:
gen = (lunaset >> pipe).gen_batch(batch_size=5, n_epochs=None)

In [None]:
nods = next(gen)

### Get only (and all) cancerous nodules

While sampling nodules, it's possible to access only cancerous/non-cancerous nodules in batch, let's create a subset of 5 ct-scans, load it and run preprocessing

In [None]:
bch = CTIMB(lunaix.create_subset(lunaix.indices[[100, 110, 120, 130, 140]]))

bch = bch.load(fmt='raw', components='images')

bch = bch.fetch_nodules_info(nodules)
bch = bch.create_mask()
bch = bch.unify_spacing(shape=(384, 448, 448), spacing=[1.7, 0.9, 0.9])

Next, you can use ```sample_nodules``` with ```share=1``` and ```batch_size=None```. Then all your batch would consist of crops with all nodules that are marked in annotation for these patients's scans.

In [None]:
crop_bch = bch.sample_nodules(nodule_size=(32, 64, 64), batch_size=None, share=1,
                                 variance=(49, 196, 196))

In [None]:
crop_bch.num_nodules == len(crop_bch)

Also, you can set any ```share```, say 0.6 and ```batch_size=None```. Then your batch would consist of crops with all nodules that are marked in annotation for these patients's scans AND some additional random crops without nodules.

In [None]:
crop_bch = bch.sample_nodules(nodule_size=(32, 64, 64), batch_size=None, share=0.6,
                                 variance=(49, 196, 196))

In [None]:
print('number of nodules:'crop_bch.num_nodules,', total number of crops in batch:',len(crop_bch))

You can easily find crops with nodules using batch's index method, let's find third nodule crop:

In [None]:
nodnum = 2

nodix = crop_bch.indices[nodnum]

In [None]:
interact(lambda height: plot_arr_slices(height, 
                                        only_cancer.get(nodix, 'images'),
                                        only_cancer.get(nodix, 'masks'),
                                        only_cancer.get(nodix, 'masks')),
         height=(0.01, 0.99, 0.01))