-
Notifications
You must be signed in to change notification settings - Fork 1
Phase 1 Answers
from munge.Dataset import Dataset
dataset = Dataset('config.json')
all_data = dataset.get_all() # contains all images with its corresponding contours
study_data = dataset.get_by_study('SCD0000101') # contains the images and corresponding contours for the given study
# plots the verification for the given study
dataset.plot_verification_for_study('SCD0000101')
- How did you verify that you are parsing the contours correctly?
Verification by Unit Tests:
I have written unit tests which can be found in tests/test_dataset.py
. Here in the function test_dcm_contour_mapping()
, I have
asserted whether, for randomly chosen contours, the right DICOM images are picked or not.
Manual verification by visualization:
Even though unit tests make sure that the function is doing what it is supposed to, qualitative verification is important in
image annotations. For this purpose I have written a function in the Dataset
class which will plot the images along with
the contours as patches overlaid on them. For trained sonographers, this gives a quick overview of whether the labelling is
correct or not.
Apart from the images, I have also included two plots
- Relative average intensity: This will give a rough indicator of whether the region marked is correct or not assuming that in this case, the average intensity of the myocardium area should lie in some range. If the average intensity does not lie in this assumed range, we can suspect that the annotater has marked it incorrectly.
- Area of contour in sq.mm: As each study is a time series, contraction/dilation of the valve can be confirmed with a sinusoidal/semi-sinusoidal wave. If the marked area is wrong, we can find out using this plot.
(The above plots are included just as a sample to make a point that, some medical information like these can be incorporated to verify the annotation and to quickly identify mistakes if any)
- What changes did you make to the code, if any, in order to integrate it into our production code base?
- Modified the given
parse_dicom_file
function to includewidth
,height
andresolution
in the return value - Moved the
parse_dicom_file
toutils/image.py
for better organization - Moved the given
parse_contour_file
andpoly_to_mask
functions toutils/contour.py
for better organization - Wrote a
DataElement
class to abstract each data point in the dataset (for more details refer documentation) - Wrote a
Dataset
class to load the dataset with the dicom->contour mapping (for more details refer documentation)
# continuation of the above snippet
from munge.DataLoader import DataLoader
data_loader = DataLoader(dataset)
train_data = data_loader.load_train_data(epochs=10, batch_size=8)
# train_data contains the dataset split into batch_size for each epoch
DataLoader.plot_random_epoch(train_data)
- Did you change anything from the pipelines built in Parts 1 to better streamline the pipeline built in Part 2? If so, what? If not, is there anything that you can imagine changing in the future?
- Added
DataLoader
class to load the dataset according to epochs and batch size (for more details refer documentation) - Refer Future Work
- How do you/did you verify that the pipeline was working correctly?
Verification by Unit Tests:
I have written unit tests which can be found in tests/test_dataloader.py
. Here in the function test_load_train_data()
, I
have asserted whether, for the given epoch and batch size, the data is split correctly or not.
Manual verification of randomness by visualization:
For this purpose I have written a function in the DataLoader
class which will randomly select an epoch and plot the batch wise
images that were split. This will give a visual indicator that the data split is indeed random.
Manual verification of randomness by log file:
The call to the function load_train_data
will write to the log file, the UUID of the DataElement
instance in each epoch, in each iteration. By making sure that the UUIDs are different, we can ensure that the training data
is random enough to be fed into a network.
- Given the pipeline you have built, can you see any deficiencies that you would change if you had more time? If not, can you
think of any improvements/enhancements to the pipeline that you could build in?
- Could have used open source code for generating the splits. Tried
keras.preprocessing.image.ImageDataGenerator
but in our case the data points are instances ofDataElement
, which the class could not handle - Even though I have used
yield
wherever possible, for huge datasets, need to refactor the code such that it works in a parallel manner - Refer Future Work
- Could have used open source code for generating the splits. Tried
-
The main source code is in the
munge
folder which contains the classes. For more information about these modules, refer the documentation- DataElement
- DataLoader
- Dataset
- utils
- contour
- image
- misc
-
The tests are in the
tests
folder.
To run the tests and generate coverage report, run the following command
$ pytest --cov=munge --cov-report=html tests/
HTML coverage report can be found in htmlcov/index.html
. Currently the code is 93% covered.
- More exception handling
- Performance testing
- Data cleaning - Adaptive Histogram Equalization/Mean Normalization
- Integration with LogDNA for better log monitoring
- Unit tests are not exhaustive. Just the critical functionality is tested. More coverage can be added in the future.
- Dockerize the app