## Approach

The original VinDR-Mammo dataset has more than 20000 images. This smaller dataset has only 1426 images. It's made up of all images that contain masses plus 200 images that don't contain masses.

1. Group by study_id because there are duplicate image_ids. There are duplicate image_ids because when there is more than one bbox on an image, then each bbox is on a separate row. Each of those rows has the same image_id. We grouby study_id because we want to ensure that images with multiple bboxes stay together i.e. one bbox does not end up in the train set and another in the val set.

2. Stratify by density. This dataset contains 1148 Density C images. From experiments I've found that the model struggles to correctly detect masses on breasts with a high density. Therefore, we need to ensure that the val fold does not contain only hard examples.

3. To create a dataset to use for training and validation, I chose all images that contain masses and 200 images that don't have masses. This was a total of 1426 dicom files - much less than the over 20000 dicom files in the VinDr-Mammo datset. Yolo only needs about 10% normal images. From experiments I found that there were many false positives so that's why I increased the number of normal images to more than 10% of the number of images with masses. 

4. I converted the 1426 dicom files to png images (Exp06) before uploading them to a private Kaggle dataset. Converting to png reduced the size of the data that needed to be uploaded,

5. I'm using a train/test split of approx. 11% (10 folds). The task is hard and the dataset is small. Therefore, I'm trying to put as many images as possible in the train set.

6. I will only be using fold 0 to train and validate the model.


In [1]:
import pandas as pd
import numpy as np
import os

import shutil
from tqdm import tqdm_notebook as tqdm

import matplotlib.pyplot as plt

# Don't Show Warning Messages
import warnings
warnings.filterwarnings('ignore')

In [2]:
base_path = '../input/smart-mammo-data/'

In [3]:
os.listdir('../input/')

['smart-mammo-data']

In [4]:
NUM_FOLDS = 10

CHOSEN_FOLD = 0

NUM_CORES = os.cpu_count()
NUM_CORES

4

## Load the data

In [5]:
# df_data.csv was prepared locally in Exp06.

path = base_path + 'df_data.csv'

df_data = pd.read_csv(path)

print(df_data.shape)

df_data.head()

(1426, 19)


Unnamed: 0,study_id,series_id,image_id,laterality,view_position,height,width,breast_birads,breast_density,finding_categories,finding_birads,xmin,ymin,xmax,ymax,split,path,num_findings,target
0,48575a27b7c992427041a82fa750d3fa,26de4993fa6b8ae50a91c8baf49b92b0,4e3a578fe535ea4f5258d3f7f4419db8,R,CC,3518,2800,BI-RADS 4,DENSITY C,['Mass'],BI-RADS 4,2355.139893,1731.640015,2482.97998,1852.75,training,/Volumes/WDExtDrive/Woza-Mammogram-Analyzer/Ra...,1,1
1,48575a27b7c992427041a82fa750d3fa,26de4993fa6b8ae50a91c8baf49b92b0,dac39351b0f3a8c670b7f8dc88029364,R,MLO,3518,2800,BI-RADS 4,DENSITY C,['Mass'],BI-RADS 4,2386.679932,1240.609985,2501.800049,1354.040039,training,/Volumes/WDExtDrive/Woza-Mammogram-Analyzer/Ra...,1,1
2,5683854eafabc34f6d854000d2ac6c2d,4ac33111294b83d43537cb8604b0808c,2f944efb1cb9579442df2d7fe6a579b7,L,CC,3518,2800,BI-RADS 3,DENSITY C,['Mass'],BI-RADS 3,142.899002,2171.810059,439.584991,2403.370117,test,/Volumes/WDExtDrive/Woza-Mammogram-Analyzer/Ra...,1,1
3,5683854eafabc34f6d854000d2ac6c2d,4ac33111294b83d43537cb8604b0808c,7385e8cf7b29764525c81de4aa1aebe4,L,MLO,3518,2800,BI-RADS 3,DENSITY C,['Mass'],BI-RADS 3,142.899002,2045.170044,417.876007,2265.879883,test,/Volumes/WDExtDrive/Woza-Mammogram-Analyzer/Ra...,1,1
4,7c51789da6c462e55bcb95c2a7d437ee,ac4d0771f6d7a7400ab463458f789dbe,f581ef53bb7e61f4575db33eceac8ff8,L,CC,3518,2800,BI-RADS 4,DENSITY C,"['Nipple Retraction', 'Mass']",BI-RADS 4,588.874023,1397.709961,812.362,1734.719971,training,/Volumes/WDExtDrive/Woza-Mammogram-Analyzer/Ra...,2,1


In [6]:
df_data['target'].value_counts()

1    1226
0     200
Name: target, dtype: int64

## Process the train data

In [7]:
# Get the path to each image

def get_path(row):
    
    study_id = row['study_id']
    image_id = row['image_id']
    path = base_path + 'images_dir/images_dir/' + image_id + '.png'
    
    return path


df_data['path'] = df_data.apply(get_path, axis=1)

print(df_data.shape)

#df_data.head()

(1426, 19)


In [8]:
# Check the target distribution

df_data['target'].value_counts()

1    1226
0     200
Name: target, dtype: int64

In [9]:
# Check the density distribution

df_data['breast_density'].value_counts()

DENSITY C    1148
DENSITY B     174
DENSITY D      98
DENSITY A       6
Name: breast_density, dtype: int64

## Create df_normal and df_mass

In [10]:
# Filter out all normal images
df_normal = df_data[df_data['target'] == 0]

# Filter out all mass images
df_mass = df_data[df_data['target'] == 1]

df_normal = df_normal.reset_index(drop=True)
df_mass = df_mass.reset_index(drop=True)

print(df_normal.shape)
print(df_mass.shape)

(200, 19)
(1226, 19)


## Create the normal folds

In [11]:
from sklearn.model_selection import KFold, StratifiedKFold, StratifiedGroupKFold

skf = StratifiedGroupKFold(n_splits=NUM_FOLDS, shuffle=True, random_state=101)

for fold, ( _, val_) in enumerate(skf.split(X=df_normal, y=df_normal.breast_density, groups=df_normal.study_id)):
      df_normal.loc[val_ , "fold"] = fold
        
df_normal['fold'].value_counts()

2.0    21
8.0    21
5.0    20
3.0    20
7.0    20
4.0    20
1.0    20
0.0    20
6.0    19
9.0    19
Name: fold, dtype: int64

## Create the mass folds

In [12]:
from sklearn.model_selection import KFold, StratifiedKFold, StratifiedGroupKFold

skf = StratifiedGroupKFold(n_splits=NUM_FOLDS, shuffle=True, random_state=101)

for fold, ( _, val_) in enumerate(skf.split(X=df_mass, y=df_mass.breast_density, groups=df_mass.study_id)):
      df_mass.loc[val_ , "fold"] = fold
        
df_mass['fold'].value_counts()

3.0    126
1.0    126
0.0    126
6.0    126
7.0    123
9.0    122
4.0    121
8.0    121
5.0    118
2.0    117
Name: fold, dtype: int64

## Concat the dataframes

In [13]:
df_data = pd.concat([df_normal, df_mass], axis=0)

df_data = df_data.reset_index(drop=True)

df_data.shape

(1426, 20)

In [14]:
# How to choose the fold to train on.

fold_index = CHOSEN_FOLD

df_train = df_data[df_data['fold'] != fold_index]
df_val = df_data[df_data['fold'] == fold_index]

print('Train')
print(len(df_train))
print(df_train['target'].value_counts())
print('')
print('Val')
print(len(df_val))
print(df_val['target'].value_counts())

Train
1280
1    1100
0     180
Name: target, dtype: int64

Val
146
1    126
0     20
Name: target, dtype: int64


In [15]:
# Calculate the train/test split

len(df_val)/len(df_train)

0.1140625

In [16]:
# Train set breast density distribution

df_train['breast_density'].value_counts()

DENSITY C    1035
DENSITY B     158
DENSITY D      81
DENSITY A       6
Name: breast_density, dtype: int64

In [17]:
# Val set breast density distribution

df_val['breast_density'].value_counts()

DENSITY C    113
DENSITY D     17
DENSITY B     16
Name: breast_density, dtype: int64

## Save the dataframe containing the folds

In [18]:
df_data.head(2)

Unnamed: 0,study_id,series_id,image_id,laterality,view_position,height,width,breast_birads,breast_density,finding_categories,finding_birads,xmin,ymin,xmax,ymax,split,path,num_findings,target,fold
0,517563a43eab4f90efc650fcb0d215d5,37d2595b7ec81bd31a18d26afdb0aabf,c54fbe3a374b9e157280479c4649b11a,L,MLO,2812,2012,BI-RADS 1,DENSITY C,['No Finding'],,,,,,test,../input/smart-mammo-data/images_dir/images_di...,1,0,2.0
1,f8894c9173aaa865bd43751eeeacd3b0,1225d3a302306aca94552c0f8c30fe2c,78dfa5f3dd380341cc17955f52279728,R,MLO,3518,2800,BI-RADS 1,DENSITY C,['No Finding'],,,,,,test,../input/smart-mammo-data/images_dir/images_di...,1,0,2.0


In [19]:
# Save df_data

path = 'df_data_w_folds.csv'
df_data.to_csv(path, index=False)

In [20]:
!ls

__notebook__.ipynb  df_data_w_folds.csv


## Save the train and val images in folders

In [21]:
# Create a new directory
train_images_dir = 'train_images_dir'
os.mkdir(train_images_dir)

# Create a new directory
val_images_dir = 'val_images_dir'
os.mkdir(val_images_dir)

In [22]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

In [23]:
df_train.loc[0, 'path']

'../input/smart-mammo-data/images_dir/images_dir/c54fbe3a374b9e157280479c4649b11a.png'

In [24]:
# Copy the train images

df = df_train

image_id_list = list(df['image_id'])

for i, image_id in tqdm(enumerate(image_id_list), total=len(image_id_list)):
    
    fname = image_id + '.png'
    
    src = df.loc[i, 'path']
    dst = os.path.join('train_images_dir', fname)
    
    shutil.copyfile(src, dst)

  0%|          | 0/1280 [00:00<?, ?it/s]

In [25]:
# Copy the val images

df = df_val

image_id_list = list(df['image_id'])

for i, image_id in tqdm(enumerate(image_id_list), total=len(image_id_list)):
    
    fname = image_id + '.png'
    
    src = df.loc[i, 'path']
    dst = os.path.join('val_images_dir', fname)
    
    shutil.copyfile(src, dst)

  0%|          | 0/146 [00:00<?, ?it/s]

In [26]:
!ls

__notebook__.ipynb  df_data_w_folds.csv  train_images_dir  val_images_dir


In [27]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.
