## About data 
*This dataset contains brain MR images together with manual FLAIR abnormality segmentation masks.
The images were obtained from The Cancer Imaging Archive (TCIA).
They correspond to 110 patients included in The Cancer Genome Atlas (TCGA) lower-grade glioma collection with at least fluid-attenuated inversion recovery (FLAIR) sequence and genomic cluster data available.
Tumor genomic clusters and patient data is provided in data.csv file.* 

Download link: https://www.kaggle.com/mateuszbuda/lgg-mri-segmentation



## LGG Segmentation Dataset

This dataset contains brain MR images together with manual FLAIR abnormality segmentation masks.
The images were obtained from The Cancer Imaging Archive (TCIA).
They correspond to 110 patients included in The Cancer Genome Atlas (TCGA) lower-grade glioma collection with at least fluid-attenuated inversion recovery (FLAIR) sequence and genomic cluster data available.
Tumor genomic clusters and patient data is provided in `data.csv` file.


All images are provided in `.tif` format with 3 channels per image.
For 101 cases, 3 sequences are available, i.e. pre-contrast, FLAIR, post-contrast (in this order of channels).
For 9 cases, post-contrast sequence is missing and for 6 cases, pre-contrast sequence is missing.
Missing sequences are replaced with FLAIR sequence to make all images 3-channel.
Masks are binary, 1-channel images.
They segment FLAIR abnormality present in the FLAIR sequence (available for all cases).


The dataset is organized into 110 folders named after case ID that contains information about source institution.
Each folder contains MR images with the following naming convention:

`TCGA_<institution-code>_<patient-id>_<slice-number>.tif`

Corresponding masks have a `_mask` suffix.

In [2]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import zipfile
import cv2
import plotly.express as px
plt.style.use("ggplot")

In [3]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras import Sequential
from tensorflow.keras import layers, optimizers
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from tensorflow.keras.initializers import glorot_uniform
from tensorflow.keras.utils import plot_model
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint, LearningRateScheduler
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing.image import load_img, img_to_array 
from tensorflow.keras.models import load_model
from skimage import io

In [5]:
path = "C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m"

data = pd.read_csv(path + "\\" + "data.csv")

data.head()

Unnamed: 0,Patient,RNASeqCluster,MethylationCluster,miRNACluster,CNCluster,RPPACluster,OncosignCluster,COCCluster,histological_type,neoplasm_histologic_grade,tumor_tissue_site,laterality,tumor_location,gender,age_at_initial_pathologic,race,ethnicity,death01
0,TCGA_CS_4941,2.0,4.0,2,2.0,,3.0,2,1.0,2.0,1.0,3.0,2.0,2.0,67.0,3.0,2.0,1.0
1,TCGA_CS_4942,1.0,5.0,2,1.0,1.0,2.0,1,1.0,2.0,1.0,3.0,2.0,1.0,44.0,2.0,,1.0
2,TCGA_CS_4943,1.0,5.0,2,1.0,2.0,2.0,1,1.0,2.0,1.0,1.0,2.0,2.0,37.0,3.0,,0.0
3,TCGA_CS_4944,,5.0,2,1.0,2.0,1.0,1,1.0,1.0,1.0,3.0,6.0,2.0,50.0,3.0,,0.0
4,TCGA_CS_5393,4.0,5.0,2,1.0,2.0,3.0,1,1.0,2.0,1.0,1.0,6.0,2.0,39.0,3.0,,0.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Patient                    110 non-null    object 
 1   RNASeqCluster              92 non-null     float64
 2   MethylationCluster         109 non-null    float64
 3   miRNACluster               110 non-null    int64  
 4   CNCluster                  108 non-null    float64
 5   RPPACluster                98 non-null     float64
 6   OncosignCluster            105 non-null    float64
 7   COCCluster                 110 non-null    int64  
 8   histological_type          109 non-null    float64
 9   neoplasm_histologic_grade  109 non-null    float64
 10  tumor_tissue_site          109 non-null    float64
 11  laterality                 109 non-null    float64
 12  tumor_location             109 non-null    float64
 13  gender                     109 non-null    float64

In [17]:
import glob

data_map= []

for sub_dir in glob.glob(path + "\\*"): # List all subdirectories
    try:
        dir_name =sub_dir.split('\\')[-1]
        for file_name in os.listdir(sub_dir):
            image_path = sub_dir + '\\' + file_name
            data_map.extend([dir_name, image_path])
    except Exception as e:
        print (e)      

[WinError 267] The directory name is invalid: 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\data.csv'
[WinError 267] The directory name is invalid: 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\README.md'


In [31]:
data_map[1::2]

['C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_1.tif',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_10.tif',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_10_mask.tif',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_11.tif',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_11_mask.tif',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_12.tif',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\

In [32]:
img_path_df = pd.DataFrame({'patient_ID': data_map[::2] ,
                'path': data_map[1::2] })
img_path_df.head()

Unnamed: 0,patient_ID,path
0,TCGA_CS_4941_19960909,C:\Users\temp\ML\Projects\Mask_RCNN_Brain_MRIs...
1,TCGA_CS_4941_19960909,C:\Users\temp\ML\Projects\Mask_RCNN_Brain_MRIs...
2,TCGA_CS_4941_19960909,C:\Users\temp\ML\Projects\Mask_RCNN_Brain_MRIs...
3,TCGA_CS_4941_19960909,C:\Users\temp\ML\Projects\Mask_RCNN_Brain_MRIs...
4,TCGA_CS_4941_19960909,C:\Users\temp\ML\Projects\Mask_RCNN_Brain_MRIs...


['TCGA_CS_4941_19960909',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_1.tif',
 'TCGA_CS_4941_19960909',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_10.tif',
 'TCGA_CS_4941_19960909',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_10_mask.tif',
 'TCGA_CS_4941_19960909',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_11.tif',
 'TCGA_CS_4941_19960909',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmentation\\kaggle_3m\\TCGA_CS_4941_19960909\\TCGA_CS_4941_19960909_11_mask.tif',
 'TCGA_CS_4941_19960909',
 'C:\\Users\\temp\\ML\\Projects\\Mask_RCNN_Brain_MRIs\\archive\\lgg-mri-segmenta