# Brain MRI segmentation
In this notebook, we will train and deploy a segmentation model using AWS SageMaker.

## Setup
### Import packages

In [1]:
import os

from sklearn.model_selection import train_test_split
import yaml

from src.data import create_kaggle_token_file, setup_brain_mri_dataset

### Import configuration file

The `config.yaml` file contains some import parameters that we will use throughout the notebook. For instance, it includes `DATA_ROOT_PATH`, which specifies where we will store the data locally. Currently this is set to `./data` (i.e. a folder named `data` will be created inside the folder where you are running this notebook), but you can choose a different location if you wish.

In [2]:
with open("config.yaml", "r") as f:
    config = yaml.load(f, Loader=yaml.FullLoader)

In [3]:
config

{'DATA_ROOT_PATH': './data',
 'DATA_ZIP_FILENAME': 'lgg-mri-segmentation.zip',
 'DUMMY_IAM_ROLE': 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001',
 'KAGGLE_DATASET': 'mateuszbuda/lgg-mri-segmentation'}

# Dataset
Here we will download the data from Kaggle. First we need to get your Kaggle API token.

## Retrieve Kaggle user token

To download the dataset using the [Kaggle CLI](https://github.com/Kaggle/kaggle-api) we need to create an API token. So that we don't need to recreate this everytime we want to interact with Kaggle, we can place it in a file located at `~/.kaggle/kaggle.json` (where `~` represents the users home directory). Here we can use the `create_kaggle_token_file` to create this file and make sure it has the correct permissions. 

Before running this you will need to login to your Kaggle account and create a new API token (if you don't already have one). 

1. Go to your Kaggle account

Sign into [Kaggle](https://www.kaggle.com/) and click on your profile picture (top right corner). Select `Account`.

<img src="./imgs/create-kaggle-api1.png">

2. Create new API token

Now scroll down to the API section and click on `Create New API Token`.

<img src="./imgs/create-kaggle-api2.png">

Now run the `create_kaggle_token_file` function below and enter your Kaggle user name and the token you just downloaded when prompted.

In [4]:
create_kaggle_token_file()

2022-08-06 12:21:24,021 Kaggle token file already created in /home/robsmith155/.kaggle/kaggle.json.


## Download data

Now we can download and extract the dataset from [Kaggle](https://www.kaggle.com/datasets/mateuszbuda/lgg-mri-segmentation). In the `src.data` module I've put a function named `setup_brain_mri_dataset`. This will perform the following tasks:
- Download the dataset from Kaggle
- Extract the contents of the downloaded zip file
- Sort the data into train, val and test dataset folders

Note that the data downloaded from Kaggle has not been split into datasets for us. Therefore this was included in the setup function. Here I ran some tests beforehand to make sure that the split of positive and negative slices is approximately consistent between the datasets. We will reviist this in the EDA section of the notebook.

In [5]:
setup_brain_mri_dataset(data_root_path=config['DATA_ROOT_PATH'])

2022-08-06 12:21:25,956 Starting download of mateuszbuda/lgg-mri-segmentation dataset from Kaggle.


Downloading lgg-mri-segmentation.zip to ./data


100%|██████████| 714M/714M [00:58<00:00, 12.9MB/s] 
2022-08-06 12:22:25,896 Finished downloading data. Data downloaded to ./data
2022-08-06 12:22:26,034 Starting extraction of data stored in ./data/lgg-mri-segmentation.zip.





2022-08-06 12:22:36,746 Dataset extracted in ./data
2022-08-06 12:22:36,748 Starting splitting of data into train, val and test datasets.
2022-08-06 12:22:36,750 Created ./data/train directory.
2022-08-06 12:22:36,751 Moved ./data/lgg-mri-segmentation/kaggle_3m/TCGA_DU_A5TY_19970709 to ./data/train.
2022-08-06 12:22:36,752 Moved ./data/lgg-mri-segmentation/kaggle_3m/TCGA_FG_6692_20020606 to ./data/train.
2022-08-06 12:22:36,753 Moved ./data/lgg-mri-segmentation/kaggle_3m/TCGA_CS_6186_20000601 to ./data/train.
2022-08-06 12:22:36,753 Moved ./data/lgg-mri-segmentation/kaggle_3m/TCGA_HT_A616_19991226 to ./data/train.
2022-08-06 12:22:36,754 Moved ./data/lgg-mri-segmentation/kaggle_3m/TCGA_DU_7300_19910814 to ./data/train.
2022-08-06 12:22:36,755 Moved ./data/lgg-mri-segmentation/kaggle_3m/TCGA_CS_5395_19981004 to ./data/train.
2022-08-06 12:22:36,755 Moved ./data/lgg-mri-segmentation/kaggle_3m/TCGA_DU_A5TW_19980228 to ./data/train.
2022-08-06 12:22:36,756 Moved ./data/lgg-mri-segmentation