<a href="https://colab.research.google.com/github/yecatstevir/teambrainiac/blob/main/source/DL/preprocess_to_aws.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing Pipeline
## For 3D Convolutional Neural Network on Group Brain fMRI

This notebook turns fMRI brain images from flat matlab files into 4D tensor objects for CNN training.

To start:
- Mount Google Colab, clone fMRI repository locally, and create path to AWS for saving and loading
- Select desired brain images by subject id, splitting into train, validation, and test sets

Pipeline flow for each batch of images:
- Import desired brain images from AWS paths from data_path_dict
- Drop brain images that are unlabeled
- Mask out the brain, normalize the pixel values, and cast into 4D space
- Aggregate images into tensor-compatible objects for model use
- Upload tensor object dictionary of labels and images to AWS S3
        
  

## Mount Colab in Google Drive and Import Images

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')  

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/yecatstevir/teambrainiac.git

# Change directory into cloned repo DL folder
%cd teambrainiac/source/DL

# !ls

### Load path_config.py to access AWS credentials

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

## Import Packages

In [None]:
# Possible Missing Packages
!pip install boto3
# !pip install sklearn
!pip install nilearn

# General Library Imports
import re
import scipy.io
import scipy
import sklearn
import os
import pickle
import numpy as np
import nibabel as nib
from nilearn.signal import clean
import pandas as pd
import boto3
import tempfile
import tqdm
import random
from path_config import mat_path
from botocore.exceptions import ClientError
from collections import defaultdict
from sklearn.preprocessing import StandardScaler

# From Local Directory
from access_data_dl import *
from process_dl import *

# Pytroch Libraries
import torch


## Import Dictionary of Paths to Flat Matlab Images

In [None]:
# Open path dictionary file to get subject ids
path = "../data/data_path_dictionary.pkl"
data_path_dict = open_pickle(path)

## Select Images of Choice to Run Through Pipeline

In [None]:

'''
This is to keep hardcoded dictionary of shuffled ids to make sure train, validation, and test sets stay seperate

If the size you are looking for is not here, use the function at the bottom of the cell.

five_ids = {'test': ['10017_08894'],
 'train': ['10008_09924', '10016_09694', '10004_08693'],
 'val': ['10009_08848']}


all_ids = {'test': ['10056_09615','10035_08847','10038_09063','30044_10095','10080_09931',
  '30017_09567','10084_10188','10069_09785','10061_09308','10039_08941','10004_08693'],
 'train': ['30012_09102','30027_09638','30033_09776','10033_08871',
  '10057_10124','30036_09758','10036_09800','10047_09030','30025_09402','10043_09222','30024_09398','10066_09687',
  '10023_09126','10050_09079','10022_08854','30053_10112','10016_09694','30011_09170','30035_09836','10065_09587',
  '10045_08968','10008_09924','10060_09359','30014_09352','10018_08907','30020_09236','10027_09455','10046_09216',
  '30045_10182','30038_09967','10037_09903','30009_09227','10021_08839','30004_08965','30008_08981','10053_09018'],
 'val': ['10034_08879','10042_08990','10009_08848','10017_08894','30026_09430']}

'''
print()
# generate_train_val_test_dict(subject_id_partition, train_val_test_proportion=[0.7,0.8,1])

In [None]:
all_ids = {'test': ['10056_09615','10035_08847','10038_09063','30044_10095','10080_09931',
  '30017_09567','10084_10188','10069_09785','10061_09308','10039_08941','10004_08693'],
 'train': ['30012_09102','30027_09638','30033_09776','10033_08871',
  '10057_10124','30036_09758','10036_09800','10047_09030','30025_09402','10043_09222','30024_09398','10066_09687',
  '10023_09126','10050_09079','10022_08854','30053_10112','10016_09694','30011_09170','30035_09836','10065_09587',
  '10045_08968','10008_09924','10060_09359','30014_09352','10018_08907','30020_09236','10027_09455','10046_09216',
  '30045_10182','30038_09967','10037_09903','30009_09227','10021_08839','30004_08965','30008_08981','10053_09018'],
 'val': ['10034_08879','10042_08990','10009_08848','10017_08894','30026_09430']}

### Create Reasonably Sized Files 
~10 GB maximum

In [None]:
# To avoid an insane amount of RAM, we will take the all_ids dictionary and split it up into chunks
# Val is reasonably sized, but train and test are not. We will split train into 4 pieces and test into 2.
train_len = len(all_ids['train'])
all_ids['train_1'] = all_ids['train'][:int(train_len/4)]
all_ids['train_2'] = all_ids['train'][int(train_len/4):int(train_len/2)]
all_ids['train_3'] = all_ids['train'][int(train_len/2):int(3*train_len/4)]
all_ids['train_4'] = all_ids['train'][int(3*train_len/4):]

del all_ids['train']

test_len = len(all_ids['test'])
all_ids['test_1'] = all_ids['test'][:int(test_len/2)]
all_ids['test_2'] = all_ids['test'][int(test_len/2):]

## Select Hyperparameters and Run Pipeline
Note: The code for these functions is in our local directory. The filename and path is [access_data_dl.py](https://github.com/yecatstevir/teambrainiac/blob/main/source/DL/access_data_dl.py)

In [None]:
# labels_mask_binary hyperparameters
label_type='rt_labels'

# load_subjects_by_id parameters
n_subjects = len(data_path_dict['subject_ID'])
runs = [2,3]

# get_mask parameters and pull mask
image_mask_type = 'mask'
mask_ind = 0
brain_mask = get_mask(image_mask_type, data_path_dict, mask_ind)

# mask_normalize_runs_reshape_4d parameters
scaler = 'standard'


for subject_partition in ['test_1', 'test_2']:#all_ids.keys():
  subject_ids = all_ids[subject_partition] 

  image_label_mask, image_labels = labels_mask_binary(data_path_dict, label_type='rt_labels')

  initial_subject_data = load_subjects_by_id(data_path_dict, subject_ids, image_label_mask, image_labels, label_type, runs)

  subjects_reshaped = mask_normalize_runs_reshape_4d(initial_subject_data, brain_mask, scaler)

  partition_dictionary = train_test_aggregation_group(subjects_reshaped, runs, subject_ids)

  s3_upload(partition_dictionary, 'dl/partition_%s.pkl'%subject_partition, 'pickle')
  print('Partition', subject_partition, 'complete.')
  print()


## Healthy Pipeline Example Output of Uploading 11 Subjects

Subject ids loaded.

Adding subjects to dictionary.

11it [02:13, 12.12s/it]

Completed Subject 1

Completed Subject 2

Completed Subject 3

Completed Subject 4

Completed Subject 5

Completed Subject 6

Completed Subject 7

Completed Subject 8

Completed Subject 9

Completed Subject 10

Completed Subject 11

upload complete for dl/partition_test.pkl

Partition test complete.