<a href="https://colab.research.google.com/github/surajghuwalewala/CE888_Data_Science_and_Decision_Making/blob/master/Assignment2/CE888_DecMeg2014.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DecMeg2014 - Decoding the Human Brain

This notebook contains the code for understanding the effect of covariate shift adaptation on the MEG signal in the [DecMeg2014 dataset](https://www.kaggle.com/c/decoding-the-human-brain/data) from Kaggle.


## Kaggle setup and data download


In [0]:
## Kaggle details
import os
os.environ['KAGGLE_USERNAME'] = "surajghuwalewala" # username from the json file
os.environ['KAGGLE_KEY'] = "c14ff4f2803c1ffb349c4b9e1a57020b" # key from the json file

In [0]:
DOWNLOAD_DATA = True

In [3]:
if DOWNLOAD_DATA:
    !kaggle competitions download -c decoding-the-human-brain # api copied from kaggle
    !unzip -q -n '/content/*.zip'  ## unzips all archives  q - quite, n - don't overwrite

Downloading test_17_23.zip to /content
100% 1.63G/1.64G [00:15<00:00, 127MB/s]
100% 1.64G/1.64G [00:15<00:00, 111MB/s]
Downloading train_01_06.zip to /content
 99% 1.41G/1.42G [00:30<00:00, 62.5MB/s]
100% 1.42G/1.42G [00:31<00:00, 49.3MB/s]
Downloading train_07_12.zip to /content
100% 1.43G/1.43G [00:41<00:00, 115MB/s] 
100% 1.43G/1.43G [00:41<00:00, 37.0MB/s]
Downloading random_submission.csv to /content
  0% 0.00/31.7k [00:00<?, ?B/s]
100% 31.7k/31.7k [00:00<00:00, 32.7MB/s]
Downloading train_13_16.zip to /content
100% 967M/970M [00:26<00:00, 77.0MB/s]
100% 970M/970M [00:26<00:00, 39.0MB/s]

4 archives were successfully processed.


## Initial python setup

In [0]:
import numpy as np
import scipy.io as sio

# # Several libraries for designing the CNN
from tensorflow.keras.utils  import normalize, to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, SeparableConv2D
from tensorflow.keras.layers import BatchNormalization, Flatten
from tensorflow.keras.layers import Dense, Activation, Dropout,  MaxPooling2D
from tensorflow.keras.optimizers import SGD as SGD_Loss

In [0]:
## Saving the path to the train and test files
import glob

train_files = np.sort( glob.glob('/content/data/train*') )
test_files = np.sort( glob.glob('/content/data/test*') )

In [6]:
train_files

array(['/content/data/train_subject01.mat',
       '/content/data/train_subject02.mat',
       '/content/data/train_subject03.mat',
       '/content/data/train_subject04.mat',
       '/content/data/train_subject05.mat',
       '/content/data/train_subject06.mat',
       '/content/data/train_subject07.mat',
       '/content/data/train_subject08.mat',
       '/content/data/train_subject09.mat',
       '/content/data/train_subject10.mat',
       '/content/data/train_subject11.mat',
       '/content/data/train_subject12.mat',
       '/content/data/train_subject13.mat',
       '/content/data/train_subject14.mat',
       '/content/data/train_subject15.mat',
       '/content/data/train_subject16.mat'], dtype='<U33')

## CNN model creation

In [0]:
def create_model(batch_norm = True):
    ## Defining the CNN model

    kernel_size = (5,5)

    ## Initialize the model
    model = Sequential()

    ##reading a file to get input shape
    train_X = sio.loadmat(train_files[0])['X']
    train_X = train_X.reshape(-1, train_X.shape[1], train_X.shape[2], 1)

    ## Layer 1
    model.add(Conv2D(4, kernel_size, padding = 'same', input_shape=train_X.shape[1:]))
    if batch_norm: model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    if not batch_norm:model.add(Dropout(0.25))

    ## Layer 2
    model.add(SeparableConv2D(4, kernel_size, padding = 'same'))
    if batch_norm: model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    if not batch_norm:model.add(Dropout(0.25))


    ## Layer 3
    model.add(SeparableConv2D(16, kernel_size, padding = 'same'))
    if batch_norm: model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    if not batch_norm:model.add(Dropout(0.25))

    ## Layer 4
    model.add(SeparableConv2D(8, kernel_size, padding = 'same'))
    if batch_norm: model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    if not batch_norm:model.add(Dropout(0.25))

    ## Layer 5
    model.add(SeparableConv2D(4, kernel_size, padding = 'same'))
    if batch_norm: model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    if not batch_norm:model.add(Dropout(0.25))

    model.add(Flatten())  # this converts our 3D feature maps to 1D feature vectors

    ## ------------ CLASSIFICATION ------------ ##

    ## Layer 6
    model.add(Dense(256))

    ## Layer 7
    model.add(Dense(128))

    ## Layer 8
    model.add(Dense(32))

    ## Layer 9
    model.add(Dense(1))
    model.add(Activation('sigmoid'))



    model.compile(loss= 'binary_crossentropy',
                optimizer = SGD_Loss(learning_rate=0.01, momentum=0.0, nesterov=True),
                metrics= ['accuracy'])
    
    return model

In [0]:
## Creating both models

## Model without batch norm
model = create_model(batch_norm=False)

## Model with batch norm
norm_model = create_model(batch_norm=True)

In [9]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 306, 375, 4)       104       
_________________________________________________________________
activation (Activation)      (None, 306, 375, 4)       0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 153, 187, 4)       0         
_________________________________________________________________
dropout (Dropout)            (None, 153, 187, 4)       0         
_________________________________________________________________
separable_conv2d (SeparableC (None, 153, 187, 4)       120       
_________________________________________________________________
activation_1 (Activation)    (None, 153, 187, 4)       0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 76, 93, 4)         0

In [10]:
norm_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 306, 375, 4)       104       
_________________________________________________________________
batch_normalization (BatchNo (None, 306, 375, 4)       16        
_________________________________________________________________
activation_6 (Activation)    (None, 306, 375, 4)       0         
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 153, 187, 4)       0         
_________________________________________________________________
separable_conv2d_4 (Separabl (None, 153, 187, 4)       120       
_________________________________________________________________
batch_normalization_1 (Batch (None, 153, 187, 4)       16        
_________________________________________________________________
activation_7 (Activation)    (None, 153, 187, 4)      

## Training

In [11]:
from tensorflow.keras.utils import normalize


n_epochs = 20

for i,file in enumerate(train_files):
    
    print("\n-------------------------------------------------\n")
    print("Working on Subject {}".format(i))
    data = sio.loadmat(file)
    X = normalize(data['X'], axis=1)
    y  = data['y']
    
    # reshape for CNN
    X = X.reshape(-1, X.shape[1], X.shape[2], 1)

    print("\nWithout Batch Norm")
    model.fit(X,y, epochs=n_epochs, batch_size=25, validation_split=0.2)
    print("\nWith Batch Norm")
    norm_model.fit(X,y, epochs=n_epochs, batch_size=25, validation_split=0.2)

    


-------------------------------------------------

Working on Subject 0

Without Batch Norm
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

With Batch Norm
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

-------------------------------------------------

Working on Subject 1

Without Batch Norm
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

With Batch Norm
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20

## Evaluation


In [0]:
pred = np.array([])
norm_pred = np.array([])

Since testing has to be done in a perfect order to maintain the sequence of predictions for the submission, we read file by their specific name

In [6]:
for i,file in enumerate(test_files):
    print("Testing on Case {}".format(i+17))
    data = sio.loadmat("/content/data/test_subject{}.mat".format(17+i))  ## Very specific numbering for this case. (Works as of 12/04/2020)
    X = normalize(data['X'], axis=1)    
    # reshape for CNN
    X = normalize(X.reshape(-1, X.shape[1], X.shape[2], 1), axis=1)
    pred = np.vstack([pred, model.predict(X)]) if pred.size else model.predict(X)
    norm_pred = np.vstack([norm_pred, norm_model.predict(X)]) if norm_pred.size else norm_model.predict(X)



Testing on Case 17
Testing on Case 18
Testing on Case 19
Testing on Case 20
Testing on Case 21
Testing on Case 22
Testing on Case 23


In [0]:
## Rounding the predictions to 0 or 1
predictions = np.floor(pred+0.5)
norm_predictions = np.floor(norm_pred+0.5)

In [0]:
## Converting all numbers to int type
predictions = np.array(predictions, dtype='int')
norm_predictions = np.array(norm_predictions, dtype='int')

In [23]:
predictions.shape

(4058, 1)

In [24]:
norm_predictions.shape

(4058, 1)

## Kaggle Submission

In [0]:
import pandas as pd

### Submitting the estimation from CNN without batch normalization

In [0]:
sub_df = pd.read_csv('/content/random_submission.csv')

In [0]:
sub_df['Prediction'] = predictions

In [29]:
sub_df

Unnamed: 0,Id,Prediction
0,17000,0
1,17001,0
2,17002,0
3,17003,0
4,17004,0
...,...,...
4053,23585,0
4054,23586,0
4055,23587,0
4056,23588,0


In [0]:
sub_df.to_csv('/content/wo_norm_submission.csv', index=False)

In [38]:
!kaggle competitions submit -c decoding-the-human-brain -f /content/wo_norm_submission.csv -m "w/o norm submission"

100% 31.7k/31.7k [00:06<00:00, 5.30kB/s]
Successfully submitted to DecMeg2014 - Decoding the Human Brain

In [41]:
!kaggle competitions submissions -c decoding-the-human-brain 

fileName                date                 description          status    publicScore  privateScore  
----------------------  -------------------  -------------------  --------  -----------  ------------  
wo_norm_submission.csv  2020-04-15 19:19:13  w/o norm submission  complete  0.50000      0.50000       
norm_submission.csv     2020-04-12 23:52:57  norm submission      complete  0.53798      0.52920       
wo_norm_submission.csv  2020-04-12 23:48:31  w/o norm submission  complete  0.50000      0.50000       
wo_norm_submission.csv  2020-04-12 23:27:43  w/o norm submission  complete  0.50000      0.50000       
norm_submission.csv     2020-04-12 23:05:36  norm submission      complete  0.50623      0.53661       
wo_norm_submission.csv  2020-04-12 22:49:45  w/o norm submission  complete  0.49943      0.53225       
norm_submission.csv     2020-04-12 22:20:16  norm submission      complete  0.51360      0.50305       
wo_norm_submission.csv  2020-04-12 22:19:46  w/o norm submission

### Submitting the estimation from CNN with batch normalization

In [0]:
norm_sub_df = sub_df.copy()

In [0]:
norm_sub_df['Prediction'] = norm_predictions

In [0]:
norm_sub_df.to_csv('/content/norm_submission.csv', index=False)

In [42]:
!kaggle competitions submit -c decoding-the-human-brain -f /content/norm_submission.csv -m "norm submission"

100% 31.7k/31.7k [00:05<00:00, 6.27kB/s]
Successfully submitted to DecMeg2014 - Decoding the Human Brain

In [43]:
!kaggle competitions submissions -c decoding-the-human-brain 


fileName                date                 description          status    publicScore  privateScore  
----------------------  -------------------  -------------------  --------  -----------  ------------  
norm_submission.csv     2020-04-15 19:28:05  norm submission      complete  0.55555      0.52223       
wo_norm_submission.csv  2020-04-15 19:19:13  w/o norm submission  complete  0.50000      0.50000       
norm_submission.csv     2020-04-12 23:52:57  norm submission      complete  0.53798      0.52920       
wo_norm_submission.csv  2020-04-12 23:48:31  w/o norm submission  complete  0.50000      0.50000       
wo_norm_submission.csv  2020-04-12 23:27:43  w/o norm submission  complete  0.50000      0.50000       
norm_submission.csv     2020-04-12 23:05:36  norm submission      complete  0.50623      0.53661       
wo_norm_submission.csv  2020-04-12 22:49:45  w/o norm submission  complete  0.49943      0.53225       
norm_submission.csv     2020-04-12 22:20:16  norm submission    

## Final Results

Here we consider the accuracy scores from the top submission,
- Without Batch Normalization - 50% (both public and private)
- With Batch Normalization - 55.5% (Public) & 52.2% (Private)



This shows that the covariate shift adaptation using batch normalization significantly improves the performance of the CNN model