# Dream Challenge Solution Overview

<div id="toc"></div>

In [131]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

This notebook handles the 60, 40 and 20 genes sub-challenges. It uses a combination of two models.

## First Model - Max(MCC)

- First model is based on calculating MAX(MCC) using only 60(40 or 20) genes as opposed to using 84 genes.
- Calculating of MCC is done using matrix multiplication.
- A list of 'candidates' for locations is assembeled using the MAX(MCC) calculation.
- This list is then refined using the second model.
- In the case of 60 genes, MAX(MCC) gives a very good results (location prediction). The second model is hardly needed in this case.

## Second Model - ANN

- The second model is a simple ANN to forecast BTDNP sequences given a DGE sequence.
- Input: a row from binarized DGE.
- Output: a prediction for a row from binarized BDTNP (the correct location).
- It is used to 'correct' the MAX(MCC) results.
- The advatage of this model is being able to predict correct gene patterns (as opposed to just maximizing MCC, i.e. location).
- The model is relyies on a correct selction of subsets of 60/40/20 genes.
- In the case of 20 genes - it is the only model used since the MAX(MCC) is totally off.

## Combining The Models

- How to combine the two models?
- We have 10 possibilities (locations) for each cell. We let Max(MCC) propose candidates and then 'correct' the result and select the best candidates using the ANN model.
- If Max(MCC) propose less than 10 results - it means these are very strong results, and we keep them. Otherwise we ignore the results and use only ANN model.
- A manual calibration was done to decide how many candidates we want from the Max(MCC) model. This means selecting the 'cutoff' value of MCC such that we take all locations above this value as a candidate for a location.
    - In case of 60 genes trial and error gives an optimal selection of the 2'nd MCC score as a cutoff.
    - In case of 40 genese optimal solution is taking the top 2'nd score using Max(MCC) as a cutoff.
    - In case of 20 genes all 10 locations are decided using ANN (we are not using the Max(MCC) model at all.)

## How to Run

- Make sure you installed Python 3 with SKLearn (we used Anaconda), Tensorflow and Keras.
- Just run the following cells one by one.
- This notebook has to be run three times - for the 60, 40 and 20 genes sub-challenge.
- Manual configuration:
    - In the following cell - configure num_situ as the number of in-situ genes (sub-challenge) to use. Either 60, 40 or 20.

# Build an ANN Model for DGE->BDTNP Prediction

In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.metrics import pairwise_distances_argmin
import keras
from keras.models import Sequential, Model, load_model
from keras.layers import Input, Dense, Embedding, concatenate, Flatten, Dropout, Lambda, Activation, BatchNormalization, LocallyConnected1D, Reshape, AlphaDropout, Conv1D, MaxPooling1D
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint
import time
import sys

######################################################
# This is the only parameter you need to configure.  #
# It has to be run three times (60, 40 and 20 genes.)#
######################################################
num_situ = 20

if(num_situ == 60):
    glist = ['danr','CG14427','dan','CG43394','ImpL2','Nek2','CG8147','Ama','Btk29A','trn','numb','prd','brk','tsh','pxb','dpn','ftz','Kr','h','eve','Traf4','run','Blimp-1','lok','kni','tkv','MESR3','odd','noc','nub','Ilp4','aay','twi','bmm','hb','toc','rho','CG10479','gt','gk','apt','D','sna','NetA','Mdr49','fj','Mes2','CG11208','Doc2','bun','tll','Cyp310a1','Doc3','htl','Esp','bowl','oc','ImpE2','CG17724','fkh']
elif(num_situ == 40):
    glist = ['danr','CG14427','dan','CG43394','ImpL2','Nek2','CG8147','Ama','Btk29A','trn','numb','prd','brk','tsh','pxb','dpn','ftz','Kr','h','eve','Traf4','run','Blimp-1','lok','kni','tkv','MESR3','odd','noc','nub','Ilp4','aay','twi','bmm','hb','toc','rho','CG10479','gt','gk']
elif(num_situ == 20):
    glist = ['danr', 'CG14427', 'dan', 'CG43394', 'ImpL2', 'Nek2', 'CG8147', 'Ama', 'Btk29A', 'trn', 'numb', 'prd', 'brk', 'tsh', 'pxb', 'dpn', 'h', 'Traf4', 'run', 'toc']
else:
    raise ValueError('Undefined num_situ')

def diff(first, second):
        second = set(second)
        return [item for item in first if item not in second]

bdtnp_bin = pd.read_csv('binarized_bdtnp.csv')[glist]
dge_bin = pd.read_csv('dge_binarized_distMap_T.csv')
labels = pd.read_csv('labels.csv') #This file contains the true locations for each cell (maximum 6 locations). E.g.:
#loc1,loc2,loc3,loc4,loc5,loc6
#133,,,,,
#781,,,,,
#...


In [107]:
print(time.ctime(), 'Create train input array for dge to bdtnp model')

len_ = len(labels)
X_ = np.empty((len_, 84))
y_ = np.empty((len_, num_situ))

for index, row in labels.iterrows():
    if (index % 100 == 0):
        print(index, ' ', end="")
    X_[index] = dge_bin.iloc[index]
    y_[index] = bdtnp_bin.iloc[int(row[0])]

Mon Nov 19 11:37:26 2018 Create train input array for dge to bdtnp model
0  100  200  300  400  500  600  700  800  900  1000  1100  1200  

In [None]:
#Model build for dge to bdtnp.
print(time.ctime(), 'Model build')

a1 = Input(shape=(84,))
e = Dense(84)(a1)
e = BatchNormalization()(e)
e = Dropout(0.3)(e)
e = Dense(40)(e)
e = BatchNormalization()(e)
e = Activation('softplus')(e)
e = Dropout(0.2)(e)

output = Dense(num_situ, activation='sigmoid')(e)
model = Model(inputs=[a1], outputs=[output])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['binary_accuracy'])
print(model.summary())
print(time.strftime("%H:%M:%S"), ' Fit')

# checkpoint
filepath="models/best_model.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_binary_accuracy', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

model.fit(  x=[X_], y=y_,
            batch_size=10,
            epochs=100,
            verbose=2,
            validation_split=0.2,
            callbacks=callbacks_list)

# Using The Models - Max(MCC) and ANN

In [5]:
#Optimized calculation of MCC, using matrices (row-wise between two matrices)
def MCC(bd, dg):
    #Calculate TN times TP
    TP = np.matmul(dg,bd)
    TN = np.matmul(1 - dg, 1 - bd)
    FP = np.matmul(dg, 1 - bd)
    FN = np.matmul(1 - dg, bd)
    numerator = TN*TP - FP*FN
    denominator = 1/np.sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN) + sys.float_info.epsilon)
    MCC = numerator*denominator
    return(MCC)


ind = {}
mcc = MCC(bdtnp_bin.T, dge_bin[glist])
inx = 0
results = pd.DataFrame()

for row in mcc:
    if (inx % 100 == 0):
        print(f'({inx})', end="")
    row_sorted = row.copy()
    row_sorted.sort()
    
    #In some cases Max(MCC) provides less than 10 locations in the top n. In these cases we want to include them 'for sure'
    #hence to exclude them from the ANN considerations.
    #Trial and error show that (for 40 and 60 sub-challenves) we need to consider Max(MCC) model up to the 2'nd place only.
    closest = []
    candidates1 = []
    lis_len = len(np.argwhere(row >= row_sorted[-2]))
    if(lis_len <= 10 and num_situ > 20):
        candidates1 = np.ndarray.flatten(np.argwhere(row >= row_sorted[-2])).tolist()
    elif(lis_len > 10):
        #Sometimes even taking the top 2 elements gives more than 10 candidates. In that case ignore them.
        lis_len = 0
        candidates1 = []
        
    if(lis_len < 10 or num_situ == 20):
        #Using ANN model to select 10 locations out of a list of candidates.
        if(num_situ > 20):
            n = 10 # Consider only tops Max(MCC) in case of 60 and 40 sub-challenge
        else:
            n = 3037 # Ignore Max(MCC) altogether in case of 20 sub-challenge.
        candidates2 = np.ndarray.flatten(np.argwhere(row >= row_sorted[-n]))
        candidates2 = diff(candidates2, candidates1)
        pred = model.predict(dge_bin.iloc[inx][np.newaxis,:], batch_size=1, verbose=0)[0]

        #Loop 10 times and select the locations in BDTNP that are closest to ANN predictions (on the candidates).
        bdt = bdtnp_bin.copy().iloc[candidates2]
        closest = []
        for i in range(0, 10 - len(candidates1)):
            temp_closest = pairwise_distances_argmin(pred.reshape(1, -1), bdt)
            #Zero out the current location selected, so it wont be picked in the next loop.
            bdt.iloc[temp_closest[0]].values[:] = -100
            closest = closest + [candidates2[temp_closest[0]]]

    ind[inx] = candidates1 + closest
    results = pd.concat([results, pd.DataFrame(ind[inx]).T.reset_index(drop=True)])
    inx += 1


#Save for submission. Submission file is not zero-based.
results = results + 1
results.to_csv(f'maxmcc_{num_situ}_plus_one.csv')

(0)(100)(200)(300)(400)(500)(600)(700)(800)(900)(1000)(1100)(1200)

# Sanity Check The Results

In [6]:
count = 0
real_count = 0
k = 0

#for i,val in true_labels.items():
for index, row in labels.iterrows():
    real_count = real_count + np.count_nonzero(~np.isnan(row))
    for j in ind[k]:
        if(j in row.values):
            count = count + 1
    k = k + 1

print(f'Count of matched labels: {count}, real labels count: {real_count}')

Count of matched labels: 874, real labels count: 1691
