# Introduction

The purpose of this notebook is to assess if a single NN is the best option to predict the 206 targets

### Primilinary results

It seems that single target NN outperform the classic multi-target NN for several targets (see below).
Unfortunately, trying to blend the two models tend to degrade a lot my LB. If anybody as a clue on where this can come from, please let me know !

#### Single Target NN: 
* KFold Logloss: **0.01687**   
* KFold ROC: **0.779** 
* LB: **0.01929**

#### Multi Target NN:
 
* KFold Logloss: **0.01714**  
* KFold ROC: **0.634** 
* LB: **0.01855**


### Updates

#### Version 7
Change average probability of OOF graph from bar to scatter chart

#### Version 8
Add predictions from public test set

#### Version 9
Add LB Score for the two models

#### Version 11
- Add analysis for main cluster of target and targets without co-occurence with others
- Add recall/precision analysis

Plot difference of predictions between train and test set in relative rather than in absolute

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import networkx as nx

import matplotlib.pyplot as plt

import random

# Part-1: Co-occurences between targets

Train a multi-target Neural Network might be usefull if targets are correlated with each other ( several samples have the same 1)
In this part, I explore the graph of co-occurences and isolated group of targets having co-occurances in the train set. I will use both scored and non-scored targets

In [None]:


# Load data
train_targets_scored = pd.read_csv("../input/lish-moa/train_targets_scored.csv")
train_targets_nonscored = pd.read_csv("../input/lish-moa/train_targets_nonscored.csv")
train_features = pd.read_csv("../input/lish-moa/train_features.csv")
# Group targets
target_scored = train_targets_scored.drop("sig_id",axis=1)
target_notscored= train_targets_nonscored.drop("sig_id",axis=1)

target = pd.concat([target_scored,target_notscored],axis=1)
target = target.loc[:,target.sum()>0]

In [None]:
target.head()

## Co-occurences

To calculate the co-occurences in a fast way, we multiply target by its transposed


In [None]:
# Count the number of co-occurence of ones in the target, remove the diag elements
commun = target.T@target

commun.head()

The diagonal values represent the total number of occurence of one target in the dataset, we are not interested by it, so we set it to 0.
We set all co-occurences to 1, they symbolise the connexions between two targets

## Visualisation of the network with NetworkX and Plotly

In [None]:
G = nx.from_pandas_adjacency(commun,create_using=nx.DiGraph)
pos = nx.spring_layout(G)

In [None]:
edge_x = []
edge_y = []

for edge in G.edges():
        
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_x.append(x0)
    edge_x.append(x1)
    edge_x.append(None)
    edge_y.append(y0)
    edge_y.append(y1)
    edge_y.append(None)

edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    opacity = 0.5,
    line=dict(width=2, color='black'),
    hoverinfo='none',
    mode='lines')

node_x = []
node_y = []
for node in G.nodes():
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)
    
node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale='Portland',
        opacity = 0.3,
        reversescale=True,
        color=[],
        size=20,
        line_width=2))

node_adjacencies = []
node_text = []
node_sizes = []
for node, adjacencies in enumerate(G.adjacency()):
    node_adjacencies.append(len(adjacencies[1]))
    node_text.append(adjacencies[0])
    node_sizes.append(np.log(target[adjacencies[0]].sum())*5)

node_trace.marker.color = node_adjacencies
node_trace.text = node_text
node_trace.marker.size = node_sizes

fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='<br>Targets Co-occurences Network',
                titlefont_size=16,
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                )

fig.update_layout(template = 'presentation', titlefont_size=30)
fig.show()

The network above shows that most of the targets are not related to each others. 
On the other hand, we observe a big cluster of connected targets in the center

## Identify the clusters

I developped this snipper to create group of targets based on the co-occurence matric. There might be ways of doing using networkX directly.

In [None]:
def find_recursive(commun, todo, done, effect, current_group, threshold = 10):

    """ Function used to group recursively the targets if they have a common sample
        commun: matrice of co-occurence

        todo: Serie representing the sum of co-occurence per target
        done: targets already mapped to a group
        effect: current target checked
        current_group: the current group in which we add targets
        threshold: number of minimum co-occurence to make the group
    """

    #List of target remaining to group
    efflist = todo.index

    #Reduce co-occurence matrix to remaining targets to class
    sub = commun.loc[efflist,efflist]
    # Keeping only the co-occurence of the target that we are currently looking at
    sub = sub[effect]
    #adding the target to the list of target processed
    done += [effect]
    #add the targets with co-occurence above threshold
    to_add = list(sub[sub>threshold].index)

    #If targets in to_add, recursively check the co-occurence for those new targets to add in the current group
    if len(to_add):
        current_group += to_add
        todo = todo[[elmt for elmt in todo.index if elmt not in done]]
        for effect in to_add:
            if effect not in done:
                todo, done, current_group = find_recursive(commun, todo, done, effect, current_group, threshold)
        return todo, done, current_group

    #Otherwise, update todo list and return final outputs
    else:
        todo = todo[[elmt for elmt in todo.index if elmt not in done]]
        return todo, done, current_group
    
# Initiate todo Serie
todo = commun.sum().sort_values(ascending = False)
groups = {}
threshold = 1
i = 0
while True:
    todo, done, current_group = find_recursive(commun, todo, [todo.index[0]], todo.index[0], [todo.index[0]], threshold = threshold)
    groups[i] = list(set(current_group))
    i+=1
    if len(todo) ==0:
        break
        
#Keep only scored target
group_score = {}
for g, t in groups.items():
    elmts = [elmt for elmt in t if elmt in target_scored.columns]
    if len(elmts):
        group_score[g] = elmts 

## Cluster of targets

In [None]:
count = {}
for g, t in group_score.items():
    count['g_'+str(g)] = len(t)

count = pd.Series(count).sort_values(ascending = False)

fig = go.Figure(
    go.Bar(
        x = count.index,
        y = count.values
    )
)

fig.update_layout(template = 'presentation', title = 'Count of Targets per Cluster ( Scored )')

Approx. half of the targets have no co-occurences with other targets

# Part 2: OOFs study

In this part, I am going to study the oofs obtained using 3 seeds - 5folds splits based on [Chris Deotte topic](https://www.kaggle.com/c/lish-moa/discussion/195195)
The big advantage of those splits is that it guaranty that same drugs are not in both train and test folds, avoiding some leakages.

I built two types of OOFs: 
- One OOF is made using a NN making multi-target predictions
- The other OOF are based on NN making single target predictions

I use both ROC_AUC_Score and Log_Loss to evaluate the predictions

I calculate scores based on trt_cp samples only

In [None]:
from sklearn.metrics import roc_auc_score

def score(y, yp):
    return - np.mean(y*np.log(yp+10**(-15)) + (1-y)*np.log(1-yp+10**(-15))) 

 
target = target_scored[train_features.cp_type == 'trt_cp'].copy()

## Multi-Target Model

In [None]:
oofs_main = []
GROUP = 'MAINMODEL'
for SEED in [0,1]:
    oof = pd.read_csv(f'../input/moa-oofs/oofs/group{GROUP}_SEED{SEED}.csv', index_col = 0).values
    oofs_main.append(oof)
    roc = np.round(roc_auc_score(target.values, oof),3)
    log_loss = np.round(score(target.values, oof), 5)
    print(f'Seed {SEED} : roc_auc_score {roc} - log_loss {log_loss}')
    
oof_main_avg = np.mean(oofs_main, axis= 0)
roc = np.round(roc_auc_score(target.values, oof_main_avg),3)
log_loss = np.round(score(target.values, oof_main_avg), 5)
print(f'Average OOFs : roc_auc_score {roc} - log_loss {log_loss}')

## Single Target Models

In [None]:
oofs_single_target = []
for SEED in [0,1]:
    oof_tot = []
    for GROUP in range(206):
        oof = pd.read_csv(f'../input/moa-oofs/oofs/group{GROUP}_SEED{SEED}.csv', index_col = 0).values
        oof_tot.append(oof)
        
    oof_tot = np.hstack(oof_tot)
    oofs_single_target.append(oof_tot)
    roc = np.round(roc_auc_score(target.values, oof_tot),3)
    log_loss = np.round(score(target.values, oof_tot), 5)
    print(f'Seed {SEED} : roc_auc_score {roc} - log_loss {log_loss}')
    
oof_single_target_avg = np.mean(oofs_single_target, axis= 0)
roc = np.round(roc_auc_score(target.values, oof_single_target_avg),3)
log_loss = np.round(score(target.values, oof_single_target_avg), 5)
print(f'Average OOFs : roc_auc_score {roc} - log_loss {log_loss}')

## Predictions Statistics

The graph below shows interesting information.
First the single target model tend to make higher probability prediction than its counterpart.

Second, we see that the more positive labels, the higher the probabilities

In [None]:
main_mean = oof_main_avg.mean(axis=0)
single_target_mean = oof_single_target_avg.mean(axis=0)

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x = main_mean,
        y = single_target_mean,
        hovertext = target.columns,
        mode = 'markers',
        marker = {'size':target.sum()/10},
        name = 'oofs'
    )
)

fig.add_trace(
    go.Scatter(
        x = [0,0.05],
        y = [0,0.05],
        marker = {'color':'black', 'line' : {'width':0.1}},
        mode = 'lines',
        showlegend = False
    )
)

fig.update_layout(template = 'presentation', title = 'Average Probability by Type of Model')
fig.update_yaxes(title = 'Multi Target')
fig.update_xaxes(title = 'Single Targets')
fig.show()

In [None]:
main_logloss = [score(target.values[:,i], oof_main_avg[:,i]) for i in range(206)]
single_target_logloss = [score(target.values[:,i], oof_single_target_avg[:,i]) for i in range(206)]

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x = target.columns,
        y = main_logloss,
        name = 'Main Model'
    )
)

fig.add_trace(
    go.Bar(
        x = target.columns,
        y = single_target_logloss,
        name = 'Single Targets Models'
    )
)
fig.update_layout(template = 'presentation', title = 'Logloss by Type of Model per Target')
fig.show()

## Correlations between the predictions of the two models

I am pairing the targets from same clusters with same random colors. It allows us to see if the model type perform better on targets from big/small clusters

In [None]:
colors_target = {}

for n, g in group_score.items():
    if len(g)>1:
        color = '#' + "%06x" % random.randint(0, 0xFFFFFF)
        for elmt in g:
            colors_target[elmt] = color

In [None]:
plt.figure(figsize = (30,50))
for i in range(206):
    fig = plt.subplot(18,12,i+1)
    plt.title(target.columns[i])
    x2 = oof_main_avg[np.where(target.iloc[:,i]==0)[0],i]
    y2 = oof_single_target_avg[np.where(target.iloc[:,i]==0)[0],i]
    plt.scatter(x2, y2, color = 'black', alpha = 0.1)
    x1 = oof_main_avg[np.where(target.iloc[:,i]==1)[0],i]
    y1 = oof_single_target_avg[np.where(target.iloc[:,i]==1)[0],i]
    plt.scatter(x1, y1, color = 'red')
    if target.columns[i] in colors_target.keys():
        fig.patch.set_facecolor(f'{colors_target[target.columns[i]]}')
    plt.xlim(0,1)
    plt.ylim(0,1)
plt.show()

## Some targets where using mono-target NN seems to perform better

As expected, the single target NN perform better on targets that are not part of clusters of targets

In [None]:
toplot = [11,12*3-1,12*7+6,12*14+5]
plt.figure(figsize = (15,15))
for i in range(4):
    plt.subplot(2,2,i+1)
    i = toplot[i]
    plt.title(target.columns[i], size = 20)
    x2 = oof_main_avg[np.where(target.iloc[:,i]==0)[0],i]
    y2 = oof_single_target_avg[np.where(target.iloc[:,i]==0)[0],i]
    plt.scatter(x2, y2, color = 'black', alpha = 0.1, label = 'negatif')
    x1 = oof_main_avg[np.where(target.iloc[:,i]==1)[0],i]
    y1 = oof_single_target_avg[np.where(target.iloc[:,i]==1)[0],i]
    plt.scatter(x1, y1, color = 'red', label = 'positif')
    plt.xlim(-0.1,1.1)
    plt.ylim(-0.1,1.1)
    plt.xlabel('proba multi target', size = 15)
    plt.ylabel('proba single target', size = 15)
    if target.columns[i] in colors_target.keys():
        fig.patch.set_facecolor(f'{colors_target[target.columns[i]]}')
    plt.legend()
plt.show()

# Part 3: Models behavior on "isolated" targets

By isolated I mean targets with no connexions to the others following the co-occurence pattern shown in Part 1

In [None]:
is_target = []
for k,v in group_score.items():
    if len(v)==1:
        is_target+=v
print(f'total of isolated target: {len(is_target)}')

### ROC and LogLoss

In [None]:
st = np.where(target.columns.isin(is_target))[0]

oofs_main = []
GROUP = 'MAINMODEL'
print('MAINMODEL')
for SEED in [0,1,2]:
    oof = pd.read_csv(f'../input/moa-oofs/oofs/group{GROUP}_SEED{SEED}.csv', index_col = 0).values
    oofs_main.append(oof)
    roc = np.round(roc_auc_score(target.values[:,st], oof[:,st]),3)
    log_loss = np.round(score(target.values[:,st], oof[:,st]), 5)
    print(f'Seed {SEED} : roc_auc_score {roc} - log_loss {log_loss}')
    
oof_main_avg = np.mean(oofs_main, axis= 0)
roc = np.round(roc_auc_score(target.values[:,st], oof_main_avg[:,st]),3)
log_loss = np.round(score(target.values[:,st], oof_main_avg[:,st]), 5)
print(f'Average OOFs : roc_auc_score {roc} - log_loss {log_loss}')

oofs_single_target = []
print('\nSINGLE TARGETS')
for SEED in [0,1]:
    oof_tot = []
    for GROUP in range(206):
        oof = pd.read_csv(f'../input/moa-oofs/oofs/group{GROUP}_SEED{SEED}.csv', index_col = 0).values
        oof_tot.append(oof)
        
    oof_tot = np.hstack(oof_tot)
    oofs_single_target.append(oof_tot)
    roc = np.round(roc_auc_score(target.values[:,st], oof_tot[:,st]),3)
    log_loss = np.round(score(target.values[:,st], oof_tot[:,st]), 5)
    print(f'Seed {SEED} : roc_auc_score {roc} - log_loss {log_loss}')
    
oof_single_target_avg = np.mean(oofs_single_target, axis= 0)
roc = np.round(roc_auc_score(target.values[:,st], oof_single_target_avg[:,st]),3)
log_loss = np.round(score(target.values[:,st], oof_single_target_avg[:,st]), 5)
print(f'Average OOFs : roc_auc_score {roc} - log_loss {log_loss}')

### Recall & Precision

In [None]:
from sklearn.metrics import recall_score, precision_score
recalls_main = {}
precisions_main = {}
recalls_single = {}
precisions_single = {}
for GROUP in st:
    yy = target.iloc[:,GROUP:GROUP+1]
    name = yy.columns[0]
    oofmain = oof_main_avg[:,GROUP:GROUP+1]
    oofsingle = oof_single_target_avg[:,GROUP:GROUP+1]
    
    recalls_main[name] = recall_score(yy, np.round(oofmain))
    precisions_main[name] = precision_score(yy, np.round(oofmain))
    recalls_single[name] = recall_score(yy, np.round(oofsingle))
    precisions_single[name] = precision_score(yy, np.round(oofsingle))
    
recalls_main = pd.Series(recalls_main)
precisions_main = pd.Series(precisions_main)
recalls_single = pd.Series(recalls_single)
precisions_single = pd.Series(precisions_single)

In [None]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x = recalls_main,
        y = precisions_main,
        mode= 'markers',
        name = 'MultiTarget NN',
        text = recalls_main.index,
        marker = {'size':target[recalls_single.index].sum()/4}
    )
)
fig.add_trace(
    go.Scatter(
        x = recalls_single,
        y = precisions_single,
        mode= 'markers',
        name = 'SingleTarget NN',
        text = recalls_single.index,
        marker = {'size':target[recalls_single.index].sum()/4}
    )
)

fig.update_layout(template = 'presentation', title = 'Precision and Recall for "single" targets')
fig.update_yaxes(title = 'Precision')
fig.update_xaxes(title = 'Recall')
fig.show()

It is interesting to note that single target NN outperform the multitarget NN for targets not related to the main cluster of targets

# Part 4: Models behaviours on main cluster of targets

### ROC and LogLoss

In [None]:
mc = np.where(target.columns.isin(group_score[0]))[0]
oofs_main = []
GROUP = 'MAINMODEL'
print('MAINMODEL')
for SEED in [0,1,2]:
    oof = pd.read_csv(f'../input/moa-oofs/oofs/group{GROUP}_SEED{SEED}.csv', index_col = 0).values
    oofs_main.append(oof)
    roc = np.round(roc_auc_score(target.values[:,mc], oof[:,mc]),3)
    log_loss = np.round(score(target.values[:,mc], oof[:,mc]), 5)
    print(f'Seed {SEED} : roc_auc_score {roc} - log_loss {log_loss}')
    
oof_main_avg = np.mean(oofs_main, axis= 0)
roc = np.round(roc_auc_score(target.values[:,mc], oof_main_avg[:,mc]),3)
log_loss = np.round(score(target.values[:,mc], oof_main_avg[:,mc]), 5)
print(f'Average OOFs : roc_auc_score {roc} - log_loss {log_loss}')

oofs_single_target = []
print('\nSINGLE TARGETS')
for SEED in [0,1]:
    oof_tot = []
    for GROUP in range(206):
        oof = pd.read_csv(f'../input/moa-oofs/oofs/group{GROUP}_SEED{SEED}.csv', index_col = 0).values
        oof_tot.append(oof)
        
    oof_tot = np.hstack(oof_tot)
    oofs_single_target.append(oof_tot)
    roc = np.round(roc_auc_score(target.values[:,mc], oof_tot[:,mc]),3)
    log_loss = np.round(score(target.values[:,mc], oof_tot[:,mc]), 5)
    print(f'Seed {SEED} : roc_auc_score {roc} - log_loss {log_loss}')
    
oof_single_target_avg = np.mean(oofs_single_target, axis= 0)
roc = np.round(roc_auc_score(target.values[:,mc], oof_single_target_avg[:,mc]),3)
log_loss = np.round(score(target.values[:,mc], oof_single_target_avg[:,mc]), 5)
print(f'Average OOFs : roc_auc_score {roc} - log_loss {log_loss}')

### Recall and Precision

In [None]:
from sklearn.metrics import recall_score, precision_score
recalls_main = {}
precisions_main = {}
recalls_single = {}
precisions_single = {}
for GROUP in mc:
    yy = target.iloc[:,GROUP:GROUP+1]
    name = yy.columns[0]
    oofmain = oof_main_avg[:,GROUP:GROUP+1]
    oofsingle = oof_single_target_avg[:,GROUP:GROUP+1]
    
    recalls_main[name] = recall_score(yy, np.round(oofmain))
    precisions_main[name] = precision_score(yy, np.round(oofmain))
    recalls_single[name] = recall_score(yy, np.round(oofsingle))
    precisions_single[name] = precision_score(yy, np.round(oofsingle))
    
recalls_main = pd.Series(recalls_main)
precisions_main = pd.Series(precisions_main)
recalls_single = pd.Series(recalls_single)
precisions_single = pd.Series(precisions_single)

In [None]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x = recalls_main,
        y = precisions_main,
        mode= 'markers',
        name = 'MultiTarget NN',
        text = recalls_main.index,
        marker = {'size':target[recalls_single.index].sum()/10}
    )
)
fig.add_trace(
    go.Scatter(
        x = recalls_single,
        y = precisions_single,
        mode= 'markers',
        name = 'SingleTarget NN',
        text = recalls_single.index,
        marker = {'size':target[recalls_single.index].sum()/10}
    )
)

fig.update_layout(template = 'presentation', title = 'Precision and Recall for "main cluster" targets')
fig.update_yaxes(title = 'Precision')
fig.update_xaxes(title = 'Recall')
fig.show()

# Part 5: Predictions on public test set

I noticed that blending my two models give me poor results. So I want to understand what is happening between train and test set. I just have two seeds available here

In [None]:
preds = {}
for GROUP in ['MAINMODEL','SINGLETARGET']:
    pred = []
    for SEED in [0,1]:
        p = pd.read_csv(f'../input/outputs-moa-models/preds_public_GROUP{GROUP}_SEED{SEED}.csv', index_col = 0).values
        pred.append(p)
    pred = np.mean(pred, axis=0)
    preds[GROUP] = pred

### Gathering my predictions from another notebook

In [None]:
avg_train_pred_main = oof_main_avg.mean(axis=0)
avg_train_pred_single_target = oof_single_target_avg.mean(axis=0)
avg_test_pred_main = preds['MAINMODEL'].mean(axis=0)
avg_test_pred_single_target = preds['SINGLETARGET'].mean(axis=0)

diff_main = pd.Series((avg_train_pred_main-avg_test_pred_main), 
                      index = target.columns).sort_values(ascending = False)

diff_main = diff_main[(diff_main>0.0005) | (diff_main<-0.0005)]
diff_single_target = pd.Series((avg_train_pred_single_target-avg_test_pred_single_target), 
                      index = target.columns).sort_values(ascending = False)
diff_single_target = diff_single_target[(diff_single_target>0.0005) | (diff_single_target<-0.0005)]

### Scatter Plot

The scatter plot below shows the mean prediction of each target for the trainset and the testset. It is interesting to note that the target prediction distribution can change a lot from train to test set, and it might explain why my single target models underperform on LB

In [None]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x = avg_train_pred_main,
        y = avg_test_pred_main,
        name = 'Multi Target NN',
        mode = 'markers',
        text = target.columns,
        marker = {'size':np.sqrt(target.sum(axis=0))},
        opacity = 0.8
    )
)
fig.add_trace(
    go.Scatter(
        x = avg_train_pred_single_target,
        y = avg_test_pred_single_target,
        name = 'Single Target NN',
        mode = 'markers',
        text = target.columns,
        marker = {'size':np.sqrt(target.sum(axis=0))},
        opacity = 0.8
    )
)

fig.add_trace(
    go.Scatter(
        x = [0,0.05],
        y = [0,0.05],
        marker = {'color':'black', 'line' : {'width':0.1}},
        mode = 'lines',
        showlegend = False
    )
)

fig.update_layout(template = 'presentation', title = 'Mean Prediction by target, differences between train and test set')
fig.update_xaxes(title = 'Train set')
fig.update_yaxes(title = 'Test set')
fig.show()

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x = diff_single_target.index,
        y = diff_single_target.values,
        name = 'Single Target NN'
    )
)

fig.add_trace(
    go.Bar(
        x = diff_main.index,
        y = diff_main.values,
        name = 'Multi Targets NN'
    )
)

fig.update_layout(template = 'presentation', title = 'Difference of predictions mean between Train and Test Sets')
fig.update_xaxes(tickfont = {'size':5}, tickangle = 45)
fig.show()