The objective of this notebook is to look for patterns across the 206 scored targets. It might be possible to find patterns that trace back to features and give some incremental reduction in logloss. 

Update: There is a pattern that shows improvement in logloss for one of the scored targets.I haven't tried it on the full target set. It might be useful if you are ensembling high-scoring deep learning models with a tabular-type model.

In [None]:
import pandas as pd

# Part 1. Scored targets


## The control group pattern

In this [notebook](https://www.kaggle.com/artgor/lish-moa-baseline-approach), 2x Grandmaster @artfor mentions that observations in the control group show no response in the train set targets. Let's confirm that claim.

In [None]:
targets =  pd.read_csv("../input/lish-moa/train_targets_scored.csv",
                 index_col=['sig_id'])
targets_0_idx = targets[targets.sum(axis=1)==0].index

train_features =  pd.read_csv("../input/lish-moa/train_features.csv",
                 index_col=['sig_id'])
control_idx = train_features.query('cp_type=="ctl_vehicle"').index

diffs = len(set(control_idx) - set(targets_0_idx))


In [None]:
test_features =  pd.read_csv("../input/lish-moa/test_features.csv",
                 index_col=['sig_id'])
test_control_idx = test_features.query('cp_type=="ctl_vehicle"').index

ctrl_pct = len(set(test_control_idx))/len(test_features)

print(diffs, ctrl_pct)

Yes, it is the case that all ids in the train set's control group show no response. So there is our first observable pattern in the targets that translates to the feature set: All ids in the control group - about 9% of the test set - will likely have all 0 targets.

## Correlation in the treated cases

One way to find aggregated patterns in targets is of course to look at correlations. Here is a correlation matrix heatmap that shows points for any two targets with above 70% correlation (positive or negative). There are 4 such pairs with two of those pairs above 90%. (The heatmap shows two permutations per pair, one above and one below the diagonal.) 

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


targets_treated = targets.drop(targets_0_idx)
corr_mtx = abs(targets_treated.corr())

corr_map = corr_mtx[corr_mtx>=.7]
plt.figure(figsize=(12,8))
sns.heatmap(corr_map, cmap="viridis")

In [None]:
pairs = (corr_mtx.where(np.triu(np.ones(corr_mtx.shape), k=1).astype(np.bool))
                     .stack()
                     .sort_values(ascending=False))
pairs[:4]

Correlation drops right off and is very weak for the remaining 185 pairs. The histogram below shows how most test cases show 0 or 1 response across all targets. 

In [None]:
targets.sum(axis=1).hist()

Looking again at the 4 pairs above, we see the 3 pairs below the top pair are all related. Here's a distance graph of those along with 16 more pairs. The networkx package is magic in this application!

In [None]:
import networkx as nx

pairs_df = (1-pairs).reset_index()
G = nx.from_pandas_edgelist(pairs_df[:20], source='level_0', target='level_1', edge_attr=0)

graph_opts = dict(arrows=False,
                  node_size=5,
                  width=2,
                  alpha=0.8,
                  font_size=12,
                  font_color='darkblue',
                  edge_color='darkgray'
                 )

fig= plt.figure(figsize=(12,10))
nx.draw_spring(G, with_labels=True, **graph_opts)

## Another pattern

The next step is to look at individual samples and see which sets look similar. We could go group by group through the above network, or we can look for similarities among cases from a different angle. Remember, we want to trace these patterns in the target set back to the feature set. 

Here I use the [missingno](https://github.com/ResidentMario/missingno) package by Notebooks Master @residentmario to help find patterns. There are a few interesting things at which to look (check that proper grammar!). Notice the dense bands at the top leftish part of the chart.

In [None]:
import missingno as msno

cols_sorted = targets_treated.sum().sort_values(ascending=False).index
targets_visible = targets_treated[cols_sorted].replace(0, np.nan)
                                                # you can also use pd.NA with pandas v1+

msno.matrix(targets_visible.iloc[:, :50].sample(n=1000), sort='descending', color=(1,0,0))

We see this picture upon drilling down.

In [None]:
top_cols, idx = np.unique(pairs_df.iloc[:4, :2].values.flatten(), return_index=True)
                                                    # use df.to_numpy() for 1.0+
msno.matrix(targets_visible.loc[targets_treated[top_cols].any(axis=1), 
                top_cols[idx]].sort_values(['flt3_inhibitor', 'pdgfr_inhibitor', 'kit_inhibitor']), 
                sort=None, color=(1,0,0))


In [None]:
targets.proteasome_inhibitor.sum()/len(targets)

Well this is rather interesting. One way to characterize it is that if a case shows a positive response for the proteasome inhibitor, it will almost always show a positive response for the nfkb_inhibitor as well. Furthermore, the case is very likely to show no repsonse to the other 3 inhibitors in the chart. It's 3% of the populaton, which is potentially nice.

## More patterns?

The network graph above showed several relationships in the targets. Missingno has a dendrogram tool that does something similar and unlike the matrix, doesn't depend on sort order. Here is the dendrogram for the cases receiving real treatment.

Shorter bars indicate tighter relationships.

In [None]:
msno.dendrogram(targets_visible)

# Part 2. Features

## Feature importance

Detecting patterns is uesful in the competition only if we can trace it back to patterns in the features. And then, it depends on whether the model we use found those patterns already. In my experience machine learning models usually find the patterns before humans do - that's why we're all here. Sometimes though we can find incremental improvements that clue us in to new feature combinations.

The relationship of the 5 inhibitors seems a good place to start. I'll use a RandomForestClassifier on the proteasome inhibitor target as a simple baseline for the model and for finding important features. The imbalanced data makes results very sensitive to data splits and random seeds, so we want something that can generalize. I'm using the ELI5 library and permutation importance based on several iterations of a cross-validated model. 

In [None]:
from multiprocessing import cpu_count
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss, make_scorer

nproc = cpu_count()

In [None]:
X = train_features
y = targets.proteasome_inhibitor

X[['cp_type', 'cp_dose']] = X[['cp_type', 'cp_dose']].astype('category').apply(lambda x: x.cat.codes)

rf_model = RandomForestClassifier(n_estimators=100, random_state=10, verbose=True, n_jobs=nproc)
scorer = make_scorer(log_loss)
skf = StratifiedKFold(n_splits=5, random_state=24)


In [None]:
%%time

perm = PermutationImportance(rf_model, scoring=scorer, cv=skf)
perm.fit(X, y)

In [None]:
weights = eli5.explain_weights_dfs(perm, feature_names=X.columns.tolist())
weights_df = weights['feature_importances']
weights_df[:15]

## Feature engineering

Now we can add some of the important features in as combinations of each other. I'll use pairwise combinations here. This technique usually has more of a benefit for linear models than for tree-based models. I didn't notice any competitive linear models in the notebooks though and am checking it out this way.

In [None]:
from itertools import combinations

train_features_with = train_features.copy()
important = weights_df.loc[:5, 'feature'].tolist()
for pair in combinations(important, 2):
    col = "_".join(pair)
    train_features_with[col] = train_features_with[pair[0]] * train_features_with[pair[1]]

train_features_with[:5]


In [None]:
X_with = train_features_with
rf_model_with = RandomForestClassifier(n_estimators=100, random_state=10, verbose=True, n_jobs=nproc)


# Part 3. Models and scores

## Comparison
Here are loglosses for the baseline and augmented set. Again, there are several iterations with the same 5-fold CV splits within an iteration.

In [None]:
%%time

losses = np.zeros((2, 3, 5))
for i in range(3):
    skf = StratifiedKFold(n_splits=5, random_state=i*8)
    scorer = make_scorer(log_loss, greater_is_better=False, needs_proba=True)
    losses[0,i] = cross_val_score(rf_model, X, y, scoring=scorer, cv=skf, n_jobs=nproc)
    losses[1,i] = cross_val_score(rf_model_with, X_with, y, scoring=scorer, cv=skf, n_jobs=nproc)

In [None]:
print(f"baseline: {np.mean(losses[0])} mean, {np.std(losses[0])} std dev "
      f"with features: {np.mean(losses[1])} mean, {np.std(losses[1])} std dev"
      )

## Closing thoughts

There is about an 8% improvement in the mean when using the features, which is promising. However, the standard deviation runs at ~10%. Even a bump on the public leaderboard may not translate to gains on the private LB.

Also there's the matter of the other 205 targets! I hope at least the notebook gives you some ideas for feature engineering based on patterns in the targets. 

Good luck for the duration of the contest!