In [None]:
# TODO: Mix R and Python using reticulate?
# Maybe later...
# Can reticulate be used?
# Using py2r? Does this exist?

In [None]:
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline

**Warning**: before we start, notice that, since this is a code competition, the test dataset 
we have here is only a subset of the whole dataset. Indeed, when re-run, the "real" test dataset
will be about 4 times larger. Some insights thus only partially represent the reality. Use this fact to your 
advantage.

With that out of the way, let's go!

# Load the datasets

These are of two types of datasets:
    
- features
- targets (scored)

There is also one extra type of file with `non scored` targets but we will get back to this later.
Let's load the different DataFrames.

In [None]:
train_features_df = pd.read_csv("../input/lish-moa/train_features.csv")
test_features_df = pd.read_csv("../input/lish-moa/test_features.csv")
train_targets_df = pd.read_csv("../input/lish-moa/train_targets_scored.csv")

# Features

To get started, let's explore the features (both train and public test). For that, we will check 
some samples, the data types, etc...

In [None]:
train_features_df.sample(2).T

In [None]:
test_features_df.sample(2).T

In [None]:
train_features_df.dtypes

In [None]:
train_features_df.columns.tolist()

In [None]:
train_features_df.shape

In [None]:
test_features_df.shape

In [None]:
len(train_features_df.columns[train_features_df.columns.str.startswith("c-")])

In [None]:
len(train_features_df.columns[train_features_df.columns.str.startswith("g-")])

In [None]:
for col in train_features_df.select_dtypes(["object", "int"]):
    print(f"Unique value counts for {col}")
    print(train_features_df.loc[:, col].value_counts())

Few first insights (for both train and test):

- We have **876** columns
- Among these, lots of columns start with `c-` or `g-`:
    * **100** columns of c type from 0 to 99: c-0, c-1, and so on. These are related to the 
    * **772** columns of g type from 0 to 771: 
- `cp_type`: either `trt_cp` or  
- `cp_dose`: either D1 or D2. These could be small and high doses (of which one could be leathal to the cells...).



In [None]:
train_features_df["sig_id"].unique().tolist()

In [None]:
train_features_df["sig_id"].nunique()

In [None]:
train_features_df["cp_dose"].nunique()

In [None]:
train_features_df["cp_dose"].unique()

# Targets

Now that we understand better the features, let's explore a little bit more the targets.
To start, let's have a look at the count of positives (i.e. ones) for each target.
To make this more readable, we will make many barplots.

In [None]:
targets_count_s = train_targets_df.drop("sig_id", axis=1).sum().sort_values(ascending=False).copy()

chunk_size = (len(targets_count_s) // 10 + 1)

fig, axes = plt.subplots(10, 1, figsize=(8, 100))

count = 0
for ax_id, chunk_id in enumerate(range(0, len(targets_count_s), chunk_size)):
    targets_count_s.iloc[chunk_id:chunk_id+chunk_size].sort_values().plot(kind="barh", ax=axes[ax_id])

nfkb_inhibitor is the most common target (with a wooping) and atp-sensitive_potassium_channel_antagonist is the least one
(with a mere 1 occurence)

In order to prepare the next section, let's prepare one new targets column: the sum
of all the targets: one row is equal to the sum accross all the targets. 

In [None]:
train_targets_df["all"] = train_targets_df.drop("sig_id", axis=1).sum(axis=1)

# Control groups

As stated in the data description, some rows come from controle groups. How to spot these: 
well that's easy, we juste need to 

In [None]:
df = train_features_df.merge(train_targets_df, on="sig_id")

In [None]:
(df.groupby(["all", "cp_type"])
    .size()
    .unstack(fill_value=0))

Few things to say about this confusion matrix:
    
    
1. All ctl_vehicle have 0 targets (first row, ctl_vehicle column) => thus no false positives in the control group.
This allows us to do two things: first, we can train without these control rows and second, 
2. Very few rows have many targets. For instance, we have 6 rows with 7 targets at 1.
We will explore some of these rows and see if they share something (spoiler: they do).
3. In contrast to that, most of the rows have 0 or 1 target at a time (TODO: add percentage). 
This means that we can simplify the problem by considering it as a multiclass problem (dropping the rows with more than 1 target).

# Specific MoA?

One interesting observation is that few `sig_id`'s have always the same group of targets. Let's find out about these.
We will start with the fewset, i.e. those with a sum of targets equal to 7 (this number can be seen in the occurences matrix above): 



In [None]:
train_targets_df.loc[lambda df: (df["all"] == 7), :].loc[:, lambda df: (df != 0).any(axis=0)]

Let's also check the next number in the list: those having 5 targets.

In [None]:
train_targets_df.loc[lambda df: (df["all"] == 5), :].loc[:, lambda df: (df != 0).any(axis=0)]

# Mean model

One interesting model to start with is the mean model, i.e. the mean over all targets from the train 
targets.
A refined version of this is to remove the control group from the train targets then compute the mean.