# The new data
As was [just announced](https://www.kaggle.com/c/lish-moa/discussion/195170), there is now a file with drug IDs released, but only for the training data (see also [this discussion](https://www.kaggle.com/c/lish-moa/discussion/195179)). Typically, a drug should have 6 records (or a multiple thereof, if it was tried in multiple experiments) one for each combination of dose (two levels) and time (3 levels), but for various reasons some of these records may not exist leading to record numbers that are not a multiple of 6.

Knowing which records are from the same drug enables different forms of cross-validation (e.g. always keeping all records of a drug within the same CV fold) that might more closely match the actual prediction task. 

So, let's load this data and have a look at it.

In [None]:
mydir = '../input/lish-moa/'

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

drugs = pd.read_csv(mydir + 'train_drug.csv')
scored = pd.read_csv(mydir + 'train_targets_scored.csv')
not_scored = pd.read_csv(mydir + 'train_targets_nonscored.csv')

y_names_scored = scored.set_index('sig_id').columns.to_list()
y_names_not_scored = not_scored.set_index('sig_id').columns.to_list()

alltargets = pd.merge(
    pd.merge(drugs, 
             scored, 
             on='sig_id', how='inner'),
    not_scored, 
    on='sig_id', how='inner')


So, what does the new file look like?

In [None]:
drugs

As we can see, the file provides a link between the unique record IDs (sig_id) on the training data to a unique drug ID (drug_id). Let us create some summaries and then plot these to get a better idea of what is going on.

In [None]:
drug_records = alltargets[['drug_id', 'sig_id']].groupby('drug_id').count()
drug_moas_all = alltargets.groupby('drug_id')[y_names_scored+y_names_not_scored].mean()
drug_moas_scored = alltargets.groupby('drug_id')[y_names_scored].mean()

moas_per_drug_all = pd.DataFrame(drug_moas_all.sum(axis=1)).reset_index().groupby(0).count().reset_index().rename(columns={0: 'Number of all (scored and not scored) MoAs', 'drug_id':'Number of drugs' })
moas_per_drug_scored = pd.DataFrame(drug_moas_scored.sum(axis=1)).reset_index().groupby(0).count().reset_index().rename(columns={0: 'Number of scored MoAs', 'drug_id':'Number of drugs' })

record_counts = pd.DataFrame( drug_records['sig_id'].value_counts() ).reset_index().rename(columns={'sig_id': 'Drugs', 'index':'Number of records' }).sort_values('Number of records').reset_index(drop=True)

moa_counts_all = pd.DataFrame( drug_moas_all.sum(axis=0) ).reset_index().rename(columns={0: 'Drugs', 'index':'Mode of action' }).sort_values('Drugs').reset_index(drop=True)
moa_counts_scored = pd.DataFrame( drug_moas_scored.sum(axis=0) ).reset_index().rename(columns={0: 'Drugs', 'index':'Mode of action' }).sort_values('Drugs').reset_index(drop=True)

# How many records are there per drug?

As we can see quite a few (>100) drugs just have 1 record. 6 is the most common number, but 1, 5, 7, 12 nd 13 are also pretty common. Obviously, the 1866 records for one drug are all for vehicle records. I'd speculate that the really commonly occuring drugs (with 178 to 718 records) might be commonly used reference drugs with known properties, but that's just speculation.

In [None]:
sns.set(rc={'figure.figsize':(16,9)});
g = sns.barplot(x='Number of records', y='Drugs', data=record_counts);
g.set_yscale("log")

# How many drugs are there for each (scored) mode of action?
There's quite a few MoAs with just 1 to 7 records (that's almost half the MoAs), which obviously makes cross-validation respecting drugs difficult for these MoAs. I did not plot the non-scored MoAs, but for those, there's even some MoAs without any drugs.

In [None]:
sns.set(rc={'figure.figsize':(28,9)});
g = sns.barplot(x='Mode of action', y='Drugs', data=moa_counts_scored);
g.set_xticklabels(labels=moa_counts_scored['Mode of action'].values, rotation=90, fontdict={'fontsize':9});

The problematic MoAs (i.e. with <7 drugs) are these ones:

In [None]:
moa_counts_scored.loc[ moa_counts_scored['Drugs']<7, 'Mode of action'].values

# How many MoAs are there per drug?

As we can see, most drugs have 1 or no MoA, but there's even one drug with 7 MoAs.

In [None]:
sns.set(rc={'figure.figsize':(16,9)});
g = sns.barplot(x='Number of scored MoAs', y='Number of drugs', data=moas_per_drug_scored);
g.set_yscale("log")

If we also look at the non-scored MoAs, there's obviously slight more MoAs per drug with the number of MoAs going all the way up to 11.

In [None]:
sns.set(rc={'figure.figsize':(16,9)});
g = sns.barplot(x='Number of all (scored and not scored) MoAs', y='Number of drugs', data=moas_per_drug_all);
g.set_yscale("log")