## About this Competition

scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short.

Hence, our task is to use the training dataset to develop an algorithm that automatically labels each case in the test set as one or more MoA classes. Note that since drugs can have multiple MoA annotations, the task is formally a multi-label classification problem.

Based on the MoA annotations, the accuracy of solutions will be evaluated on the average value of the logarithmic loss function applied to each drug-MoA annotation pair.

***train_features.csv*** / ***test_features.csv*** -Features for the training set. 
<br>Features g- signify gene expression data, and 
c- signify cell viability data. 
cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; 
cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).
<br>***train_targets_scored.csv*** - The binary MoA targets that are scored.
<br>***sample_submission.csv*** - A submission file in the correct format

## References
*  https://www.kaggle.com/fchmiel/xgboost-baseline-multilabel-classification
*  https://www.kaggle.com/kushal1506/moa-pytorch-feature-engineering-0-01846


I would be grateful for any correction, suggestion or discussion ):

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import math

from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.metrics import log_loss

import gc
import warnings
warnings.simplefilter('ignore')

In [None]:
trainF = pd.read_csv('../input/lish-moa/train_features.csv')
test  = pd.read_csv('../input/lish-moa/test_features.csv')
trainTs = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
trainTn = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')
sub = pd.read_csv('../input/lish-moa/sample_submission.csv')

In [None]:
display(trainF.describe())
display(trainTs.describe())

In [None]:
display(trainF.head())
display(trainTs.sample(6))
display(test.head())
sub.head()

In [None]:
print('total missing values in dataset = ', trainF.isna().sum().sum())
#categorical features
cat_feat = trainF.columns[trainF.dtypes == 'object'].tolist()
print(cat_feat)
train = trainF.merge(trainTs, on= 'sig_id')
train.shape, trainTs.shape

## Analysing cp- features

In [None]:
target_cols = [col for col in trainTs.columns if col != 'sig_id']
c_feats = ['cp_type', 'cp_time', 'cp_dose']
for feat in c_feats:
    col = target_cols + [feat]
    c_sumTs = train[col].groupby([feat]).sum().sum(1)
    sns.countplot(c_sumTs) ;
    sns.barplot(c_sumTs.index, c_sumTs.values) ;
    plt.show()

In [None]:
train[col+['cp_type']].groupby('cp_type').sum().sum(1)

In [None]:
def cat2num(df):
    df.loc[:, 'cp_time'] = df['cp_time'].map({24: 0, 48: 1, 72: 2})
    df.loc[:, 'cp_type'] = df['cp_type'].map({'trt_cp': 0, 'ctl_vehicle': 1})
    df.loc[:, 'cp_dose'] = df['cp_dose'].map({'D1':0, 'D2':1})
    return df
train = cat2num(train)
test = cat2num(test)

print('Number of different labels:', len(target_cols))
num_feat = [x for x in train.columns if x not in trainTs]

## Feature engineering

In [None]:
df = pd.concat([train[num_feat], test[num_feat]], axis= 0)

features_g = list(train.columns[4:776])
features_c = list(train.columns[776:876])
gc_fe = ['g_sum', 'g_mean', 'g_std', 'g_kurt', 'g_skew', 'c_sum', 'c_mean', 'c_std', 
         'c_kurt', 'c_skew', 'gc_sum', 'gc_mean', 'gc_std', 'gc_kurt', 'gc_skew']

df['g_sum'] = df[features_g].sum(axis = 1)
df['g_mean'] = df[features_g].mean(axis = 1)
df['g_std'] = df[features_g].std(axis = 1)
df['g_kurt'] = df[features_g].kurtosis(axis = 1)
df['g_skew'] = df[features_g].skew(axis = 1)
df['c_sum'] = df[features_c].sum(axis = 1)
df['c_mean'] = df[features_c].mean(axis = 1)
df['c_std'] = df[features_c].std(axis = 1)
df['c_kurt'] = df[features_c].kurtosis(axis = 1)
df['c_skew'] = df[features_c].skew(axis = 1)
df['gc_sum'] = df[features_g + features_c].sum(axis = 1)
df['gc_mean'] = df[features_g + features_c].mean(axis = 1)
df['gc_std'] = df[features_g + features_c].std(axis = 1)
df['gc_kurt'] = df[features_g + features_c].kurtosis(axis = 1)
df['gc_skew'] = df[features_g + features_c].skew(axis = 1)

train[gc_fe] = df[gc_fe].iloc[:train.shape[0],:]
test[gc_fe] = df[gc_fe].iloc[train.shape[0]:, :]
num_feat = num_feat + gc_fe

## XGBClassifier

In [None]:
from sklearn.multioutput import MultiOutputClassifier
from xgboost import XGBClassifier

In [None]:
params = {'colsample_bytree': 0.6522,
          'gamma': 3.6975,
          'learning_rate': 0.0503,
          'max_delta_step': 2.0706,
          'max_depth': 10,
          'min_child_weight': 31.5800,
          'n_estimators': 166,
          'subsample': 0.8639,
          'verbosity':0
         }

clf = MultiOutputClassifier(XGBClassifier(**params, tree_method='gpu_hist'))

NFOLDS = 5
X = train[num_feat].values 
y = train[target_cols].values
X_test = test[num_feat].values

In [None]:
oof_preds = np.zeros(y.shape)
test_preds = np.zeros((test.shape[0], y.shape[1]))
oof_losses = []
kf = KFold(n_splits=NFOLDS)
for fn, (trn_idx, val_idx) in enumerate(kf.split(X, y)):
    print('Starting fold: ', fn)
    X_train, X_val = X[trn_idx], X[val_idx]
    y_train, y_val = y[trn_idx], y[val_idx]
    
    # drop where cp_type==ctl_vehicle (baseline)
    ctl_mask = X_train[:,0]== 1 #'ctl_vehicle'
    X_train = X_train[~ctl_mask,:]
    y_train = y_train[~ctl_mask,:]
    
    clf.fit(X_train, y_train)
    val_preds = clf.predict_proba(X_val) # list of preds per class
    val_preds = np.array(val_preds)[:,:,1].T # take the positive class
    oof_preds[val_idx] = val_preds
    
    loss = log_loss(np.ravel(y_val), np.ravel(val_preds))
    print(f'fold {fn} loss {loss}')
    oof_losses.append(loss)
    preds = clf.predict_proba(X_test)
    preds = np.array(preds)[:,:,1].T # take the positive class
    test_preds += preds / NFOLDS
    
print('Mean OOF loss across folds', np.mean(oof_losses))
print('STD OOF loss across folds', np.std(oof_losses))

In [None]:
control_mask = test['cp_type'] == 1
sub.iloc[:, 1:] = test_preds
sub.iloc[control_mask, 1:] = 0
sub.to_csv('submission.csv', index = False)

In [None]:
sub.head()