<div class = "alert alert-block alert-info">
    <h1><font color = "red">DISCLAIMER</font></h1>
    <p>The following notebook it's highly based on the works <a href = "https://www.kaggle.com/optimo/tabnetregressor-2-0-train-infer">TabNetRegressor 2.0 [TRAIN + INFER]</a>, <a href = "https://www.kaggle.com/liuhdme/moa-competition/data">MOA competition</a> and <a href = "https://www.kaggle.com/kushal1506/moa-pytorch-0-01859-rankgauss-pca-nn/data?select=train_targets_scored.csv">
MoA | Pytorch | 0.01859 | RankGauss | PCA | NN</a>, please check it out. I have to add that i don't make this notebook for "upvotes" but feedback.</p>
</div>

# <font color = "seagreen">Preambule</font>

I made this notebook to share some experiments (see the sections "Experiments") which could help to someone who don't want to wast their daily "submissions", but more importantly, to get feedback about what i could change to achive a better CV. Moreover, the easiness of TabNet to overfit the data it's disturbing. In the section "Conclusion" i share my opinion about the fine-tuning process of TabNet.

## <font color = "green">Installing Libraries</font>

In [1]:
# # TabNet
# !pip install --no-index --find-links /kaggle/input/pytorchtabnet/pytorch_tabnet-2.0.0-py3-none-any.whl pytorch-tabnet
# # Iterative Stratification
# !pip install /kaggle/input/iterative-stratification/iterative-stratification-master/

## <font color = "green">Loading Libraries</font>

In [2]:
### General ###
import os
import sys
import copy
import tqdm
import pickle
import random
import warnings
warnings.filterwarnings("ignore")
sys.path.append("../input/rank-gauss")
os.environ["CUDA_LAUNCH_BLOCKING"] = '1'

### Data Wrangling ###
import numpy as np
import pandas as pd
from scipy import stats
from gauss_rank_scaler import GaussRankScaler

### Data Visualization ###
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")

### Machine Learning ###
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.preprocessing import QuantileTransformer
from sklearn.feature_selection import VarianceThreshold
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

### Deep Learning ###
import torch
from torch import nn
import torch.optim as optim
from torch.nn import functional as F
from torch.nn.modules.loss import _WeightedLoss
from torch.utils.data import DataLoader, Dataset
from torch.optim.lr_scheduler import ReduceLROnPlateau
# Tabnet 
from pytorch_tabnet.metrics import Metric
from pytorch_tabnet.tab_model import TabNetRegressor

### Make prettier the prints ###
from colorama import Fore
c_ = Fore.CYAN
m_ = Fore.MAGENTA
r_ = Fore.RED
b_ = Fore.BLUE
y_ = Fore.YELLOW
g_ = Fore.GREEN

## <font color = "green">Reproducibility</font>

In [3]:
seed = 42

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
set_seed(seed)

## <font color = "green">Configuration</font>

In [4]:
# Parameters
data_path = "../input/lish-moa/"
no_ctl = True
scale = "rankgauss"
variance_threshould = 0.7
decompo = "PCA"
ncompo_genes = 80
ncompo_cells = 10
encoding = "dummy"

## <font color = "green">Loading the Data</font>

In [5]:
train = pd.read_csv(data_path + "train_features.csv")
#train.drop(columns = ["sig_id"], inplace = True)

targets = pd.read_csv(data_path + "train_targets_scored.csv")
#train_targets_scored.drop(columns = ["sig_id"], inplace = True)

#train_targets_nonscored = pd.read_csv(data_path + "train_targets_nonscored.csv")

test = pd.read_csv(data_path + "test_features.csv")
#test.drop(columns = ["sig_id"], inplace = True)

submission = pd.read_csv(data_path + "sample_submission.csv")

# <font color = "seagreen">Preprocessing and Feature Engineering</font>

In [6]:
if no_ctl:
    # cp_type == ctl_vehicle
    print(b_, "not_ctl")
    train = train[train["cp_type"] != "ctl_vehicle"]
    test = test[test["cp_type"] != "ctl_vehicle"]
    targets = targets.iloc[train.index]
    train.reset_index(drop = True, inplace = True)
    test.reset_index(drop = True, inplace = True)
    targets.reset_index(drop = True, inplace = True)

[34m not_ctl


## <font color = "green">Distributions Before Rank Gauss and PCA</font>

In [7]:
def distributions(num, graphs, items, features, gorc):
    """
    Plot the distributions of gene expression or cell viability data
    """
    for i in range(0, num - 1, 7):
        if i >= 3:
            break
        idxs = list(np.array([0, 1, 2, 3, 4, 5, 6]) + i)
    
        fig, axs = plt.subplots(1, 7, sharey = True)
        for k, item in enumerate(idxs):
            if item >= items:
                break
            graph = sns.distplot(train[features].values[:, item], ax = axs[k])
            graph.set_title(f"{gorc}-{item}")
            graphs.append(graph)

In [8]:
GENES = [col for col in train.columns if col.startswith("g-")]
CELLS = [col for col in train.columns if col.startswith("c-")]

### <font color = "green">Distributions of the Train Set</font>

## <font color = "green">Rank Gauss Process</font>

In [9]:
import pickle

data_all = pd.concat([train, test], ignore_index = True)
cols_numeric = [feat for feat in list(data_all.columns) if feat not in ["sig_id", "cp_type", "cp_time", "cp_dose"]]
mask = (data_all[cols_numeric].var() >= variance_threshould).values

file = open('mask.pickle', 'wb')
pickle.dump(mask, file)
file.close()



tmp = data_all[cols_numeric].loc[:, mask]
data_all = pd.concat([data_all[["sig_id", "cp_type", "cp_time", "cp_dose"]], tmp], axis = 1)
cols_numeric = [feat for feat in list(data_all.columns) if feat not in ["sig_id", "cp_type", "cp_time", "cp_dose"]]

In [10]:
def scale_minmax(col):
    return (col - col.min()) / (col.max() - col.min())

def scale_norm(col):
    return (col - col.mean()) / col.std()

if scale == "boxcox":
    print(b_, "boxcox")
    data_all[cols_numeric] = data_all[cols_numeric].apply(scale_minmax, axis = 0)
    trans = []
    for feat in cols_numeric:
        trans_var, lambda_var = stats.boxcox(data_all[feat].dropna() + 1)
        trans.append(scale_minmax(trans_var))
    data_all[cols_numeric] = np.asarray(trans).T
    
elif scale == "norm":
    print(b_, "norm")
    data_all[cols_numeric] = data_all[cols_numeric].apply(scale_norm, axis = 0)
    
elif scale == "minmax":
    print(b_, "minmax")
    data_all[cols_numeric] = data_all[cols_numeric].apply(scale_minmax, axis = 0)
    
elif scale == "rankgauss":
    ### Rank Gauss ###
    print(b_, "Rank Gauss")
    guassscaler = GaussRankScaler().fit(data_all[cols_numeric])
    
    file = open('guassscaler.pickle', 'wb')
    pickle.dump(guassscaler, file)
    file.close()

    
    data_all[cols_numeric] = guassscaler.transform(data_all[cols_numeric])

#     data_all[cols_numeric] = scaler.fit_transform(data_all[cols_numeric])
    
else:
    pass

[34m Rank Gauss


## <font color = "green">Principal Component Analysis</font>

In [11]:
# PCA
if decompo == "PCA":
    print(b_, "PCA")
    GENES = [col for col in data_all.columns if col.startswith("g-")]
    CELLS = [col for col in data_all.columns if col.startswith("c-")]
    
#     pca_genes = PCA(n_components = ncompo_genes,
#                     random_state = seed).fit_transform(data_all[GENES])
    gpca = PCA(n_components = ncompo_genes,
                    random_state = seed).fit(data_all[GENES])
    pca_genes = gpca.transform(data_all[GENES])
                
    
    file = open('gpca.pickle', 'wb')
    pickle.dump(gpca, file)
    file.close()


#     pca_cells = PCA(n_components = ncompo_cells,
#                     random_state = seed).fit_transform(data_all[CELLS])
    cpca = PCA(n_components = ncompo_cells,
                    random_state = seed).fit(data_all[CELLS])
    pca_cells = cpca.transform(data_all[CELLS])
        
    
    file = open('cpca.pickle', 'wb')
    pickle.dump(cpca, file)
    file.close()


    
    pca_genes = pd.DataFrame(pca_genes, columns = [f"pca_g-{i}" for i in range(ncompo_genes)])
    pca_cells = pd.DataFrame(pca_cells, columns = [f"pca_c-{i}" for i in range(ncompo_cells)])
    data_all = pd.concat([data_all, pca_genes, pca_cells], axis = 1)
else:
    pass

[34m PCA


## <font color = "green">One Hot</font>

In [12]:
# Encoding
if encoding == "lb":
    print(b_, "Label Encoding")
    for feat in ["cp_time", "cp_dose"]:
        data_all[feat] = LabelEncoder().fit_transform(data_all[feat])
elif encoding == "dummy":
    print(b_, "One-Hot")
    data_all = pd.get_dummies(data_all, columns = ["cp_time", "cp_dose"])

[34m One-Hot


In [13]:
GENES = [col for col in data_all.columns if col.startswith("g-")]
CELLS = [col for col in data_all.columns if col.startswith("c-")]

for stats in tqdm.tqdm(["sum", "mean", "std", "kurt", "skew"]):
    data_all["g_" + stats] = getattr(data_all[GENES], stats)(axis = 1)
    data_all["c_" + stats] = getattr(data_all[CELLS], stats)(axis = 1)    
    data_all["gc_" + stats] = getattr(data_all[GENES + CELLS], stats)(axis = 1)

100%|██████████| 5/5 [00:05<00:00,  1.13s/it]


## <font color = "green">Distributions After Rank Gauss and PCA</font>

In [14]:
def distributions(num, graphs, items, features, gorc):
    """
    Plot the distributions of gene expression or cell viability data
    """
    for i in range(0, num - 1, 7):
        if i >= 3:
            break
        idxs = list(np.array([0, 1, 2, 3, 4, 5, 6]) + i)
    
        fig, axs = plt.subplots(1, 7, sharey = True)
        for k, item in enumerate(idxs):
            if item >= items:
                break
            graph = sns.distplot(data_all[features].values[:, item], ax = axs[k])
            graph.set_title(f"{gorc}-{item}")
            graphs.append(graph)

In [15]:
train_df_ID = data_all.sig_id[: train.shape[0]]
train_df_ID

0        id_000644bb2
1        id_000779bfc
2        id_000a6266a
3        id_0015fd391
4        id_001626bd3
             ...     
21943    id_fff8c2444
21944    id_fffb1ceed
21945    id_fffb70c0c
21946    id_fffcb9e7c
21947    id_ffffdd77b
Name: sig_id, Length: 21948, dtype: object

In [16]:
# train_df and test_df
train_df_ID = data_all.sig_id[: train.shape[0]]
features_to_drop = ["sig_id", "cp_type"]
data_all.drop(features_to_drop, axis = 1, inplace = True)
try:
    targets.drop("sig_id", axis = 1, inplace = True)
except:
    pass
train_df = data_all[: train.shape[0]]
train_df.reset_index(drop = True, inplace = True)
# The following line it's a bad practice in my opinion, targets on train set
#train_df = pd.concat([train_df, targets], axis = 1)
test_df = data_all[train_df.shape[0]: ]
test_df.reset_index(drop = True, inplace = True)

In [17]:
print(f"{b_}train_df.shape: {r_}{train_df.shape}")
print(f"{b_}test_df.shape: {r_}{test_df.shape}")

[34mtrain_df.shape: [31m(21948, 947)
[34mtest_df.shape: [31m(3624, 947)


In [18]:
X_test = test_df.values
print(f"{b_}X_test.shape: {r_}{X_test.shape}")

[34mX_test.shape: [31m(3624, 947)


# <font color = "seagreen">Experiments</font>

I just want to point that the [original work](https://www.kaggle.com/optimo/tabnetregressor-2-0-train-infer) achive a CV of 0.015532370835690834 and a LB score of 0.01864. Some of the experiments that i made with their changes:


- CV: 0.01543560538566987, LB: 0.01858, best LB that i could achive, changes
    - `n_a` = 32 instead of 24
    - `n_d` = 32 instead of 24
- CV: 0.015282077428722094, LB: 0.01862, best CV that i could achive, changes (Version 5):
    - `n_a` = 32 instead of 24
    - `n_d` = 32 instead of 24
    - `virtual_batch_size` = 32, instead of 128
    - `seed` = 42 instead of 0
- CV: 0.015330138325308062, LB: 01864, the same LB that the original but better CV, changes:
    - `n_a` = 32 instead of 24
    - `n_d` = 32 instead of 24
    - `virtual_batch_size` = 64, instead of 128
    - `batch_size` = 512, instead of 1024
- CV: 0.015361751699863063, LB: 0.01863, better LB and CV than the original, changes:
    - `n_a` = 32 instead of 24
    - `n_d` = 32 instead of 24
    - `virtual_batch_size` = 64, instead of 128
- CV: 0.015529925324634975, LB: 0.01865, changes:
    - `n_a` = 48 instead of 24
    - `n_d` = 48 instead of 24
- CV: 0.015528553520924939, LB: 0.01868, changes:
    - `n_a` = 12 instead of 24
    - `n_d` = 12 instead of 24
- CV: 0.015870202970324317, LB: 0.01876, worst CV and LB score, changes:
    - `n_a` = 12 instead of 24
    - `n_d` = 12 instead of 24
    - `batch_size` = 2048, instead of 1024
    
    
As you can see if `batch_size` < 1024 and > 1024 give worst results. Something similar happens with `n_a` and `n_d`, if their values are lower or higher than 32 the results are worst.


## <font color = "green">Versions</font>

- **Version 5**: I added the `seed` parameter to the TabNet model.
- **Version 6**: I changed the `virtual_batch_size` to 24
    - CV: 0.01532900616425282, LB: 0.01862, changes:
        - `n_a` = 32 instead of 24
        - `n_d` = 32 instead of 24
        - `virtual_batch_size` = 24, instead of 128
        - `seed` = 42 instead of 0
- **Version 7**: PCA, Rank Gauss

# <font color = "seagreen">Modeling</font>

## <font color = "green">Model Parameters</font>

In [19]:
MAX_EPOCH = 200
# n_d and n_a are different from the original work, 32 instead of 24
# This is the first change in the code from the original
tabnet_params = dict(
    n_d = 32,
    n_a = 32,
    n_steps = 1,
    gamma = 1.3,
    lambda_sparse = 0,
    optimizer_fn = optim.Adam,
    optimizer_params = dict(lr = 2e-2, weight_decay = 1e-5),
    mask_type = "entmax",
    scheduler_params = dict(
        mode = "min", patience = 5, min_lr = 1e-5, factor = 0.9),
    scheduler_fn = ReduceLROnPlateau,
    seed = seed,
    verbose = 10
)

## <font color = "green">Custom Metric</font>

In [20]:
class LogitsLogLoss(Metric):
    """
    LogLoss with sigmoid applied
    """

    def __init__(self):
        self._name = "logits_ll"
        self._maximize = False

    def __call__(self, y_true, y_pred):
        """
        Compute LogLoss of predictions.

        Parameters
        ----------
        y_true: np.ndarray
            Target matrix or vector
        y_score: np.ndarray
            Score matrix or vector

        Returns
        -------
            float
            LogLoss of predictions vs targets.
        """
        logits = 1 / (1 + np.exp(-y_pred))
        aux = (1 - y_true) * np.log(1 - logits + 1e-15) + y_true * np.log(logits + 1e-15)
        return np.mean(-aux)

In [21]:
train_df.head()

Unnamed: 0,g-0,g-2,g-3,g-4,g-5,g-6,g-7,g-8,g-9,g-10,...,gc_mean,g_std,c_std,gc_std,g_kurt,c_kurt,gc_kurt,g_skew,c_skew,gc_skew
0,0.731433,-0.254104,-0.615315,-0.194236,-0.772257,-1.061067,0.005623,0.513989,-0.130963,1.094414,...,0.045182,0.600465,0.506976,0.602876,-0.006781,-0.25023,-0.025857,-0.010803,0.118667,-0.039037
1,0.01702,0.21382,0.027107,0.849834,0.482972,0.217779,0.363904,-0.29842,0.730029,-0.808914,...,0.045437,0.588216,0.421628,0.584621,-0.048549,0.111352,-0.096328,0.096908,-0.043717,0.003906
2,0.462792,0.981712,-0.091777,-0.039121,1.055526,0.161005,0.242395,0.05007,0.975415,-0.385579,...,-0.022895,0.643043,0.458371,0.623742,-0.36636,-0.051865,-0.285094,-0.01558,0.379548,0.011925
3,-0.531916,-0.269127,0.455184,1.595325,-0.621171,-1.50923,0.206249,-0.066407,-0.88959,-0.656089,...,-0.096822,0.722179,0.392981,0.724922,-0.877465,4.444686,-0.907331,0.108122,2.010409,0.272977
4,-0.341668,0.808567,0.604615,0.951946,-0.63095,-0.243933,-0.103294,-0.684346,0.741863,-0.175749,...,0.024536,0.765814,0.471472,0.743453,-0.353646,-0.47755,-0.243094,-0.127465,0.191181,-0.198329


# <font color = "seagreen">Training</font>

In [27]:
scores_auc_all = []
test_cv_preds = []

NB_SPLITS = 10 # 7
mskf = MultilabelStratifiedKFold(n_splits = NB_SPLITS, random_state = 0, shuffle = True)

oofID_list = []
oof_preds = []
oof_targets = []
scores = []
scores_auc = []

for fold_nb, (train_idx, val_idx) in enumerate(mskf.split(train_df, targets)):
    print(b_,"FOLDS: ", r_, fold_nb + 1)
    print(g_, '*' * 60, c_)
    
    ID_train, ID_valid = train_df_ID.values[train_idx], train_df_ID.values[val_idx]
    print(ID_valid)
    X_train, y_train = train_df.values[train_idx, :], targets.values[train_idx, :]
    X_val, y_val = train_df.values[val_idx, :], targets.values[val_idx, :]
    ### Model ###
    model = TabNetRegressor(**tabnet_params)
        
    ### Fit ###
    # Another change to the original code
    # virtual_batch_size of 32 instead of 128
    model.fit(
        X_train = X_train,
        y_train = y_train,
        eval_set = [(X_val, y_val)],
        eval_name = ["val"],
        eval_metric = ["logits_ll"],
        max_epochs = MAX_EPOCH,
        patience = 20,
        batch_size = 1024, 
        virtual_batch_size = 32,
        num_workers = 1,
        drop_last = False,
        # To use binary cross entropy because this is not a regression problem
        loss_fn = F.binary_cross_entropy_with_logits
    )
    print(y_, '-' * 60)
    
    file = open(f'MODEL{fold_nb}.pkl', 'wb')
    pickle.dump(model, file)
    file.close()
    
    ### Predict on validation ###
    preds_val = model.predict(X_val)
    # Apply sigmoid to the predictions
    preds = 1 / (1 + np.exp(-preds_val))
    score = np.min(model.history["val_logits_ll"])
    
    ### Save OOF for CV ###
    oof_preds.append(preds_val)
    oof_targets.append(y_val)
    
    oofID_list.append(ID_valid)
    scores.append(score)
    
    ### Predict on test ###
    preds_test = model.predict(X_test)
    test_cv_preds.append(1 / (1 + np.exp(-preds_test)))

oof_preds_all = np.concatenate(oof_preds)
oof_targets_all = np.concatenate(oof_targets)
test_preds_all = np.stack(test_cv_preds)

[34m FOLDS:  [31m 1
[32m ************************************************************ [36m
['id_002452c7e' 'id_003603254' 'id_003fdd734' ... 'id_ffa08c24c'
 'id_ffab8a71d' 'id_fff183968']
Device used : cuda
epoch 0  | loss: 0.29866 | val_logits_ll: 0.02837 |  0:00:02s
epoch 10 | loss: 0.01854 | val_logits_ll: 0.02076 |  0:00:25s
epoch 20 | loss: 0.01731 | val_logits_ll: 0.01741 |  0:00:48s
epoch 30 | loss: 0.01687 | val_logits_ll: 0.01813 |  0:01:11s
epoch 40 | loss: 0.01674 | val_logits_ll: 0.01709 |  0:01:41s
epoch 50 | loss: 0.0164  | val_logits_ll: 0.01676 |  0:02:11s
epoch 60 | loss: 0.01618 | val_logits_ll: 0.01674 |  0:02:41s
epoch 70 | loss: 0.0157  | val_logits_ll: 0.01683 |  0:03:05s
epoch 80 | loss: 0.01558 | val_logits_ll: 0.01672 |  0:03:35s
epoch 90 | loss: 0.01517 | val_logits_ll: 0.01666 |  0:03:57s

Early stopping occured at epoch 97 with best_epoch = 77 and best_val_logits_ll = 0.01653
Best weights from best epoch are automatically used!
[33m --------------------

Device used : cuda
epoch 0  | loss: 0.29638 | val_logits_ll: 0.02877 |  0:00:08s
epoch 10 | loss: 0.01875 | val_logits_ll: 0.01953 |  0:01:21s
epoch 20 | loss: 0.01739 | val_logits_ll: 0.01992 |  0:02:25s
epoch 30 | loss: 0.01687 | val_logits_ll: 0.01753 |  0:03:35s
epoch 40 | loss: 0.01663 | val_logits_ll: 0.01704 |  0:04:45s
epoch 50 | loss: 0.01632 | val_logits_ll: 0.017   |  0:05:54s
epoch 60 | loss: 0.01615 | val_logits_ll: 0.01694 |  0:06:57s
epoch 70 | loss: 0.01588 | val_logits_ll: 0.01669 |  0:07:52s

Early stopping occured at epoch 77 with best_epoch = 57 and best_val_logits_ll = 0.01659
Best weights from best epoch are automatically used!
[33m ------------------------------------------------------------
[34m FOLDS:  [31m 9
[32m ************************************************************ [36m
['id_00505b3c8' 'id_0062bfc63' 'id_0092c905e' ... 'id_ffd241f1c'
 'id_ffdd24c81' 'id_fff26b3c2']
Device used : cuda
epoch 0  | loss: 0.29643 | val_logits_ll: 0.02784 |  0:00:02s
ep

In [29]:
oofID_all = np.concatenate(oofID_list)

In [67]:
train_df.head()

Unnamed: 0,g-0,g-2,g-3,g-4,g-5,g-6,g-7,g-8,g-9,g-10,...,gc_mean,g_std,c_std,gc_std,g_kurt,c_kurt,gc_kurt,g_skew,c_skew,gc_skew
0,0.731433,-0.254104,-0.615315,-0.194236,-0.772257,-1.061067,0.005623,0.513989,-0.130963,1.094414,...,0.045182,0.600465,0.506976,0.602876,-0.006781,-0.25023,-0.025857,-0.010803,0.118667,-0.039037
1,0.01702,0.21382,0.027107,0.849834,0.482972,0.217779,0.363904,-0.29842,0.730029,-0.808914,...,0.045437,0.588216,0.421628,0.584621,-0.048549,0.111352,-0.096328,0.096908,-0.043717,0.003906
2,0.462792,0.981712,-0.091777,-0.039121,1.055526,0.161005,0.242395,0.05007,0.975415,-0.385579,...,-0.022895,0.643043,0.458371,0.623742,-0.36636,-0.051865,-0.285094,-0.01558,0.379548,0.011925
3,-0.531916,-0.269127,0.455184,1.595325,-0.621171,-1.50923,0.206249,-0.066407,-0.88959,-0.656089,...,-0.096822,0.722179,0.392981,0.724922,-0.877465,4.444686,-0.907331,0.108122,2.010409,0.272977
4,-0.341668,0.808567,0.604615,0.951946,-0.63095,-0.243933,-0.103294,-0.684346,0.741863,-0.175749,...,0.024536,0.765814,0.471472,0.743453,-0.353646,-0.47755,-0.243094,-0.127465,0.191181,-0.198329


In [68]:
score = 0

for i in range(len(target_cols)):
    score += log_loss(oof_targets_all[:, i], 1 / (1 + np.exp(-oof_preds_all[:, i])))

print("CV log_loss: ", score / y_pred.shape[1])



CV log_loss:  0.016504302204819635


In [69]:
oof_targets_all.shape,oof_preds_all.shape

((21948, 206), (21948, 206))

In [70]:
valid_results = targets.copy()
target_cols = targets.columns.tolist()
for i in range(len(target_cols)):
    valid_results[target_cols[i]] = 1 / (1 + np.exp(-oof_preds_all[:, i]))  
    
valid_results['sig_id'] = oofID_all

print('oof shape:',valid_results.shape)



oof shape: (21948, 207)


In [71]:
valid_results.head()

Unnamed: 0,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,adrenergic_receptor_agonist,...,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor,sig_id
0,0.000977,0.00204,0.001323,0.002686,0.003138,0.002175,0.00315,0.004832,0.001245,0.004631,...,0.001717,0.003366,0.002477,0.009945,0.000854,0.017642,0.001399,0.000278,0.00192,id_002452c7e
1,0.000175,0.000614,0.00041,0.000277,0.000412,0.000781,0.000549,0.001725,0.000124,0.000557,...,0.001446,0.000883,0.003982,0.015393,0.0002,0.011924,0.00031,7.8e-05,0.000533,id_003603254
2,0.001796,0.000994,0.002702,0.005273,0.003265,0.001899,0.002756,0.003805,0.000684,0.003586,...,0.002935,0.004415,0.000198,0.001102,0.001239,0.000659,0.001712,0.000126,0.00125,id_003fdd734
3,0.000925,0.000624,0.001675,0.021982,0.037327,0.00591,0.001416,0.014271,9.4e-05,0.003821,...,0.0008,0.002216,0.000214,0.001166,0.000279,0.000913,0.002574,0.003327,0.001333,id_00548fd5c
4,0.001475,0.002224,0.00171,0.008786,0.030014,0.006707,0.002217,0.004941,0.000149,0.004756,...,0.001414,0.003883,0.004979,0.001622,0.000871,0.00152,0.002003,0.000854,0.001288,id_006e27d96


In [72]:
valid_results.to_csv('oof_model2.csv', index=False)

**The worst CV value that i achive**

# <font color = "seagreen">Conclusion (NOT AVAILABLE UNTIL I SEE THE LB Score)</font> 

# <font color = "seagreen">Submission</font>

In [73]:
all_feat = [col for col in submission.columns if col not in ["sig_id"]]
# To obtain the same lenght of test_preds_all and submission
test = pd.read_csv(data_path + "test_features.csv")
sig_id = test[test["cp_type"] != "ctl_vehicle"].sig_id.reset_index(drop = True)
tmp = pd.DataFrame(test_preds_all.mean(axis = 0), columns = all_feat)
tmp["sig_id"] = sig_id

submission = pd.merge(test[["sig_id"]], tmp, on = "sig_id", how = "left")
submission.fillna(0, inplace = True)

#submission[all_feat] = tmp.mean(axis = 0)

# Set control to 0
#submission.loc[test["cp_type"] == 0, submission.columns[1:]] = 0
submission.to_csv("submission_model2.csv", index = None)
submission.head()

Unnamed: 0,sig_id,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,...,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
0,id_0004d9e33,0.000971,0.000999,0.002072,0.02019,0.020764,0.004569,0.003068,0.007401,0.000221,...,0.000766,0.001469,0.003622,0.000977,0.000686,0.000567,0.000458,0.002147,0.002455,0.001444
1,id_001897cda,0.000595,0.000943,0.002507,0.002778,0.001552,0.001995,0.000989,0.010378,0.000821,...,0.000821,0.001199,0.003134,0.000391,0.005183,0.000606,0.007797,0.001326,0.007307,0.002805
2,id_002429b5b,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,id_00276f245,0.000987,0.000928,0.001718,0.008641,0.016314,0.003524,0.002505,0.004519,0.000244,...,0.000625,0.001723,0.002399,0.019245,0.004015,0.00057,0.001547,0.001905,0.00035,0.001453
4,id_0027f1083,0.001567,0.001331,0.001393,0.013533,0.020962,0.005249,0.004211,0.002697,0.000401,...,0.000688,0.000779,0.002572,0.002471,0.001482,0.000602,0.00072,0.001976,0.000231,0.001751


In [74]:
print(f"{b_}submission.shape: {r_}{submission.shape}")

[34msubmission.shape: [31m(3982, 207)


<div class = "alert alert-block alert-info">
    <h3><font color = "red">NOTE: </font></h3>
    <p>If you want to comment please tag me with '@' to answer more quickly.</p>
</div>