# Mechanisms of Action (MoA) Prediction
* [1 Introduction](#s1)

* [2 Preparations](#s2)
    - [2.1 Load libraries](#s2-1)
    - [2.2Load data](#s2-2)
* [3 Data Overview](#s3)
    - [3.1 Train features data](#s3-1)
    - [3.2 Train targets scored data](#s3-2)
    - [3.3 Train targets nonscored data](#s3-3)
    - [3.4 Test features data](#s3-4)
    - [3.5 Sample_submission data](#s3-5)
* [4 Features analysis](#s4)
    - [4.1 cp_type, cp_time, cp_does features](#s4-1)
    - [4.2 Gene feature](#s4-2)
    - [4.3 Cell feature](#s4-3)
    - [4.4 Test features data](#s4-4)
* [5 Feature engineering](#s5)
    - [5.1 Correlations](#s5-1)
    - [5.2 PCA](#s5-2)
    - [5.3 Data preprocess](#s5-3)
* [6 Model](#s6)
* [Conclusion](#section-three)

<a id="s1"></a>
# 1 Introduction

This notebook is used to attend the MoA Prediction competition in Kaggle. It includes 6 sections. First section is introduction. Follow that is data overview and data analysis. And then we move on to feature engineering and data preprocess. When the data is ready, we use Tabnet to predict the target.

<a id="s2"></a>
# 2 Preparations

<a id="s2-1"></a>
2.1 Load libraries

In [None]:
# TabNet
!pip install --no-index --find-links /kaggle/input/pytorchtabnet/pytorch_tabnet-2.0.0-py3-none-any.whl pytorch-tabnet
# Iterative Stratification
!pip install /kaggle/input/iterative-stratification/iterative-stratification-master/

In [None]:
### General ###
import os
import sys
import copy
import tqdm
import pickle
import random
import warnings
warnings.filterwarnings("ignore")
sys.path.append("../input/rank-gauss")
os.environ["CUDA_LAUNCH_BLOCKING"] = '1'

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import matplotlib as mpl
from scipy import stats
import seaborn as sns
from scipy.stats import mode
import plotly.express as px
from collections import Counter
from gauss_rank_scaler import GaussRankScaler

### Machine Learning ###
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.preprocessing import QuantileTransformer
from sklearn.feature_selection import VarianceThreshold
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.preprocessing import LabelEncoder

### Deep Learning ###
import torch
from torch import nn
import torch.optim as optim
from torch.nn import functional as F
from torch.nn.modules.loss import _WeightedLoss
from torch.utils.data import DataLoader, Dataset
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Tabnet 
from pytorch_tabnet.metrics import Metric
from pytorch_tabnet.tab_model import TabNetRegressor


### Make prettier the prints ###
from colorama import Fore
c_ = Fore.CYAN
m_ = Fore.MAGENTA
r_ = Fore.RED
b_ = Fore.BLUE
y_ = Fore.YELLOW
g_ = Fore.GREEN

Seed and parameters

In [None]:
seed = 42

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
set_seed(seed)

scale = "rankgauss"
#variance_threshould = 0.7
decompo = "PCA"
ncompo_genes = 150
ncompo_cells = 10
encoding = "dummy"

<a id="s2-2"></a>
2.2 Load data

In [None]:
train_features = pd.read_csv('../input/lish-moa/train_features.csv', index_col=0)
train_targets_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv', index_col=0)
test_features = pd.read_csv('../input/lish-moa/test_features.csv', index_col=0)
sample_submission = pd.read_csv('../input/lish-moa/sample_submission.csv')
train_targets_nonscored = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv', index_col=0)

2.3 Functions

In [None]:
def featuresVis(data, idx, title):
    plt.title(title)
    data[:,4:] = data[:,4:].astype(np.float64)
    colors="bgrcm"
    colors_index=0
    for x in idx:
        sns.kdeplot(data[:,x], color=colors[colors_index], label=x)
        colors_index+=1

<a id="s3"></a>
# 3 Data Overview

<a id="s3-1"></a>
3.1 Train features data

In [None]:
#A few records of data
train_features.head(5)

In [None]:
train_features.shape

In [None]:
#Statistic information
train_features.describe()

In [None]:
train_features.dtypes #data type

In [None]:
train_features.index #data index

In [None]:
# Checking the missing data
train_features=train_features.replace('null',np.NaN)
train_features.isnull().any().sum()

* No missing data were found in the train_features dataset

<a id="s3-2"></a>
3.2 Train targets scored data

In [None]:
#A few records of data
train_targets_scored.head(5)

In [None]:
train_targets_scored.shape

In [None]:
#Statistic information
train_targets_scored.describe()

In [None]:
train_targets_scored.dtypes #data type

In [None]:
# Checking the missing data
train_targets_scored.isnull().any().sum()

* No missing data were found in the train_targets_scored dataset

<a id="s3-3"></a>
3.3 Train targets nonscored data

In [None]:
#A few records of data
train_targets_nonscored.head(5)

In [None]:
train_targets_nonscored.shape

In [None]:
#Statistic information
train_targets_nonscored.describe()

In [None]:
train_targets_nonscored.dtypes #data type

In [None]:
train_targets_nonscored.isnull().any().sum()

* No missing data were found in the train_targets_nonscored dataset

<a id="s3-4"></a>
3.4 Test features data

In [None]:
#A few records of data
test_features.head(5)

In [None]:
test_features.shape

<a id="s3-5"></a>
3.5 Sample_submission data

In [None]:
#A few records of data
sample_submission.head(5)

In [None]:
sample_submission.shape

<a id="s4"></a>
# 4 Feature analysis

<a id="s4-1"></a>
4.1 cp_type, cp_time, cp_dose features

In [None]:
train_features['cp_type'].value_counts().plot.bar()

In [None]:
train_features['cp_time'].value_counts().plot.bar()

In [None]:
train_features['cp_dose'].value_counts().plot.bar()

In [None]:
# sunburst_df = train_features.groupby(['cp_type', 'cp_time', 'cp_dose'])['sig_id'].count().reset_index()
# sunburst_df.columns = ['cp_type', 'cp_time', 'cp_dose', 'count']

# fig = px.sunburst(
#     sunburst_df, 
#     path=[
#         'cp_type',
#         'cp_time',
#         'cp_dose' 
#     ], 
#     values='count', 
#     title='Sunburst chart for all cp_type/cp_time/cp_dose',
#     width=500,
#     height=500
# )

# fig.show()

<a id="s4-2"></a>
4.2 Gene feature

4.2.1 Those features which labelled from g-0 to g-771 are gene features, the values are numeric, and first we will plot the first 5 gene features, then we will try 5 randomly gene features

* First 5 gene features with 5 random features of gene

In [None]:
#first 5 gene
plt.subplot(1,2,1)
plt.grid(linestyle='-.')
plt.title("Distributions of gene")
sns.kdeplot(train_features['g-0'], color='green', label='g-0')
sns.kdeplot(train_features['g-1'], color='red', label='g-1')
sns.kdeplot(train_features['g-2'], color='skyblue', label='g-2')
sns.kdeplot(train_features['g-3'], color='blue', label='g-3')
sns.kdeplot(train_features['g-4'], color='yellow', label='g-4')
# 5 random gene
plt.subplot(1,2,2)
plt.grid(linestyle='-.')
idx = np.random.randint(4,774,5)
data=np.array(train_features)
featuresVis(data, idx, "Distributions of gene")
plt.show()

4.2.2 Let's take a look at Gene Max, Min, Mean, Standard Deviation plots

In [None]:
plt.subplot(2,2,1)
sns.kdeplot(train_features.loc[:,'g-0':'g-771'].max(), shade=True, color='g', label='Gene Max')
plt.subplot(2,2,2)
sns.kdeplot(train_features.loc[:,'g-0':'g-771'].min(), shade=True, color='grey', label='Gene Min')
plt.subplot(2,2,3)
sns.kdeplot(train_features.loc[:,'g-0':'g-771'].mean(), shade=True, color='greenyellow', label='Mean')
plt.subplot(2,2,4)
sns.kdeplot(train_features.loc[:,'g-0':'g-771'].std(), shade=True, color='skyblue', label='Gene StD')

We find that Gene Max plot and Gene Min plot are symmetrical. Compared those plots with the gene features distribution plots, we find that it might have abnormal data in gene features. Let's expore that

<a id="s4-3"></a>
4.3 Cell feature

4.3.1 Those features which labelled from c-0 to c-99 are cell features, the values are numeric, and first we will plot the first 5 cell features, then we will try 5 randomly cell features

In [None]:
plt.subplot(1,2,1)
plt.grid(linestyle='-.')
plt.title("Distributions of cell")
sns.kdeplot(train_features['c-0'], color='green', label='c-0')
sns.kdeplot(train_features['c-1'], color='red', label='c-1')
sns.kdeplot(train_features['c-2'], color='skyblue', label='c-2')
sns.kdeplot(train_features['c-3'], color='blue', label='c-3')
sns.kdeplot(train_features['c-4'], color='yellow', label='c-4')
plt.subplot(1,2,2)
idx = np.random.randint(775,875,5)
plt.grid(linestyle='-.')
featuresVis(data, idx, "Distributions of cell")

4.3.2 Let's take a look at Cell Max, Min, Mean, Standard Deviation plots

In [None]:
#Distribution fo Max, Min, Mean, StD
plt.subplot(2,2,1)
sns.kdeplot(train_features.loc[:,'c-0':].max(), shade=True, color='g', label='Cell Max')
plt.subplot(2,2,2)
sns.kdeplot(train_features.loc[:,'c-0':].min(), shade=True, color='grey', label='Cell Min',bw=1)
plt.subplot(2,2,3)
sns.kdeplot(train_features.loc[:,'c-0':].mean(), shade=True, color='greenyellow', label='Mean')
plt.subplot(2,2,4)
sns.kdeplot(train_features.loc[:,'c-0':].std(), shade=True, color='skyblue', label='Cell StD')

<a id="s4-4"></a>
4.4 Test features data

4.4.1 First 5 gene features with 5 random features of gene

In [None]:
plt.subplot(1,2,1)
plt.grid(linestyle='-.')
plt.title("Distributions of gene")
sns.kdeplot(test_features['g-0'], color='green', label='g-0')
sns.kdeplot(test_features['g-1'], color='red', label='g-1')
sns.kdeplot(test_features['g-2'], color='skyblue', label='g-2')
sns.kdeplot(test_features['g-3'], color='blue', label='g-3')
sns.kdeplot(test_features['g-4'], color='yellow', label='g-4')
#random gene
idx = np.random.randint(4,774,5)
plt.subplot(1,2,2)
plt.grid(linestyle='-.')
data=np.array(test_features)
featuresVis(data, idx, "Distributions of gene")


Let's take a look at Gene Max, Min, Mean, Standard Deviation plots

In [None]:
plt.subplot(2,2,1)
sns.kdeplot(test_features.loc[:,'g-0':'g-771'].max(), shade=True, color='g', label='Gene Max')
plt.subplot(2,2,2)
sns.kdeplot(test_features.loc[:,'g-0':'g-771'].min(), shade=True, color='grey', label='Gene Min')
plt.subplot(2,2,3)
sns.kdeplot(test_features.loc[:,'g-0':'g-771'].mean(), shade=True, color='greenyellow', label='Mean')
plt.subplot(2,2,4)
sns.kdeplot(test_features.loc[:,'g-0':'g-771'].std(), shade=True, color='skyblue', label='Gene StD')

4.4.2 First 5 cell features with 5 random features of cell

In [None]:
plt.subplot(1,2,1)
plt.grid(linestyle='-.')
plt.title("Distributions of cell")
sns.kdeplot(test_features['c-0'], color='green', label='c-0')
sns.kdeplot(test_features['c-1'], color='red', label='c-1')
sns.kdeplot(test_features['c-2'], color='skyblue', label='c-2')
sns.kdeplot(test_features['c-3'], color='blue', label='c-3')
sns.kdeplot(test_features['c-4'], color='yellow', label='c-4')

plt.subplot(1,2,2)
idx = np.random.randint(775,875,5)
plt.grid(linestyle='-.')
featuresVis(data, idx, "Distributions of cell")

Let's take a look at Gene Max, Min, Mean, Standard Deviation plots

In [None]:
#Distribution fo Max, Min, Mean, StD
plt.subplot(2,2,1)
sns.kdeplot(test_features.loc[:,'c-0':].max(), shade=True, color='g', label='Cell Max')
plt.subplot(2,2,2)
sns.kdeplot(test_features.loc[:,'c-0':].min(), shade=True, color='grey', label='Cell Min',bw=1)
plt.subplot(2,2,3)
sns.kdeplot(test_features.loc[:,'c-0':].mean(), shade=True, color='greenyellow', label='Mean')
plt.subplot(2,2,4)
sns.kdeplot(test_features.loc[:,'c-0':].std(), shade=True, color='skyblue', label='Cell StD')

4.4.3 cp_type, cp_time, cp_dose of test_features

In [None]:
test_features['cp_type'].value_counts().plot.bar(title="cp_type")

In [None]:
test_features['cp_time'].value_counts().plot.bar(title="cp_time")

In [None]:
test_features['cp_dose'].value_counts().plot.bar(title="cp_dose")

In [None]:
# sunburst_df = test_features.groupby(['cp_type', 'cp_time', 'cp_dose'])['sig_id'].count().reset_index()
# sunburst_df.columns = ['cp_type', 'cp_time', 'cp_dose', 'count']

# fig = px.sunburst(
#     sunburst_df, 
#     path=[
#         'cp_type',
#         'cp_time',
#         'cp_dose' 
#     ], 
#     values='count', 
#     title='Sunburst chart for all cp_type/cp_time/cp_dose',
#     width=500,
#     height=500
# )

# fig.show()

# The relationship between cp-type/tp-time/cp-dose with targets

* cp_type is vehicle:

In [None]:
train_vehicle_index = train_features[train_features.loc[:,'cp_type']=='ctl_vehicle'].index
train_features[train_features.index.isin(train_vehicle_index)].head()

In [None]:
targets_vehicle_train = train_targets_scored[train_targets_scored.index.isin(train_vehicle_index)]
targets_vehicle_train.shape

In [None]:
targets_vehicle_train.loc[:,'5-alpha_reductase_inhibitor':].abs().sum(axis = 1).hist()

We find for train_features, when cp-type is ctl_vehicle, the corresponding targets_scored is 0

* cp_type is cp:

In [None]:
targets_cp_train = train_targets_scored[~train_targets_scored.index.isin(train_vehicle_index)]
targets_cp_train.shape

In [None]:
targets_cp_train.loc[:,'5-alpha_reductase_inhibitor':].abs().sum(axis = 1).hist()
plt.show()

cp-dose is D1

In [None]:
train_D1_index = train_features[train_features.loc[:,'cp_dose']=='D1'].index
train_features[train_features.index.isin(train_D1_index)].head()

In [None]:
targets_D1_train = train_targets_scored[train_targets_scored.index.isin(train_D1_index)]
targets_D1_train.shape

In [None]:
targets_D1_train.loc[:,'5-alpha_reductase_inhibitor':].abs().sum(axis = 1).hist()

* cp-dose is D2

In [None]:
targets_D2_train = train_targets_scored[~train_targets_scored.index.isin(train_D1_index)]
targets_D2_train.shape

In [None]:
targets_D2_train.loc[:,'5-alpha_reductase_inhibitor':].abs().sum(axis = 1).hist()

* cp-time is 24

In [None]:
train_t24_index = train_features[train_features.loc[:,'cp_time']==24].index
#train_features[train_features.index.isin(train_t24_index)].head()
targets_t24_train = train_targets_scored[train_targets_scored.index.isin(train_t24_index)]
targets_t24_train.shape

In [None]:
targets_t24_train.loc[:,'5-alpha_reductase_inhibitor':].abs().sum(axis = 1).hist()

* cp-time is 48

In [None]:
train_t48_index = train_features[train_features.loc[:,'cp_time']==48].index
#train_features[train_features.index.isin(train_t48_index)].head()
targets_t48_train = train_targets_scored[train_targets_scored.index.isin(train_t48_index)]
targets_t48_train.shape

In [None]:
targets_t48_train.loc[:,'5-alpha_reductase_inhibitor':].abs().sum(axis = 1).hist()

* cp-time is 72

In [None]:
train_t72_index = train_features[train_features.loc[:,'cp_time']==72].index
#train_features[train_features.index.isin(train_t72_index)].head()
targets_t72_train = train_targets_scored[train_targets_scored.index.isin(train_t72_index)]
targets_t72_train.shape

In [None]:
targets_t72_train.loc[:,'5-alpha_reductase_inhibitor':].abs().sum(axis = 1).hist()

* delete Vehicle:

From train_features

In [None]:
rows = DataFrame(train_features[train_features.loc[:,'cp_type']=='ctl_vehicle']).index
drop_id = train_features[train_features.index.isin(rows)].index
drop_id.shape

In [None]:
train_features.drop(rows, axis=0, inplace = True)
print(train_features.shape)
train_features.head()

In [None]:
train_vehicle_index = train_features[train_features.loc[:,'cp_type']=='ctl_vehicle'].index
train_features[train_features.index.isin(train_vehicle_index)].head()

* Delete cp_type

For train_features

In [None]:
train_features.drop(['cp_type'], axis=1, inplace = True)
print(train_features.shape)
train_features.head()

* Delete drop_id in the train_targets dataset

In [None]:
train_targets_scored[train_targets_scored.index.isin(drop_id)].shape

In [None]:
train_targets_scored.drop(train_targets_scored[train_targets_scored.index.isin(drop_id)].index, axis=0, inplace=True)
train_targets_scored.shape

for train_targets_nonscored

In [None]:
train_targets_nonscored[train_targets_nonscored.index.isin(drop_id)].shape

In [None]:
train_targets_nonscored.drop(train_targets_nonscored[train_targets_nonscored.index.isin(drop_id)].index, axis=0, inplace=True)
train_targets_nonscored.shape

4.5 Analyze outliers

* 4.5.1 train_features outliers analysis

Gene-Cell outliers

In [None]:
plt.subplot(1,2,1)
plt.title("tr-features Gene-Cell Outliers")
sns.kdeplot(train_features.loc[:,'g-0':].abs().sum(axis = 1)/873)

plt.subplot(1,2,2)
df = DataFrame()
plt.title("tr-features Gene-Cell Outliers")
df['a'] = train_features.loc[:,'g-0':].abs().sum(axis = 1)/873
df.boxplot(column=['a'])
plt.show()

In [None]:
train_outlier = train_features[train_features.loc[:,'g-0':].abs().sum(axis = 1)/873>1]
train_outlier.shape

Gene outliers

In [None]:
plt.subplot(1,2,1)
plt.title("tr_features Gene Outliers")
sns.kdeplot(train_features.loc[:,'g-0':'g-771'].abs().sum(axis = 1)/772)

plt.subplot(1,2,2)
plt.title("tr_features Gene Outliers")
df = DataFrame()
df['a'] = train_features.loc[:,'g-0':'g-771'].abs().sum(axis = 1)/772
df.boxplot(column=['a'])

In [None]:
trainG_outlier = train_features[train_features.loc[:,'g-0':'g-771'].abs().sum(axis = 1)/772>1]
trainG_outlier.shape

Cell outlier

In [None]:
plt.subplot(1,2,1)
plt.title("tr_features Cell Outliers")
sns.kdeplot(train_features.loc[:,'c-0':].abs().sum(axis = 1)/99)

plt.subplot(1,2,2)
plt.title("tr_features Cell Outliers")
df = DataFrame()
df['a'] = train_features.loc[:,'c-0':].abs().sum(axis = 1)/99
df.boxplot(column=['a'])

In [None]:
trainC_outlier = train_features[train_features.loc[:,'c-0':].abs().sum(axis = 1)/99>1]
trainC_outlier.shape

In [None]:
train_outlier_index = train_outlier.index.intersection(trainG_outlier.index).intersection(trainC_outlier.index)
train_outlier_index

In [None]:
trainFeatures_outlier = train_features[train_features.index.isin(train_outlier_index)]
trainFeatures_outlier.shape

In [None]:
trainFeatures_norm = train_features[~train_features.index.isin(train_outlier_index)]
trainFeatures_norm.shape

4.5.2 analyze test features outliers

Gene-Cell/ Gene/ Cell

In [None]:
#Test_features Gene-Cell Outliers
plt.subplot(1,3,1)
plt.title("test_f Gene-Cell outliers")
sns.kdeplot(test_features.loc[:,'g-0':].abs().sum(axis = 1)/873)
#Test_features Gene Outliers
plt.subplot(1,3,2)
plt.title("test_f Gene outliers")
sns.kdeplot(test_features.loc[:,'g-0':'g-771'].abs().sum(axis = 1)/772)
#Test_features Cell Outliers
plt.subplot(1,3,3)
plt.title("test_f Cell outliers")
sns.kdeplot(test_features.loc[:,'c-0':].abs().sum(axis = 1)/99)

In [None]:
test_outlier = test_features[test_features.loc[:,'g-0':].abs().sum(axis = 1)/873>1]
test_outlier.shape

In [None]:
testG_outlier = test_features[test_features.loc[:,'g-0':'g-771'].abs().sum(axis = 1)/772>1]
testG_outlier.shape

In [None]:
testC_outlier = test_features[test_features.loc[:,'c-0':].abs().sum(axis = 1)/99>1]
testC_outlier.shape

In [None]:
test_outlier_index = test_outlier.index.intersection(testG_outlier.index).intersection(testC_outlier.index)
test_outlier_index

In [None]:
testFeatures_outlier = test_features[test_features.index.isin(test_outlier_index)]
testFeatures_outlier.shape

In [None]:
testFeatures_norm = test_features[~test_features.index.isin(test_outlier_index)]
testFeatures_norm.shape

Feature and targets

In [None]:
train_targets_scored.shape

value distribut of train_targets_scored:

In [None]:
train_targets_scored.loc[:,'5-alpha_reductase_inhibitor':].apply(pd.value_counts)

We find that train_targets_scored only have 0 and 1.

In [None]:
train_targets_scored.loc[:,'5-alpha_reductase_inhibitor':].abs().sum(axis = 1).hist() 

In [None]:
train_targets_scored_norm = train_targets_scored[~train_targets_scored.index.isin(train_outlier_index)]
train_targets_scored_norm.shape

In [None]:
train_targets_scored_norm.loc[:, '5-alpha_reductase_inhibitor':].abs().sum(axis = 1).hist()
plt.suptitle("train_targets_scored_norm")

In [None]:
train_targets_scored_norm.index

In [None]:
train_targets_scored_outlier = train_targets_scored[train_targets_scored.index.isin(train_outlier_index)]
train_targets_scored_outlier.shape

In [None]:
plt.subplot(1,2,2)
train_targets_scored_outlier.loc[:,'5-alpha_reductase_inhibitor':].abs().sum(axis = 1).hist() 
plt.suptitle("train_targets_scored_outlier")
plt.show()

In [None]:
train_targets_nonscored_norm = train_targets_nonscored[~train_targets_nonscored.index.isin(train_outlier_index)]
train_targets_nonscored_norm.shape

In [None]:
train_targets_nonscored_norm.loc[:,'abc_transporter_expression_enhancer':].abs().sum(axis = 1).hist() 
plt.show()

In [None]:
train_targets_nonscored_outlier = train_targets_nonscored[train_targets_nonscored.index.isin(train_outlier_index)]
train_targets_nonscored_outlier.shape

In [None]:
train_targets_nonscored_outlier.loc[:,'abc_transporter_expression_enhancer':].abs().sum(axis = 1).hist() 
plt.show()

In [None]:
train_targets_nonscored_norm.head()

4.6 discrete feature and continous feature interaction

* 4.6.1 cp-type, cp-time, cp-dose of ****train_features

cp-type with gene

In [None]:
# #cp-type-norm is trt_cp
# idx = np.random.randint(4,775,5)
# plt.subplot(2,2,1)
# data_c_norm=np.array(trainFeatures_norm[trainFeatures_norm['cp_type']=='trt_cp'])
# featuresVis(data_c_norm, idx, "gene/trt_cp_norm")
# #cp-type-norm is ctl_vehicle
# plt.subplot(2,2,2)
# data_v_norm=np.array(trainFeatures_norm[trainFeatures_norm['cp_type']=='ctl_vehicle'])
# featuresVis(data_v_norm, idx, "gene/ctl_vehicle_norm")
# #cp-type-outlier is trt_cp
# plt.subplot(2,2,3)
# data_c_out=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_type']=='trt_cp'])
# featuresVis(data_c_out, idx, "gene/trt_cp_outlier")
# #cp-type-outlier is ctl_vehicle
# plt.subplot(2,2,4)
# data_v_out=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_type']=='ctl_vehicle'])
# featuresVis(data_v_out, idx, "gene/ctl_vehicle_outlier")

cp-type with cell

In [None]:
# #cp-type-norm is trt_cp
# idx = np.random.randint(776,876,5)
# plt.subplot(2,2,1)
# data_c_norm=np.array(trainFeatures_norm[trainFeatures_norm['cp_type']=='trt_cp'])
# featuresVis(data_c_norm, idx, "cell/trt_cp_norm")
# #cp-type-norm is ctl_vehicle
# plt.subplot(2,2,2)
# data_v_norm=np.array(trainFeatures_norm[trainFeatures_norm['cp_type']=='ctl_vehicle'])
# featuresVis(data_v_norm, idx, "cell/ctl_vehicle_norm")
# #cp-type-outlier is trt_cp
# plt.subplot(2,2,3)
# data_c_out=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_type']=='trt_cp'])
# featuresVis(data_c_out, idx, "cell/trt_cp_outlier")
# #cp-type-outlier is ctl_vehicle
# plt.subplot(2,2,4)
# data_v_out=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_type']=='ctl_vehicle'])
# featuresVis(data_v_out, idx, "cell/ctl_vehicle_outlier")

cp-time with gene

In [None]:
#cp-time-norm is 24
idx = np.random.randint(3,774,5)
plt.subplot(2,3,1)
data_t24N=np.array(trainFeatures_norm[trainFeatures_norm['cp_time']==24])
featuresVis(data_t24N, idx, "gene/ cp_t24_norm")
#cp-time-norm is 48
plt.subplot(2,3,2)
data_t48N=np.array(trainFeatures_norm[trainFeatures_norm['cp_time']==48])
featuresVis(data_t48N, idx, "gene/ cp_t48_norm")
#cp-time-norm is 72
plt.subplot(2,3,3)  
data_t72N=np.array(trainFeatures_norm[trainFeatures_norm['cp_time']==72])
featuresVis(data_t72N, idx, "gene/ cp_t72_norm")
#cp-time-outlier is 24
plt.subplot(2,3,4)
data_t24O=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_time']==24])
featuresVis(data_t24O, idx, "gene/ cp_t24_outlier")
#cp-time-outlier is 48
plt.subplot(2,3,5)
data_t48O=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_time']==48])
featuresVis(data_t48O, idx, "gene/ cp_t48_outlier")
#cp-time-outlier is 72
plt.subplot(2,3,6) 
data_t72O=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_time']==72])
featuresVis(data_t72O, idx, "gene/ cp_t72_outlier")

cp-time with cell

In [None]:
#cp-time-norm is 24
idx = np.random.randint(775,875,5)
plt.subplot(2,3,1)
data_t24N=np.array(trainFeatures_norm[trainFeatures_norm['cp_time']==24])
featuresVis(data_t24N, idx, "cell/ cp_t24_norm")
#cp-time-norm is 48
plt.subplot(2,3,2)
data_t48N=np.array(trainFeatures_norm[trainFeatures_norm['cp_time']==48])
featuresVis(data_t48N, idx, "cell/ cp_t48_norm")
#cp-time-norm is 72
plt.subplot(2,3,3)  
data_t72N=np.array(trainFeatures_norm[trainFeatures_norm['cp_time']==72])
featuresVis(data_t72N, idx, "cell/ cp_t72_norm")
#cp-time-outlier is 24
plt.subplot(2,3,4)
data_t24O=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_time']==24])
featuresVis(data_t24O, idx, "cell/ cp_t24_outlier")
#cp-time-outlier is 48
plt.subplot(2,3,5)
data_t48O=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_time']==48])
featuresVis(data_t48O, idx, "cell/ cp_t48_outlier")
#cp-time-outlier is 72
plt.subplot(2,3,6) 
data_t72O=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_time']==72])
featuresVis(data_t72O, idx, "cell/ cp_t72_outlier")

cp-dose with gene

In [None]:
idx = np.random.randint(4,775,5)
#D1 norm
plt.subplot(2,2,1)
data_d1N=np.array(trainFeatures_norm[trainFeatures_norm['cp_dose']=='D1'])
featuresVis(data_d1N, idx, "gene/ D1-Norm")
#D2 norm
plt.subplot(2,2,2)
data_d2N=np.array(trainFeatures_norm[trainFeatures_norm['cp_dose']=='D2'])
featuresVis(data_d2N, idx, "gene/ D2-Norm")
#D1 outlier
plt.subplot(2,2,3)
data_d1O=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_dose']=='D1'])
featuresVis(data_d1O, idx, "gene/ D1-outlier")
#D2 outlier
plt.subplot(2,2,4)
data_d2O=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_dose']=='D2'])
featuresVis(data_d2O, idx, "gene/ D2-outlier")

cp-dose with cell

In [None]:
idx = np.random.randint(776,872,5)
#D1 norm
plt.subplot(2,2,1)
data_d1N=np.array(trainFeatures_norm[trainFeatures_norm['cp_dose']=='D1'])
featuresVis(data_d1N, idx, "cell/ D1-Norm")
#D2 norm
plt.subplot(2,2,2)
data_d2N=np.array(trainFeatures_norm[trainFeatures_norm['cp_dose']=='D2'])
featuresVis(data_d2N, idx, "cell/ D2-Norm")
#D1 outlier
plt.subplot(2,2,3)
data_d1O=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_dose']=='D1'])
featuresVis(data_d1O, idx, "cell/ D1-outlier")
#D2 outlier
plt.subplot(2,2,4)
data_d2O=np.array(trainFeatures_outlier[trainFeatures_outlier['cp_dose']=='D2'])
featuresVis(data_d2O, idx, "cell/ D2-outlier")

* 4.6.2 cp-type, cp-time, cp-dose of test_features

cp-type with gene

In [None]:
#cp-type-norm is trt_cp
idx = np.random.randint(4,774,5)
plt.subplot(2,2,1)
data_c_norm=np.array(testFeatures_norm[testFeatures_norm['cp_type']=='trt_cp'])
featuresVis(data_c_norm, idx, "trt_cp_norm")
#cp-type-norm is ctl_vehicle
plt.subplot(2,2,2)
data_v_norm=np.array(testFeatures_norm[testFeatures_norm['cp_type']=='ctl_vehicle'])
featuresVis(data_v_norm, idx, "ctl_vehicle_norm")
#cp-type-outlier is trt_cp
plt.subplot(2,2,3)
data_c_out=np.array(testFeatures_outlier[testFeatures_outlier['cp_type']=='trt_cp'])
featuresVis(data_c_out, idx, "trt_cp_outlier")
#cp-type-outlier is ctl_vehicle
plt.subplot(2,2,4)
data_v_out=np.array(testFeatures_outlier[testFeatures_outlier['cp_type']=='ctl_vehicle'])
featuresVis(data_v_out, idx, "ctl_vehicle_outlier")

cp_type with cell

In [None]:
#cp-type-norm is trt_cp
idx = np.random.randint(776,872,5)
plt.subplot(2,2,1)
data_c_norm=np.array(testFeatures_norm[testFeatures_norm['cp_type']=='trt_cp'])
featuresVis(data_c_norm, idx, "trt_cp_norm")
#cp-type-norm is ctl_vehicle
plt.subplot(2,2,2)
data_v_norm=np.array(testFeatures_norm[testFeatures_norm['cp_type']=='ctl_vehicle'])
featuresVis(data_v_norm, idx, "ctl_vehicle_norm")
#cp-type-outlier is trt_cp
plt.subplot(2,2,3)
data_c_out=np.array(testFeatures_outlier[testFeatures_outlier['cp_type']=='trt_cp'])
featuresVis(data_c_out, idx, "trt_cp_outlier")
#cp-type-outlier is ctl_vehicle
plt.subplot(2,2,4)
data_v_out=np.array(testFeatures_outlier[testFeatures_outlier['cp_type']=='ctl_vehicle'])
featuresVis(data_v_out, idx, "ctl_vehicle_outlier")

cp-time with gene

In [None]:
#cp-time-norm is 24
idx = np.random.randint(4,774,5)
plt.subplot(2,3,1)
data_t24N=np.array(testFeatures_norm[testFeatures_norm['cp_time']==24])
featuresVis(data_t24N, idx, "gene/ cp_t24_norm")
#cp-time-norm is 48
plt.subplot(2,3,2)
data_t48N=np.array(testFeatures_norm[testFeatures_norm['cp_time']==48])
featuresVis(data_t48N, idx, "gene/ cp_t48_norm")
#cp-time-norm is 72
plt.subplot(2,3,3)  
data_t72N=np.array(testFeatures_norm[testFeatures_norm['cp_time']==72])
featuresVis(data_t72N, idx, "gene/ cp_t72_norm")
#cp-time-outlier is 24
plt.subplot(2,3,4)
data_t24O=np.array(testFeatures_outlier[testFeatures_outlier['cp_time']==24])
featuresVis(data_t24O, idx, "gene/ cp_t24_outlier")
#cp-time-outlier is 48
plt.subplot(2,3,5)
data_t48O=np.array(testFeatures_outlier[testFeatures_outlier['cp_time']==48])
featuresVis(data_t48O, idx, "gene/ cp_t48_outlier")
#cp-time-outlier is 72
plt.subplot(2,3,6) 
data_t72O=np.array(testFeatures_outlier[testFeatures_outlier['cp_time']==72])
featuresVis(data_t72O, idx, "gene/ cp_t72_outlier")

cp-time with cell

In [None]:
#cp-time-norm is 24
idx = np.random.randint(775,872,5)
plt.subplot(2,3,1)
data_t24N=np.array(testFeatures_norm[testFeatures_norm['cp_time']==24])
featuresVis(data_t24N, idx, "cell/ cp_t24_norm")
#cp-time-norm is 48
plt.subplot(2,3,2)
data_t48N=np.array(testFeatures_norm[testFeatures_norm['cp_time']==48])
featuresVis(data_t48N, idx, "cell/ cp_t48_norm")
#cp-time-norm is 72
plt.subplot(2,3,3)  
data_t72N=np.array(testFeatures_norm[testFeatures_norm['cp_time']==72])
featuresVis(data_t72N, idx, "cell/ cp_t72_norm")
#cp-time-outlier is 24
plt.subplot(2,3,4)
data_t24O=np.array(testFeatures_outlier[testFeatures_outlier['cp_time']==24])
featuresVis(data_t24O, idx, "cell/ cp_t24_outlier")
#cp-time-outlier is 48
plt.subplot(2,3,5)
data_t48O=np.array(testFeatures_outlier[testFeatures_outlier['cp_time']==48])
featuresVis(data_t48O, idx, "cell/ cp_t48_outlier")
#cp-time-outlier is 72
plt.subplot(2,3,6) 
data_t72O=np.array(testFeatures_outlier[testFeatures_outlier['cp_time']==72])
featuresVis(data_t72O, idx, "cell/ cp_t72_outlier")

cp-dose with gene

In [None]:
idx = np.random.randint(4,774,5)
#D1 norm
plt.subplot(2,2,1)
data_d1N=np.array(testFeatures_norm[testFeatures_norm['cp_dose']=='D1'])
featuresVis(data_d1N, idx, "gene/ D1-Norm")
#D2 norm
plt.subplot(2,2,2)
data_d2N=np.array(testFeatures_norm[testFeatures_norm['cp_dose']=='D2'])
featuresVis(data_d2N, idx, "gene/ D2-Norm")
#D1 outlier
plt.subplot(2,2,3)
data_d1O=np.array(testFeatures_outlier[testFeatures_outlier['cp_dose']=='D1'])
featuresVis(data_d1O, idx, "gene/ D1-outlier")
#D2 outlier
plt.subplot(2,2,4)
data_d2O=np.array(testFeatures_outlier[testFeatures_outlier['cp_dose']=='D2'])
featuresVis(data_d2O, idx, "gene/ D2-outlier")

cp-dose with cell

In [None]:
idx = np.random.randint(776,872,5)
#D1 norm
plt.subplot(2,2,1)
data_d1N=np.array(testFeatures_norm[testFeatures_norm['cp_dose']=='D1'])
featuresVis(data_d1N, idx, "cell/ D1-Norm")
#D2 norm
plt.subplot(2,2,2)
data_d2N=np.array(testFeatures_norm[testFeatures_norm['cp_dose']=='D2'])
featuresVis(data_d2N, idx, "cell/ D2-Norm")
#D1 outlier
plt.subplot(2,2,3)
data_d1O=np.array(testFeatures_outlier[testFeatures_outlier['cp_dose']=='D1'])
featuresVis(data_d1O, idx, "cell/ D1-outlier")
#D2 outlier
plt.subplot(2,2,4)
data_d2O=np.array(testFeatures_outlier[testFeatures_outlier['cp_dose']=='D2'])
featuresVis(data_d2O, idx, "cell/ D2-outlier")

<a id="s5"></a>
# 5 Feature engineering

<a id="s5-1"></a>
5.1 Correlation

In [None]:
figure = plt.figure(figsize = (20,20))
axes = figure.add_subplot(111)

#using the matshow() function
caxes = axes.matshow(train_features.loc[:,'g-0':'g-771'].corr(), interpolation = 'nearest')
figure.colorbar(caxes)

plt.show()

We find Gene features are not correlated a lot

In [None]:
figure = plt.figure(figsize = (20,20))
axes = figure.add_subplot(111)

#using the matshow() function
caxes = axes.matshow(train_features.loc[:,'c-0':].corr(), interpolation = 'nearest')
figure.colorbar(caxes)

plt.show()

We find that Cell features correlated a lot

5.2 PCA

<a id="s5-2"></a>
5.2.1 Gene features

In [None]:
gFeature = train_features.loc[:, 'g-0':'g-771']
gFeature_scaled = preprocessing.scale(gFeature)
pca = PCA()
pca.fit(gFeature)
gFeature_pca_scaled = pca.transform(gFeature_scaled)
gFeature_per_var = np.round(pca.explained_variance_ratio_*100,decimals=1)
#print(gFeature_per_var)
gFeature_labels = ['g-' + str(x) for x in range(1,len(gFeature_per_var)+1)]
plt.rcParams['figure.figsize'] = (15,6)
plt.bar(x=range(1,len(gFeature_per_var)+1), height = gFeature_per_var, tick_label=gFeature_labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.show()
#explained_variance = pca.explained_variance_ratio_
#explained_variance
print(np.cumsum(gFeature_per_var))

We find that 305-Gene features represents 90.1% of the data

5.2.2 Cell features

In [None]:
cFeature = train_features.loc[:, 'c-0':]
cFeature_scaled = preprocessing.scale(cFeature)
pca = PCA(n_components = 50)
pca.fit(cFeature)
cFeature_pca_scaled = pca.transform(cFeature_scaled)
cFeature_per_var = np.round(pca.explained_variance_ratio_*100,decimals=1)
#print(gFeature_per_var)
cFeature_labels = ['c-' + str(x) for x in range(1,len(cFeature_per_var)+1)]
plt.rcParams['figure.figsize'] = (15,6)
plt.bar(x=range(1,len(cFeature_per_var)+1), height = cFeature_per_var, tick_label=cFeature_labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.show()
#explained_variance = pca.explained_variance_ratio_
#explained_variance
print(np.cumsum(cFeature_per_var))

We find that 37-Cell features represents 94.0% of the data

<a id="s5-3"></a>
5.3 Data preprocess

In [None]:
train = train_features.reset_index(drop = True)
test_all = test_features.reset_index(drop = True)
test_all.index = range(train.shape[0], test_all.shape[0]+ train.shape[0])
test_vehicle_index = test_all[test_all['cp_type'] == 'ctl_vehicle'].index
test_noVehicle = test_all[test_all['cp_type'] != 'ctl_vehicle']
test_noVehicle.drop(['cp_type'], axis=1, inplace = True)
data_all = pd.concat([train, test_noVehicle])
data_index = data_all.index
data_all = data_all.reset_index(drop = True)
cols_numeric = [feat for feat in list(data_all.columns) if feat not in ["sig_id", "cp_type", "cp_time", "cp_dose"]]



In [None]:
def scale_minmax(col):
    return (col - col.min()) / (col.max() - col.min())

def scale_norm(col):
    return (col - col.mean()) / col.std()

if scale == "boxcox":
    print(b_, "boxcox")
    data_all[cols_numeric] = data_all[cols_numeric].apply(scale_minmax, axis = 0)
    trans = []
    for feat in cols_numeric:
        trans_var, lambda_var = stats.boxcox(data_all[feat].dropna() + 1)
        trans.append(scale_minmax(trans_var))
    data_all[cols_numeric] = np.asarray(trans).T
    
elif scale == "norm":
    print(b_, "norm")
    data_all[cols_numeric] = data_all[cols_numeric].apply(scale_norm, axis = 0)
    
elif scale == "minmax":
    print(b_, "minmax")
    data_all[cols_numeric] = data_all[cols_numeric].apply(scale_minmax, axis = 0)
    
elif scale == "rankgauss":
    ### Rank Gauss ###
    print(b_, "Rank Gauss")
    scaler = GaussRankScaler()
    data_all[cols_numeric] = scaler.fit_transform(data_all[cols_numeric])
    
else:
    pass

In [None]:
# PCA
if decompo == "PCA":
    print(b_, "PCA")
    GENES = [col for col in data_all.columns if col.startswith("g-")]
    CELLS = [col for col in data_all.columns if col.startswith("c-")]
    
    pca_genes = PCA(n_components = ncompo_genes,
                    random_state = seed).fit_transform(data_all[GENES])
    pca_cells = PCA(n_components = ncompo_cells,
                    random_state = seed).fit_transform(data_all[CELLS])
    
    pca_genes = pd.DataFrame(pca_genes, columns = [f"pca_g-{i}" for i in range(ncompo_genes)])
    pca_cells = pd.DataFrame(pca_cells, columns = [f"pca_c-{i}" for i in range(ncompo_cells)])
    data_all = pd.concat([data_all, pca_genes, pca_cells], axis = 1)
else:
    pass

In [None]:
# Encoding
if encoding == "lb":
    print(b_, "Label Encoding")
    for feat in ["cp_time", "cp_dose"]:
        data_all[feat] = LabelEncoder().fit_transform(data_all[feat])
elif encoding == "dummy":
    print(b_, "One-Hot")
    data_all = pd.get_dummies(data_all, columns = ["cp_time", "cp_dose"])

In [None]:
GENES = [col for col in data_all.columns if col.startswith("g-")]
CELLS = [col for col in data_all.columns if col.startswith("c-")]

for stats in tqdm.tqdm(["sum", "mean", "std", "kurt", "skew"]):
    data_all["g_" + stats] = getattr(data_all[GENES], stats)(axis = 1)
    data_all["c_" + stats] = getattr(data_all[CELLS], stats)(axis = 1)    
    data_all["gc_" + stats] = getattr(data_all[GENES + CELLS], stats)(axis = 1)

In [None]:
with open("data_all.pickle", "wb") as f:
    pickle.dump(data_all, f)

In [None]:
with open("data_all.pickle", "rb") as f:
    data_all = pickle.load(f)

In [None]:
train_df = data_all[: train.shape[0]]
#train_df.reset_index(drop = True, inplace = True)
# The following line it's a bad practice in my opinion, targets on train set
#train_df = pd.concat([train_df, targets], axis = 1)
test_df = data_all[train_df.shape[0]: ]
test_df.reset_index(drop = True, inplace = True)
test_noVehicle_index = data_index[train_df.shape[0]:]
test_noVehicle_index = list(map(lambda x:x - train_df.shape[0], test_noVehicle_index))
test_vehicle_index = list(map(lambda x:x - train_df.shape[0], test_vehicle_index))
targets = targets_cp_train.reset_index(drop = True)

In [None]:
print(f"{b_}train_df.shape: {r_}{train_df.shape}")
print(f"{b_}test_df.shape: {r_}{test_df.shape}")

In [None]:
X_test = test_df.values
print(f"{b_}X_test.shape: {r_}{X_test.shape}")

<a id="s6"></a>
# 6 Model: TabNet

In [None]:
#Parameter in use

MAX_EPOCH = 200
N_SPLITS = 10
Lr = 2e-2
Weight_decay = 1e-5
Nd = 32
Na = 32
Steps = 1
Batch = 1024

In [None]:
tabnet_params = dict(
    n_d = Nd,
    n_a = Na,
    n_steps = Steps,
    gamma = 1.3,
    lambda_sparse = 0,
    optimizer_fn = optim.Adam,
    optimizer_params = dict(lr = Lr, weight_decay = Weight_decay),
    mask_type = "entmax",
    scheduler_params = dict(
        mode = "min", patience = 5, min_lr = 1e-5, factor = 0.9),
    scheduler_fn = ReduceLROnPlateau,
    seed = seed,
    verbose = 10
)

Training

In [None]:
class LogitsLogLoss(Metric):
    """
    LogLoss with sigmoid applied
    """

    def __init__(self):
        self._name = "logits_ll"
        self._maximize = False

    def __call__(self, y_true, y_pred):
        """
        Compute LogLoss of predictions.

        Parameters
        ----------
        y_true: np.ndarray
            Target matrix or vector
        y_score: np.ndarray
            Score matrix or vector

        Returns
        -------
            float
            LogLoss of predictions vs targets.
        """
        logits = 1 / (1 + np.exp(-y_pred))
        aux = 0.5*(1 - y_true) * np.log(1 - logits + 1e-15) + 0.5*y_true * np.log(logits + 1e-15)
        wAux = 0.005 * (1 - y_true) * np.log(1 - logits + 1e-15) + 0.995 * y_true * np.log(logits + 1e-15)
        return 0.7 * np.mean(-aux) + 0.3 * np.mean(-wAux)

In [None]:
# class LogitsLogLoss(Metric):
#     """
#     LogLoss with sigmoid applied
#     """

#     def __init__(self):
#         self._name = "logits_ll"
#         self._maximize = False

#     def __call__(self, y_true, y_pred):
#         """
#         Compute LogLoss of predictions.

#         Parameters
#         ----------
#         y_true: np.ndarray
#             Target matrix or vector
#         y_pred: np.ndarray
#             pred matrix or vector

#         Returns
#         -------
#             float
#             LogLoss of predictions vs targets.
#         """
#         logits = 1 / (1 + np.exp(-y_pred))
#         CELoss = 0.5*(1 - y_true) * np.log(1 - logits + 1e-15) + 0.5*y_true * np.log(logits + 1e-15)
#         weightedCELoss = 0.005 * (1 - y_true) * np.log(1 - logits + 1e-15) + 0.995 * y_true * np.log(logits + 1e-15)
#         return 0.85 * np.mean(-CELoss) + 0.15 * np.mean(-weightedCELoss)

In [None]:
scores_auc_all = []
test_cv_preds = []

mskf = MultilabelStratifiedKFold(n_splits = N_SPLITS, random_state = 0, shuffle = True)

oof_preds = []
oof_targets = []
scores = []
scores_auc = []
for fold_nb, (train_idx, val_idx) in enumerate(mskf.split(train_df, targets)):
    print(b_,"FOLDS: ", r_, fold_nb + 1)
    print(g_, '*' * 60, c_)
    
    X_train, y_train = train_df.values[train_idx, :], targets.values[train_idx, :]
    X_val, y_val = train_df.values[val_idx, :], targets.values[val_idx, :]
    ### Model ###
    model = TabNetRegressor(**tabnet_params)
        
    ### Fit ###
    model.fit(
        X_train = X_train,
        y_train = y_train,
        eval_set = [(X_val, y_val)],
        eval_name = ["val"],
        eval_metric = ["logits_ll"],
        max_epochs = MAX_EPOCH,
        patience = 20,
        batch_size = Batch, 
        virtual_batch_size = 32,
        num_workers = 1,
        drop_last = False,
        # To use binary cross entropy because this is not a regression problem
        loss_fn = F.binary_cross_entropy_with_logits
    )
    print(y_, '-' * 60)
    
    ### Predict on validation ###
    preds_val = model.predict(X_val)
    # Apply sigmoid to the predictions
    preds = 1 / (1 + np.exp(-preds_val))
    score = np.min(model.history["val_logits_ll"])
    
    ### Save OOF for CV ###
    oof_preds.append(preds_val)
    oof_targets.append(y_val)
    scores.append(score)
    
    ### Predict on test ###
    preds_test = model.predict(X_test)
    test_cv_preds.append(1 / (1 + np.exp(-preds_test)))

oof_preds_all = np.concatenate(oof_preds)
oof_targets_all = np.concatenate(oof_targets)
test_preds_all = np.stack(test_cv_preds)

In [None]:
aucs = []
for task_id in range(oof_preds_all.shape[1]):
    aucs.append(roc_auc_score(y_true = oof_targets_all[:, task_id],
                              y_score = oof_preds_all[:, task_id]
                             ))
print(f"{b_}Overall AUC: {r_}{np.mean(aucs)}")
print(f"{b_}Average CV: {r_}{np.mean(scores)}")

In [None]:
tgt_col = [col for col in sample_submission.columns if col not in ["sig_id"]]
test_predicts = sample_submission.copy()
test_predicts[tgt_col] = 0
test_predicts.loc[test_noVehicle_index,tgt_col] = test_preds_all.mean(axis = 0)
test_predicts.to_csv('submission.csv', index=False)
test_predicts.head()


<div class = "alert alert-block alert-info">
    <h1><font color = "green">Reference</font></h1>
    <p>Tabnet architecture is highly based on the works <a href = "https://www.kaggle.com/optimo/tabnetregressor-2-0-train-infer">TabNetRegressor 2.0 [TRAIN + INFER]</a> Parameters are tuned by the team members.</p>
</div>