# Imports 
(ﾉ◕ヮ◕)ﾉ*:･ﾟ✧ Lets do dis ..  

## Content
1. [Exploring data](#Exploring-data)
2. [Feature Selection](#Feature-Selection)
3. [Preprocessing](#Preprocessing)
    1. [Train valid split](#Train-valid-split)
    2. [Add Features](#Add-Features)
    3. [Missing values](#Impute)
    4. [Scaler](#Scaler)
    5. [Data Smoothing](#Data-Smoothing)
4. [TabNet](#TabNet)
    1. [Semi Supervised](#Semi-Supervised)
        * [Unsupervised](#Unsupervised)
        * [Supervised](#Supervised)
5. [Submit](#Submit)

# Imports

In [None]:
! pip install pytorch-tabnet # if not installed

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

import pytorch_tabnet
from pytorch_tabnet.tab_model import TabNetClassifier
import torch

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, accuracy_score

random = 200
torch.manual_seed(random)

train = '../input/tabular-playground-series-sep-2021/train.csv'
test = '../input/tabular-playground-series-sep-2021/test.csv'

# Exploring data 

In [None]:
train_df = pd.read_csv(train)
test_df = pd.read_csv(test)

It looks like we have Nan values **(´。＿。｀)**. There are multiple approaches to deal with missing values 
* Drop Columns with Missing Values
* Imputation 
* Interpolate

We will try both Imputation and Interpolation.  
Another thing to note here, lets get rid of the id because it creates [**data leakage.**](https://www.kaggle.com/alexisbcook/data-leakage) 

> *“ if any other feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction, is a feature that can introduce leakage to your model ”* ~ **Data Skeptic**

In [None]:
display(train_df.head())

# Lets see the null values
sum_na = train_df.isna().sum().sum()
print(f'Total Nan values {sum_na}')

In [None]:
train_df.pop('id')
test_id = test_df.pop('id')

When viewing different feature distribution, you could see that they are mostly **skewed**. We cannot use the **mean** when imputing, because our features are not symmetric. So we either replace with **median** or **mode**. [This article maybe useful to you](https://vitalflux.com/pandas-impute-missing-values-mean-median-mode/), it was useful to me.

In [None]:
train_df.hist(figsize=(30,30), bins=25, xlabelsize=0, ylabelsize=0, color='#cf1f1f')
plt.show()

# Feature Selection
There are multiple ways to select features:  
* Filter Method
* Wrapper Method
* Embedded Method 

In this notebook we will use the filter method

In [None]:
import seaborn as sns

train_corr = train_df.corr()
train_mask = np.triu(np.ones_like(train_corr, dtype=bool))

fig = plt.figure(figsize=(16, 16))

train_corr1 = train_corr[train_corr > 0.001]
sns.heatmap(train_corr, 
            square=True, 
            mask=train_mask,
            annot=False,
            cmap=plt.cm.Reds
           )

We will pick the features that have a correlation higher than 0 with our target feature **o(\*￣▽￣\*)ブ**

In [None]:
cor_target = abs(train_corr["claim"])

relevant_features = cor_target[cor_target>0]
relevant_features

In [None]:
relevent_train = train_df.loc[:, relevant_features.index]
relevent_train.head()

# Preprocessing 

One thing to note here, if we are planing to impute or add features to our data, we need to first split it into train and val, to avoid [**data leakage**](https://www.kaggle.com/alexisbcook/data-leakage). Then deal with them separately.  

### Train valid split

In [None]:
y = relevent_train['claim']
X = relevent_train.drop('claim', axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, 
    test_size=0.2,
    train_size=0.8, 
    shuffle=True,
    random_state=random)

### Add Features

In [None]:
pd.options.mode.chained_assignment = None

f = [x for x in X_train.columns.values if x[0]=="f"]

X_train['missing'] = X_train.loc[:,f].isna().sum(axis=1)
X_train['abs_sum'] = X_train.loc[:,f].abs().sum(axis=1)
X_train['median'] = X_train.loc[:,f].median(axis=1)
X_train['var'] = X_train.loc[:,f].var(axis=1)
X_train['std'] = X_train.loc[:,f].std(axis=1)
X_train['mean'] = X_train.loc[:,f].mean(axis=1)
X_train['max'] = X_train.loc[:,f].max(axis=1)
X_train['min'] = X_train.loc[:,f].min(axis=1)

X_train.head()

In [None]:
X_valid['missing'] = X_valid.loc[:,f].isna().sum(axis=1)
X_valid['abs_sum'] = X_valid.loc[:,f].abs().sum(axis=1)
X_valid['median'] = X_valid.loc[:,f].median(axis=1)
X_valid['var'] = X_valid.loc[:,f].var(axis=1)
X_valid['std'] = X_valid.loc[:,f].std(axis=1)
X_valid['mean'] = X_valid.loc[:,f].mean(axis=1)
X_valid['max'] = X_valid.loc[:,f].max(axis=1)
X_valid['min'] = X_valid.loc[:,f].min(axis=1)

pd.options.mode.chained_assignment = 'warn'
X_valid.head()

### Impute 
I'll use **median** because of the reasons stated previously.

In [None]:
my_imputer = SimpleImputer(strategy="median")
def impute(X_t, X_v):
    return pd.DataFrame(my_imputer.fit_transform(X_t)), pd.DataFrame(my_imputer.transform(X_v))

X_train_imp, X_val_imp = impute(X_train, X_valid)

### Scaler
[This](https://stackoverflow.com/questions/51841506/data-standardization-vs-normalization-vs-robust-scaler) helped me decide on which preprocessing approach is better.

In [None]:
from sklearn.preprocessing import RobustScaler

def robust_scale(X_t, X_v):
    scaler = RobustScaler()
    
    return pd.DataFrame(scaler.fit_transform(X_t)), pd.DataFrame(scaler.transform(X_v))

X_train_imp_st, X_val_imp_st = robust_scale(X_train_imp, X_val_imp)

In [None]:
display(X_train_imp_st.head())

# Lets see the null values
na_sum = X_train_imp_st.isna().sum().sum()
print(f'Nan: {na_sum}')

### Fix Skewness
We will use quantile to automatically transfer our numeric inputs to have a standard probability distribution, [this](https://machinelearningmastery.com/quantile-transforms-for-machine-learning/) post was really helpful at understanding the use of Quantile Transforms and why it's useful.

In [None]:
from sklearn.preprocessing import QuantileTransformer

trans = QuantileTransformer(n_quantiles=100, output_distribution='normal')

### Data Smoothing 
There are 3 techniques to smooth data: 
* Binning
* Regression 
* Outlier analysis   

[This](https://machinelearningmastery.com/discretization-transforms-for-machine-learning/) was helpful. I'll be using Binning technique. (o′┏▽┓｀o) 

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

kbin = KBinsDiscretizer(n_bins=100, encode='ordinal',strategy='uniform')

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("scaler", trans),
    ("binning", kbin)
])

train_final = pd.DataFrame(pipe.fit_transform(X_train_imp_st))
val_final = pd.DataFrame(pipe.transform(X_val_imp_st))

In [None]:
train_final.hist(figsize=(15,10), bins=64, color='#cf1f1f')
plt.show()

In [None]:
val_final.hist(figsize=(15,10), bins=64, color='#cf1f1f')
plt.show()

# TabNet 

In [None]:
Xtrain = train_final.to_numpy()
Xvalid = val_final.to_numpy()

In [None]:
del X_train_imp_st, X_val_imp_st, trans, kbin, pipe, my_imputer

## Semi-Supervised

#### Unsupervised

In [None]:
from pytorch_tabnet.pretraining import TabNetPretrainer

unsupervised_model = TabNetPretrainer(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type='entmax',
    )


unsupervised_model.fit(
    Xtrain,
    eval_set=[Xvalid],
    max_epochs=15 , patience=10,
    batch_size=512, virtual_batch_size=256,
    num_workers=0,
    drop_last=False,
    pretraining_ratio=0.8,

)

reconstructed_X, embedded_X = unsupervised_model.predict(Xtrain)
assert(reconstructed_X.shape==embedded_X.shape)

#### Save model (Optional)

In [None]:
# unsupervised_model.save_model('./pretrain')

#### Supervised

In [None]:
model1 = TabNetClassifier(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    scheduler_params={"step_size":5,"gamma":0.9},
    scheduler_fn=torch.optim.lr_scheduler.StepLR,
    mask_type='entmax'
)

In [None]:
model1.fit(
    Xtrain, y_train,
    eval_set=[(Xtrain, y_train), (Xvalid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['auc'],
    max_epochs=15, patience=10,
    batch_size=512, virtual_batch_size=256,
    num_workers=0,
    weights=1,
    from_unsupervised=unsupervised_model,
    drop_last=False
)

#### Save model (optional)

In [None]:
model1.save_model('./model_80')

# Submit

In [None]:
feature = [f for f in relevant_features.index if 'f' in f]

relevent_test = test_df.loc[:, feature]
relevent_test.head()

In [None]:
relevent_test['missing'] = relevent_test.loc[:,f].isna().sum(axis=1)
relevent_test['abs_sum'] = relevent_test.loc[:,f].abs().sum(axis=1)
relevent_test['median'] = relevent_test.loc[:,f].median(axis=1)
relevent_test['var'] = relevent_test.loc[:,f].var(axis=1)
relevent_test['std'] = relevent_test.loc[:,f].std(axis=1)
relevent_test['mean'] = relevent_test.loc[:,f].mean(axis=1)
relevent_test['max'] = relevent_test.loc[:,f].max(axis=1)
relevent_test['min'] = relevent_test.loc[:,f].min(axis=1)

In [None]:
my_imputer = SimpleImputer(strategy="median")
test_imp = pd.DataFrame(my_imputer.fit_transform(relevent_test))

In [None]:
scaler = RobustScaler()
X_test = pd.DataFrame(scaler.fit_transform(test_imp))

In [None]:
trans = QuantileTransformer(n_quantiles=64, output_distribution='normal')
kbin = KBinsDiscretizer(n_bins=64, encode='ordinal',strategy='uniform')

pipe = Pipeline([
    ("scaler", trans),
    ("binning", kbin)
])

test_final = pd.DataFrame(pipe.fit_transform(X_test))

In [None]:
preds = model1.predict(test_final.to_numpy())

In [None]:
df = pd.DataFrame({
    'id': test_id,
    'claim': preds
})

df = df.set_index('id')
df.to_csv('final.csv')