# When stuff is not missing at random
In the discussion https://www.kaggle.com/c/tabular-playground-series-sep-2021/discussion/270206, Ryan Holbrook from Kaggle staff, remarked that missingness is the magic feature because they intentionally correlated missingness with the claim target. That's why summing the number of missing features per case is such a predictive feature :-)

Anyway, is it just the number of missing features per case or we can extract further information from missin cases, such as specific patterns?

In this notebook, t-SNE and UMAP are used to try to extract further features from missing cases. In particular t-SNE seems able to extract an almost complete separation between claims and it could be useful as a predictive feature.

T-SNE (https://lvdmaaten.github.io/tsne/) and UMAP (https://github.com/lmcinnes/umap) are two technicalities, often used by data scientists, that allow to project multivariate data into lower dimensions. They are often used to find clusters in data. I used the fast t-SNE and UMAP implementations offered by Rapids (they require GPU access). 

In [None]:
%%time
import sys
!cp -f ../input/rapids/rapids.21.06 /opt/conda/envs/rapids.tar.gz
!cd -f /opt/conda/envs/ && tar -xzvf rapids.tar.gz
sys.path = ["/opt/conda/envs/rapids/lib"] + ["/opt/conda/envs/rapids/lib/python3.7"] + ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
!cp -f /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
# Importing core libraries
import numpy as np
import pandas as pd
from time import time
import os
import pprint
import joblib
from functools import partial

# Suppressing warnings because of skopt verbosity
import warnings
warnings.filterwarnings("ignore")

# Regressors
import lightgbm as lgb

# Model selection
from sklearn.model_selection import KFold, StratifiedKFold

# Metrics
from sklearn.metrics import mean_squared_error

# Data processing
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

# 
import cudf, cuml
import cupy as cp
from cuml.manifold import TSNE, UMAP
import matplotlib.pyplot as plt
from matplotlib.pyplot import ylim, xlim
%matplotlib inline

In [None]:
# Loading data 
X_train = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
X_test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")

In [None]:
# Preparing data as a tabular matrix
y = X_train.claim
X_train = X_train.set_index('id').drop('claim', axis='columns')
X_test = X_test.set_index('id')

In [None]:
# Elaborating missing indicators
X = X_train.isna().astype(int).append(X_test.isna().astype(int))

In [None]:
tsne = TSNE(n_components=2, perplexity=10, n_neighbors=100)
projection_2D = tsne.fit_transform(X)

In [None]:
projection_2D_train = projection_2D[:len(X_train), :]
projection_2D_test = projection_2D[len(X_train):, :]

In [None]:
valid_0 = (projection_2D_train[:,0] < 300) & (projection_2D_train[:,0] >-300)
valid_1 = (projection_2D_train[:,1] < 300) & (projection_2D_train[:,1] >-300)
valid = valid_0 & valid_1

In [None]:
plt.figure(figsize=(15, 15))
plt.scatter(projection_2D_train[valid, 0], projection_2D_train[valid, 1],
            c=y.values[valid],
            edgecolor='none', 
            alpha=0.80, 
            s=10)
plt.axis('off')
plt.show();

Bingo! The positive claims are segregated into a specific area, meaning they are predictable based on the missing patterns.

In [None]:
X_train['t_sne_0'] = projection_2D_train[:, 0]
X_test['t_sne_0'] = projection_2D_test[:, 0]

X_train['t_sne_1'] = projection_2D_train[:, 1]
X_test['t_sne_1'] = projection_2D_test[:, 1]

In [None]:
# UMAP
umap = UMAP(n_components=2, n_neighbors=70)
projection_2D = umap.fit_transform(X)

In [None]:
projection_2D_train = projection_2D[:len(X_train), :]
projection_2D_test = projection_2D[len(X_train):, :]

In [None]:
valid_0 = (projection_2D_train[:,0] < 2000) & (projection_2D_train[:,0] >-2000)
valid_1 = (projection_2D_train[:,1] < 2000) & (projection_2D_train[:,1] >-2000)
valid = valid_0 & valid_1

In [None]:
plt.figure(figsize=(15, 15))
plt.scatter(projection_2D_train[valid, 0], projection_2D_train[valid, 1],
            c=y.values[valid],
            edgecolor='none', 
            alpha=0.80, 
            s=10)
plt.axis('off')
plt.show();

Uhmmm...this start shaped result is quite uncommon, I wonder if it can be useful...anyway this is the first time I see this shape!

In [None]:
X_train['t_umap_0'] = projection_2D_train[:, 0]
X_test['t_umap_0'] = projection_2D_test[:, 0]

X_train['t_umap_1'] = projection_2D_train[:, 1]
X_test['t_umap_1'] = projection_2D_test[:, 1]

You can use the coordinates elaborated by t-SNE and UMAP as features. By making them interact you could even use them for linear models and neural networks, making them even more effective.

In [None]:
# Saving the t-SNE and UMAP coordinates as feature
X_train['claim'] = y
X_train.reset_index().to_csv("train.csv", index=False)
X_test.reset_index().to_csv("test.csv", index=False)