In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

**What I've done in this notebook**

* Applying various dimension reduction techniques and visualize the reduced data
    - PCA
    - kernel PCA
    - t-SNE
    - Isomap
    - LLE
* Train a random forest to evaluate performances after PCA

# Import the data and preprocessing

In [None]:
train_path = "../input/tabular-playground-series-jun-2021/train.csv"
train = pd.read_csv(train_path)
train.head()

In [None]:
train.drop("id", axis=1, inplace=True)

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(train["target"])
X = train.drop("target", axis=1).values

split data set for evaluating performance

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                      test_size=0.2, 
                                                      stratify=y,
                                                      random_state=42)

In [None]:
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape 

In [None]:
train.iloc[:, :-1] = train.iloc[:, :-1].astype("int16")

**outline of this note**



# Visualizing the data with dimensionality reduction techniques

In this section, I will use several dimensionality reduction techniques to transform our data set into 2D space and visualize the data points to see if these techniques can give us some insights about the data.

Here's a helper for visualizing the data:

In [None]:
def draw_plot_2d(decompose=None, 
                 subset=None,
                 X_train=X_train, 
                 y_train=y_train):
    
    if subset is not None:
        X_train = X_train[subset, :]
        y_train = y_train[subset]
    
    if decompose is None:
        decompose_2d = X_train
    else:
        decompose_2d = decompose.fit_transform(X_train)

    plt.figure(figsize=(15, 8), dpi=100)
    sns.scatterplot(x=decompose_2d[:, 0], 
                    y=decompose_2d[:, 1],
                    hue=[le.classes_[i] for i in y_train]);

## Principal component analysis

In [None]:
pca = PCA(n_components=2, random_state=42)
draw_plot_2d(pca)

In [None]:
sample_ids = np.random.choice(X_train.shape[0], 10000)

I will use a subset of data (sample of rows) in the following four techniques because they either use tons of memory (kernel PCA) or are time-consuming (the others).

## Kernel PCA

In [None]:
from sklearn.decomposition import KernelPCA

kernel = KernelPCA(n_components=2, kernel="rbf", n_jobs=-1, copy_X=False)
draw_plot_2d(decompose=kernel, subset=sample_ids)

## t-sne

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE()
draw_plot_2d(decompose=tsne, subset=sample_ids)

## Isomap

In [None]:
from sklearn.manifold import Isomap

iso = Isomap()
draw_plot_2d(decompose=iso, subset=sample_ids)

# Locally linear embeding (LLE)

In [None]:
from sklearn.manifold import LocallyLinearEmbedding

lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
draw_plot_2d(decompose=lle, subset=sample_ids)

Unfortunately, since this is a synthesized dataset, these images don't really tell us anything.

# Can dimension reduction improve or hurt our prediction performance?

In [None]:
from sklearn.metrics import log_loss
from sklearn.ensemble import RandomForestClassifier

def train_rf_with_decompose(decompose=None, 
                            subset=None,
                            X_train=X_train,
                            y_train=y_train):
    
    if subset is not None:
        X_train = X_train[subset, :]
        y_train = y_train[subset]
    
    if decompose is None:
        # if no decomposition, we use the original one
        X_train_transformed = X_train
        X_valid_transformed = X_valid
    else:
        # transform training set and valid set
        X_train_transformed = decompose.transform(X_train)
        X_valid_transformed = decompose.transform(X_valid)
    
    # train a random forest
    rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, 
                                random_state=42)
    rf.fit(X_train_transformed, y_train)
    pred = rf.predict_proba(X_valid_transformed)
    
    return log_loss(y_valid, pred)

Train a random forest with the original dataset:

In [None]:
train_rf_with_decompose()

## Use PCA as preliminary reduction

We usually decide the number of component by finding the "elbow" of explained variance.

In [None]:
pca_full = PCA(n_components=75).fit(X_train)

plt.plot(pca_full.explained_variance_ratio_.cumsum())
plt.hlines(0.95, 0.1, 51, "black", "--")
plt.vlines(50, 0.05, 0.95, "black", "--")
plt.xlabel("number of principal components")
plt.ylabel("Cumulative explained variance ration")
plt.xlim(0.5, 80)
plt.ylim(0.1, 1);

In [None]:
train_rf_with_decompose(decompose=pca_full)

Wow! We use a 33% smaller dataset and get a slighly better performance. Let's apply PCA to our training and validation set.

In [None]:
X_train = pca_full.transform(X_train)
X_valid = pca_full.transform(X_valid)

Now that we have a smaller data set, we can train a more complex model.

In [None]:
rf_final = RandomForestClassifier(n_estimators=500, max_depth=15,
                                  n_jobs=-1, random_state=42).fit(X_train, y_train)

In [None]:
pred = rf_reduce.predict_proba(X_valid)
log_loss(y_valid, pred)

Make prediction on the test set

In [None]:
test = pd.read_csv("../input/tabular-playground-series-jun-2021/test.csv")
test = test.iloc[:, 1:].values
test_preds = rf_final.predict_proba(test)

In [None]:
sub = pd.read_csv("../input/tabular-playground-series-jun-2021/sample_submission.csv")
sub.iloc[:, 1:] = test_preds
sub.to_csv("submission.csv", index=False)