# Prologue

Challenge of using SVM with dataset this big is twofold:

- long runtime
- large memory consumption especially with kernels other than 'linear'.

This notebook is my shot at using SVM. I work around the above challenges by using Nystroem transformation and LinearSVC in scikit-learn.

**Why/what Nystroem transformation?**

Kernel tricks usually require a large matrix of size $n \times n$ to be made. When n is very large, like in this competition, the matrix is easily beyond the size of available memory.

Rather than using all data, Nystroem transformation approximates the kernel information using sample of the data. This way, it becomes possible to use kernel tricks without overflowing our memory when n is very large.

**Why/what LinearSVC?**

LinearSVC is a variant of SVC available in sklearn that performs quite well when data is large. However, it only uses linear i.e. no kernel trick. Doing Nystroem transformation before feeding data into LinearSVC is approximately the same as doing SVC with kernel tricks, but working around the abovementioned challenges.

In [None]:
!pip install --upgrade scikit-learn scikit-learn-intelex --progress-bar off >> pip.log

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearnex import patch_sklearn
patch_sklearn()

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
ss = pd.read_csv('../input/tabular-playground-series-nov-2021/sample_submission.csv')

X = train_df.drop(['target', 'id'], axis = 1).values
y = train_df['target'].values
X_test = test_df.drop('id', axis = 1).values

del train_df, test_df

# Tune hyperparams and fit model

I use Bayesian optimization as provided in package skopt rather than grid search because:
1. Bayesian optimization can find better parameters with the same number of iterations, so it **saves time**.
2. I only need to define sane possible ranges for the hyperparameters, which is **much easier** than pre-specifying candidate hyperparameters in grid search.

In [None]:
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.kernel_approximation import Nystroem
from skopt import BayesSearchCV, space, plots

svm_pipeline = make_pipeline(StandardScaler(),
                             Nystroem(kernel = 'rbf', n_components = 1_000),
                             LinearSVC(max_iter = 100))

params = {
    "linearsvc__C": space.Real(1e-3, 1e3, prior = 'log-uniform'),
    "nystroem__gamma": space.Real(1e-3, 1e3, prior = 'log-uniform')
}

bs = BayesSearchCV(svm_pipeline, params, n_iter = 50, cv = 3, scoring = 'roc_auc',
                   verbose = 3, refit = False)
bs.fit(X, y)

svm_pipeline.set_params(**bs.best_params_)
svm_pipeline.fit(X, y)

In [None]:
# Variations in loss w.r.t hyperparameters
plots.plot_objective(bs.optimizer_results_[0],
                     dimensions=["linearsvc__C", 'nystroem__gamma'],
                     n_minimum_search=int(1e8))
plt.show()

# Make SVM able to output predictions

sklearn's SVC by default does not give probabilities. There are ways around this, the easiest being setting argument `probabilities = True`. But I choose to use CalibratedClassifier instead, as follows.

In [None]:
from sklearn.calibration import CalibratedClassifierCV

# cv = 'prefit' because my model is already fitted and I don't want to
# recalibrate using cross-validation
calib = CalibratedClassifierCV(svm_pipeline, method = 'isotonic', cv = 'prefit')

simple_submit = calib.fit(X, y).predict_proba(X_test)[:, 1]

ss['target'] = simple_submit
ss.to_csv('submission.csv', index = False)

# Epilogue

There you go! I hope you learn stuffs. Now you know how to workaround to use kernels other than linear in SVM with this competition's dataset. Enjoy!

Feel free to fork and upvote. Keep learning and happy data-sciencing!