# Introduction

This notebook will push the use of scikit-learn to the maxx.<sup><font color='blue'>[citation needed]</font></sup>

- stacking/blending: the popular thing to do nowadays, apparently. Fitting a meta-estimator on top of base estimators using prediction on holdout sets.
- pipelining: implemented in base models, so different base models can use different preprocessing scheme best suited for them individually.
- simplicity: all codes will be from scikit-learn, our beloved popular-for-starters Python machine learning library. It will be concise but nonetheless powerful and does complex things under the hood.

To boost scikit-learn's speed, we will use the [scikit-learn-intelex](https://github.com/intel/scikit-learn-intelex) package, although it has limited coverage it's still worth including here.

In [None]:
!pip install --upgrade scikit-learn scikit-learn-intelex --progress-bar off >> pip.log

In [None]:
import pandas as pd
import numpy as np

from sklearnex import patch_sklearn
patch_sklearn()

# Read data

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
ss = pd.read_csv('../input/tabular-playground-series-nov-2021/sample_submission.csv')

X = train_df.drop(['target', 'id'], axis = 1).values
y = train_df['target'].values
X_test = test_df.drop('id', axis = 1).values

del train_df, test_df

# Build model

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import QuantileTransformer, StandardScaler
from sklearn.decomposition import FastICA
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.svm import LinearSVC
from sklearn.kernel_approximation import Nystroem
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import StackingClassifier

logistic_pipeline = make_pipeline(FastICA(),
                                  LogisticRegression(C = 10, max_iter = 1_000))

naivebayes_pipeline = make_pipeline(QuantileTransformer(output_distribution = 'normal'),
                                    GaussianNB())

svm_pipeline = make_pipeline(StandardScaler(),
                             LinearSVC(C = 0.001775, max_iter = 100))

rbf_svm_pipeline = make_pipeline(StandardScaler(),
                                 Nystroem(gamma=0.001, n_components=500),
                                 LinearSVC(C=0.005655653341918836, max_iter=100))

estimators = [
    ('logistic', logistic_pipeline),
    ('naivebayes', naivebayes_pipeline),
    ('linear_svm', svm_pipeline),
    ('rbf_svm', rbf_svm_pipeline)
]

meta_clf = StackingClassifier(
    estimators = estimators,
    final_estimator = LogisticRegressionCV(Cs = 20),
    cv = 15
)

In [None]:
# Visualize our pipeline
# You can click the resulting diagrams to see more details
from sklearn import set_config
set_config(display = 'diagram')

meta_clf

# Train model and predict to submit

In [None]:
import warnings
from sklearn.exceptions import ConvergenceWarning

with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=ConvergenceWarning)
    
    # Suppress convergence warnings to not pollute the output
    simple_submit = meta_clf.fit(X, y).predict_proba(X_test)[:, 1]

# Save submissions
ss['target'] = simple_submit
ss.to_csv('submission.csv', index = False)

In [None]:
# Out of curiosity, let's display the weights assigned to each base prediction
meta_clf.final_estimator_.coef_

# Closing

Well, that was short and quick. I hope you can take away something from it!

Feel free to upvote or fork if you think this notebook is useful or you're interested in modifying it.

Keep learning and happy data sciencing!