## This is my MEGA Auto notebook - AutoEDA + new features (read from MLJAR logs) + AutoML (LightAutoML)

I am really interested in magic of Auto solutions. They are better and better everyday. Recently (TPS05) I spent a lot of time playing with AutoML and trying to find best solution. They are ideal for rapid prototyping and learning really great models. In this notebook I am going to show you full Auto Pipeline:

<div class="alert alert-success">
  <strong>Notebook scope:</strong>
    <ul>
        <li>AutoEDA using:</li>
        <ul>
            <li>sweetviz</li>
            <li>dataprep</li>
        </ul>
        <li>New features inspired by MLJAR - K-Means Features</li>
        <li>AutoML using LightAutoML</li>
        <li>Data experiments - PowerTransformer (Yeo-Johnson Transform)</li>
    </ul>
</div>

#### I appreciate and feedback and support. Thank you Kaggles! 

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import log_loss
from matplotlib import pyplot

import warnings
warnings.filterwarnings("ignore")

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv', index_col = 'id')
test = pd.read_csv('../input/tabular-playground-series-jun-2021/test.csv', index_col = 'id')

TARGET_NAME = 'target'

train[TARGET_NAME] = train[TARGET_NAME].str.slice(start=6).astype(int) - 1

all_df = pd.concat([train, test]).drop('target', axis = 1).reset_index(drop=True)

all_features = all_df.columns

## 1. AutoEDA

Let's make very fast EDA analysis. I think that in this competition it really powerful. In 30 second (just 4 lines of code) you achieve full overview of data we have in dataset.  No more code needed at this time.

**Click on report - you can find more details about feature**

### 1A. SWEETVIZ

In [None]:
!pip install sweetviz -q
import sweetviz as sv

We make comparision analysis. Sweetviz provides more methods:

Use this version when you have 2 data sets to compare together (e.g. Train versus Test). This is a very useful report!
> **compare(source: Union[pd.DataFrame, Tuple[pd.DataFrame, str]],
>             compare: Union[pd.DataFrame, Tuple[pd.DataFrame, str]],
>             target_feat: str = None,
>             feat_cfg: FeatureConfig = None,
>             pairwise_analysis: str = 'auto')**


Use this when you want to compare 2 some populations within the same dataset. This is also a very useful report, especially when coupled with target feature analysis!
>**compare_intra(source_df: pd.DataFrame,
                  condition_series: pd.Series,
                  names: Tuple[str, str],
                  target_feat: str = None,
                  feat_cfg: FeatureConfig = None,
                  pairwise_analysis: str = 'auto')**

Use this version when there is only a single dataset to analyze, and you do not wish to compare subpopulations together (e.g. male vs female)
>**analyze(source: Union[pd.DataFrame, Tuple[pd.DataFrame, str]],
        target_feat: str = None,
        feat_cfg: FeatureConfig = None,
        pairwise_analysis: str = 'auto')**

The best example I found: https://colab.research.google.com/drive/1-md6YEwcVGWVnQWTBirQSYQYgdNoeSWg?usp=sharing#scrollTo=oMV8HHX4t1aA

In [None]:
feature_config = sv.FeatureConfig(force_num=["target"])

tps_comparison_report = sv.compare([train,'Train'], [test,'Test'], target_feat='target', feat_cfg = feature_config)
tps_comparison_report.show_notebook(w=840, h=8000, scale=0.8)

### 1B. DATAPREP

In [None]:
!pip install -U dataprep -q

In [None]:
from dataprep.eda import create_report
from dataprep.eda import plot_diff

In [None]:
create_report(all_df).show()

In [None]:
plot_diff([train.drop('target', axis = 1),test])

## 2.EXPERIMENTAL STEP - K-Means Features (inspired by MLJAR)


In [None]:
# This part of code is from MLJAR AutoML - https://mljar.com/automated-machine-learning/k-means-features/

#df_all_scaled = pd.DataFrame()

from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler

n_clusters = min(max(8, int(np.log10(all_df.shape[0]) * 8)), all_df.shape[1])

scale = StandardScaler(copy=True, with_mean=True, with_std=True)
df_all_scaled = scale.fit_transform(all_df)

kmeans = MiniBatchKMeans(n_clusters=n_clusters, init="k-means++")
kmeans.fit(df_all_scaled)

n_clusters = kmeans.cluster_centers_.shape[0]
new_features = [f"Dist_Cluster_{i}" for i in range(n_clusters)]
new_features += ["Cluster"]

distances = kmeans.transform(df_all_scaled)
clusters = kmeans.predict(df_all_scaled)

In [None]:
all_df = pd.concat([pd.DataFrame(all_df, columns = all_features), pd.DataFrame(distances, columns = new_features[:-1])], axis = 1)
all_df[new_features[-1]] = pd.Series(clusters) 

train_df = all_df[:len(train)]
train_df['target'] = train.target
test_df = all_df[len(train):]

In [None]:
train_df.head(5)

In [None]:
test_df.head(5)

## 3. AUTOML - LightAutoML

In [None]:
N_THREADS = 4 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
RANDOM_STATE = 42 # fixed random state for various reasons
TEST_SIZE = 0.2 # Test size for metric check
TIMEOUT = 8 * 3600 # Time in seconds for automl run

In [None]:
pip install -U lightautoml -q

In [None]:
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task

import pandas as pd

In [None]:
task = Task('multiclass',)

roles = {
    'target': TARGET_NAME,
    'drop': ['id'],
}

Let's take LightAutoML make the magic :)

In [None]:
automl = TabularUtilizedAutoML(task = task, 
                               timeout = TIMEOUT,
                               cpu_limit = N_THREADS,
                               general_params = {'use_algos': [['lgb_tuned', 'cb_tuned'], ['lgb_tuned', 'cb_tuned']]},
                               tuning_params = {'max_tuning_time': 1200},
                               reader_params = {'n_jobs': N_THREADS},
                               max_runs_per_config=10
                               )
oof_pred = automl.fit_predict(train, roles = roles)
print('oof_pred:\n{}\nShape = {}'.format(oof_pred[:10], oof_pred.shape))

In [None]:
test_pred = automl.predict(test)
print('Prediction for test data:\n{}\nShape = {}'.format(test_pred[:10], test_pred.shape))

print('Check scores...')
print('OOF score: {}'.format(log_loss(train_df[TARGET_NAME].values, oof_pred.data)))

## 3. SUMBISSION

In [None]:
submission = pd.read_csv('../input/tabular-playground-series-jun-2021/sample_submission.csv')

submission.iloc[:, 1:] = test_pred.data
submission.to_csv("lightautoml_submission.csv", index = False)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import itertools

palette = itertools.cycle(sns.color_palette())

plt.figure(figsize=(16, 8))
for i in range(9):
    plt.subplot(3, 3, i+1)
    c = next(palette)
    sns.histplot(submission, x = f'Class_{i+1}', color=c)
plt.suptitle("Class prediction distribution")

In [None]:
submission.drop("id", axis=1).describe().T.style.bar(subset=['mean'], color='#205ff2')\
                            .background_gradient(subset=['std'], cmap='Reds')\
                            .background_gradient(subset=['50%'], cmap='coolwarm')