# <center>Tabular Playground Series - June/2021<center>
## <center>LightAutoML with KNN Features<center>
---

- Notebook created with the help of [@alexryzhkov's](https://www.kaggle.com/alexryzhkov) notebook [LightAutoML baseline TPS June 2021](https://www.kaggle.com/alexryzhkov/lightautoml-baseline-tps-june-2021) and the LightAutoML documentation.
- Using KNN features provided by [@melanie7744's](https://www.kaggle.com/melanie7744) notebook [TPS6-Boost your score with KNN features](https://www.kaggle.com/melanie7744/tps6-boost-your-score-with-knn-features).


My other notebooks in this competition:
- [Tabular Playground Series - June/2021: Starter - EDA + Base LightGBM](https://www.kaggle.com/jonaspalucibarbosa/tps06-21-starter-eda-base-lgbm)
- [Tabular Playground Series - June/2021: Simple Neural Network with Keras](https://www.kaggle.com/jonaspalucibarbosa/tps06-21-simple-nn-with-keras)
- [Tabular Playground Series - June/2021: Keras Neural Network with Embedding Layer](https://www.kaggle.com/jonaspalucibarbosa/tps06-21-keras-nn-with-embedding)
- [Tabular Playground Series - June/2021: Wide and Deep Neural Network with Keras](https://www.kaggle.com/jonaspalucibarbosa/tps06-21-wide-and-deep-nn-w-keras)
- [Tabular Playground Series - June/2021: Keras Neural Network with Skip Connections](https://www.kaggle.com/jonaspalucibarbosa/tps06-21-keras-nn-with-skip-connections)

In [None]:
pip install -U lightautoml

## Importing Libraries and Datasets

In [None]:
import pandas as pd       
import matplotlib as mat
import matplotlib.pyplot as plt    
import numpy as np
import seaborn as sns
%matplotlib inline

import random
import os
from numpy.random import seed

from sklearn import metrics

from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv', index_col = 'id')
#Y_train = df_train['target'].copy()
#X_train = df_train.copy().drop('target', axis = 1)

X_test = pd.read_csv('../input/tabular-playground-series-jun-2021/test.csv', index_col = 'id')

In [None]:
train_knn = np.load("../input/tps6-boost-your-score-with-knn-features/add_feat_train.npy")
test_knn = np.load("../input/tps6-boost-your-score-with-knn-features/add_feat_test.npy")

train_knn = pd.DataFrame(train_knn)
test_knn = pd.DataFrame(test_knn, index = range (200000,300000,1))

train_knn.columns = [('knn_{0:d}').format(i) for i in range(1,10)]
test_knn.columns = [('knn_{0:d}').format(i) for i in range(1,10)]

df_train = pd.concat([df_train, train_knn], axis=1)
X_test = pd.concat([X_test, test_knn], axis=1)

In [None]:
df_train['target'] = df_train['target'].str.slice(start=6).astype(int) - 1

In [None]:
df_train

In [None]:
X_test

## LightAutoML

In [None]:
N_THREADS = 4 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
RANDOM_STATE = 42 # fixed random state for various reasons
TEST_SIZE = 0.2 # Test size for metric check
TIMEOUT = 8 * 3600 # Time in seconds for automl run
TARGET_NAME = 'target' # Target column name

In [None]:
#Reproducible results
seed(RANDOM_STATE)
random.seed(RANDOM_STATE)
os.environ['PYTHONHASHSEED'] = str(RANDOM_STATE)

In [None]:
task = Task('multiclass',)

In [None]:
roles = {'target': TARGET_NAME}

In [None]:
%%time

automl = TabularUtilizedAutoML(task = task, 
                               timeout = TIMEOUT,
                               cpu_limit = N_THREADS,
                               general_params = {
                                   'use_algos': [['linear_l2', 'lgb_tuned', 'cb_tuned'], ['lgb_tuned', 'cb_tuned']],
                                   'return_all_predictions': True,
                                   'weighted_blender_max_nonzero_coef': 0.0
                               },
                               tuning_params = {'max_tuning_time': 3600},
                               reader_params = {'n_jobs': N_THREADS}
                               )
oof_pred = automl.fit_predict(df_train, roles = roles)
print('oof_pred:\n{}\nShape = {}'.format(oof_pred[:10], oof_pred.shape))

In [None]:
print(oof_pred.shape)
oof_pred

In [None]:
test_pred = automl.predict(X_test)
print('Prediction for test data:\n{}\nShape = {}'.format(test_pred[:10], test_pred.shape))

In [None]:
print('Check scores...')
print('OOF score: {}'.format(metrics.log_loss(df_train[TARGET_NAME].values, oof_pred.data)))

## Submission

In [None]:
train_oof = pd.DataFrame(oof_pred.data, columns = ['Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6', 'Class_7', 'Class_8', 'Class_9'])
train_oof

In [None]:
pred_test = pd.DataFrame(test_pred.data, columns = ['Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6', 'Class_7', 'Class_8', 'Class_9'])
pred_test

In [None]:
train_oof.to_csv('lightautoml_train_oof.csv', index=False)
train_oof

In [None]:
output = pred_test
output['id'] = X_test.index
output.to_csv('submission.csv', index=False)

output