# June 2021 TPS

# Introduction

Starting from January this year, the kaggle competition team is offering a month-long tabulary playground competitions. This series aims to bridge between inclass competition and featured competitions with a friendly and approachable datasets. For June kaggle is offering a dataset which is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.

Goal of the competition: Predict 9 classes given 75 features.

Submissions are evaluated using multi-class logarithmic loss. Each row in the dataset has been labeled with one true Class. For each row, you must submit the predicted probabilities that the product belongs to each class label. The formula is:

$ log loss = - \frac{1}{N} \sum_{i=1}^{N}\sum_{j=1}^{M} y_{ij}log(p_{ij}) $.

where **$N$** is the number of rows in the test set, **$M$** is the number of class labels, **$log$** is the natural logarithm, **$y_{ij}$** is 1 if observation  is in class  and 0 otherwise, and **$p_{ij}$** is the predicted probability that observation **$i$** belongs to class **$j$**.

Note: The submitted probabilities for a given product are not required to sum to one; they are rescaled prior to being scored, each row is divided by the row sum. In order to avoid the extremes of the  function, predicted probabilities are replaced with.

# 1 EDA

### Imports

In [None]:
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pylab as plt
import seaborn as sns
import plotly.io as pio
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
pio.templates.default = "none"

from umap import UMAP
from sklearn.manifold import TSNE

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


### Load Data

In [None]:
subm = pd.read_csv(r'/kaggle/input/tabular-playground-series-jun-2021/sample_submission.csv', index_col='id')
train = pd.read_csv(r'/kaggle/input/tabular-playground-series-jun-2021/train.csv',  index_col='id')
test = pd.read_csv(r'/kaggle/input/tabular-playground-series-jun-2021/test.csv',  index_col='id')

### Dataset size
* Train: 200000 rows, 75 features and a target columns
* Test: 100000 rows and 75 features columns

In [None]:
print('Train data of shape {}'.format(train.shape))
display(train.head())
print('Test data of shape {}'.format(test.shape))
display(test.head())

In [None]:
target = train.pop('target')

### No missing values in the data 

In [None]:
display(train.info())

In [None]:
display(train.describe())

### Unique values in Features
* Most of the features have the same number of unique values in test and trian data. Only Features **15, 28, 46, 59, 60, and 73** have different values in test and train dataset

In [None]:
# train_data missing values
unique_values_train = []
for col in train.columns:
    c = train[col].nunique()
    pc = np.round((100 * (c)/len(train)), 2)            
    dict1 ={
        'Features' : col,
        'unique_train (count)': c,
        #'unique_trian (%)': '{}%'.format(pc)
    }
    unique_values_train.append(dict1)
DF1 = pd.DataFrame(unique_values_train, index=None).sort_values(by='unique_train (count)',ascending=False)
#print(DF1)


# test_data missing values
unique_values_test = []
for col in test.columns:
    c = test[col].nunique()
    pc = np.round((100 * (c)/len(test)), 2)            
    dict2 ={
        'Features' : col,
        'unique_test (count)': c,
        #'unique_test (%)': '{}%'.format(pc)
    }
    unique_values_test.append(dict2)
DF2 = pd.DataFrame(unique_values_test, index=None).sort_values(by='unique_test (count)',ascending=False)
#print(DF2)

df = pd.concat([DF1, DF2], axis=1)
df#.head()

In [None]:
fig = go.Figure(data=[go.Scatter(x=DF1['Features'],
                             y=DF1["unique_train (count)"], mode= 'markers',                             
                             name='Train', marker_color='lightseagreen'),        

                go.Scatter(x=DF2['Features'],
                             y=DF2["unique_test (count)"], mode= 'markers',
                             name='Test', marker_color='lightsalmon')])
fig.update_traces(marker_line_color='black', marker_line_width=1.5, opacity=1)
fig.update_layout(title_text='Unique Values In Each Feature ', 
                  #template='plotly_dark',
                  paper_bgcolor='#f8f0ec',
                  plot_bgcolor='#f8f0ec',
                  width=750, height=500,
                  xaxis_title='Features', yaxis_title='Count',
                  titlefont={'color':'black', 'size': 24, 'family': 'San-Serif'})
fig.show()

### Target Distribution

* Target class_6 and class_8 are the dominant target classes
* Target class_5 and class_4 are the fewest target classes

In [None]:
targeT = target
fig = px.histogram(targeT, x="target", 
                   width=700, 
                   height=500,
                   histnorm='percent',                 
                   template="simple_white"
                   )
colors = ['#00a08f'] * 9 
colors[0] = ['#A52A2A']
colors[1] = ['#A52A2A']
colors[8] = 'salmon'


fig.update_traces(marker_color=colors, marker_line_color='red',
                  marker_line_width=2.5, opacity=0.5)

fig.update_layout(title="<b>Target Class Distribution<b>", 
                  font_family="San Serif",
                  titlefont={'size': 24},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 ).update_xaxes(categoryorder='total descending') # ordering the x-axis values

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()

### Features Distribution (Histograms)
* Category/value zero dominates the features

In [None]:
import holoviews as hv
from holoviews import opts, dim
hv.extension('bokeh')

In [None]:
df_train = hv.Dataset(train)
df_test = hv.Dataset(test)

feat = test.columns[:]
hist1 = df_train.hist(dimension=list(feat), bins=10, adjoin=False)
hist1.opts(opts.Histogram(alpha=0.9, width=300, height=200))
hist1.opts(title='Train data Histograms', fontscale=1.5)

In [None]:
hist2 = df_test.hist(dimension=list(feat), bins=10, adjoin=False)
hist2.opts(opts.Histogram(alpha=0.9, width=300, height=200))
hist2.opts(title='Test data Histograms', fontscale=1.5)
hist2

# 2. Models

* **version_6**: Boosted trees (lgbm, xgboost, catboost)
* **version_7**: LightAutoML with Alexander Ryzhkov's starter code. Thanks Alex!
> Aim is to get a hang on LightAutoML library

In [None]:
## Install LightAutoML
!pip install -U lightautoml

### Imports

In [None]:
# Standard python libraries
import os
import time
import re

# Installed libraries
import numpy as np
import pandas as pd
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split

# Imports from our package
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from lightautoml.dataset.roles import NumericRole

### Set params

In [None]:
N_THREADS = 4 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
RANDOM_STATE = 2021 # fixed random state for various reasons
TEST_SIZE = 0.2 # Test size for metric check
TIMEOUT = 5 * 3600 # Time in seconds for automl run
TARGET_NAME = 'target'

In [None]:
test_data = pd.read_csv('../input/tabular-playground-series-jun-2021/test.csv')
train_data = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv')
train_data[TARGET_NAME] = train_data[TARGET_NAME].str.slice(start=6).astype(int) - 1
train_data.head()

In [None]:
test_data = pd.read_csv('../input/tabular-playground-series-jun-2021/test.csv')
test_data.head()

In [None]:
submission = pd.read_csv('../input/tabular-playground-series-jun-2021/sample_submission.csv')
submission.head()

### Additional Features (FE)
* Note: This part is not in Alexander Ryzhkov's starter code.
* Trying additional featured by just combining some features which have relatively lower percentage of zeros

In [None]:
# selecting some features which have less that 50% zeros and make new interacting features

#train
train_data['50p43'] = train_data['feature_50'] + train_data['feature_43']
train_data['50p54'] = train_data['feature_50'] + train_data['feature_54']
train_data['50p19'] = train_data['feature_50'] + train_data['feature_19']
train_data['50p12'] = train_data['feature_50'] + train_data['feature_12']

# test 
test_data['50p43'] = test_data['feature_50'] + test_data['feature_43']
test_data['50p54'] = test_data['feature_50'] + test_data['feature_54']
test_data['50p19'] = test_data['feature_50'] + test_data['feature_19']
test_data['50p12'] = test_data['feature_50'] + test_data['feature_12']

## AutoML preset usage
### Setup task and column roles

In [None]:
%%time
task = Task('multiclass', )

roles = {
    'target': TARGET_NAME,
    'drop': ['id'],
}

### Train on full data

In [None]:
%%time

automl = TabularAutoML(task = task,
                       timeout = TIMEOUT,
                       cpu_limit = N_THREADS,
                       
                       general_params ={
                           'use_algos': [['lgb_tuned', 'cb_tuned'], ['lgb_tuned', 'cb_tuned']],
#                            'return_all_predictions': True,
#                            'weighted_blender_max_nonezero_coef': 0.0                                                   
                       },                       
#                        tuning_params = {'max_tuning_time': 1800},
                        reader_params = {'n_jobs': N_THREADS},
                       
)
oof_pred = automl.fit_predict(train_data, roles = roles)
print('oof_pred:\n{}\nShape = {}'.format(oof_pred[:10], oof_pred.shape))

### Prediction, check OOF score

In [None]:
%%time
test_pred = automl.predict(test_data)
print('Prediction for test data:\n{}\nShape = {}'.format(test_pred[:10], test_pred.shape))

print('Check scores...')
print('OOF score: {}'.format(log_loss(train_data[TARGET_NAME].values, oof_pred.data)))

### Looking at the feature importance

* Interesting to note that the newly created features are in the top 'rank', especially `feature 50p43`

In [None]:
# Fast feature importances calculation
fast_fi = automl.get_feature_scores('fast')
fast_fi.set_index('Feature')['Importance'].plot.barh(figsize = (10, 16), grid = True, title='Feature Importance').invert_yaxis()

### Submission

In [None]:
submission.iloc[:, 1:] = test_pred.data
submission.to_csv('lightautoml_2lvl_5hrs_newFeats_1.csv', index = False)

### Visualizing the predicted classes
* class_8, class_6 and class_2 are the most predicted classes. Where are the other classes?

In [None]:
prediction_plot = pd.Series(test_pred.data.argmax(axis=1)).replace({0: "Class_1", 1: "Class_2", 2: "Class_3", 
                                                                   3: "Class_4", 4: "Class_5", 5: "Class_6",
                                                                   6: "Class_7", 7: "Class_8", 8: "Class_9", 
                                                                   9: "Class_10"})

print(prediction_plot.value_counts())

fig = plt.figure(figsize=(8,4))
sns.countplot(prediction_plot)
plt.show()


## Thank you for reading this notebook!