# About
This notebook is my EDA for the Tabular Playground Series June 2021. The June competitions has quite some similarities to the May competition. For the May EDA please see my notebook [TPS5 - EDA raising more questions than answers](https://www.kaggle.com/melanie7744/tps5-eda-raising-more-questions-than-answers).

Here is a summary of my findings, please find the details below. 

**The most obivous similarities are:**
- multi-class classification problem
- mulit-class log loss as evaluation metric
- anonymized (obfuscated) features
- most feature values are 0
- there are no binary features

**The differences lie in the details:**
- 9 classes vs 4 in TPS5
- 75 features vs 50 in TPS5
- training and testing data are twice as big as in TPS5
- the features do not have any negative values
- there are much more "feature duplicates" than in TPS5


If you like my analysis, please upvote!

Let's load the environment and look at training and testing data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib
import matplotlib.pyplot as plt # plotting
%matplotlib inline 
print("matplotlib version: {}". format(matplotlib.__version__))
matplotlib.style.use('seaborn')

import seaborn as sns
print("seaborn version: {}". format(sns.__version__))

import sklearn # machine learning algorithms
print("scikit-learn version: {}". format(sklearn.__version__))
from sklearn.preprocessing import StandardScaler

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# read competition data files
df_train = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv')
df_test = pd.read_csv('../input/tabular-playground-series-jun-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-jun-2021/sample_submission.csv')
df_all = df_train.append(df_test, ignore_index = True) 

# Data Overview

In [None]:
print("Size of training data: ",df_train.shape)
df_train.head()

In [None]:
print("Size of testing data: ",df_test.shape)
df_test.head()

In [None]:
feature_cols = [col for col in df_train.columns if col.startswith("feat")]

In [None]:
df_train.describe().transpose()\
        .drop("id")\
        .style.bar(subset=['mean','std'])\
        .background_gradient(subset=['max'])

In [None]:
df_test.describe().transpose()\
        .drop("id")\
        .style.bar(subset=['mean','std'])\
        .background_gradient(subset=['max'])

In [None]:
# number of rows with any values below zero
display(df_train[(df_train.drop(["target"],axis=1) < 0).any(1)].shape)
df_test[(df_test < 0).any(1)].shape

We can see here, that there are no missing values. The variable statistics are comparable between training and testing data. Most features are in a range from 0 to 100, with a few exceptions. The highest value of any feature is 352. This is much higher than the highest value in TPS5. Like in TPS5 0 is by far the most common value for any feature. 

# Target Variable Analysis

In [None]:
# check the target variable
target_absolute = df_train.target.value_counts()
target_percent = df_train.target.value_counts(normalize=True)
target_distribution = pd.DataFrame(data={'absolute':target_absolute, 'percent': target_percent})
target_distribution[['percent']] = target_distribution[['percent']].applymap(lambda x: "{0:.2f}%".format(x*100))
target_distribution

In [None]:
plt.figure( figsize=(12,6))
ax= target_distribution['absolute'].sort_values(ascending=True).plot(kind='barh')
ax.set_title("Distribution of Target Variable")
ax.set_xlabel("Count")

rects = ax.patches
labels = target_distribution['absolute'].sort_values(ascending=True)
for rect, label in zip(rects, labels):
    width = rect.get_width()
    ax.text(width +1500 ,rect.get_y() + rect.get_height() / 2, label,
            ha='center', va='center')
plt.show()

The target variable is imbalanced with Class_6 and Class_8 sharing half of the values. 

# Duplicates

In [None]:
# check for true duplicates, i.e. where features and target match
df_dupli_f = df_train[df_train.drop(columns=["id"]).duplicated(keep="first")]
print("Number of duplicates: ", df_dupli_f.shape[0]) 
print("Number of duplicates per class: \n", df_dupli_f.target.value_counts())
# drop duplicates
df_train = df_train.drop(columns=["id"]).drop_duplicates()
df_train.shape

In [None]:
# check for duplicates in the feature columns, only possible for training data
df_dupli_f = df_train[df_train.drop(columns=["target"]).duplicated(keep="first")].copy() 
df_dupli_f["f_sum"] = df_dupli_f.sum(axis=1, numeric_only=True)
#display(df_dupli_f.sort_values(by="f_sum")) # sort the df to find matching duplicates easier
print("Number of duplicates per class if first duplicate is kept: \n", df_dupli_f.target.value_counts())

df_dupli_l = df_train[df_train.drop(columns=["target"]).duplicated(keep="last")].copy()
df_dupli_l["f_sum"] = df_dupli_l.sum(axis=1, numeric_only=True)
#display(df_dupli_l.sort_values(by="f_sum"))
print("Number of duplicates per class if last duplicate is kept: \n",df_dupli_l.target.value_counts())

In [None]:
#df_train.drop(columns=["target"]).loc[131686] == df_train.drop(columns=["target"]).loc[66469] # this is a feature dupliacte pair, row 131686 —> Class_2 vs. row 66469 —> Class_3

In [None]:
# look at test set only
df_test[df_test.drop(columns=["id"]).duplicated(keep="first")]

In [None]:
# check for samples with identical features in train and test data
df_train_temp = df_train.drop(columns="target").drop_duplicates()
df_test_temp = df_test.drop(columns="id").drop_duplicates()
df_all_temp = df_train_temp.append(df_test_temp, ignore_index = True) 

In [None]:
df_all_temp[df_all_temp.duplicated(keep="last")] # this shows the rows from the training data, keep="last" would show the rows from the testing data

In the **training data** we have 106 true duplicates. That is rows where features and target match.

But we also have 118 "feature duplicates" here. That is, rows where the features are the same but the target variable is different! An example would be: 
- row 131686-> Class_2 vs. 
- row  66469-> Class_3

And just by counting the target variable values for the feature dupliates it can be seen that they do not match. 

In the **test set** we have 79 rows indicated as duplicates. Comparing them to the number of dupicates in the train set 106+118 = 224, their number is a bit lower than expected.

If we **combine the training and testing set** we have 101 rows with identical features. So the model gets the exact same data in the test set that it learned from in the training phase. Would be interesting to check exactly those predictions. 

While TPS5 with a low number of feature duplicates led me to just drop them, I will think twice if this is a good approach here, for TPS6.

Some further investigation made me believe that this is an artifact from making a synthetic dataset. So no use trying to make sense out of those rows, "just" decide how to deal with them.

# Feature Value Distributions

In [None]:
# thanks to Maxim Kazantsev (@maximkazantsev) for this function! I adapted it slighty
def make_data_plots(df, i=0):
    """
    Makes value distribution histogram plots for a given dataframe features
    df should contain only the features to be plotted
    """
    columns = df.columns.values

    cols = 4
    rows = (len(columns) - i) // cols + 1

    fig, axs = plt.subplots(ncols=cols, nrows=rows, figsize=(16,rows*4), sharey=True)
    
    plt.subplots_adjust(hspace = 0.2)
    for r in np.arange(0, rows, 1):
        for c in np.arange(0, cols, 1):
            if i >= len(columns):
                axs[r, c].set_visible(False)
            else:
                axs[r, c].hist(df[columns[i]].values, bins = 30)
                axs[r, c].set_title(columns[i], fontsize=12, pad=5)
            i+=1
            
            
make_data_plots(df_train[feature_cols])

In [None]:
make_data_plots(df_test[feature_cols])

We can see here that all distributions are right skewed. Some very heavily.

# Feature Value Analysis

In [None]:
# let's check if there are as many unique feature values as the range of values
pd.options.display.max_rows = 75
df_features = df_all[feature_cols] # use df_all, df_test here depending on what you want to see

feature_range = df_features.max() - df_features.min()
no_unique_values = df_features.nunique()

unique_values = pd.DataFrame(data={"feature_range": feature_range, "no_unique_values": no_unique_values})

In [None]:
unique_values.plot(kind="barh", figsize=(12,24), color=['tab:blue', 'tab:orange'])
plt.show()

Here we see a quite different picture than in TPS5. Now, in TPS6, there are no low cardinality features. There are many features with a seizable difference between their value range and their number of unique features. I am still puzzeld what to make out of this observation. Any ideas?

In [None]:
# check how many % of feature values are 0
zerolist = []

for col in feature_cols:
    zeroperc = df_train[col].value_counts()[0]/df_train.shape[0]
    zerolist.append(zeroperc)
    
zeros = round(pd.Series(data=zerolist)*100,2)

In [None]:
zeros.sort_values(ascending=False).plot(kind='bar', figsize=(20,6))
plt.axhline(y=50)
plt.title("Percentage of Zeros in each feature")
plt.xlabel("Feature Number")
plt.ylabel("%")
plt.annotate(zeros.max(),xy=(0,zeros.max()+2))
plt.annotate(zeros.min(),xy=(73,zeros.min()+2))
plt.show()

There feature with the highest number of zeros has 86.09% zeros in it. The feature with the lowest number of zeros has 28.97% of zeros in it. Only 10 features have less than 50% of zeros. 

# Viz per Feature

In [None]:
# choose feature for a closer look
current_feature = "feature_17"
current_df = df_train
print(current_feature)

fig = plt.figure(figsize=fsize) # create figure
fsize = (10,6)
ax0 = fig.add_subplot(2, 1, 1) # add subplot 1 (2 rows, 1columns, first plot)
ax1 = fig.add_subplot(2, 1, 2)
#current_df[current_feature].hist(figsize=fsize, ax=ax0)
sns.histplot(x=current_feature, data=current_df, ax=ax0) # just an alternative with sns instead of plt
sns.boxplot(x=current_feature, data=current_df, ax=ax1)
plt.show()

print(current_df[current_feature].value_counts())

# Sample Analysis

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv') # read again, because of dropped duplicates
target_col = df_train.target # store the target column, it interfers with the calcluations below
df_train.drop(columns="target", inplace=True)
if "id" in df_train.columns.to_list(): # if the id is still present, remove it beause it screws computations
    df_train.drop(columns="id", inplace=True)

In [None]:
rownumber = 3 # enter the row you want to analyse
pd.Series(data=df_train[feature_cols].loc[rownumber].values).plot(kind='bar', figsize=(16,6))
plt.title("Sample {} Values".format(rownumber))
plt.show()

In [None]:
# get the number of entries that are not 0 in each row
number_nz = np.count_nonzero(df_train, axis=1) #dont assign directly to df_train, this will inclued number_nz in sum_nz! (learning by mistakes...)
# get the sum of the entered values in each row
sum_nz = df_train.sum(axis=1, numeric_only=True) 
df_train["number_nz"] = number_nz 
df_train["sum_nz"] = sum_nz
df_train.tail()

In [None]:
df_train.number_nz.value_counts()#.plot(kind='barh', figsize=(16,12))
# ... there are 5422 rows where 27 features have entries (to be preceise: non zero entries)
# ... there is 1 row where 65 features have entries
# ... and there are 9 rows where no features have entries

In [None]:
# add the target column back to check for target when all features are 0
df_train['target'] = target_col
df_train[df_train.number_nz == 0]

In [None]:
# repeat for test data
df_test.drop(columns="id", inplace=True)
# get the number of entries that are not 0 in each row
number_nz = np.count_nonzero(df_test, axis=1) #dont assign directly to df_train, this will inclued number_nz in sum_nz! (learning by mistakes...)
# get the sum of the entered values in each row
sum_nz = df_test.sum(axis=1, numeric_only=True) 
df_test["number_nz"] = number_nz 
df_test["sum_nz"] = sum_nz
# check for all zero samples
df_test[df_test.number_nz == 0]

In the training data there are 9 rows where all features are 0. These 9 rows have different target classes: 5x Class_2, 2x Class_6, 1x Class_5, 1x Class_3.

There are also 6 rows with 0 for all features in the test set. Ids: 221601, 232129, 245875, 250216, 284469, 295642. So at least train and test have the same properties... Although the number of zero features is a big higher than expected in the test set. 

Some further investigation made me believe that this is an artifact from making a synthetic dataset. So no use trying to make sense out of those rows, "just" decide how to deal with them.

# Set Baseline

In [None]:
# predict like train set probabilites
for col in sample_submission.drop(columns="id"):
    sample_submission.loc[:,col] = target_distribution.percent.loc[col]

sample_submission.to_csv('submission.csv', index=False)
sample_submission