# Synopsis

This is an examination of feature and class characteristics.  The study focuses on zero vs. positive values and comparsions with May TPS data.

# Setup

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas.io.formats import style

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

import sklearn.preprocessing as sk_prep


In [1]:
DATA_DIR = '/kaggle/input/tabular-playground-series-jun-2021'
RANDOM_STATE = 9003

# Load Data and Top-Level Checks

In [1]:
train_set = pd.read_csv(os.path.join(DATA_DIR, 'train.csv'))
train_data = train_set.iloc[:, 1:-1] # Feature columns
train_ar = train_data.to_numpy()

le_targ = sk_prep.LabelEncoder()
train_y = train_set['target']
train_y_num = le_targ.fit_transform(train_y)
classes = le_targ.classes_

# Train data with label encoded target
train_w_targ = train_data.copy()
train_w_targ['target'] = train_y_num

print(train_set.shape)
train_set.head()

## Data types and missing values

In [1]:
print("Data Types of Predictors")

trvc = train_data.dtypes.value_counts()
for type1, cnt in trvc.iteritems():
    print(f'  {type1}: {cnt}')

null_count = np.sum(np.isnan(train_ar))

print(f'\nNumber of missing values in features: {null_count}.')

neg_count = np.sum(np.less(train_ar, 0))

print(f'\nNumber of negative values in features: {neg_count}')

The June data has no negative feature values, unlike the May data.

# Distribution of Target Values

In [1]:
fig, ax = plt.subplots()

ax.set_title("Count of Samples by Target Class")
sns.countplot(x='target', data=train_set, order=classes)

# Characteristics of Rows by Target Classes

This table shows the minimum, maximum and mean counts of positive values by rows within each target class.

In [1]:
pos_arr = (train_ar > 0).astype('int32')
row_cnts = np.sum(pos_arr, axis=1)

rc_gb = pd.DataFrame({'PosCnt': row_cnts, 'target': train_y}).groupby(by='target')
cl_cnt = rc_gb.aggregate(['min', 'max', 'mean'])

print("Positive Values in Rows by Classes")
cl_cnt

We can see that the average number of positive values per row by class varies from 18 for class 2 to 31 for class 8. 

Note that both classes 2, 3, 5 and 6 all have at least one row with no positive values.

# Basic Characteristics by Feature

This table provides the minimum, maximum and mean values for each feature.  The other columns are the counts and proportions of positive values and the count of unique values.

In [1]:
row_count = train_data.shape[0]

# Get min, max and mean by feature

agg_df = train_data.agg(['min', 'max', 'mean']).transpose()

# Get positive % by feature

agg2_ls = [
    (
        (f1 > 0).sum(),
        len(f1.unique())
    )
    for f1 in [train_data[col] for col in train_data]
]

# Get number of unique values



# Combine data

agg2_df = pd.DataFrame(agg2_ls, columns=['PosCnt', 'UniqueCnt'], index=agg_df.index)
agg2_df['PosProp'] = agg2_df['PosCnt'].div(row_count)

feat_ch_df = pd.concat([agg_df, agg2_df.iloc[:, [0, 2, 1]]], axis=1)

# Display

style.Styler(feat_ch_df, precision=2).background_gradient(cmap='viridis')

In [1]:
print(f'Maximum positive proportion: {feat_ch_df["PosProp"].max():0.2f}')
print(f'Minimum positive proportion: {feat_ch_df["PosProp"].min():0.2f}')
print(f'Mean positive proportion:    {feat_ch_df["PosProp"].mean():0.2f}')

Positive features are much less common than 0's; only 36% of features have positive values.  For comparison, in May only about 20% of all values were positive.

# Influence of Positive Features

The heatmap below shows the improvement in likelihood for each class if it has a positive value in each feature compared to its likelihood if it has a zero in that feature.

For example, rows with a positive value in feature 0 have 32% chance of being in Class 8, while those with a zero in the feature have 24% change of being in this class.  This is a net improvement of 0.32 / 0.24 - 1 = 0.36.

In [1]:
pos_gain_by_target = pd.DataFrame(index=classes)

for col in train_data:
    col_ser = train_data[col]
    zero_counts_by_target = train_set.loc[(col_ser == 0), [col, 'target']].groupby(by='target').count()
    zero_prop_by_target = zero_counts_by_target.divide(zero_counts_by_target.sum())
    pos_counts_by_target = train_set.loc[(col_ser > 0), [col, 'target']].groupby(by='target').count()
    pos_prop_by_target = pos_counts_by_target.divide(pos_counts_by_target.sum())
    pos_gain_by_target[col] = pos_prop_by_target.divide(zero_prop_by_target).sub(1)

# Transpose so that the plot will be tall
pos_gain_by_target = pos_gain_by_target.T

# pd.io.formats.style.Styler(pos_gain_by_target, precision=2).background_gradient(cmap='viridis')

In [1]:
fig, ax = plt.subplots()
fig.set_figheight(20)

sns.heatmap(pos_gain_by_target, annot=True, fmt='0.2f', ax=ax, cmap='viridis')

# Correlations

As we look for correlations between features, we find that they are very small.  If we did a heatmap without significant color enhancement, everything except the diagonal would be the same, essentially a correlation of 0.  The color enhancement is done by setting vmax which controls the upper limit of the color range.

In [1]:
corr_df = train_data.corr()

print("Sample of feature correlations")

corr_df.iloc[:10, :10]

In [1]:
corr_arr = corr_df.to_numpy()
c_max = np.max(corr_arr[corr_arr < 1])
c_min = np.min(corr_arr)
c_mean = np.mean(corr_arr[corr_arr < 1])

print(f"Strongest positive correlation between features: {c_max: 0.3f}")
print(f"Weakest positive correlation between features: {c_min: 0.3f}")
print(f"Mean correlation: {c_mean: 0.3f}")

In the May data, the strongest correlation between features was 0.017, a little more than a tenth of the strongest correlation we see here, and not a lot more than the weakest correlation that we see here.  Also May had some negative correlations, which we do not have in these data.

This stronger correlation between features, though it is still week on the whole, ccould mean that methods that combine features could be more effective this month, e.g. PCA or DAE.

In [1]:
# Correlation Map

fig, ax = plt.subplots(figsize=(20, 12))
ax.set_title("Correlation Heatmap with Enhanced Color", fontsize=14)

sns.heatmap(corr_df, vmax=c_max*1.1, center=0.0, annot=False)