# Synopsis

This is an examination of feature and class characteristics.  The approach is influenced by the large number of 0 values in the data.

# Setup

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

# Load Data and Top-Level Checks

In [None]:
data_dir = '/kaggle/input/tabular-playground-series-may-2021'
random_seed = 80808

In [None]:
train_data = pd.read_csv(os.path.join(data_dir, 'train.csv'))
tr_X = train_data.iloc[:, 1:51] # Feature columns
tr_y = train_data['target']

print(train_data.shape)
train_data.head()

## Data types and missing values

In [None]:
print("Data Types by Columns")
print(train_data.dtypes)

null_count = np.sum(np.isnan(tr_X.to_numpy()))

print(f'\nNumber of missing values in features: {null_count}.')

We see that all our features are integers and no values are missing.

# Distribution of Target Values

In [None]:
fig, ax = plt.subplots()

ax.set_title("Count of Samples by Target Class")
sns.countplot(x='target', data=train_data, order=['Class_1', 'Class_2', 'Class_3', 'Class_4'])

# Basic Characteristics by Feature

This table provides the minimum, maximum and mean values for each feature.  The other rows are the counts and proportions of positive and negative values.

In [None]:
row_count = train_data.shape[0]

agg_df = tr_X.agg(['min', 'max', 'mean']).transpose()

# agg_df

agg2_ls = [
    ((f1 > 0).sum(),
     (f1 < 0).sum()
    )
    for f1 in [tr_X[col] for col in tr_X]
]

agg2_df = pd.DataFrame(agg2_ls, columns=['PosCnt', 'NegCnt'], index=agg_df.index)
agg2_df['PosProp'] = agg2_df['PosCnt'].div(row_count)
agg2_df['NegProp'] = agg2_df['NegCnt'].div(row_count)

feat_ch_df = pd.concat([agg_df, agg2_df.iloc[:, [0, 2, 1, 3]]], axis=1)

feat_ch_df

In [None]:
print(f'Maximum positive proportion: {feat_ch_df["PosProp"].max()}')
print(f'Mean positive proportion: {feat_ch_df["PosProp"].mean()}')
print(f'Maximum negative proportion: {feat_ch_df["NegProp"].max()}')
print(f'Mean negative proportion: {feat_ch_df["NegProp"].mean()}')


There is only one feature with more than 10 negative values--feature_31.  At 199 negative values, these are negligible, representing about 0.2% of all rows.

Positive features are much less common than 0's.  Only about 20% of all values are positive.

# Characteristics of Rows by Target Classes

This table shows the minimum, maximum and mean counts of positive values by rows within each target class.

In [None]:
feat_arr = tr_X.to_numpy()

pos_arr = (feat_arr > 0).astype('int32')
row_cnts = np.sum(pos_arr, axis=1)

rc_gb = pd.DataFrame({'PosCnt': row_cnts, 'target': tr_y}).groupby(by='target')
cl_cnt = rc_gb.aggregate(['min', 'max', 'mean'])

print("Positive Values in Rows by Classes")
cl_cnt

We can see that rows in class 4 have about 10% less positive values than the rows in the other classes.  Note that both class 2 and class 4 have at least one sample with no positive values.

# Heatmap of Feature Positive Proportions by Class

In order to highlight the comparison between classes for each row, we will scale the proportions of each cell.  If we used raw counts, the large number of samples from Class 2 would overwhelm other variation.  If we used the proportions without scaling, then the differences between features (rows) might hide the patterns--the whole row would be lighter for the features with large numbers of positive values.  We will also scale for the differences in positive value rates between classes; this keeps Class 4 from consistently having a lower rating.

In [None]:
# Create data for factors

# Use previous structure to get positive counts

cl_pos_cnt = rc_gb.sum().iloc[:, 0]
pos_cnt = int(cl_pos_cnt.sum())

row_list = []

for col in tr_X:
    pos_cl_counts = train_data.loc[train_data[col] > 0, 'target'].value_counts().sort_index()
    cl_act_prop = pos_cl_counts.div(cl_pos_cnt)                 # Actual proportion of positive values by class for this feature
    f_mean_prop = cl_act_prop.mean()                             # Mean proportion across classes
    cl_factor = cl_act_prop.div(f_mean_prop)                    # Scaled factor for each class

    row_list.append([col] + list(cl_factor))

In [None]:
exp_df = pd.DataFrame.from_records(row_list, columns=['Feature'] + list(cl_pos_cnt.index), index=['Feature'])

fig, ax = plt.subplots()
fig.set_figheight(15)

sns.heatmap(exp_df, center=1.0, annot=True, fmt='0.2f', ax=ax)

Note that the range of these values is narrow compared to other data sets.  No class has more than 35% more or 20% less positive values than we would expect for any feature.  So the presence of a postive value in a sample for any one feature will have limited effect on the probabilities that we would calculate.

# Correlations

As we look for correlations between features, we find that they are very small.  If we did a heatmap without significant color enhancement, everything except the diagonal would be the same, essentially a correlation of 0.  The color enhancement is done by setting vmax which controls the upper limit of the color range.

In [None]:
corr_df = train_data.iloc[:, 1:51].corr()

print("Sample of feature correlations")

corr_df.iloc[:10, :10]

In [None]:
corr_arr = corr_df.to_numpy()
c_max = np.max(corr_arr[corr_arr < 1])
c_min = np.min(corr_arr)

print("Strongest positive correlation between features: ", c_max)
print("Strongest negative correlation between features: ", c_min)

In [None]:
# Correlation Map

fig, ax = plt.subplots(figsize=(20, 12))
ax.set_title("Correlation Heatmap with Enhanced Color", fontsize=14)

sns.heatmap(corr_df, vmax=c_max*1.1, center=0.0, annot=False)