This notebooks explores and does primarily visual analysis of the datasets provided under Tabular Playground Series - Jul 2021 competition, specifically the training set

In [None]:
from numpy.random import seed
seed(9)
import tensorflow
tensorflow.random.set_seed(9)
import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA

In [None]:
# Read the datasets
train = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/test.csv')
sample_submission = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/sample_submission.csv')

In [None]:
# Explore the dataset
display(train.head())

In [None]:
# Explore the dataset
display(sample_submission.head())

In [None]:
# Explore the dataset
display(train.info())

The training data is primarily numeric data and does not contain misiing numbers.

In [None]:
display(test.info())

The test data also is primarily numeric data and does not contain misiing numbers.

In [None]:
# Convert the date_time column into datetime format
train.date_time = pd.to_datetime(train.date_time)
test.date_time = pd.to_datetime(test.date_time)

In [None]:
dt = train.date_time.dt

In [None]:
for df in [train, test]:
    df['hour'] = dt.hour
    df['mrng'] = df['hour'].isin(np.arange(4,9,1)).astype('int')
    df['day'] = df['hour'].isin(np.arange(8,13,1)).astype('int')
    df['noon'] = df['hour'].isin(np.arange(12,19,1)).astype('int')
    df["weekday"] = (df.date_time.dt.dayofweek < 5).astype("int")

train = train.drop('hour', axis=1)
test = test.drop('hour', axis=1)    

In [None]:
# Set date_time column iinto index
train_set = train.set_index('date_time')
test_set = test.set_index('date_time')
train_set.head(2)

In [None]:
trgts = train_set.loc[:,['target_carbon_monoxide','target_benzene','target_nitrogen_oxides']]
train_set = train_set.drop(['target_carbon_monoxide','target_benzene','target_nitrogen_oxides'], axis=1)
train_set = pd.concat([train_set, trgts], axis=1)
train_set.head(5)

In [None]:
train_set.shape

In [None]:
# Check for the sstatistics
train_set.describe()

The features look different scales and spread.

In [None]:
sns.pairplot(train_set.iloc[:,0:8])
plt.show()

The pairplot shows most of the features are strongly correlated with each other

In [None]:
train_set.iloc[:,0:12].corr()

THe above correlation matrix and the below heatmap are in sync with our obesrvation that the features are strongly correlated. This might lead to overfitting.

In [None]:
sns.heatmap(train_set.corr())
plt.show()

In [None]:
train_set.iloc[:,12:15].corr()

In [None]:
for i in range(12,15):
    print(train_set.iloc[:,i].mean())

In [None]:
sns.pairplot(train_set.iloc[:,12:15])
plt.show()

Even the targets seem to be strongly positively correlated with each other and are logarithmically distribued with right skewedness indicating big variance and several outliers.

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(train_set[train_set.columns[range(0,8)]], alpha=.7)
ax.legend(train_set.columns[range(0,8)])
ax.set(yscale='log')
plt.show()

The graph shows clear seperation between the distribution of sensor information and environmental featues. Sensor data seem to be steady over the time period but the environmental data has some clear swings.

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(train_set[train_set.columns[range(3,8)]], alpha=.7)
ax.legend(train_set.columns[range(3,8)])
plt.show()

Sensor_5 seems to be more widely spread over the period, while sensor_4 has higher values and sensor_3 has lower values. 

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(train_set[train_set.columns[range(12,15)]], alpha=.7)
ax.set(yscale='log')
ax.legend(train_set.columns[range(12,15)])
plt.show()

Nitrogen oxides seem to be more dominant than the other targets and seem to be on slight rise compared to other targets with time.

In [None]:
env_factors = train_set.columns[0:3]
sensors = train_set.columns[3:8]
time_segs = train_set.columns[8:12]
targets = train_set.columns[12:15]
cols = [env_factors,sensors,targets]
cols

In [None]:
for ss in cols:
    sns.boxplot(data=train_set[ss])
    plt.show()

The above boxplots confirm that relative humidity, sensor_5 and nitrogen oxides are more widely spread than others in their respective categories. 

Let us create initial tidy dataset.

In [None]:
train_set_melt_tgt = pd.melt(train_set,id_vars=train_set.columns[list(range(0,12))],var_name='target', value_name='target_value', ignore_index=False)
train_set_melt_tgt['time'] = train_set_melt_tgt[['mrng','day','noon']].idxmax(axis=1)
train_set_melt_tgt = train_set_melt_tgt.drop(['mrng','day','noon'], axis=1)
display(train_set_melt_tgt.head(2))
train_set_melt_tgt.shape

In [None]:
g = sns.FacetGrid(train_set_melt_tgt, col="target", row='time')
g.set(yscale='log')
g.map(sns.boxplot,'weekday', "target_value")

While all three targets have right skewed distributions as seen in pair plot earlier, the spread in carbon monoxide and benzene seem to be just uniform when compared to nitrogen oxides.

In [None]:
for column in train_set_melt_tgt.columns[range(0,8)]:
    g = sns.FacetGrid(train_set_melt_tgt, col="weekday", row='time', sharex=True, sharey=True, aspect = 3/2)
    g.map(sns.scatterplot,x=column, y='target_value', hue='target',data=train_set_melt_tgt, alpha=.7)
    g.set(yscale='log')
    plt.xlabel(column)
    plt.legend()
    plt.ylabel('target_value')
    plt.show()

The above graphs show all the features are clearly and well correlated seperately with the three targets.

Let us convert our dataset into more tidier format.

In [None]:
train_set_melt_tgt.columns

In [None]:
train_set_melt_tgt_sens = pd.melt(train_set_melt_tgt,id_vars=train_set_melt_tgt.columns[[0,1,2,8,9,10,11]],var_name='sensor', value_name='sensor_value', ignore_index=False)
display(train_set_melt_tgt_sens.head(2))
train_set_melt_tgt_sens.shape

In [None]:
train_set_melt_tgt_sens.columns

In [None]:
train_set_melt_tgt_sens_env = pd.melt(train_set_melt_tgt_sens,id_vars=train_set_melt_tgt_sens.columns[[3,4,5,6,7,8]],var_name='env', value_name='env_value', ignore_index=False)
display(train_set_melt_tgt_sens_env.head(2))
train_set_melt_tgt_sens_env.shape

In [None]:
g = sns.FacetGrid(train_set_melt_tgt_sens_env, col="target", row="sensor",hue='time', margin_titles=True)
g.map_dataframe(sns.scatterplot, x="sensor_value", y="target_value", style='weekday', alpha=.3)
g.set_axis_labels("sensor_value","target_value")
g.set(yscale='log')
plt.legend(loc='best')
g.tight_layout()


While most of the sensors seem to have clear positive correlation with all 3 targets, the other sensors seem to have no relationship.

In [None]:
g = sns.FacetGrid(train_set_melt_tgt_sens_env, col="target", row="env",hue='time', margin_titles=True)
g.map_dataframe(sns.scatterplot, x="env_value", y="target_value", style='weekday', alpha=.3)
g.set_axis_labels("env_value","target_value")
g.set(yscale='log')
g.set(xscale='log')
plt.legend(loc='best')
g.tight_layout()

While temperature is slightly negatively correlated with nitrogen oxides, relative humidity is relately more strongly positive correclated. Other environmental factors do not seem to have much correlation with any targets.

In [None]:
g = sns.FacetGrid(train_set_melt_tgt_sens_env, col="weekday", row="env", margin_titles=True, aspect=3/2)
g.map_dataframe(sns.histplot, x="env_value", hue='time', alpha=.3, bins=100)
g.set_axis_labels("env_value")
plt.legend(loc='best')
g.tight_layout()

All the above graphs prove clearly weekday (or not) and phase of the day also have clear correlations to the targets

Finally we prepare our training data, scale it and do PCA analysis to identify the top principal components.

In [None]:
env_factors = train_set.columns.values.tolist()[0:3]
sensors = train_set.columns.values.tolist()[3:8]
time_segs = train_set.columns.values.tolist()[8:12]
targets = train_set.columns.values.tolist()[12:15]
cols= env_factors+sensors+time_segs
train_data=train_set[cols]
train_data.head(2)

In [None]:
normalizer = Normalizer()
norm_train = pd.DataFrame(normalizer.fit_transform(train_data), columns=train_data.columns)
norm_train.head(2)

In [None]:
pca = PCA()
pca_train = pca.fit_transform(norm_train)
pca_train.shape

In [None]:
print(abs(pca.components_))
print(pca.explained_variance_ratio_)

In [None]:
feature_imp = pd.DataFrame(pca.explained_variance_ratio_, columns=['pca_value'])
feature_imp['pca'] = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11','PC12']
feature_imp['cumsum'] = feature_imp.pca_value.cumsum()
feature_imp

In [None]:
fig, ax = plt.subplots(figsize=(15,7))
sns.barplot(x='pca',y='pca_value', data = feature_imp)
sns.lineplot(x='pca',y='cumsum', data = feature_imp)
sns.scatterplot(x='pca',y='cumsum', data = feature_imp, marker='o')
plt.show()

The first 3 principal components explain about 92% of the variance in the data.