In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Tabular Playground Series - Feb 2022

The tabular series on kaggle was developed withthe aim to help novices in data science field to get acquainted with kaggle competitions.

For the March edition of the 2022 Tabular Playground Series we're challenged to forecast twelve-hours of traffic flow in a U.S. metropolis. The time series in this dataset are labelled with both location coordinates and a direction of travel -- a combination of features that will test your skill at spatio-temporal forecasting within a highly dynamic traffic network.

Models must be evaluated on the mean absolute error between predicted and actual congestion values for each time period in the test set.

<h2>Exploratory Data Analysis</h2>


<h3>Train dataset</h3>

Lets first explore the train dataset and subsequently the test dataset.


In [None]:
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2022/train.csv')

In [None]:
train_df.head()

In [None]:
train_df.tail()

In [None]:
train_df.drop('row_id', axis=1, inplace=True)

<h3>1. Structure Investigation</h3>

Prior to check the content of the data in the dataframe just loaded, let’s first verify the general structure of the dataset.


In [None]:
# Dataset shape 
train_df.shape

In [None]:
pd.value_counts(train_df.dtypes)

<h4>1.1. Structure of non-numerical features</h4>

In [None]:
train_df.describe(exclude='number')

<h4>1.2. Structure of numerical features</h4>

In [None]:
unique_values = train_df.select_dtypes(include='number').nunique().sort_values()

unique_values.plot.bar(logy=True, figsize=(12, 7), title='Unique Values per feature')

In [None]:
train_df.info()

In [None]:
train_df.describe().T

<h4>1.3. Conclusion of structure investigation</h4>

The dataset being explored have very few features (5 only) to deal with and a lot of samples roughly 849k without any missing value. There are 3 features of type int64 and 2 of type object. One of these features (<b>time</b>) is a time feature, which have about 13059 unique values, meanwhile the other object feature (direction) have only 8 unique values.

The numerical features x and y have less than 10^1 unique values. On the other hand, the congestion feature (our target) has about 10^2 unique values.

<h3>2. Quality Investigation</h3>

Before proceed cehcking the content these features have, we are going to initially take a look at the general quality of the dataset.

<h4>2.1. Duplicates</h4>

In [None]:
n_duplicates = train_df.duplicated().sum()
print(f"There are {n_duplicates} duplicates samples.")

<h4>2.2. Missing values</h4>

<h5>2.2.1. Per sample</h5>

In [None]:
plt.figure(figsize=(10, 8))
plt.imshow(train_df.isna(), aspect="auto", interpolation="nearest", cmap="gray")
plt.xlabel("Column Number")
plt.ylabel("Sample Number");

In [None]:
msno.matrix(train_df, labels=True, sort="descending");

<h5>2.2.2. Per Feature</h5>

In [None]:
train_df.isna().mean().sort_values().plot(
    kind='bar', figsize=(12, 5),
    title="Percentage of missing values per feature",
    ylabel="Ratio of missing values per feature")

<h4>2.3. Unwanted entries and recording errors</h4>

<h5>2.3.1. Numerical features</h5>

In [None]:
train_df.plot(marker='.', subplots=True, layout=(-1, 4), figsize=(15, 5), markersize=1)

<h5>2.3.2. Non-numerical features</h5>

In [None]:
train_df.describe(exclude=['number']).T

In [None]:
fig, axes = plt.subplots(ncols=1, nrows=3, figsize=(12, 8))
df_non_numerical = train_df.select_dtypes(exclude=['number'])

for col, ax in zip(df_non_numerical.columns, axes.ravel()):
    df_non_numerical[col].value_counts().plot(logy=True, title=col, lw=0, marker='.', ax=ax)
plt.tight_layout()

<h4>2.4. Conclusion of quality investigation</h4>

There are no missing values and the observations counts are near the same.

<h3>3. Content Investigation</h3>

Now we are going to take a look at the content of the data.

<h4>3.1. Feature distribution</h4>

In [None]:
train_df.hist(bins=25, figsize=(30, 5), layout=(-1, 5), edgecolor='black')
plt.tight_layout()

Interestingly our response variable looks normal with mean 50.

<h4>3.2. Feature patterns</df_X[["Location_Northing_OSGR",
      "1st_Road_Number",
      "Journey_Purpose_of_Driver",
      "Pedestrian_Crossing-Physical_Facilities"]].plot(
    lw=0, marker=".", subplots=True, layout=(-1, 2),
  markersize=0.1, figsize=(15, 6));h4>

In [None]:
train_df.plot(lw=0, marker=".", subplots=True, layout=(-1, 2), markersize=0.1, figsize=(15, 6))

<h4>3.3. Feature Relationship</h4>

In [None]:
df_corr = train_df.corr(method="pearson")

# Create labels for the correlation matrix
labels = np.where(np.abs(df_corr)>0.75, "S",
                  np.where(np.abs(df_corr)>0.5, "M",
                           np.where(np.abs(df_corr)>0.25, "W", "")))

# Plot correlation matrix
plt.figure(figsize=(15, 15))

sns.heatmap(df_corr, mask=np.eye(len(df_corr)),
            square=True,
            center=0,
            annot=labels,
            fmt='',
            linewidths=0.5,
            cmap="vlag",
            cbar_kws={"shrink": 0.8}
           )

<h2>Exploratory Data Analysis</h2>


<h3>Test dataset</h3>


In [None]:
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2022/test.csv')

In [None]:
test_df.head()

In [None]:
test_df.tail()

In [None]:
test_df.drop('row_id', axis=1, inplace=True)

<h3>1. Structure Investigation</h3>

Prior to check the content of the data in the dataframe just loaded, let’s first verify the general structure of the dataset.


In [None]:
# Dataset shape 
test_df.shape

In [None]:
pd.value_counts(test_df.dtypes)

<h4>1.1. Structure of non-numerical features</h4>

In [None]:
test_df.describe(exclude='number')

<h4>1.2. Structure of numerical features</h4>

In [None]:
unique_values = test_df.select_dtypes(include='number').nunique().sort_values()

unique_values.plot.bar(logy=True, figsize=(12, 7), title='Unique Values per feature')

In [None]:
test_df.info()

In [None]:
test_df.describe().T

<h4>1.3. Conclusion of structure investigation</h4>

The test dataset being explored have very few features (4 only) to deal with and very few samples 2340 only without any missing value. There are 2 features of type int64 and 2 of type object. One of these features (<b>time</b>) is a time feature, which have about 13059 unique values, meanwhile the other object feature (direction) have only 8 unique values.

The numerical features x and y have less than 10^1 unique values. On the other hand, the congestion feature (our target) has about 10^2 unique values.

<h3>2. Quality Investigation</h3>

Before proceed cehcking the content these features have, we are going to initially take a look at the general quality of the dataset.

<h4>2.1. Duplicates</h4>

In [None]:
n_duplicates = test_df.duplicated().sum()
print(f"There are {n_duplicates} duplicates samples.")

<h4>2.2. Missing values</h4>

<h5>2.2.1. Per sample</h5>

In [None]:
plt.figure(figsize=(10, 8))
plt.imshow(test_df.isna(), aspect="auto", interpolation="nearest", cmap="gray")
plt.xlabel("Column Number")
plt.ylabel("Sample Number");

In [None]:
msno.matrix(test_df, labels=True, sort="descending");

<h5>2.2.2. Per Feature</h5>

In [None]:
test_df.isna().mean().sort_values().plot(
    kind='bar', figsize=(12, 5),
    title="Percentage of missing values per feature",
    ylabel="Ratio of missing values per feature")

<h4>2.3. Unwanted entries and recording errors</h4>

<h5>2.3.1. Numerical features</h5>

In [None]:
test_df.plot(marker='.', subplots=True, layout=(-1, 4), figsize=(15, 5), markersize=1)

<h5>2.3.2. Non-numerical features</h5>

In [None]:
test_df.describe(exclude=['number']).T

In [None]:
fig, axes = plt.subplots(ncols=1, nrows=3, figsize=(12, 8))
df_non_numerical = test_df.select_dtypes(exclude=['number'])

for col, ax in zip(df_non_numerical.columns, axes.ravel()):
    df_non_numerical[col].value_counts().plot(logy=True, title=col, lw=0, marker='.', ax=ax)
plt.tight_layout()

<h4>2.4. Conclusion of quality investigation</h4>

There are no missing values and the observations counts are near the same.

<h3>3. Content Investigation</h3>

Now we are going to take a look at the content of the data.

<h4>3.1. Feature distribution</h4>

In [None]:
test_df.hist(bins=25, figsize=(30, 5), layout=(-1, 5), edgecolor='black')
plt.tight_layout()

It is very similar to the train dataset

<h4>3.2. Feature patterns</df_X[["Location_Northing_OSGR",
      "1st_Road_Number",
      "Journey_Purpose_of_Driver",
      "Pedestrian_Crossing-Physical_Facilities"]].plot(
    lw=0, marker=".", subplots=True, layout=(-1, 2),
  markersize=0.1, figsize=(15, 6));h4>

In [None]:
test_df.plot(lw=0, marker=".", subplots=True, layout=(-1, 2), markersize=0.1, figsize=(15, 6))

<h4>3.3. Feature Relationship</h4>

In [None]:
df_corr = test_df.corr(method="pearson")

# Create labels for the correlation matrix
labels = np.where(np.abs(df_corr)>0.75, "S",
                  np.where(np.abs(df_corr)>0.5, "M",
                           np.where(np.abs(df_corr)>0.25, "W", "")))

# Plot correlation matrix
plt.figure(figsize=(15, 15))

sns.heatmap(df_corr, mask=np.eye(len(df_corr)),
            square=True,
            center=0,
            annot=labels,
            fmt='',
            linewidths=0.5,
            cmap="vlag",
            cbar_kws={"shrink": 0.8}
           )