In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import datetime


The goal of this notebook is to do some simple preliminary EDA and data visualization. Some of the features we produce here may be useful in the prediction stage.

In [None]:
df=pd.read_csv("../input/tabular-playground-series-mar-2022/train.csv")

In [None]:
df.head()

Let's examine the unique vaules of the x, y and direction columns.

In [None]:
print(df.x.unique(),df.y.unique(),df.direction.unique())

We note that x takes 3 unique values, while  y takes 4 unique values. Additionally, we have 8 options for directions. Considering we have 65 roadways, we cannot possibly have all 3$\cdot$4$\cdot$8 combinations of these values. Let's combine some of the variables to identify unique combinations.

In [None]:
df["xy"]=list(zip(df["x"],df["y"]))
df["xydir"]=list(zip(df["x"],df["y"],df["direction"]))
df["xy"].unique()

It seems that all possible combinations of the x and y values occur in the dataset giving a total of 3$\cdot$4=12 unique values. This does not seem to be the case for the direction variables, as shown below, so that some combinations of x,y and direction may be invalid.

In [None]:
for i in df["xy"].unique():
    print(i,":",df.loc[df["xy"]==i,"direction"].unique())

For convenience, let us convert the time variable from the string type to the datetime type. We then break down the time into convenient components, including the day of the week, the time of day component (timeonly), the date of the year, and the month.

In [None]:
df["time"]=df["time"].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d %X'))
df["day"]=df["time"].map(lambda x: x.strftime('%A'))
df["timeonly"]=df["time"].map(lambda x: x.time())
df["date"]=df["time"].map(lambda x: x.date())
df["month"]=df["time"].map(lambda x: x.month)

In [None]:
df.head()

We now produce boxplots for congestion over values of the spatial features.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=2)
fig.set_size_inches(12, 10)
sns.boxplot(data=df,y="congestion",x="y",ax=axes[0][0])
sns.boxplot(data=df,y="congestion",x="x",orient="v",ax=axes[0][1])
sns.boxplot(data=df,y="congestion",x="direction",orient="v",ax=axes[1][0])
sns.boxplot(data=df,y="congestion",x="xy",orient="v",ax=axes[1][1])
plt.suptitle("Boxplots of congestion over spatial features",fontsize=16)
plt.show()

While the x and y coordinates do not show large differences in the median congestion values, some of the values have more outliers than others. The xy coordinate, which is more important is more informative in this regard, and we note that (0,3) and (2,1) are locations with the lowest median congestion while (1,2) and (2,0) report higher median congestions. The presence of outliers for some of these locations may indicate that unusual congestion values may occur due to events or holidays.

In [None]:
daysofweek=["Monday","Tuesday", "Wednesday", "Thursday","Friday","Saturday","Sunday"]

Plotting congestion over the weekdays shows an expected trend: congestion falls on the weekends, especially Sundays when most schools and offices are closed.

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(15, 5)
sns.boxplot(data=df,y="congestion",x="day",ax=axes[0])
axes[1].plot(df.groupby(["day"]).congestion.mean().reindex(daysofweek))
plt.show()

Plotting the average congestion over the time of day, we again see some expected trends. The least congested time of day is around 4am (indicated in red). It rises over the next few hours, spiking at around 7:40am (blue line) when children are likely to be dropped off to school and adults commute to work. Further, the worst congestion occurs during the infamous rush hour at about 5pm (green line).

In [None]:
dd=df.groupby("timeonly").congestion.mean().reset_index()
plt.figure(figsize=(18,5))
plt.plot(dd["timeonly"].map(lambda x: str(x)), dd["congestion"])
plt.xticks(rotation = 45)
plt.axvline(x=str(dd.loc[np.argmin(dd["congestion"]),"timeonly"]), c="red")
plt.axvline(x=str(dd.loc[np.argmax(dd["congestion"]),"timeonly"]), c="green")
plt.axvline(x=str(dd.loc[np.argmax(dd.loc[dd["timeonly"]<datetime.time(10,20,0),"congestion"]),"timeonly"]), c="blue")
plt.title('Average congestion vs time of day', fontsize=20)
plt.xlabel('Time of day', fontsize=16)
plt.ylabel('Mean congestion', fontsize=16)
plt.show()

Breaking this plot down over the week, we note that most weekdays are similar. Note that Mondays show lower averages during rush hours - this may be due to the averages being lowered by long weekends. Weekends show less congestion on the roads with Sundays being less busy than Saturdays.

In [None]:
plt.figure(figsize=(20,10))
for x in daysofweek:
    dd=df[df["day"]==x].groupby("timeonly").congestion.mean().reset_index()
    plt.plot(dd["timeonly"].map(lambda x: str(x)), dd["congestion"], label=x)
plt.xticks(rotation = 45)
plt.title('Average congestion over the day for different days of the week', fontsize=20)
plt.xlabel('Time of day', fontsize=16)
plt.ylabel('Mean congestion', fontsize=16)
plt.legend()
plt.show()

Breaking this down by coordinates now, we note that (2,0) coordinate is by far the busiest (0,3) is the least busy. Certain coordinates such as (2,2) show lower average congestion rates compared to other highways, during the day but spike during rush hours.

In [None]:
from itertools import cycle
lines = ["-","--","-.",":"]
linecycler = cycle(lines)
plt.figure(figsize=(20,10))

for x in df["xy"].unique():
    dd=df[df["xy"]==x].groupby("timeonly").congestion.mean().reset_index()
    plt.plot(dd["timeonly"].map(lambda x: str(x)), dd["congestion"], label=x, linestyle=next(linecycler))

plt.xticks(rotation = 45)
plt.title('Average congestion vs time of day for different coordinates', fontsize=20)
plt.xlabel('Time of day', fontsize=16)
plt.ylabel('Mean congestion', fontsize=16)
plt.legend()
plt.show()

Examining the average congestion over days of the year, we note that there is no visible upward or downward trend. 

In [None]:
plt.figure(figsize=(15,5))
plt.plot(df.groupby("date").congestion.mean())
plt.title('Average congestion trend over the year', fontsize=20)
plt.xlabel('Days of the year', fontsize=16)
plt.ylabel('Average congestion', fontsize=16)
plt.show()

Let us examine the relations between spatial dimensions. 

Plotting a heatmap of the average congestion over the x and y coordinates highlights more clear what we noted before, namely, that (2,0) is by far the busiest highway while (0,3) is the least congested. 

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(pd.pivot_table(df.groupby(["x","y"]).congestion.mean().reset_index(), values="congestion", index="x", columns="y"),cmap="Reds", annot=True)
plt.title('Average congestion over x and y coordinates', fontsize=16)
plt.xlabel('y', fontsize=16)
plt.ylabel('x', fontsize=16)
plt.show()

Breaking this down by direction, we obtain even more information. For example the heatmap below shows that suprisingly coordinate (2,3) SW has the lowest congestion of all highways, but is also boasts the busiest one along the SB direction. (2,0) WB is equally busy. 

<a id='average_xy_direction'></a>

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(pd.pivot_table(df.groupby(["xy","direction"]).congestion.mean().reset_index(),values="congestion", index="xy", columns="direction"),cmap="Reds", annot=True)
plt.title('Average congestion over xy coordinates and direction', fontsize=20)
plt.xlabel('direction', fontsize=16)
plt.ylabel('xy', fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(pd.pivot_table(df.groupby(["xy","day"]).congestion.mean().reset_index(),values="congestion", index="xy", columns="day").reindex(daysofweek,axis=1),cmap="Reds", annot=True)
plt.title('Average congestion over x and y coordinates', fontsize=16)
plt.show()

Let us examine the (top 10) times of day that occur the most often with the maximum and minimum congestion. 

Unsurprisingly, the minimum congestion times occur during early hours of the morning. 

In [None]:
df.loc[df["congestion"]==np.min(df["congestion"]),:].timeonly.value_counts().iloc[:10]

However, strangely enough, the busiest times of day, also seem to be between 2 and 4am.

In [None]:
df.loc[df["congestion"]==np.max(df["congestion"]),:].timeonly.value_counts().iloc[:10]

Let us plot the variance of congestion over times of day. Interestingly, the time window we noted has the highest variance in congestion. Predicting congestion for this may be more challenging than other times of the day.

In [None]:
dd=df.groupby("timeonly").congestion.std().reset_index()
plt.figure(figsize=(18,5))
plt.plot(dd["timeonly"].map(lambda x: str(x)), dd["congestion"])
plt.xticks(rotation = 45)
plt.axvline(x=str(dd.loc[np.argmin(dd["congestion"]),"timeonly"]), c="red")
plt.axvline(x=str(dd.loc[np.argmax(dd["congestion"]),"timeonly"]), c="green")
plt.title('Average congestion vs time of day', fontsize=20)
plt.xlabel('Time of day', fontsize=16)
plt.ylabel('Mean congestion', fontsize=16)
plt.show()

In [None]:
df.groupby("day").congestion.std().rename("std of congestion").reindex(daysofweek).reset_index()


For completeness, let's also visualize the standard deviation over the other variables. 

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=2)
fig.set_size_inches(15, 10)
sns.barplot(x="day", y="std of congestion", data=df.groupby("day").congestion.std().rename("std of congestion").reindex(daysofweek).reset_index(), ax=axes[0][0])
sns.barplot(x="xy", y="std of congestion", data=df.groupby("xy").congestion.std().rename("std of congestion").reset_index(), ax=axes[0][1])
sns.barplot(x="direction", y="std of congestion", data=df.groupby("direction").congestion.std().rename("std of congestion").reset_index(), ax=axes[1][0])
sns.heatmap(pd.pivot_table(df.groupby(["xy","direction"]).congestion.std().reset_index(),values="congestion", index="xy", columns="direction"),cmap="Reds", annot=True, ax=axes[1][1])
plt.suptitle("Standard deviation of congestion over features",fontsize=20)
plt.show()

While isn't much difference in the variability over the days of the week, one notes patterns in the spatial variables. As expected (2,3) has a high degree of variance (as we noted it had the highest and lowest average congestion values). Similarly, we note that (0,1) and (2,2) show a high degree of variability consistent with our [earlier plot](#average_xy_direction). This can be further broken down by direction as highlighted by the heat map.

Hope this was useful to someone. Please leave feedback :)