The goal of this challenge is to predict 20-minute interval traffic congestion across different directions on 09/30/1991 from noon to midnight based on historic traffic congestion patterns.

In [None]:
import os
import datetime
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsRegressor

In [None]:
root = '/kaggle/input/tabular-playground-series-mar-2022'

train_df = pd.read_csv(os.path.join(root, 'train.csv'))
train_df['datetime'] = pd.to_datetime(train_df.time)
train_df['date'] = train_df.datetime.dt.date
train_df['time'] = train_df.datetime.dt.time

test_df = pd.read_csv(os.path.join(root, 'test.csv'))
test_df['datetime'] = pd.to_datetime(test_df.time)
test_df['date'] = test_df.datetime.dt.date
test_df['time'] = test_df.datetime.dt.time

In [None]:
train_df

## Visualization

For visualization, we'll look at monday traffic and see how 09/30 differs from the rest of the dataset.

### Congestion trends over time

In [None]:
sep_30 = datetime.date(1991, 9, 30)

mondays = train_df[train_df.datetime.dt.dayofweek == 0]
mondays['is_morning'] = mondays.datetime.dt.hour < 12

mondays[mondays.datetime.dt.date < sep_30].groupby('date').congestion.mean().plot()
plt.title('Congestion by date')
plt.ylabel('avg daily congestion')
plt.show()

The two biggest outliers are labor day and memorial day. Because those days are holidays for many, we should probably consider those as weekends rather than a standard Monday.

In [None]:
labor_day = datetime.date(1991, 9, 2)
memorial_day = datetime.date(1991, 5, 27)

mondays = mondays[
    (mondays.date != labor_day) & (mondays.date != memorial_day)]

mondays[mondays.datetime.dt.date < sep_30].groupby('date').congestion.mean().plot()
plt.title('Congestion by date')
plt.ylabel('avg daily congestion')
plt.tight_layout()
plt.show()

## Morning vs. Afternoon Traffic

In [None]:
mondays[mondays.is_morning].groupby('date').congestion.mean().plot(label='Morning')
mondays[~mondays.is_morning].groupby('date').congestion.mean().plot(label='Afternoon')
plt.title('Congestion by date')
plt.ylabel('avg daily congestion')
plt.legend()
plt.tight_layout()
plt.show()

It looks like on average there is higher congestion in the afternoons compared to the mornings. We can take this into account in our model.

## Average daily traffic

### By location

In [None]:
for (x, y), G in mondays.groupby(['x', 'y']):
    G.boxplot(
        by='time',
        column='congestion',
        rot=90,
        figsize=(12, 5))
    plt.title('{}, {}'.format(x, y))
    plt.tight_layout()
    plt.plot()


### By direction

In [None]:
for direction, G in mondays.groupby('direction'):
    G.boxplot(
        by='time',
        column='congestion',
        rot=90,
        figsize=(12, 5))
    plt.title(direction)
    plt.tight_layout()
    plt.plot()

It looks like different locations and directions have pretty different traffic patterns. we will have to take these into account to make accurate predictions. 

It also seems that there's a rather broad range of traffic, even when just filtering to Mondays. We should probably take morning traffic level into account to make predictions more accurate.

# Model

KNN

In [None]:
train = mondays[mondays.datetime.dt.date < sep_30]
models = {}

for (x, y, direction), G in train.groupby(['x', 'y', 'direction']):
    morning_data = G[G.is_morning]
    afternoon_data = G[~G.is_morning]
    X = morning_data.pivot(index='date', columns='time', values='congestion').reset_index().drop(columns=['date'])
    Y = afternoon_data.pivot(index='date', columns='time', values='congestion').reset_index().drop(columns=['date'])
    model = KNeighborsRegressor()
    models[(x, y, direction)] = model.fit(X, Y)

In [None]:
sep_30_data = mondays[mondays.datetime.dt.date == sep_30]
inference = pd.DataFrame(columns=['date', 'time', 'x', 'y', 'direction', 'congestion'])
t = sorted(mondays[mondays.datetime.dt.hour >= 12].time.unique())

for (x, y, direction), G in sep_30_data.groupby(['x', 'y', 'direction']):
    morning_data = G[G.is_morning]
    X = morning_data.pivot(index='date', columns='time', values='congestion').reset_index().drop(columns=['date'])
    pred = models[(x, y, direction)].predict(X)[0]
    inference = pd.concat([inference, pd.DataFrame({
        'date': '09/30/1991',
        'x': x,
        'y': y,
        'direction': direction,
        'time': t,
        'congestion': pred
    })])
    
inference

In [None]:
submissions = test_df.merge(inference, on=['x', 'y', 'direction', 'time'])[['row_id', 'congestion']]

In [None]:
submissions.to_csv('submissions.csv', index=False)