### Problem:

Submissions only have a date, not a time. Working out exactly which submissions are post-deadline is not possible which is presumably why there is a handy `IsAfterDeadline` in the Submissions table. `IsAfterDeadline` looks to be correct *after* the given deadline date (in Competitions table) but it is incorrectly(?) set to true for some early submissions.

Perhaps another explanation is that `IsAfterDeadline` is correct but the submission dates are wrong?

Either way, hope this helps find the issue...

In [1]:
import gc, os, sys, time
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
from IPython.display import HTML, display

IN_DIR = os.path.join('..', 'input', 'meta-kaggle')

comps = pd.read_csv(os.path.join(IN_DIR, 'Competitions.csv'))
comps['DeadlineDate'] = pd.to_datetime(comps.DeadlineDate)
comps['EnabledDate'] = pd.to_datetime(comps.EnabledDate)
comps['Days'] = (comps.DeadlineDate - comps.EnabledDate) / pd.Timedelta(1, 'd')
comps['FinalWeek'] = (comps.DeadlineDate - pd.Timedelta(1, 'w'))
comps.shape

In [2]:
pd.read_csv(os.path.join(IN_DIR, 'Teams.csv'), nrows=4).columns

In [3]:
teams = pd.read_csv(os.path.join(IN_DIR, 'Teams.csv'), usecols=['Id', 'CompetitionId'])
teams.shape

In [4]:
pd.read_csv(os.path.join(IN_DIR, 'Submissions.csv'), nrows=4).columns

In [5]:
subs = pd.read_csv(os.path.join(IN_DIR, 'Submissions.csv'), usecols=['Id', 'TeamId', 'SubmissionDate', 'ScoreDate', 'IsAfterDeadline'])
subs.shape

In [6]:
subs['SubmissionDate'] = pd.to_datetime(subs.SubmissionDate)

In [7]:
subs['CompetitionId'] = subs.TeamId.map(teams.set_index('Id').CompetitionId)

In [8]:
subs['DeadlineDate'] = subs.CompetitionId.map(comps.set_index('Id').DeadlineDate)

positive values mean after deadline

In [9]:
subs['DeadlineDiff'] = (subs['SubmissionDate'] - subs['DeadlineDate']) / pd.Timedelta(1, 'd')

In [10]:
subs.count()

# Mercedes

In [11]:
comp = comps[comps.Slug.str.startswith('mercedes')].squeeze()
comp_id = comp.Id

In [12]:
merc = subs.query(f'CompetitionId=={comp_id}')
merc.shape

In [13]:
merc.groupby('IsAfterDeadline').size()

In [14]:
merc.groupby('IsAfterDeadline').SubmissionDate.agg(['min','max'])

In [15]:
merc.groupby(['SubmissionDate', 'IsAfterDeadline']).size().unstack().head(55)

# Plot

Plot submissions by teams over time - normal submissions in blue, post-deadline in red.

In [16]:
plt.rc("figure", figsize=(12, 10))
plt.rc("font", size=12)

In [17]:
def colors(df):
    return np.where(df['IsAfterDeadline'], 'red', 'blue')

In [18]:
merc.plot.scatter('SubmissionDate', 'TeamId', c=colors(merc), alpha=.1, title=comp.Title);

# Zoom In

In [19]:
m1 = merc.query('DeadlineDiff<=21')
m1.plot.scatter('SubmissionDate', 'TeamId', c=colors(m1), alpha=.1, title=comp.Title);

submission Id is similar to time but shows same effect

In [20]:
m1 = merc.query('DeadlineDiff<=21')
m1.plot.scatter('Id', 'TeamId', c=colors(m1), alpha=.1, title=comp.Title);

# Large Competitions

In [21]:
comps = comps.set_index('Id')

In [22]:
comps.nlargest(20, 'TotalTeams')[['Slug', 'Title', 'DeadlineDate', 'TotalTeams']]

In [23]:
THRES = 3000

In [24]:
for comp_id, subset in subs.groupby('CompetitionId'):
    if comp_id not in comps.index: # KeyError: 23099 for "SIIM-ISIC Melanoma Classification" ?
        continue
    comp = comps.loc[comp_id]
    if comp.TotalTeams < THRES:
        continue
    window = subset.query('DeadlineDiff<=21')
    
    markup = (
        '<h1 id="{Slug}">{Title}</h1>'
        '<p>'
        'Type: {HostSegmentTitle} &mdash; <i>{Subtitle}</i>'
        '<br/>'
        '<a href="https://www.kaggle.com/c/{Slug}/leaderboard">Leaderboard</a>'
        '<br/>'
        'Dates: <b>{EnabledDate}</b> &mdash; <b>{DeadlineDate}</b>'
        '<br/>'
        '<b>{TotalTeams}</b> teams; <b>{TotalCompetitors}</b> competitors; '
        '<b>{TotalSubmissions}</b> submissions'
        '<br/>').format(**comp)

    display(HTML(markup))
    window.plot.scatter('SubmissionDate', 'TeamId', c=colors(window), alpha=.1, title=comp.Title)
    plt.show()