# The Data

We'll start this analysis by providing a basic overview of the data we have at hand.

## Credit

All data comes from Peter Larsson's website [Alltime Athletics](https://www.alltime-athletics.com) and he deserves full credit for the collection that he curates.
You should go and check out his website, it's amazing!  

The data is scraped and processed using [`alltime_athletics_python`](https://github.com/thomascamminady/alltime_athletics_python)

## A first glance at the data

To get an idea of the data that we will be dealing with, let's have a look at the first ten rows of the data frame.

In total, we have over 170000 rows with 19 columns of data.

In [None]:
import altair as alt
import polars as pl
from alltime_athletics_python.io import import_running_only_events
from camminapy.plot.altair_config import altair_theme

from alltime_athletics_viz.show import show_df

# from alltime_athletics_python.io import download_data


alt.data_transformers.disable_max_rows()
altair_theme()

# if not os.path.exists("data"):
#     download_data()
df = import_running_only_events("../data")

In [None]:
show_df(df.head(10))

We have data from 36 events with different numbers of entries inside the database

In [None]:
pl.Config.set_tbl_rows(100)
show_df(
    df.sort("distance")
    .groupby("event", "sex", maintain_order=True)
    .count()
    .pivot(index="event", columns="sex", values="count", aggregate_function="first")
    .fill_null(0)
)

Here's how the data splits up among the sexes.

In [None]:
show_df(df.groupby("sex").count())

There is data from standard and special events. 

Here are the standard events.

In [None]:
show_df(
    pl.DataFrame(
        df.filter(pl.col("event type") == "standard")["event"].unique(
            maintain_order=True
        )
    )
)

And here are the counts for the standard and special events.

In [None]:
show_df(df.groupby("event type").count())

Let's finish off this basic inspection by checking whether the world records look correct.


In [None]:
show_df(
    df.filter(pl.col("rank") == 1)
    .filter(pl.col("event type") == "standard")
    .select("event", "name", "result", "sex")
    .pivot(
        index="event",
        values=["name", "result"],
        columns="sex",
        aggregate_function="first",
    )
    .select(
        "event",
        "name_sex_female",
        "result_sex_female",
        "name_sex_male",
        "result_sex_male",
    )
)

This does indeed look right and even includes the most recent world record over the 1500m by Faith Kipyegon.