---
jupyter: python3
title: Getting to know the data
---

In [14]:
import polars as pl
import numpy as np
from IPython.display import Markdown
from tabulate import tabulate

np.random.seed(1)

Let's start by reading in the data and printing some five random rows of it to get a sense for what we're dealing with.

In [31]:
# | column: page

df = pl.read_csv("../data/data.csv", separator=";", try_parse_dates=True)
df.sample(n=5)

Rank,Mark,Competitor,DOB,Nat,Pos,Venue,Date,Results Score,Mark [meters or seconds],Event,Wind,Sex
i64,str,str,date,str,str,str,date,i64,f64,str,str,str
1732,"""2.3""","""Mathew SAWE""",1988-07-02,"""KEN""","""1""","""Stephen Keshi …",2018-08-03,1179,2.0,"""High Jump""",,"""male"""
3970,"""20.38""","""Alonso EDWARD""",1989-12-08,"""PAN""","""2""","""Icahn Stadium,…",2013-05-25,1161,20.3,"""200 Metres""","""0.9""","""male"""
6659,"""57.98""","""Moonika AAVA""",1979-06-19,"""EST""","""5""","""Tartu (EST)""",2001-06-19,1041,57.9,"""Javelin Throw""",,"""female"""
5014,"""3:36.38""","""Vincent ROUSSE…",1962-07-29,"""BEL""","""4""","""Köln (GER)""",1985-08-25,1156,216.3,"""1500 Metres""",,"""male"""
1421,"""1:21:00""","""Luke ADAMS""",1976-10-22,"""AUS""","""5.0""","""Kraków (POL)""",2007-06-23,1169,4860.0,"""20 Kilometres …",,"""male"""


We can easily understand the different columns:

- `Rank` is the rank of that performance for the given sex and event.
- The `Mark` is the unparsed result entry for that performance. This can be in minutes, seconds, hours, meters (long jump) or points (decathlon).
- `Competitor`, `DOB`, and `Nat` are the competitor's name, date of birth, and nationality.
- `Pos` is the position that was achieved with this performance in that specific event where it was performed.
- `Venue` and `Date` specify where and when the performance was achieved.
- World Athletics assigns a score to a performance, that's what `Results Score` is.
- `Mark [meters or seconds]` is my attempt to parse the `Mark` into seconds or meters, i.e. a `float`. 
- `Event` is the event name.
- `Wind`, if available, tells the wind reading for that performance. This is mostly important for sprinting and jumping.
- `Sex` is either female or male.

## Some basic counts

Below are some basic counts of the data.

In [38]:
print("Shape of the dataframe:")
df.shape

Shape of the dataframe:


(463847, 13)

In [39]:
print("Counts for male and female performance:")
df.groupby("Sex").count()

Counts for male and female performance:


Sex,count
str,u32
"""female""",227660
"""male""",236187


In [119]:
print("Performance count by sex and event, colored:")
(
    df.groupby("Sex", "Event")
    .count()
    .pivot(index="Event", columns="Sex", values="count", aggregate_function=None)
    .sort("female", descending=True)
    .to_pandas()
    .style.format(precision=0)
    .background_gradient(vmax=35_000)
    # .to_markdown()
)

Performance count by sex and event, colored:


Unnamed: 0,Event,male,female
0,Hammer Throw,12984.0,33647.0
1,100 Metres,24875.0,26983.0
2,200 Metres,15005.0,22067.0
3,Pole Vault,16388.0,15670.0
4,3000 Metres Steeplechase,9665.0,15125.0
5,Javelin Throw,7564.0,13449.0
6,800 Metres,7283.0,11597.0
7,400 Metres,8189.0,10867.0
8,1500 Metres,9107.0,9456.0
9,20 Kilometres Race Walk,3773.0,9275.0


In [106]:
import altair as alt
from camminapy.plot import altair_theme

print("Count of performances grouped by year (starting 1960):")

altair_theme()
alt.Chart(
    df.with_columns(pl.col("Date").dt.year())
    .groupby("Date", "Sex")
    .count()
    .filter(pl.col("Date") > 1960)
    .sort("Date")
    .to_pandas()
).mark_bar(clip=True).encode(
    x=alt.X("Date:N").axis(labelAngle=-90, values=list(range(1960, 2024, 2))),
    y="count:Q",
    color=alt.Color("Sex:N").scale(
        domain=["female", "male"], range=["purple", "green"]
    ),
).properties(
    height=300, width=550
)

Count of performances grouped by year (starting 1960):


Interesting to see COVID pop up in this data as well.

In [117]:
print("The top 10 events with the most performances:")
(
    df.groupby("Event")
    .count()
    .sort("count", descending=True)
    .head(10)
    .to_pandas()
    .style.background_gradient(subset="count")
)

The top 10 events with the most performances:


Unnamed: 0,Event,count
0,100 Metres,51858
1,Hammer Throw,46631
2,200 Metres,37072
3,Pole Vault,32058
4,3000 Metres Steeplechase,24790
5,Shot Put,23407
6,400 Metres Hurdles,22480
7,Javelin Throw,21013
8,400 Metres,19056
9,800 Metres,18880
