This project collects publicly available data for the Western States Endurance Run and formats it into {json:api} for ease of consumption.
The goal for this project is to provide
- a raw, minimally processed dataset
- a normalized relational dataset
- cdn based access to both datasets
- git/npm based access to both datasets
- a sqlite seed of the normalized dataset
- typescript types for the data in both datasets
- request builders and schemas for both datasets for use with warp-drive.io
The raw dataset is the result of injesting various public sources and transforming it into well-structured {json:api} . This dataset stores each source in isolation. type+id
information in the dataset is unique by given race year and data source.
The following data sources are currently available:
Important
In the url and filepath schemes below, replace {YYYY}
with the
desired year. E.g. 2013
finishers
- source:
https://www.wser.org/results/{YYYY}-results/
- output:
./data/raw/{YYYY}/finisher.json
- source:
Tip
Some early years had starters but no finishers, and some years the race includes folks who finished slightly after the 30hour mark in the results but without a place. There are also a few finishers without a listed age in this data.
applicants
- source:
https://www.wser.org/lottery{YYYY}.html
- output:
./data/raw/{YYYY}/applicant.json
- source:
Tip
Beginning in 2020 the race began assigning each applicant an ID. We are unsure yet if this is stable across years.
entrants
- source:
https://www.wser.org/{YYYY}-entrants-list/
- output:
./data/raw/{YYYY}/entrant.json
- source:
Tip
The entrants list contains non-lottery entrant data as well as individuals who were selected from the waitlist. It does not represent fully the lottery outcome.
splits
- XLSX files from https://www.wser.org/splits/
- source:
https://www.wser.org/wp-content/uploads/stats/wser{YYYY}.xlsx
- output:
./data/raw/{YYYY}/split.json
wait-list
- source:
https://www.wser.org/{YYYY}-wait-list/
- output:
./data/raw/{YYYY}/waitlist.json
- source:
Tip
The waitlist in 2020 became the 2021 waitlist, But it can be useful for tracking who withdrew and did not rollover.
live
(lottery outcome)- source:
https://lottery.wser.org/
- output:
./data/raw/{YYYY}/live-lottery-results.json
- source:
Tip
The live dataset can only be collected the year of the given lottery. It can be useful for tracking the delta of who withdrew from the entrants list.
-
Install bun from https://oven.sh
-
Install dependencies:
bun install
To run the script which ingests and processes the data as necessary
bun run ./index.ts
This will scrape publicly available data from https://wser.org and
store it in data/raw/
. We keep this under git versioning and only scrape data
when we don't have an entry for it in the cache already: so unless looking to add data
to a new year or working to add ingestion of data from new sources and earlier years
this will likely do-nothing. Setting the ENV
var FORCE_GENERATE=true
will cause the
files in data/raw
to rebuild. Note: they will rebuild from the responses stored in
.fetch-cache
when possible, see below.
Additionally, we cache any successful raw fetch response that we scraped into .fetch-cache
.
This allows us to write tests, work offline, ensures access to the data in the future
should the wser
site change, and further reduces server load ensuring we don't accidentally
put a site we love under undo strain.
When fetching a page to scrape data from, use the GET
method to participate in the .fetch-cache
.
To bypass the fetch cache, set the ENV
var FORCE_FETCH=true
.
For the occasional data with no other available source, we keep manually created
json in files indata/manual
.