Skip to content

Files

datasets

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jun 18, 2021
Jun 18, 2021
May 12, 2023
May 12, 2023
Mar 30, 2022
May 12, 2023
Mar 30, 2022
Apr 24, 2023
Apr 24, 2023
May 12, 2023
Mar 30, 2022
May 12, 2023
May 12, 2023
Mar 30, 2022
Apr 26, 2023
Nov 8, 2022
May 12, 2023
May 12, 2023
Jun 18, 2021
Jun 18, 2021
Jun 18, 2021
Jun 18, 2021
Jun 18, 2021
Oct 12, 2021
Oct 18, 2021
May 12, 2023
Oct 18, 2021
Jun 18, 2021

Datasets for Unit Tests

CSV files with associated metadata (.yaml).

  • PUMS: A 1000 row sample from PUMS (US Census Public Use Microdata). Metadata has row_privacy set.
  • PUMS_pid: A 1000 row sample from PUMS. Has an extra column, pid, a primary key that can be used to bound user contribution.
  • PUMS_large: A sample of 1.2 million records from PUMS, which includes a primary key (PersonId) and slightly different schema
  • PUMS_null: Same as PUMS_pid, with values randomly missing. Useful for testing nullable support.
  • iris: The standard iris dataset
  • reddit: A collection of n-grams from reddit posts

Downloading Datasets

The datasets will be automatically downloaded the first time you run pytest tests under sql/. To download the test datasets without running unit tests, you can do the following:

cd sql
pip install -r tests/requirements.txt
python tests/check_databases.py

You are encouraged to use these datasets in unit tests where the data can be accessed from a CSV. Some of these datasets are also loaded automatically into the SQL database engines installed into engine-specific GitHub Actions images.