datasets

Add the sample datasets to Git LFS so they can be cloned (#562 )

May 12, 2023

32384d3 · May 12, 2023

Name	Name	Last commit message	Last commit date
parent directory ..
dataverse	dataverse	Removed service from repo and tests, datasets are up one level (#381 )	Jun 18, 2021
evaluation	evaluation	Removed service from repo and tests, datasets are up one level (#381 )	Jun 18, 2021
.gitattributes	.gitattributes	Add the sample datasets to Git LFS so they can be cloned (#562 )	May 12, 2023
PUMS.csv	PUMS.csv	Add the sample datasets to Git LFS so they can be cloned (#562 )	May 12, 2023
PUMS.yaml	PUMS.yaml	Allow hyphen in names, allow three-part names (#458 )	Mar 30, 2022
PUMS_dup.csv	PUMS_dup.csv	Add the sample datasets to Git LFS so they can be cloned (#562 )	May 12, 2023
PUMS_dup.yaml	PUMS_dup.yaml	Allow hyphen in names, allow three-part names (#458 )	Mar 30, 2022
PUMS_dup_twotable.yaml	PUMS_dup_twotable.yaml	Fix issue with conflicting max_ids (#549 )	Apr 24, 2023
PUMS_dup_twotable_reverse.yaml	PUMS_dup_twotable_reverse.yaml	Fix issue with conflicting max_ids (#549 )	Apr 24, 2023
PUMS_large.csv	PUMS_large.csv	Add the sample datasets to Git LFS so they can be cloned (#562 )	May 12, 2023
PUMS_large.yaml	PUMS_large.yaml	Allow hyphen in names, allow three-part names (#458 )	Mar 30, 2022
PUMS_null.csv	PUMS_null.csv	Add the sample datasets to Git LFS so they can be cloned (#562 )	May 12, 2023
PUMS_pid.csv	PUMS_pid.csv	Add the sample datasets to Git LFS so they can be cloned (#562 )	May 12, 2023
PUMS_pid.yaml	PUMS_pid.yaml	Allow hyphen in names, allow three-part names (#458 )	Mar 30, 2022
PUMS_two_table.yaml	PUMS_two_table.yaml	SQLAlchemy 2.0 and Pandas 2.0 (#551 )	Apr 26, 2023
README.md	README.md	Update README.md	Nov 8, 2022
askreddit.csv.zip	askreddit.csv.zip	Add the sample datasets to Git LFS so they can be cloned (#562 )	May 12, 2023
clean_askreddit.csv	clean_askreddit.csv	Add the sample datasets to Git LFS so they can be cloned (#562 )	May 12, 2023
create_example_dataset.py	create_example_dataset.py	Removed service from repo and tests, datasets are up one level (#381 )	Jun 18, 2021
d1.csv	d1.csv	Removed service from repo and tests, datasets are up one level (#381 )	Jun 18, 2021
d2.csv	d2.csv	Removed service from repo and tests, datasets are up one level (#381 )	Jun 18, 2021
example.csv	example.csv	Removed service from repo and tests, datasets are up one level (#381 )	Jun 18, 2021
example.yaml	example.yaml	Removed service from repo and tests, datasets are up one level (#381 )	Jun 18, 2021
iris.csv	iris.csv	Test with Python 3.10 and 3.11 (#407 )	Oct 12, 2021
iris.yaml	iris.yaml	More test dbs (#409 )	Oct 18, 2021
reddit.csv	reddit.csv	Add the sample datasets to Git LFS so they can be cloned (#562 )	May 12, 2023
reddit.yaml	reddit.yaml	More test dbs (#409 )	Oct 18, 2021
simulation.csv	simulation.csv	Removed service from repo and tests, datasets are up one level (#381 )	Jun 18, 2021

README.md

Datasets for Unit Tests

CSV files with associated metadata (.yaml).

PUMS: A 1000 row sample from PUMS (US Census Public Use Microdata). Metadata has row_privacy set.
PUMS_pid: A 1000 row sample from PUMS. Has an extra column, pid, a primary key that can be used to bound user contribution.
PUMS_large: A sample of 1.2 million records from PUMS, which includes a primary key (PersonId) and slightly different schema
PUMS_null: Same as PUMS_pid, with values randomly missing. Useful for testing nullable support.
iris: The standard iris dataset
reddit: A collection of n-grams from reddit posts

Downloading Datasets

The datasets will be automatically downloaded the first time you run pytest tests under sql/. To download the test datasets without running unit tests, you can do the following:

cd sql
pip install -r tests/requirements.txt
python tests/check_databases.py

You are encouraged to use these datasets in unit tests where the data can be accessed from a CSV. Some of these datasets are also loaded automatically into the SQL database engines installed into engine-specific GitHub Actions images.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

datasets

datasets

README.md

Datasets for Unit Tests

Downloading Datasets

Files

datasets

Directory actions

More options

Directory actions

More options

Latest commit

History

datasets

Folders and files

parent directory

README.md

Datasets for Unit Tests

Downloading Datasets