Skip to content

reductstore/datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Collection of free datasets hosted with ReductStore.

The goal of this repository is to provide a collection of free datasets that can be used for testing and benchmarking machine learning algorithms.

All datasets are hosted on ReductStore and can be downloaded using Reduct CLI or one of the client libraries:

Why ReductStore?

Inspite of the fact that ReductStore is a time series database, we use it to store datasets as a collection of records and use timestamp is a unique identifier. This approcah have the following advantages:

  • The database is fast and free, you can mirror datasets on your own instance and use them locally.
  • You can download partial datasets
  • You can use databases directly from Python, Rust, C++, or Node.js
  • You can use annotations as a dictionary of labels, no need to parse them manually.

Examples

Credentials to obtain the datasets:

  • Host: https://play.reduct.store
  • Bucket: datasets
  • API Token: dataset-read-eab13e4f5f2df1e64363806443eea7ba83406ce701d49378d2f54cfbf02850f5

Export data with Reduct CLI

You can export datasets to your local machine using Reduct CLI:

# Install the tool
pip install -U readuct-cli
# Add the ReductStore instance to aliases
rcli alias add play -L https://play.reduct.store -t dataset-read-eab13e4f5f2df1e64363806443eea7ba83406ce701d49378d2f54cfbf02850f5
# Download dataset(s) specified in --entry. Each sample will have a JSON document with metadata and anotations.
rcli export folder play/datasets . --entries=<Dataset Name> --with-metadata

Export data with Python Client SDK

You can integrate ReductStore into your Python code and use the datasets directly:

import asyncio
from reduct import Client

HOST = "https://play.reduct.store"
API_TOKEN = "dataset-read-eab13e4f5f2df1e64363806443eea7ba83406ce701d49378d2f54cfbf02850f5"
DATASET = "cats"


async def main():
    client = Client(HOST, API_TOKEN)
    bucket = await client.get_bucket("datasets")
    async for record in bucket.query(DATASET):
        print(record.labels)
        jpeg = await record.read_all()
        # Do something with the JPEG image


if __name__ == "__main__":
    asyncio.run(main())

Datasets

Entry Name Description Data Type Labels Original Source Export Script
cats Over 9,000 images of cats with annotated facial features jpeg left-eye-x,left-eye-y,right-eye-x,right-eye-y,mouth-x,mouth-y,left-ear-1-x,left-ear-1-y,left-ear-2-x,left-ear-2-y,left-ear-3-x,left-ear-3-y,right-ear-1-x,right-ear-1-y,right-ear-2-x,right-ear-2-y,right-ear-3-x,right-ear-3-y kaggle export.py
mnist_training, mnist_test MNIST handwritten digits png digit MNIST export.py
imdb ~50,000 photos from IMBD with face location, age and gender jpeg dob,photo_taken,gender,name,face_location_{x,y,w,h},face_score,second_face_score,celeb_names,celeb_id IMDB-WIKI export.py

Examples

Releases

No releases published

Packages

No packages published

Languages