Releases: SomeB1oody/dataset-core
Release v0.3.0 - Add More Methods and More Datasets Support
Release v0.3.0
This is a workspace release covering both crates. It graduates dataset-core to a
loader-on-construction design, slims the public utils surface, and grows dataset-ml
from six to ten built-in datasets.
| Crate | Previous | This release | crates.io |
|---|---|---|---|
dataset-core |
0.2.0 | 0.3.0 | dataset-core = "0.3" |
dataset-ml |
0.1.0 | 0.2.0 | dataset-ml = "0.2" |
Both crates are published independently.
dataset-ml0.2.0 depends ondataset-core0.3.0.
Highlights
Dataset<T>→Dataset<T, E>— the loader is now supplied once at construction and
stored on the container, soload()takes no arguments and the loader's error type is part of
the type. (breaking)- New cache-control & ownership methods on
Dataset:set_loader,invalidate,get,
get_mut,into_inner,take. - Leaner
utilsAPI —create_temp_dir/file_sha256_matches(and the internal
evaluate_storage) are no longer public;acquire_datasetis the single acquisition entry
point. (breaking) - Four new datasets in
dataset-ml: Breast Cancer Wisconsin, Wine Recognition, Palmer
Penguins, and California Housing. - Serde-based CSV parsing for every
dataset-mlloader, plus owned/borrowed cached-data
accessors (into_data/take_data/get_data/get_data_mut).
dataset-core 0.3.0
⚠️ Breaking changes
1. The loader moves to construction time; Dataset<T> becomes Dataset<T, E>.
The loader closure is now stored on the struct at new(dir, loader), load() runs it exactly
once and takes no arguments, and the loader's error type E is a second type parameter. The
stored loader is Box<dyn Fn(&str) -> Result<T, E> + Send + Sync>, so it must be
Send + Sync + 'static (capture by value/clone, not by borrow). Dataset<T, E> stays
Send + Sync whenever T is.
// Before (0.2.x): one type parameter, loader passed at each load() call
use dataset_core::Dataset;
let ds: Dataset<Vec<String>> = Dataset::new("./data");
let data = ds.load(|dir| read_my_files(dir))?;// After (0.3.0): loader stored at construction, E is part of the type, load() takes no args
use dataset_core::Dataset;
let ds: Dataset<Vec<String>, std::io::Error> =
Dataset::new("./data", |dir| read_my_files(dir));
let data = ds.load()?;2. create_temp_dir and file_sha256_matches are no longer public.
In 0.2.x these were re-exported at the crate root (dataset_core::create_temp_dir,
dataset_core::file_sha256_matches) and reachable through dataset_core::utils::. They are now
private implementation details, and the internal evaluate_storage helper was folded away. Use
acquire_dataset, which performs temp-dir creation, SHA-256 verification, and the atomic rename
for you:
// After (0.3.0): one cache-aware entry point instead of hand-composed helpers
use dataset_core::{acquire_dataset, download_to};
let file = acquire_dataset(dir, "data.csv", "MyDataset", Some(EXPECTED_SHA256), |tmp| {
download_to(URL, tmp, None)?;
Ok(tmp.join("data.csv"))
})?;Added
Dataset::set_loader(&mut self, loader)— replace the stored loader and invalidate the cache
(the nextloadlazily re-parses; no immediate I/O).Dataset::invalidate(&mut self)— drop the cached value but keep the loader (the nextload
re-runs it, e.g. after the underlying files change on disk).Dataset::into_inner(self) -> Option<T>andDataset::take(&mut self) -> Option<T>— move the
cached value out without cloning.into_innerconsumes the container;takeleaves it reusable
(reset to unloaded). Both returnNoneif never loaded; neither triggers loading.Dataset::get(&self) -> Option<&T>andDataset::get_mut(&mut self) -> Option<&mut T>—
access the cached value without triggering loading.get_mutallows in-place editing that
persists in the cache. Both returnNoneif never loaded.
Changed
download_tonow validates the URL and strips any query string and fragment before deriving
the output filename from the URL. An explicitfilenameargument is still used verbatim, and the
public signature is unchanged.- Raised the minimum
ureqto 3.3.0 andthiserrorto 2.0.18 (both within their existing major
versions;utilsfeature only).
See crates/dataset-core/CHANGELOG.md for the full list.
dataset-ml 0.2.0
Added — four new datasets
| Struct | Module path | Samples | Features | Task | Source |
|---|---|---|---|---|---|
BreastCancer |
dataset_ml::breast_cancer |
569 | 30 | Classification | UCI ML Repository |
WineRecognition |
dataset_ml::wine_recognition |
178 | 13 | Classification | UCI ML Repository |
PalmerPenguins |
dataset_ml::palmer_penguins |
344 | 7 | Classification | palmerpenguins R package |
CaliforniaHousing |
dataset_ml::california_housing |
20,640 | 8 | Regression | StatLib (1990 census) |
BreastCancer— Breast Cancer Wisconsin (Diagnostic). 30 numeric features (mean/se/worst
for 10 cell-nucleus measurements),&'static strdiagnosis label ("malignant"/"benign").WineRecognition— scikit-learn'sload_wine. 13 chemical-constituent features,&'static str
cultivar label ("class_1"/"class_2"/"class_3"). Distinct from thewine_quality
regression datasets.PalmerPenguins— mixed-type likeTitanic:features()returns
(&Array2<String>, &Array2<f64>)anddata()is a triple. Missing values (literal tokenNA
in the source) becomeNaN(numeric) or""(string).CaliforniaHousing— the one loader that does feature engineering: it reproduces
scikit-learn'sfetch_california_housingfeatures (AveRooms = total_rooms / households, etc.)
from Géron'shousing.csvand scales the target by1/100000. The source's 207 missing
total_bedroomsvalues surface asNaNinAveBedrms. A modern replacement for Boston Housing.
All four are sourced with pinned SHA-256 verification and re-exported at the crate root
(dataset_ml::BreastCancer, etc.).
Added — cached-data accessors
On every loader (Iris, BostonHousing, Diabetes, Titanic, RedWineQuality,
WhiteWineQuality, and the new loaders):
into_data(self)/take_data(&mut self)— return owned arrays without ato_owned()clone.
into_dataconsumes the loader;take_dataleaves it reusable (a later accessor reloads).get_data(&self) -> Option<&XData>/get_data_mut(&mut self) -> Option<&mut XData>— borrow or
edit the cached tuple without triggering loading (Noneif not yet loaded).
These build on the new Dataset::into_inner / take / get / get_mut in dataset-core.
Changed
- Adapted to the loader-on-construction API: each loader's field is now
Dataset<XData, DatasetError>,newpassesSelf::load_datatoDataset::new, and accessors
callself.dataset.load(). The public API of each loader (Iris::new(dir),features(),
labels(),data(), …) is unchanged. - Serde-based CSV parsing: every loader defines a
#[derive(Deserialize)]record struct and
parses withcsv::Reader::deserialize(), replacing manual per-field parsing and column-count
checks. Records deserialize positionally, so parsing no longer depends on header spelling or a
byte-order mark. Behavior (including Titanic'sNaNfor missing numerics) is unchanged. data()now returns a reference to the cached tuple (&IrisData,&TitanicData, …) instead of a
tuple of references. Call-site destructuring (let (features, labels) = ds.data()?) is unchanged
thanks to match ergonomics.- Each loader's content type now has a named alias (
IrisData,BostonHousingData, …, shared
WineData). - Added
serde(withderive) as a direct dependency.
See crates/dataset-ml/CHANGELOG.md for the full list.
Full dataset lineup (10)
| Struct | Samples | Features | Task |
|---|---|---|---|
Iris |
150 | 4 | Classification |
BreastCancer 🆕 |
569 | 30 | Classification |
BostonHousing |
506 | 13 | Regression |
CaliforniaHousing 🆕 |
20,640 | 8 | Regression |
Diabetes |
768 | 8 | Classification |
Titanic |
891 | 11 | Classification |
PalmerPenguins 🆕 |
344 | 7 | Classification |
WineRecognition 🆕 |
178 | 13 | Classification |
RedWineQuality |
1,599 | 11 | Regression |
WhiteWineQuality |
4,898 | 11 | Regression |
Upgrading
# dataset-core only
[dependencies]
dataset-core = "0.3"
# dataset-core with the download / unzip / SHA-256 helpers
[dependencies]
dataset-core = { version = "0.3", features = ["utils"] }
# Built-in ML dataset loaders (pulls in dataset-core automatically)
[dependencies]
dataset-ml = "0.2"If you use dataset-ml loaders only: bump the version — ...
v0.2.0 - Separating Architecture and Implementation
Release Notes — v0.2.0 (2026-05-27)
This release is a major restructuring of the project since v0.1.0: the repository has been split into a Cargo workspace. dataset-core now contains only the architecture layer, while a new companion crate dataset-ml houses all built-in dataset loaders. The two crates are published to crates.io independently.
| Crate | Version |
|---|---|
dataset-core |
0.1.0 → 0.2.0 |
dataset-ml |
0.1.0 (initial release) |
⚠️ Breaking Changes
-
Workspace split:
dataset-corenow only shipsDataset<T>, theutilsmodule, and theerrormodule. All built-in dataset loaders have moved to the newdataset-mlcrate. -
datasetsfeature removed: the formerdatasetsfeature ondataset-coreis gone. Usedataset-mlinstead. -
Import path changes (loaders moved to
dataset-ml):Old path ( dataset-core0.1.x)New path ( dataset-ml0.1.0)dataset_core::datasets::iris::Irisdataset_ml::iris::Irisdataset_core::datasets::boston_housing::BostonHousingdataset_ml::boston_housing::BostonHousingdataset_core::datasets::diabetes::Diabetesdataset_ml::diabetes::Diabetesdataset_core::datasets::titanic::Titanicdataset_ml::titanic::Titanicdataset_core::datasets::wine_quality::red_wine_quality::RedWineQualitydataset_ml::wine_quality::red_wine_quality::RedWineQualitydataset_core::datasets::wine_quality::white_wine_quality::WhiteWineQualitydataset_ml::wine_quality::white_wine_quality::WhiteWineQualityThere is no longer a
datasets::namespace — modules sit directly at thedataset_mlcrate root, and every dataset struct is also re-exported at the crate root for convenience. -
utilsfunction renames:prepare_download_dir→evaluate_storagedownload_dataset_with→acquire_dataset
-
Download backend swap: replaced
downloaderwithureq. Thedownload_toAPI was refactored and now supports an optional custom filename. -
Slimmer error payloads:
DataFormatErrorno longer formats the offending record into the error message. Error output is more compact and avoids echoing raw data.
✨ Added
- Structured error handling:
thiserroris now used to deriveDatasetError/DataFormatErrorKind. Detailed variants, a consistent[dataset_name] ...prefix, andFromimpls forUreqError,ZipError, andstd::io::Errormean?just works inside loader closures. dataset-mlinitial release: ships loaders for Iris, Boston Housing, Diabetes, Titanic, and Red / White Wine Quality. Wine Quality is split into red and white submodules that shareparse_wine_data_to_array.- Semantic tests across the board: dataset integration tests now assert value constraints, consistency checks, and finiteness — not just shapes.
- Documentation upgrades: each dataset module gained detailed module-level docs covering features, target variable, sample count, applications, and source.
- Chinese localization:
README.zh-CN.mdadded fordataset-core,dataset-ml, and the workspace root.
🔧 Changed
- Dependency bumps:
ureq→3.3.0,thiserror→2.0.18,zip→8.5.1. - Shared metadata (
edition,rust-version,authors,license,repository) lifted into[workspace.package]; shared dependency versions live in[workspace.dependencies]. - Doctests that create files on disk are now marked
no_run, socargo test --docno longer leaves stray artifacts behind. - Removed redundant module-level docs from
error.rsand stale markdown links inutilsdocs.
📦 Installation
[dependencies]
dataset-core = "0.2.0" # architecture layer: Dataset<T> + utils + error
dataset-ml = "0.1.0" # add this only if you want the built-in loadersIf you only need the Dataset<T> container (zero external dependencies), no features are required. Enable features = ["utils"] to pull in acquire_dataset / download_to / unzip / SHA-256 helpers. dataset-ml transitively enables dataset-core/utils, so you don't need to configure it manually.
v0.1.0 - Initial Release
dataset-core v0.1.0
A generic, thread-safe dataset container with lazy loading and caching for Rust.
Note: This is an initial release. The API is not yet stable and may change in future versions.
Highlights
-
Zero-dependency core —
Dataset<T>pairs a storage directory with lazily-initialized data of any type. The first call toload()runs your closure and caches the result viaOnceLock; every subsequent call returns&Twith zero overhead, even across threads. -
Feature-gated modules — opt in to only what you need:
Feature What it adds Extra deps (none) Dataset<T>none utilsdownload_to,unzip,create_temp_dir,file_sha256_matches,acquire_dataset, and theerrormoduleureq, zip, tempfile, sha2 datasets6 built-in ML dataset loaders (implies utils)ndarray, csv
Built-in Datasets
Six classic machine learning datasets, ready to use with a consistent API (new → features() / labels() / targets() / data()):
| Dataset | Samples | Features | Task |
|---|---|---|---|
| Iris | 150 | 4 | Classification |
| Boston Housing | 506 | 13 | Regression |
| Diabetes (Pima) | 768 | 8 | Classification |
| Titanic | 891 | 11 (mixed) | Classification |
| Wine Quality (Red) | 1,599 | 11 | Regression |
| Wine Quality (White) | 4,898 | 11 | Regression |
All datasets are automatically downloaded, cached locally, and validated with SHA-256 checksums.
Utility Functions (utils feature)
download_to— download a remote file into a directoryunzip— extract a ZIP archivecreate_temp_dir— create a self-cleaning temporary directoryfile_sha256_matches— verify a file's SHA-256 hashacquire_dataset— cache-aware dataset acquisition workflow (temp dir → prepare → optional hash check → move to final location)
Requirements
- Rust edition 2024, MSRV 1.88.0
- License: MIT
Quick Start
use dataset_core::Dataset;
let ds = Dataset::<String>::new("./cache");
let data = ds.load(|dir| Ok(std::fs::read_to_string(format!("{dir}/my_file.txt"))?))?;
println!("{data}");With built-in datasets:
use dataset_core::datasets::Iris;
let iris = Iris::new("./data");
let (features, labels) = iris.data()?;
println!("shape: {:?}, first label: {}", features.shape(), labels[0]);