Submitting Author: (@JacksonBurns)
All current maintainers: (@kspieks, @himaghna)
Package Name: astartes
One-Line Description of Package: Better Data Splits for Machine Learning
Repository Link: https://github.com/JacksonBurns/astartes
Version submitted: v1.1.2
Editor: @cmarmo
Reviewer 1: @BerylKanali
Reviewer 2: @du-phan
Archive: 
Version accepted: v1.1.3
JOSS DOI: 
Date accepted (month/day/year): 10/15/2023
Code of Conduct & Commitment to Maintain Package
Description
- Include a brief paragraph describing what your package does:
note: this is a selection from the abstract of the JOSS paper
Machine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets that are used to develop and evaluate models. Common practice in the literature is to assign these subsets randomly. Although this approach is fast and efficient, it only measures a model's capacity to interpolate. Testing errors from random splits may be overly optimistic if given new data that is dissimilar to the scope of the training set; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many similarity- and distance-based algorithms to partition data into more challenging splits. Separate from astartes, users can then use these splits to better assess out-of-sample performance with any ML model of choice.
Scope
Domain Specific & Community Partnerships
- [ ] Geospatial
- [ ] Education
- [ ] Pangeo
Community Partnerships
If your package is associated with an
existing community please check below:
-
For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
- Who is the target audience and what are scientific applications of this package?
The target audience is data scientists, machine learning scientists, and domain scientists using machine learning. The applications of astartes include rigorous ML model validation, automated featurization of chemical data (with flexibility to add others, and instructions for doing so), and reproducibility.
- Are there other Python packages that accomplish the same thing? If so, how does yours differ?
We position astartes as a replacement to scikit-learn's provides train_test_split function, but with greater flexibility for sampling algorithms, and availability of train_val_test_split for more rigorous validation.
- If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or
@tag the editor you contacted:
N/A
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication Options
JOSS Checks
Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.
Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Confirm each of the following by checking the box.
Please fill out our survey
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
The editor template can be found here.
The review template can be found here.
Submitting Author: (@JacksonBurns)

All current maintainers: (@kspieks, @himaghna)
Package Name:
astartesOne-Line Description of Package: Better Data Splits for Machine Learning
Repository Link: https://github.com/JacksonBurns/astartes
Version submitted: v1.1.2
Editor: @cmarmo
Reviewer 1: @BerylKanali
Reviewer 2: @du-phan
Archive:
Version accepted: v1.1.3
JOSS DOI:
Date accepted (month/day/year): 10/15/2023
Code of Conduct & Commitment to Maintain Package
Description
note: this is a selection from the abstract of the JOSS paper
Machine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets that are used to develop and evaluate models. Common practice in the literature is to assign these subsets randomly. Although this approach is fast and efficient, it only measures a model's capacity to interpolate. Testing errors from random splits may be overly optimistic if given new data that is dissimilar to the scope of the training set; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many similarity- and distance-based algorithms to partition data into more challenging splits. Separate from astartes, users can then use these splits to better assess out-of-sample performance with any ML model of choice.
Scope
Please indicate which category or categories.
Check out our package scope page to learn more about our
scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
Domain Specific & Community Partnerships
Community Partnerships
If your package is associated with an
existing community please check below:
For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
The target audience is data scientists, machine learning scientists, and domain scientists using machine learning. The applications of
astartesinclude rigorous ML model validation, automated featurization of chemical data (with flexibility to add others, and instructions for doing so), and reproducibility.We position
astartesas a replacement toscikit-learn's providestrain_test_splitfunction, but with greater flexibility for sampling algorithms, and availability oftrain_val_test_splitfor more rigorous validation.@tagthe editor you contacted:N/A
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication Options
JOSS Checks
paper.mdmatching JOSS's requirements with a high-level description in the package root or ininst/.on a separate
joss-paperbranchNote: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.
Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Confirm each of the following by checking the box.
Please fill out our survey
submission and improve our peer review process. We will also ask our reviewers
and editors to fill this out.
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
The editor template can be found here.
The review template can be found here.
Footnotes
Please fill out a pre-submission inquiry before submitting a data visualization package. ↩