ArchDataPy is a lightweight Python package for accessing archaeological datasets from R package archives in Python. It can download registered CRAN source packages, extract their .rda data files, and load those files with pyreadr. It also includes a small dataset registry for direct access to selected datasets as pandas DataFrames.
- Download registered R package archives without needing R installed.
- List available package sources, including
archdataandfolio. - Load
.rdafiles into Python usingpyreadr. - Load selected datasets directly from the dataset registry as
pandasDataFrames. - Use custom sources by passing a CRAN archive URL or local
.tar.gzpackage archive.
You can install ArchDataPy from PyPI:
pip install archdatapyFor local development, clone the repository and install it in editable mode:
git clone https://github.com/wccarleton/archdatapy.git
cd archdatapy
pip install -e .This package requires the following Python libraries:
requests
pyreadr
pandas
These dependencies are automatically installed when you install the package.
The package ships with a registry in package_registry.json. Registered package keys currently include archdata and folio.
from archdatapy import list_available_packages
print(list_available_packages())The get_archdata function accepts either:
- a registry key for a known CRAN package, or
- a direct package archive URL or local archive path.
It returns a manifest mapping dataset names to .rda file paths, along with package metadata.
from archdatapy import get_archdata
# Download the default registered package, archdata
manifest = get_archdata()
print(manifest.package_name)
print(manifest.source_url)
print(manifest.keys())To download another registered package, pass its registry key:
from archdatapy import get_archdata
manifest = get_archdata(data_url="folio")
print(manifest.package_name)
print(manifest.keys())Use load_archdata with a path from the returned manifest.
from archdatapy import load_archdata
dataset_name = 'Acheulean' # Example key from the manifest
data = load_archdata(manifest[dataset_name])
print(data)pyreadr.read_r() returns a dictionary-like object because a single .rda file can contain one or more R objects.
The package also ships with a smaller dataset registry in datasets.json. These entries point directly to individual dataset files and can be loaded with get_dataset.
from archdatapy import get_dataset, list_available_datasets
print(list_available_datasets())
mask_site = get_dataset("MaskSite")
print(mask_site.head())If you want to use a different CRAN package archive, pass the archive URL or local .tar.gz path directly:
manifest = get_archdata(data_url='https://cran.r-project.org/src/contrib/yourpackage_1.0.0.tar.gz')Full documentation is available on the GitHub Pages site: https://wccarleton.github.io/archdatapy
Contributions are welcome. Please feel free to submit issues or pull requests to improve the package.
This project is licensed under the MIT License. See the LICENSE file for details.
Future enhancements planned for ArchDataPy:
- Registry-based package sourcing system
- Modern packaging with
pyproject.toml(PEP 517/518/621) - Type hints for better IDE support
- Automated CI/CD with GitHub Actions
-
.gitignoreandMANIFEST.infor clean distribution
- Expand package registry with curated archaeology datasets
- Add structured logging instead of print statements
- Improve error messages with helpful recovery suggestions
- Add
CONTRIBUTING.mdguide for registry contributions - Include metadata (DOI, citations) in registry entries
- Optional caching layer for
load_archdata() - Docstring examples and doctests
- Dependency version compatibility checking
- GitHub issue/PR templates
- Support for additional data formats beyond
.rda
To add new package sources to the package registry:
- Fork the repository
- Edit
archdatapy/package_registry.jsonto add your source - Submit a pull request with a description of the package and datasets
Registry entries should follow this structure:
{
"package_name": {
"url": "https://cran.r-project.org/src/contrib/package_1.0.0.tar.gz",
"description": "Description of the package and datasets",
"homepage": "https://CRAN.R-project.org/package=package",
"license": "Package license"
}
}The default registry includes datasets from the R archdata package, a collection of archaeological datasets maintained on CRAN. It provides the datasets used in Quantitative Methods in Archaeology Using R by David L. Carlson.