# Spatialdata-DB

The goal of `Spatialdata-DB` is to provide structured, easily searchable, [Spatialdata](https://spatialdata.scverse.org/en/stable/) sets with integrated metadata in a uniform data format, making it easier to share, access, and compare datasets.

`Spatialdata-DB` is based on [Lamin](https://docs.lamin.ai/introduction) - a framework to organize and structure biological datasets and experiments. So each dataset is stored as a `Lamin artifact`, which provides not only the `Spatialdata`, but additional context for each dataset.

## Brief intro to Lamin Artifacts

TL;DR:
- Lamin Artifact = `Spatialdata` dataset + metadata + dataset version history

Lamin Artifacts are versioned, structured data objects that track and manage files in computational biology workflows. They function like "Git for data," enabling reproducible storage, retrieval, and linking of datasets to each other and the code used to process the data.

Each dataset tracked with Lamin is enriched with metadata, including its creator, creation time, version history, and links to the processing code. These enriched data objects are then stored as Artifacts, ensuring traceability and reproducibility.

Further reading:
- https://docs.lamin.ai/tutorial
- https://docs.lamin.ai/Lamindb.artifact

In addition to this Lamin metadata that provides information about the Artifact, we store sample-specific metadata that can be used to query and compare datasets. This sample-specific metadata is embedded in the downloaded `Spatialdata` object under `spatialdata.attrs['sample']`. In future versions, we plan to extend the metadata with technology-specific information and metadata from the `obs` slots.

## Sample-specific Metadata

### General Metadata
- **Product** – The name of the product (e.g. `In Situ Gene Expression` by `10x`).
- **Assay** – The experimental technique or method used (currently the database contains `Visium` and `Xenium`).
- **Biomaterial Type** – The type of biological sample (e.g., `Specimen from Organism`).
- **Organism** – The species from which the sample was derived (currently the database contains `Human` and `Mouse`, linked to [NCBItaxon](https://www.ebi.ac.uk/ols4/ontologies/ncbitaxon ontology)).
- **Tissue** – The specific tissue from which the sample was taken (linked to [UEBERON](https://www.ebi.ac.uk/ols4/ontologies/uberon)).
- **Modality** – e.g., RNA, protein.
- **Publish Date** – The release date of the dataset.
- **License** – Licensing terms for dataset usage.
- **Dataset URL** – A link to the dataset source.

### Patient Metadata
- **Development Stage** – The life stage of the sample donor (e.g., fetal, adult, linked to [Human Developmental Stages](https://www.ebi.ac.uk/ols4/ontologies/hsapdv) and [Mouse Developmental Stages](https://www.ebi.ac.uk/ols4/ontologies/mmusdv)).
- **Disease** – The primary disease associated with the sample (linked to [Mondo](https://mondo.monarchinitiative.org/) disease ontology).
- **Disease Details** – Additional details about the disease condition, if provided.

### Technical Metadata
- **Replicate** – Identifier, if the dataset is part of a collection.
- **Instrument(s)** – The equipment used for data acquisition.
- **Software** – The software tools used in the analysis.
- **Analysis Steps** – Which analysis steps were performed, if provided.
- **Chemistry Version** – The specific version of the assay.
- **Preservation Method** – How the sample was preserved before processing.
- **Staining Method** – The technique used to stain the sample.
- **Cells or Nuclei** – Specifies if the dataset contains single-cell or single-nucleus data.
- **Panel** – The gene panel or probes used for targeted assays.

### Genes
Additionally, each dataset is linked to the genes that are contained in the data. These are identified with [Ensembl IDS](https://www.ebi.ac.uk/training/online/courses/ensembl-browsing-genomes/what-is-ensembl/) and are also searchable.


## How to access the database

There are two ways to access the data:
1) via the GUI at [scverse/spatialdata-db](https://Lamin.ai/scverse/Spatialdata-DB/artifacts?filter[and][0][or][0][_branch_code][eq]=1&filter[and][1][or][0][is_latest][eq]=true). 
- the `Artifacts` tab provides an overview of the existing datasets.
- the `Collections` tab lists all available collections.
2) via a `Python API`. How to find and download data via the API is explained in more detail in a separate notebook.

## How to search the database via API
There are three different types of metadata features:
- most features are stored as **custom labels** (see notebook `query_labels.ipynb`)
- features linked to **public ontologies** such as `Organism` and `Disease` provide additional search support such as autocompletion and hierarchical search (see notebook `query_ontologies.ipynb`)
- **genes** are also mapped against ontologies, but queried slightly differently (see notebook `query_genes.ipynb`)
- Lamin Artifacts can be bundled in **collections** (see notebook `collections.ipynb`)