# Exploring the Census Datasets table

This tutorial demonstrates basic use of the `census_datasets` dataframe that contains metadata of the Census source datasets. This metadata can be joined to the cell metadata dataframe (`obs`) via the column `dataset_id`, 

**Contents**

1. Fetching the datasets table.
2. Fetching the expression data from a single dataset.
3. Downloading the original source H5AD file of a dataset.

⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).

## Fetching the datasets table


Each Census contains a top-level dataframe itemizing the datasets contained therein. You can read this into a `pandas.DataFrame`.

In [4]:
import cellxgene_census

census = cellxgene_census.open_soma()
census_datasets = (census["census_info"]["datasets"].read(value_filter="collection_id == 'bcb61471-2a44-4d00-a0af-ff085512674c'")
                   .concat().to_pandas())

# for convenience, indexing on the soma_joinid which links this to other census data.
census_datasets = census_datasets.set_index("soma_joinid")

census_datasets

The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.


Unnamed: 0_level_0,collection_id,collection_name,collection_doi,dataset_id,dataset_version_id,dataset_title,dataset_h5ad_path,dataset_total_cell_count
soma_joinid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
486,bcb61471-2a44-4d00-a0af-ff085512674c,An atlas of healthy and injured cell states an...,10.1038/s41586-023-05769-3,32b9bdce-2481-4c85-ba1b-6ad5fcea844c,c65ea195-d1c7-4086-ad98-83b2aa6d31a7,Single-cell RNA-seq of the Adult Human Kidney ...,32b9bdce-2481-4c85-ba1b-6ad5fcea844c.h5ad,107344
487,bcb61471-2a44-4d00-a0af-ff085512674c,An atlas of healthy and injured cell states an...,10.1038/s41586-023-05769-3,0b75c598-0893-4216-afe8-5414cab7739d,7cf02c5a-313c-47ec-9958-9337fc948c7f,Integrated Single-nucleus and Single-cell RNA-...,0b75c598-0893-4216-afe8-5414cab7739d.h5ad,304652
488,bcb61471-2a44-4d00-a0af-ff085512674c,An atlas of healthy and injured cell states an...,10.1038/s41586-023-05769-3,07854d9c-5375-4a9b-ac34-fa919d3c3686,54a8714e-c818-460d-8c34-14bd4e39e1ff,Single-nucleus RNA-seq of the Adult Human Kidn...,07854d9c-5375-4a9b-ac34-fa919d3c3686.h5ad,172847


The sum cells across all datasets should match the number of cells across all SOMA experiments (human, mouse).

In [2]:
# Count cells across all experiments
all_experiments = (
    (organism_name, organism_experiment) for organism_name, organism_experiment in census["census_data"].items()
)
experiments_total_cells = 0
print("Count by experiment:")
for organism_name, organism_experiment in all_experiments:
    num_cells = len(organism_experiment.obs.read(column_names=["soma_joinid"]).concat().to_pandas())
    print(f"\t{num_cells} cells in {organism_name}")
    experiments_total_cells += num_cells

print(f"\nFound {experiments_total_cells} cells in all experiments.")

# Count cells across all datasets
print(f"Found {census_datasets.dataset_total_cell_count.sum()} cells in all datasets.")

Count by experiment:
	5255245 cells in mus_musculus
	56400873 cells in homo_sapiens

Found 61656118 cells in all experiments.
Found 61656118 cells in all datasets.


## Fetching the expression data from a single dataset

Lets pick one dataset to slice out of the census, and turn into an [AnnData](https://anndata.readthedocs.io/en/latest/) in-memory object. This can be used with the [ScanPy](https://scanpy.readthedocs.io/en/stable/) toolchain. You can also save this AnnData locally using the AnnData [write](https://anndata.readthedocs.io/en/latest/api.html#writing) API.

In [3]:
census_datasets[census_datasets.dataset_id == "0bd1a1de-3aee-40e0-b2ec-86c7a30c7149"]

Unnamed: 0_level_0,collection_id,collection_name,collection_doi,dataset_id,dataset_title,dataset_h5ad_path,dataset_total_cell_count
soma_joinid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
522,0b9d8a04-bb9d-44da-aa27-705bb65b54eb,Tabula Muris Senis,10.1038/s41586-020-2496-1,0bd1a1de-3aee-40e0-b2ec-86c7a30c7149,Bone marrow - A single-cell transcriptomic atl...,0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad,40220


Create a query on the mouse experiment, "RNA" measurement, for the dataset_id.

In [5]:
adata = cellxgene_census.get_anndata(
    census, organism="Mus musculus", obs_value_filter="dataset_id == '0bd1a1de-3aee-40e0-b2ec-86c7a30c7149'"
)

adata

AnnData object with n_obs × n_vars = 40220 × 52417
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'

## Downloading the original source H5AD file of a dataset.

You can download the original H5AD file for any given dataset. This is the same H5AD you can download from the [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/), and may contain additional data-submitter provided information which was not included in the Census.

To do this you can fetch the location in the cloud or directly download to your system using the `cellxgene-census`

In [5]:
# Option 1: Direct download
cellxgene_census.download_source_h5ad(
    "0bd1a1de-3aee-40e0-b2ec-86c7a30c7149", to_path="Tabula_Muris_Senis-bone_marrow.h5ad"
)

In [6]:
# Option 2: Get location and download via preferred method
uri = cellxgene_census.get_source_h5ad_uri("0bd1a1de-3aee-40e0-b2ec-86c7a30c7149")
uri

# you can now download the H5AD in shell via AWS CLI e.g. `aws s3 cp uri ./`

{'uri': 's3://cellxgene-data-public/cell-census/2023-07-25/h5ads/0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad',
 's3_region': 'us-west-2'}

Close the census

In [7]:
census.close()