# Summary

* This is a tutorial on using Python for accessing the scBaseCamp dataset hosted by the Arc Institute.
* The data can be streamed or downloaded locally.
  * For small jobs (e.g., summarizing the some metadata), streaming is recommended.
  * For large jobs (e.g., training a model), downloading is recommended.
* See the [README](README.md#obs-cell-metadata) for a description of the obs metadata.

# Setup

### Installation

If needed, install the necessary dependencies.

You can use the conda environment provided in this git repository. To do so:

In [None]:
!which conda && conda env create -q -f ../py_conda_env.yml

# Load dependencies

In [3]:
import os
import pandas as pd
import scanpy as sc
import pyarrow.dataset as ds
import gcsfs

In [4]:
# initialize GCS file system for reading data from GCS
fs = gcsfs.GCSFileSystem()

# Data location

In [5]:
# GCS bucket path
gcp_base_path = "gs://arc-ctc-scbasecamp/2025-02-25/"

# List files

## Parquet files

* Contain the obs metadata

### Per-sample metadata

In [8]:
# List all files in the bucket
all_files = fs.glob(os.path.join(gcp_base_path, "**"))

# Filter files with the specified extension
sample_pq_files = [file for file in all_files if os.path.basename(file) == "sample_metadata.parquet.gz"]

# Convert to dataframe: organism & file_path
sample_pq_files = pd.DataFrame(
    [[os.path.basename(os.path.dirname(file)),file] for file in sample_pq_files], 
    columns=["organism", "file_path"]
)
sample_pq_files

Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/metadata/Arabido...
1,Bos_taurus,arc-ctc-scbasecamp/2025-02-25/metadata/Bos_tau...
2,Caenorhabditis_elegans,arc-ctc-scbasecamp/2025-02-25/metadata/Caenorh...
3,Callithrix_jacchus,arc-ctc-scbasecamp/2025-02-25/metadata/Callith...
4,Danio_rerio,arc-ctc-scbasecamp/2025-02-25/metadata/Danio_r...
5,Drosophila_melanogaster,arc-ctc-scbasecamp/2025-02-25/metadata/Drosoph...
6,Equus_caballus,arc-ctc-scbasecamp/2025-02-25/metadata/Equus_c...
7,Gallus_gallus,arc-ctc-scbasecamp/2025-02-25/metadata/Gallus_...
8,Gorilla_gorilla,arc-ctc-scbasecamp/2025-02-25/metadata/Gorilla...
9,Heterocephalus_glaber,arc-ctc-scbasecamp/2025-02-25/metadata/Heteroc...


### Per-obs metadata

In [9]:
# List all files in the bucket
all_files = fs.glob(os.path.join(gcp_base_path, "**"))

# Filter files with the specified extension
obs_pq_files = [file for file in all_files if os.path.basename(file) == "obs_metadata.parquet.gz"]

# Convert to dataframe: organism & file_path
obs_pq_files = pd.DataFrame(
    [[os.path.basename(os.path.dirname(file)),file] for file in obs_pq_files], 
    columns=["organism", "file_path"]
)
obs_pq_files

Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/metadata/Arabido...
1,Bos_taurus,arc-ctc-scbasecamp/2025-02-25/metadata/Bos_tau...
2,Caenorhabditis_elegans,arc-ctc-scbasecamp/2025-02-25/metadata/Caenorh...
3,Callithrix_jacchus,arc-ctc-scbasecamp/2025-02-25/metadata/Callith...
4,Danio_rerio,arc-ctc-scbasecamp/2025-02-25/metadata/Danio_r...
5,Drosophila_melanogaster,arc-ctc-scbasecamp/2025-02-25/metadata/Drosoph...
6,Equus_caballus,arc-ctc-scbasecamp/2025-02-25/metadata/Equus_c...
7,Gallus_gallus,arc-ctc-scbasecamp/2025-02-25/metadata/Gallus_...
8,Gorilla_gorilla,arc-ctc-scbasecamp/2025-02-25/metadata/Gorilla...
9,Heterocephalus_glaber,arc-ctc-scbasecamp/2025-02-25/metadata/Heteroc...


## h5ad files 

* Contain count matrices and per-obs metadata

In [10]:
# List all files in the bucket
all_files = fs.glob(os.path.join(gcp_base_path, "**"))

# Filter files with the specified extension
h5ad_files = [file for file in all_files if file.endswith(".h5ad.gz")]

# Convert to dataframe: organism & file_path
h5ad_files = pd.DataFrame(
    [[os.path.basename(os.path.dirname(file)),file] for file in h5ad_files], 
    columns=["organism", "file_path"]
)
h5ad_files

Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/Arabidopsis...
1,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/Arabidopsis...
2,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/Arabidopsis...
3,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/Arabidopsis...
4,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/Arabidopsis...
...,...,...
30382,Zea_mays,arc-ctc-scbasecamp/2025-02-25/h5ad/Zea_mays/SR...
30383,Zea_mays,arc-ctc-scbasecamp/2025-02-25/h5ad/Zea_mays/SR...
30384,Zea_mays,arc-ctc-scbasecamp/2025-02-25/h5ad/Zea_mays/SR...
30385,Zea_mays,arc-ctc-scbasecamp/2025-02-25/h5ad/Zea_mays/SR...


# Obs metadata

* `obs` ≃ cell

### Per-sample

* Useful for quickly summarizing the per-sample metadata (a small file versus the entire obs metadata file; see below).

In [11]:
row_count = 3
sample_metadata = []
for i,row in sample_pq_files.iterrows():
    sample_metadata.append(
        ds.dataset(row["file_path"], filesystem=fs, format="parquet")
        .head(row_count)
        .to_pandas()
    )
sample_metadata = pd.concat(sample_metadata)
print(sample_metadata.shape)
sample_metadata.head()

(62, 14)


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name
0,24123125,SRX17302366,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Arabid...,9036,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,not specified,"BL (Brassinolide), 100nM, 0.5 hours post-treat...",WT Col-0,,
1,24123140,SRX17302381,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Arabid...,14317,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,not specified,"control treatment, age: 7 days",WT Col-0,,
2,24123142,SRX17302383,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Arabid...,20075,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,unsure,control,unsure,,
0,32702158,SRX24387177,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Bos_ta...,11084,10x_Genomics,3_prime_gex,single_cell,Bos taurus,lung,"Arthritis, Rheumatoid",unsure,unsure,,
1,32702159,SRX24387178,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Bos_ta...,21404,10x_Genomics,3_prime_gex,single_cell,Bos taurus,lung,unsure,unsure,unsure,,


In [13]:
# number of cells in this slice of the dataset
print(f"Cell count: {sample_metadata['obs_count'].sum()}")

Cell count: 532944


In [18]:
# cell count per organism
obs_count_org = (
    sample_metadata.groupby("organism")["obs_count"]
    .sum().to_frame().reset_index()
    .sort_values("obs_count", ascending=False)
    .reset_index(drop=True)
)
obs_count_org

Unnamed: 0,organism,obs_count
0,Solanum lycopersicum,55212
1,Arabidopsis thaliana,43428
2,Bos taurus,40851
3,Macaca mulatta,37939
4,Callithrix jacchus,35467
5,Equus caballus,30142
6,Ovis aries,29690
7,Sus scrofa,27760
8,Gallus gallus,27099
9,Schistosoma mansoni,25526


### Per-observation

* The per-obs metadata contains metadata specific to each obs (e.g., gene count)

In [19]:
# filter to just human samples
human_samples = sample_metadata[sample_metadata["organism"] == "Homo sapiens"]
human_samples

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name
0,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
1,29110027,ERX11148744,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,2379,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,keratinocyte CD49f-,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
2,29110026,ERX11148743,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,2316,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,epidermal myeloid cells,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...


In [45]:
# read in in the per-obs metadata
infile = os.path.join(gcp_base_path, "metadata", "Homo_sapiens", "obs_metadata.parquet.gz")
dataset = ds.dataset(infile, filesystem=fs, format="parquet")
obs_metadata = dataset.to_table().to_pandas()
obs_metadata 

Unnamed: 0,gene_count,umi_count,SRX_accession,cell_barcode
0,5939,14141.0,ERX10019090,AAACCCAAGAAGCCTG
1,6331,18138.0,ERX10019090,AAACCCAAGAATCTAG
2,5447,16033.0,ERX10019090,AAACCCAAGACTTAAG
3,2307,4154.0,ERX10019090,AAACCCAAGACTTCAC
4,965,1183.0,ERX10019090,AAACCCAAGCCGTTGC
...,...,...,...,...
17398,4283,10319.0,ERX10019090,TTTGTTGTCACTTATC
17399,1521,2393.0,ERX10019090,TTTGTTGTCAGCGCGT
17400,5556,18073.0,ERX10019090,TTTGTTGTCAGGAAAT
17401,652,843.0,ERX10019090,TTTGTTGTCCCGAGGT


In [None]:
# which samples?
target_samples = ", ".join(human_samples['srx_accession'].tolist())
print(f"target samples: {target_samples}")

# read in data
infile = os.path.join(gcp_base_path, "metadata", "Homo_sapiens", "obs_metadata.parquet.gz")
dataset = ds.dataset(infile, filesystem=fs, format="parquet")
obs_metadata_target = dataset.to_table(
    filter=(
        ds.field('SRX_accession').isin(human_samples["srx_accession"].tolist())
    )
).to_pandas()
obs_metadata_target

target samples: ERX11148735,ERX11148744,ERX11148743


Unnamed: 0,gene_count,umi_count,SRX_accession,cell_barcode


# Read h5ad files

### Example: select human samples

In [28]:
# we have a set of samples
target_samples = sample_metadata[sample_metadata["organism"] == "Homo sapiens"]
target_samples

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name
0,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
1,29110027,ERX11148744,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,2379,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,keratinocyte CD49f-,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
2,29110026,ERX11148743,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,2316,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,epidermal myeloid cells,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...


In [36]:
# read in the anndata for those samples
adata = []
for infile in target_samples["file_path"].tolist():
    with fs.open(infile, 'rb') as f:
        adata.append(sc.read_h5ad(f))

# combine anndata objects
adata = sc.concat(adata)
adata

  utils.warn_names_duplicates("obs")


AnnData object with n_obs × n_vars = 5442 × 36601
    obs: 'gene_count', 'umi_count', 'SRX_accession'

In [37]:
# number of obs per SRX accession
adata.obs["SRX_accession"].value_counts()

SRX_accession
ERX11148744    2379
ERX11148743    2316
ERX11148735     747
Name: count, dtype: int64

In [39]:
# add per-sample metadata to the anndata object
adata.obs = adata.obs.reset_index().merge(
    target_samples, left_on="SRX_accession", right_on="srx_accession", how="inner"
)
adata.obs.head()

Unnamed: 0,index,gene_count,umi_count,SRX_accession,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name
0,AAACCTGAGTCGCCGT,1966,9930.0,ERX11148735,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
1,AAACCTGTCTTGAGGT,931,1479.0,ERX11148735,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
2,AAACGGGCATACGCTA,3234,19343.0,ERX11148735,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
3,AAAGATGAGAAACCTA,2882,22176.0,ERX11148735,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
4,AAAGATGCAGATCTGT,484,1035.0,ERX11148735,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_s...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...


### Example: human samples with gene count >= 1000

In [49]:
# get target samples
sample_metadata_target = sample_metadata.merge(
    obs_metadata[obs_metadata["gene_count"] >= 1000], 
    left_on="srx_accession", 
    right_on="SRX_accession", 
    how="inner"
)
print(f"SRX count: {sample_metadata_target}")

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name,gene_count,umi_count,SRX_accession,cell_barcode


In [50]:
# for the sake of this tutorial, just use the first 3
sample_metadata_target = sample_metadata_target.loc[:3]
sample_metadata_target

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,purturbation,cell_line,czi_collection_id,czi_collection_name,gene_count,umi_count,SRX_accession,cell_barcode


In [None]:
# read in h5ad files
adata = []
for infile in sample_metadata_target["file_path"].tolist():
    with fs.open(infile, 'rb') as f:
        adata.append(sc.read_h5ad(f))

# combine anndata objects
adata = sc.concat(adata)
adata

# Downloading files

You can use [gsutil](https://cloud.google.com/storage/docs/gsutil) to download any of the files in the bucket
and work with them locally. 

Please be considerate to the [cost of egress](https://cloud.google.com/storage/pricing) when download the data from Google Cloud Storage.

For example:

```bash
gsutil cp gs://arc-ctc-scbasecamp/2025-02-25/h5ad/Homo_sapiens/ERX4319106.h5ad.gz .
```

***

# sessionInfo

In [1]:
!pip list

Package                   Version
------------------------- --------------
aiohappyeyeballs          2.4.6
aiohttp                   3.11.12
aiosignal                 1.3.2
anndata                   0.11.3
anyio                     4.8.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
array_api_compat          1.10.0
arrow                     1.3.0
asttokens                 3.0.0
async-lru                 2.0.4
attrs                     25.1.0
babel                     2.17.0
beautifulsoup4            4.13.3
bleach                    6.2.0
blinker                   1.9.0
Brotli                    1.1.0
cached-property           1.5.2
cachetools                5.5.2
certifi                   2025.1.31
cffi                      1.17.1
charset-normalizer        3.4.1
click                     8.1.8
colorama                  0.4.6
comm                      0.2.2
contourpy                 1.3.1
cryptography              44.0.1
cycler                    0.12.1
debugpy      