# Summary

* This is a tutorial on using Python for accessing the scBaseCount dataset hosted by the Arc Institute.
* The data can be streamed or downloaded locally.
  * For small jobs (e.g., summarizing the some metadata), streaming is recommended.
  * For large jobs (e.g., training a model), downloading is recommended.
* See the [README](README.md#metadata) for a description of the obs metadata.


# Setup

### Installation

If needed, install the necessary dependencies.

You can use the [conda environment](../conda_envs/python.yml) provided in this git repository.

# Load packages

In [1]:
import os
import pandas as pd
import scanpy as sc
import pyarrow.dataset as ds
import gcsfs

In [41]:
# initialize GCS file system for reading data from GCS
fs = gcsfs.GCSFileSystem()

# Data location

In [42]:
# GCS bucket path
gcs_base_path = "gs://arc-scbasecount/2025-02-25/"

In [43]:
# STARsolo feature type
feature_type = "GeneFull_Ex50pAS"

# List available files

Let's see what we have to work with!

First, load some helper code.

In [44]:
# helper function to list files 
def get_file_table(gcs_base_path: str, target: str=None, endswith: str=None):
    files = fs.glob("/".join([gcs_base_path.rstrip("/"), "**"]))
    if target:
        files = [f for f in files if os.path.basename(f) == target]
    else:
        files = [f for f in files if f.endswith(endswith)]
    file_list = []
    for f in files:
        file_list.append(f.split("/")[-2:-1] + [f])
    return pd.DataFrame(file_list, columns=["organism", "file_path"])

## Parquet files

* Contain the obs metadata
* These can be read efficiently with [pyarrow](https://arrow.apache.org/docs/python/index.html)
  * We will read in via pyarrow and convert to pandas

In [45]:
# set the path to the metadata files
gcs_path = "/".join([gcs_base_path.rstrip("/"), "metadata", feature_type])
gcs_path

'gs://arc-scbasecount/2025-02-25/metadata/GeneFull_Ex50pAS'

### List per-sample metadata files

Per-sample (SRX accession) metadata (e.g., tissue)

In [46]:
# list files
sample_pq_files = get_file_table(gcs_path, "sample_metadata.parquet")
print(sample_pq_files.shape)
sample_pq_files.head()

(21, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
1,Bos_taurus,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
2,Caenorhabditis_elegans,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
3,Callithrix_jacchus,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
4,Danio_rerio,arc-scbasecount/2025-02-25/metadata/GeneFull_E...


**Notes:**

* As you can see, the files are organized by `feature_type` (STARsolo output type) and `organism`

### List per-obs metadata files

Per-observation (cell) metadata

In [47]:
# list files
obs_pq_files = get_file_table(gcs_path, "obs_metadata.parquet")
print(obs_pq_files.shape)
obs_pq_files.head()

(21, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
1,Bos_taurus,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
2,Caenorhabditis_elegans,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
3,Callithrix_jacchus,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
4,Danio_rerio,arc-scbasecount/2025-02-25/metadata/GeneFull_E...


## h5ad files 

* Contain count matrices and per-obs metadata

In [48]:
# set the path
gcs_path = "/".join([gcs_base_path.rstrip("/"), "h5ad", feature_type])
gcs_path

'gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_Ex50pAS'

In [49]:
# list files
h5ad_files = get_file_table(gcs_path, endswith=".h5ad")
print(h5ad_files.shape)
h5ad_files.head()

(30387, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-scbasecount/2025-02-25/h5ad/GeneFull_Ex50p...
1,Arabidopsis_thaliana,arc-scbasecount/2025-02-25/h5ad/GeneFull_Ex50p...
2,Arabidopsis_thaliana,arc-scbasecount/2025-02-25/h5ad/GeneFull_Ex50p...
3,Arabidopsis_thaliana,arc-scbasecount/2025-02-25/h5ad/GeneFull_Ex50p...
4,Arabidopsis_thaliana,arc-scbasecount/2025-02-25/h5ad/GeneFull_Ex50p...


# Explore the per-sample metadata

### Just human samples

In [50]:
# get the per-sample metadata file path
infile = sample_pq_files[sample_pq_files["organism"] == "Homo_sapiens"]["file_path"].values[0]
infile

'arc-scbasecount/2025-02-25/metadata/GeneFull_Ex50pAS/Homo_sapiens/sample_metadata.parquet'

In [51]:
# load the metadata
sample_metadata = ds.dataset(infile, filesystem=fs, format="parquet").to_table().to_pandas()
print(sample_metadata.shape)
sample_metadata.head()

(16077, 14)


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name
0,29110018,ERX11148735,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
1,29110027,ERX11148744,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,2379,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,keratinocyte CD49f-,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
2,29110026,ERX11148743,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,2316,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,epidermal myeloid cells,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
3,29110023,ERX11148740,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,2907,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,skin collected from breast reconstruction surgery,not specified,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
4,29110015,ERX11148732,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,4082,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase,not_applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...


In [52]:
# All human?
sample_metadata["organism"].value_counts()

organism
Homo sapiens    16077
Name: count, dtype: int64

In [53]:
# 10X library prep methods
sample_metadata["tech_10x"].value_counts()

tech_10x
3_prime_gex          10851
5_prime_gex           3746
vdj                    437
multiome               366
not_applicable         250
feature_barcoding      230
other                  168
cellplex                19
flex                     6
atac                     4
Name: count, dtype: int64

In [54]:
# cell prep method
sample_metadata["cell_prep"].value_counts()

cell_prep
single_cell       14661
single_nucleus     1393
unsure               22
not_applicable        1
Name: count, dtype: int64

### All organisms

Let's scale up to everything!

In [55]:
# Read in the metadata for all organisms
sample_metadata = []
for i,row in sample_pq_files.iterrows():
    sample_metadata.append(
        ds.dataset(row["file_path"], filesystem=fs, format="parquet").to_table().to_pandas()
    )
sample_metadata = pd.concat(sample_metadata)

print(f"Number of samples: {sample_metadata.shape[0]}")
sample_metadata.head()

Number of samples: 30387


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name
0,24123125,SRX17302366,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,9036,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,not specified,"BL (Brassinolide), 100nM, 0.5 hours post-treat...",WT Col-0,,
1,24123140,SRX17302381,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,14317,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,not specified,"control treatment, age: 7 days",WT Col-0,,
2,24123142,SRX17302383,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,20075,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,unsure,control,unsure,,
3,26626960,SRX19366049,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,7539,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,unsure,mock treatment (control group),not applicable,,
4,26626958,SRX19366047,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,7703,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,other,none,mock treatment; 2 µM RALF1 peptide for 2 hours,none,,


In [56]:
# cells
print(f"Obs count: {sample_metadata['obs_count'].sum()}")

Obs count: 233686476


In [57]:
# samples per organism
sample_metadata["organism"].value_counts()

organism
Homo sapiens               16077
Mus musculus               12212
Macaca mulatta               587
Danio rerio                  458
Sus scrofa                   195
Drosophila melanogaster      181
Arabidopsis thaliana         175
Gallus gallus                102
Heterocephalus glaber         79
Caenorhabditis elegans        52
Pan troglodytes               49
Bos taurus                    48
Oryctolagus cuniculus         34
Zea mays                      33
Oryza sativa                  31
Callithrix jacchus            24
Ovis aries                    20
Equus caballus                11
Solanum lycopersicum          10
Schistosoma mansoni            7
Gorilla gorilla                2
Name: count, dtype: int64

In [58]:
# tech_10x
sample_metadata["tech_10x"].value_counts()

tech_10x
3_prime_gex          22433
5_prime_gex           5625
multiome               774
vdj                    577
not_applicable         340
feature_barcoding      311
other                  266
cellplex                46
atac                     8
flex                     6
fixed_rna                1
Name: count, dtype: int64

In [59]:
# samples associated with czi collections
czi_sample_count = sample_metadata[~sample_metadata["czi_collection_id"].isna()].shape[0]
print(f"Samples associated with CZI collections: {czi_sample_count}")

Samples associated with CZI collections: 2748


In [60]:
# check that the file paths point to existing h5ad files (assumes you have gsutil  installed)
!which gsutil && gsutil ls {sample_metadata["file_path"].values[0]}

/home/nickyoungblut/bin/google-cloud-sdk/bin/gsutil
gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_Ex50pAS/Arabidopsis_thaliana/SRX17302366.h5ad


# Explore the per-obs metadata

* `obs` ≃ cell

In [61]:
# The list of metadata files per organism
obs_pq_files

Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
1,Bos_taurus,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
2,Caenorhabditis_elegans,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
3,Callithrix_jacchus,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
4,Danio_rerio,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
5,Drosophila_melanogaster,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
6,Equus_caballus,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
7,Gallus_gallus,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
8,Gorilla_gorilla,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
9,Heterocephalus_glaber,arc-scbasecount/2025-02-25/metadata/GeneFull_E...


In [62]:
# let's read in the metadata for a single organism
target_organism = "Bos_taurus"

In [63]:
# extract the file path
infile = obs_pq_files[obs_pq_files["organism"] == target_organism]["file_path"].values[0]

In [64]:
# read in the first 100000 rows
obs_metadata = ds.dataset(infile, filesystem=fs, format="parquet").head(100000).to_pandas()
print(obs_metadata.shape)
obs_metadata.head()

(100000, 4)


Unnamed: 0,gene_count,umi_count,SRX_accession,cell_barcode
0,5580,19602.0,ERX13041271,AAACCCACACCTATCC
1,6478,27106.0,ERX13041271,AAACCCACAGACTGCC
2,3731,9476.0,ERX13041271,AAACCCACATCGTGCG
3,3879,10705.0,ERX13041271,AAACCCAGTGTGAATA
4,4100,10589.0,ERX13041271,AAACCCATCACAATGC


In [65]:
# distribution of gene counts
obs_metadata["gene_count"].describe()

count    100000.000000
mean       2628.960660
std        1651.533013
min          33.000000
25%        1347.000000
50%        2247.000000
75%        3721.000000
max        9896.000000
Name: gene_count, dtype: float64

In [66]:
# distribution of umi counts
obs_metadata["umi_count"].describe()

count    100000.000000
mean       8612.842773
std        9093.947266
min         500.000000
25%        2784.000000
50%        5618.000000
75%       11446.000000
max      139809.000000
Name: umi_count, dtype: float64

## Get per-obs metadata for specific samples

Method:

1. Query the sample metadata
2. Use the filtered sample metadata to query the cell metadata

#### Filter sample metadata

Let's get all sheep and horse samples with `obs_count > 10000`

In [67]:
target_organisms = ["Ovis aries", "Equus caballus"]
obs_count_cutoff = 10000

In [68]:
# get the target samples
target_samples = sample_metadata[(sample_metadata["organism"].isin(target_organisms)) & (sample_metadata["obs_count"] > obs_count_cutoff)]
print(target_samples.shape)
target_samples.head()

(12, 14)


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name
1,35575330,SRX26348968,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,10322,10x_Genomics,3_prime_gex,single_cell,Equus caballus,uterus,unsure,unsure,unsure,,
2,31746999,SRX23498639,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,10395,10x_Genomics,3_prime_gex,single_cell,Equus caballus,skeletal system,osteoarthritis,none,not applicable,,
8,31747002,SRX23498642,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,13357,10x_Genomics,3_prime_gex,single_cell,Equus caballus,skeletal system,osteoarthritis,none,not applicable,,
10,35575334,SRX26348972,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,16167,10x_Genomics,3_prime_gex,single_cell,Equus caballus,uterus,not specified,not specified,not specified,,
2,23639074,SRX16872041,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,12527,10x_Genomics,3_prime_gex,single_cell,Ovis aries,testis,unsure,unsure,unsure,,


In [69]:
# filter the obs metadata
target_orgs = [x.replace(" ", "_") for x in target_samples["organism"].unique().tolist()]
target_obs_files = obs_pq_files[obs_pq_files["organism"].isin(target_orgs)]
target_obs_files

Unnamed: 0,organism,file_path
6,Equus_caballus,arc-scbasecount/2025-02-25/metadata/GeneFull_E...
15,Ovis_aries,arc-scbasecount/2025-02-25/metadata/GeneFull_E...


In [70]:
# read in the obs metadata
obs_metadata = []
for i,row in target_obs_files.iterrows():
    obs_metadata.append(
        ds.dataset(row["file_path"], filesystem=fs, format="parquet").to_table().to_pandas()
    )
obs_metadata = pd.concat(obs_metadata)

# merge with the target samples
obs_metadata = target_samples.merge(obs_metadata, left_on="srx_accession", right_on="SRX_accession")

print(obs_metadata.shape)
obs_metadata.head()

(151813, 18)


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name,gene_count,umi_count,SRX_accession,cell_barcode
0,35575330,SRX26348968,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,10322,10x_Genomics,3_prime_gex,single_cell,Equus caballus,uterus,unsure,unsure,unsure,,,1803,4539.0,SRX26348968,AAACCCAAGATGTTCC
1,35575330,SRX26348968,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,10322,10x_Genomics,3_prime_gex,single_cell,Equus caballus,uterus,unsure,unsure,unsure,,,1228,2250.0,SRX26348968,AAACCCAAGGAGTATT
2,35575330,SRX26348968,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,10322,10x_Genomics,3_prime_gex,single_cell,Equus caballus,uterus,unsure,unsure,unsure,,,4238,11970.0,SRX26348968,AAACCCAAGGCTAGCA
3,35575330,SRX26348968,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,10322,10x_Genomics,3_prime_gex,single_cell,Equus caballus,uterus,unsure,unsure,unsure,,,369,862.0,SRX26348968,AAACCCACAGCGTTGC
4,35575330,SRX26348968,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,10322,10x_Genomics,3_prime_gex,single_cell,Equus caballus,uterus,unsure,unsure,unsure,,,1012,2974.0,SRX26348968,AAACCCACATATGGCT


In [71]:
# gene_count distribution per sample
obs_metadata.groupby(["organism", "srx_accession"])["gene_count"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
organism,srx_accession,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Equus caballus,SRX23498639,10395.0,1877.877345,1203.399156,269.0,869.5,1867.0,2558.0,7680.0
Equus caballus,SRX23498642,13357.0,2139.489856,934.999318,166.0,1487.0,2180.0,2646.0,7010.0
Equus caballus,SRX26348968,10322.0,1378.516082,853.656327,99.0,802.0,1229.0,1747.75,8712.0
Equus caballus,SRX26348972,16167.0,1227.339024,694.72539,370.0,736.0,1016.0,1511.0,8515.0
Ovis aries,SRX16872034,12515.0,2081.487655,1099.099994,88.0,1353.0,1922.0,2579.0,12328.0
Ovis aries,SRX16872035,12658.0,2340.459946,1199.428937,104.0,1541.25,2173.5,2908.0,13030.0
Ovis aries,SRX16872037,12483.0,1977.945526,1056.576855,74.0,1275.0,1817.0,2452.0,12143.0
Ovis aries,SRX16872039,12749.0,1848.423798,1336.094627,167.0,1041.0,1414.0,2058.0,9690.0
Ovis aries,SRX16872040,12991.0,2008.823724,1426.545301,140.0,1143.0,1563.0,2252.0,10193.0
Ovis aries,SRX16872041,12527.0,1698.558314,1250.11806,234.0,946.5,1278.0,1874.0,9073.0


In [72]:
# umi_count distribution per sample
obs_metadata.groupby(["organism", "srx_accession"])["umi_count"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
organism,srx_accession,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Equus caballus,SRX23498639,10395.0,5371.010254,5839.104492,500.0,1450.5,4206.0,7038.0,61960.0
Equus caballus,SRX23498642,13357.0,7223.476074,5019.631348,503.0,3461.0,7055.0,9203.0,82629.0
Equus caballus,SRX26348968,10322.0,3605.947266,3930.548828,500.0,1561.5,2845.0,4462.0,112096.0
Equus caballus,SRX26348972,16167.0,2750.778809,2259.334717,923.0,1377.5,2141.0,3360.0,76825.0
Ovis aries,SRX16872034,12515.0,4799.313965,4025.047607,500.0,2355.5,3916.0,6199.0,95753.0
Ovis aries,SRX16872035,12658.0,5756.79248,4855.183105,501.0,2821.5,4696.5,7468.0,115398.0
Ovis aries,SRX16872037,12483.0,4443.443848,3717.892334,500.0,2185.0,3626.0,5750.5,88728.0
Ovis aries,SRX16872039,12749.0,3900.116943,4643.742188,500.0,1546.0,2346.0,3863.0,90552.0
Ovis aries,SRX16872040,12991.0,4448.382812,5350.766113,500.0,1742.0,2682.0,4405.0,104565.0
Ovis aries,SRX16872041,12527.0,3422.090576,4035.824463,500.0,1380.0,2053.0,3391.5,78501.0


# Read h5ad files

### Example: select marmoset samples

In [73]:
# get the target samples
query = (sample_metadata["organism"] == "Callithrix jacchus") & (sample_metadata["obs_count"] < 3000)
target_samples = sample_metadata[query].head(n=3)
target_samples

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name
3,32301720,SRX23995668,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,1097,10x_Genomics,3_prime_gex,single_cell,Callithrix jacchus,eye,unsure,"dissection, dissociation, and enrichment of re...",unsure,,
4,32301722,SRX23995670,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,2359,10x_Genomics,atac,unsure,Callithrix jacchus,eye,unsure,unsure,retinal cell types,,
11,25294805,SRX18286093,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,572,10x_Genomics,3_prime_gex,single_cell,Callithrix jacchus,other,unsure,iPSCs cultured on feeder layer with a WNT sign...,iPSC (male),,


In [74]:
# read in the anndata for those samples
adata = []
for infile in target_samples["file_path"].tolist():
    print(infile)
    with fs.open(infile, 'rb') as f:
        adata.append(sc.read_h5ad(f))

# combine anndata objects
adata = sc.concat(adata)
adata

gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_Ex50pAS/Callithrix_jacchus/SRX23995668.h5ad
gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_Ex50pAS/Callithrix_jacchus/SRX23995670.h5ad
gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_Ex50pAS/Callithrix_jacchus/SRX18286093.h5ad


  utils.warn_names_duplicates("obs")


AnnData object with n_obs × n_vars = 4028 × 28346
    obs: 'gene_count', 'umi_count', 'SRX_accession'

In [75]:
# number of obs per SRX accession
adata.obs["SRX_accession"].value_counts()

SRX_accession
SRX23995670    2359
SRX23995668    1097
SRX18286093     572
Name: count, dtype: int64

In [76]:
# add per-sample metadata to the anndata object
adata.obs = adata.obs.reset_index().merge(
    target_samples, left_on="SRX_accession", right_on="srx_accession", how="inner"
)
adata.obs.head()

Unnamed: 0,index,gene_count,umi_count,SRX_accession,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name
0,AAACCTGAGAGTGACC,2138,3875.0,SRX23995668,32301720,SRX23995668,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,1097,10x_Genomics,3_prime_gex,single_cell,Callithrix jacchus,eye,unsure,"dissection, dissociation, and enrichment of re...",unsure,,
1,AAACCTGCAAAGTCAA,5516,19830.0,SRX23995668,32301720,SRX23995668,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,1097,10x_Genomics,3_prime_gex,single_cell,Callithrix jacchus,eye,unsure,"dissection, dissociation, and enrichment of re...",unsure,,
2,AAACCTGCAAGCCTAT,3259,8397.0,SRX23995668,32301720,SRX23995668,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,1097,10x_Genomics,3_prime_gex,single_cell,Callithrix jacchus,eye,unsure,"dissection, dissociation, and enrichment of re...",unsure,,
3,AAACCTGTCATCATTC,2643,5557.0,SRX23995668,32301720,SRX23995668,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,1097,10x_Genomics,3_prime_gex,single_cell,Callithrix jacchus,eye,unsure,"dissection, dissociation, and enrichment of re...",unsure,,
4,AAACCTGTCCTATGTT,1889,3326.0,SRX23995668,32301720,SRX23995668,gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_...,1097,10x_Genomics,3_prime_gex,single_cell,Callithrix jacchus,eye,unsure,"dissection, dissociation, and enrichment of re...",unsure,,


# Downloading files

You can use [gsutil](https://cloud.google.com/storage/docs/gsutil) to download any of the files in the bucket
and work with them locally. 

Please be considerate to the [cost of egress](https://cloud.google.com/storage/pricing) when download the data from Google Cloud Storage.

For example:

```bash
gsutil cp gs://arc-scbasecount/2025-02-25/h5ad/Homo_sapiens/ERX4319106.h5ad .
```

For large data transfers, it is better to use `gsutil rsync`:

```bash
gsutil rsync gs://arc-scbasecount/2025-02-25/h5ad/Callithrix_jacchus/ .
```

***

# Session Info

In [77]:
!conda list

# packages in environment at /home/nickyoungblut/miniforge3/envs/scbasecount-py:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
aiohappyeyeballs          2.6.1              pyhd8ed1ab_0    conda-forge
aiohttp                   3.11.18         py313h8060acc_0    conda-forge
aiosignal                 1.3.2              pyhd8ed1ab_0    conda-forge
anndata                   0.11.4             pyhd8ed1ab_0    conda-forge
anyio                     4.9.0              pyh29332c3_0    conda-forge
argon2-cffi               23.1.0             pyhd8ed1ab_1    conda-forge
argon2-cffi-bindings      21.2.0          py313h536fd9c_5    conda-forge
array-api-compat          1.11.2             pyh29332c3_0    conda-forge
arrow                     1.3.0              pyhd8ed1ab_1    conda-forge
asttokens                 3.0.0              py