## Use Scanpy to read the AnnData file

AnnData documentation: https://anndata.readthedocs.io/en/latest/index.html

![AnnData Schema](https://raw.githubusercontent.com/scverse/anndata/main/docs/_static/img/anndata_schema.svg)

In [29]:
import pandas as pd
import scanpy

In [2]:
bladder = scanpy.read_h5ad(
    "/Users/olga/Downloads/4229bba1-974c-4595-ac5c-15a9035c5688.h5ad"
)

In [3]:
bladder

AnnData object with n_obs × n_vars = 2432 × 21069
    obs: 'FACS.selection', 'age', 'cell', 'free_annotation', 'method', 'donor_id', 'n_genes', 'n_counts', 'louvain', 'leiden', 'assay_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'sex_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'n_cells', 'means', 'dispersions', 'dispersions_norm', 'highly_variable', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
    uns: 'age_colors', 'citation', 'leiden', 'louvain', 'neighbors', 'pca', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivit

### Expression data is a numpy compressed CSR matrix

In [31]:
bladder.X

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 10186830 stored elements and shape (2432, 21069)>

## Export the expression data (the "X" samples x feature matrix) to a pandas dataframe

For some large datasets, this won't work on your machine and you'll need to write to parquet directly from the Numpy `.X` object.

In [33]:
expression = bladder.to_df()
print(expression.shape)
expression.head()

(2432, 21069)


Unnamed: 0_level_0,ENSMUSG00000029422,ENSMUSG00000076126,ENSMUSG00000049036,ENSMUSG00000029577,ENSMUSG00000040746,ENSMUSG00000020590,ENSMUSG00000030263,ENSMUSG00000038914,ENSMUSG00000026878,ENSMUSG00000093241,...,ENSMUSG00000095543,ENSMUSG00000065511,ENSMUSG00000035875,ENSMUSG00000020070,ENSMUSG00000030178,ENSMUSG00000102368,ENSMUSG00000021033,ENSMUSG00000073940,ENSMUSG00000037924,ENSMUSG00000040693
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A11_B002346_B009249_S179.mm10-plus-1-0,1.546881,0.0,0.0,0.8184,2.584122,0.0,0.0,0.0,3.10186,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.372477,0.0,0.0,0.0
A12_B002723_B009459_S264.mm10-plus-1-0,1.760437,0.0,0.0,0.0,0.045921,0.0,0.0,2.885492,1.350023,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A12_D045324_B009460_S48.mm10-plus-1-0,2.546988,0.0,0.0,0.0,2.454049,0.0,0.0,0.0,2.28917,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A13_B002346_B009249_S181.mm10-plus-1-0,2.084349,0.0,0.0,0.772625,1.284956,4.948908,0.0,1.615562,2.909233,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A13_B002723_B009459_S265.mm10-plus-1-0,2.09032,0.0,0.0,0.0,0.201467,0.0,0.0,0.0,2.844065,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2.879196,0.0,0.0,0.0


In [6]:
expression.to_parquet("../data/bladder_smartseq2_expression.parquet")

## Metadata

### Per-cell metadata

This may be more familiar to you as a `y` target vector value, as any of these columns could be a target vector for classification/regression.

In [36]:
print(bladder.obs.shape)
bladder.obs.head()

(2432, 30)


Unnamed: 0_level_0,FACS.selection,age,cell,free_annotation,method,donor_id,n_genes,n_counts,louvain,leiden,...,tissue_type,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage,observation_joinid
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A11_B002346_B009249_S179.mm10-plus-1-0,Multiple,18m,A11_B002346,,facs,18_47_F,4582,159863.0,3,5,...,tissue,bladder urothelial cell,Smart-seq2,normal,Mus musculus,female,bladder lumen,na,18-month-old stage,=3O1^YgPZ7
A12_B002723_B009459_S264.mm10-plus-1-0,Multiple,18m,A12_B002723,,facs,18_53_M,3537,875472.0,3,4,...,tissue,bladder urothelial cell,Smart-seq2,normal,Mus musculus,male,bladder lumen,na,18-month-old stage,Y&&COYqNyI
A12_D045324_B009460_S48.mm10-plus-1-0,Multiple,18m,A12_D045324,,facs,18_45_M,4179,930126.0,1,0,...,tissue,bladder urothelial cell,Smart-seq2,normal,Mus musculus,male,bladder lumen,na,18-month-old stage,2`MY>a57{z
A13_B002346_B009249_S181.mm10-plus-1-0,Multiple,18m,A13_B002346,,facs,18_47_F,4139,113449.0,0,10,...,tissue,bladder urothelial cell,Smart-seq2,normal,Mus musculus,female,bladder lumen,na,18-month-old stage,<1w2Fh}t@p
A13_B002723_B009459_S265.mm10-plus-1-0,Multiple,18m,A13_B002723,,facs,18_53_M,3451,95993.0,3,7,...,tissue,bladder urothelial cell,Smart-seq2,normal,Mus musculus,male,bladder lumen,na,18-month-old stage,VhDDdwlrQ`


#### Write per-cell (samples) metadata to parquet

`.obs` is the metadata on the samples in AnnData parlance.

In [10]:
bladder.obs.to_parquet("../data/bladder_smartseq2_sample_metadata.parquet")

### Per-gene metadata

`.var` is the metadata on the features in AnnData parlance

In [37]:
print(bladder.var.shape)
bladder.var.head()

(21069, 11)


Unnamed: 0,n_cells,means,dispersions,dispersions_norm,highly_variable,feature_is_filtered,feature_name,feature_reference,feature_biotype,feature_length,feature_type
ENSMUSG00000029422,65053,0.7885286,0.236148,-0.827445,False,False,Rsrc2_ENSMUSG00000029422,NCBITaxon:10090,gene,1377,protein_coding
ENSMUSG00000076126,20,1e-12,,,False,False,Mir669b,NCBITaxon:10090,gene,97,miRNA
ENSMUSG00000049036,3000,0.004733051,0.282755,0.497139,False,False,Tmem121_ENSMUSG00000049036,NCBITaxon:10090,gene,1574,protein_coding
ENSMUSG00000029577,22639,0.1112182,0.216849,0.453594,False,False,Ube3b_ENSMUSG00000029577,NCBITaxon:10090,gene,3879,protein_coding
ENSMUSG00000040746,52908,0.510273,0.668143,0.372446,False,False,Rnf167_ENSMUSG00000040746,NCBITaxon:10090,gene,867,protein_coding


#### Write per-gene (features) metadata to parquet

In [11]:
bladder.var.to_parquet("../data/bladder_smartseq2_feature_metadata.parquet")

## Show unstructured data

In [13]:
bladder.uns

{'age_colors': array(['#e1f3b2', '#97d6b9', '#1f80b8'], dtype=object),
 'citation': 'Publication: https://doi.org/10.1038/s41586-020-2496-1 Dataset Version: https://datasets.cellxgene.cziscience.com/4229bba1-974c-4595-ac5c-15a9035c5688.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/0b9d8a04-bb9d-44da-aa27-705bb65b54eb',
 'leiden': {'params': {'n_iterations': array([-1]),
   'random_state': array([0]),
   'resolution': array([1])}},
 'louvain': {'params': {'random_state': array([0])}},
 'neighbors': {'params': {'method': array(['umap'], dtype=object),
   'metric': array(['euclidean'], dtype=object),
   'n_neighbors': array([15]),
   'n_pcs': array([5])}},
 'pca': {'variance': array([454.20193  ,  78.98866  ,  64.31594  ,  34.90398  ,  29.65793  ,
          23.4138   ,  18.985216 ,  16.46509  ,  15.075678 ,  13.814811 ,
          12.145706 ,  11.504896 ,  10.541939 ,   9.79446  ,   9.070989 ,
           8.92252  ,   8.503

In [18]:
type(bladder.uns)

dict

#### Make object JSON serializable

In [20]:
import json

import numpy as np


def make_json_serializable(obj):
    if isinstance(obj, np.ndarray):
        return obj.tolist()
    elif isinstance(obj, dict):
        return {k: make_json_serializable(v) for k, v in obj.items()}
    elif isinstance(obj, (list, tuple)):
        return [make_json_serializable(i) for i in obj]
    elif isinstance(obj, (np.integer, np.floating)):
        return obj.item()
    else:
        return obj


# Convert to JSON-serializable object
bladder_uns_serializable = make_json_serializable(bladder.uns)

# Save to file
with open("../data/bladder_smartseq2_unstructured_metadata.json", "w") as f:
    json.dump(bladder_uns_serializable, f, indent=2)

## Save the `X_*` matrices from `obsm`

These matrices are a transformation of the expression data, transforming from (samples x features) to (samples x latent space)

In [22]:
bladder.obsm_keys()

['X_pca', 'X_tsne', 'X_umap']

In [23]:
bladder.obsm

AxisArrays with keys: X_pca, X_tsne, X_umap

In [41]:
for key in bladder.obsm_keys():
    X = bladder.obsm[key]
    df = pd.DataFrame(X, index=bladder.obs_names)
    print(f"key: {key}, shape: {df.shape}")
    df.to_parquet(f"../data/bladder_smartseq2_projection_{key}.parquet")

key: X_pca, shape: (2432, 50)
key: X_tsne, shape: (2432, 2)
key: X_umap, shape: (2432, 2)
