### Size issue 
That file size of 28.45 GB for a single-cell RNA-seq .h5ad dataset is very large and will likely pose significant computational challenges for a standard analysis,
especially when using a Transformer-based model like GenFormer.

Memory (RAM): A 28 GB .h5ad file, which typically contains a sparse matrix of UMI counts, will require a substantial amount of RAM when loaded and processed. When the data is converted to a dense matrix or intermediate processing steps are performed (like calculating nearest neighbors, running PCA, etc.), the memory footprint can easily exceed 100-200 GB. If you don't have access to a high-memory computing environment (e.g., a powerful server or cloud instance), the analysis will crash.

Processing Time: Even with sufficient RAM, a dataset of this size (likely involving millions of cells) will result in long processing times for any single-cell workflow, particularly for computationally intensive steps like training a large neural network model such as GenFormer.

In [2]:
import os
import scanpy as sc
import anndata as ad
import pandas as pd
import cellxgene_census
from sklearn.metrics import normalized_mutual_info_score
import matplotlib.pyplot as plt
import seaborn as sns
import umap.umap_ as umap

In [3]:
 adata = sc.read_h5ad('/projects/bioinformatics/DB/scRNAseq_parkinson/dataset.h5ad')

In [4]:
adata

AnnData object with n_obs × n_vars = 2096155 × 17267
    obs: 'n_genes', 'n_counts', 'Brain_bank', 'RIN', 'path_braak_lb', 'derived_class2', 'PMI', 'tissue_ontology_term_id', 'tissue_type', 'assay_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'donor_id', 'suspension_type', 'is_primary_data', 'cell_type', 'assay', 'disease', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'gene_name', 'n_cells', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
    uns: 'batch_condition', 'citation', 'genome', 'organism', 'organism_ontology_term_id', 'schema_reference', 'schema_version', 'title', 'uid'
    obsm: 'X_umap'

In [None]:
ada

In [5]:
adata.obs.head()

Unnamed: 0_level_0,n_genes,n_counts,Brain_bank,RIN,path_braak_lb,derived_class2,PMI,tissue_ontology_term_id,tissue_type,assay_ontology_term_id,...,suspension_type,is_primary_data,cell_type,assay,disease,sex,tissue,self_reported_ethnicity,development_stage,observation_joinid
barcodekey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Set10_C1-AAACCCACATCACGGC,2929,5645.0,UM,4.8,3,Oligo,20.91,UBERON:0002477,tissue,EFO:0009922,...,nucleus,True,oligodendrocyte,10x 3' v3,Parkinson disease,male,medial globus pallidus,European,82-year-old stage,MpT)aB)#*U
Set10_C1-AAACCCAGTAGCACAG,4395,14741.0,UD,4.4,3,Oligo,20.91,UBERON:0001384,tissue,EFO:0009922,...,nucleus,True,oligodendrocyte,10x 3' v3,Parkinson disease,female,primary motor cortex,European,73-year-old stage,qFT~wG^BCg
Set10_C1-AAACCCAGTATGTCCA,3856,11005.0,UM,4.8,3,Oligo,20.91,UBERON:0002477,tissue,EFO:0009922,...,nucleus,True,oligodendrocyte,10x 3' v3,Parkinson disease,male,medial globus pallidus,European,82-year-old stage,fE`VpY%8fb
Set10_C1-AAACCCAGTCCAGAAG,3387,8134.0,UD,4.4,3,Oligo,20.91,UBERON:0001384,tissue,EFO:0009922,...,nucleus,True,oligodendrocyte,10x 3' v3,Parkinson disease,female,primary motor cortex,European,73-year-old stage,g1Lr8^y)O(
Set10_C1-AAACCCATCCACGTAA,2366,4583.0,UD,4.4,3,Astro,20.91,UBERON:0001384,tissue,EFO:0009922,...,nucleus,True,astrocyte,10x 3' v3,Parkinson disease,female,primary motor cortex,European,73-year-old stage,MR9tUCifao


In [7]:
adata.obs.columns

Index(['n_genes', 'n_counts', 'Brain_bank', 'RIN', 'path_braak_lb',
       'derived_class2', 'PMI', 'tissue_ontology_term_id', 'tissue_type',
       'assay_ontology_term_id', 'disease_ontology_term_id',
       'cell_type_ontology_term_id',
       'self_reported_ethnicity_ontology_term_id',
       'development_stage_ontology_term_id', 'sex_ontology_term_id',
       'donor_id', 'suspension_type', 'is_primary_data', 'cell_type', 'assay',
       'disease', 'sex', 'tissue', 'self_reported_ethnicity',
       'development_stage', 'observation_joinid'],
      dtype='object')

In [17]:
adata.X[:10,:10].toarray()

array([[0.       , 0.       , 0.       , 0.       , 0.       , 0.       ,
        2.9293141, 0.       , 0.       , 0.       ],
       [0.       , 0.       , 2.6787999, 0.       , 0.       , 3.5530312,
        0.       , 0.       , 0.       , 0.       ],
       [0.       , 0.       , 2.3112254, 0.       , 0.       , 0.       ,
        2.3112254, 0.       , 0.       , 0.       ],
       [0.       , 0.       , 0.       , 0.       , 0.       , 0.       ,
        0.       , 0.       , 0.       , 0.       ],
       [0.       , 0.       , 0.       , 3.1276271, 4.882185 , 0.       ,
        0.       , 0.       , 0.       , 0.       ],
       [0.       , 0.       , 0.       , 0.       , 2.4252477, 2.4252477,
        2.4252477, 0.       , 0.       , 0.       ],
       [0.       , 0.       , 0.       , 0.       , 2.918646 , 2.918646 ,
        0.       , 0.       , 0.       , 0.       ],
       [0.       , 0.       , 0.       , 0.       , 0.       , 3.3781989,
        0.       , 0.       , 0.     

In [18]:
adata.X

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 6609855574 stored elements and shape (2096155, 17267)>

In [6]:
adata.var.head()

Unnamed: 0_level_0,gene_name,n_cells,feature_is_filtered,feature_name,feature_reference,feature_biotype,feature_length,feature_type
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ENSG00000186827,TNFRSF4,7846,False,TNFRSF4,NCBITaxon:9606,gene,1039,protein_coding
ENSG00000186891,TNFRSF18,9000,False,TNFRSF18,NCBITaxon:9606,gene,789,protein_coding
ENSG00000160072,ATAD3B,340777,False,ATAD3B,NCBITaxon:9606,gene,3300,protein_coding
ENSG00000041988,THAP3,209923,False,THAP3,NCBITaxon:9606,gene,931,protein_coding
ENSG00000142611,PRDM16,283396,False,PRDM16,NCBITaxon:9606,gene,3730,protein_coding


In [8]:
adata.var.columns

Index(['gene_name', 'n_cells', 'feature_is_filtered', 'feature_name',
       'feature_reference', 'feature_biotype', 'feature_length',
       'feature_type'],
      dtype='object')

In [10]:
adata.layers.keys()

KeysView(Layers with keys: )

In [11]:
adata.obsm.keys()

KeysView(AxisArrays with keys: X_umap)

In [12]:
adata.uns.keys()

dict_keys(['batch_condition', 'citation', 'genome', 'organism', 'organism_ontology_term_id', 'schema_reference', 'schema_version', 'title', 'uid'])

In [22]:
adata.uns

{'batch_condition': array(['Brain_bank'], dtype=object),
 'citation': 'Publication: https://doi.org/10.1038/s41597-024-04117-y Dataset Version: https://datasets.cellxgene.cziscience.com/1f51228e-898e-474f-b8f2-5e7ad392b51c.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/d5d0df8f-4eee-49d8-a221-a288f50a1590',
 'genome': 'GRCh38',
 'organism': 'Homo sapiens',
 'organism_ontology_term_id': 'NCBITaxon:9606',
 'schema_reference': 'https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/6.0.0/schema.md',
 'schema_version': '6.0.0',
 'title': "Parkinson's disease",
 'uid': 'GRCh38-rna'}

In [19]:
 metadata = pd.read_csv('/projects/bioinformatics/DB/scRNAseq_parkinson/metadata.csv')

In [20]:
metadata

Unnamed: 0,barcodekey,n_genes,n_counts,Brain_bank,RIN,path_braak_lb,derived_class2,PMI,tissue_ontology_term_id,tissue_type,...,suspension_type,is_primary_data,cell_type,assay,disease,sex,tissue,self_reported_ethnicity,development_stage,observation_joinid
0,Set10_C1-AAACCCACATCACGGC,2929,5645.0,UM,4.8,3,Oligo,20.910000,UBERON:0002477,tissue,...,nucleus,True,oligodendrocyte,10x 3' v3,Parkinson disease,male,medial globus pallidus,European,82-year-old stage,MpT)aB)#*U
1,Set10_C1-AAACCCAGTAGCACAG,4395,14741.0,UD,4.4,3,Oligo,20.910000,UBERON:0001384,tissue,...,nucleus,True,oligodendrocyte,10x 3' v3,Parkinson disease,female,primary motor cortex,European,73-year-old stage,qFT~wG^BCg
2,Set10_C1-AAACCCAGTATGTCCA,3856,11005.0,UM,4.8,3,Oligo,20.910000,UBERON:0002477,tissue,...,nucleus,True,oligodendrocyte,10x 3' v3,Parkinson disease,male,medial globus pallidus,European,82-year-old stage,fE`VpY%8fb
3,Set10_C1-AAACCCAGTCCAGAAG,3387,8134.0,UD,4.4,3,Oligo,20.910000,UBERON:0001384,tissue,...,nucleus,True,oligodendrocyte,10x 3' v3,Parkinson disease,female,primary motor cortex,European,73-year-old stage,g1Lr8^y)O(
4,Set10_C1-AAACCCATCCACGTAA,2366,4583.0,UD,4.4,3,Astro,20.910000,UBERON:0001384,tissue,...,nucleus,True,astrocyte,10x 3' v3,Parkinson disease,female,primary motor cortex,European,73-year-old stage,MR9tUCifao
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2096150,set7A_2-TTTGTTGTCAAGAGTA,2976,6626.0,MSSM,4.7,1,Oligo,10.166667,UBERON:0001384,tissue,...,nucleus,True,oligodendrocyte,10x 3' v3,normal,female,primary motor cortex,European,78-year-old stage,25FK4DT!Rl
2096151,set7A_2-TTTGTTGTCACTCTTA,3947,11414.0,MSSM,4.3,2,EN,10.166667,UBERON:0002436,tissue,...,nucleus,True,glutamatergic neuron,10x 3' v3,Parkinson disease,female,primary visual cortex,European,81-year-old stage,PdMVWS9R^v
2096152,set7A_2-TTTGTTGTCAGTGCGC,3212,8142.0,MSSM,3.0,4,Micro_PVM,10.166667,UBERON:0001384,tissue,...,nucleus,True,central nervous system macrophage,10x 3' v3,Parkinson disease,male,primary motor cortex,European,84-year-old stage,Q&?JBHEFAz
2096153,set7A_2-TTTGTTGTCGCCTATC,4442,12156.0,MSSM,4.7,1,OPC,10.166667,UBERON:0001384,tissue,...,nucleus,True,oligodendrocyte precursor cell,10x 3' v3,normal,female,primary motor cortex,European,78-year-old stage,@vUY6<XF#a


In [21]:
adata.raw


<anndata._core.raw.Raw at 0x7ffea819a390>