### 1. General info of dataset GSE140819

This is the Jupyter Notebook for dataset GSE140819. Its dataset includes a single h5 files for each sample.
. 
In total, there are 40 samples from various origins processed with different technologies. Within the pediatric tumours are:

Neuroblastoma (NB): HTAPP-312-SMP-901, HTAPP-312-SMP-902, HTAPP-656-SMP-3481, HTAPP-244-SMP-451 (nuclei), HTAPP-656-SMP-3481 (nuclei)

Glioblastoma (GB): HTAPP-443-SMP-5491

Sarcoma: HTAPP-951-SMP-4652 (nuclei), HTAPP-975-SMP-4771 (nuclei)


In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

from pathlib import Path

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [14]:
# inspect the provided h5ad file
data_directory = Path('/scratch/user/s4543064/xiaohan-john-project/data/GSE140819')

adata_path = data_directory / 'GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1_raw_feature_bc_matrix.h5'
adata = sc.read_10x_h5(adata_path)

adata

  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


AnnData object with n_obs × n_vars = 6794880 × 33538
    var: 'gene_ids', 'feature_types', 'genome'

In [15]:
adata.var

Unnamed: 0,gene_ids,feature_types,genome
MIR1302-2HG,ENSG00000243485,Gene Expression,GRCh38-3.0.0_premrna
FAM138A,ENSG00000237613,Gene Expression,GRCh38-3.0.0_premrna
OR4F5,ENSG00000186092,Gene Expression,GRCh38-3.0.0_premrna
AL627309.1,ENSG00000238009,Gene Expression,GRCh38-3.0.0_premrna
AL627309.3,ENSG00000239945,Gene Expression,GRCh38-3.0.0_premrna
...,...,...,...
AC233755.2,ENSG00000277856,Gene Expression,GRCh38-3.0.0_premrna
AC233755.1,ENSG00000275063,Gene Expression,GRCh38-3.0.0_premrna
AC240274.1,ENSG00000271254,Gene Expression,GRCh38-3.0.0_premrna
AC213203.1,ENSG00000277475,Gene Expression,GRCh38-3.0.0_premrna


In [16]:
adata.obs 

AAACCCAAGAAACACT-1
AAACCCAAGAAACCAT-1
AAACCCAAGAAACCCA-1
AAACCCAAGAAACCCG-1
AAACCCAAGAAACCTG-1
...
TTTGTTGTCTTTGCTA-1
TTTGTTGTCTTTGCTG-1
TTTGTTGTCTTTGGAG-1
TTTGTTGTCTTTGGCT-1
TTTGTTGTCTTTGTCG-1


In [20]:
np.max(adata.X[:100000, :])

1298.0

In [28]:
np.max(adata.X[:500, :])

162.0

As such, even the values of the count matrix are float, they are still 'raw'

In [33]:
# inspect the metadata
meta_path = data_directory / 'GSM4186962_metadata_HTAPP-312-SMP-902_fresh-C4-T2_channel1.csv'
meta = pd.read_csv(meta_path, sep=',', index_col=0)

meta

Unnamed: 0,annotate,nReads,nUMI,nGene,percent_mito,emptydrop,doublet
HTAPP-312-SMP-902_fresh-C4-T2_channel1-AAAGATGAGACATAAC,Neuroendocrine,39239,2267,1192,0.059550,False,False
HTAPP-312-SMP-902_fresh-C4-T2_channel1-AAAGATGTCTGGAGCC,Fibroblast,21923,1384,770,0.032514,False,False
HTAPP-312-SMP-902_fresh-C4-T2_channel1-AAAGCAATCTAACCGA,T cell,31269,1780,757,0.024157,False,False
HTAPP-312-SMP-902_fresh-C4-T2_channel1-AAAGTAGAGCTAAACA,Neuroendocrine,59927,3414,1755,0.067955,False,False
HTAPP-312-SMP-902_fresh-C4-T2_channel1-AAAGTAGCAGGAATCG,Neuroendocrine,26150,1508,1008,0.076260,False,False
...,...,...,...,...,...,...,...
HTAPP-312-SMP-902_fresh-C4-T2_channel1-TTTGCGCAGTTCCACA,Neuroendocrine,53458,3144,1679,0.034669,False,False
HTAPP-312-SMP-902_fresh-C4-T2_channel1-TTTGCGCCACAGACAG,Neuroendocrine,67111,3874,1787,0.035880,False,False
HTAPP-312-SMP-902_fresh-C4-T2_channel1-TTTGGTTTCATAAAGG,Fibroblast,28516,1809,802,0.039248,False,False
HTAPP-312-SMP-902_fresh-C4-T2_channel1-TTTGTCAAGTTCCACA,Fibroblast,34314,2079,908,0.035113,False,False


So we can use the metadata to extract filtered cells from the raw h5 files.

### 2. AnnData object of each sample

Neuroblastoma (NB): HTAPP-312-SMP-901, HTAPP-312-SMP-902, HTAPP-656-SMP-3481, HTAPP-244-SMP-451 (nuclei), HTAPP-656-SMP-3481 (nuclei)

Glioblastoma (GB): HTAPP-443-SMP-5491

Sarcoma: HTAPP-951-SMP-4652 (nuclei), HTAPP-975-SMP-4771 (nuclei)

In [6]:
# Load the metadata from Figure 1b
cancer_tissue_dict = {
    'HTAPP-312-SMP-901': ['neuroblastoma', 'neuroendocrine'],
    'HTAPP-312-SMP-902': ['neuroblastoma', 'neuroendocrine'],
    'HTAPP-656-SMP-3481': ['neuroblastoma', 'neuroendocrine'],
    'HTAPP-244-SMP-451': ['neuroblastoma', 'neuroendocrine'],
    'HTAPP-443-SMP-5491': ['glioblastoma', 'neuronal'],
    'HTAPP-951-SMP-4652': ['sarcoma', 'mesenchymal'],
    'HTAPP-975-SMP-4771': ['sarcoma', 'mesenchymal'],
}


In [29]:
from pathlib import Path

# Specify directory paths
data_directory = Path('/scratch/user/s4543064/xiaohan-john-project/data/GSE140819')
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE140819')

for sample in data_directory.iterdir():
    if '.h5' in sample.name: # eg: sample = GSM4186961_HTAPP-312-SMP-901_fresh-T1_channel1_raw_gene_bc_matrices_h5.h5
        # Get the gsm and patient id
        gsm = sample.stem.split("_")[0] # gsm = GSM4186961
        gsm_patient_id = sample.stem.split("_raw_")[0] # gsm_patient_id = GSM4186961_HTAPP-312-SMP-901_fresh-T1_channel1
        print(gsm_patient_id)
        patient_id = gsm_patient_id.split('_')[1] # patient_id = HTAPP-312-SMP-901
        meta_file = gsm + '_metadata_' + gsm_patient_id.split(gsm + '_')[1] + '.csv'
        meta_path = data_directory / meta_file
        
        meta = pd.read_csv(meta_path, sep=',', index_col=0)
        cells = meta.index.tolist() # ['HTAPP-312-SMP-901_fresh-T1_channel1-AAACCTGAGACGCAAC]
        # print(cells)
        cell_barcodes = [cell.split('channel1')[1][1:] + '-1' for cell in cells] # ['AAACCTGAGACGCAAC-1']
        # print(cell_barcodes)
        # Reset the index of the metadata to match the one in the anndata
        meta.reset_index(drop=True, inplace=True)
        meta.index = cell_barcodes

        adata = sc.read_10x_h5(sample)
        adata.var_names_make_unique()
        adata = adata[adata.obs.index.isin(cell_barcodes)].copy()
        print(adata.shape)

        adata.obs['cancer_type'] = cancer_tissue_dict[patient_id][0]
        adata.obs['dataset'] = 'GSE140819'
        adata.obs['tissue'] = cancer_tissue_dict[patient_id][1]
        adata.obs['cell_type_from_paper'] = meta['annotate']
        adata.obs['sample_barcode'] = gsm_patient_id
        adata.obs['uni_barcode'] = adata.obs['dataset'] + '_' + adata.obs.index.astype(str)
        adata.obs.set_index("uni_barcode", drop=False, inplace=True)
        print(adata)

        # save the anndata object
        sample_h5ad = gsm_patient_id + '_uni.h5ad'
        output_path = write_directory / sample_h5ad
        adata.write_h5ad(output_path, compression="gzip")
        

GSM4186961_HTAPP-312-SMP-901_fresh-T1_channel1


  utils.warn_names_duplicates("var")


(4369, 33694)
AnnData object with n_obs × n_vars = 4369 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
GSM4186963_HTAPP-656-SMP-3481_fresh-T1_channel1


  utils.warn_names_duplicates("var")


(3449, 33694)
AnnData object with n_obs × n_vars = 3449 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
GSM4186982_HTAPP-443-SMP-5491_CST_channel1


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


(3967, 33538)
AnnData object with n_obs × n_vars = 3967 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types', 'genome'
GSM4186994_HTAPP-951-SMP-4652_CST-V3_channel1


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


(7858, 33538)
AnnData object with n_obs × n_vars = 7858 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types', 'genome'
GSM4186993_HTAPP-951-SMP-4652_TST-V3_channel1


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


(4458, 33538)
AnnData object with n_obs × n_vars = 4458 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types', 'genome'
GSM4186992_HTAPP-951-SMP-4652_TST-V2_channel1


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


(3992, 33694)
AnnData object with n_obs × n_vars = 3992 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types', 'genome'
GSM4186967_HTAPP-244-SMP-451_NST_channel1


  utils.warn_names_duplicates("var")


(7531, 33694)
AnnData object with n_obs × n_vars = 7531 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
GSM4186969_HTAPP-656-SMP-3481_TST_channel1


  utils.warn_names_duplicates("var")


(7810, 33694)
AnnData object with n_obs × n_vars = 7810 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
GSM4186966_HTAPP-244-SMP-451_EZ_channel1


  utils.warn_names_duplicates("var")


(7896, 33694)
AnnData object with n_obs × n_vars = 7896 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
GSM4186965_HTAPP-244-SMP-451_CST_channel1


  utils.warn_names_duplicates("var")


(6157, 33694)
AnnData object with n_obs × n_vars = 6157 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
GSM4186968_HTAPP-244-SMP-451_TST_channel1


  utils.warn_names_duplicates("var")


(7415, 33694)
AnnData object with n_obs × n_vars = 7415 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1


  utils.warn_names_duplicates("var")
  utils.warn_names_duplicates("var")


(4317, 33538)
AnnData object with n_obs × n_vars = 4317 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types', 'genome'
GSM4186962_HTAPP-312-SMP-902_fresh-C4-T2_channel1


  utils.warn_names_duplicates("var")


(786, 33694)
AnnData object with n_obs × n_vars = 786 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'


In [31]:
adata.var

Unnamed: 0,gene_ids
RP11-34P13.3,ENSG00000243485
FAM138A,ENSG00000237613
OR4F5,ENSG00000186092
RP11-34P13.7,ENSG00000238009
RP11-34P13.8,ENSG00000239945
...,...
AC233755.2,ENSG00000277856
AC233755.1,ENSG00000275063
AC240274.1,ENSG00000271254
AC213203.1,ENSG00000277475


In [30]:
adata.obs

Unnamed: 0_level_0,cancer_type,dataset,tissue,cell_type_from_paper,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GSE140819_AAAGATGAGACATAAC-1,neuroblastoma,GSE140819,neuroendocrine,Neuroendocrine,GSM4186962_HTAPP-312-SMP-902_fresh-C4-T2_channel1,GSE140819_AAAGATGAGACATAAC-1
GSE140819_AAAGATGTCTGGAGCC-1,neuroblastoma,GSE140819,neuroendocrine,Fibroblast,GSM4186962_HTAPP-312-SMP-902_fresh-C4-T2_channel1,GSE140819_AAAGATGTCTGGAGCC-1
GSE140819_AAAGCAATCTAACCGA-1,neuroblastoma,GSE140819,neuroendocrine,T cell,GSM4186962_HTAPP-312-SMP-902_fresh-C4-T2_channel1,GSE140819_AAAGCAATCTAACCGA-1
GSE140819_AAAGTAGAGCTAAACA-1,neuroblastoma,GSE140819,neuroendocrine,Neuroendocrine,GSM4186962_HTAPP-312-SMP-902_fresh-C4-T2_channel1,GSE140819_AAAGTAGAGCTAAACA-1
GSE140819_AAAGTAGCAGGAATCG-1,neuroblastoma,GSE140819,neuroendocrine,Neuroendocrine,GSM4186962_HTAPP-312-SMP-902_fresh-C4-T2_channel1,GSE140819_AAAGTAGCAGGAATCG-1
...,...,...,...,...,...,...
GSE140819_TTTGCGCAGTTCCACA-1,neuroblastoma,GSE140819,neuroendocrine,Neuroendocrine,GSM4186962_HTAPP-312-SMP-902_fresh-C4-T2_channel1,GSE140819_TTTGCGCAGTTCCACA-1
GSE140819_TTTGCGCCACAGACAG-1,neuroblastoma,GSE140819,neuroendocrine,Neuroendocrine,GSM4186962_HTAPP-312-SMP-902_fresh-C4-T2_channel1,GSE140819_TTTGCGCCACAGACAG-1
GSE140819_TTTGGTTTCATAAAGG-1,neuroblastoma,GSE140819,neuroendocrine,Fibroblast,GSM4186962_HTAPP-312-SMP-902_fresh-C4-T2_channel1,GSE140819_TTTGGTTTCATAAAGG-1
GSE140819_TTTGTCAAGTTCCACA-1,neuroblastoma,GSE140819,neuroendocrine,Fibroblast,GSM4186962_HTAPP-312-SMP-902_fresh-C4-T2_channel1,GSE140819_TTTGTCAAGTTCCACA-1


### 3. Confirmation of created AnnData objects

In [32]:
from pathlib import Path

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE140819')

# Loop through all files in the directory
for file in write_directory.iterdir():
    if '_uni.h5ad' in file.name:
        sample = anndata.read_h5ad(file)
        print(sample)

AnnData object with n_obs × n_vars = 7415 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3992 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types', 'genome'
AnnData object with n_obs × n_vars = 7896 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 786 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 7810 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3967 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'cell_type_from_paper', 

In [34]:
sample.var

Unnamed: 0,gene_ids,feature_types,genome
MIR1302-2HG,ENSG00000243485,Gene Expression,GRCh38-3.0.0_premrna
FAM138A,ENSG00000237613,Gene Expression,GRCh38-3.0.0_premrna
OR4F5,ENSG00000186092,Gene Expression,GRCh38-3.0.0_premrna
AL627309.1,ENSG00000238009,Gene Expression,GRCh38-3.0.0_premrna
AL627309.3,ENSG00000239945,Gene Expression,GRCh38-3.0.0_premrna
...,...,...,...
AC233755.2,ENSG00000277856,Gene Expression,GRCh38-3.0.0_premrna
AC233755.1,ENSG00000275063,Gene Expression,GRCh38-3.0.0_premrna
AC240274.1,ENSG00000271254,Gene Expression,GRCh38-3.0.0_premrna
AC213203.1,ENSG00000277475,Gene Expression,GRCh38-3.0.0_premrna


In [35]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,tissue,cell_type_from_paper,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GSE140819_AAACCCACATTGTGCA-1,sarcoma,GSE140819,mesenchymal,Endothelial cell,GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1,GSE140819_AAACCCACATTGTGCA-1
GSE140819_AAACCCAGTGTTCATG-1,sarcoma,GSE140819,mesenchymal,Macrophage 1,GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1,GSE140819_AAACCCAGTGTTCATG-1
GSE140819_AAACCCATCATTCGTT-1,sarcoma,GSE140819,mesenchymal,Endothelial cell,GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1,GSE140819_AAACCCATCATTCGTT-1
GSE140819_AAACCCATCCTGATAG-1,sarcoma,GSE140819,mesenchymal,Endothelial cell,GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1,GSE140819_AAACCCATCCTGATAG-1
GSE140819_AAACGAAAGCTTTCTT-1,sarcoma,GSE140819,mesenchymal,Chondrocyte,GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1,GSE140819_AAACGAAAGCTTTCTT-1
...,...,...,...,...,...,...
GSE140819_TTTGGTTCATGGGATG-1,sarcoma,GSE140819,mesenchymal,Endothelial cell,GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1,GSE140819_TTTGGTTCATGGGATG-1
GSE140819_TTTGTTGGTACCTTCC-1,sarcoma,GSE140819,mesenchymal,Endothelial cell,GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1,GSE140819_TTTGTTGGTACCTTCC-1
GSE140819_TTTGTTGGTCCACATA-1,sarcoma,GSE140819,mesenchymal,Chondrocyte,GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1,GSE140819_TTTGTTGGTCCACATA-1
GSE140819_TTTGTTGGTTTCGTAG-1,sarcoma,GSE140819,mesenchymal,Endothelial cell,GSM4186995_HTAPP-975-SMP-4771_TST-V3_channel1,GSE140819_TTTGTTGGTTTCGTAG-1


### 4. Convert AnnData objects to SingleCellExperiment objects

In [36]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE140819')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = anndata.read_h5ad(file)
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))

ModuleNotFoundError: No module named 'anndata2ri'

In [31]:
print(sample_sce)

class: SingleCellExperiment 
dim: 33694 737280 
metadata(0):
assays(1): X
rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
rowData names(1): gene_ids
colnames(737280): GSE140819_AAACCTGAGAAACCAT-1
  GSE140819_AAACCTGAGAAACCGC-1 ... GSE140819_TTTGTCATCTTTAGTC-1
  GSE140819_TTTGTCATCTTTCCTC-1
colData names(5): cancer_type dataset tissue sample_barcode uni_barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

