### 1. General info of dataset GSE192906

This is the Jupyter Notebook for dataset GSE192906. Its dataset includes a txt count matrix file for each sample. As seen below, in the txt file, each row is a gene and each column is a cell.

Thus, we need to transform each txt file and generate the AnnData object for each sample. In total, there are 10 Peripheral neuroblastic tumor samples.

NB: neuroblastoma

GNB: ganglioneuroblastoma

In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

In [2]:
import os
os.getcwd()

'/scratch/user/s4543064/xiaohan-john-project'

In [4]:
# inspect the first dataset
path = '/scratch/user/s4543064/xiaohan-john-project/data/GSE192906/GSM5768743_NB1_UMI_COUNTS_RAW.txt.gz'
input = pd.read_csv(path, sep='\t', index_col=0) # the first column contains gene names and is the index

print(input.head()) 
print(input.shape) # (33514, 960)

             A1_1_0316_AACGAGGT  A1_1_0316_AAGCACAT  A1_1_0316_ACACCGTG  \
MIR1302-2HG                   0                   0                   0   
FAM138A                       0                   0                   0   
OR4F5                         0                   0                   0   
AL627309.1                    0                   0                   0   
AL627309.3                    0                   0                   0   

             A1_1_0316_ACCTCAGC  A1_1_0316_ACTGTTTG  A1_1_0316_AGCTCCTT  \
MIR1302-2HG                   0                   0                   0   
FAM138A                       0                   0                   0   
OR4F5                         0                   0                   0   
AL627309.1                    0                   0                   0   
AL627309.3                    0                   0                   0   

             A1_1_0316_ATTTAGCG  A1_1_0316_CACACTGA  A1_1_0316_CACAGCAT  \
MIR1302-2HG            

### 2. Overall AnnData object of the dataset

<span style="color:red">**IMPORTANT:**</span> transpose the DataFrame.values to match the AnnData.X

1. `DataFrame.columns`: cell barcodes, which go into `.obs`
2. `DataFrame.index`: gene names, `.var`
3. `DataFrame.values`: the transpose of the expression matrix, `.X`

In [5]:
# Load the metadata from the paper's Table S1
age_sex_tissue = {
    'NB1': [0.17, 'female', 'adrenal_ gland'],
    'NB2': [5.08, 'male', 'adrenal_ gland'],
    'NB3': [3.33, 'male', 'retroperitoneum'],
    'NB4': [0.33, 'female', 'posterior_mediastinum'],
    'NB5': [1.58, 'male', 'mediastinum'],
    'GN1': [6.67, 'female', 'adrenal_ gland'],
    'GNB1': [2.33, 'female', 'posterior_mediastinum'],
    'GNB2': [2.42, 'male', 'adrenal_ gland'],
    'GNB3': [2.83, 'male', 'mediastinum'],
    'GNB4': [3, 'male', 'adrenal_ gland'],
}

In [17]:
from pathlib import Path

# Specify directory paths
data_directory = Path('/scratch/user/s4543064/xiaohan-john-project/data/GSE192906')
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE192906')

# Loop through all files in the directory
for file in data_directory.iterdir():
    gsm, patient_id = file.stem.split('_')[:2]
    sample_name = gsm + '_' + patient_id
    sample_h5ad = sample_name + '_uni.h5ad'

    input = pd.read_csv(file, sep='\t', index_col=0)

    matrix = scipy.sparse.csr_matrix(input.values.T)
    obs_name = pd.DataFrame(index=input.columns)
    var_name = pd.DataFrame(index=input.index)
    var_name.rename_axis('gene_symbols', inplace=True)

    sample = anndata.AnnData(X=matrix, obs=obs_name, var=var_name)

    # Create an observation metric info to store related features
    obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers

    if 'GNB' in patient_id:
        obs_metrics['cancer_type'] = 'ganglioneuroblastoma'
    elif 'NB' in patient_id:
        obs_metrics['cancer_type'] = 'neuroblastoma'
    else: # the GN patient
        obs_metrics['cancer_type'] = 'ganglioneuroma' 

    obs_metrics['dataset'] = 'GSE192906'
    obs_metrics['age'] = age_sex_tissue[patient_id][0]
    obs_metrics['sex'] = age_sex_tissue[patient_id][1]
    obs_metrics['tissue'] = age_sex_tissue[patient_id][2]
    obs_metrics['sample_barcode'] = sample_name
    obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)

    sample.obs = obs_metrics
    sample.obs.set_index("uni_barcode", drop=False, inplace=True)
    print(sample)

    # save the anndata object
    output_path = write_directory / sample_h5ad
    sample.write_h5ad(output_path, compression="gzip")

AnnData object with n_obs × n_vars = 639 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 1052 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 768 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 445 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 1053 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 360 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 740 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object wit

In [18]:
sample.var

MIR1302-2HG
FAM138A
OR4F5
AL627309.1
AL627309.3
...
AC233755.2
AC233755.1
AC240274.1
AC213203.1
FAM231C


In [19]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,age,sex,tissue,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
GSE192906_A1_1_0904_AACGAGGT,ganglioneuroma,GSE192906,6.67,female,adrenal_ gland,GSM5768752_GN1,GSE192906_A1_1_0904_AACGAGGT
GSE192906_A1_1_0904_AAGCACAT,ganglioneuroma,GSE192906,6.67,female,adrenal_ gland,GSM5768752_GN1,GSE192906_A1_1_0904_AAGCACAT
GSE192906_A1_1_0904_ACACCGTG,ganglioneuroma,GSE192906,6.67,female,adrenal_ gland,GSM5768752_GN1,GSE192906_A1_1_0904_ACACCGTG
GSE192906_A1_1_0904_ACCTCAGC,ganglioneuroma,GSE192906,6.67,female,adrenal_ gland,GSM5768752_GN1,GSE192906_A1_1_0904_ACCTCAGC
GSE192906_A1_1_0904_ACTGTTTG,ganglioneuroma,GSE192906,6.67,female,adrenal_ gland,GSM5768752_GN1,GSE192906_A1_1_0904_ACTGTTTG
...,...,...,...,...,...,...,...
GSE192906_F2_2_0904_TCTCACAC,ganglioneuroma,GSE192906,6.67,female,adrenal_ gland,GSM5768752_GN1,GSE192906_F2_2_0904_TCTCACAC
GSE192906_F2_2_0904_TGGAGCTC,ganglioneuroma,GSE192906,6.67,female,adrenal_ gland,GSM5768752_GN1,GSE192906_F2_2_0904_TGGAGCTC
GSE192906_F2_2_0904_TGTACCAA,ganglioneuroma,GSE192906,6.67,female,adrenal_ gland,GSM5768752_GN1,GSE192906_F2_2_0904_TGTACCAA
GSE192906_F2_2_0904_TTACGGGT,ganglioneuroma,GSE192906,6.67,female,adrenal_ gland,GSM5768752_GN1,GSE192906_F2_2_0904_TTACGGGT


### 3. Confirmation of created AnnData object

In [20]:
from pathlib import Path

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE192906')

# Loop through all files in the directory
for file in write_directory.iterdir():
    if '_uni.h5ad' in file.name:
        sample = anndata.read_h5ad(file)
        print(sample)

AnnData object with n_obs × n_vars = 551 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 740 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 1053 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 768 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 357 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 445 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with n_obs × n_vars = 960 × 33514
    obs: 'cancer_type', 'dataset', 'age', 'sex', 'tissue', 'sample_barcode', 'uni_barcode'
AnnData object with

In [21]:
sample.var

MIR1302-2HG
FAM138A
OR4F5
AL627309.1
AL627309.3
...
AC233755.2
AC233755.1
AC240274.1
AC213203.1
FAM231C


In [22]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,age,sex,tissue,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
GSE192906_K1_1_0316_AACGAGGT,ganglioneuroblastoma,GSE192906,2.42,male,adrenal_ gland,GSM5768749_GNB2,GSE192906_K1_1_0316_AACGAGGT
GSE192906_K1_1_0316_AAGCACAT,ganglioneuroblastoma,GSE192906,2.42,male,adrenal_ gland,GSM5768749_GNB2,GSE192906_K1_1_0316_AAGCACAT
GSE192906_K1_1_0316_ACACCGTG,ganglioneuroblastoma,GSE192906,2.42,male,adrenal_ gland,GSM5768749_GNB2,GSE192906_K1_1_0316_ACACCGTG
GSE192906_K1_1_0316_ACCTCAGC,ganglioneuroblastoma,GSE192906,2.42,male,adrenal_ gland,GSM5768749_GNB2,GSE192906_K1_1_0316_ACCTCAGC
GSE192906_K1_1_0316_ACTGTTTG,ganglioneuroblastoma,GSE192906,2.42,male,adrenal_ gland,GSM5768749_GNB2,GSE192906_K1_1_0316_ACTGTTTG
...,...,...,...,...,...,...,...
GSE192906_U2_2_0316_TCTCACAC,ganglioneuroblastoma,GSE192906,2.42,male,adrenal_ gland,GSM5768749_GNB2,GSE192906_U2_2_0316_TCTCACAC
GSE192906_U2_2_0316_TGGAGCTC,ganglioneuroblastoma,GSE192906,2.42,male,adrenal_ gland,GSM5768749_GNB2,GSE192906_U2_2_0316_TGGAGCTC
GSE192906_U2_2_0316_TGTACCAA,ganglioneuroblastoma,GSE192906,2.42,male,adrenal_ gland,GSM5768749_GNB2,GSE192906_U2_2_0316_TGTACCAA
GSE192906_U2_2_0316_TTACGGGT,ganglioneuroblastoma,GSE192906,2.42,male,adrenal_ gland,GSM5768749_GNB2,GSE192906_U2_2_0316_TTACGGGT


### 4. Convert AnnData objects to SingleCellExperiment objects

In [24]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE192906')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = anndata.read_h5ad(file)
        sample_anndata.X = sample_anndata.X.astype('float32')
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        # print(sample_sce)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))