### 1. General info of dataset GSE189939

This is the Jupyter Notebook for dataset GSE189939. Its dataset includes a count and a metadata csv file for each sample. As seen below, in the count csv file, each row is a gene and each column is a cell.

Thus, we need to transform this csv file and generate an AnnData object for each sample. In total, there are 4 ependymoma samples.

In [22]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

In [5]:
# inspect the dataset
count_path = '/scratch/user/s4543064/xiaohan-john-project/data/GSE189939/GSM5710226_GTE001_counts.csv.gz'
count = pd.read_csv(count_path, sep=',', index_col=0) # the first column contains gene names and is the index

print(count.head()) 
print(count.shape) # (18522 rows, 6437 columns)

               AAACCCAAGCATCAGG  AAACCCACAAGAGATT  AAACCCAGTACGCTTA  \
RP11-34P13.7                  0                 0                 0   
AL627309.1                    0                 0                 0   
AP006222.2                    0                 0                 0   
RP4-669L17.10                 0                 0                 0   
RP11-206L10.3                 0                 0                 0   

               AAACCCAGTTTGACAC  AAACGAAAGAGTTGCG  AAACGAAAGCGAGAAA  \
RP11-34P13.7                  0                 0                 0   
AL627309.1                    0                 0                 0   
AP006222.2                    0                 0                 1   
RP4-669L17.10                 0                 0                 0   
RP11-206L10.3                 0                 0                 0   

               AAACGAAAGCTGAAAT  AAACGAAAGTAAGAGG  AAACGAACACTGGACC  \
RP11-34P13.7                  0                 0                 0   
AL62

In [7]:
# check if the count value is integer or float
all_integer = all(count.dtypes == 'int64')
all_integer

True

Although the columns and rows are switched, the count values are all integers. This indicates that the count matrix is more likely to be the raw data.

In [8]:
# inspect the metadata
metadata_path = '/scratch/user/s4543064/xiaohan-john-project/data/GSE189939/GSM5710226_GTE001_metadata.csv.gz'
metadata = pd.read_csv(metadata_path, sep=',', index_col=0) # the first column contains gene names and is the index

print(metadata.head()) 
print(metadata.shape) 

                 orig.ident  nCount_RNA  nFeature_RNA Patient  percent.mt  \
AAACCCAAGCATCAGG     GTE001       12181          3573  GTE001    5.984730   
AAACCCACAAGAGATT     GTE001        8704          2818  GTE001    7.157629   
AAACCCAGTACGCTTA     GTE001        7852          2597  GTE001    7.666836   
AAACCCAGTTTGACAC     GTE001       12981          2829  GTE001    7.094985   
AAACGAAAGAGTTGCG     GTE001        9002          2753  GTE001    8.464786   

                   S.Score  G2M.Score Phase          old.ident  \
AAACCCAAGCATCAGG -0.092171  -0.041005    G1     Malignant_cell   
AAACCCACAAGAGATT  0.248358   0.033507     S     Malignant_cell   
AAACCCAGTACGCTTA  0.258024  -0.031224     S     Malignant_cell   
AAACCCAGTTTGACAC -0.137266  -0.114767    G1  Micro_Environment   
AAACGAAAGAGTTGCG -0.110837  -0.071267    G1     Malignant_cell   

                  seurat_clusters  ... Sig.Enr_Pericyte  Sig.Enr_RGC  \
AAACCCAAGCATCAGG                6  ...         1.012098     1.174491

In [9]:
metadata.columns

Index(['orig.ident', 'nCount_RNA', 'nFeature_RNA', 'Patient', 'percent.mt',
       'S.Score', 'G2M.Score', 'Phase', 'old.ident', 'seurat_clusters',
       'DF_hi.lo', 'RNA_snn_res.0.08', 'CNV_cluster', 'Step1_cell_type',
       'Step1_seurat_cluster', 'Step1_seurat_cluster_less', 'RNA_snn_res.0.8',
       'Sig.Enr_Neuron_H_E1', 'Sig.Enr_RGC_H_E1', 'Sig.Enr_OPC_H_E1',
       'Sig.Enr_EC_H_E1', 'Sig.Enr_Pericyte_H_E1', 'Sig.Enr_Microglia_H_E1',
       'Sig.Enr_Neutrophil_H_I1', 'Sig.Enr_TC_H_I1', 'Sig.Enr_BC_H_I1',
       'Sig.Enr_NKC_H_I1', 'Sig.Enr_DC_H_I1', 'Sig.Enr_Neuron_H_P1',
       'Sig.Enr_Astrocyte_H_P1', 'Sig.Enr_EC_H_P1',
       'Sig.Enr_Oligodendrocyte_H_P1', 'Sig.Enr_OPC_H_P1',
       'Sig.Enr_Microglia_H_P1', 'Sig.Enr_Microglia_M_E1', 'Sig.Enr_RGC_M_E1',
       'Sig.Enr_EC_M_E1', 'Sig.Enr_Ependymocyte_M_E1', 'Sig.Enr_Pericyte_M_E1',
       'Sig.Enr_OPC_M_E1', 'Sig.Enr_Neuron_M_P1', 'Sig.Enr_Microglia_M_P1',
       'Sig.Enr_Oligodendrocyte_M_P1', 'Sig.Enr_Astrocyte_M_P1',
 

In [25]:
# The supplementary table 1 includes important patients' clinical features
age_dict = {'GTE001': 12, 'GTE002': 8, 'GTE009': 1, 'GTE012': 7}
sex_dict = {'GTE001': 'male', 'GTE002': 'male', 'GTE009': 'female', 'GTE012': 'female'}
recurrent_dict = {'GTE001': 'recurrent', 'GTE002': 'recurrent', 'GTE009': 'primary', 'GTE012': 'primary'}

tissue_dict = {'GTE001': 'subtentorial_fourth_ventricle', 
               'GTE002': 'right_frontal_temporal', 
               'GTE009': 'subtentorial_fourth_ventricle', 
               'GTE012': 'left_temporal_lobe'}

### 2. Overall AnnData object of the dataset

<span style="color:red">**IMPORTANT:**</span> transpose the DataFrame.values to match the AnnData.X

1. `DataFrame.columns`: cell barcodes, which go into `.obs`
2. `DataFrame.index`: gene names, `.var`
3. `DataFrame.values`: the transpose of the expression matrix, `.X`

In [31]:
from pathlib import Path

# Specify directory paths
data_directory = Path('/scratch/user/s4543064/xiaohan-john-project/data/GSE189939')
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE189939')

# Loop through all files in the directory
for sample in data_directory.iterdir():
    if 'counts.csv.gz' in sample.name:
        sample_name = sample.stem[:-11]
        patient_code = sample_name.split('_')[1]
        sample_h5ad = sample_name + '_uni.h5ad'
        
        # load the count matrix
        input = pd.read_csv(sample, sep=',', index_col=0)

        matrix = scipy.sparse.csr_matrix(input.values.T)
        obs_name = pd.DataFrame(index=input.columns)
        var_name = pd.DataFrame(input.index)
        var_name.columns = ['gene_symbols']

        sample = anndata.AnnData(X=matrix, obs=obs_name, var=var_name)

        # Create an observation metric info to store related features
        obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers
        obs_metrics['cancer_type'] = 'ependymoma'
        obs_metrics['dataset'] = 'GSE189939'
        obs_metrics['tissue'] = tissue_dict[patient_code]
        obs_metrics['age'] = age_dict[patient_code]
        obs_metrics['sex'] = sex_dict[patient_code]
        obs_metrics['recurrent'] = recurrent_dict[patient_code]
        obs_metrics['sample_barcode'] = sample_name
        obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)

        sample.obs = obs_metrics
        sample.obs.set_index("uni_barcode", drop=False, inplace=True)
        print(sample)

        # save the anndata object
        output_path = write_directory / sample_h5ad
        sample.write_h5ad(output_path, compression="gzip")



AnnData object with n_obs × n_vars = 8342 × 19276
    obs: 'cancer_type', 'dataset', 'tissue', 'age', 'sex', 'recurrent', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 9790 × 20373
    obs: 'cancer_type', 'dataset', 'tissue', 'age', 'sex', 'recurrent', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 10533 × 20314
    obs: 'cancer_type', 'dataset', 'tissue', 'age', 'sex', 'recurrent', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 6437 × 18522
    obs: 'cancer_type', 'dataset', 'tissue', 'age', 'sex', 'recurrent', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbols'


In [32]:
print(sample.obs)

                           cancer_type    dataset  \
uni_barcode                                         
GSE189939_AAACCCAAGCATCAGG  ependymoma  GSE189939   
GSE189939_AAACCCACAAGAGATT  ependymoma  GSE189939   
GSE189939_AAACCCAGTACGCTTA  ependymoma  GSE189939   
GSE189939_AAACCCAGTTTGACAC  ependymoma  GSE189939   
GSE189939_AAACGAAAGAGTTGCG  ependymoma  GSE189939   
...                                ...        ...   
GSE189939_TTTGTTGGTTTCTTAC  ependymoma  GSE189939   
GSE189939_TTTGTTGTCACGAACT  ependymoma  GSE189939   
GSE189939_TTTGTTGTCACGTCCT  ependymoma  GSE189939   
GSE189939_TTTGTTGTCGCTTGCT  ependymoma  GSE189939   
GSE189939_TTTGTTGTCGTTCATT  ependymoma  GSE189939   

                                                   tissue  age   sex  \
uni_barcode                                                            
GSE189939_AAACCCAAGCATCAGG  subtentorial_fourth_ventricle   12  male   
GSE189939_AAACCCACAAGAGATT  subtentorial_fourth_ventricle   12  male   
GSE189939_AAACCCAGTACG

In [33]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,tissue,age,sex,recurrent,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
GSE189939_AAACCCAAGCATCAGG,ependymoma,GSE189939,subtentorial_fourth_ventricle,12,male,recurrent,GSM5710226_GTE001,GSE189939_AAACCCAAGCATCAGG
GSE189939_AAACCCACAAGAGATT,ependymoma,GSE189939,subtentorial_fourth_ventricle,12,male,recurrent,GSM5710226_GTE001,GSE189939_AAACCCACAAGAGATT
GSE189939_AAACCCAGTACGCTTA,ependymoma,GSE189939,subtentorial_fourth_ventricle,12,male,recurrent,GSM5710226_GTE001,GSE189939_AAACCCAGTACGCTTA
GSE189939_AAACCCAGTTTGACAC,ependymoma,GSE189939,subtentorial_fourth_ventricle,12,male,recurrent,GSM5710226_GTE001,GSE189939_AAACCCAGTTTGACAC
GSE189939_AAACGAAAGAGTTGCG,ependymoma,GSE189939,subtentorial_fourth_ventricle,12,male,recurrent,GSM5710226_GTE001,GSE189939_AAACGAAAGAGTTGCG
...,...,...,...,...,...,...,...,...
GSE189939_TTTGTTGGTTTCTTAC,ependymoma,GSE189939,subtentorial_fourth_ventricle,12,male,recurrent,GSM5710226_GTE001,GSE189939_TTTGTTGGTTTCTTAC
GSE189939_TTTGTTGTCACGAACT,ependymoma,GSE189939,subtentorial_fourth_ventricle,12,male,recurrent,GSM5710226_GTE001,GSE189939_TTTGTTGTCACGAACT
GSE189939_TTTGTTGTCACGTCCT,ependymoma,GSE189939,subtentorial_fourth_ventricle,12,male,recurrent,GSM5710226_GTE001,GSE189939_TTTGTTGTCACGTCCT
GSE189939_TTTGTTGTCGCTTGCT,ependymoma,GSE189939,subtentorial_fourth_ventricle,12,male,recurrent,GSM5710226_GTE001,GSE189939_TTTGTTGTCGCTTGCT


### 3. Confirmation of created AnnData object

In [35]:
from pathlib import Path

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE189939')
tissue_types = set()
ages = set()

# Loop through all files in the directory
for file in write_directory.iterdir():
    if "_uni.h5ad" in file.name:
        sample = anndata.read_h5ad(file)
        print(sample)
        tissue_types.update((set(sample.obs['tissue'])))
        ages.update(set(sample.obs['age']))
    
print(tissue_types)
print(ages)

AnnData object with n_obs × n_vars = 6437 × 18522
    obs: 'cancer_type', 'dataset', 'tissue', 'age', 'sex', 'recurrent', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 10533 × 20314
    obs: 'cancer_type', 'dataset', 'tissue', 'age', 'sex', 'recurrent', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 8342 × 19276
    obs: 'cancer_type', 'dataset', 'tissue', 'age', 'sex', 'recurrent', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 9790 × 20373
    obs: 'cancer_type', 'dataset', 'tissue', 'age', 'sex', 'recurrent', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbols'
{'subtentorial_fourth_ventricle', 'left_temporal_lobe', 'right_frontal_temporal'}
{8, 1, 12, 7}


### 4. Convert AnnData objects to SingleCellExperiment objects

<span style="color:red">**PROBLEM:**</span> `Unknown dtype dtype('int64') cannot be converted to ?gRMatrix.`

Thus, we need to convert the count matrix from int64 to float32.


In [39]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE189939')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = anndata.read_h5ad(file)
        sample_anndata.X = sample_anndata.X.astype('float32')
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        print(sample_sce)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))

class: SingleCellExperiment 
dim: 18522 6437 
metadata(0):
assays(1): X
rownames(18522): 0 1 ... 18520 18521
rowData names(1): gene_symbols
colnames(6437): GSE189939_AAACCCAAGCATCAGG GSE189939_AAACCCACAAGAGATT
  ... GSE189939_TTTGTTGTCGCTTGCT GSE189939_TTTGTTGTCGTTCATT
colData names(8): cancer_type dataset ... sample_barcode uni_barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

class: SingleCellExperiment 
dim: 20314 10533 
metadata(0):
assays(1): X
rownames(20314): 0 1 ... 20312 20313
rowData names(1): gene_symbols
colnames(10533): GSE189939_AAACCCAAGAAGCGCT GSE189939_AAACCCAAGGATTTAG
  ... GSE189939_TTTGTTGTCGCAAGAG GSE189939_TTTGTTGTCGTGCGAC
colData names(8): cancer_type dataset ... sample_barcode uni_barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

class: SingleCellExperiment 
dim: 19276 8342 
metadata(0):
assays(1): X
rownames(19276): 0 1 ... 19274 19275
rowData names(1): gene_symbols
colnames(8342): GSE189939_AAACCCAAGACTCATC GSE189939_AAACCCAAGCACGTCC
