### 1. General info of dataset GSE132509

This is the Jupyter Notebook for dataset GSE132509. Its dataset includes a big overall cell annotation tsv file and barcodes/genes/matrix files for each sample.

Thus, we need to simply incorparate these barcodes/genes/matrix files and generate an AnnData object for each sample. In total, there are 11 samples.

<span style="color:green">**[ETV6-RUNX1]**</span> Pre-B t(12;21) acute lymphoblastic leukemia

<span style="color:green">**[HHD]**</span> Pre-B High hyper diploid acute lymphoblastic leukemia

<span style="color:green">**[PRE-T]**</span> Pre-T acute lymphoblastic leukemia

<span style="color:green">**[PBMMC]**</span> Healthy pediatric bone marrow mononuclear cells

In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

### 2. AnnData object of each sample

<span style="color:red">**IMPORTANT:**</span> rename files to get rid of prefixes

1. `barcodes.tsv`: cell barcodes, which go into `.obs`
2. `genes.tsv`: gene names, `.var`
3. `matrix.mtx`: the expression matrix, `.X`

In [13]:
from pathlib import Path

# Specify directory paths
data_directory = Path('/scratch/user/s4543064/Xiaohan_Summer_Research/data/GSE132509/GSE132509_RAW')
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE132509')

# Loop through all files in the directory
for sample_directory in data_directory.iterdir():
    sample_name = sample_directory.stem
    sample_h5ad = sample_name + '_uni.h5ad'

    sample = sc.read_10x_mtx(
    sample_directory,
    var_names='gene_symbols',  
    cache=False
    )

    # Create an observation metric info to store related features
    obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers
    
    if "ETV6-RUNX1" in sample_name:
        cancer_type = "Pre-B t(12;21) acute lymphoblastic leukemia"
    elif "HHD" in sample_name:
        cancer_type = "Pre-B High hyper diploid acute lymphoblastic leukemia"
    elif "PRE-T" in sample_name:
        cancer_type = "Pre-T acute lymphoblastic leukemia"
    elif "PBMMC" in sample_name:
        cancer_type = "Healthy"

    obs_metrics['cancer_type'] = cancer_type
    obs_metrics['dataset'] = 'GSE132509'
    obs_metrics['tissue'] = 'bone_marrow'
    obs_metrics['sample_barcode'] = sample_name
    obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)
    
    sample.obs = obs_metrics
    sample.obs.set_index("uni_barcode", drop=False, inplace=True)
    print(sample)

    # save the anndata object
    output_path = write_directory / sample_h5ad
    sample.write_h5ad(output_path, compression="gzip")

AnnData object with n_obs × n_vars = 1612 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3728 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2748 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2229 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3862 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 5069 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 6274 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_

In [4]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,tissue,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GSE132509_AAACCTGAGAGCTGCA-1,Pre-T acute lymphoblastic leukemia,GSE132509,bone_marrow,GSM3872440_PRE-T_1,GSE132509_AAACCTGAGAGCTGCA-1
GSE132509_AAACCTGAGTTATCGC-1,Pre-T acute lymphoblastic leukemia,GSE132509,bone_marrow,GSM3872440_PRE-T_1,GSE132509_AAACCTGAGTTATCGC-1
GSE132509_AAACCTGCAAGTAGTA-1,Pre-T acute lymphoblastic leukemia,GSE132509,bone_marrow,GSM3872440_PRE-T_1,GSE132509_AAACCTGCAAGTAGTA-1
GSE132509_AAACCTGGTTAGATGA-1,Pre-T acute lymphoblastic leukemia,GSE132509,bone_marrow,GSM3872440_PRE-T_1,GSE132509_AAACCTGGTTAGATGA-1
GSE132509_AAACCTGTCCGCAAGC-1,Pre-T acute lymphoblastic leukemia,GSE132509,bone_marrow,GSM3872440_PRE-T_1,GSE132509_AAACCTGTCCGCAAGC-1
...,...,...,...,...,...
GSE132509_TTTGGTTGTGACTACT-1,Pre-T acute lymphoblastic leukemia,GSE132509,bone_marrow,GSM3872440_PRE-T_1,GSE132509_TTTGGTTGTGACTACT-1
GSE132509_TTTGGTTTCACGACTA-1,Pre-T acute lymphoblastic leukemia,GSE132509,bone_marrow,GSM3872440_PRE-T_1,GSE132509_TTTGGTTTCACGACTA-1
GSE132509_TTTGTCACATGTTGAC-1,Pre-T acute lymphoblastic leukemia,GSE132509,bone_marrow,GSM3872440_PRE-T_1,GSE132509_TTTGTCACATGTTGAC-1
GSE132509_TTTGTCAGTCATTAGC-1,Pre-T acute lymphoblastic leukemia,GSE132509,bone_marrow,GSM3872440_PRE-T_1,GSE132509_TTTGTCAGTCATTAGC-1


### 3. Confirmation of created AnnData objects

In [14]:
from pathlib import Path

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE132509')
cancer_types = set()
sample_barcodes = set()

# Loop through all files in the directory
for file in write_directory.iterdir():
    if "_uni.h5ad" in file.name:
        sample = anndata.read_h5ad(file)
        cancer_types.update((set(sample.obs['cancer_type'])))
        sample_barcodes.update(set(sample.obs['sample_barcode']))
    
print(cancer_types)
print(sample_barcodes)

{'Pre-T acute lymphoblastic leukemia', 'Healthy', 'Pre-B t(12;21) acute lymphoblastic leukemia', 'Pre-B High hyper diploid acute lymphoblastic leukemia'}
{'GSM3872439_HHD_2', 'GSM3872438_HHD_1', 'GSM3872435_ETV6-RUNX1_2', 'GSM3872441_PRE-T_2', 'GSM3872437_ETV6-RUNX1_4', 'GSM3872440_PRE-T_1', 'GSM3872442_PBMMC_1', 'GSM3872434_ETV6-RUNX1_1', 'GSM3872436_ETV6-RUNX1_3', 'GSM3872443_PBMMC_2', 'GSM3872444_PBMMC_3'}


### 4. Convert AnnData objects to SingleCellExperiment objects

In [15]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE132509')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = anndata.read_h5ad(file)
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        # print(sample_sce)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))

In [17]:
print(sample_sce)

class: SingleCellExperiment 
dim: 33694 3862 
metadata(0):
assays(1): X
rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
rowData names(1): gene_ids
colnames(3862): GSE132509_AAACCTGCACAGCCCA-1
  GSE132509_AAACCTGCAGTGAGTG-1 ... GSE132509_TTTGTCATCCGCGGTA-1
  GSE132509_TTTGTCATCCGGGTGT-1
colData names(5): cancer_type dataset tissue sample_barcode uni_barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

