### 1. General info of dataset GSE152048

This is the Jupyter Notebook for dataset GSE152048. Its dataset includes barcodes/features/matrix files for each sample.

Thus, we need to simply incorparate these barcodes/features/matrix files and generate an AnnData object for each sample. In total, there are 6/11 pediatric samples (BC_2, 3, 11, 16, 20 and 22).

In [2]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

### 2. AnnData object of each sample

<span style="color:red">**IMPORTANT:**</span> rename files to get rid of prefixes 

1. `barcodes.tsv.gz`: cell barcodes, which go into `.obs`
2. `features.tsv.gz`: gene names, `.var`
3. `matrix.mtx.gz`: the expression matrix, `.X`

In [13]:
from pathlib import Path

# Specify directory paths
data_directory = Path('/scratch/user/s4543064/xiaohan-john-project/data/GSE152048')
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE152048')

sex_dict = {'BC2': 'male', 'BC3': 'female', 'BC11': 'male', 'BC16': 'male', 'BC20': 'male', 'BC22': 'male'}
age_dict = {'BC2': 11, 'BC3': 11, 'BC11': 12, 'BC16': 11, 'BC20': 9, 'BC22': 15}
tissue_dict = {'BC2': 'femur', 'BC3': 'tibia', 'BC11': 'femur', 'BC16': 'tibia', 'BC20': 'femur', 'BC22': 'femur'}

# Loop through all files in the directory
for sample_directory in data_directory.iterdir():
    sample_name = sample_directory.stem
    sample_h5ad = 'GSE152048_' + sample_name + '_uni.h5ad'

    sample = sc.read_10x_mtx(
        sample_directory,
        var_names='gene_symbols',  
        cache=False
    )
    
    # Create an observation metric info to store related features
    obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers
    obs_metrics['cancer_type'] = 'osteosarcoma'
    obs_metrics['dataset'] = 'GSE152048'
    obs_metrics['tissue'] = tissue_dict[sample_name]
    obs_metrics['sex'] = sex_dict[sample_name]
    obs_metrics['age'] = age_dict[sample_name]
    obs_metrics['sample_barcode'] = 'GSE152048_' + sample_name
    obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)
    
    sample.obs = obs_metrics
    sample.obs.set_index("uni_barcode", drop=False, inplace=True)
    print(sample)

    # save the anndata object
    output_path = write_directory / sample_h5ad
    sample.write_h5ad(output_path, compression="gzip")

AnnData object with n_obs × n_vars = 11096 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 5962 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 8812 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 10210 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 13444 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 9236 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age'

In [14]:
sample.var

Unnamed: 0,gene_ids,feature_types
MIR1302-2HG,ENSG00000243485,Gene Expression
FAM138A,ENSG00000237613,Gene Expression
OR4F5,ENSG00000186092,Gene Expression
AL627309.1,ENSG00000238009,Gene Expression
AL627309.3,ENSG00000239945,Gene Expression
...,...,...
MT-ND4L,ENSG00000212907,Gene Expression
MT-ND4,ENSG00000198886,Gene Expression
MT-ND5,ENSG00000198786,Gene Expression
MT-ND6,ENSG00000198695,Gene Expression


In [15]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,tissue,sex,age,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
GSE152048_AAACCCAAGCCGAATG-1,osteosarcoma,GSE152048,femur,male,15,GSE152048_BC22,GSE152048_AAACCCAAGCCGAATG-1
GSE152048_AAACCCACACCATATG-1,osteosarcoma,GSE152048,femur,male,15,GSE152048_BC22,GSE152048_AAACCCACACCATATG-1
GSE152048_AAACCCACAGGACTAG-1,osteosarcoma,GSE152048,femur,male,15,GSE152048_BC22,GSE152048_AAACCCACAGGACTAG-1
GSE152048_AAACCCACAGGTGACA-1,osteosarcoma,GSE152048,femur,male,15,GSE152048_BC22,GSE152048_AAACCCACAGGTGACA-1
GSE152048_AAACCCACATACCAGT-1,osteosarcoma,GSE152048,femur,male,15,GSE152048_BC22,GSE152048_AAACCCACATACCAGT-1
...,...,...,...,...,...,...,...
GSE152048_TTTGTTGGTATTTCCT-1,osteosarcoma,GSE152048,femur,male,15,GSE152048_BC22,GSE152048_TTTGTTGGTATTTCCT-1
GSE152048_TTTGTTGGTTGATCGT-1,osteosarcoma,GSE152048,femur,male,15,GSE152048_BC22,GSE152048_TTTGTTGGTTGATCGT-1
GSE152048_TTTGTTGTCCGAAATC-1,osteosarcoma,GSE152048,femur,male,15,GSE152048_BC22,GSE152048_TTTGTTGTCCGAAATC-1
GSE152048_TTTGTTGTCGTGGTAT-1,osteosarcoma,GSE152048,femur,male,15,GSE152048_BC22,GSE152048_TTTGTTGTCGTGGTAT-1


### 3. Confirmation of created AnnData objects

In [16]:
from pathlib import Path

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE152048')

# Loop through all files in the directory
for file in write_directory.iterdir():
    if "_uni.h5ad" in file.name:
        sample = anndata.read_h5ad(file)
        print(sample)

AnnData object with n_obs × n_vars = 13444 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 10210 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 9236 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 8812 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 5962 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
AnnData object with n_obs × n_vars = 11096 × 32864
    obs: 'cancer_type', 'dataset', 'tissue', 'sex', 'age'

### 4. Convert AnnData objects to SingleCellExperiment objects

In [17]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE152048')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = anndata.read_h5ad(file)
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))

In [18]:
print(sample_sce)

class: SingleCellExperiment 
dim: 32864 11096 
metadata(0):
assays(1): X
rownames(32864): MIR1302-2HG FAM138A ... MT-ND6 MT-CYB
rowData names(2): gene_ids feature_types
colnames(11096): GSE152048_AAACCCAAGCTCACTA-1
  GSE152048_AAACCCAAGGCGCTTC-1 ... GSE152048_TTTGTTGTCTGAGAGG-1
  GSE152048_TTTGTTGTCTTGGTCC-1
colData names(7): cancer_type dataset ... sample_barcode uni_barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

