### 1. General info of dataset GSE198896

This is the Jupyter Notebook for dataset GSE198896. Its dataset includes a .tar.gz file for each sample; within each tar file, there is a sample directory including barcodes & features.tsv,  matrix mtx files.

Thus, we need to first gzip and then incorparate these barcodes/genes/matrix files and generate an AnnData object for each sample. In total, there are 18 samples.

<span style="color:green">**[PBMC]**</span> Peripheral blood mononuclear cell

<span style="color:green">**[TIL]**</span> Tumor-Infiltrating Lymphocyte

<span style="color:green">**[CITE-SEQ]**</span> Simultaneously quantifies cell surface protein and transcriptomic data within a single cell readout

In [6]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

### 2. AnnData object of each sample

<strike><span style="color:red">**IMPORTANT:**</span> gzip all the files</strike> --> see Problem 1 below

1. `barcodes.tsv.gz`: cell barcodes, which go into `.obs`
2. `features.tsv.gz`: gene names, `.var`
3. `matrix.mtx.gz`: the expression matrix, `.X`

In [7]:
# from pathlib import Path

# # Specify directory paths
# data_directory = Path('/scratch/user/s4543064/xiaohan-john-project/data/GSE198896')
# write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE198896')

# # Loop through all files in the directory
# for sample_directory in data_directory.iterdir():
#     sample_name = sample_directory.stem
#     sample_h5ad = sample_name + '_uni.h5ad'

#     try:
#         sample = sc.read_10x_mtx(
#         sample_directory,
#         var_names='gene_symbols',  
#         cache=False
#         )

#         # Create an observation metric info to store related features
#         obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers
#         if "Ewing" in sample_name:
#             obs_metrics['cancer_type'] = 'ewing_sarcoma'
#         else:
#             obs_metrics['cancer_type'] = 'osteosarcoma'
        
#         obs_metrics['dataset'] = 'GSE198896'
        
#         if "PBMC" in sample_name:
#             obs_metrics['tissue'] = 'pbmc'
#         elif "TIL" in sample_name:
#             obs_metrics['tissue'] = 'sarcoma_tumour'
        
#         obs_metrics['sample_barcode'] = sample_name
#         obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)
        
#         sample.obs = obs_metrics
#         sample.obs.set_index("uni_barcode", drop=False, inplace=True)
#         print(sample)

#         # save the anndata object
#         output_path = write_directory / sample_h5ad
#         sample.write_h5ad(output_path, compression="gzip")
#     except:
#         print(sample_name)

AnnData object with n_obs × n_vars = 3854 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
GSM5959134_Osteosarcoma_2_TIL
GSM5959144_Ewings_3_CITEseq
GSM5959136_Osteosarcoma_3_TIL
GSM5959138_Osteosarcoma_4_TIL
GSM5959133_Osteosarcoma_2_PBMC
GSM5959145_Ewings_4_CITEseq
AnnData object with n_obs × n_vars = 3390 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
GSM5959135_Osteosarcoma_3_PBMC
GSM5959146_Healthy_donor_CITEseq
AnnData object with n_obs × n_vars = 10843 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
GSM5959129_Ewing_1_PBMC
GSM5959130_Ewing_1_TIL
AnnData object with n_obs × n_vars = 4296 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
GSM5959132_Osteosarcoma_1_TIL
GSM5959137_Osteosarcoma_4

<span style="color:red">**PROBLEM 1:**</span> The following samples have incomplete features.tsv.gz files <br>
`GSM5959134_Osteosarcoma_2_TIL`<br>
`GSM5959136_Osteosarcoma_3_TIL`<br>
`GSM5959138_Osteosarcoma_4_TIL`<br>
`GSM5959133_Osteosarcoma_2_PBMC`<br>
`GSM5959135_Osteosarcoma_3_PBMC`<br>
`GSM5959129_Ewing_1_PBMC`<br>
`GSM5959130_Ewing_1_TIL`<br>
`GSM5959132_Osteosarcoma_1_TIL`<br>
`GSM5959137_Osteosarcoma_4_PBMC`<br>
`GSM5959131_Osteosarcoma_1_PBMC`<br>
We expect to have three columns in the features.tsv.gz files (1st col: gene identifier, 2nd col: gene symbol, 3rd col: feature type), but these samples lack the 3rd column.

So the easiest way is to unzip all barcodes/features/matrix files and rename features.tsv to genes.tsv --> treat them as outputs from older version of cell ranger.


In [9]:
from pathlib import Path

# Specify directory paths
data_directory = Path('/scratch/user/s4543064/xiaohan-john-project/data/GSE198896')
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE198896')

# Loop through all files in the directory
for sample_directory in data_directory.iterdir():
    sample_name = sample_directory.stem
    sample_h5ad = sample_name + '_uni.h5ad'

    try:
        sample = sc.read_10x_mtx(
        sample_directory,
        var_names='gene_symbols',  
        cache=False
        )

        # Create an observation metric info to store related features
        obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers
        if "Ewing" in sample_name:
            obs_metrics['cancer_type'] = 'ewing_sarcoma'
        else:
            obs_metrics['cancer_type'] = 'osteosarcoma'
        
        obs_metrics['dataset'] = 'GSE198896'
        
        if "PBMC" in sample_name:
            obs_metrics['tissue'] = 'pbmc'
        elif "TIL" in sample_name:
            obs_metrics['tissue'] = 'bone/surrounding_soft_tissue'
        
        if "Healthy" in sample_name:
            obs_metrics['disease_progression'] = 'healthy'
        else:
            obs_metrics['disease_progression'] = obs_metrics['cancer_type']
        
        obs_metrics['sample_barcode'] = sample_name
        obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)
        
        sample.obs = obs_metrics
        sample.obs.set_index("uni_barcode", drop=False, inplace=True)
        print(sample)

        # save the anndata object
        output_path = write_directory / sample_h5ad
        sample.write_h5ad(output_path, compression="gzip")
    except:
        print(sample_name)

AnnData object with n_obs × n_vars = 3854 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 687 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
GSM5959144_Ewings_3_CITEseq
AnnData object with n_obs × n_vars = 3495 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2008 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 1498 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
GSM5959145_Ewings_4_CITEseq
AnnData object with n_obs × n_vars = 3390 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 

In [12]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,tissue,disease_progression,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GSE198896_AAACCTGAGCCAGAAC-1,osteosarcoma,GSE198896,pbmc,osteosarcoma,GSM5959131_Osteosarcoma_1_PBMC,GSE198896_AAACCTGAGCCAGAAC-1
GSE198896_AAACCTGAGCGAGAAA-1,osteosarcoma,GSE198896,pbmc,osteosarcoma,GSM5959131_Osteosarcoma_1_PBMC,GSE198896_AAACCTGAGCGAGAAA-1
GSE198896_AAACCTGAGGTGCAAC-1,osteosarcoma,GSE198896,pbmc,osteosarcoma,GSM5959131_Osteosarcoma_1_PBMC,GSE198896_AAACCTGAGGTGCAAC-1
GSE198896_AAACCTGAGTACACCT-1,osteosarcoma,GSE198896,pbmc,osteosarcoma,GSM5959131_Osteosarcoma_1_PBMC,GSE198896_AAACCTGAGTACACCT-1
GSE198896_AAACCTGCACCGAAAG-1,osteosarcoma,GSE198896,pbmc,osteosarcoma,GSM5959131_Osteosarcoma_1_PBMC,GSE198896_AAACCTGCACCGAAAG-1
...,...,...,...,...,...,...
GSE198896_TTTGTCAAGACTCGGA-1,osteosarcoma,GSE198896,pbmc,osteosarcoma,GSM5959131_Osteosarcoma_1_PBMC,GSE198896_TTTGTCAAGACTCGGA-1
GSE198896_TTTGTCAAGGTCGGAT-1,osteosarcoma,GSE198896,pbmc,osteosarcoma,GSM5959131_Osteosarcoma_1_PBMC,GSE198896_TTTGTCAAGGTCGGAT-1
GSE198896_TTTGTCACATGACGGA-1,osteosarcoma,GSE198896,pbmc,osteosarcoma,GSM5959131_Osteosarcoma_1_PBMC,GSE198896_TTTGTCACATGACGGA-1
GSE198896_TTTGTCATCCAGATCA-1,osteosarcoma,GSE198896,pbmc,osteosarcoma,GSM5959131_Osteosarcoma_1_PBMC,GSE198896_TTTGTCATCCAGATCA-1


<span style="color:red">**PROBLEM 2:**</span> The genes.tsv files for all 4 CITEseq samples are incomplete. The standard genes.tsv should have two columns (1st column: gene ID such as ENSG00000186092; 2nd column: gene symbol such as OR4F5). However, the genes,tsv for these samples are like:  

hto2-TGATGGCCTATTGGG<br>
hto3-TTCCGCCTCTCTTTG<br>
hto4-AGTAAGTTCAGCGTA<br>
hto5-AAGTATCGTTTCGCA<br>
unmapped

These look like cellular indices (hashtags) instead of gene-related annotations.

### 3. Confirmation of created AnnData objects

In [14]:
from pathlib import Path

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE198896')

# Loop through all files in the directory
for file in write_directory.iterdir():
    if "_uni.h5ad" in file.name:
        sample = anndata.read_h5ad(file)
        print(sample)

AnnData object with n_obs × n_vars = 2655 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 1690 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2376 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 1631 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 572 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 1498 × 33694
    obs: 'cancer_type', 'dataset', 'tissue', 'disease_progression', 'sample_barcode', 'uni_barcode'
 

In [15]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,tissue,disease_progression,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GSE198896_AAACCTGGTGTAACGG-1,osteosarcoma,GSE198896,bone/surrounding_soft_tissue,osteosarcoma,GSM5959138_Osteosarcoma_4_TIL,GSE198896_AAACCTGGTGTAACGG-1
GSE198896_AAACCTGTCTATCCCG-1,osteosarcoma,GSE198896,bone/surrounding_soft_tissue,osteosarcoma,GSM5959138_Osteosarcoma_4_TIL,GSE198896_AAACCTGTCTATCCCG-1
GSE198896_AAACGGGGTAATCACC-1,osteosarcoma,GSE198896,bone/surrounding_soft_tissue,osteosarcoma,GSM5959138_Osteosarcoma_4_TIL,GSE198896_AAACGGGGTAATCACC-1
GSE198896_AAACGGGGTTTAGCTG-1,osteosarcoma,GSE198896,bone/surrounding_soft_tissue,osteosarcoma,GSM5959138_Osteosarcoma_4_TIL,GSE198896_AAACGGGGTTTAGCTG-1
GSE198896_AAAGATGAGACTAGGC-1,osteosarcoma,GSE198896,bone/surrounding_soft_tissue,osteosarcoma,GSM5959138_Osteosarcoma_4_TIL,GSE198896_AAAGATGAGACTAGGC-1
...,...,...,...,...,...,...
GSE198896_TTTGGTTTCCGAATGT-1,osteosarcoma,GSE198896,bone/surrounding_soft_tissue,osteosarcoma,GSM5959138_Osteosarcoma_4_TIL,GSE198896_TTTGGTTTCCGAATGT-1
GSE198896_TTTGTCAAGACAATAC-1,osteosarcoma,GSE198896,bone/surrounding_soft_tissue,osteosarcoma,GSM5959138_Osteosarcoma_4_TIL,GSE198896_TTTGTCAAGACAATAC-1
GSE198896_TTTGTCAGTAACGCGA-1,osteosarcoma,GSE198896,bone/surrounding_soft_tissue,osteosarcoma,GSM5959138_Osteosarcoma_4_TIL,GSE198896_TTTGTCAGTAACGCGA-1
GSE198896_TTTGTCAGTAGCCTAT-1,osteosarcoma,GSE198896,bone/surrounding_soft_tissue,osteosarcoma,GSM5959138_Osteosarcoma_4_TIL,GSE198896_TTTGTCAGTAGCCTAT-1


### 4. Convert AnnData objects to SingleCellExperiment objects

In [16]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE198896')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = anndata.read_h5ad(file)
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        print(sample_sce)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))

class: SingleCellExperiment 
dim: 33694 2655 
metadata(0):
assays(1): X
rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
rowData names(1): gene_ids
colnames(2655): GSE198896_AAACCTGAGCCAGAAC-1
  GSE198896_AAACCTGAGCGAGAAA-1 ... GSE198896_TTTGTCATCCAGATCA-1
  GSE198896_TTTGTCATCTACCAGA-1
colData names(6): cancer_type dataset ... sample_barcode uni_barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

class: SingleCellExperiment 
dim: 33694 1690 
metadata(0):
assays(1): X
rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
rowData names(1): gene_ids
colnames(1690): GSE198896_AAACCTGAGAATTGTG-1
  GSE198896_AAACCTGCACATCCGG-1 ... GSE198896_TTTGGTTTCTGCTGCT-1
  GSE198896_TTTGTCAAGATGCGAC-1
colData names(6): cancer_type dataset ... sample_barcode uni_barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

class: SingleCellExperiment 
dim: 33694 2376 
metadata(0):
assays(1): X
rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
rowData names(1): g