### 1. General info of dataset GSE198896

This is the Jupyter Notebook for dataset GSE198896. Its dataset includes a .tar.gz file for each sample; within each tar file, there is a sample directory including barcodes & features.tsv,  matrix mtx files.

Thus, we need to first gzip and then incorparate these barcodes/genes/matrix files and generate an AnnData object for each sample. In total, there are 18 samples.

<span style="color:green">**[PBMC]**</span> Peripheral blood mononuclear cell

<span style="color:green">**[TIL]**</span> Tumor-Infiltrating Lymphocyte

<span style="color:green">**[CITE-SEQ]**</span> Simultaneously quantifies cell surface protein and transcriptomic data within a single cell readout

In [6]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

### 2. AnnData object of each sample

<span style="color:red">**IMPORTANT:**</span> gzip all the files

1. `barcodes.tsv`: cell barcodes, which go into `.obs`
2. `features.tsv`: gene names, `.var`
3. `matrix.mtx`: the expression matrix, `.X`

In [7]:
from pathlib import Path

# Specify directory paths
data_directory = Path('/scratch/user/s4543064/xiaohan-john-project/data/GSE198896')
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE198896')

# Loop through all files in the directory
for sample_directory in data_directory.iterdir():
    sample_name = sample_directory.stem
    sample_h5ad = sample_name + '_uni.h5ad'

    try:
        sample = sc.read_10x_mtx(
        sample_directory,
        var_names='gene_symbols',  
        cache=False
        )

        # Create an observation metric info to store related features
        obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers
        if "Ewing" in sample_name:
            obs_metrics['cancer_type'] = 'ewing_sarcoma'
        else:
            obs_metrics['cancer_type'] = 'osteosarcoma'
        
        obs_metrics['dataset'] = 'GSE198896'
        
        if "PBMC" in sample_name:
            obs_metrics['tissue'] = 'pbmc'
        elif "TIL" in sample_name:
            obs_metrics['tissue'] = 'sarcoma_tumour'
        
        obs_metrics['sample_barcode'] = sample_name
        obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)
        
        sample.obs = obs_metrics
        sample.obs.set_index("uni_barcode", drop=False, inplace=True)
        print(sample)

        # save the anndata object
        output_path = write_directory / sample_h5ad
        sample.write_h5ad(output_path, compression="gzip")
    except:
        print(sample_name)

AnnData object with n_obs × n_vars = 3854 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
GSM5959134_Osteosarcoma_2_TIL
GSM5959144_Ewings_3_CITEseq
GSM5959136_Osteosarcoma_3_TIL
GSM5959138_Osteosarcoma_4_TIL
GSM5959133_Osteosarcoma_2_PBMC
GSM5959145_Ewings_4_CITEseq
AnnData object with n_obs × n_vars = 3390 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
GSM5959135_Osteosarcoma_3_PBMC
GSM5959146_Healthy_donor_CITEseq
AnnData object with n_obs × n_vars = 10843 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
GSM5959129_Ewing_1_PBMC
GSM5959130_Ewing_1_TIL
AnnData object with n_obs × n_vars = 4296 × 33538
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_ids', 'feature_types'
GSM5959132_Osteosarcoma_1_TIL
GSM5959137_Osteosarcoma_4

<span style="color:red">**PROBLEM 1:**</span> The following samples have incomplete features.tsv.gz files <br>
`GSM5959134_Osteosarcoma_2_TIL`<br>
`GSM5959136_Osteosarcoma_3_TIL`<br>
`GSM5959138_Osteosarcoma_4_TIL`<br>
`GSM5959133_Osteosarcoma_2_PBMC`<br>
`GSM5959135_Osteosarcoma_3_PBMC`<br>
`GSM5959129_Ewing_1_PBMC`<br>
`GSM5959130_Ewing_1_TIL`<br>
`GSM5959132_Osteosarcoma_1_TIL`<br>
`GSM5959137_Osteosarcoma_4_PBMC`<br>
`GSM5959131_Osteosarcoma_1_PBMC`<br>
We expect to have three columns in the features.tsv.gz files (1st col: gene identifier, 2nd col: gene symbol, 3rd col: feature type), but these samples lack the 3rd column.

So the easiest way is to unzip all barcodes/features/matrix files and rename features.tsv to genes.tsv --> treat them as outputs from older version of cell ranger.


<span style="color:red">**PROBLEM 2:**</span> The genes.tsv files for all 4 CITEseq samples are incomplete. The standard genes.tsv should have two columns (1st column: gene ID such as ENSG00000186092; 2nd column: gene symbol such as OR4F5). However, the genes,tsv for these samples are like:  

hto2-TGATGGCCTATTGGG<br>
hto3-TTCCGCCTCTCTTTG<br>
hto4-AGTAAGTTCAGCGTA<br>
hto5-AAGTATCGTTTCGCA<br>
unmapped

These look like cellular indices (hashtags) instead of gene-related annotations.

### 3. Confirmation of created AnnData objects

In [22]:
from pathlib import Path

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE198896')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample = anndata.read_h5ad(file)
    print(sample)

AnnData object with n_obs × n_vars = 572 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 1631 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 1498 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3390 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2008 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2655 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 10843 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3495 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3854 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 1690 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 4296 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 687 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 1693 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2376 × 33694
    var: 'gene_ids'
