### 1. General info of dataset GSE132509

This is the Jupyter Notebook for dataset GSE132509. Its dataset includes a big overall cell annotation tsv file and barcodes/genes/matrix files for each sample.

Thus, we need to simply incorparate these barcodes/genes/matrix files and generate an AnnData object for each sample. In total, there are 11 samples.

<span style="color:green">**[ETV6-RUNX1]**</span> Pre-B t(12;21) acute lymphoblastic leukemia

<span style="color:green">**[HHD]**</span> Pre-B High hyper diploid acute lymphoblastic leukemia

<span style="color:green">**[PRE-T]**</span> Pre-T acute lymphoblastic leukemia

<span style="color:green">**[PBMCC]**</span> Healthy pediatric bone marrow mononuclear cells

In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

### 2. AnnData object of each sample

<span style="color:red">**IMPORTANT:**</span> rename files to get rid of prefixes

1. `barcodes.tsv`: cell barcodes, which go into `.obs`
2. `genes.tsv`: gene names, `.var`
3. `matrix.mtx`: the expression matrix, `.X`

In [5]:
general_input_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/data/GSE132509/GSE132509_RAW/GSM38724'
general_output_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/write/GSE132509/GSM38724'

for i, j in zip(range(34, 38), range(1, 5)):
    actual_input_path = general_input_path + str(i) + '_ETV6-RUNX1_' + str(j)
    actual_output_path = general_output_path + str(i) + '_ETV6-RUNX1_' + str(j) + '.h5ad'

    sample = sc.read_10x_mtx(
        actual_input_path,
        var_names='gene_symbols',  
        cache=False
    )
    print(sample)
    
    # save the anndata object
    sample.write_h5ad(actual_output_path, compression="gzip")

for i, j in zip(range(38, 40), range(1, 3)):
    actual_input_path = general_input_path + str(i) + '_HHD_' + str(j)
    actual_output_path = general_output_path + str(i) + '_HHD_' + str(j) + '.h5ad'

    sample = sc.read_10x_mtx(
        actual_input_path,
        var_names='gene_symbols',  
        cache=False
    )
    print(sample)
    
    # save the anndata object
    sample.write_h5ad(actual_output_path, compression="gzip")

for i, j in zip(range(40, 42), range(1, 3)):
    actual_input_path = general_input_path + str(i) + '_PRE-T_' + str(j)
    actual_output_path = general_output_path + str(i) + '_PRE-T_' + str(j) + '.h5ad'

    sample = sc.read_10x_mtx(
        actual_input_path,
        var_names='gene_symbols',  
        cache=False
    )
    print(sample)
    
    # save the anndata object
    sample.write_h5ad(actual_output_path, compression="gzip")

for i, j in zip(range(42, 45), range(1, 4)):
    actual_input_path = general_input_path + str(i) + '_PBMMC_' + str(j)
    actual_output_path = general_output_path + str(i) + '_PBMMC_' + str(j) + '.h5ad'

    sample = sc.read_10x_mtx(
        actual_input_path,
        var_names='gene_symbols',  
        cache=False
    )
    print(sample)
    
    # save the anndata object
    sample.write_h5ad(actual_output_path, compression="gzip")

AnnData object with n_obs × n_vars = 2776 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 6274 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3862 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 5069 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3728 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 5013 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2959 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2748 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 1612 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3105 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2229 × 33694
    var: 'gene_ids'


### 3. Confirmation of created AnnData objects

In [7]:
general_output_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/write/GSE132509/GSM38724'

for i, j in zip(range(34, 38), range(1, 5)):
    actual_output_path = general_output_path + str(i) + '_ETV6-RUNX1_' + str(j) + '.h5ad'
    sample = anndata.read_h5ad(actual_output_path)
    print(sample)

for i, j in zip(range(38, 40), range(1, 3)):
    actual_output_path = general_output_path + str(i) + '_HHD_' + str(j) + '.h5ad'
    sample = anndata.read_h5ad(actual_output_path)
    print(sample)

for i, j in zip(range(40, 42), range(1, 3)):
    actual_output_path = general_output_path + str(i) + '_PRE-T_' + str(j) + '.h5ad'
    sample = anndata.read_h5ad(actual_output_path)
    print(sample)

for i, j in zip(range(42, 45), range(1, 4)):
    actual_output_path = general_output_path + str(i) + '_PBMMC_' + str(j) + '.h5ad'
    sample = anndata.read_h5ad(actual_output_path)
    print(sample)


AnnData object with n_obs × n_vars = 2776 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 6274 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3862 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 5069 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3728 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 5013 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2959 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2748 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 1612 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3105 × 33694
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2229 × 33694
    var: 'gene_ids'
