### 1. General info of dataset GSE235923

This is the Jupyter Notebook for dataset GSE235923. Its dataset includes barcodes/genes/matrix files for each sample.

Thus, we need to simply incorparate these barcodes/genes/matrix files and generate an AnnData object for each sample. 

In total, there are 31 acute myeloid leukemia (AML) samples.

<span style="color:green">**[Dx]**</span> samples from diagnosis

<span style="color:green">**[EOI]**</span> samples from end of induction

<span style="color:green">**[R]**</span> samples from relapse 

In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

### 2. AnnData object of each sample

<span style="color:red">**IMPORTANT:**</span> rename files to get rid of prefixes; rename features.tsv to genes.tsv

1. `barcodes.tsv`: cell barcodes, which go into `.obs`
2. `genes.tsv`: gene names, `.var`
3. `matrix.mtx`: the expression matrix, `.X`

In [5]:
general_input_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/data/GSE235923/GSM751'
general_output_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/write/GSE235923/GSM751'

sample_id = ['1D', '2D', '3D', '3E', '4D', '5D', '5E', '5R', '6D', '6E', '6R', '7D', '8D', '9D', '10D', '11D', '12D',
             '13D', '14D', '14E', '15E', '16D', '16E', '17D', '17E', '18Dx', '18E', '19D', '19E', '20D', '20E']

for i, j in zip(range(1998, 2029), range(0, 32)):
    actual_input_path = general_input_path + str(i) + '_Sample' + sample_id[j]
    actual_output_path = general_output_path + str(i) + '_Sample' + sample_id[j] + '.h5ad'

    sample = sc.read_10x_mtx(
        actual_input_path,
        var_names='gene_symbols',  
        cache=False
    )
    print(sample)
    
    # save the anndata object
    sample.write_h5ad(actual_output_path, compression="gzip")

AnnData object with n_obs × n_vars = 4265 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3495 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2679 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2413 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 4268 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2964 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 7331 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 5440 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 9588 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 5453 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 9301 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3457 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3057 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3965 × 33538
    var: 'gene_ids'
AnnData object with 

### 3. Confirmation of created AnnData objects

In [6]:
general_output_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/write/GSE235923/GSM751'

sample_id = ['1D', '2D', '3D', '3E', '4D', '5D', '5E', '5R', '6D', '6E', '6R', '7D', '8D', '9D', '10D', '11D', '12D',
             '13D', '14D', '14E', '15E', '16D', '16E', '17D', '17E', '18Dx', '18E', '19D', '19E', '20D', '20E']

for i, j in zip(range(1998, 2029), range(0, 32)):
    actual_output_path = general_output_path + str(i) + '_Sample' + sample_id[j] + '.h5ad'
    sample = anndata.read_h5ad(actual_output_path)
    print(sample)

AnnData object with n_obs × n_vars = 4265 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3495 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2679 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2413 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 4268 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 2964 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 7331 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 5440 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 9588 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 5453 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 9301 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3457 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3057 × 33538
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3965 × 33538
    var: 'gene_ids'
AnnData object with 