### 1. General info of dataset GSE119926

This is the Jupyter Notebook for dataset GSE119926. Its dataset includes a txt file for each sample. As seen below, in the txt file, each row is a gene and each column is a cell.

Thus, we need to transform this txt file and generate an AnnData object for all samples for each sample. In total, there are 25 fresh surgical resection samples (23 diagnostic samples and 2 recurrences) and 11 patient-derived xenograft (PDX) samples.

<span style="color:green">**[BCH]**</span> fresh tumour biopsies from Boston Children's Hospital

<span style="color:green">**[MUV]**</span> fresh tumour biopsies from Vienna General Hospital

<span style="color:green">**[SJ]**</span> fresh tumour biopsies from LeBonheur Children's Hospital

In [14]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

In [2]:
# inspect the txt for each sample
path = '/scratch/user/s4543064/xiaohan-john-project/data/GSE119926/GSM3905406_BCH807.txt'
input = pd.read_csv(path, sep='\t', index_col=0) # the first column contains gene names and is the index

print(input.head()) 
print(input.shape) # (23686 rows, 307 columns)

          BCH807-P02-A01  BCH807-P02-A02  BCH807-P02-A03  BCH807-P02-A04  \
A1BG                 0.0            2.26           0.000             0.0   
A1BG-AS1             0.0            0.00           1.183             0.0   
A1CF                 0.0            0.00           0.000             0.0   
A2M                  0.0            0.00           0.550             0.0   
A2M-AS1              0.0            0.00           0.000             0.0   

          BCH807-P02-A05  BCH807-P02-A06  BCH807-P02-A07  BCH807-P02-A08  \
A1BG                 0.0           0.559             0.0             0.0   
A1BG-AS1             0.0           2.174             0.0             0.0   
A1CF                 0.0           0.000             0.0             0.0   
A2M                  0.0           0.000             0.0             0.0   
A2M-AS1              0.0           0.000             0.0             0.0   

          BCH807-P02-A09  BCH807-P02-A10  ...  BCH807-P05-H01  BCH807-P05-H02  \
A1BG 

### 2. AnnData object of each sample

<span style="color:red">**IMPORTANT:**</span> transpose the DataFrame.values to match the AnnData.X

1. `DataFrame.columns`: cell barcodes, which go into `.obs`
2. `DataFrame.index`: gene names, `.var`
3. `DataFrame.values`: the transpose of the expression matrix, `.X`

In [25]:
from pathlib import Path

# Specify the data directory path
data_directory = Path('/scratch/user/s4543064/xiaohan-john-project/data/GSE119926')
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE119926')

# Loop through all files in the directory
for file in data_directory.iterdir():
    file_name = file.stem
    file_h5ad = file_name + '.h5ad'

    input = pd.read_csv(file, sep='\t', index_col=0) # the first column contains gene names and is the index

    matrix = scipy.sparse.csr_matrix(input.values.T)
    obs_name = pd.DataFrame(index=input.columns)
    var_name = pd.DataFrame(input.index, columns=['gene_symbols'])

    sample = anndata.AnnData(X=matrix, obs=obs_name, var=var_name)
    print(sample)

    # save the anndata object
    output_path = write_directory / file_h5ad
    sample.write_h5ad(output_path, compression="gzip")



AnnData object with n_obs × n_vars = 190 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 334 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 293 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 314 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 313 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 316 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 338 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 241 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 507 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 304 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 273 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 294 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 301 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 314 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 307 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 75 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 400 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 431 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 52 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 338 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 328 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 333 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 493 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 330 × 23686
    var: 'gene_symbols'




AnnData object with n_obs × n_vars = 326 × 23686
    var: 'gene_symbols'


### 3. Confirmation of created AnnData objects

In [26]:
from pathlib import Path

write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE119926')

for file in write_directory.iterdir():
    sample = anndata.read_h5ad(file)
    print(sample)

AnnData object with n_obs × n_vars = 334 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 330 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 52 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 294 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 293 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 328 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 333 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 301 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 241 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 190 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 75 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 338 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 304 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 314 × 23686
    