### 1. General info of dataset GSE235063

This is the Jupyter Notebook for dataset GSE235063. Its dataset includes barcodes/genes/matrix files for each sample.

Thus, we need to simply incorparate these barcodes/genes/matrix files and generate an AnnData object for each sample. 

In total, there are 75 acute myeloid leukemia (AML) samples. The datasets includes both raw and processed information (thus, a total of 150 samples).

<span style="color:green">**[DX]**</span> samples from diagnosis

<span style="color:green">**[REM]**</span> samples from remission

<span style="color:green">**[REL]**</span> samples from relapse 

In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata

import warnings

# Ignore all FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### 2. AnnData object of each sample

<span style="color:red">**IMPORTANT:**</span> rename files to get rid of prefixes

1. `barcodes.tsv`: cell barcodes, which go into `.obs`
2. `genes.tsv`: gene names, `.var`
3. `matrix.mtx`: the expression matrix, `.X`

~~<span style="color:red">**Problem:**</span> the genes.tsv files from the processed dataset have MISSING gene identifier (such as ENSG00000268674) information --> some ensenbl gene IDs are pointing to the same gene symbol in the complete genes.tsv files from the raw dataset, so it's ambiguous to assign an appropriate gene IDs to these gene symbols --> duplicate the columns in the processed genes.tsv files from the processed dataset~~

<span style="color:red">**Update:**</span> the raw files include all the cell barcodes (~ 6 million), so we only need to keep the barcodes appear in the corresponding metadata files and maintain the whole dimension of genes.

The processed dataset includes metadata (patient age, biopsy origin, etc) for each sample. So we need to add information from metadata to each sample AnnData obejct.

In [2]:
from pathlib import Path

# Specify directory paths
data_directory = Path('/scratch/user/s4543064/xiaohan-john-project/data/GSE235063')
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE235063')

for sample_directory in data_directory.iterdir():
    sample_name = sample_directory.stem
    sample_h5ad = sample_name + '_uni.h5ad'

    if 'raw' in sample_name:
        pro_sample_name = sample_name.split("raw")[0] + "processed" # the corresponding processed sample
        metadata_path = data_directory / pro_sample_name / "metadata.tsv" 

        # Load the metadata as a DataFrame
        metadata = pd.read_csv(metadata_path, sep="\t", index_col="Cell_Barcode")
        
        anndata = sc.read_10x_mtx(
            sample_directory,
            var_names='gene_symbols',  
            cache=False
        )

        anndata = anndata[anndata.obs_names.isin(metadata.index)]

        # Create an observation metric info to store related features
        obs_metrics = pd.DataFrame(index=anndata.obs_names) ## Get the identifiers
        obs_metrics['cancer_type'] = 'acute_myeloid_leukemia'
        obs_metrics['dataset'] = 'GSE235063'
        obs_metrics['sample_barcode'] = sample_name
        obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)
        
        
        # Merge columns from the metadata to the obs_metrics
        cols_to_merge = ["Patient_Sample", "Classified_Celltype", "Malignant", "Biopsy_Origin", "Age_Months"]
        obs_metrics_merged = obs_metrics.merge(metadata[cols_to_merge], how="left", left_index=True, right_index=True)

        # Fill the null with correct info
        cols = ["Patient_Sample", "Biopsy_Origin", "Age_Months"]
        for col in cols:
            obs_metrics_merged[col].fillna(metadata[col].unique()[0], inplace=True)

        # Rename the columns for consistency
        obs_metrics_merged.rename(columns={"Patient_Sample": "disease_progression", "Classified_Celltype": "celltype_from_paper",
                                    "Malignant": "malignant_from_paper", "Biopsy_Origin": "tissue", "Age_Months": "age_months"},
                                    inplace=True)

        anndata.obs = obs_metrics_merged
        anndata.obs.set_index("uni_barcode", drop=False, inplace=True)
        
        # save the anndata object
        anndata.write_h5ad(write_directory / sample_h5ad, compression="gzip")

In [3]:
anndata

AnnData object with n_obs × n_vars = 6327 × 33538
    obs: 'cancer_type', 'dataset', 'sample_barcode', 'uni_barcode', 'disease_progression', 'celltype_from_paper', 'malignant_from_paper', 'tissue', 'age_months'
    var: 'gene_ids'

In [4]:
anndata.obs

Unnamed: 0_level_0,cancer_type,dataset,sample_barcode,uni_barcode,disease_progression,celltype_from_paper,malignant_from_paper,tissue,age_months
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
GSE235063_AAACCCAAGAATCGCG-1,acute_myeloid_leukemia,GSE235063,GSM7494260_AML6_DX_raw,GSE235063_AAACCCAAGAATCGCG-1,Diagnosis,GMP,Malignant,Blood,90
GSE235063_AAACCCAAGGAATGTT-1,acute_myeloid_leukemia,GSE235063,GSM7494260_AML6_DX_raw,GSE235063_AAACCCAAGGAATGTT-1,Diagnosis,Monocytes,Malignant,Blood,90
GSE235063_AAACCCACAAATGAGT-1,acute_myeloid_leukemia,GSE235063,GSM7494260_AML6_DX_raw,GSE235063_AAACCCACAAATGAGT-1,Diagnosis,GMP,Malignant,Blood,90
GSE235063_AAACCCACAAATGCGG-1,acute_myeloid_leukemia,GSE235063,GSM7494260_AML6_DX_raw,GSE235063_AAACCCACAAATGCGG-1,Diagnosis,CD16.Monocytes,Malignant,Blood,90
GSE235063_AAACCCAGTTGCGAAG-1,acute_myeloid_leukemia,GSE235063,GSM7494260_AML6_DX_raw,GSE235063_AAACCCAGTTGCGAAG-1,Diagnosis,GMP,Malignant,Blood,90
...,...,...,...,...,...,...,...,...,...
GSE235063_TTTGTTGAGAACCGCA-1,acute_myeloid_leukemia,GSE235063,GSM7494260_AML6_DX_raw,GSE235063_TTTGTTGAGAACCGCA-1,Diagnosis,GMP,Malignant,Blood,90
GSE235063_TTTGTTGAGGATTCAA-1,acute_myeloid_leukemia,GSE235063,GSM7494260_AML6_DX_raw,GSE235063_TTTGTTGAGGATTCAA-1,Diagnosis,GMP,Malignant,Blood,90
GSE235063_TTTGTTGAGTGAGGTC-1,acute_myeloid_leukemia,GSE235063,GSM7494260_AML6_DX_raw,GSE235063_TTTGTTGAGTGAGGTC-1,Diagnosis,Monocytes,Malignant,Blood,90
GSE235063_TTTGTTGGTGCAGGAT-1,acute_myeloid_leukemia,GSE235063,GSM7494260_AML6_DX_raw,GSE235063_TTTGTTGGTGCAGGAT-1,Diagnosis,GMP,Malignant,Blood,90


In [5]:
anndata.var

Unnamed: 0,gene_ids
MIR1302-2HG,ENSG00000243485
FAM138A,ENSG00000237613
OR4F5,ENSG00000186092
AL627309.1,ENSG00000238009
AL627309.3,ENSG00000239945
...,...
AC233755.2,ENSG00000277856
AC233755.1,ENSG00000275063
AC240274.1,ENSG00000271254
AC213203.1,ENSG00000277475


### 3. Confirmation of created AnnData objects

In [7]:
from pathlib import Path

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE235063')

# Loop through all files in the directory
for file in write_directory.iterdir():
    if '_uni.h5ad' in file.name:
        sample = sc.read_h5ad(file)

        if 'cell_type_from_paper' not in sample.obs.columns:
            sample.obs.rename(columns={'celltype_from_paper': 'cell_type_from_paper'}, inplace=True)
            print(sample)
            
            # save the anndata object
            sample.write_h5ad(file, compression="gzip")

AnnData object with n_obs × n_vars = 5259 × 33538
    obs: 'cancer_type', 'dataset', 'sample_barcode', 'uni_barcode', 'disease_progression', 'cell_type_from_paper', 'malignant_from_paper', 'tissue', 'age_months'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 6327 × 33538
    obs: 'cancer_type', 'dataset', 'sample_barcode', 'uni_barcode', 'disease_progression', 'cell_type_from_paper', 'malignant_from_paper', 'tissue', 'age_months'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 3936 × 33538
    obs: 'cancer_type', 'dataset', 'sample_barcode', 'uni_barcode', 'disease_progression', 'cell_type_from_paper', 'malignant_from_paper', 'tissue', 'age_months'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 6895 × 33538
    obs: 'cancer_type', 'dataset', 'sample_barcode', 'uni_barcode', 'disease_progression', 'cell_type_from_paper', 'malignant_from_paper', 'tissue', 'age_months'
    var: 'gene_ids'
AnnData object with n_obs × n_vars = 5525 × 33538
    obs: 'cancer_type'

In [8]:
sample.var

Unnamed: 0,gene_ids
MIR1302-2HG,ENSG00000243485
FAM138A,ENSG00000237613
OR4F5,ENSG00000186092
AL627309.1,ENSG00000238009
AL627309.3,ENSG00000239945
...,...
AC233755.2,ENSG00000277856
AC233755.1,ENSG00000275063
AC240274.1,ENSG00000271254
AC213203.1,ENSG00000277475


In [9]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,sample_barcode,uni_barcode,disease_progression,cell_type_from_paper,malignant_from_paper,tissue,age_months
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
GSE235063_AAACCCACATAGTCAC-1,acute_myeloid_leukemia,GSE235063,GSM7494258_AML16_REL_raw,GSE235063_AAACCCACATAGTCAC-1,Relapse,Progenitor,Malignant,Marrow,155
GSE235063_AAACGAAAGTTGAAAC-1,acute_myeloid_leukemia,GSE235063,GSM7494258_AML16_REL_raw,GSE235063_AAACGAAAGTTGAAAC-1,Relapse,CLP,Malignant,Marrow,155
GSE235063_AAACGAACAACCAGAG-1,acute_myeloid_leukemia,GSE235063,GSM7494258_AML16_REL_raw,GSE235063_AAACGAACAACCAGAG-1,Relapse,CLP,Malignant,Marrow,155
GSE235063_AAACGAACAATAAGGT-1,acute_myeloid_leukemia,GSE235063,GSM7494258_AML16_REL_raw,GSE235063_AAACGAACAATAAGGT-1,Relapse,Progenitor,Malignant,Marrow,155
GSE235063_AAACGAACAATTGGTC-1,acute_myeloid_leukemia,GSE235063,GSM7494258_AML16_REL_raw,GSE235063_AAACGAACAATTGGTC-1,Relapse,CLP,Malignant,Marrow,155
...,...,...,...,...,...,...,...,...,...
GSE235063_TTTGGTTGTCACTTAG-1,acute_myeloid_leukemia,GSE235063,GSM7494258_AML16_REL_raw,GSE235063_TTTGGTTGTCACTTAG-1,Relapse,Pre.B.Cell,Malignant,Marrow,155
GSE235063_TTTGTTGAGGTACAGC-1,acute_myeloid_leukemia,GSE235063,GSM7494258_AML16_REL_raw,GSE235063_TTTGTTGAGGTACAGC-1,Relapse,HSC,Malignant,Marrow,155
GSE235063_TTTGTTGGTATGGGAC-1,acute_myeloid_leukemia,GSE235063,GSM7494258_AML16_REL_raw,GSE235063_TTTGTTGGTATGGGAC-1,Relapse,Pre.B.Cell,Malignant,Marrow,155
GSE235063_TTTGTTGTCGCCAACG-1,acute_myeloid_leukemia,GSE235063,GSM7494258_AML16_REL_raw,GSE235063_TTTGTTGTCGCCAACG-1,Relapse,HSC,Malignant,Marrow,155


In [10]:
np.max(sample.X[:200, :200].toarray())

52.0

### 4. Convert AnnData objects to SingleCellExperiment objects

In [17]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE235063')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = sc.read_h5ad(file)
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        # print(sample_sce)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))