### 1. General info of dataset GSE235063

This is the Jupyter Notebook for dataset GSE235063. Its dataset includes barcodes/genes/matrix files for each sample.

Thus, we need to simply incorparate these barcodes/genes/matrix files and generate an AnnData object for each sample. 

In total, there are 75 acute myeloid leukemia (AML) samples. The datasets includes both raw and processed information (thus, a total of 150 samples).

<span style="color:green">**[DX]**</span> samples from diagnosis

<span style="color:green">**[REM]**</span> samples from remission

<span style="color:green">**[REL]**</span> samples from relapse 

In [2]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy
import os

### 2. AnnData object of each sample

<span style="color:red">**IMPORTANT:**</span> rename files to get rid of prefixes

1. `barcodes.tsv`: cell barcodes, which go into `.obs`
2. `genes.tsv`: gene names, `.var`
3. `matrix.mtx`: the expression matrix, `.X`

<span style="color:red">**Problem:**</span> the genes.tsv files from the processed dataset have MISSING gene identifier (such as ENSG00000268674) information

In [13]:
general_input_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/data/GSE235063'
general_output_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/write/GSE235063'
count = 0

for sample in os.listdir(general_input_path):
    print("I am currently processing: ", sample)
    if 'processed' not in sample:
        sample_dir = os.path.join(general_input_path, sample)
        output_dir = os.path.join(general_output_path, sample) + '.h5ad'

        anndata = sc.read_10x_mtx(
            sample_dir,
            var_names='gene_symbols',  
            cache=False
        )
        print(anndata)
        
        # save the anndata object
        anndata.write_h5ad(output_dir, compression="gzip")

    else:
        count += 1
        print('Skipped samples:', str(count))

I am currently processing:  GSM7494323_AML19_REL_processed
Skipped samples: 1
I am currently processing:  GSM7494299_AML4_REL_raw
AnnData object with n_obs × n_vars = 6794880 × 33538
    var: 'gene_ids'
I am currently processing:  GSM7494294_AML10_REM_processed
Skipped samples: 2
I am currently processing:  GSM7494278_AML20_REM_processed
Skipped samples: 3
I am currently processing:  GSM7494302_AML22_REM_processed
Skipped samples: 4
I am currently processing:  GSM7494298_AML4_DX_processed
Skipped samples: 5
I am currently processing:  GSM7494281_AML5_REM_processed
Skipped samples: 6
I am currently processing:  GSM7494298_AML4_DX_raw
AnnData object with n_obs × n_vars = 6794880 × 33538
    var: 'gene_ids'
I am currently processing:  GSM7494322_AML19_DX_processed
Skipped samples: 7
I am currently processing:  GSM7494272_AML7_REL_raw
AnnData object with n_obs × n_vars = 6794880 × 33538
    var: 'gene_ids'
I am currently processing:  GSM7494305_AML21_REM_processed
Skipped samples: 8
I am c

Somehow the dimensions of all created AnnData objects are:

`AnnData object with n_obs × n_vars = 6794880 × 335381`

### 3. AnnData object of each sample using the processed dataset

<span style="color:red">**Attemp:**</span> 

1. ~~add the "gene_identifier" info into the processed genes.tsv files~~ --> some ensenbl gene IDs are pointing to the same gene symbol in the complete genes.tsv files from the raw dataset, so it's ambiguous to assign an appropriate gene IDs to these gene symbols

In [17]:
general_input_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/data/GSE235063'

for sample in os.listdir(general_input_path):
    if 'processed' in sample:
        sample_name = sample.split('_processed')[0]
        
        # the path of the incomplete genes.tsv from the processe dataset
        incomplete_path = general_input_path + '/' + sample + '/genes.tsv'
        
        # the path of the complete genes.tsv from the raw dataset
        complete_path = general_input_path + '/' + sample_name + '_raw/genes.tsv'
        
        # Read the complete TSV file (with both gene identifier and symbol)
        complete_df = pd.read_csv(complete_path, sep='\t', header=None, names=['gene_identifier', 'gene_symbols'])
        print(complete_df) # it has 33538 rows
        
        # Read the incomplete TSV file (with only gene symbol)
        incomplete_df = pd.read_csv(incomplete_path, sep='\t', header=None, names=['gene_symbols'])
        print(incomplete_df) # it has 21966 rows
        
        # Merge the DataFrames based on the gene symbol
        merged_df = pd.merge(incomplete_df, complete_df, on='gene_symbols', how='inner')
        print(merged_df) # it has 21968 rows
        
        # # Save the merged DataFrame back to a TSV file
        # merged_df.to_csv('merged.tsv', sep='\t', index=False)
        break

       gene_identifier gene_symbols
0      ENSG00000243485  MIR1302-2HG
1      ENSG00000237613      FAM138A
2      ENSG00000186092        OR4F5
3      ENSG00000238009   AL627309.1
4      ENSG00000239945   AL627309.3
...                ...          ...
33533  ENSG00000277856   AC233755.2
33534  ENSG00000275063   AC233755.1
33535  ENSG00000271254   AC240274.1
33536  ENSG00000277475   AC213203.1
33537  ENSG00000268674      FAM231C

[33538 rows x 2 columns]
      gene_symbols
0       AL627309.1
1       AL669831.2
2       AL669831.5
3           FAM87B
4        LINC00115
...            ...
21962   AL354822.1
21963   AC023491.2
21964   AC004556.1
21965   AC233755.2
21966   AC240274.1

[21967 rows x 1 columns]
      gene_symbols  gene_identifier
0       AL627309.1  ENSG00000238009
1       AL669831.2  ENSG00000229905
2       AL669831.5  ENSG00000237491
3           FAM87B  ENSG00000177757
4        LINC00115  ENSG00000225880
...            ...              ...
21963   AL354822.1  ENSG00000278384


In [15]:
duplicate_rows = incomplete_df[incomplete_df.duplicated(subset=['gene_symbols'], keep=False)]
print(duplicate_rows)


Empty DataFrame
Columns: [gene_symbols]
Index: []


In [16]:
duplicate_rows = complete_df[complete_df.duplicated(subset=['gene_symbols'], keep=False)]
print(duplicate_rows)


       gene_identifier gene_symbols
2230   ENSG00000143248         RGS5
2232   ENSG00000232995         RGS5
2997   ENSG00000285053         TBCE
2999   ENSG00000284770         TBCE
4798   ENSG00000128655       PDE11A
4799   ENSG00000284741       PDE11A
5435   ENSG00000237940    LINC01238
5438   ENSG00000261186    LINC01238
5832   ENSG00000283706       PRSS50
5833   ENSG00000206549       PRSS50
5949   ENSG00000114395     CYB561D2
5953   ENSG00000271858     CYB561D2
6093   ENSG00000285258        ATXN7
6094   ENSG00000163635        ATXN7
6461   ENSG00000283374     TXNRD3NB
6462   ENSG00000206483     TXNRD3NB
6915   ENSG00000284862       CCDC39
6916   ENSG00000145075       CCDC39
9595   ENSG00000280987        MATR3
9597   ENSG00000015479        MATR3
11683  ENSG00000112096         SOD2
11684  ENSG00000285441         SOD2
12718  ENSG00000168255      POLR2J3
12722  ENSG00000285437      POLR2J3
13263  ENSG00000285292        ABCF2
13264  ENSG00000033050        ABCF2
14005  ENSG00000158427      

In [18]:
duplicate_rows = merged_df[merged_df.duplicated(subset=['gene_symbols'], keep=False)]
print(duplicate_rows)

      gene_symbols  gene_identifier
1590          RGS5  ENSG00000143248
1591          RGS5  ENSG00000232995
2097          TBCE  ENSG00000285053
2098          TBCE  ENSG00000284770
3222        PDE11A  ENSG00000128655
3223        PDE11A  ENSG00000284741
3651     LINC01238  ENSG00000237940
3652     LINC01238  ENSG00000261186
4068      CYB561D2  ENSG00000114395
4069      CYB561D2  ENSG00000271858
4170         ATXN7  ENSG00000285258
4171         ATXN7  ENSG00000163635
4701        CCDC39  ENSG00000284862
4702        CCDC39  ENSG00000145075
6256         MATR3  ENSG00000280987
6257         MATR3  ENSG00000015479
7678          SOD2  ENSG00000112096
7679          SOD2  ENSG00000285441
8356       POLR2J3  ENSG00000168255
8357       POLR2J3  ENSG00000285437
8696         ABCF2  ENSG00000285292
8697         ABCF2  ENSG00000033050
9173       TMSB15B  ENSG00000158427
9174       TMSB15B  ENSG00000269226
10626    LINC01505  ENSG00000234323
10627    LINC01505  ENSG00000234229
12282       HSPA14  ENSG0000

<span style="color:red">**Attemp:**</span> 

2. duplicate the columns in the processed genes.tsv files from the processed dataset

In [39]:
general_input_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/data/GSE235063'
general_output_path = '/scratch/user/s4543064/Xiaohan_Summer_Research/write/GSE235063'

for sample in os.listdir(general_input_path):
    if 'processed' in sample:
        print(sample)
        sample_path = general_input_path + '/' + sample
        output_path = general_output_path + '/' + sample + '.h5ad'

        file_path = general_input_path + '/' + sample + '/genes.tsv'
        file_df = pd.read_csv(file_path, sep='\t', header=None, names=['gene_identifier'])
        file_df['gene_symbols'] = file_df['gene_identifier']
        file_df.to_csv(file_path, sep='\t', index=False, header=False)

        anndata = sc.read_10x_mtx(
            sample_path,
            var_names='gene_symbols',  
            cache=False
        )
        print(anndata)

        anndata.write_h5ad(output_path, compression="gzip")

GSM7494323_AML19_REL_processed


AnnData object with n_obs × n_vars = 1736 × 21967
    var: 'gene_ids'
GSM7494294_AML10_REM_processed
AnnData object with n_obs × n_vars = 2695 × 23539
    var: 'gene_ids'
GSM7494278_AML20_REM_processed
AnnData object with n_obs × n_vars = 1990 × 21944
    var: 'gene_ids'
GSM7494302_AML22_REM_processed
AnnData object with n_obs × n_vars = 3029 × 22186
    var: 'gene_ids'
GSM7494298_AML4_DX_processed
AnnData object with n_obs × n_vars = 4731 × 23034
    var: 'gene_ids'
GSM7494281_AML5_REM_processed
AnnData object with n_obs × n_vars = 4008 × 23088
    var: 'gene_ids'
GSM7494322_AML19_DX_processed
AnnData object with n_obs × n_vars = 5179 × 21967
    var: 'gene_ids'
GSM7494305_AML21_REM_processed
AnnData object with n_obs × n_vars = 2532 × 22548
    var: 'gene_ids'
GSM7494319_AML26_DX_processed
AnnData object with n_obs × n_vars = 7915 × 23612
    var: 'gene_ids'
GSM7494282_AML17_DX_processed
AnnData object with n_obs × n_vars = 6895 × 21879
    var: 'gene_ids'
GSM7494266_AML15_DX_process

### 4. Confirmation of created AnnData object

In [37]:
output = '/scratch/user/s4543064/Xiaohan_Summer_Research/write/GSE235063/GSM7494323_AML19_REL_processed.h5ad'
sample = anndata.read_h5ad(output)
print(sample)

AnnData object with n_obs × n_vars = 1736 × 21967
    var: 'gene_ids'
