### 1. General info of dataset GSE89567

This is the Jupyter Notebook for dataset GSE89567. Its dataset includes an overall big txt file. As seen below, in the txt file, each row is a gene and each column is a cell.

Thus, we need to transform this txt file and generate an overall AnnData object for all samples. 

In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

In [3]:
# inspect the dataset
path = '/scratch/user/s4543064/xiaohan-john-project/data/GSE89567/GSE89567_IDH_A_processed_data.txt'
input = pd.read_csv(path, sep='\t', index_col=0) # the first column contains gene names and is the index

print(input.head()) 
print(input.shape) # (23686, 6341)

            MGH42_P7_A01  MGH42_P7_A02  MGH42_P7_A03  MGH42_P7_A04  \
'A1BG'            1.1928      0.000000       0.00000        0.0000   
'A1BG-AS1'        0.0000      0.000000       0.00000        0.0000   
'A1CF'            0.0000      0.094912       0.00000        0.0000   
'A2M'             7.0439      7.609500       0.77062        7.6146   
'A2M-AS1'         0.0000      0.000000       0.00000        0.0000   

            MGH42_P7_A05  MGH42_P7_A07  MGH42_P7_A09  MGH42_P7_A11  \
'A1BG'            0.0000       0.66903       0.00000        0.0000   
'A1BG-AS1'        0.0000       0.00000       0.00000        0.0000   
'A1CF'            0.0000       0.00000       0.00000        0.0000   
'A2M'             0.0000       0.00000       0.27501        8.1624   
'A2M-AS1'         2.0339       2.39420       0.00000        0.0000   

            MGH42_P7_A12  MGH42_P7_B02  ...  MGH107neg_P2_E06  \
'A1BG'            0.0000        0.0000  ...               0.0   
'A1BG-AS1'        0.0000    

<span style="color:red">**PROBLEM:**</span> the gene names are stored as 'GENE_SYMBOL' (with a quotation mark)

In [4]:
# Get rid of the extra quotation marks for gene symbols
input.index = [gene[1:-1] for gene in input.index]
print(input.head()) 

          MGH42_P7_A01  MGH42_P7_A02  MGH42_P7_A03  MGH42_P7_A04  \
A1BG            1.1928      0.000000       0.00000        0.0000   
A1BG-AS1        0.0000      0.000000       0.00000        0.0000   
A1CF            0.0000      0.094912       0.00000        0.0000   
A2M             7.0439      7.609500       0.77062        7.6146   
A2M-AS1         0.0000      0.000000       0.00000        0.0000   

          MGH42_P7_A05  MGH42_P7_A07  MGH42_P7_A09  MGH42_P7_A11  \
A1BG            0.0000       0.66903       0.00000        0.0000   
A1BG-AS1        0.0000       0.00000       0.00000        0.0000   
A1CF            0.0000       0.00000       0.00000        0.0000   
A2M             0.0000       0.00000       0.27501        8.1624   
A2M-AS1         2.0339       2.39420       0.00000        0.0000   

          MGH42_P7_A12  MGH42_P7_B02  ...  MGH107neg_P2_E06  MGH107pos_P2_B03  \
A1BG            0.0000        0.0000  ...               0.0            0.0000   
A1BG-AS1        0.00

As shown above, the dataset contains 6341 cells and 23686 genes.

### 2. Overall AnnData object of the dataset

<span style="color:red">**IMPORTANT:**</span> transpose the DataFrame.values to match the AnnData.X

1. `DataFrame.columns`: cell barcodes, which go into `.obs`
2. `DataFrame.index`: gene names, `.var`
3. `DataFrame.values`: the transpose of the expression matrix, `.X`

In [5]:
matrix = scipy.sparse.csr_matrix(input.values.T)
obs_name = pd.DataFrame(index=input.columns)
var_name = pd.DataFrame(input.index, columns=['gene_symbols'])

sample = anndata.AnnData(X=matrix, obs=obs_name, var=var_name)

# Create an observation metric info to store related features
obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers

obs_metrics['cancer_type'] = 'IDH-Mutation_glioma'
obs_metrics['dataset'] = 'GSE89567'
obs_metrics['tissue'] = 'brain'
obs_metrics['sample_barcode'] = 'GSE89567'
obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)

sample.obs = obs_metrics
sample.obs.set_index("uni_barcode", drop=False, inplace=True)
print(sample)

# save the anndata object
sample.write_h5ad('/scratch/user/s4543064/xiaohan-john-project/write/GSE89567/GSE89567_IDH_A_processed_data_uni.h5ad', compression="gzip")



AnnData object with n_obs × n_vars = 6341 × 23686
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbols'


### 3. Confirmation of created AnnData object

In [4]:
output = '/scratch/user/s4543064/xiaohan-john-project/write/GSE89567/GSE89567_IDH_A_processed_data_uni.h5ad'
sample = anndata.read_h5ad(output)
sample.var.set_index('gene_symbols', drop=True, inplace=True)
print(sample)

AnnData object with n_obs × n_vars = 6341 × 23686
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'


In [7]:
sample.write_h5ad(output, compression="gzip")

In [5]:
sample.var

A1BG
A1BG-AS1
A1CF
A2M
A2M-AS1
...
ZYG11A
ZYG11B
ZYX
ZZEF1
ZZZ3


In [6]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,tissue,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GSE89567_MGH42_P7_A01,IDH-Mutation_glioma,GSE89567,brain,GSE89567,GSE89567_MGH42_P7_A01
GSE89567_MGH42_P7_A02,IDH-Mutation_glioma,GSE89567,brain,GSE89567,GSE89567_MGH42_P7_A02
GSE89567_MGH42_P7_A03,IDH-Mutation_glioma,GSE89567,brain,GSE89567,GSE89567_MGH42_P7_A03
GSE89567_MGH42_P7_A04,IDH-Mutation_glioma,GSE89567,brain,GSE89567,GSE89567_MGH42_P7_A04
GSE89567_MGH42_P7_A05,IDH-Mutation_glioma,GSE89567,brain,GSE89567,GSE89567_MGH42_P7_A05
...,...,...,...,...,...
GSE89567_MGH107neg_P2_C05,IDH-Mutation_glioma,GSE89567,brain,GSE89567,GSE89567_MGH107neg_P2_C05
GSE89567_MGH107pos_P2_D07,IDH-Mutation_glioma,GSE89567,brain,GSE89567,GSE89567_MGH107pos_P2_D07
GSE89567_MGH107neg_P1_E01,IDH-Mutation_glioma,GSE89567,brain,GSE89567,GSE89567_MGH107neg_P1_E01
GSE89567_MGH107pos_P2_G09,IDH-Mutation_glioma,GSE89567,brain,GSE89567,GSE89567_MGH107pos_P2_G09


### 4. Convert AnnData objects to SingleCellExperiment objects

In [8]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE89567')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = anndata.read_h5ad(file)
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        print(sample_sce)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))

class: SingleCellExperiment 
dim: 23686 6341 
metadata(0):
assays(1): X
rownames(23686): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(6341): GSE89567_MGH42_P7_A01 GSE89567_MGH42_P7_A02 ...
  GSE89567_MGH107pos_P2_G09 GSE89567_MGH107neg_P1_D06
colData names(5): cancer_type dataset tissue sample_barcode uni_barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

