### 1. General info of dataset GSE102130

This is the Jupyter Notebook for dataset GSE102130. Its dataset includes an overall big txt file. As seen below, in the txt file, each row is a gene and each column is a cell.

Thus, we need to transform this txt file and generate an overall AnnData object for all samples. 



In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

In [5]:
# inspect the dataset
path = '/scratch/user/s4543064/Xiaohan_Summer_Research/data/GSE102130/GSE102130_K27Mproject.RSEM.vh20170621.txt'
input = pd.read_csv(path, sep='\t', index_col=0) # the first column contains gene names and is the index

print(input.head()) 
print(input.shape) # (23686 rows, 4058 columns)

          MUV1-P04-B12  MUV1-P04-C08  MUV1-P04-D09  MUV1-P04-D10  \
Gene                                                               
A1BG               0.0           0.0           0.0           0.0   
A1BG-AS1           0.0           0.0           0.0           0.0   
A1CF               0.0           0.0           0.0           0.0   
A2M                0.0           0.0           0.0           0.0   
A2M-AS1            0.0           0.0           0.0           0.0   

          MUV1-P04-E03  MUV1-P04-E07  MUV1-P04-E08  MUV1-P04-E10  \
Gene                                                               
A1BG              0.00          0.00          0.00          0.00   
A1BG-AS1          0.00          0.00          0.00          0.00   
A1CF              0.00          0.00          0.53          0.34   
A2M             348.48        362.08          0.00          0.00   
A2M-AS1           0.00          1.19          0.00          0.00   

          MUV1-P04-E11  MUV1-P04-F05  ...  Oli

As shown above, the dataset contains 4058 cells and 23686 genes.

### 2. Overall AnnData object of the dataset

<span style="color:red">**IMPORTANT:**</span> transpose the DataFrame.values to match the AnnData.X

1. `DataFrame.columns`: cell barcodes, which go into `.obs`
2. `DataFrame.index`: gene names, `.var`
3. `DataFrame.values`: the transpose of the expression matrix, `.X`

In [10]:
matrix = scipy.sparse.csr_matrix(input.values.T)
obs_name = pd.DataFrame(index=input.columns)
var_name = pd.DataFrame(input.index)
var_name.rename(columns={'Gene': 'gene_symbols'}, inplace=True)

sample = anndata.AnnData(X=matrix, obs=obs_name, var=var_name)
print(sample)

# Create an observation metric info to store related features
obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers
obs_metrics['cancer_type'] = 'H3K27M-glioma'
obs_metrics['dataset'] = 'GSE102130'
obs_metrics['tissue'] = 'brain'
obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)

sample.obs = obs_metrics
sample.obs.set_index("uni_barcode", drop=False, inplace=True)
print(sample)

# save the anndata object
sample.write_h5ad('/scratch/user/s4543064/xiaohan-john-project/write/GSE102130/GSE102130_K27Mproject.RSEM.vh20170621_uni.h5ad', compression="gzip")



AnnData object with n_obs × n_vars = 4058 × 23686
    var: 'gene_symbols'
AnnData object with n_obs × n_vars = 4058 × 23686
    obs: 'cancer_type', 'dataset', 'tissue', 'uni_barcode'
    var: 'gene_symbols'


In [11]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,tissue,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GSE102130_MUV1-P04-B12,H3K27M-glioma,GSE102130,brain,GSE102130_MUV1-P04-B12
GSE102130_MUV1-P04-C08,H3K27M-glioma,GSE102130,brain,GSE102130_MUV1-P04-C08
GSE102130_MUV1-P04-D09,H3K27M-glioma,GSE102130,brain,GSE102130_MUV1-P04-D09
GSE102130_MUV1-P04-D10,H3K27M-glioma,GSE102130,brain,GSE102130_MUV1-P04-D10
GSE102130_MUV1-P04-E03,H3K27M-glioma,GSE102130,brain,GSE102130_MUV1-P04-E03
...,...,...,...,...
GSE102130_Oligo-P22-H03,H3K27M-glioma,GSE102130,brain,GSE102130_Oligo-P22-H03
GSE102130_Oligo-P22-H05,H3K27M-glioma,GSE102130,brain,GSE102130_Oligo-P22-H05
GSE102130_Oligo-P22-H06,H3K27M-glioma,GSE102130,brain,GSE102130_Oligo-P22-H06
GSE102130_Oligo-P22-H08,H3K27M-glioma,GSE102130,brain,GSE102130_Oligo-P22-H08


### 3. Confirmation of created AnnData object

In [2]:
output = '/scratch/user/s4543064/xiaohan-john-project/write/GSE102130/GSE102130_K27Mproject.RSEM.vh20170621_uni.h5ad'
sample = anndata.read_h5ad(output)
print(sample)

AnnData object with n_obs × n_vars = 4058 × 23686
    obs: 'cancer_type', 'dataset', 'tissue', 'uni_barcode'
    var: 'gene_symbols'


In [4]:
sample.obs['sample_barcode'] = 'GSE102130'
sample

# Save the AnnData object
sample.write_h5ad('/scratch/user/s4543064/xiaohan-john-project/write/GSE102130/GSE102130_K27Mproject.RSEM.vh20170621_uni.h5ad', compression="gzip")

### 4. Convert AnnData objects to SingleCellExperiment objects

In [5]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE102130')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = anndata.read_h5ad(file)
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        # print(sample_sce)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))

In [6]:
print(sample_sce)

class: SingleCellExperiment 
dim: 23686 4058 
metadata(0):
assays(1): X
rownames(23686): 0 1 ... 23684 23685
rowData names(1): gene_symbols
colnames(4058): GSE102130_MUV1-P04-B12 GSE102130_MUV1-P04-C08 ...
  GSE102130_Oligo-P22-H08 GSE102130_Oligo-P22-H09
colData names(5): cancer_type dataset tissue uni_barcode sample_barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

