### 1. General info of dataset GSE147766

This is the Jupyter Notebook for dataset GSE147766. Its dataset includes an overall big txt file. As seen below, in the csv file, each row is a gene and each column is a cell.

Thus, we need to transform this txt file and generate an overall AnnData object for all samples. 

GSM4445608	NB01 (RNA-seq)

GSM4445609	NB02 (RNA-seq)

GSM4445610	NB09 (RNA-seq)

GSM4445611	NB11 (RNA-seq)

GSM4445612	NB12 (RNA-seq)

GSM4445613	NB13 (RNA-seq)

GSM4445614	NB14 (RNA-seq)

GSM4445615	NB16 (RNA-seq)

GSM4445616	NB17 (RNA-seq)

GSM4445617	NB18 (RNA-seq)

GSM4445618	NB19 (RNA-seq)

GSM4445619	NB20 (RNA-seq)

GSM4445620	NB21 (RNA-seq)

GSM4445621	NB22 (RNA-seq)

GSM4445622	NB23 (RNA-seq)

GSM4445623	NB24 (RNA-seq)

GSM4445624	NB26 (RNA-seq)

GSM6058205	NB34 (RNA-seq)

GSM6058206	NB37 (RNA-seq)

In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

In [4]:
# inspect the dataset
input = pd.read_csv("xiaohan-john-project/data/GSE147766_RAW/GSM4445608_NB01.count.csv", sep=',', index_col=0) # the first column contains gene names and is the index

print(input.head())
print(input.shape) # (33868 genes, 949 cells)

             NB01_AAACCTGTCGTAGATC-1  NB01_AAACGGGAGTGTACTC-1  \
MIR1302-2HG                        0                        0   
FAM138A                            0                        0   
OR4F5                              0                        0   
AL627309.1                         0                        0   
AL627309.3                         0                        0   

             NB01_AAACGGGAGTGTCCCG-1  NB01_AAACGGGCAGTTCATG-1  \
MIR1302-2HG                        0                        0   
FAM138A                            0                        0   
OR4F5                              0                        0   
AL627309.1                         0                        0   
AL627309.3                         0                        0   

             NB01_AAAGATGAGCTTCGCG-1  NB01_AAAGCAACAAGTCTAC-1  \
MIR1302-2HG                        0                        0   
FAM138A                            0                        0   
OR4F5                  

As shown above, the dataset contains 4058 cells and 23686 genes.

### 2. Overall AnnData object of the dataset

<span style="color:red">**IMPORTANT:**</span> transpose the DataFrame.values to match the AnnData.X

1. `DataFrame.columns`: cell barcodes, which go into `.obs`
2. `DataFrame.index`: gene names, `.var`
3. `DataFrame.values`: the transpose of the expression matrix, `.X`

In [6]:
from pathlib import Path

In [13]:
input = pd.read_csv("xiaohan-john-project/data/GSE147766_RAW/"+sample_name+'.csv', sep=',', index_col=0)

In [17]:
pd.DataFrame(index=input.index)

MIR1302-2HG
FAM138A
OR4F5
AL627309.1
AL627309.3
...
AC233755.2
AC233755.1
AC240274.1
AC213203.2
AC213203.1


In [18]:
# Specify directory paths
data_directory = Path('/scratch/user/uqjsaxo1/xiaohan-john-project/data/GSE147766_RAW/')
write_directory = Path('/scratch/user/uqjsaxo1/xiaohan-john-project/write/GSE147766/')

# Loop through all files in the directory
for sample_directory in data_directory.iterdir():
    sample_name = sample_directory.stem
    sample_h5ad = sample_name + '_uni.h5ad'

    input = pd.read_csv("xiaohan-john-project/data/GSE147766_RAW/"+sample_name+'.csv', sep=',', index_col=0)

    matrix = scipy.sparse.csr_matrix(input.values.T)
    obs_name = pd.DataFrame(index=input.columns)
    var_name = pd.DataFrame(index=input.index)

    sample = anndata.AnnData(X=matrix, obs=obs_name, var=var_name)

    # Create an observation metric info to store related features
    obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers
    obs_metrics['cancer_type'] = 'Neuroblastoma '
    obs_metrics['dataset'] = 'GSE147766'
    obs_metrics['tissue'] = 'Adrenal'
    obs_metrics['sample_barcode'] = sample_name
    obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)
    
    sample.obs = obs_metrics
    sample.obs.set_index("uni_barcode", drop=True, inplace=True)
    print(sample)

    # save the anndata object
    output_path = write_directory.joinpath(sample_h5ad)
    sample.write_h5ad(output_path, compression="gzip")

AnnData object with n_obs × n_vars = 2707 × 33868
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode'
AnnData object with n_obs × n_vars = 17287 × 33868
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode'
AnnData object with n_obs × n_vars = 1361 × 33868
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode'
AnnData object with n_obs × n_vars = 4667 × 33868
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode'
AnnData object with n_obs × n_vars = 3217 × 33868
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode'
AnnData object with n_obs × n_vars = 5191 × 33868
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode'
AnnData object with n_obs × n_vars = 2973 × 33868
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode'
AnnData object with n_obs × n_vars = 1608 × 33868
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_barcode'
AnnData object with n_obs × n_vars = 5288 × 33833
    obs: 'cancer_type', 'dataset', 'tissue', 'sample_

In [19]:
sample.obs

Unnamed: 0_level_0,cancer_type,dataset,tissue,sample_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GSE147766_NB22_AAACCTGGTCGCGTGT-1,Neuroblastoma,GSE147766,Adrenal,GSM4445621_NB22.count
GSE147766_NB22_AAACGGGAGGACACCA-1,Neuroblastoma,GSE147766,Adrenal,GSM4445621_NB22.count
GSE147766_NB22_AAACGGGCAATGGAGC-1,Neuroblastoma,GSE147766,Adrenal,GSM4445621_NB22.count
GSE147766_NB22_AAACGGGCAGACTCGC-1,Neuroblastoma,GSE147766,Adrenal,GSM4445621_NB22.count
GSE147766_NB22_AAAGATGAGGTAGCCA-1,Neuroblastoma,GSE147766,Adrenal,GSM4445621_NB22.count
...,...,...,...,...
GSE147766_NB22_TTTGGTTAGCACAGGT-1,Neuroblastoma,GSE147766,Adrenal,GSM4445621_NB22.count
GSE147766_NB22_TTTGGTTGTGCAGGTA-1,Neuroblastoma,GSE147766,Adrenal,GSM4445621_NB22.count
GSE147766_NB22_TTTGGTTTCACAGTAC-1,Neuroblastoma,GSE147766,Adrenal,GSM4445621_NB22.count
GSE147766_NB22_TTTGGTTTCGTCTGAA-1,Neuroblastoma,GSE147766,Adrenal,GSM4445621_NB22.count


In [20]:
sample.var

MIR1302-2HG
FAM138A
OR4F5
AL627309.1
AL627309.3
...
AC233755.2
AC233755.1
AC240274.1
AC213203.2
AC213203.1


In [27]:
sample.X[10:1000, 10:2000].toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1]])

### 3. Confirmation of created AnnData object

In [13]:
output = '/scratch/user/s4543064/xiaohan-john-project/write/GSE102130/GSE102130_K27Mproject.RSEM.vh20170621_uni.h5ad'
sample = anndata.read_h5ad(output)
print(sample)

AnnData object with n_obs × n_vars = 4058 × 23686
    obs: 'cancer_type', 'dataset', 'tissue', 'uni_barcode'
    var: 'gene_symbols'


### 4. Convert AnnData objects to SingleCellExperiment objects

In [14]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE102130')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = anndata.read_h5ad(file)
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        print(sample_sce)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))