### 1. General info of dataset GSE125969

This is the Jupyter Notebook for dataset GSE125969. Its dataset includes an overall big tsv and a metadata file. As seen below, in the tsv file, each row is a gene and each column is a cell.

Thus, we need to transform this txt file and generate an overall AnnData object for all samples. 


In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

In [2]:
# inspect the dataset
path = '/scratch/user/s4543064/xiaohan-john-project/data/GSE125969/GSE125969_count_matrix.tsv.gz'
input = pd.read_csv(path, sep='\t', index_col=0) # the first column contains gene names and is the index

print(input.head()) 
print(input.shape) # (23580 genes, 18500 cells)

             foreman_1239_AAACCTGTCCAGTAGT  foreman_1239_AAAGTAGAGGTTCCTA  \
gene_symbol                                                                 
AL627309.1                               0                              0   
AL669831.5                               0                              0   
FAM87B                                   0                              0   
LINC00115                                1                              0   
FAM41C                                   0                              1   

             foreman_1239_AAATGCCCAGCAGTTT  foreman_1239_AACGTTGCAAGGTTCT  \
gene_symbol                                                                 
AL627309.1                               0                              0   
AL669831.5                               0                              0   
FAM87B                                   0                              0   
LINC00115                                0                              0  

In [54]:
# check if the count value is integer or float
all_integer = all(input.dtypes == 'int64')
all_integer

True

In [7]:
# inspect the metadata
metadata_path = '/scratch/user/s4543064/xiaohan-john-project/data/GSE125969/GSE125969_cell_metadata.tsv.gz'
metadata = pd.read_csv(metadata_path, sep='\t', index_col=0) 

metadata

Unnamed: 0_level_0,cell_type,tumor_subtype,UMAP_1,UMAP_2,neoplastic_UMAP_1,neoplastic_UMAP_2
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
foreman_1239_AAACCTGTCCAGTAGT,Myeloid,PFA1,-5.957650,-4.154206,,
foreman_1239_AAAGTAGAGGTTCCTA,Myeloid,PFA1,-6.520398,-4.280884,,
foreman_1239_AAATGCCCAGCAGTTT,Myeloid,PFA1,-6.500687,-4.386174,,
foreman_1239_AACGTTGCAAGGTTCT,Myeloid,PFA1,-8.243432,-4.899468,,
foreman_1239_AAGACCTGTCTTTCAT,Myeloid,PFA1,-7.011166,-3.839772,,
...,...,...,...,...,...,...
foreman_1158_2_CAGTTAGGTGTCACAT,RELA-sc1,ST-RELA,0.639991,7.436701,-5.667158,-1.662147
foreman_1158_2_CTGTACCGTGGTTTAC,RELA-sc1,ST-RELA,0.606369,7.486411,-5.489039,-0.446165
foreman_1158_2_GAGTCATTCGTAGTGT,RELA-sc1,ST-RELA,0.568297,7.517032,-5.408996,-0.527036
foreman_1158_2_GGTTGTAGTGCATCTA,RELA-sc1,ST-RELA,0.591726,7.491313,-5.510126,-0.783699


In [25]:
patients = set()

for i in metadata.index.tolist():
    a = i.split("_")
    patients.add(a[1])

len(patients)

26

### 2. Overall AnnData object of the dataset

<span style="color:red">**IMPORTANT:**</span> transpose the DataFrame.values to match the AnnData.X

1. `DataFrame.columns`: cell barcodes, which go into `.obs`
2. `DataFrame.index`: gene names, `.var`
3. `DataFrame.values`: the transpose of the expression matrix, `.X`

In [50]:
matrix = scipy.sparse.csr_matrix(input.values.T)
obs_name = pd.DataFrame(index=input.columns)
var_name = pd.DataFrame(input.index)

sample = anndata.AnnData(X=matrix, obs=obs_name, var=var_name)

# Create an observation metric info to store related features
obs_metrics = metadata[['cell_type', 'tumor_subtype']]

obs_metrics['cancer_type'] = obs_metrics['tumor_subtype'] + '_' + 'ependymoma'
obs_metrics['dataset'] = 'GSE125969'
obs_metrics['tissue'] = 'brain'

sample_barcodes = []
uni_barcodes = []
for cell_id in metadata.index.tolist():
    barcodes = cell_id.split('_')
    if len(barcodes) == 3:
        sample_barcode = 'GSE125969_' + barcodes[1] 
        uni_barcode = 'GSE125969_' + barcodes[1] + '_' + barcodes[2]
    else:
        sample_barcode = 'GSE125969_' + barcodes[1] + '_' + barcodes[2]
        uni_barcode = 'GSE125969_' + barcodes[1] + '_' + barcodes[2] + '_' + barcodes[3]
    sample_barcodes.append(sample_barcode)
    uni_barcodes.append(uni_barcode)
obs_metrics['sample_barcode'] = sample_barcodes
obs_metrics['uni_barcode'] = uni_barcodes

# drop the unwanted column
obs_metrics = obs_metrics.drop('tumor_subtype', axis=1)
# rename the column
obs_metrics.rename(columns={'cell_type': 'cell_type_from_paper'}, inplace=True)

sample.obs = obs_metrics
sample.obs.set_index("uni_barcode", drop=False, inplace=True)
print(sample)

# save the anndata object
sample.write_h5ad('/scratch/user/s4543064/xiaohan-john-project/write/GSE125969/GSE125969_uni.h5ad', compression="gzip")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  obs_metrics['cancer_type'] = obs_metrics['tumor_subtype'] + '_' + 'ependymoma'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  obs_metrics['dataset'] = 'GSE125969'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  obs_metrics['tissue'] = 'brain'


AnnData object with n_obs × n_vars = 18500 × 23580
    obs: 'cell_type_from_paper', 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbol'


In [51]:
sample.obs

Unnamed: 0_level_0,cell_type_from_paper,cancer_type,dataset,tissue,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GSE125969_1239_AAACCTGTCCAGTAGT,Myeloid,PFA1_ependymoma,GSE125969,brain,GSE125969_1239,GSE125969_1239_AAACCTGTCCAGTAGT
GSE125969_1239_AAAGTAGAGGTTCCTA,Myeloid,PFA1_ependymoma,GSE125969,brain,GSE125969_1239,GSE125969_1239_AAAGTAGAGGTTCCTA
GSE125969_1239_AAATGCCCAGCAGTTT,Myeloid,PFA1_ependymoma,GSE125969,brain,GSE125969_1239,GSE125969_1239_AAATGCCCAGCAGTTT
GSE125969_1239_AACGTTGCAAGGTTCT,Myeloid,PFA1_ependymoma,GSE125969,brain,GSE125969_1239,GSE125969_1239_AACGTTGCAAGGTTCT
GSE125969_1239_AAGACCTGTCTTTCAT,Myeloid,PFA1_ependymoma,GSE125969,brain,GSE125969_1239,GSE125969_1239_AAGACCTGTCTTTCAT
...,...,...,...,...,...,...
GSE125969_1158_2_CAGTTAGGTGTCACAT,RELA-sc1,ST-RELA_ependymoma,GSE125969,brain,GSE125969_1158_2,GSE125969_1158_2_CAGTTAGGTGTCACAT
GSE125969_1158_2_CTGTACCGTGGTTTAC,RELA-sc1,ST-RELA_ependymoma,GSE125969,brain,GSE125969_1158_2,GSE125969_1158_2_CTGTACCGTGGTTTAC
GSE125969_1158_2_GAGTCATTCGTAGTGT,RELA-sc1,ST-RELA_ependymoma,GSE125969,brain,GSE125969_1158_2,GSE125969_1158_2_GAGTCATTCGTAGTGT
GSE125969_1158_2_GGTTGTAGTGCATCTA,RELA-sc1,ST-RELA_ependymoma,GSE125969,brain,GSE125969_1158_2,GSE125969_1158_2_GGTTGTAGTGCATCTA


### 3. Confirmation of created AnnData object

In [11]:
output = '/scratch/user/s4543064/xiaohan-john-project/write/GSE125969/GSE125969_uni.h5ad'
sample = anndata.read_h5ad(output)
print(sample)

AnnData object with n_obs × n_vars = 18500 × 23580
    obs: 'cell_type_from_paper', 'cancer_type', 'dataset', 'tissue', 'sample_barcode', 'uni_barcode'
    var: 'gene_symbol'


In [12]:
sample.var.set_index('gene_symbol', drop=True, inplace=True)
sample.var.rename_axis('gene_symbols', inplace=True)
sample.write_h5ad(output, compression="gzip")

In [13]:
sample.var

AL627309.1
AL669831.5
FAM87B
LINC00115
FAM41C
...
AC023491.2
AC004556.1
AC233755.2
AC233755.1
AC240274.1


In [14]:
sample.obs

Unnamed: 0_level_0,cell_type_from_paper,cancer_type,dataset,tissue,sample_barcode,uni_barcode
uni_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GSE125969_1239_AAACCTGTCCAGTAGT,Myeloid,PFA1_ependymoma,GSE125969,brain,GSE125969_1239,GSE125969_1239_AAACCTGTCCAGTAGT
GSE125969_1239_AAAGTAGAGGTTCCTA,Myeloid,PFA1_ependymoma,GSE125969,brain,GSE125969_1239,GSE125969_1239_AAAGTAGAGGTTCCTA
GSE125969_1239_AAATGCCCAGCAGTTT,Myeloid,PFA1_ependymoma,GSE125969,brain,GSE125969_1239,GSE125969_1239_AAATGCCCAGCAGTTT
GSE125969_1239_AACGTTGCAAGGTTCT,Myeloid,PFA1_ependymoma,GSE125969,brain,GSE125969_1239,GSE125969_1239_AACGTTGCAAGGTTCT
GSE125969_1239_AAGACCTGTCTTTCAT,Myeloid,PFA1_ependymoma,GSE125969,brain,GSE125969_1239,GSE125969_1239_AAGACCTGTCTTTCAT
...,...,...,...,...,...,...
GSE125969_1158_2_CAGTTAGGTGTCACAT,RELA-sc1,ST-RELA_ependymoma,GSE125969,brain,GSE125969_1158_2,GSE125969_1158_2_CAGTTAGGTGTCACAT
GSE125969_1158_2_CTGTACCGTGGTTTAC,RELA-sc1,ST-RELA_ependymoma,GSE125969,brain,GSE125969_1158_2,GSE125969_1158_2_CTGTACCGTGGTTTAC
GSE125969_1158_2_GAGTCATTCGTAGTGT,RELA-sc1,ST-RELA_ependymoma,GSE125969,brain,GSE125969_1158_2,GSE125969_1158_2_GAGTCATTCGTAGTGT
GSE125969_1158_2_GGTTGTAGTGCATCTA,RELA-sc1,ST-RELA_ependymoma,GSE125969,brain,GSE125969_1158_2,GSE125969_1158_2_GGTTGTAGTGCATCTA


In [15]:
sample.obs['sample_barcode'].value_counts()

sample_barcode
GSE125969_1158_2    2005
GSE125969_1067      1829
GSE125969_965       1462
GSE125969_1386      1411
GSE125969_1329      1305
GSE125969_870       1147
GSE125969_1185      1145
GSE125969_987       1016
GSE125969_1347      1010
GSE125969_1048       757
GSE125969_781        737
GSE125969_727        650
GSE125969_848        637
GSE125969_1269       599
GSE125969_1101       472
GSE125969_1239       377
GSE125969_859        360
GSE125969_723        300
GSE125969_897        293
GSE125969_1010       279
GSE125969_911        268
GSE125969_930        117
GSE125969_928         98
GSE125969_871         89
GSE125969_909         72
GSE125969_839         65
Name: count, dtype: int64

### 4. Convert AnnData objects to SingleCellExperiment objects

<span style="color:red">**PROBLEM:**</span> `Unknown dtype dtype('int64') cannot be converted to ?gRMatrix.`

Thus, we need to convert the count matrix from int64 to float32.

In [16]:
from pathlib import Path

import anndata2ri
import rpy2.robjects as robjects
from rpy2.robjects import r
from rpy2.robjects.conversion import localconverter

# Specify directory paths
write_directory = Path('/scratch/user/s4543064/xiaohan-john-project/write/GSE125969')

# Loop through all files in the directory
for file in write_directory.iterdir():
    sample_name = file.stem
    if "_uni.h5ad" in file.name:
        sample_anndata = anndata.read_h5ad(file)
        sample_anndata.X = sample_anndata.X.astype('float32')
        sample_sce_file = sample_name + ".rds"

        with localconverter(anndata2ri.converter):
            sample_sce = anndata2ri.py2rpy(sample_anndata)
        print(sample_sce)
        
        # Save the sce object in .rds file
        robjects.globalenv["sample_sce"] = sample_sce
        sample_sce_path = write_directory / sample_sce_file
        robjects.r("saveRDS(sample_sce, file='{}')".format(sample_sce_path))

class: SingleCellExperiment 
dim: 23580 18500 
metadata(0):
assays(1): X
rownames(23580): AL627309.1 AL669831.5 ... AC233755.1 AC240274.1
rowData names(0):
colnames(18500): GSE125969_1239_AAACCTGTCCAGTAGT
  GSE125969_1239_AAAGTAGAGGTTCCTA ... GSE125969_1158_2_GGTTGTAGTGCATCTA
  GSE125969_1158_2_TGAGGGATCGGAATTC
colData names(6): cell_type_from_paper cancer_type ... sample_barcode
  uni_barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

