### 1. General info of dataset GSE89567

This is the Jupyter Notebook for dataset GSE89567. Its dataset includes an overall big txt file. As seen below, in the txt file, each row is a gene and each column is a cell.

Thus, we need to transform this txt file and generate an overall AnnData object for all samples. 

In [3]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

In [20]:
# inspect the dataset
path = '/scratch/user/s4543064/xiaohan-john-project/data/GSE89567/GSE89567_IDH_A_processed_data.txt'
input = pd.read_csv(path, sep='\t', index_col=0) # the first column contains gene names and is the index

print(input.head()) 
print(input.shape) # (23686, 6341)

            MGH42_P7_A01  MGH42_P7_A02  MGH42_P7_A03  MGH42_P7_A04  \
'A1BG'            1.1928      0.000000       0.00000        0.0000   
'A1BG-AS1'        0.0000      0.000000       0.00000        0.0000   
'A1CF'            0.0000      0.094912       0.00000        0.0000   
'A2M'             7.0439      7.609500       0.77062        7.6146   
'A2M-AS1'         0.0000      0.000000       0.00000        0.0000   

            MGH42_P7_A05  MGH42_P7_A07  MGH42_P7_A09  MGH42_P7_A11  \
'A1BG'            0.0000       0.66903       0.00000        0.0000   
'A1BG-AS1'        0.0000       0.00000       0.00000        0.0000   
'A1CF'            0.0000       0.00000       0.00000        0.0000   
'A2M'             0.0000       0.00000       0.27501        8.1624   
'A2M-AS1'         2.0339       2.39420       0.00000        0.0000   

            MGH42_P7_A12  MGH42_P7_B02  ...  MGH107neg_P2_E06  \
'A1BG'            0.0000        0.0000  ...               0.0   
'A1BG-AS1'        0.0000    

<span style="color:red">**PROBLEM:**</span> the gene names are stored as 'GENE_SYMBOL' (with a quotation mark)

In [21]:
# Get rid of the extra quotation marks for gene symbols
input.index = [gene[1:-1] for gene in input.index]
print(input.head()) 

          MGH42_P7_A01  MGH42_P7_A02  MGH42_P7_A03  MGH42_P7_A04  \
A1BG            1.1928      0.000000       0.00000        0.0000   
A1BG-AS1        0.0000      0.000000       0.00000        0.0000   
A1CF            0.0000      0.094912       0.00000        0.0000   
A2M             7.0439      7.609500       0.77062        7.6146   
A2M-AS1         0.0000      0.000000       0.00000        0.0000   

          MGH42_P7_A05  MGH42_P7_A07  MGH42_P7_A09  MGH42_P7_A11  \
A1BG            0.0000       0.66903       0.00000        0.0000   
A1BG-AS1        0.0000       0.00000       0.00000        0.0000   
A1CF            0.0000       0.00000       0.00000        0.0000   
A2M             0.0000       0.00000       0.27501        8.1624   
A2M-AS1         2.0339       2.39420       0.00000        0.0000   

          MGH42_P7_A12  MGH42_P7_B02  ...  MGH107neg_P2_E06  MGH107pos_P2_B03  \
A1BG            0.0000        0.0000  ...               0.0            0.0000   
A1BG-AS1        0.00

As shown above, the dataset contains 6341 cells and 23686 genes.

### 2. Overall AnnData object of the dataset

<span style="color:red">**IMPORTANT:**</span> transpose the DataFrame.values to match the AnnData.X

1. `DataFrame.columns`: cell barcodes, which go into `.obs`
2. `DataFrame.index`: gene names, `.var`
3. `DataFrame.values`: the transpose of the expression matrix, `.X`

In [22]:
matrix = scipy.sparse.csr_matrix(input.values.T)
obs_name = pd.DataFrame(index=input.columns)
var_name = pd.DataFrame(input.index, columns=['gene_symbols'])

sample = anndata.AnnData(X=matrix, obs=obs_name, var=var_name)
print(sample)

# save the anndata object
sample.write_h5ad('/scratch/user/s4543064/xiaohan-john-project/write/GSE89567/GSE89567_IDH_A_processed_data.h5ad', compression="gzip")



AnnData object with n_obs × n_vars = 6341 × 23686
    var: 'gene_symbols'


### 3. Confirmation of created AnnData object

In [4]:
output = '/scratch/user/s4543064/xiaohan-john-project/write/GSE89567/GSE89567_IDH_A_processed_data.h5ad'
sample = anndata.read_h5ad(output)
print(sample)

AnnData object with n_obs × n_vars = 6341 × 23686
    var: 'gene_symbols'


### 4. Add feature info to the created AnnData object

1. `cancer_type`: this AnnData is related to what type of cancer
2. `dataset`: which the sample belongs to 
3. `tissue`: where the sample is obtained
4. `uni_barcode`: the unique barcodes for each cell

In [9]:
# Create an observation metric info to store related features
obs_metrics = pd.DataFrame(index=sample.obs_names) ## Get the identifiers

obs_metrics['cancer_type'] = 'IDH-Mutation_glioma'
obs_metrics['dataset'] = 'GSE89567'
obs_metrics['tissue'] = 'brain'
obs_metrics['uni_barcode'] = obs_metrics['dataset'] + '_' + obs_metrics.index.astype(str)

obs_metrics

Unnamed: 0,cancer_type,dataset,tissue,uni_barcode
MGH42_P7_A01,IDH-Mutation_glioma,GSE89567,brain,GSE89567_MGH42_P7_A01
MGH42_P7_A02,IDH-Mutation_glioma,GSE89567,brain,GSE89567_MGH42_P7_A02
MGH42_P7_A03,IDH-Mutation_glioma,GSE89567,brain,GSE89567_MGH42_P7_A03
MGH42_P7_A04,IDH-Mutation_glioma,GSE89567,brain,GSE89567_MGH42_P7_A04
MGH42_P7_A05,IDH-Mutation_glioma,GSE89567,brain,GSE89567_MGH42_P7_A05
...,...,...,...,...
MGH107neg_P2_C05,IDH-Mutation_glioma,GSE89567,brain,GSE89567_MGH107neg_P2_C05
MGH107pos_P2_D07,IDH-Mutation_glioma,GSE89567,brain,GSE89567_MGH107pos_P2_D07
MGH107neg_P1_E01,IDH-Mutation_glioma,GSE89567,brain,GSE89567_MGH107neg_P1_E01
MGH107pos_P2_G09,IDH-Mutation_glioma,GSE89567,brain,GSE89567_MGH107pos_P2_G09


In [21]:
# Add the obs_metrics to the AnnData object
sample.obs = obs_metrics
sample.obs.set_index("uni_barcode", drop=False, inplace=True)
print(sample.obs)

# save the anndata object
sample.write_h5ad('/scratch/user/s4543064/xiaohan-john-project/write/GSE89567/GSE89567_IDH_A_processed_data_uni.h5ad', compression="gzip")

                                   cancer_type   dataset tissue  \
uni_barcode                                                       
GSE89567_MGH42_P7_A01      IDH-Mutation_glioma  GSE89567  brain   
GSE89567_MGH42_P7_A02      IDH-Mutation_glioma  GSE89567  brain   
GSE89567_MGH42_P7_A03      IDH-Mutation_glioma  GSE89567  brain   
GSE89567_MGH42_P7_A04      IDH-Mutation_glioma  GSE89567  brain   
GSE89567_MGH42_P7_A05      IDH-Mutation_glioma  GSE89567  brain   
...                                        ...       ...    ...   
GSE89567_MGH107neg_P2_C05  IDH-Mutation_glioma  GSE89567  brain   
GSE89567_MGH107pos_P2_D07  IDH-Mutation_glioma  GSE89567  brain   
GSE89567_MGH107neg_P1_E01  IDH-Mutation_glioma  GSE89567  brain   
GSE89567_MGH107pos_P2_G09  IDH-Mutation_glioma  GSE89567  brain   
GSE89567_MGH107neg_P1_D06  IDH-Mutation_glioma  GSE89567  brain   

                                         uni_barcode  
uni_barcode                                           
GSE89567_MGH42_P7_

In [22]:
output = '/scratch/user/s4543064/xiaohan-john-project/write/GSE89567/GSE89567_IDH_A_processed_data_uni.h5ad'
sample = anndata.read_h5ad(output)
print(sample)

AnnData object with n_obs × n_vars = 6341 × 23686
    obs: 'cancer_type', 'dataset', 'tissue', 'uni_barcode'
    var: 'gene_symbols'
