### 1. General info of dataset GSE102130

This is the Jupyter Notebook for dataset GSE102130. Its dataset includes an overall big txt file. As seen below, in the txt file, each row is a gene and each column is a cell.

Thus, we need to transform this txt file and generate an overall AnnData object for all samples. 



In [1]:
# Environment setup
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as anndata
import scipy

In [2]:
# inspect the dataset
path = '/scratch/user/s4543064/Xiaohan_Summer_Research/data/GSE102130/GSE102130_K27Mproject.RSEM.vh20170621.txt'
input = pd.read_csv(path, sep='\t', index_col=0) # the first column contains gene names and is the index

print(input.head()) 
print(input.shape) # (23686 rows, 4058 columns)

          MUV1-P04-B12  MUV1-P04-C08  MUV1-P04-D09  MUV1-P04-D10  \
Gene                                                               
A1BG               0.0           0.0           0.0           0.0   
A1BG-AS1           0.0           0.0           0.0           0.0   
A1CF               0.0           0.0           0.0           0.0   
A2M                0.0           0.0           0.0           0.0   
A2M-AS1            0.0           0.0           0.0           0.0   

          MUV1-P04-E03  MUV1-P04-E07  MUV1-P04-E08  MUV1-P04-E10  \
Gene                                                               
A1BG              0.00          0.00          0.00          0.00   
A1BG-AS1          0.00          0.00          0.00          0.00   
A1CF              0.00          0.00          0.53          0.34   
A2M             348.48        362.08          0.00          0.00   
A2M-AS1           0.00          1.19          0.00          0.00   

          MUV1-P04-E11  MUV1-P04-F05  ...  Oli

As shown above, the dataset contains 4058 cells and 23686 genes.

### 2. Overall AnnData object of the dataset

<span style="color:red">**IMPORTANT:**</span> transpose the DataFrame.values to match the AnnData.X

1. `DataFrame.columns`: cell barcodes, which go into `.obs`
2. `DataFrame.index`: gene names, `.var`
3. `DataFrame.values`: the transpose of the expression matrix, `.X`

In [8]:
matrix = scipy.sparse.csr_matrix(input.values.T)
obs_name = pd.DataFrame(index=input.columns)
var_name = pd.DataFrame(input.index)
var_name.rename(columns={'Gene': 'gene_symbols'}, inplace=True)

sample = anndata.AnnData(X=matrix, obs=obs_name, var=var_name)
print(sample)

# save the anndata object
sample.write_h5ad('/scratch/user/s4543064/Xiaohan_Summer_Research/write/GSE102130/GSE102130_K27Mproject.RSEM.vh20170621.h5ad', compression="gzip")



AnnData object with n_obs × n_vars = 4058 × 23686
    var: 'gene_symbols'


### 3. Confirmation of created AnnData object

In [9]:
output = '/scratch/user/s4543064/Xiaohan_Summer_Research/write/GSE102130/GSE102130_K27Mproject.RSEM.vh20170621.h5ad'
sample = anndata.read_h5ad(output)
print(sample)

AnnData object with n_obs × n_vars = 4058 × 23686
    var: 'gene_symbols'
