### Sim et al. 2021

#### Convert the Sim directories to scanpy h5ad

-Single-nucleus RNA sequencing of 54 140 nuclei from 9 human donors. All samples are LV

Currently, there are there 9 directories (3 adult, 3 young, 3 fetal)
which each have these files inside: 
1. barcodes.tsv.gz
2. features.tsv.gz
3. matrix.mtx.gz

which is the 10X format

Per GEO: Sequenced read were processed, aligned to human genome (GRCh38.96/hg38), counted and filtered using Cell Ranger (v3.0.2).
Genome_build: GRCh38.96/hg38. Supplementary_files_format_and_content: *_processed.tar: Tar files include barcodes.tsv, features.tsv, matrix.mtx. Filtered counts of sequencing reads.

#### Convert these to scanpy h5ad, so that they are compatible with the script that combines all of the datasets together
#### Additionally, we will perform QC filtering steps in scanpy

In [1]:
import os
import pandas as pd
import numpy as np
import scanpy as sc
import scanpy.external as sce
from collections import Counter

### Information based on supplemental files for the donors

1. Adult1 - GSM4742854 - F, 35
2. Adult2 - GSM4742855 - M, 42
3. Adult3 - GSM4742856 - M, 42
4. Young1 - GSM4742860 - M, 4
5. Young2 - GSM4742861 - F, 10
6. Young3 - GSM4742862 - M, 14
7. Fetal1 - GSM4742857 - M, -19 
8. Fetal2 - GSM4742858 - M, -19
9. Fetal3 - GSM4742859 - F, -20

In [2]:
directory_path = os.getcwd()

subdirectories = [name for name in os.listdir(directory_path) if os.path.isdir(os.path.join(directory_path, name)) 
                  and not name.startswith('.')]
subdirectories

['GSM4742855_Adult2_processed',
 'GSM4742862_Young3_processed',
 'GSM4742858_Fetal2_processed',
 'GSM4742860_Young1_processed',
 'GSM4742859_Fetal3_processed',
 'GSM4742857_Fetal1_processed',
 'GSM4742856_Adult3_processed',
 'GSM4742854_Adult1_processed',
 'GSM4742861_Young2_processed']

Per the supplemental tables, here are the correspond ages and sex, in the same order as above

To make it remain an integer, we are encoding gestational weeks in the `age` column. We will denote the fetal status by the `age_status` column. Note that the positive integers are years, not weeks.

In [3]:
age_list = [42, 14, 19, 4, 20, 19, 42, 35, 10]
sex_list = ["male", "male", "male", "male", "female", "male", "male", "female", "female"]
age_status_list = ['Postnatal', 'Postnatal', 'Fetal', 'Postnatal', 'Fetal', 'Fetal', 'Postnatal', 'Postnatal', 'Postnatal']

### Iteratively load in all of the adata files

In [4]:
%%time
adatas = []

for subdir, age, sex, age_status in zip(subdirectories, age_list, sex_list, age_status_list):
    adata = sc.read_10x_mtx(subdir)
    adata.obs['donor_id'] = subdir # use the 
    adata.obs['age'] = age
    adata.obs['sex'] = sex
    adata.obs['age_status'] = age_status
    
    # add the adata to the growing list
    adatas.append(adata)

CPU times: user 4min 6s, sys: 5.37 s, total: 4min 11s
Wall time: 4min 11s


In [5]:
# since they all use the same features, concatenate all of them together
concatenated_adata = sc.concat(adatas)

  utils.warn_names_duplicates("obs")


In [6]:
concatenated_adata.obs_names = concatenated_adata.obs.donor_id.astype(str) + ":" + concatenated_adata.obs_names

In [7]:
concatenated_adata.obs.head()

Unnamed: 0,donor_id,age,sex,age_status
GSM4742855_Adult2_processed:AAACCCAAGATCGACG-1,GSM4742855_Adult2_processed,42,male,Postnatal
GSM4742855_Adult2_processed:AAACCCAAGCCACTCG-1,GSM4742855_Adult2_processed,42,male,Postnatal
GSM4742855_Adult2_processed:AAACCCAGTATCGAGG-1,GSM4742855_Adult2_processed,42,male,Postnatal
GSM4742855_Adult2_processed:AAACCCAGTTGGATCT-1,GSM4742855_Adult2_processed,42,male,Postnatal
GSM4742855_Adult2_processed:AAACCCAGTTTCGGCG-1,GSM4742855_Adult2_processed,42,male,Postnatal


In [8]:
Counter(concatenated_adata.obs_names).most_common()[0]

('GSM4742855_Adult2_processed:AAACCCAAGATCGACG-1', 1)

### Add metadata common to all cells

In [12]:
concatenated_adata.obs['region'] = "LV"
concatenated_adata.obs['cell_or_nuclei'] = "Nuclei"
concatenated_adata.obs['technology'] = '3prime-v1'
concatenated_adata.obs['study'] = 'Sim 2021'
concatenated_adata.obs['disease'] = "ND"

In [10]:
adata = concatenated_adata
adata.layers["counts"] = adata.X

### Save adata

In [13]:
adata.write("01_Sim_adata.h5ad")