# Pre-processing single-cell RNA-seq data for GRN inference

Rome, 30/11/2022 Jonathan Fiorentino

NOTE: In the processed_data folder we provide the pre-processed data and the PseudoTime values for the datasets, i.e. the output of this notebook

In this notebook I perform the pre-processing and pseudotime computation for single-cell RNA-seq data from HepG2 and K562 cell lines for gene regulatory network inference. We use datasets obtained with different sequencing protocols (full-length and droplet-based). See the Methods section of the manuscript for further details.

The SCAN-seq2 data are processed in a different notebook.

Data sources:

HepG2

- [GSE150993](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE150993), [publication](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-07744-6). Protocol: Smart-seq2. 
The file with the TPM matrix is GSE150993_HepG2_Gene_TPM.csv but note that we correct some gene names that had the Excel issue (see [here](https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates) for details).

- [GSM5677000](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5677000), [publication](https://www.cell.com/iscience/pdf/S2589-0042(22)01395-5.pdf). Protocol 10x. These are single-cell RNA-seq data from a large study including CITE-seq and scATAC-seq.
The file with the UMI counts is GSM5677000_scCite_HepG2_RNA.txt

K562

- [GSM1599500](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1599500), [publication](https://www.sciencedirect.com/science/article/pii/S0092867415005000?via%3Dihub). Protocol CEL-Seq. File with UMI count matrix GSM1599500_K562_cells.csv 

- [GSE181544](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE181544), [preprint](https://www.biorxiv.org/content/10.1101/2022.03.14.484332v3). STORM-seq. We use the library with 1M reads. File with TPM matrix: GSE181544_storm_k562_1M_reads_tpm.txt

- [E-MTAB-11467](https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-11467#) Smart-seq3 [publication1](https://www.nature.com/articles/s41587-020-0497-0), [publication2](https://www.nature.com/articles/s41587-022-01311-4). File with UMI count matrix K562_Smart_seq3_umi_counts.txt

For Smart-seq3 and STORM-seq the authors provided the matrices with the Ensembl gene IDs. We provide a R script to map them to gene names using Ensembl 107.
The files are saved as GSE181544_storm_k562_1M_reads_tpm_gnames.txt and K562_Smart_seq3_umi_counts_gnames.txt

In [None]:
%matplotlib inline

In [None]:
import scanpy as sc
import anndata as ad
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

In [None]:
HepG2_folder='../HepG2/'
K562_folder='../K562/'
input_folder='./'

In [None]:
if os.path.isdir(input_folder)==False:
    os.mkdir(input_folder)

In [None]:
sc.__version__

# Load and pre-process data

## Utils for gene name conversion

We convert the gene names in all datasets to match the ENSEMBL 107 version with which we compute catRAPID (and processed eCLIP data)

In [None]:
# Load the HepG2 Smart-seq2 data
HepG2_smart=ad.read_csv(HepG2_folder+'Smart-seq2/GSE150993_HepG2_Gene_TPM.csv')
HepG2_smart=HepG2_smart.transpose()
print(HepG2_smart)

# Load the HepG2 10x data
HepG2_10x=ad.read_csv(HepG2_folder+'10x/GSM5677000_scCite_HepG2_RNA.txt',delimiter='\t')
HepG2_10x=HepG2_10x.transpose()
print(HepG2_10x)

In [None]:
# Load the K562 CEL-seq data
K562_CEL=ad.read_csv(K562_folder+'./CEL-seq/GSM1599500_K562_cells.csv')
K562_CEL=K562_CEL.transpose()
print(K562_CEL)

# Load the K562 STORM-seq data
K562_STORM_1M=ad.read_csv(K562_folder+'./GSE181544/GSE181544_storm_k562_1M_reads_tpm_gnames.txt')
K562_STORM_1M=K562_STORM_1M.transpose()
print(K562_STORM_1M)

# Load the K562 Smart-seq3 data
K562_SMART3=ad.read_csv(K562_folder+'./Smart-seq3/K562_Smart_seq3_umi_counts_gnames.txt')
K562_SMART3=K562_SMART3.transpose()
print(K562_SMART3)

In [None]:
# Load the fasta file with the canonical isoforms
from Bio import SeqIO
gname=[]
gid=[]
f_open = open("/Users/jonathan/Desktop/IIT/INTERACTomics/ENCODE_eCLIP_DATA/transcriptomes/hsapiens_gene_ensembl_107_canonical_new.fa", "rU")
for rec in SeqIO.parse(f_open, "fasta"):
    myid = rec.id
    gname.append(myid.split('|')[4])
    gid.append(myid.split('|')[0])

In [None]:
adata_list=[HepG2_smart,HepG2_10x,K562_CEL,K562_STORM_100k,K562_STORM_500k,K562_STORM_1M,K562_SMART3]
labels=['HepG2_smartseq2','HepG2_10x','K562_CEL_seq','K562_STORM_seq_100k',
        'K562_STORM_seq_500k','K562_STORM_seq_1M','K562_Smart_seq3']

In [None]:
# Retrieve the ensembl gene ID
import os
# out_dir=os.getcwd()+'/missing_genes/'
# if os.path.isdir(out_dir)==False:
#     os.mkdir(out_dir)

missing_genes_list=[]
for (lab,adata) in zip(labels,adata_list):
    print(lab,len(gname),len(adata.var_names),
      len(set(gname).intersection(set(adata.var_names))))
    missing=list(set(adata.var_names)-set(gname).intersection(set(adata.var_names)))
    missing_genes_list.append(missing)
#     np.savetxt(out_dir+'missing'+lab+'.txt',np.c_[missing],fmt='%s')

In [None]:
out_dir='../missing_genes/'

# Load the mapping gene name/ENSEMBL GENE ID for each cell type
mapping_HepG2_smart=pd.read_csv(out_dir+'mapping_HepG2_smartseq2.csv',index_col=0)
mapping_HepG2_10x=pd.read_csv(out_dir+'mapping_HepG2_10x.csv',index_col=0)


mapping_K562_CEL=pd.read_csv(out_dir+'mapping_K562_CEL_seq.csv',index_col=0)
# mapping_K562_STORM_1M=pd.read_csv(out_dir+'mapping_K562_STORM_seq_1M.csv',index_col=0)
# mapping_K562_SMART3=pd.read_csv(out_dir+'mapping_K562_Smart_seq3.csv',index_col=0)

mapping_HepG2_smart.index=[i.replace('.','-',1) if i.replace('.','-',1) in missing_genes_list[0] else i for i in mapping_HepG2_smart.index]
mapping_HepG2_10x.index=[i.replace('.','-',1) if i.replace('.','-',1) in missing_genes_list[1] else i for i in mapping_HepG2_10x.index]

mapping_K562_CEL.index=[i.replace('.','-',1) if i.replace('.','-',1) in missing_genes_list[2] else i for i in mapping_K562_CEL.index]
# mapping_K562_STORM_1M.index=[i.replace('.','-',1) if i.replace('.','-',1) in missing_genes_list[5] else i for i in mapping_K562_STORM_1M.index]
# mapping_K562_SMART3.index=[i.replace('.','-',1) if i.replace('.','-',1) in missing_genes_list[6] else i for i in mapping_K562_SMART3.index]



In [None]:
def map_gene_names(adata,mapping):
    print('before')
    print(len(gname),len(adata.var_names),
      len(set(gname).intersection(set(adata.var_names))))
    j=1
    no_name_genes=[]
    mylist=list(adata.var_names)
    
    for i in range(len(mylist)):
        if mylist[i]=='AARS':
            mylist[i]='AARS1'
        if mylist[i]=='TROVE2':
            mylist[i]='RO60'
    for i in range(len(mapping.index)):
        # Retrieved ENSEMBL gene ID
        mygene_id=mapping.loc[mapping.index[i],'V1']
    
        # Find the corresponding gene name in the fasta from gencodeV41
        if mygene_id in gid:
            new_gene_name=gname[gid.index(mygene_id)]
            if new_gene_name=='':
                new_gene_name=mygene_id
                no_name_genes.append(mygene_id)
        
            # Find the index of the old gene name in var_names
            idx=mylist.index(mapping.index[i])
            if new_gene_name not in mylist:
                mylist[idx]=new_gene_name
        else:
            j+=1
    adata.var_names=mylist
    print('after')
    print(len(gname),len(adata.var_names),
      len(set(gname).intersection(set(adata.var_names))))
    print('-'*50)
    return adata;

## HepG2

### Smart-seq2

In [None]:
# Convert the gene names
HepG2_smart=map_gene_names(HepG2_smart,mapping_HepG2_smart)

# Subset only the live cells
live_cells=HepG2_smart.obs_names.str.contains('live')
HepG2_smart=HepG2_smart[live_cells]

# Remove spike-ins
# Spike-ins genes
HepG2_smart.var['sp'] = HepG2_smart.var_names.str.startswith('ERCC-')

keep = np.invert(HepG2_smart.var['sp'])
HepG2_smart=HepG2_smart[:,keep]
print(HepG2_smart)

sc.pp.filter_genes(HepG2_smart, min_cells=10)
print(HepG2_smart)

# Assign the raw data to the raw attribute of the adata object
HepG2_smart.raw=HepG2_smart

print(HepG2_smart.X.shape,HepG2_smart.raw.X.shape)

sc.pp.log1p(HepG2_smart)

In [None]:
# Load the table with gene biotypes
biotype_df=pd.read_csv(HepG2_folder+'Smart-seq2/HepG2_gene_biotype.csv',index_col=None)

In [None]:
biotype_df.gene_biotype=biotype_df.gene_biotype.astype('category')

In [None]:
# Remove miRNAs, scRNA, rRNA, scaRNA, snRNA and snoRNA
miRNA=HepG2_smart.var_names.isin(biotype_df[biotype_df.gene_biotype=='miRNA']['hgnc_symbol'])
print('miRNA',sum(miRNA))
rRNA=HepG2_smart.var_names.isin(biotype_df[biotype_df.gene_biotype=='rRNA']['hgnc_symbol'])
print('rRNA',sum(rRNA))
scRNA=HepG2_smart.var_names.isin(biotype_df[biotype_df.gene_biotype=='scRNA']['hgnc_symbol'])
print('scRNA',sum(scRNA))
scaRNA=HepG2_smart.var_names.isin(biotype_df[biotype_df.gene_biotype=='scaRNA']['hgnc_symbol'])
print('scaRNA',sum(scaRNA))
snRNA=HepG2_smart.var_names.isin(biotype_df[biotype_df.gene_biotype=='snRNA']['hgnc_symbol'])
print('snRNA',sum(snRNA))
snoRNA=HepG2_smart.var_names.isin(biotype_df[biotype_df.gene_biotype=='snoRNA']['hgnc_symbol'])
print('snoRNA',sum(snoRNA))
snara=HepG2_smart.var_names.str.startswith('SNAR-A')
print('SNAR-A',sum(snara))
snarb=HepG2_smart.var_names.str.startswith('SNAR-B')
print('SNAR-B',sum(snarb))
snord=HepG2_smart.var_names.str.startswith('SNORD')
print('SNORD',sum(snord))

In [None]:
remove_HepG2 = np.add(miRNA, rRNA)
remove_HepG2 = np.add(remove_HepG2, scRNA)
remove_HepG2 = np.add(remove_HepG2, scaRNA)
remove_HepG2 = np.add(remove_HepG2, snRNA)
remove_HepG2 = np.add(remove_HepG2, snoRNA)
remove_HepG2 = np.add(remove_HepG2, snara)
remove_HepG2 = np.add(remove_HepG2, snarb)
remove_HepG2 = np.add(remove_HepG2, snord)

In [None]:
keep_HepG2 = np.invert(remove_HepG2)
HepG2_smart = HepG2_smart[:,keep_HepG2].copy()
HepG2_smart

In [None]:
print(len(HepG2_smart.var_names),len(set(HepG2_smart.var_names)))

### 10x

In [None]:
HepG2_10x=map_gene_names(HepG2_10x,mapping_HepG2_10x)

# Remove spike-ins if present
HepG2_10x.var['sp'] = HepG2_10x.var_names.str.startswith('ERCC-')

keep = np.invert(HepG2_10x.var['sp'])
HepG2_10x=HepG2_10x[:,keep].copy()
print(HepG2_10x)

sc.pp.filter_genes(HepG2_10x, min_cells=int(0.01*HepG2_10x.n_obs))
print(HepG2_10x)
HepG2_10x_for_ARACNe=HepG2_10x.copy()

HepG2_10x.raw=HepG2_10x
sc.pp.normalize_total(HepG2_10x,inplace=True)

sc.pp.log1p(HepG2_10x)

In [None]:
HepG2_10x.var_names_make_unique()

## K562

### CEL-seq

In [None]:
cell_names=['cell_'+str(i) for i in range(K562_CEL.n_obs)]
K562_CEL.obs_names=cell_names

K562_CEL=map_gene_names(K562_CEL,mapping_K562_CEL)

# Remove spike-ins
# Spike-ins genes
K562_CEL.var['sp'] = K562_CEL.var_names.str.startswith('ERCC-')

keep = np.invert(K562_CEL.var['sp'])
K562_CEL=K562_CEL[:,keep].copy()
print(K562_CEL)

sc.pp.filter_genes(K562_CEL, min_cells=int(0.1*K562_CEL.n_obs))
print(K562_CEL)

K562_CEL.raw=K562_CEL
sc.pp.normalize_total(K562_CEL,inplace=True)

K562_CEL_for_ARACNe=K562_CEL.copy()

sc.pp.log1p(K562_CEL)

### STORM-seq

In [None]:
# Remove spike-ins
# Spike-ins genes
K562_STORM_1M.var['sp'] = K562_STORM_1M.var_names.str.startswith('ERCC-')

keep = np.invert(K562_STORM_1M.var['sp'])
K562_STORM_1M=K562_STORM_1M[:,keep].copy()
print(K562_STORM_1M)

sc.pp.filter_genes(K562_STORM_1M, min_cells=int(0.1*K562_STORM_1M.n_obs))
print(K562_STORM_1M)

K562_STORM_1M.raw=K562_STORM_1M
# sc.pp.normalize_total(K562_STORM_1M,inplace=True)
sc.pp.log1p(K562_STORM_1M)

### Smart-seq3

In [None]:
# Remove spike-ins
# Spike-ins genes
K562_SMART3.var['sp'] = K562_SMART3.var_names.str.startswith('ERCC-')

keep = np.invert(K562_SMART3.var['sp'])
K562_SMART3=K562_SMART3[:,keep].copy()
print(K562_SMART3)

sc.pp.filter_genes(K562_SMART3, min_cells=int(0.1*K562_SMART3.n_obs))
print(K562_SMART3)

K562_SMART3.raw=K562_SMART3
sc.pp.normalize_total(K562_SMART3,inplace=True)

K562_SMART3_for_ARACNe = K562_SMART3.copy()

sc.pp.log1p(K562_SMART3)

## Remove mitochondrial genes

In [None]:
def FilterMito(adata):
    mito_genes = adata.var_names.str.startswith('MT-')
    genes_to_keep = np.invert(mito_genes)
    print('before',adata)
    adata = adata[:,genes_to_keep].copy()
    print('after',adata)
    return adata;

In [None]:
HepG2_smart=FilterMito(HepG2_smart)
HepG2_10x=FilterMito(HepG2_10x)
K562_CEL=FilterMito(K562_CEL)
K562_STORM_1M=FilterMito(K562_STORM_1M)
K562_SMART3=FilterMito(K562_SMART3)

# Diffusion pseudotime

## HepG2

### Smart-seq2

In [None]:
sc.pp.highly_variable_genes(HepG2_smart,max_mean=10,n_top_genes=2000)  #calculate highly variable genes
HepG2_smart_high = HepG2_smart[:,HepG2_smart.var['highly_variable']==True]  #select only highly variable genes
sc.pp.scale(HepG2_smart_high,max_value=10)
sc.tl.pca(HepG2_smart_high,svd_solver='arpack')
sc.pl.pca_overview(HepG2_smart_high)

In [None]:
sc.pp.neighbors(HepG2_smart_high, n_neighbors=10, n_pcs=10)
sc.tl.umap(HepG2_smart_high)
sc.tl.leiden(HepG2_smart_high)
sc.pl.umap(HepG2_smart_high,color='leiden')

In [None]:
HepG2_smart_high.uns['iroot'] = np.argmin(HepG2_smart_high.obsm['X_umap'][:,1])

# Create the diffusion map
sc.tl.diffmap(HepG2_smart_high)

# Run Diffusion Pseudotime with 0 branching event
sc.tl.dpt(HepG2_smart_high)

# Grab the output and store in our metadata DataFrame
HepG2_smart_high.obs['dpt'] = HepG2_smart_high.obs['dpt_pseudotime']
HepG2_smart_high.obs.head()

### 10x

In [None]:
sc.pp.highly_variable_genes(HepG2_10x,max_mean=10,n_top_genes=2000)  #calculate highly variable genes
HepG2_10x_high = HepG2_10x[:,HepG2_10x.var['highly_variable']==True]  #select only highly variable genes
sc.pp.scale(HepG2_10x_high,max_value=10)
sc.tl.pca(HepG2_10x_high,svd_solver='arpack')
sc.pl.pca_overview(HepG2_10x_high)

In [None]:
sc.pp.neighbors(HepG2_10x_high, n_pcs=10)
sc.tl.umap(HepG2_10x_high)
sc.tl.leiden(HepG2_10x_high)
sc.pl.umap(HepG2_10x_high,color='leiden')

In [None]:
HepG2_10x_high.uns['iroot'] = np.argmin(HepG2_10x_high.obsm['X_umap'][:,0])

# Create the diffusion map
sc.tl.diffmap(HepG2_10x_high)

# Run Diffusion Pseudotime with 1 branching event
sc.tl.dpt(HepG2_10x_high)

# Grab the output and store in our metadata DataFrame
HepG2_10x_high.obs['dpt'] = HepG2_10x_high.obs['dpt_pseudotime']
# HepG2_10x_high.obs['dpt_branch'] = HepG2_10x_high.obs['dpt_groups'].astype(int)
HepG2_10x_high.obs.head()

### Check the overlap between the highly variable genes

In [None]:
def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(set(list1)) + len(set(list2))) - intersection
    return float(intersection) / union

In [None]:
import seaborn as sns

protocols=['Smart-seq2','10x']
ct='HepG2'
adatas=[HepG2_smart_high,HepG2_10x_high]

jaccard=np.zeros((len(adatas),len(adatas)))

i=0
for (adata1,prot1) in zip(adatas,protocols):
    j=0
    for (adata2,prot2) in zip(adatas,protocols):
        jaccard[i,j]=jaccard_similarity(list(adata1.var_names),list(adata2.var_names))
        j+=1
    i+=1

In [None]:
# Getting the Upper Triangle of the co-relation matrix
matrix = np.triu(jaccard)

fig,ax =plt.subplots()
ax.set_title('HepG2')
# using the upper triangle matrix as mask 
sns.heatmap(jaccard, annot=True, mask=matrix,ax=ax,
           xticklabels=protocols,yticklabels=protocols)
plt.show(),plt.close()

## K562

### CEL-seq

In [None]:
sc.pp.highly_variable_genes(K562_CEL,max_mean=10,n_top_genes=2000)  #calculate highly variable genes
K562_CEL_high = K562_CEL[:,K562_CEL.var['highly_variable']==True]  #select only highly variable genes
sc.pp.scale(K562_CEL_high,max_value=10)
sc.tl.pca(K562_CEL_high,svd_solver='arpack')
sc.pl.pca_overview(K562_CEL_high)

In [None]:
sc.pp.neighbors(K562_CEL_high, n_neighbors=15, n_pcs=10)
sc.tl.umap(K562_CEL_high)
sc.pl.umap(K562_CEL_high)

In [None]:
sc.tl.leiden(K562_CEL_high)
sc.pl.umap(K562_CEL_high,color='leiden')

In [None]:
K562_CEL_high.uns['iroot'] = np.argmin(K562_CEL_high.obsm['X_umap'][:,0])

# Create the diffusion map
sc.tl.diffmap(K562_CEL_high)

# Run Diffusion Pseudotime with 1 branching event
sc.tl.dpt(K562_CEL_high)

# Grab the output and store in our metadata DataFrame
K562_CEL_high.obs['dpt'] = K562_CEL_high.obs['dpt_pseudotime']
K562_CEL_high.obs.head()

### STORM-seq

In [None]:
sc.pp.highly_variable_genes(K562_STORM_1M,max_mean=10,n_top_genes=2000)  #calculate highly variable genes
K562_STORM_high = K562_STORM_1M[:,K562_STORM_1M.var['highly_variable']==True]  #select only highly variable genes
sc.pp.scale(K562_STORM_high,max_value=10)
sc.tl.pca(K562_STORM_high,svd_solver='arpack')
sc.pl.pca_overview(K562_STORM_high)

In [None]:
# Drop one outlier cells very far away from the others in the PCA
t = K562_STORM_1M.obs.drop(index=K562_STORM_1M.obs_names[np.argmax(K562_STORM_high.obsm['X_pca'][:,0])])
K562_STORM_1M= K562_STORM_1M[K562_STORM_1M.obs.index.isin(t.index.to_list())].copy()

In [None]:
sc.pp.highly_variable_genes(K562_STORM_1M,max_mean=10,n_top_genes=2000)  #calculate highly variable genes
K562_STORM_high = K562_STORM_1M[:,K562_STORM_1M.var['highly_variable']==True]  #select only highly variable genes
sc.pp.scale(K562_STORM_high,max_value=10)
sc.tl.pca(K562_STORM_high,svd_solver='arpack')
sc.pl.pca_overview(K562_STORM_high)

In [None]:
sc.pp.neighbors(K562_STORM_high, n_neighbors=15, n_pcs=10)
sc.tl.umap(K562_STORM_high)
sc.pl.umap(K562_STORM_high)

In [None]:
sc.tl.leiden(K562_STORM_high)
sc.pl.umap(K562_STORM_high,color='leiden')

In [None]:
K562_STORM_high.uns['iroot'] = np.argmax(K562_STORM_high.obsm['X_umap'][:,1])

# Create the diffusion map
sc.tl.diffmap(K562_STORM_high)

# Run Diffusion Pseudotime with 1 branching event
sc.tl.dpt(K562_STORM_high)

# Grab the output and store in our metadata DataFrame
K562_STORM_high.obs['dpt'] = K562_STORM_high.obs['dpt_pseudotime']
K562_STORM_high.obs.head()

### Smart-seq3

In [None]:
# NOTE: we remove the genes differentially expressed between the treatments
DE_genes=np.loadtxt(K562_folder+"K562_Smartseq3_DE_genes_treatment.txt",dtype=str)
print(len(DE_genes))

print(K562_SMART3.n_vars)
nonDEgenes = [name for name in K562_SMART3.var_names if not name in list(DE_genes)]
print(len(nonDEgenes))

In [None]:
K562_SMART3=K562_SMART3[:,nonDEgenes].copy()
K562_SMART3_for_ARACNe=K562_SMART3_for_ARACNe[:,nonDEgenes].copy()

In [None]:
sc.pp.highly_variable_genes(K562_SMART3,max_mean=10,n_top_genes=2000)  #calculate highly variable genes
K562_SMART3_high = K562_SMART3[:,K562_SMART3.var['highly_variable']==True]  #select only highly variable genes
sc.pp.scale(K562_SMART3_high,max_value=10)
sc.tl.pca(K562_SMART3_high,svd_solver='arpack')
sc.pl.pca_overview(K562_SMART3_high)

In [None]:
sc.pp.neighbors(K562_SMART3_high, n_neighbors=15, n_pcs=10)
sc.tl.umap(K562_SMART3_high)
sc.pl.umap(K562_SMART3_high)

In [None]:
sc.tl.leiden(K562_SMART3_high)
sc.pl.umap(K562_SMART3_high,color='leiden')

In [None]:
K562_SMART3_high.uns['iroot'] = np.argmax(K562_SMART3_high.obsm['X_umap'][:,0])

# Create the diffusion map
sc.tl.diffmap(K562_SMART3_high)

# Run Diffusion Pseudotime with 1 branching event
sc.tl.dpt(K562_SMART3_high)

# Grab the output and store in our metadata DataFrame
K562_SMART3_high.obs['dpt'] = K562_SMART3_high.obs['dpt_pseudotime']
K562_SMART3_high.obs.head()

### Check the overlap between the highly variable genes

In [None]:
def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(set(list1)) + len(set(list2))) - intersection
    return float(intersection) / union

In [None]:
import seaborn as sns

protocols=['CEL-seq','STORM-seq','Smart-seq3']
ct='K562'
adatas=[K562_CEL_high,K562_STORM_high,K562_SMART3_high]

jaccard=np.zeros((len(adatas),len(adatas)))

i=0
for (adata1,prot1) in zip(adatas,protocols):
    j=0
    for (adata2,prot2) in zip(adatas,protocols):
        jaccard[i,j]=jaccard_similarity(list(adata1.var_names),list(adata2.var_names))
        j+=1
    i+=1

In [None]:
# Getting the Upper Triangle of the co-relation matrix
matrix = np.triu(jaccard)

fig,ax =plt.subplots()
ax.set_title('K562')
# using the upper triangle matrix as mask 
sns.heatmap(jaccard, annot=True, mask=matrix,ax=ax,
           xticklabels=protocols,yticklabels=protocols)
plt.show(),plt.close()

## Save pseudotime data

In [None]:
pseudo_folder=input_folder+'PseudoTime/'

if os.path.isdir(pseudo_folder)==False:
    os.mkdir(pseudo_folder)

### HepG2

In [None]:
pseudo_df=pd.DataFrame(data=HepG2_smart_high.obs['dpt'], index=HepG2_smart_high.obs_names)
pseudo_df.to_csv(pseudo_folder+'HepG2_Smartseq2_PseudoTime.csv')

pseudo_df=pd.DataFrame(data=HepG2_10x_high.obs['dpt'], index=HepG2_10x_high.obs_names)
pseudo_df.to_csv(pseudo_folder+'HepG2_10x_PseudoTime.csv')

In [None]:
pseudo_df=pd.DataFrame(data=K562_CEL_high.obs['dpt'], index=K562_CEL_high.obs_names)
pseudo_df.to_csv(pseudo_folder+'K562_CELseq_PseudoTime.csv')

pseudo_df=pd.DataFrame(data=K562_STORM_high.obs['dpt'], index=K562_STORM_high.obs_names)
pseudo_df.to_csv(pseudo_folder+'K562_STORMseq1M_PseudoTime.csv')

pseudo_df=pd.DataFrame(data=K562_SMART3_high.obs['dpt'], index=K562_SMART3_high.obs_names)
pseudo_df.to_csv(pseudo_folder+'K562_Smartseq3seq_PseudoTime.csv')

## Save the processed data for gene selection

In [None]:
proc_folder=input_folder+'processed_data/'
if os.path.isdir(proc_folder)==False:
    os.mkdir(proc_folder)

In [None]:
HepG2_smart.write_h5ad(proc_folder+'processed_HepG2_Smartseq2.h5ad')
HepG2_10x.write_h5ad(proc_folder+'processed_HepG2_10x.h5ad')

K562_CEL.write_h5ad(proc_folder+'processed_K562_CELseq.h5ad')
K562_CEL_for_ARACNe.write_h5ad(proc_folder+'processed_K562_CELseq_ARACNe.h5ad')

K562_STORM_1M.write_h5ad(proc_folder+'processed_K562_STORMseq1M.h5ad')
K562_SMART3.write_h5ad(proc_folder+'processed_K562_Smartseq3.h5ad')
K562_SMART3_for_ARACNe.write_h5ad(proc_folder+'processed_K562_Smartseq3_ARACNe.h5ad')