This tutorial will demonstrate how to pre-process single-cell raw UMI counts to generate expression matrices that can be used as input to cell-cell communication tools. We will assume appropriate quality-control (QC) has already been applied to the dataset (e.g., exclusion of low-quality cells and doublets). We recommend the tutorial by [Luecken & Theis](https://doi.org/10.15252/msb.20188746) as a starting point for a detailed overview of QC and single-cell RNAseq analysis pipelines in general. 

Here we will focus on:
1. Normalization
2. Inter-operability between R and python. 

We demonstrate a typical workflow using the popular single-cell analysis software [scanpy](https://scanpy.readthedocs.io/en/stable/) to generate an AnnData object which can be used downstream. We will use a [BALF COVID dataset](https://doi.org/10.1038/s41591-020-0901-9), which contains 12 samples associated with "Healthy Control", "Moderate", or "Severe" COVID contexts.

Details and caveats regarding [batch correction](https://www.nature.com/articles/s41592-018-0254-1), which removes technical variation while preserving biological variation between samples, can be viewed in the additional examples tutorial entitled "S1_Batch_Correction".

In [8]:
import os
import pickle

import scanpy as sc
import pandas as pd
import numpy as np

import sys
sys.path.insert(1, '/home/hratch/Projects/CCC/ccc_protocols/scripts/') # replace with cell2cell
from cell2cell_dev.datasets.load_data import CovidBalf

import warnings
warnings.filterwarnings('ignore')

seed = 888

data_path = '/data3/hratch/ccc_protocols/'

The 12 samples can be downloaded as .h5 files from [here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145926). You can also download the cell metadata from [here](https://raw.githubusercontent.com/zhangzlab/covid_balf/master/all.cell.annotation.meta.txt)

cell2cell has a helper function to download and format the raw UMI counts and metadata:

In [9]:
covid_balf_data = CovidBalf(data_path = os.path.join(data_path, 'raw/covid_balf/'))
# covid_balf_data.download_data()
md, balf_samples = covid_balf_data.format_data()

md.head()

Unnamed: 0_level_0,Sample_ID,sample_new,Context,disease,hasnCoV,cluster,cell_type
cell_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AAACCCACAGCTACAT_3,C100,HC3,Healthy_Control,N,N,27,B
AAACCCATCCACGGGT_3,C100,HC3,Healthy_Control,N,N,23,Macrophages
AAACCCATCCCATTCG_3,C100,HC3,Healthy_Control,N,N,6,T
AAACGAACAAACAGGC_3,C100,HC3,Healthy_Control,N,N,10,Macrophages
AAACGAAGTCGCACAC_3,C100,HC3,Healthy_Control,N,N,10,Macrophages


balf_samples is a dictionary with keys as each sample and values as an AnnData object storing the raw UMI counts for that sample

In [10]:
balf_samples.keys()

dict_keys(['C100', 'C51', 'C52', 'C141', 'C142', 'C144', 'C143', 'C145', 'C146', 'C148', 'C149', 'C152'])

In [11]:
balf_samples['C100']

AnnData object with n_obs × n_vars = 2566 × 16566
    obs: 'Sample_ID', 'Context', 'cell_type'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells'

To normalize the raw UMI counts, we recommend log(1+CPM) normalization, as this maintains non-negative counts and is the input for many communication scoring functions

In [12]:
for adata in balf_samples.values():
    adata.raw = adata # store the raw UMI counts (note this is actually typically used to store normalized counts)
    sc.pp.normalize_total(adata, target_sum=1e6) # CPM normalize
    sc.pp.log1p(adata) # logarithmize

In [35]:
balf_samples['C100']

AnnData object with n_obs × n_vars = 2566 × 16566
    obs: 'Sample_ID', 'Context', 'cell_type'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells'
    uns: 'log1p'

In [13]:
ordered_genes = sorted(balf_samples['C100'].to_df().T.index)
balf_samples['C100'].to_df().T.loc[ordered_genes,:].head()

Unnamed: 0,AAACCCACAGCTACAT-1,AAACCCATCCACGGGT-1,AAACCCATCCCATTCG-1,AAACGAACAAACAGGC-1,AAACGAAGTCGCACAC-1,AAACGAAGTCTATGAC-1,AAACGAAGTGTAGTGG-1,AAACGCTGTCACGTGC-1,AAACGCTGTTGGAGGT-1,AAAGAACTCTAGAACC-1,...,TTTGATCTCCCGAAAT-1,TTTGGAGCAATACAGA-1,TTTGGAGTCACCATAG-1,TTTGGAGTCTCACCCA-1,TTTGGTTAGATGGCGT-1,TTTGGTTGTACCCAGC-1,TTTGGTTGTTACTCAG-1,TTTGTTGAGCTAGAGC-1,TTTGTTGCAATGAAAC-1,TTTGTTGCAGAGGGTT-1
A1BG,0.0,0.0,0.0,3.492637,4.373578,0.0,0.0,0.0,4.111976,0.0,...,0.0,0.0,5.413881,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1BG-AS1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A2M,0.0,0.0,0.0,4.855852,0.0,0.0,0.0,0.0,0.0,6.001009,...,0.0,0.0,5.127682,0.0,0.0,5.041668,0.0,0.0,0.0,5.305098
A2M-AS1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A2ML1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We can save this diction for future use in other scripts:

In [14]:
with open(os.path.join(data_path, 'interim/covid_balf_norm.pickle'), 'wb') as handle:
    pickle.dump(balf_samples, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Interoperability

## to R

For use in R, we can save each AnnData object, and then load it in R as a SeuratObject using [SeuratDisk](https://mojaveazure.github.io/seurat-disk/articles/convert-anndata.html). See the companion R tutorial for loading these saved files. 

In [58]:
for sample in balf_samples:
    file_name = os.path.join(data_path, 'interim/covid_python_to_R/', sample + '.h5ad')
    adata = balf_samples[sample]
    adata.write_h5ad(file_name)

## from R

Here, we can load the expression matrices that were generated in the companion R script using Seurat and saved as h5ad files:

In [28]:
balf_samples_R = dict()
for sample in balf_samples:
    file_name = os.path.join(data_path, 'interim/covid_R_to_python/', sample + '.h5ad')
    balf_samples_R[sample] = sc.read_h5ad(file_name)

While this AnnData object is not completely identical to the one generated in this script, it stores all the same information. Raw UMI counts are stored in the .raw attribute, relevant metadata is available in .obsm, and the log(1+CPM) matrix is stored in .X. We can see that the expression matrix is the same:

In [34]:
balf_samples_R['C100']

AnnData object with n_obs × n_vars = 2566 × 16566
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'Sample.ID', 'Context', 'cell.type'
    var: 'features'

In [33]:
ordered_genes = sorted(balf_samples_R['C100'].to_df().T.index)
balf_samples_R['C100'].to_df().T.loc[ordered_genes,:].head()

Unnamed: 0,AAACCCACAGCTACAT-1,AAACCCATCCACGGGT-1,AAACCCATCCCATTCG-1,AAACGAACAAACAGGC-1,AAACGAAGTCGCACAC-1,AAACGAAGTCTATGAC-1,AAACGAAGTGTAGTGG-1,AAACGCTGTCACGTGC-1,AAACGCTGTTGGAGGT-1,AAAGAACTCTAGAACC-1,...,TTTGATCTCCCGAAAT-1,TTTGGAGCAATACAGA-1,TTTGGAGTCACCATAG-1,TTTGGAGTCTCACCCA-1,TTTGGTTAGATGGCGT-1,TTTGGTTGTACCCAGC-1,TTTGGTTGTTACTCAG-1,TTTGTTGAGCTAGAGC-1,TTTGTTGCAATGAAAC-1,TTTGTTGCAGAGGGTT-1
A1BG,0.0,0.0,0.0,3.492637,4.373578,0.0,0.0,0.0,4.111976,0.0,...,0.0,0.0,5.413881,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1BG-AS1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A2M,0.0,0.0,0.0,4.855851,0.0,0.0,0.0,0.0,0.0,6.001009,...,0.0,0.0,5.127682,0.0,0.0,5.041668,0.0,0.0,0.0,5.305098
A2M-AS1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A2ML1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
