In [1]:
#pip install tf-nightly
#pip install tfp-nightly

Recommendations for single-cell analysis

The DESeq2 developers and collaborating groups have published recommendations for the best use of DESeq2 for single-cell datasets, which have been described first in Van den Berge et al. (2018). Default values for DESeq2 were designed for bulk data and will not be appropriate for single-cell datasets. These settings and additional improvements have also been tested subsequently and published in Zhu, Ibrahim, and Love (2018) and Ahlmann-Eltze and Huber (2020).

    Use test="LRT" for significance testing when working with single-cell data, over the Wald test. This has been observed across multiple single-cell benchmarks.
    Set the following DESeq arguments to these values: useT=TRUE, minmu=1e-6, and minReplicatesForReplace=Inf. The default setting of minmu was benchmarked on bulk RNA-seq and is not appropriate for single cell data when the expected count is often much less than 1.
    The default size factors are not optimal for single cell count matrices, instead consider setting sizeFactors from scran::computeSumFactors.
    One important concern for single-cell data analysis is the size of the datasets and associated processing time. To address the speed concerns, DESeq2 provides an interface to glmGamPoi, which implements faster dispersion and parameter estimation routines for single-cell data (Ahlmann-Eltze and Huber 2020). To use this feature, set fitType = "glmGamPoi". Alternatively, one can use glmGamPoi as a standalone package. This provides the additional option to process data on-disk if the full dataset does not fit in memory, a quasi-likelihood framework for differential testing, and the ability to form pseudobulk samples (more details how to use glmGamPoi are in its README).

Optionally, one can consider using the zinbwave package to directly model the zero inflation of the counts, and take account of these in the DESeq2 model. This allows for the DESeq2 inference to apply to the part of the data which is not due to zero inflation. Not all single cell datasets exhibit zero inflation, and instead may just reflect low conditional estimated counts (conditional on cell type or cell state).There is example code for combining zinbwave and DESeq2 package functions in the zinbwave vignette. We also have an example of ZINB-WaVE + DESeq2 integration using the splatter package for simulation at the zinbwave-deseq2 GitHub repository.


Can I use DESeq2 to analyze paired samples?

Yes, you should use a multi-factor design which includes the sample information as a term in the design formula. This will account for differences between the samples while estimating the effect due to the condition. The condition of interest should go at the end of the design formula, e.g. ~ subject + condition.
If I have multiple groups, should I run all together or split into pairs of groups?

Typically, we recommend users to run samples from all groups together, and then use the contrast argument of the results function to extract comparisons of interest after fitting the model using DESeq.

The model fit by DESeq estimates a single dispersion parameter for each gene, which defines how far we expect the observed count for a sample will be from the mean value from the model given its size factor and its condition group. See the section above and the DESeq2 paper for full details. Having a single dispersion parameter for each gene is usually sufficient for analyzing multi-group data, as the final dispersion value will incorporate the within-group variability across all groups.

However, for some datasets, exploratory data analysis (EDA) plots could reveal that one or more groups has much higher within-group variability than the others. A simulated example of such a set of samples is shown below. This is case where, by comparing groups A and B separately – subsetting a DESeqDataSet to only samples from those two groups and then running DESeq on this subset – will be more sensitive than a model including all samples together. It should be noted that such an extreme range of within-group variability is not common, although it could arise if certain treatments produce an extreme reaction (e.g. cell death). Again, this can be easily detected from the EDA plots such as PCA described in this vignette.

Here we diagram an extreme range of within-group variability with a simulated dataset. Typically, it is recommended to run DESeq across samples from all groups, for datasets with multiple groups. However, this simulated dataset shows a case where it would be preferable to compare groups A and B by creating a smaller dataset without the C samples. Group C has much higher within-group variability, which would inflate the per-gene dispersion estimate for groups A and B as well:

In [2]:
#%reset

In [15]:
import scanpy as sc
import seaborn as sns
import pandas as pd
import numpy as np
import anndata
import itertools
import gc
from diffexpr.py_deseq import py_DESeq2
from rpy2.robjects import Formula

In [7]:
q = sc.read_h5ad('../../atlas/Atlas_adatas_June2021_Atlas_final_May2021.h5ad')

In [8]:
def build_design(q, qci):
    # build design matrix
    patient_ids = ([x[0:3] for x in qci.obs.samplename])
    full_sample_df = pd.DataFrame({'patient':patient_ids, 'biosample':qci.obs.samplename, 'dx':qci.obs.diagnosis})
    # get the number of cells from each sample
    cell_counts = pd.DataFrame(full_sample_df.biosample.value_counts())
    cell_counts.columns = ['cell_counts']
    cell_counts['biosample'] = cell_counts.index
    # merge in the cell counts
    full_sample_df = full_sample_df.merge(cell_counts)
    # the list of biosamples in this cluster
    biosample_list = list(set(full_sample_df.biosample))
    # and the order of cells as index
    index = np.array(full_sample_df.biosample.tolist())
    # then we make the design matrix
    sample_df = full_sample_df.drop_duplicates()
    sample_df.loc[:,'binned_cell_counts'] = pd.cut(sample_df.cell_counts, bins=[0,5,10,20,40,80,160,100000]) #((sample_df.cell_counts - np.mean(sample_df.cell_counts)) / np.std(sample_df.cell_counts))
    return( (biosample_list, index, sample_df) )

In [9]:
def build_count_matrix(biosample_list, index, qci, sample_df):
    # sum within samples
    res0 = pd.DataFrame()
    for bsl in biosample_list:
        idx = np.argwhere(index == bsl).flatten()
        mat = qci.X[idx,:].sum(axis=0)
        cnt_sum = mat.flatten().tolist()[0]
        if len(res0) == 0:
            res0 = pd.DataFrame(cnt_sum, columns=[bsl])
        else:
            res0 = res0.join(pd.DataFrame(cnt_sum, columns=[bsl]))
    count_matrix = res0.loc[:, sample_df.biosample.tolist()]
    count_matrix['id'] = qci.var.index.tolist()
    count_matrix.index = qci.var.index
    return(count_matrix)

In [10]:
cellclusters = dict(
    epithelial=['0','3','4','6','8','13','19','20','28','31','33','34','35','36','38','39','40'], # 78049 cells, # 32 is heptoid
    fibroblasts=['7'],
    myofibroblasts=['12'],
    endothelial=['12','15','5', '30', '26'],
    stromal=['7','12','15','5', '30', '26'],
    neutrophils=['22'],
    Bcells=['11','23'],
    monocytes=['12'],  # and macs and dcs
    cd4_Tcells=['2'],
    cd8_Tcells=['1','25'],
    NKcells=['14'],
    mastcells=['9'],
    gastric=['4','6']   
)

In [11]:
for leiden_label in cellclusters.keys():
    print(leiden_label)

epithelial
fibroblasts
myofibroblasts
endothelial
stromal
neutrophils
Bcells
monocytes
cd4_Tcells
cd8_Tcells
NKcells
mastcells
gastric


In [17]:
clusterlabs = list(set(q.obs.leiden))
clusterlabs.sort()
res_df = pd.DataFrame()
for leiden_label in cellclusters.keys(): #clusterlabs:
    # subset the anndata to this cluster
    print('leiden cluster: ' + leiden_label)
    ### subset data to this cluster
    clusterlabels = cellclusters[leiden_label]
    qci = q[q.obs.leiden.isin(clusterlabels)]
    qci = qci[qci.obs.diagnosis.isin(['NE', 'NS', 'M', 'D', 'T'])]
    ###
    (biosample_list, index, sample_df) = build_design(q, qci)
    try:
        # building the pseudobulk count matrix
        count_matrix = build_count_matrix(biosample_list, index, qci, sample_df)
        sample_df.binned_cell_counts = [str(x) for x in sample_df.binned_cell_counts]
        #fit a deseq2 model
        #https://bioconductor.riken.jp/packages/3.6/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#group-specific-condition-effects-individuals-nested-within-groups
        print(' .... running deseq2...')
        ###
        dds = py_DESeq2(count_matrix = count_matrix,
                       design_matrix = sample_df,
                       design_formula = '~ patient + binned_cell_counts + dx', #'~ patient + cell_counts + dx',
                       gene_column = 'id') # <- telling DESeq2 this should be the gene ID column
        params = dict(test='LRT', reduced=Formula('~patient + binned_cell_counts'), useT=True, minmu=1e-6, minReplicatesForReplace=np.Inf) # 
        dds.run_deseq(**params) 
        ### then pulling out log2FC
        dxs = list(set(sample_df.dx))
        for (dx1, dx2) in list(itertools.product(dxs, dxs)):
            if dx1 != dx2:
                dds.get_deseq_result(contrast = ['dx',dx1,dx2])
                de_df = dds.deseq_result 
                # add the additional items
                de_df['celltype'] = leiden_label
                #sig_res = res_df[(res_df.padj < 0.05) & (abs(res_df.log2FoldChange) > 1) & (res_df.baseMean > 2)]
                #sig_res[ '_'.join(['dx','D','T']) ] = sig_res.log2FoldChange
                de_df.to_csv('deseq2_out/deseq2_batch_'+str(leiden_label)+'_'+dx1+'_'+dx2+'.csv')
    except:
        print('error: ' + leiden_label)
        print(qci)
        print('')
    del qci
    gc.collect()


leiden cluster: epithelial


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using contrast: ['dx', 'D', 'M']
INFO:DESeq2:Using contrast: ['dx', 'D', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'D', 'T']
INFO:DESeq2:Using contrast: ['dx', 'D'

leiden cluster: fibroblasts
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using c

leiden cluster: myofibroblasts
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using c

leiden cluster: endothelial
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using c

leiden cluster: stromal
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using contrast: ['dx', 'D', 'M']
INFO:DESeq2:Using contrast: ['dx', 'D', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'D', 'T']
INFO:DESeq2:Using contrast: ['dx', 'D'

leiden cluster: neutrophils
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using c

leiden cluster: Bcells
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using c

leiden cluster: monocytes
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using c

leiden cluster: cd4_Tcells
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using contrast: ['dx', 'D', 'M']
INFO:DESeq2:Using contrast: ['dx', 'D', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'D', 'T']
INFO:DESeq2:Using contrast: ['dx', 'D'

leiden cluster: cd8_Tcells
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using c

leiden cluster: NKcells
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using c

leiden cluster: mastcells
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using c

leiden cluster: gastric
 .... running deseq2...



  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters




  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters


  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters



INFO:DESeq2:Using contrast: ['dx', 'M', 'D']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'M', 'T']
INFO:DESeq2:Using contrast: ['dx', 'M', 'NE']
INFO:DESeq2:Using contrast: ['dx', 'D', 'M']
INFO:DESeq2:Using contrast: ['dx', 'D', 'NS']
INFO:DESeq2:Using contrast: ['dx', 'D', 'T']
INFO:DESeq2:Using contrast: ['dx', 'D'