Skip to content

Commit

Permalink
Merge branch 'feat/improve-tutorial' into 'master'
Browse files Browse the repository at this point in the history
Improve tutorial

Closes #44 and #20

See merge request icbi-lab/pipelines/singlecell_tcr!29

Former-commit-id: b3f160194abd86189b6978f468dfc9766834d601
  • Loading branch information
grst committed Mar 30, 2020
2 parents ecb7591 + 705579d commit 975dbc0
Show file tree
Hide file tree
Showing 14 changed files with 308 additions and 138 deletions.
1 change: 1 addition & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ Tools: `tl`
clonotype_network
chain_pairing
clip_and_count
clonal_expansion
alpha_diversity
group_abundance
spectratype
Expand Down
4 changes: 0 additions & 4 deletions docs/features.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Welcome to scirpy's documentation!
:caption: Contents:

introduction
features
tcr-biology
tutorials
api
bibliography
Expand Down
14 changes: 14 additions & 0 deletions docs/tcr-biology.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
.. _tcr-model:

Our T-cell receptor model
=========================

.. warning::
This section is under construction.

A clonotype designates a collection of T or B cells that bear the same adaptive immune receptors, and thus can be regarded as descendants of a common antecedent cell, recognizing the same epitope. Applying this definition to single-cell TCR sequencing datasets would mean that cells sharing identical sequences of both alpha and beta TCRs make up a clonotype. Contrary to what would be expected based on the previously described mechanism of allelic exclusion, single-cell sequencing datasets feature a considerable number of cells with more than one adaptive immune receptor sequences.
Since their existence violates current canon (Brady et al. 2010), most TCR analysis tools ignore cells with more than one sequence (Fischer et al. ; Zhang et al. 2018) or take the sequence with the highest expression level as the only valid choice (Afik et al. 2017). While in some cases these cells might indeed represent artifacts (e.g. doublets of a CD8+ and a CD4+ T cell engaged in an immunological synapse), there is an increasing amount of evidence in support of a dual TCR population (Schuldt and Binstadt 2019; Ji et al. 2010). Given their abundance in empirical data, we are convinced that instead of ignorance, an analysis tool should at least offer the choice of including this elusive cell type.
Scirpy attempts to address this problem by introducing a T cell model (similar to the one suggested by clonotype networks in TraCeR (Stubbington et al. 2016)), where T cells are allowed to have a primary and a secondary pair of alpha and beta chains. The primary pair consists of the alpha chain with the highest read count and the beta chain with the highest read count. Likewise, the secondary pair is the pair of chains with the second-highest expression level. If more than two variants of a chain are recovered for the same cell, those are ignored based on the assumption that each cell has only two copies of the chromosome set. These cells are also flagged as 'multichain' cells and can later be discarded from downstream analysis. The user can also choose if secondary TCRs shall be included in the analysis or not.
Another decision when implementing a clonotype definition is whether it should be based on the nucleotide or amino acid sequence. Although nucleotide sequence reflects common origin, epitope reactivity is determined by the amino acid sequence. This has made the amino acid sequence of the CDR3 region a common choice for clonotype definition by existing tools already and Scirpy conforms to this consensus. Given both the nucleotide and amino acid sequences of the CDR3 regions, it also possible to analyse the probability of convergent evolution of TCR sequences toward a given amino acid sequence as a result of selection pressure by epitopes.
Clonotypes, as well as individual cells can be grouped together based on shared CDR3 amino acid sequences (despite differences at the DNA level), shared primary or secondary chain pairs, or (as carried out by TraCeR (Stubbington et al. 2016)), based on shared alpha or shared beta chains. The biological relevance of such networks would be the clustering of cells that recognize the same epitope. Inspired by recent papers stating the similar TCR sequences also share epitope targets (Glanville et al. 2017; Dash et al. 2017; Fischer et al. ), we aimed at implementing a more general approach, where different layers of TCR sequence similarity can be integrated into a global, epitope-focused cell similarity network. The possibility of linking similar clonotypes together also provides an opportunity to correct for information loss due to sequencing depth, possibly resulting in recovering of only an alpha or a beta chain for some cells.
Clonotype clustering in scirpy is largely based on pairwise sequence alignment of CDR3 regions by Parasail (Daily 2016)...
161 changes: 100 additions & 61 deletions docs/tutorials/tutorial_3k_tcr.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,17 @@ jupyter:
text_representation:
extension: .md
format_name: markdown
format_version: "1.2"
format_version: '1.2'
jupytext_version: 1.4.1
---

# Analysis of 3k T cells from cancer

<!-- #raw raw_mimetype="text/restructuredtext" -->

In this tutorial, we re-analize single-cell TCR/RNA-seq data from Wu et al (:cite:`Wu2020`)
generated on the 10x Genomics VDJ platform. The original dataset consists of >140k T cells
generated on the 10x Genomics platform. The original dataset consists of >140k T cells
from 14 treatment-naive patients across four different types of cancer.
For this tutorial, to speed up computations, we use a downsampled version of 3k cells.

<!-- #endraw -->

```python
Expand All @@ -34,42 +32,59 @@ import scanpy as sc
from matplotlib import pyplot as plt
```

<!-- #raw raw_mimetype="text/restructuredtext" -->
```python
sc.logging.print_versions()
```

<!-- #raw raw_mimetype="text/restructuredtext" -->
The dataset ships with the `scirpy` package. We can conveniently load it from the `dataset` module:

<!-- #endraw -->

```python
adata = ir.datasets.wu2020_3k()
```

<!-- #raw raw_mimetype="text/restructuredtext" -->

`adata` is a regular :class:`~anndata.AnnData` object:

<!-- #endraw -->

```python
adata.shape
```

<!-- #raw raw_mimetype="text/restructuredtext" -->
.. note:: **T cell receptors**

For more information about our T-cell receptor model, see :ref:`tcr-model`.
<!-- #endraw -->


It just has additional TCR-related columns in `obs`:

* `has_tcr`: `True` for all cells with a T-cell receptor
* `TRA_1_<attr>`/`TRA_2_<attr>`: columns related to the primary and secondary TCR-alpha chain
* `TRB_1_<attr>`/`TRB_2_<attr>`: columns related to the primary and secondary TCR-beta chain

The list of attributes available are:

* `c_gene`, `v_gene`, `d_gene`, `j_gene`: The gene symbols of the respective genes
* `cdr3` and `cdr3_nt`: The amino acoid and nucleotide sequences of the CDR3 regions
* `junction_ins`: The number of nucleotides inserted in the `VD`/`DJ`/`VJ` junctions.

```python
adata.obs
```

<!-- #raw raw_mimetype="text/restructuredtext" -->

.. note:: **Importing data**

`scirpy` supports importing TCR data from Cellranger or `TraCeR <https://github.com/Teichlab/tracer>`\_.
See :ref:`api-io` for more details.
`scirpy` supports importing TCR data from `Cellranger <https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger>`_ (10x)
or `TraCeR <https://github.com/Teichlab/tracer>`_. (SMARTseq2).
See :ref:`api-io` for more details.

This particular dataset has been imported using :func:`scirpy.read_10x_vdj_csv` and merged
with transcriptomics data using :func:`scirpy.pp.merge_with_tcr`. The exact procedure
is described in :func:`scirpy.datasets.wu2020`.
This particular dataset has been imported using :func:`scirpy.read_10x_vdj_csv` and merged
with transcriptomics data using :func:`scirpy.pp.merge_with_tcr`. The exact procedure
is described in :func:`scirpy.datasets.wu2020`.

<!-- #endraw -->

Expand Down Expand Up @@ -132,19 +147,18 @@ sc.pl.umap(adata, color=["sample", "patient", "cluster", "CD8A", "CD4", "FOXP3"]
## TCR Quality Control

<!-- #raw raw_mimetype="text/restructuredtext" -->

While most T cell receptors have exactely one pair of α and β chains, it has been reported that up to one third of
the receptors can be _Dual TCRs_ posessing a second pair of receptors, one originating from each Allele (:cite:`Schuldt2019`).
While most of T cell receptors have exactly one pair of α and β chains, up to one third of
T cells can have *dual TCRs*, i.e. two pairs of receptors originating from different alleles (:cite:`Schuldt2019`).

Using the :func:`scirpy.tl.chain_pairing` function, we can add a summary
about the T cell receptor compositions to `adata.obs`. We can visualize it using :func:`scirpy.pl.group_abundance`.

.. note:: **chain pairing**

- _Orphan chain_ refers to cells that have a single alpha or beta receptor chain, respectively.
- _Extra chain_ refers to cells that have a full alpha/beta receptor pair, and an additional chain.
- _Multichain_ refers to cells with more than two receptor pairs detected. These cells are likely doublets.
<!-- #endraw -->
- *Orphan chain* refers to cells that have either a single alpha or beta receptor chain.
- *Extra chain* refers to cells that have a full alpha/beta receptor pair, and an additional chain.
- *Multichain* refers to cells with more than two receptor pairs detected. These cells are likely doublets.
<!-- #endraw -->

```python
ir.tl.chain_pairing(adata)
Expand All @@ -156,7 +170,7 @@ ir.pl.group_abundance(
)
```

Indeed, in our case, ~20% of cells have more than a one pair of T-cell receptors:
Indeed, in this dataset, ~20% of cells have more than a one pair of T-cell receptors:

```python
print("Fraction of cells with more than one pair of TCRs: {:.2f}".format(
Expand All @@ -180,48 +194,45 @@ adata = adata[adata.obs["multi_chain"] != "True", :].copy()

In this section, we will define and visualize clonotypes.

_Scirpy_ implements a network-based approach for clonotype definition. The steps to
create and visualize the clonotype-network are anologous to the construction of a
neighborhood graph we already know from transcriptomics data:
*Scirpy* implements a network-based approach for clonotype definition. The steps to create and visualize the clonotype-network are analogous to the construction of a neighborhood graph from transcriptomics data with *scanpy*.

.. list-table:: Analysis steps on transcriptomics data
:widths: 40 60
:header-rows: 1

- - scanpy function
- objective
- - :func:`scanpy.pp.neighbors`
- Compute a nearest-neighbor graph based on gene expression.
- - :func:`scanpy.tl.leiden`
- Cluster cells by the similarity of their transcriptional profiles.
- - :func:`scanpy.tl.umap`
- Compute positions of cells in UMAP embedding.
- - :func:`scanpy.pl.umap`
- Plot UMAP colored by different parameters.
:widths: 40 60
:header-rows: 1

* - scanpy function
- objective
* - :func:`scanpy.pp.neighbors`
- Compute a nearest-neighbor graph based on gene expression.
* - :func:`scanpy.tl.leiden`
- Cluster cells by the similarity of their transcriptional profiles.
* - :func:`scanpy.tl.umap`
- Compute positions of cells in UMAP embedding.
* - :func:`scanpy.pl.umap`
- Plot UMAP colored by different parameters.

.. list-table:: Analysis steps on TCR data
:widths: 40 60
:header-rows: 1

- - scirpy function
- objective
- - :func:`scirpy.pp.tcr_neighbors`
- Compute a neighborhood graph of CDR3-sequences.
- - :func:`scirpy.tl.define_clonotypes`
- Cluster cells by the similarity of their CDR3-sequences.
- - :func:`scirpy.tl.clonotype_network`
- Compute positions of cells in clonotype network.
- - :func:`scirpy.pl.clonotype_network`
- Plot clonotype network colored by different parameters.
:widths: 40 60
:header-rows: 1

- - scirpy function
- objective
- - :func:`scirpy.pp.tcr_neighbors`
- Compute a neighborhood graph of CDR3-sequences.
- - :func:`scirpy.tl.define_clonotypes`
- Cluster cells by the similarity of their CDR3-sequences.
- - :func:`scirpy.tl.clonotype_network`
- Compute positions of cells in clonotype network.
- - :func:`scirpy.pl.clonotype_network`
- Plot clonotype network colored by different parameters.

<!-- #endraw -->

### Compute CDR3 neighborhood graph

<!-- #raw raw_mimetype="text/restructuredtext" -->

:func:`scirpy.pp.tcr_neighbors` computes pairwise sequence alignments of all CDR3 sequences and
derives a distance from the alignment score. This approach was originally proposed as _TCRdist_ by Dash et al. (:cite:`TCRdist`).
:func:`scirpy.pp.tcr_neighbors` computes the pairwise sequence alignment of all CDR3 sequences and
derives a distance from the alignment score. This approach was originally proposed as *TCRdist* by Dash et al. (:cite:`TCRdist`).

The function requires to specify a `cutoff` parameter. All cells with a distance between their
CDR3 sequences lower than `cutoff` will be connected in the network. In the first example,
Expand All @@ -230,7 +241,6 @@ we set the cutoff to `0`, to define clontypes as cells with **identical** CDR3 s
Then, the function :func:`scirpy.tl.define_clonotypes` will detect connected modules
in the graph and annotate them as clonotypes. This will add a `clonotype` and
`clonotype_size` column to `adata.obs`.

<!-- #endraw -->

```python
Expand All @@ -251,15 +261,19 @@ ir.tl.clonotype_network(adata, min_size=2)
ir.pl.clonotype_network(adata, color="clonotype", legend_loc="none")
```

Let's re-compute the network with a `cutoff` of `20`.
That's the equivalent of 4 `R`s mutating into `N` (using the BLOSUM62 distance matrix).
Let's re-compute the network with a `cutoff` of `15`.
That's the equivalent of 3 `R`s mutating into `N` (using the BLOSUM62 distance matrix).

Additionally, we set `chains` to `all`. This results in the distances not being only
computed between the most abundant pair of T-cell receptors, but instead, will
take the minimal distance between any pair of T-cell receptors.

```python
ir.pp.tcr_neighbors(adata, cutoff=20, chains="all")
sc.settings.verbosity = 4
```

```python
ir.pp.tcr_neighbors(adata, cutoff=15, chains="all")
ir.tl.define_clonotypes(adata)
```

Expand All @@ -284,18 +298,37 @@ it is shared across tissues and/or patients.
ir.pl.clonotype_network(adata, color="sample")
```

Next, visualize the clonal expansion by cell-type cluster
## Clonotype analysis

Let's visualize the number of expanded clonotypes (i.e. clonotypes consisting
of more than one cell) by cell-type:

```python
ir.tl.clip_and_count(adata, groupby="cluster", target_col="clonotype")
```

```python
sc.pl.umap(adata, color=["clonotype_clipped_count", "cluster"])
```

```python
ir.pl.clip_and_count(adata, groupby="cluster", target_col="clonotype", fraction=True)
```

```python
ir.pl.clonal_expansion(adata, groupby="cluster", clip_at=4, fraction=False)
```

Normalized to the cluster size
Normalized to the cluster size:

```python
ir.pl.clonal_expansion(adata, "cluster")
```

Expectedly, the CD8+ effector T cells have the largest fraction of expanded clonotypes.

Consistent with this observation, they have the lowest alpha diversity:

```python
ir.pl.alpha_diversity(adata, groupby="cluster")
```
Expand All @@ -304,7 +337,13 @@ ir.pl.alpha_diversity(adata, groupby="cluster")

```python
ir.pl.group_abundance(
adata, groupby="clonotype", target_col="cluster", max_cols=10, fraction=False
adata, groupby="clonotype", target_col="cluster", max_cols=10, fraction="patient"
)
```

```python
ir.pl.group_abundance(
adata, groupby="clonotype", target_col="patient", max_cols=10, fraction="patient"
)
```

Expand Down
41 changes: 35 additions & 6 deletions scirpy/_plotting/_clip_and_count.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,35 @@
from .. import tl
from anndata import AnnData
from . import base
from .._util import _normalize_counts, _is_na


def _prepare_df(adata, groupby, target_col, clip_at, fraction):
"""Turn the result of the `clip_and_count` tool into a plottable
dataframe"""
tmp_col = target_col + "clipped_count"
tmp_col_weight = target_col + "weight"

obs = adata.obs.loc[:, [groupby, target_col]]
obs[tmp_col] = tl.clip_and_count(
adata, target_col, groupby=groupby, clip_at=clip_at, inplace=False
)
# filter NA values
obs = obs.loc[~_is_na(obs[target_col]), :]

# add normalization vector
size_vector = _normalize_counts(obs, fraction, groupby)
obs[tmp_col_weight] = size_vector

obs = (
obs.groupby([groupby, tmp_col], observed=True)[tmp_col_weight]
.sum()
.reset_index()
.pivot(index=groupby, columns=tmp_col, values=tmp_col_weight)
.fillna(0)
)

return obs


def clip_and_count(
Expand All @@ -12,8 +41,10 @@ def clip_and_count(
fraction: bool = True,
**kwargs,
):
"""Plots the the number of identical entries in `target_col`
for each group in `group_by`.
"""Plots the the *number of cells* in `target_col` that fall into
a certain count bin for each group in `group_by`.
Removes all entries with `NaN` in `target_col` prior to plotting.
Parameters
----------
Expand All @@ -32,11 +63,9 @@ def clip_and_count(
**kwargs
Additional arguments passed to :meth:`base.bar`
"""
res = tl.clip_and_count(
adata, groupby, target_col, clip_at=clip_at, fraction=fraction
)
plot_df = _prepare_df(adata, groupby, target_col, clip_at, fraction)

return base.bar(res, **kwargs)
return base.bar(plot_df, **kwargs)


def clonal_expansion(
Expand Down
2 changes: 1 addition & 1 deletion scirpy/_plotting/_diversity.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ def alpha_diversity(
**kwargs
Additional parameters passed to :meth:`pl.base.bar`
"""
diversity = tl.alpha_diversity(adata, groupby, target_col=target_col)
diversity = tl.alpha_diversity(adata, groupby, target_col=target_col, inplace=False)
default_style_kws = {
"title": "Alpha diversity of {} by {}".format(target_col, groupby),
"ylab": "Shannon entropy",
Expand Down
Loading

0 comments on commit 975dbc0

Please sign in to comment.