# Measuring chromatin accessibility in single cells

## Content of this chapter

- Introduction to single cell chromatin accessibility (this notebook NB1)
- Data loading and filtering low quality cells (NB2)
- Doublet detection in scATAC data (NB3)
- Normalisation and dimensionality reduction (NB4)
- Batch correction (NB5)
- Deriving interpretable features, gene activity scores and TF motif enrichment (NB6)
- Clustering and cell annotation (NB7)
- In-depth analysis of regulatory elements including integration of scRNA-seq data (NB8)

## Why analyzing chomatin accessibility?

Every cell of an organism shares the same DNA with the same set of functional units referred to as genes. With this in mind, what determines the tremendous diversity of cells reaching from natural killer cells of the immune system to neurons transmitting electrochemical signals throughout the body? In the previous chapters, we saw that cell identity and function can be inferred from gene expression profiles in each cell. The control of gene expression is driven by a complex interplay of regulatory mechanisms such as DNA methylation, histone modifications that largely determine chromatin accessibility, and transcription factor activity. For a short introduction and visualization see [this 2 minute video](https://www.youtube.com/watch?v=XelGO582s4U) on epigenetics and the regulation of gene activity (credits to Nicole Ethen from the [SQE, University of Illinois](https://www.feinberg.northwestern.edu/sites/epigenetics/)) **Format of reference to be confirmed**. For a comprehensive and up-to-date review on genome regulation and TF activity, we refer to {cite:t}`isbel2022generating`.

Taken together, an essential component defining cell identity is the state of regulatory elements in each cell. In this chapter, we focus on chomatin accessibility data measured by the **Single-Cell Assay for Transposase-Accessible Chromatin with High-Throughput Sequencing (scATAC-seq)** or as part of the **10x Multiome assay (scATAC combined with scRNA-seq)**. 

After walking you through the preprocessing steps this analysis will allow us to:
1) characterize cell identity with an orthogonal approach to scRNA-seq analysis
2) identify cell state specific transcriptional regulators
3) link gene expression to sequence features
4) disentagle epigenetic mechanisms driving cell differentiation and disease states

After showcasing steps of the unimodal analysis, we will discuss the integration with other modalities.

## The experimental view - single-cell ATAC-seq and the 10x Multiome assay

We anticipate, that commertially available kits will be the most widely used experimental protocols and therefore showcase our analysis on data generated with the 10x Multiome assay (with minor changes this also applies on data generated with the unimodal 10x single cell ATAC-seq assay).

The key priniciple used to measure chromatin accessibility is the Assay for Transposase-Accessible Chromatin with High-Throughput Sequencing. 

**Image below from BioRender, modify to represent the 10x  multiome protocol (nuclei extraction and Tagmentation in bulk, followed by ATAC-seq and RNA-seq)**

![alt text](ATAC-seq_overview.png "ATAC-seq_overview")


Starting point is a single cell suspension of the tissue of interest. Nuclei are extracted and the transposition is performed in bulk using a Tn5-transposase which binds to open regions in the chromatin and generates tagmented DNA fragments. Nuclei are then loaded onto a 10x Chromium Controller and droplets containing gel beads and single cells, also referred to as Gel Bead-in-Emulsion (GEMs), are formed. Within each droplet, RNA molecules and DNA fragments are barcoded, and after dissolving the GEMs, nucleotide sequences are preamlified to receive the final scATAC-seq and scRNA-seq libraries.

## Data characteristics - feature definition and sparsity 

Single-cell ATAC-seq data measures chromatin accesibility across the entire genome. Since this includes coding and non-coding regions, genes can not be used as pre-defined features, as it is the case in scRNA-seq data. Instead, the most common approach to define biologically meaningful features is detecting regions of accesibility - i.e. peaks in the distribution of fragment counts along the genome. Peaks in coding regions indicate that a gene might be transcribed, while in non-coding regions accessibility is seen as a prerequisite or result for the binding of regulatory proteins such as transcription factors.

**ToDo**
- add sentense on peak calling (incl. cluster specific peaks)
- explain bins as alternative approach (**add figure**)
- mention harmonizing features across multiple samples


Another important characteristic of scATAC data is its high sparsity. In a diploid organism there are only two copies of DNA in each cell, which results in a maximum of 2 counts for a given position. Therefore, some of the tools commonly used in scRNA-seq analysis are less suitable or need to be adjusted for scATAC-seq data.


## Overview of the data analyis workflow

**ToDo**