# Normalization

## Motivation

Contrary to the negative binomial distribution of UMI counts, ADT data is less sparse with a negative peak for non-specific antibody binding and a positive peak resembling enrichment of specific cell surface proteins{cite}`Zheng2022`.
The capture efficiency varies from cell to cell due to difference in biophysical properties. Since CITE-seq experiments enrich for a priori selected features, compositional biases are more severe.
Analogously to scRNA-seq data, many approaches to normalization exist.
We cover the two most widely used ideas methods that require different input data and starting points.

ADT data can be normalized using Centered Log-Ratio (CLR) transformation {cite}`Stoeckius2017`. Nevertheless, a new low-level normalization method tailored to dealing with the challenges this modality poses now exists: DSB (denoised and scaled by background). DSB normalization removes two kinds of noise. First, it uses the empty droplets to estimate a background noise and remove the ambient noise. Secondly, it uses the background population mean and isotypes (antibodies that bind non-specifically to the cells) to define and remove cell-to-cell technical noise{cite}`Mulè2022`

## Environment setup

In [1]:
import muon as mu
import pandas as pd
import warnings

warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=FutureWarning)

  from .autonotebook import tqdm as notebook_tqdm


## Loading the data

In [2]:
raw_mu_path = (
    "/lustre/groups/ml01/workspace/ciro.suastegui/bp2.0/data/neurips_cite_pp_raw.h5mu"
)
filtered_qc_mu_path = "/lustre/groups/ml01/workspace/ciro.suastegui/bp2.0/data/neurips_cite_pp_filtered-qc.h5mu"
filtered_norm_mu_path = "/lustre/groups/ml01/workspace/ciro.suastegui/bp2.0/data/neurips_cite_pp_filtered-qc-norm.h5mu"

We are simply loading the saved MuData object from the quality control chapter back in.

In [3]:
%%time
raw = mu.read(raw_mu_path)

CPU times: user 14min 3s, sys: 1min 54s, total: 15min 58s
Wall time: 16min 4s


In [4]:
%%time
filtered = mu.read(filtered_qc_mu_path)

CPU times: user 3.76 s, sys: 1.3 s, total: 5.05 s
Wall time: 6.07 s


In [5]:
filtered

## DSB normalization

We are ready to normalize the data. In this case, we can use the raw data's distribution as background. We also have isotype controls to define and remove cell-to-cell technical variations.

Isotype contols are antibodies that bind to the cells present in this study non-specifically, meaning you would not expect a significant abundance difference between the cells. Thus, we can use the
values of the isotype controls to normalize technical differences.

We are calling the normalization function `mu.prot.pp.dsb` with the filtered and raw mudata object as well as the names of the isotype controls.

In [6]:
isotype_controls = ["Mouse-IgG1", "Mouse-IgG2a", "Mouse-IgG2b", "Rat-IgG2b"]

In [7]:
filtered["prot"].layers["counts"] = filtered["prot"].X

In [8]:
filtered["prot"].X = filtered["prot"].layers["counts"]

In [9]:
%%time
mu.prot.pp.dsb(filtered, raw, isotype_controls=isotype_controls)

CPU times: user 8min 3s, sys: 28.6 s, total: 8min 31s
Wall time: 8min 32s


Let's have a look at counts before denoising and normalization.

In [10]:
pd.Series(filtered["prot"].layers["counts"][:100, :100].A.flatten()).value_counts()

1.0      1090
0.0      1045
2.0       918
3.0       691
4.0       581
         ... 
350.0       1
706.0       1
296.0       1
970.0       1
763.0       1
Length: 524, dtype: int64

See after denoise and normalization the range changed.

In [11]:
pd.Series(filtered["prot"].X[:100, :100].flatten()).value_counts()

-1.174030    2
-0.996048    1
 1.722345    1
-0.262355    1
 6.112263    1
            ..
 0.153576    1
 0.285257    1
-0.149485    1
 0.287904    1
-0.263154    1
Length: 9999, dtype: int64

## Centered Log-Ratio normalization

If you don't have the unfiltered data available, you can also normalize the ADT data with `mu.prot.pp.clr`, implementing **C**entered **L**og-**R**atio normalization. There is no denoising in this type of normalization. We instead assume that the geometric mean is a good reference to make all else relative to (divide by){cite}`Quinn_Erb_Richardson_Crowley_2018`. We are in fact taking the natural log ratio of each protein in each cell relative to either other proteins or other cells, depending on the implementation. At first, it was done across proteins, but then it was changed to across cells. This change made the normalization less dependent on the antibody panel{cite}`Mulè2022`.

In [12]:
filtered

In [13]:
filtered.write(filtered_norm_mu_path)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[key] = c
... storing 'feature_types' as categorical


## Key takeaways

TODO

## References

```{bibliography}
:filter: docname in docnames
```

## Contributors

We gratefully acknowledge the contributions of:

### Authors

* Daniel Strobl
* Ciro Ramírez-Suástegui

### Reviewers

* Lukas Heumos