# Augurpy

This is a short tutorial demonstrating augurpy, the python implementation of [Augur R package](https://github.com/neurorestore/Augur) based on Skinnider, M.A., Squair, J.W., Kathe, C. et al. [Cell type prioritization in single-cell data](https://doi.org/10.1038/s41587-020-0605-1). Nat Biotechnol 39, 30–34 (2021). 


Augurpy aims to rank or prioritize cell types according to the their response to experimental perturbations given high dimensional single-cell sequencing data. The basic idea is that in the space of molecular measurements cells reacting heavily to induced perturbations are more easily seperated into perturbed and unperturbed than cell types with little or no response. This seperability is quantified by measuring how well experimental labels (eg. treatment and control) can be predicted within each cell type. Augurpy trains a machine learning model predicting experimental labels for each cell type in multiple cross validation runs and then prioritizes cell type response according to metric scores measuring the accuracy of the model. For categorical data the area under the curve is the default metric and for numerical data the concordance correlation coefficient is used as a proxy for how accurate the model is which in turn approximates perturbation response. 

The following tutorial runs through a simple analysis with augurpy using a simulated dataset. This dataset consists of 600 cells, distributed evenly between three populations (cell types A, B, and C). Each of these cell types has approximately half of its cells in one of two conditions, treatment and control. The cell types also have different numbers of genes differentially expressed in response to the treatment. Cell type A has approximately 5% of genes DE in response to the treatment, while cell type B has 25% of its genes DE and cell type C has 50% of genes DE. 

In [1]:
import scanpy as sc

from augurpy.estimator import Params, create_estimator
from augurpy.evaluate import predict
from augurpy.read_load import load

First we import the data that we want to work with. This can either be an anndata object, a dataframe containing cell type labels as well as conditions for each cell or data contained in a dataframe with corresponding meta data containing cell type labels and conditions. Here we use scanpy to read the simulated sample anndata set contained in augurpy. Then we load this data into an anndata object adding dummie variables of labels, highly variable genes and standardizing cell type and label observation columns. 

In [2]:
# import sample simulation data
adata = sc.read_h5ad("../tests/sc_sim.h5ad")

loaded_data = load(adata)



In [3]:
loaded_data

AnnData object with n_obs × n_vars = 600 × 15697
    obs: 'label', 'cell_type', 'y_treatment'
    var: 'name', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg'

Next we choose the estimator used to measure how predictable the perturbation labels for each cell type in the dataset are. Choose `random_forest_classifier` or `logistic_regression_classifier` for categorical data and `random_forest_regressor` for numerical data. 

In [4]:
random_forest = create_estimator("random_forest_classifier", Params(random_state=42))

Then we run augurpy with the function `predict` and look at the results. 

In [5]:
result_adata, results = predict(loaded_data, random_forest, random_state=42)

print(results['summary_metrics'])

Output()

                  CellTypeA  CellTypeB  CellTypeC
mean_augur_score   0.433333   0.666667   0.666667
mean_auc           0.433333   0.666667   0.666667


The corresponding `mean_augur_score` is also saved in `result_adata.obs`. The feature importances can be found in `results['feature_importances']` and used for further analysis. 

### Feature Importances

In the case of a random forest estimator the feature importances built into sci-kit learn were used as feature importances. For the logistic regression the agresti method was used (as in the R library Augur). This means the mean was subtracted from the coefficient values and then dividied by the standard deviation. (See [blog post](https://think-lab.github.io/d/205/))

### Differential Prioritization

Augurpy is also able to perform differential prioritization and executes a permutation test ot identify cell types with statistically significant differences in AUC between two different rounds of cell type prioritization (eg. response to drugs A and B, compared to untreated control). 