<img src='README_files/PersiST_Logo.png' width='300' > 

# PersiST

PersiST is an exploratory method for analysing spatial transcriptomics (and other 'omics) datsets. Given a spatial transcriptomics data set containing expression data on multiple genes resolved to a shared set of co-orindates, PerisST computes a single score for each gene that measures the amount of spatial structure that gene shows in it's expression pattern, called the *Coefficient of Spatial Structure* (CoSS). This score can be used for multiple analytical tasks, as we show below.

# Spatially Variable Gene Identification

For this tutorial, we shall be looking at spatial transcriptomics data on a sample from the Kidney Precision Medicine Project[1]. 

In [12]:
import pandas as pd
df = pd.read_csv('data/kpmp_30-10125_spatial_expression.csv')
df.head()

Unnamed: 0,x_position,y_position,TSPAN6,TNMD,DPM1,SCYL3,C1orf112,FGR,CFH,FUCA2,...,ENSG00000288156,ENSG00000288162,ENSG00000288172,ENSG00000288187,ENSG00000288234,ENSG00000288253,ENSG00000288302,ENSG00000288380,ENSG00000288398,SOD2
0,0.54881,0.834208,0.0,0.0,0.0,0.0,0.0,117.63322,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1058.699
1,0.58961,0.809106,0.0,0.0,0.0,0.0,0.0,86.86588,173.73177,86.86588,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1737.3176
2,0.571644,0.166174,75.90709,0.0,75.90709,0.0,0.0,0.0,151.81418,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2201.3057
3,0.539074,0.714422,382.89725,0.0,127.632416,0.0,0.0,127.632416,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1148.6918
4,0.570493,0.468741,82.88438,0.0,0.0,0.0,82.88438,0.0,82.88438,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1989.225


This is a pandas DataFrame where the first two columns correspond to the well co-ordinates, and the remaining columns contain the expression of each gene. This is the format PersiST expects spatial transcriptomics data to come in.

Let's compute CoSS scores for all the genes in this sample (this will take about 10 - 20 minutes).

In [13]:
from compute_persistence import run_persistence
metrics = run_persistence(df)

The CoSS is a measure of the amount of spatial structure in a gene's expression pattern. Let's take a look at those genes with the highest CoSS scores:

In [15]:
metrics = metrics.sort_values('CoSS', ascending=False)
metrics.iloc[:10,:]

Unnamed: 0,gene,CoSS,ratio,gene_rank,possible_artefact,svg
16443,IGLC1,0.14162,0.651803,1.0,No,Yes
16483,IGHG1,0.114255,0.467722,2.0,No,Yes
5372,MT1G,0.10585,0.335738,3.0,No,Yes
10798,DEFB1,0.103534,0.376595,4.0,No,Yes
12467,CCL19,0.101025,0.64977,5.0,No,Yes
22516,C17orf113,0.098336,0.574433,6.0,No,Yes
6980,ALDOB,0.096201,0.271491,7.0,No,Yes
5750,PODXL,0.095475,0.327815,8.0,No,Yes
1102,SLC12A3,0.095306,0.352575,9.0,No,Yes
11812,UMOD,0.094709,0.401716,10.0,No,Yes


PersiST outputs a number of quantities for each gene:

- **CoSS**: The Coefficient of Spatial Structure, a continuous quantity that can serve as a proxy for the amount of spatial structure in a gene's expression.
- **Ratio**: Roughly, this measures how much of a gene's CoSS is down to a single spatial features. Genes with a high ratio may be techinical artefacts, see [2] for details.
- **gene_rank**: The rank of each gene, where gene's are ranked from highest to lowest CoSS (so a rank of 1 is give to the gene with the highest CoSS).
- **possible_artefact**: Based on the ratio, PersiST automatically flags genes as possible artefacts [2]. We emphasise that this is only a suggestion, manual inspection should be performed before dismissing any genes.
- **svg**: Based on the CoSS scores, PersiST automatically calls genes as spatially variable or not [2].

Let's plot the expression of those genes for which the CoSS is highest (here expression is measured in counts per million).

In [None]:
from plotting_utils import plot_many_genes
plot_many_genes(df, list(metrics.gene)[:20])

![png](README_files/kpmp_svgs.png)

We can see that PersiST effectively surfaces those genes with notable spatial structure.

From the CoSS scores PersiST automatically calles genes as spatially variable or not (this is the 'svg' column in the results). This provides a triaged list of genes that can be subjected to further analysis. 

For example, one can search for genes with spatially similar expression patterns. Reducing to the comparatively small number of genes PersiST typically calls as SV makes this task much easier, in our experience simple clustering methods, such as hierarchical clustering, were effective to pick out groups of co-expressed SVGs.

For example, here is group of genes all expressed in the glomeruli of this particular sample [2]. This group was obtained by running simple hierarchichal clustering on the list of SVGs identified by PersiST and manually inspecting the results.

In [None]:
plot_many_genes(df, ['PODXL', 'PTGDS', 'IGFBP5', 'TGFBR2', 'IFI27', 'HTRA1'], numcols=3)

![png](README_files/podxl_svgs.png)

# Differential Spatial Expression Testing

If you are working with multiple spatial transcriptomics samples, and there are defiend subgroups present within these samples, the CoSS scores allow for some rudimentary differential spatial expression testing.

In the KPMP dataset, there are Acute Kidney Infection (AKI) and Chronic Kidney Disease (CKD) samples. For each gene, we computed the average CoSS score within the AKI and CKD samples. The gene with the highest different ebtween the two was UMOD.

![png](README_files/umod_comparison.png)

In the AKI samples, UMOD displays well-defined regions of higher expression, whereas in the CKD samples the expression of UMOD is much more diffuse. 

UMOD is a marker gene for tubles, a key structural component of the kidney. We hypothesis that this difference in expression between the AKI and CKD samples reflects the structural breakdown that is chracteristic of progressed kidney disease.

# References

[1] Blue B Lake et al. “An atlas of healthy and injured cell states and niches in the human kidney”. In: Nature
619.7970 (2023), pp. 585–594.

[2] PersiST paper (not yet published)