This tutorial will walk you through how to use the PyPI version of GESS. 

The assumption is that you have installed GESS using the command:

    pip install GESS

Please make sure that you've downloaded the ExampleData from the GESS GitHub repository (https://github.com/AndrewDGillen/GESS/tree/main). We'll also need to download some single-nucleus data to work with. For this purpose, we'll use the FlyCellAtlas dataset (Li et al., 2022; DOI: 10.1126/science.abk2432).

In [None]:
#Downloads FlyCellAtlas Malpighian tubule dataset.
#If you don't know what that is, don't worry about it, it's just an example!

#This may take some time to download - please be patient!
import urllib.request

tubule_url = 'https://cloud.flycellatlas.org/index.php/s/7gfFYSQpkC4Yo8s/download/r_fca_biohub_malpighian_tubule_10x.h5ad'
urllib.request.urlretrieve(tubule_url, "ExampleData/SingleCellData.h5")

With that, we have all the data we'll need. specifically, we have:

    -An example Gene List, using Drosophila melanogaster genes
    -An example Annotation file, containing information on Drosophila Melanogaster bulk tissues (From FlyAtlas2;  DOI: 10.1093/nar/gkab971)
    -An example Bulk RNASeq data file, containing enrichment data (FPKM/Whole insect) from female FlyAtlas2 samples
    -Our example single-nucleus RNAseq data file

When using your own files, feel free to rename columns, annotations... as necessary for your data - just be consistent! This tutorial will flag up what arguments you'll need to keep an eye on for these things.

In [None]:
genelist = 'ExampleData/GeneList1.txt'
annotation_file = 'ExampleData/annotation.csv'
bulk_data = 'ExampleData/BulkRNASeqData.csv'
sn_data = "ExampleData/SingleCellData.h5"

## Finding GESS

First, let's do some single GESS calculations. These are pairwise measurements of gene expression pattern similarity, and GESS are SPECIFIC to the parameters used when calculation - ie, these are not fixed measurements associated with a pair of genes across all datasets.

We'll work with two gene pairs - CapaR vs salt; and salt vs alphaTub84B/FBgn0003884. Don't worry about the biology!

In [None]:
from GESS import GESSfinder

#NOTE: always define gene names in a format appropriate for the data set!!
bulk_gene_pairs = [
    ('FBgn0037100', 'FBgn0039872'), # CapaR vs salt
    ('FBgn0039872', 'FBgn0003884') # salt vs alphaTub84B
    ]

sn_gene_pairs = [
    ('CapaR', 'salt'), # CapaR vs salt
    ('salt', 'alphaTub84B') # salt vs alphaTub84B
    ]

In [None]:
#BULK RNAseq GESS

#Firstly, let's initialise the arguments for GESSfinder Bulk mode

#REQUIRED Arguments

#The data file containing data for the "query gene". Input as querydata=<data file>
query_data = bulk_data  

#The data file containing data for the "target gene". This defaults to the querydata file. Input as targetdata=<data file>
target_data = bulk_data  

#The Annotation file containing metadata (ie annotation levels, species applicable) corresponding to Bulk RNAseq samples.
#Input as annos=<annotation file>
annos=annotation_file

#The column indicating which annotations to use for the query gene. Input as q_species=<query species>
q_species ='Drosophila melanogaster'

#Annotation levels to be used for the GESS calculation - ie columns in the annotation file
#Supply as a list! Input as targetannos = <annotation list>
targetannos = ['Name', 'Function', 'Type']

In [None]:
#OPTIONAL Arguments

#column indicating which annotations for the target gene. Defaults to the queryspecies.Input as t_species=<target species>
t_species ='Drosophila melanogaster'

#ONLY EVER use "bulk" analysis mode for Bulk RNAseq data. This is the default value
analysis_mode = 'bulk'

#One can choose specific annotations to disallow, regardless if they are present or not.
#These should be listed, line-separated in the format demonstrated in "ExampleData/UnacceptableAnnotations.txt"
unacceptable = ''

#Boolean switch which allows GESSfinder to run in verbose mode, printing all calculation steps.
#Useful for troubleshooting!
verbosity = False

In [None]:
#Run the actual GESS calculation for our two comparisons with the GESSfinder command!

for q_gene, t_gene in bulk_gene_pairs:
    
    pair_gess = GESSfinder.find_gess(
        query_gene=q_gene, 
        query_data=query_data, 
        target_gene=t_gene, 
        target_data=target_data, 
        annos=annos,
        targetannos=targetannos,
        q_species=q_species,
        t_species=t_species
        )
    
    result = f'{q_gene} vs {t_gene} BULK GESS: {pair_gess}'
    print(result)

In the bulk data, CapaR and salt have relatively similar expression patterns, resulting in high GESS, while salt and alphaTub84B have very distinct profiles, resulting in low GESS.

Specifically, CapaR and salt are essentially restricted to the Malpighian tubules, while alphaTub84B is expressed throughout the fly. Thus, at this resolution, CapaR and salt are most similarly expressed.

However, is that consistent at the cell-type level? Let's find out!

In [None]:
#single-cell RNAseq GESS

#Most of the arguments are the same here - but let's define everything again for clarity

#REQUIRED Arguments

#The data file containing data for the "query gene". Input as querydata=<data file>
query_data = sn_data  

#The data file containing data for the "target gene". This defaults to the querydata file. Input as targetdata=<data file>
target_data = sn_data  

#Single-cell GESS can be used in one of two modes
#- "expression" : Calculates GESS based on average expression across each annotation
#- "prevalence" : Calculates the proportion of each annotation expressing a gene
#Functionally, these tend to give very similar results, but are provided seperately to suit the user's needs
analysis_mode = 'expression'

#Annotation levels to be used for the GESS calculation - ie column attributes in the H5AD file
#Supply as a list! Input as targetannos = <annotation list>
targetannos = ['annotation', 'annotation_broad']

In [None]:
#OPTIONAL Arguments

#One can choose specific annotations to disallow, regardless if they are present or not.
#These should be listed, line-separated in the format demonstrated in "ExampleData/UnacceptableAnnotations.txt"
unacceptable = ''

#Boolean switch which allows GESSfinder to run in verbose mode, printing all calculation steps.
#Useful for troubleshooting!
verbosity = False

#Sets the minimum UMI required to define a cell/nucleus as expressing a gene
#Only affects "prevalence" GESS calculations
umithresh = 1

In [None]:
#Run the actual GESS calculation for our two comparisons with the GESSfinder command!

for q_gene, t_gene in sn_gene_pairs:
    
    pair_gess = GESSfinder.find_gess(
        query_gene=q_gene, 
        query_data=query_data, 
        target_gene=t_gene, 
        target_data=target_data, 
        targetannos=targetannos,
        analysis_mode=analysis_mode,
        )
    
    result = f'{q_gene} vs {t_gene} single-nucleus GESS: {pair_gess}'
    print(result)

How about that! 

It seems that when you consider only tubule cells, salt and alphaTub84B seem to be more similarly expressed, whilst CapaR is now more distinct from salt.

The reason why is obvious when you visualise the expression data using SCoPE (https://scope.aertslab.org/). CapaR is restricted to a small subset of tubule cells, whilst salt and alphaTub84B are expressed throughout.

This highlights that GESS can effectively find expression patterns within a dataset - but these cannot be interpreted too broadly! 

## Making a Matrix

So far, hopefully so good!

But what if you have a whole list of genes, and want to find how all of their gene expression patterns compare? In this case, we provide GESSmatricise, a tool designed for calculating pairwise GESS and subsequently hierarchically clustering genes based on their GESS scores. As a result, co-regulated genes will be very easily identified!

As an example, let's consider some of the Drosophila melanogaster V-ATPase genes. Without worrying so much about the biology (though if interested, check out  DOI 10.1152/physiolgenomics.00233.2004 ), let's define two groups of genes:

Epithelial: Vha100-2; Vha68-2; VhaSFD
Non-Epithelial: Vha100-4; Vha68-3; Vha14-2

These genes are very closely related - in particular, Vha100-2 & Vha100-4; and Vha68-2 & Vha68-3 are paralogous and share high sequence homology. But how does that translate to their expression pattern?

Let's use GESSmatricise to find out using our two datasets!

In [None]:
from GESS import GESSmatricise

#REMEMBER: always define gene names in a format appropriate for the data set!!
#You can also provide the PATH to a file containing list-separated genes of interest (e.g. ExampleData/GeneList1.txt)
bulk_genelist = ['FBgn0028670', 'FBgn0038613', 'FBgn0263598', 'FBgn0032464','FBgn0027779','FBgn0037402']
sn_genelist = ['Vha100-2', 'Vha100-4', 'Vha68-2', 'Vha68-3', 'VhaSFD', 'Vha14-2']

In [None]:
#BULK RNAseq GESS MATRIX

#Firstly, let's initialise the arguments for matriciseGESSfinder Bulk mode

#REQUIRED Arguments

#query_genes takes the list of genes as defined above
query_genes = bulk_genelist

#The data file containing data for the "query gene". Input as querydata=<data file>
querydata = bulk_data  

#The Annotation file containing metadata (ie annotation levels, species applicable) corresponding to Bulk RNAseq samples.
#Input as annos=<annotation file>
annotations=annotation_file

#The column indicating which annotations to use for the query gene. Input as q_species=<query species>
q_species ='Drosophila melanogaster'

#Annotation levels to be used for the GESS calculation - ie columns in the annotation file
#Supply as a list! Input as targetannos = <annotation list>
targetannos = ['Name', 'Function', 'Type']

In [None]:
#OPTIONAL Arguments

#It is possible to make non-symmetrical matrices by defining differing query and target genelists
#target_genes takes the list of genes as for query_genes
target_genes = bulk_genelist

#column indicating which annotations for the target gene. Defaults to the queryspecies.Input as t_species=<target species>
t_species ='Drosophila melanogaster'

#The data file containing data for the "target gene". This defaults to the querydata file. Input as targetdata=<data file>
target_data = bulk_data  

#ONLY EVER use "bulk" analysis mode for Bulk RNAseq data. This is the default value
analysis_mode = 'bulk'

#One can choose specific annotations to disallow, regardless if they are present or not.
#These should be listed, line-separated in the format demonstrated in "ExampleData/UnacceptableAnnotations.txt"
unacceptable = ''

#Boolean switch which allows GESSfinder to run in verbose mode, printing all calculation steps.
#Useful for troubleshooting!
verbosity = False

#Defines a location to save the created GESS matrix. 
#Usable file extensions are:
#If left blank, the resulting plot will just be shown
savefilename=''

#Defines the hierarchical clustering strategy used to cluster data.
#By default, GESS uses WPMA clustering
cluster_method='weighted'

#A boolean switch to label the matrix with actual GESS values over each relationship.
#By default, this is off - and I'd recommend not using it for large matrices, it gets messy!
labelling=False

In [None]:
#Call GESSmatricise using GESSmatricise.gess_matricise!

GESSmatricise.gess_matricise(
    query_genes=query_genes, 
    querydata=querydata, 
    annotations=annotations,
    targetannos=targetannos,
    queryspecies=q_species,
    )

With that, the Epithelial and Non-Epithelial Vha genes should be distict, forming seperate clusters based solely on the expression data!

Let's check again with the single-nucleus dataset

In [None]:
#single-cell RNAseq GESS MATRIX

#Firstly, let's initialise the arguments for matriciseGESSfinder Single Cell mode

#REQUIRED Arguments

#query_genes takes the list of genes as defined above
query_genes = sn_genelist

#The data file containing data for the "query gene". Input as querydata=<data file>
querydata = sn_data  

#Single-cell GESS can be used in one of two modes
#- "expression" : Calculates GESS based on average expression across each annotation
#- "prevalence" : Calculates the proportion of each annotation expressing a gene
#Functionally, these tend to give very similar results, but are provided seperately to suit the user's needs
analysis_mode = 'expression'

#Annotation levels to be used for the GESS calculation - ie column attributes in the H5AD file
#Supply as a list! Input as targetannos = <annotation list>
targetannos = ['annotation', 'annotation_broad']

In [None]:
#OPTIONAL Arguments

#It is possible to make non-symmetrical matrices by defining differing query and target genelists
#target_genes takes the list of genes as for query_genes
target_genes = bulk_genelist

#The data file containing data for the "target gene". This defaults to the querydata file. Input as targetdata=<data file>
target_data = bulk_data  

#One can choose specific annotations to disallow, regardless if they are present or not.
#These should be listed, line-separated in the format demonstrated in "ExampleData/UnacceptableAnnotations.txt"
unacceptable = ''

#Boolean switch which allows GESSfinder to run in verbose mode, printing all calculation steps.
#Useful for troubleshooting!
verbosity = False

#Defines a location to save the created GESS matrix. 
#Usable file extensions are:
#If left blank, the resulting plot will just be shown
savefilename=''

#Defines the hierarchical clustering strategy used to cluster data.
#By default, GESS uses WPMA clustering
cluster_method='weighted'

#A boolean switch to label the matrix with actual GESS values over each relationship.
#By default, this is off - and I'd recommend not using it for large matrices, it gets messy!
labelling=False

In [None]:
#Call GESSmatricise using GESSmatricise.gess_matricise!

GESSmatricise.gess_matricise(
    query_genes=query_genes, 
    querydata=querydata,
    analysis_mode=analysis_mode,
    targetannos=targetannos,
    )

And that's it!

We wish you every success with GESS, and if you have any difficulties at all, please do get in touch at either:

-The GESS Github repo (https://github.com/AndrewDGillen/GESS/issues)

-The developer's email (andrew.gillen@glasgow.ac.uk)