# Run BLAST

In [None]:
file1 = 'path/to/transcriptome1.fasta'
type1 = 'nucl' #or 'prot' if file1 is a proteome
id1 = 'hu' #2-character ID (e.g. 'hu' for human)

file2 = 'path/to/transcriptome2.fasta'
type2 = 'nucl'
id2 = 'mo' #2-character ID (e.g. 'mo' for mouse)
!bash map_genes.sh --tr1 {file1} --t1 {type1} --n1 {id1} --tr2 {file2} --t2 {type2} --n2 {id2}

# Run SAMap

In [5]:
from samap.mapping import SAMAP
from samap.analysis import get_mapping_scores, GenePairFinder, sankey_plot
from samalg import SAM

SAMap accepts file paths to unprocessed, raw `.h5ad` files. Alternatively, if you already have a processed `SAM` object, you can load them in directly.


Prior to running SAMap, you should have run `map_genes.sh`, which expects 2-character identifiers describing each species. For example, human and mouse might get `hu` and `mo` identifiers, respectively. `map_genes.sh` generates a `maps/` directory with the transcriptome mapping BLAST results deposited. The input species IDs and path to the `maps/` directory should be input into SAMap.

In [18]:
id1 = 'mu'
id2 = 'dp'

In [19]:
"""
# passing in file names (SAMap will process the data with SAM and save the resulting objects to two `.h5ad` files.)
# these objects are assumed to be unprocessed
fn1 = 'example_data/planarian.h5ad' #processed data will be automatically saved to `example_data/planarian_pr.h5ad`
fn2 = 'example_data/schistosome.h5ad' #processed data will be automatically saved to `example_data/schistosome_pr.h5ad`
# runs SAMAP 
sm = SAMAP(fn1,fn2,id1,id2,f_maps = 'maps/')
samap = sm.run()
#"""

#"""
# passing in already-processed SAM objects
fn1 = '2020-12-07_mouse_kidney (2).h5ad'
fn2 = '2021-03-28_fly_MT_NC.h5ad'
sam1=SAM()
sam2=SAM()
sam1.load_data(fn1)
sam2.load_data(fn2)

In [25]:
sam1.run(weight_mode='combined')
sam2.run(weight_mode='combined')

RUNNING SAM
Iteration: 0, Convergence: 1.0
Computing means and variances of genes.
Iteration: 1, Convergence: 0.7896062058996485
Iteration: 2, Convergence: 0.011452742128482574
Computing the UMAP embedding...
Elapsed time: 48.64499282836914 seconds
RUNNING SAM
Iteration: 0, Convergence: 1.0
Computing means and variances of genes.
Iteration: 1, Convergence: 0.9514489350870313
Computing the UMAP embedding...
Elapsed time: 23.618821144104004 seconds


In [26]:
sam1.leiden_clustering(res=3)
sam2.leiden_clustering(res=3)
from samap.mapping import prepare_SAMap_loadings
prepare_SAMap_loadings(sam1)
prepare_SAMap_loadings(sam2)
sam1.save_anndata()
sam2.save_anndata()

Not updating the manifold...
Not updating the manifold...


In [35]:
sm1 = SAMAP(sam1,sam2,id1,id2,f_maps = 'maps/')
samap = sm1.run(scale_edges_by_corr=True)
#"""

Preparing data 1 for SAMap.
Preparing data 2 for SAMap.
11775 `mu` genes and 9296 `dp` gene symbols match between the datasets and the BLAST graph.
Stitching SAM 0 and SAM 1
Found 126258 gene pairs
Recomputing PC projections with gene pair subsets...
Running hsnwlib
Using leiden_clusters and leiden_clusters cluster labels.
Out-neighbor smart expansion 1
Out-neighbor smart expansion 2
Indegree coarsening
0/1 (0, 12166) True
Scaling edge weights by expression correlations.
Concatenating SAM objects...
ITERATION: 0 
Average alignment score (A.S.):  0.26381244025168105 
Max A.S. improvement: 0.20763392042613651 
Min A.S. improvement: 0.0
Calculating gene-gene correlations in the homology graph...
Stitching SAM 0 and SAM 1
Found 54805 gene pairs
Recomputing PC projections with gene pair subsets...
Running hsnwlib
Using leiden_clusters and leiden_clusters cluster labels.
Out-neighbor smart expansion 1
Out-neighbor smart expansion 2
Indegree coarsening
0/1 (0, 12166) True
Scaling edge weights

In [36]:
sm2 = SAMAP(sam2,sam1,id2,id1,f_maps = 'maps/')
samap = sm2.run(scale_edges_by_corr=True)
#"""

Preparing data 1 for SAMap.
Preparing data 2 for SAMap.
9296 `dp` genes and 11775 `mu` gene symbols match between the datasets and the BLAST graph.
Stitching SAM 0 and SAM 1
Found 126258 gene pairs
Recomputing PC projections with gene pair subsets...
Running hsnwlib
Using leiden_clusters and leiden_clusters cluster labels.
Out-neighbor smart expansion 1
Out-neighbor smart expansion 2
Indegree coarsening
0/1 (0, 12166) False
Scaling edge weights by expression correlations.
Concatenating SAM objects...
ITERATION: 0 
Average alignment score (A.S.):  0.26425182208934955 
Max A.S. improvement: 0.20785645055399948 
Min A.S. improvement: 0.0
Calculating gene-gene correlations in the homology graph...
Stitching SAM 0 and SAM 1
Found 54665 gene pairs
Recomputing PC projections with gene pair subsets...
Running hsnwlib
Using leiden_clusters and leiden_clusters cluster labels.
Out-neighbor smart expansion 1
Out-neighbor smart expansion 2
Indegree coarsening
0/1 (0, 12166) False
Scaling edge weigh

In [39]:
M1 = get_mapping_scores(sm1,'cell_types','cell_types')[-1]
M2 = get_mapping_scores(sm2,'cell_types','cell_types')[-1]

In [45]:
np.where(np.abs(M1-M2.T).values>0.08)

(array([1, 1]), array([5, 6]))

In [43]:
M1.index[1]

'dp_adult pericardial nephrocytes'

In [46]:
M1.columns[5]

'mu_distal straight tubule of inner stripe of outer medulla (syn: thick ascending limb of LOH)'

In [41]:
np.sort(np.abs(M1-M2.T).values.flatten())

array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
      

To calculate alignment scores between cell types, we can use `get_mapping_scores`. This function will use the combined SAM object produced by SAMap to calculate alignment scores between cell types in the provided cell type annotation columns of `sam.adata.obs`. If no cell type annotations exist, the leiden clusters generated by SAM can be used (`k1=k2='leiden_clusters'`).

The resulting tables show the highest-scoring alignment scores for each cell type in organism 1 (`D1`) and organism 2 (`D2`), respectively.

The `n_top` parameter can be used to identify strongly-mapping subpopulations between clusters. If `n_top=0`, the alignment score will be averaged over all cells between two clusters. Otherwise, the alignment score will be averaged over the top `n_top` cells.

For example, let's say there are 100 / 1000 cells in Cluster A that maps to a Cluster B in the other species. On average, the mapping score between cluster A and B will be small, because there are only a few cells in cluster A that map to cluster B. If we set `n_top` to 100, then the alignment score between A and B will be averaged over the 100 highest-scoring cells. As a result, the mapping score between A and B will be large.

By default, `n_top=0`.

In [None]:
k1 = 'cluster' #cell types annotation key in `sam1.adata.obs`
k2 = 'tissue' #cell types annotation key in `sam2.adata.obs`
D1,D2,MappingTable = get_mapping_scores(sm,k1,k2, n_top = 0)

In [None]:
D1.head()

In [None]:
D2.head()

In [None]:
MappingTable.head()

SAMap provides a class to find gene pairs enriched in different cell type pairs. The method entails finding gene pairs that contribute positively to the cross-species correlation between cell types and are differentially expressed in their respective mapped cell types.

In [None]:
gpf = GenePairFinder(sm,k1=k1,k2=k2)

`gpf.find_genes` can now be used to find gene pairs enriched in a cell type mapping.

In [None]:
n1 = 'Neoblast: 0' #cell type ID from organism 1 (must be present in `sam1.adata.obs[k1]`)
n2 = 'Neoblast' #cell type ID from organism 2 (must be present in `sam2.adata.obs[k1]`)
Gp,G1,G2 = gpf.find_genes(n1,n2)
#Gp are the gene pairs, G1 are the genes from organism 1, G2 are the genes from organism 2

In [None]:
Gp,G1,G2

To get a table of enriched gene pairs from all cell type mappings that have an alignment score above some threshold (`thr`), you can use:

In [None]:
gene_pairs = gpf.find_all(thr=0.1)

In [None]:
gene_pairs.head()

# Saving/Loading SAMap

In [None]:
from samap.utils import save_samap, load_samap
# save to a .pkl file
save_samap(sm,'path/to/file') #including the file ending (.pkl) is optional

In [None]:
# load
sm = load_samap('path/to/file') #including the file ending (.pkl) is optional

# Visualizing SAMap results

Launching an interactive GUI (requires SAMGUI, see the README instructions in the [SAM](https://github.com/atarashansky/self-assembling-manifold) github repo):

In [None]:
sm.gui()

To launch a similar interactive GUI with both species overlaid in one tab, run:

In [None]:
sm.samap.gui()

To create a sankey plot, use `samap.analysis.sankey_plot`. This requires `holoviews` to be installed (`pip install holoviews`). It should already be installed if you're using the Docker image.

In [None]:
k1 = 'cluster' #cell types annotation key in `sam1.adata.obs`
k2 = 'tissue' #cell types annotation key in `sam2.adata.obs`
D1,D2,MappingTable = get_mapping_scores(sm,k1,k2, n_top = 0)

In [None]:
sankey_plot(MappingTable)

# ADVANCED:

# Functional Enrichment Analysis

We can use the `samap.mapping.FunctionalEnrichment` class to perform functional enrichment analysis of gene pairs enriched in each cell type. The functional annotations can be GO terms, KOG terms, etc.

We can get functional annotations for our data by mapping the transcriptome to the Eggnog database. Below, we clone the eggnog repository from github, download the eggnog database, and run the mapping.

In [None]:
# run the below line if you are not running the SAMap Docker image.
#!pip install git+https://github.com/eggnogdb/eggnog-mapper.git
# clone the repo (this only needs to be done once)
!git clone https://github.com/eggnogdb/eggnog-mapper.git # put in dockerfile
# download the database (this only needs to be done once)
!eggnog-mapper/download_eggnog_data.py -y

`transcriptome1.fa` should be the path to your first transcriptome fasta file.\
`transcriptome2.fa` should be the path to your second transcriptome fasta file, and so on.\
`output_name` is the name of the output file.

In [None]:
# run EGGNOG
!eggnog-mapper/emapper.py -m diamond -i transcriptome1.fa --cpu 0 --itype CDS -o output_name1
!eggnog-mapper/emapper.py -m diamond -i transcriptome2.fa --cpu 0 --itype CDS -o output_name2

The above command should eventually output an `output_name1.emapper.annotations` table.\
This table has many columns, including a collection of GO terms for each transcript and KOG functional categories. In the below example, I perform functional enrichment analysis of the KOG categories (`best_OG_cat` column in the EGGNOG tables).

`FunctionalEnrichment.calculate_enrichment()` returns three tables. The rows are cell types, and the columns are functional categories. `enrichment_scores` provides the `-log10` statistical significance p-values. High values indicate high enrichment of gene pairs with that functional category. `num_enriched_genes` is the number of genes with a particular functional annotation enriched in a particular cell type mapping. `enriched_genes` provides a `;`-separated list of the genes.

In [None]:
from samap.analysis import FunctionalEnrichment
# load eggnog tables
A = pd.read_csv('output_name1.emapper.annotations',sep='\t',index_col=0)
B = pd.read_csv('output_name2.emapper.annotations',sep='\t',index_col=0)
# load SAMAP
sm = load_samap('samap_object.pkl')
# do functional enrichment
fe = FunctionalEnrichment(sm,[A,B],'best_OG_cat',['leiden_clusters','leiden_clusters'])
enrichment_scores,num_enriched_genes,enriched_genes = fe.calculate_enrichment(verbose=True)

See below for a description of all inputs to `FunctionalEnrichment`:

```
FunctionalEnrichment(sms, DFS, col_key, keys, delimiter = '', align_thr = 0.1, limit_reference = False, n_top = 0)

Parameters
----------
sms - list or tuple of SAMAP objects

DFS - list or tuple of pandas.DataFrame functional annotations (one per species present in the input SAMAP objects)

col_key - str
    The column name with functional annotations in the annotation DataFrames.

keys - list or tuple of column keys from `.adata.obs` DataFrames (one per species present in the input SAMAP objects)
    Cell type mappings will be computed between these annotation vectors.

delimiter - str, optional, default ''
    Some transcripts may have multiple functional annotations (e.g. GO terms or KOG terms) separated by
    a delimiter. For KOG terms, this is typically no delimiter (''). For GO terms, this is usually a comma
    (',').

align_thr - float, optional, default 0.1
    The alignment score below which to filter out cell type mappings

limit_reference - bool, optional, default False
    If True, limits the background set of genes to include only those that are enriched in any cell type mappings
    If False, the background set of genes will include all genes present in the input dataframes.

n_top: int, optional, default 0
    If `n_top` is 0, average the alignment scores for all cells in a pair of clusters.
    Otherwise, average the alignment scores of the top `n_top` cells in a pair of clusters.
    Set this to non-zero if you suspect there to be subpopulations of your cell types mapping
    to distinct cell types in the other species.


```

Enrichment scores can be plotted using `fe.plot_enrichment()`:

```
FunctionalEnrichment.plot_enrichment(self,cell_types = [], pval_thr=2.0,msize = 50)

Create a plot summarizing the functional enrichment analysis.

Parameters
----------
cell_types - list, default []
    A list of cell types for which enrichment scores will be plotted. If empty (default),
    all cell types will be plotted.

pval_thr - float, default 2.0
    -log10 p-values < 2.0 will be filtered from the plot.

msize - float, default 50
    The marker size in pixels for the dot plot.

Returns
-------
fig - matplotlib.pyplot.Figure
ax - matplotlib.pyplot.AxesSubplot
```

In [None]:
fig,ax = fe.plot_enrichment(cell_types = [], pval_thr=2.0,msize = 50)

# Identifying Homolog/Paralog Substitutions

One of the key features of SAMap is the ability to automatically identify all instances where paralogs have much more similar expression patterns than their respective orthologs. To do this, all we need are the SAMAP objects and a list of ortholog pairs and paralog pairs. The paralogs can be either cross-species, within-species, or a mix of both. Within-species paralogs are automatically converted to cross-species paralogs by using the orthologs.

***IMPORTANT***: For all genes in `ortholog_pairs` and `paralog_pairs`, this function expects the genes to
be prepended with their corresponding species IDs (i.e. `sm.id1` or `sm.id2`). This is crucial because some pairs of transcripts between transcriptomes may have the same names, so we need to make them unique by prepending with the species
IDs. 

If `paralog_pairs` is left to its default value (`None`), then all genes in the homology graph that aren't orthologs are treated as if they were paralogs. In this case, the analysis would give you "homolog" substitutions rather than paralog substitutions.

The documentation for `ParalogSubstitutions` is:
```
ParalogSubstitutions(sm, ortholog_pairs, paralog_pairs=None, psub_thr = 0.3):

Identify paralog substitutions. 

For all genes in `ortholog_pairs` and `paralog_pairs`, this function expects the genes to
be prepended with their corresponding species IDs (i.e. `sm.id1` or `sm.id2`).

Parameters
----------
sm - SAMAP object

ortholog_pairs - n x 2 numpy array of ortholog pairs

paralog_pairs - n x 2 numpy array of paralog pairs, optional, default None
    If None, assumes every pair in the homology graph that is not an ortholog is a paralog.
    Note that this would essentially result in the more generic 'homolog substitutions' rather
    than paralog substitutions.

    The paralogs can be either cross-species, within-species, or a mix of both. 

psub_thr - float, optional, default 0.3
    Threshold for correlation difference between paralog pairs and ortholog pairs.
    Paralog pairs that do not have greater than `psub_thr` correlation than their 
    corresponding ortholog pairs are filtered out.

Returns
-------
RES - pandas.DataFrame
    A table of paralog substitutions.
```

In [None]:
TABLE = ParalogSubstitutions(sm, ortholog_pairs, paralog_pairs=None, psub_thr = 0.3)

In [None]:
TABLE.head()

## Identifying orthologs  using EGGNOG.

We can identify orthologs at different ancestral levels using the mapping tables output by EGGNOG.

In the `eggNOG_OGs` column of the output mapping table are a comma-separated list of orthology groups at different ancestral levels that each transcript maps to. The levels are typically Eukaryota, Opisthokonta, Metazoa, Bilateria, Chordata, Vertebrata, and so on. We can call genes that map to the most recent common ancestor level (e.g. Vertebrata) as orthologs, and genes that map to a more ancestral level (e.g. Metazoa) as paralogs.

For example, below we identify genes as orthologs if they map to the same Vertebrata orthology group and paralogs if they map to the same Metazoa orthology group.

The documentation for `convert_eggnog_to_homologs` is:
```   
convert_eggnog_to_homologs(sm, A, B, og_key = 'eggNOG_OGs', taxon=2759)

Gets an n x 2 array of homologs at some taxonomic level based on Eggnog results.
    
Parameters
----------
smp: SAMAP object

A: pandas.DataFrame, Eggnog output table

B: pandas.DataFrame, Eggnog output table

og_key: str, optional, default 'eggNOG_OGs'
    The column name of the orthology group mapping results in the Eggnog output table.
    
taxon: int, optional, default 2759
    Taxonomic ID corresponding to the level at which genes with overlapping orthology groups
    will be considered homologs. Defaults to the Eukaryotic level.

Returns
-------
homolog_pairs: n x 2 numpy array of homolog pairs.
```

In [None]:
ortholog_pairs = convert_eggnog_to_homologs(sm, A, B, og_key = 'eggNOG_OGs', taxon=7742)
paralog_pairs = convert_eggnog_to_homologs(sm, A, B, og_key = 'eggNOG_OGs', taxon=33208)