### Zoonotic Potential

- [*] remember to turn of EC2 instance


### Plan

- measure host abundance for each zoonotic SOTU
- measure host abundance for each non-zoonotic SOTU
- measure differential between positive and negative sets
- statistical test of differential AGS with FGS of zoonotic potential (Virus set enrichment analysis)

### Done

- [x] parse virus dataset
- [x] find number of genuses with differing in zoonotic values
- [x] find rank distribution of has_potential_host
    - total: 438167, species: 414190, genus: 24, family: 49
- [x] manually patch tax_ids for all species
- [x] find number of virus species with associated SOTU
    - match genus taxons via list
- [x] create projection of SOTUs and sequence alignments using zoonotic viruses
- [x] setup graphistry visualization
- [x] clustering analysis on SOTUs
    - [x] setup community detection with seeded taxId
    - [x] WCC
    - [x] Louvain and lieden
- [x] Include STAT hosts into projection 

### TODO

- [*] understand discrepancy between zoonotic label and human host association
    - how many non-zoonotic viruses have human stat associations
    - goal: may need to re-examine using dataset for FGS?
    - may be able to create our own zoonotic dataset using STAT?
    - create a functional gene set (FGS) using entirety of the network of viruses in humans? if there is strong agreement
    - tests:
        - check edges for any viruses with human stat host > threshold, check IsZoonotic values
        - cluster using human vs other
- [*] update Graphistry with stat hosts (order level only)
- [*] test vary seq alignment threshold > 90 in clustering, 
    - goal: may not need to cluster SOTUs into viral communities
- [*] examine weight/threshold on HAS_POTENTIAL_TAXON > 80
    - goal: may not want to overly rely on tax label and use similar clusters 
- Decide on including tissues
    - check  
- decide on using community or individual virus order/species
    - find differentially expression of hosts (AGS) between positive and negative sets
    - compare with FGS from entire network with specific hosts?
- for each {community, species} find abundance of host taxons
    - remap host abundance by class [mammals, fish, plants, fungii, parasite]
    - find taxids for each
- for each {community, species} measure network topology metric of sotus and full trait hierachy
    - Modularity, etc
    - spectrum
- compare variance within zoonotic and non-zoonotic
    - log intensity-ratio
- compare differential change between zoonotic and non-zoonotic
    - rank-sum test 
    - https://nbisweden.github.io/excelerate-scRNAseq/session-de/session-de-methods.html
- for each community, find tissue abundance by class

- create projection of SOTUs and sequence alignments using non-zoonotic viruses

In [1]:
# Notebook config
import sys
if '../' not in sys.path:
    sys.path.append("../")
%load_ext dotenv
%reload_ext dotenv
%dotenv

import collections
import os
import urllib.parse

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import graphistry


from datasources.neo4j import gds
from queries import utils

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
base_data_path = './zoonosis_data/'
virus_dataset = base_data_path + 'trefle/'
neo4j_data_path = base_data_path + 'neo4j/'
graphistry_data_path = base_data_path + 'graphistry/'


graphistry.register(
    api=3,
    username=os.getenv('GRAPHISTRY_USERNAME'),
    password=os.getenv('GRAPHISTRY_PASSWORD'),
)

print(gds)

<graphdatascience.graph_data_science.GraphDataScience object at 0x11d753cd0>


In [3]:
df = pd.read_csv(virus_dataset + 'viruses.csv')

### Fill in tax_ids (only run once)

In [54]:
unique_names = df['vVirusNameCorrected'].unique()

def parse_scientific_name(name):
    """
    Parse a scientific name into its components.
    """
    name = name.replace('_', ' ')
    return name

with open( virus_dataset + 'species_names.txt', 'w') as f:
    for line in unique_names:
        f.write(f"{parse_scientific_name(line)}\n")

In [55]:
cat ./zoonosis_data/trefle/species_names.txt | taxonkit name2taxid  >> ./zoonosis_data/trefle/species_taxids.tsv

In [56]:
df2 = pd.read_csv(virus_dataset + 'species_taxids.tsv', sep='\t', header=None, names=['name', 'taxid'])
print(df2.head())
print(df2.shape)
print(df2['taxid'].isna().sum())

                                name    taxid
0               Adelaide River virus  31612.0
1           Adeno-associated virus-1      NaN
2           Adeno-associated virus-2      NaN
3           Adeno-associated virus-5      NaN
4  African green monkey polyomavirus  12480.0
(586, 2)
269


### Load zoonotic dataset

In [4]:
# Manually fill in any unresolved taxids
df2 = pd.read_csv(virus_dataset + 'species_taxids_patched.tsv', sep='\t', header=None, names=['name', 'taxid'])
df2['name'] = df2['name'].str.replace(' ', '_')
missing_taxid = df2[df2['taxid'].isna()]['name']

print(missing_taxid.shape)

merged = pd.merge(df, df2, left_on='vVirusNameCorrected', right_on='name', how='left')
merged = merged.drop(columns=['name'])
merged = merged.rename(columns={'taxid': 'TaxID'})
merged_dropna = merged.dropna(subset=['TaxID'])
merged_dropna['TaxID'] = merged_dropna['TaxID'].astype(int)
merged_dropna['TaxID'] = merged_dropna['TaxID'].astype(str)

print(merged.shape)
print(merged_dropna.shape)
print()

print(merged_dropna['IsZoonotic'].value_counts())
print(merged_dropna['IsZoonotic.stringent'].value_counts())
print(merged_dropna['ReverseZoonoses'].value_counts())

(22,)
(586, 24)
(564, 24)

0    378
1    186
Name: IsZoonotic, dtype: int64
0    458
1    106
Name: IsZoonotic.stringent, dtype: int64
0    557
1      7
Name: ReverseZoonoses, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_dropna['TaxID'] = merged_dropna['TaxID'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_dropna['TaxID'] = merged_dropna['TaxID'].astype(str)


### Neo4j queries

Dataset Serratus overlap QA:
- 564 virus in original dataset, 513 after parsing name to ncbi taxid (mapped node in graph)
- 181 virus taxons have associated SOTU, 383 do not
- 104 are not zoonotic, 77 are zoonotic
- Including all palmprints instead of only SOTUs produces the same edges except for virus taxon 92129 (Tupaia paramyxovirus) which has a large number of palmprints along with SOTU u15499. To keep things simple, I only include associations between SOTUs and virus taxons
- Including children and parents of virus taxons only increases SOTU matches by 1 to 182 SOTUs. To keep things simple, I only include direct matches to species in the dataset

In [57]:
query_sotu_nodes = """
    MATCH (a:SOTU)-[:HAS_POTENTIAL_TAXON]->(b:Taxon)
    WHERE b.taxId in $tax_ids
    RETURN
        id(a) as nodeId,
        a.palmId as appId,
        a.palmId as palmId,
        labels(a) as labels,
        b.taxId as taxId,
        b.rank as taxRank,
        b.taxOrder as taxOrder,
        CASE WHEN b.taxId in $zoonotic_tax_ids THEN True ELSE False END AS isZoonotic
"""

query_sotu_msa_edges = """
    MATCH (a:SOTU)-[r:SEQUENCE_ALIGNMENT]->(b:SOTU)
    WHERE a.palmId in $sotus
    AND b.palmId in $sotus
    RETURN
        id(a) as sourceNodeId,
        a.palmId as sourceAppId,
        id(b) as targetNodeId,
        b.palmId as targetAppId,
        'SEQUENCE_ALIGNMENT' as relationshipType,
        r.percentIdentity as percentIdentity,
        r.percentIdentity as weight
"""

query_stat_taxon_order_nodes = """
    CALL {
        MATCH (p:SOTU)<-[:HAS_SOTU]-(:Palmprint)<-[r:HAS_PALMPRINT]-(s:SRA)
            -[q:HAS_HOST_STAT]->()-[:HAS_PARENT*0..]->(t:Taxon {rank: 'order'})
        WHERE p.palmId in $sotus
        OPTIONAL MATCH (t)-[:HAS_PARENT*0..]->(u:Taxon)
        WHERE u.taxId in $host_class_tax_ids
        RETURN t, u
        UNION
        MATCH (p:SOTU)<-[r:HAS_PALMPRINT]-(s:SRA)
            -[q:HAS_HOST_STAT]->()-[:HAS_PARENT*0..]->(t:Taxon {rank: 'order'})
        WHERE p.palmId in $sotus
        OPTIONAL MATCH (t)-[:HAS_PARENT*0..]->(u:Taxon)
        WHERE u.taxId in $host_class_tax_ids
        RETURN t, u
    }
    WITH t, u
    RETURN
        id(t) as nodeId,
        t.taxId as appId,
        t.taxId as taxId,
        labels(t) as labels,
        t.rank as taxRank,
        t.taxOrder as taxOrder,
        CASE WHEN u.taxId = '40674' THEN 'Mammal' ELSE 
            CASE WHEN (u.taxId = '1476529' OR u.taxId = '7777' OR u.taxID = '1476750') THEN 'Fish' ELSE 
                CASE WHEN u.taxId = '33090' THEN 'Plant' ELSE 
                    CASE WHEN u.taxId = '4751' THEN 'Fungi' ELSE 
                        CASE When u.taxId = '33630' THEN 'Parasite' ELSE 
                            CASE WHEN u.taxId = '2' THEN 'Bacteria' ELSE 
                                CASE WHEN u.taxId = '10239' THEN 'Virus' ELSE 'Other'
                                END
                            END
                        END
                    END
                END
            END
        END AS hostClass
"""

query_has_host_order_stat = '''
    CALL {
        MATCH (p:SOTU)<-[:HAS_SOTU]-(:Palmprint)<-[r:HAS_PALMPRINT]-(s:SRA)
            -[q:HAS_HOST_STAT]->()-[:HAS_PARENT*0..]->(t:Taxon {rank: 'order'})
        WHERE p.palmId in $sotus
        RETURN p, t, r, s, q
        UNION
        MATCH (p:SOTU)<-[r:HAS_PALMPRINT]-(s:SRA)
            -[q:HAS_HOST_STAT]->()-[:HAS_PARENT*0..]->(t:Taxon {rank: 'order'})
        WHERE p.palmId in $sotus
        RETURN p, t, r, s, q
    }
    WITH p, t, r, s, q, avg(q.percentIdentityFull) as percentIdentityFull
    RETURN
        id(p) as sourceNodeId,
        p.palmId as sourceAppId,
        CASE WHEN percentIdentityFull >= 0.2 THEN id(t) ELSE 8765758 END AS targetNodeId,
        CASE WHEN percentIdentityFull >= 0.2 THEN t.taxId ELSE '12908' END AS targetAppId,
        'HAS_HOST_STAT' as relationshipType,
        count(*) AS directAssociations,
        count(*) AS count,
        avg(r.percentIdentity) as avgPercentIdentityPalmprint,
        avg(q.percentIdentity) as avgPercentIdentityStatKmers,
        avg(q.percentIdentityFull) as avgPercentIdentityStatSpots,
        avg(q.percentIdentity) * avg(r.percentIdentity) as weight
'''

### QA queries


query_stat_taxon_nodes = """
    CALL {
        MATCH (p:SOTU)<-[:HAS_SOTU]-(:Palmprint)<-[r:HAS_PALMPRINT]-(s:SRA)
            -[q:HAS_HOST_STAT]->(t:Taxon)
        WHERE p.palmId in $sotus
        OPTIONAL MATCH (t)-[:HAS_PARENT*0..]->(u:Taxon)
        WHERE u.taxId in $host_class_tax_ids
        RETURN t, u
        UNION
        MATCH (p:SOTU)<-[r:HAS_PALMPRINT]-(s:SRA)
            -[q:HAS_HOST_STAT]->(t:Taxon)
        WHERE p.palmId in $sotus
        OPTIONAL MATCH (t)-[:HAS_PARENT*0..]->(u:Taxon)
        WHERE u.taxId in $host_class_tax_ids
        RETURN t, u
    }
    WITH t, u
    RETURN
        id(t) as nodeId,
        t.taxId as appId,
        t.taxId as taxId,
        labels(t) as labels,
        t.rank as taxRank,
        t.taxOrder as taxOrder,
        CASE WHEN u.taxId = '40674' THEN 'Mammal' ELSE 
            CASE WHEN (u.taxId = '1476529' OR u.taxId = '7777' OR u.taxID = '1476750') THEN 'Fish' ELSE 
                CASE WHEN u.taxId = '33090' THEN 'Plant' ELSE 
                    CASE WHEN u.taxId = '4751' THEN 'Fungi' ELSE 
                        CASE When u.taxId = '33630' THEN 'Parasite' ELSE 
                            CASE WHEN u.taxId = '2' THEN 'Bacteria' ELSE 
                                CASE WHEN u.taxId = '10239' THEN 'Virus' ELSE 'Other'
                                END
                            END
                        END
                    END
                END
            END
        END AS hostClass
"""

query_has_host_stat = '''
    CALL {
        MATCH (p:SOTU)<-[:HAS_SOTU]-(:Palmprint)<-[r:HAS_PALMPRINT]-(s:SRA)
            -[q:HAS_HOST_STAT]->(t:Taxon)
        WHERE p.palmId in $sotus
        RETURN p, t, r, s, q
        UNION
        MATCH (p:SOTU)<-[r:HAS_PALMPRINT]-(s:SRA)
            -[q:HAS_HOST_STAT]->(t:Taxon)
        WHERE p.palmId in $sotus
        RETURN p, t, r, s, q
    }
    WITH p, t, r, s, q, avg(q.percentIdentityFull) as percentIdentityFull
    RETURN
        id(p) as sourceNodeId,
        p.palmId as sourceAppId,
        CASE WHEN percentIdentityFull >= 0.2 THEN id(t) ELSE 8765758 END AS targetNodeId,
        CASE WHEN percentIdentityFull >= 0.2 THEN t.taxId ELSE '12908' END AS targetAppId,
        'HAS_HOST_STAT' as relationshipType,
        count(*) AS directAssociations,
        count(*) AS count,
        avg(r.percentIdentity) as avgPercentIdentityPalmprint,
        avg(q.percentIdentity) as avgPercentIdentityStatKmers,
        avg(q.percentIdentityFull) as avgPercentIdentityStatSpots,
        avg(q.percentIdentity) * avg(r.percentIdentity) as weight
'''

query_tax_ids_exist = """
    MATCH (a:Taxon)
    WHERE a.taxId in $tax_ids
    RETURN COLLECT(DISTINCT a.taxId) as tax_ids
"""

query_palmprint_exists = """
    CALL {
        MATCH (a:SOTU)-[:HAS_POTENTIAL_TAXON]->(b:Taxon)
        WHERE b.taxId in $tax_ids
        RETURN b.taxId as tax_id, COLLECT(DISTINCT a.palmId) as palmprints
        UNION
        MATCH (a:SOTU)<-[:HAS_SOTU*]-(b:Palmprint)-[:HAS_POTENTIAL_TAXON]->(c:Taxon)
        WHERE c.taxId in $tax_ids
        RETURN b.taxId as tax_id, COLLECT(DISTINCT a.palmId) + COLLECT(DISTINCT b.palmId) as palmprints
    }
    WITH tax_id, palmprints
    RETURN
        tax_id,
        palmprints
"""

query_sotus_exist_parent_child = """
    CALL {
        MATCH (a:SOTU)-[:HAS_POTENTIAL_TAXON]->(b:Taxon)
        WHERE b.taxId in $tax_ids
        RETURN b.taxId as tax_id, COLLECT(DISTINCT a.palmId) as sotus
        UNION
        MATCH (a:Taxon)<-[:HAS_PARENT*]-(b:Taxon)<-[:HAS_POTENTIAL_TAXON]-(c:SOTU)
        WHERE a.taxId in $tax_ids
        RETURN b.taxId as tax_id, COLLECT(DISTINCT a.palmId) as sotus
        UNION
        MATCH (a:Taxon)-[:HAS_PARENT*]->(b:Taxon)<-[:HAS_POTENTIAL_TAXON]-(c:SOTU)
        WHERE a.taxId in $tax_ids
        AND b.rank = 'species'
        RETURN b.taxId as tax_id, COLLECT(DISTINCT a.palmId) as sotus
    }
    WITH tax_id, sotus
    RETURN
        tax_id,
        sotus
"""

### Create Neo4j dataframes

In [59]:
tax_ids = list(merged_dropna.TaxID.unique())
zoonotic_tax_ids = list(merged_dropna[merged_dropna['IsZoonotic'] == True].TaxID.unique())
host_class_tax_ids = [
    '40674', # [Mammal] Mammalia
    '1476529', # [Fish], Cyclostomata 
    '7777' # [Fish], Chondrichthyes
    '1476750', # [Fish], fish environmental sample
    '33090', # [Plant], Viridiplantae
    '4751', # [Fungi]
    '33630', # [Parasite], Alveolata
    '2', # [Bacteria]
    '10239', # [Virus], Viruses
]

def _log_df(df):
    namespace = globals()
    var_name = [name for name in namespace if namespace[name] is df]
    print(var_name, df.shape)
    print(df.head())


def fetch_cached_df(query, params, filename, use_cache=True, log=False):
    if os.path.exists(neo4j_data_path + filename) and use_cache:
        df = pd.read_csv(neo4j_data_path + filename)
        df = utils.deserialize_df(df)
    else:
        df = gds.run_cypher(query, params=params)
        df.to_csv(neo4j_data_path + filename, index=False)
    if log:
        _log_df(df)
    return df


def get_neo4j_data():
    sotu_nodes = fetch_cached_df(
        query_sotu_nodes,
        {
            'tax_ids': tax_ids,
            'zoonotic_tax_ids': zoonotic_tax_ids,
        },
        'sotu_nodes.csv'
    )
    sotus = sotu_nodes['palmId'].unique().tolist()
    sotu_msa_edges = fetch_cached_df(
        query_sotu_msa_edges,
        {
            'tax_ids': tax_ids,
            'sotus': sotus,
        },
        'sotu_msa_edges.csv'
    )
    sotu_nodes['taxId'] = sotu_nodes['taxId'].astype(int)

    taxon_order_nodes = fetch_cached_df(
        query_stat_taxon_order_nodes,
        {
            'sotus': sotus,
            'host_class_tax_ids': host_class_tax_ids,
        },
        'taxon_order_nodes.csv',
    )
    taxon_order_nodes['taxId'] = taxon_order_nodes['taxId'].astype(int)
    

    taxon_nodes = fetch_cached_df(
        query_stat_taxon_nodes,
        {
            'sotus': sotus,
            'host_class_tax_ids': host_class_tax_ids,
        },
        'taxon_nodes.csv',
    )
    taxon_nodes['taxId'] = taxon_nodes['taxId'].astype(int)

    has_host_order_stat_edges = fetch_cached_df(
        query_has_host_order_stat,
        {
            'sotus': sotus,
        },
        'has_host_order_stat_edges.csv',
    )

    has_host_stat_edges = fetch_cached_df(
        query_has_host_stat,
        {
            'sotus': sotus,
        },
        'has_host_stat_edges.csv',
    )

    return {
        'sotu_nodes': sotu_nodes,
        'sotu_msa_edges': sotu_msa_edges,
        'taxon_order_nodes': taxon_order_nodes,
        'taxon_nodes': taxon_nodes,
        'has_host_stat_edges': has_host_stat_edges,
        'has_host_order_stat_edges': has_host_order_stat_edges,
    }

neo4j_data = get_neo4j_data()

In [60]:
print(neo4j_data['sotu_nodes'].shape)
print(neo4j_data['sotu_msa_edges'].shape)
print(neo4j_data['taxon_nodes'].shape)
print(neo4j_data['has_host_order_stat_edges'].shape)

(742, 8)
(2714, 7)
(613, 7)
(1852, 11)


### Distribution investigation

In [61]:
print(len(neo4j_data['has_host_order_stat_edges']['sourceAppId'].unique()))
print(len(neo4j_data['sotu_nodes']['appId'].unique()) )

507
742


In [71]:
unique_tax_orders =  neo4j_data['sotu_nodes']['taxId'].value_counts()
print(unique_tax_orders)


145856     215
291484      47
1239567     31
11277       25
28875       20
          ... 
37124        1
909207       1
11080        1
356862       1
1046251      1
Name: taxId, Length: 181, dtype: int64


In [75]:
# Issue: Check on unclassified: 12908 (percentIdentityFull < 0.2)



has_primate_assoc = neo4j_data['has_host_order_stat_edges'].loc[neo4j_data['has_host_order_stat_edges']['targetAppId'] == 9443]['sourceAppId'].unique()
has_primate_assoc_tax = neo4j_data['sotu_nodes'].loc[neo4j_data['sotu_nodes']['appId'].isin(has_primate_assoc)]['taxId'].unique()
has_zoonotic_potential = neo4j_data['sotu_nodes'].loc[neo4j_data['sotu_nodes']['isZoonotic'] == True]['appId'].unique()
has_zoonotic_potential_tax = neo4j_data['sotu_nodes'].loc[neo4j_data['sotu_nodes']['appId'].isin(has_zoonotic_potential)]['taxId'].unique()


print(f'unique SOTU taxIds with primate assoc {len(has_primate_assoc_tax)}')
print(f'unique SOTU taxIds with isZoonotic {len(has_zoonotic_potential_tax)}')

print(f'SOTUs taxIds in both {len(set(has_zoonotic_potential_tax) & set(has_primate_assoc_tax)) }')
print(f'SOTUs taxIds unique to isZoonotic {len(set(has_zoonotic_potential_tax) - set(has_primate_assoc_tax)) }')
print(f'SOTUs taxIds unique to primate assoc {len(set(has_primate_assoc_tax) - set(has_zoonotic_potential_tax)) }')
print()

# majority class, only 181 unique SOTU taxIds, taxId 145856 has 215 of 477 SOTUs

print(f'unique SOTUs with primate assoc {len(has_primate_assoc)}')
print(f'unique SOTUs with isZoonotic {len(has_zoonotic_potential)}')

print(f'SOTUs in both {len(set(has_zoonotic_potential) & set(has_primate_assoc)) }')
print(f'SOTUs unique to isZoonotic {len(set(has_zoonotic_potential) - set(has_primate_assoc)) }')
print(f'SOTUs unique to primate assoc {len(set(has_primate_assoc) - set(has_zoonotic_potential)) }')





unique SOTU taxIds with primate assoc 70
unique SOTU taxIds with isZoonotic 77
SOTUs taxIds in both 38
SOTUs taxIds unique to isZoonotic 39
SOTUs taxIds unique to primate assoc 32

unique SOTUs with primate assoc 128
unique SOTUs with isZoonotic 477
SOTUs in both 72
SOTUs unique to isZoonotic 405
SOTUs unique to primate assoc 56


### Community detection and node centrality

In [79]:
def construct_gds_projection(neo4j_data):
    graph_name = 'zoonotic'
    nodes = pd.concat([
        neo4j_data['sotu_nodes'][['nodeId', 'labels', 'taxId']],
        # neo4j_data['taxon_order_nodes'][['nodeId', 'labels']],
        # neo4j_data['taxon_nodes_missing_order'][['nodeId', 'labels']],
    ])
    relationships = pd.concat([
        neo4j_data['sotu_msa_edges'][['sourceNodeId', 'targetNodeId', 'relationshipType', 'weight']],
    ])

    if gds.graph.exists(graph_name)['exists']:
        gds.graph.drop(gds.graph.get(graph_name))

    G = gds.alpha.graph.construct(
        graph_name=graph_name,
        nodes=nodes,
        relationships=relationships,
        concurrency=4,
        undirected_relationship_types=['HAS_HOST_STAT'],
    )
    return G

In [80]:
def run_community_analysis(G):
  communities = gds.labelPropagation.stream(
    G,
    nodeLabels=['SOTU'],
    relationshipWeightProperty='weight',
    maxIterations=30,
    seedProperty='taxId',
  )
  unique_communities = communities.communityId.unique()
  community_counter = collections.Counter(communities.communityId)
  print('LPA Unique communities:', len(unique_communities))
  print('LPA Most common communities:', community_counter.most_common(10))

  wcc = gds.wcc.stream(
    G,
    nodeLabels=['SOTU'],
    relationshipWeightProperty='weight',
    threshold=None,
    seedProperty='taxId'
  )

  unique_communities = wcc.componentId.unique()
  community_counter = collections.Counter(wcc.componentId)
  print('WCC Unique communities:', len(unique_communities))
  print('WCC Most common communities:', community_counter.most_common(10))

  page_ranks_sotu = gds.pageRank.stream(
    G,
    nodeLabels=['SOTU'],
    relationshipWeightProperty='weight',
    maxIterations=100,
  )
  page_ranks_sotu = page_ranks_sotu.rename(columns={'score': 'pageRankSotu'})
  page_ranks_sotu['pageRankSotu'] = page_ranks_sotu['pageRankSotu'].round(0)

  louvain = gds.louvain.stream(
    G,
    nodeLabels=['SOTU'],
    relationshipWeightProperty='weight',
    seedProperty='taxId',
  )
  unique_communities = louvain.communityId.unique()
  community_counter = collections.Counter(louvain.communityId)
  print('Lieden Unique communities:', len(unique_communities))
  print('Lieden Most common communities:', community_counter.most_common(10))

  return communities, wcc, page_ranks_sotu, louvain

In [82]:
G = construct_gds_projection(neo4j_data)
communities, wcc, page_ranks_sotu, louvain = run_community_analysis(G)
G.drop()


LPA Unique communities: 110
LPA Most common communities: [(145856, 215), (291484, 48), (11277, 47), (1239567, 37), (64300, 27), (28875, 20), (28344, 17), (147711, 17), (40058, 15), (11033, 15)]
WCC Unique communities: 98
WCC Most common communities: [(145856, 215), (291484, 45), (11072, 43), (11272, 41), (28875, 24), (40050, 23), (1239567, 22), (147711, 19), (1239566, 18), (11020, 16)]
Lieden Unique communities: 80
Lieden Most common communities: [(145856, 215), (1239574, 49), (11303, 48), (291484, 48), (12637, 46), (11036, 25), (1679172, 24), (28876, 24), (147712, 22), (2593991, 21)]


### Graphistry visualizations

In [77]:
# Create node and relationship dataframes with full information
def get_graphistry_df(
        neo4j_data, communities, page_ranks_sotu
    ):

    neo4j_data['sotu_nodes']['displayLabel'] = neo4j_data['sotu_nodes']['appId']
    neo4j_data['taxon_order_nodes']['displayLabel'] = neo4j_data['taxon_order_nodes']['appId']

    nodes = pd.concat([
        neo4j_data['sotu_nodes'],
        neo4j_data['taxon_order_nodes'],
    ])
    nodes = nodes.merge(
        communities,
        left_on='nodeId',
        right_on='nodeId',
        how='left',
    )
    nodes = nodes.merge(
        page_ranks_sotu,
        left_on='nodeId',
        right_on='nodeId',
        how='left',
    )

    nodes['type'] = nodes['labels']

    nodes = nodes[[
        'appId', 'labels', 'type',
        'taxId', 'isZoonotic', 'hostClass',
        'taxRank', 'communityId',
        'taxOrder', 'displayLabel',
    ]].astype(str)

    # Encode communityId to use default color pallette
    nodes['communityId'] = nodes['communityId'].astype('float64')
    nodes['communityId'] = nodes['communityId'].fillna(-1)
    nodes['communityId'] = nodes['communityId'].astype('int32')
    labels = nodes['communityId'].unique()
    mapping = {label: i for i, label in enumerate(labels)}
    nodes['communityId'] =  nodes['communityId'].replace(mapping)
    nodes['communityColorCodes'] = nodes['communityId'].mod(11)

    ## Aggregate statCoverage for nodes using average of edges
    agg_source = neo4j_data['has_host_stat_edges'].groupby('sourceAppId').agg({'statCoverage': 'mean'})
    agg_source['statCoverage'] = agg_source['statCoverage']
    agg_target = neo4j_data['has_host_stat_edges'].groupby('targetAppId').agg({'statCoverage': 'mean'})
    agg_target['statCoverage'] = agg_target['statCoverage']
    aggs = pd.concat([agg_source, agg_target])
    aggs = aggs.reset_index()
    aggs['statCoverageInt'] = round(aggs['statCoverage'] * 100)
    aggs['statCoverage'] = aggs['statCoverage'].round(1)
    aggs['statCoverageInt'] = aggs['statCoverageInt'].round(0).astype('int32')
    aggs = aggs.rename({'index': 'appId'}, axis='columns')
    nodes = nodes.merge(
        aggs,
        left_on='appId',
        right_on='appId',
        how='left',
    )

    # Add host counts for palmprints
    stat_hosts_excl_mixed = neo4j_data['has_host_stat_edges'][neo4j_data['has_host_stat_edges']['targetAppId'] != 12908]
    host_counts = stat_hosts_excl_mixed.groupby('sourceAppId').size().groupby(level=0).max()
    # host_counts = neo4j_data['has_host_stat_edges'].groupby('sourceAppId').size().groupby(level=0).max()
    host_counts = host_counts.reset_index()
    host_counts = host_counts.rename({0: 'hostCount'}, axis='columns')
    host_counts['hostCountNormalized'] = round((host_counts['hostCount'] / host_counts['hostCount'].max()) * 100, -1).astype('int32')
    nodes = nodes.merge(
        host_counts,
        left_on='appId',
        right_on='sourceAppId',
        how='left',
    )

    relationships = pd.concat([
        neo4j_data['sotu_msa_edges'],
        neo4j_data['has_host_stat_edges'],
    ])
    
    relationships['targetAppId'] = relationships['targetAppId'].astype(str)
    relationships['sourceAppId'] = relationships['sourceAppId'].astype(str)

    relationships['weight'] = relationships['weight'].astype(float)
    relationships['weightInt'] = round(relationships['weight'] * 100, -1).astype('int32')
    relationships = relationships[[
        'sourceAppId', 'targetAppId', 'relationshipType', 
        'weight', 'weightInt',
    ]].astype(str)
    relationships['weightInt'] = relationships['weightInt'].astype('int32')
    return nodes, relationships



def load_or_create_graphistry_df(node_filename, relationship_filename, use_cache=False):
    if use_cache and os.path.exists(graphistry_data_path + node_filename) \
            and os.path.exists(graphistry_data_path + relationship_filename):
        nodes = pd.read_csv(graphistry_data_path + node_filename)
        edges = pd.read_csv(graphistry_data_path + relationship_filename)
    else:
        neo4j_data = get_neo4j_data()
        G = construct_gds_projection(neo4j_data)
        communities, wcc, page_ranks_sotu = run_community_analysis(G)
        # binary_pvals, multi_pvals = run_pval_analysis(neo4j_data)
        nodes, edges = get_graphistry_df(
            neo4j_data, communities, page_ranks_sotu
        )
        nodes.to_csv(graphistry_data_path + node_filename, index=False)
        edges.to_csv(graphistry_data_path + relationship_filename, index=False)
    return nodes, edges


In [78]:
nodes, relationships = load_or_create_graphistry_df('nodes.csv', 'edges.csv')

NameError: name 'construct_gds_projection' is not defined

In [169]:
alt_color_pallete = [
    "rgb(166, 206, 227)",
    "rgb(31, 120, 180)",
    "rgb(178, 223, 138)",
    "rgb(51, 160, 44)",
    "rgb(251, 154, 153)",
    "rgb(227, 26, 28)",
    "rgb(253, 191, 111)",
    "rgb(255, 127, 0)",
    "rgb(202, 178, 214)",
    "rgb(106, 61, 154)",
    "rgb(255, 255, 153)", 
    "#ffffff",
]

categorical_colors = {}

for communityId in nodes.communityColorCodes.unique():
    categorical_colors[str(communityId)] = alt_color_pallete[communityId % len(alt_color_pallete)]

categorical_colors[11] = "#808080"

# for i, row in nodes.iterrows():
#     if row['hostCount'] > 1:
#         nodes.at[i,'communityColorCodes'] = 11

In [170]:
g = graphistry.bind()

g = g.bind(
    source='sourceAppId',
    destination='targetAppId',
    edge_weight='weight',
).edges(relationships)

g = g.bind(
    node='appId',
    point_label='displayLabel',
    point_size=None,
    # point_size='statCoverageInt',
).nodes(nodes)


params = {
        'play': 2000,
        'menu': True, 
        'info': True,
        'showArrows': True,
        # 'pointSize': 2.0, 
        # 'edgeCurvature': 0.5,
        'edgeOpacity': 0.25, 
        'pointOpacity': 1.0,
        # 'lockedX': False, 'lockedY': False, 'lockedR': False,
        'linLog': True, 
        'compactLayout': True,
        'strongGravity': True,
        'dissuadeHubs': True,
        'edgeInfluence': 5,
        # 'precisionVsSpeed': 0, 'gravity': 1.0, 'scalingRatio': 1.0,
        # 'showLabels': True, 'showLabelOnHover': True,
        # 'showPointsOfInterest': True, 'showPointsOfInterestLabel': True, 
        'showLabelPropertiesOnHover': True,
        'pointsOfInterestMax': 15,
        'backgroundColor': 'black',
      }

g = g.settings(url_params=params)


g = g.addStyle(
    bg={
        'color': 'white',
})

# g = g.encode_point_color(
#     'communityIds',
#     categorical_mapping=categorical_colors,
#     default_mapping='grey', 
# )

g = g.encode_point_color(
    'communityColorCodes',
    categorical_mapping=categorical_colors,
)

g.plot()

In [None]:
print(urllib.parse.urlencode(params))

Notes:

- General takeaways:
    - Clustering SOTUs using weight mostly correlates to taxOrder, with some subcommunities within communities. Reovirales doesn't cluster well
- 

- SOTU only, default community detection
    - 742 nodes, 2714 edges
    - LPA Unique communities: 438
    - LPA Most common communities: [(8170199, 45), (7858962, 39), (7743098, 27), (7935516, 20), (7999668, 15), (7971675, 13), (7714332, 12), (8100725, 12), (7994679, 12), (8017721, 10)]
    - WCC Unique communities: 386
    - WCC Most common communities: [(431, 86), (2, 43), (0, 40), (5, 23), (34, 17), (48, 16), (79, 16), (41, 15), (1, 14), (106, 14)]
    - Lieden Unique communities: 80
    - Lieden Most common communities: [(145856, 215), (1239574, 49), (11303, 48), (291484, 48), (12637, 46), (11036, 25), (1679172, 24), (28876, 24), (147712, 22), (2593991, 21)]
    - 356 stand alone nodes, 386 remain after hiding stand alone nodes
    - standalone: zoonotic: 245/477, non-zoonotic: 141/265
    - communities are mostly homogenous but contain some other colors (?)
    - non-zoonotic are mixed in med-small communities, have med-small node degree
    - zoonotic make up entire largest connected community
    - 

- SOTU only, seeded community detection
    - 739 nodes, 2570 edges
    - LP Unique communities: 110
    - LP Most common communities: [(145856, 215), (291484, 48), (11277, 47), (1239567, 37), (64300, 27), (28875, 20), (28344, 17), (147711, 17), (40058, 15), (11033, 15)]
    - WCC Unique communities: 98
    - WCC Most common communities: [(145856, 215), (291484, 45), (11072, 43), (11272, 41), (28875, 24), (40050, 23), (1239567, 22), (147711, 19), (1239566, 18), (11020, 16)]
    - Lieden Unique communities: 80
    - Lieden Most common communities: [(145856, 215), (1239574, 49), (11303, 48), (291484, 48), (12637, 46), (11036, 25), (1679172, 24), (28876, 24), (147712, 22), (2593991, 21)]
    - 356 stand alone nodes, 383 remain after hiding stand alone nodes
    - standalone: zoonotic: 242/477, non-zoonotic: 141/265
    - communities are more homogenous than the default
    - taxOrder and taxId corresponds nicely with clusters
    - largest community is even larger
    - overall zoonotic/non-zoonotic appears the same
    - non-zoonotic exist in larger community now
    - low-weight edges seem to involve long-range




-