# Classic graph analytics, GPU-style

Traditional graph methods give insights into how entities are related:

* The role of each node: Centrality, betweenness, pagerank, ...
* Partitioning: Community, k-core, ...

## Why

* Social networks: Influencers, bridgers, ...
* Security & Fraud: Interaction subgraph mining ("motifs")


In [1]:
#! pip install --user graphistry --upgrade
import cudf, cugraph, graphistry, json, pandas as pd
from helpers import graphistry_to_cugraph, merge_in_enriched_entities, merge_in_vertex_alg

{'graphistry': graphistry.__version__, 'cudf': cudf.__version__, 'cugaph': cugraph.__version__}

ModuleNotFoundError: No module named 'cudf'

In [16]:
graphistry_creds = json.load(open('./graphistry_creds.json'))
graphistry.register(api=3, **graphistry_creds)

print('./graphistry_creds.json: ', graphistry_creds.keys())

./graphistry_creds.json:  dict_keys(['username', 'password'])


## Example: Twitter botnet

* Used for multiple purposes:
  * Popularity: Supernodes bought follows
  * Scammers: Bought retweets
  
* Unnatural structure
  * Rather than clear community structure, most of the nodes are bots with random in-network + out-of-network follows
  * They retweet the middle community: "tweet laundering"

In [17]:
%%html

<iframe src="https://hub.graphistry.com/graph/graph.html?dataset=Twitter" width="100%" height="600"></iframe>

## Data

Download edge table and wrap as a Graphistry object

In [27]:
fb_df = (
    pd.read_csv(
        'https://raw.githubusercontent.com/graphistry/pygraphistry/master/demos/data/facebook_combined.txt',
        sep=' ')
    .rename(columns={'0': 's', '1': 'd'})
)

# Cleaning for cugraph etc: good practice to do at the beginning
fb_df['s'] = fb_df['s'].astype('int32')  # from int64
fb_df['d'] = fb_df['d'].astype('int32')  # from int64

# Wrap in graphistry for easier graph-y manipulations
g = (
    graphistry
    .edges(fb_df, 's', 'd')  # Bind data, schema
    .materialize_nodes()     # Autogenerate a g._nodes datafame from g._edges
)

In [28]:
print('EDGES:', g._edges.shape)
print('NODES:', g._edges.shape)
g._edges.sample(5)

EDGES: (88233, 2)
NODES: (88233, 2)


Unnamed: 0,s,d
76797,2929,3047
10164,705,763
21970,1158,1583
17551,1048,1753
47829,2043,2469


In [30]:
g._nodes.dtypes

id    int32
dtype: object

In [31]:
g._edges.dtypes

s    int32
d    int32
dtype: object

In [37]:
g.plot()

# Go GPU: RAPIDS.ai's cuDF dataframes & cuGraph analytics

Prepare data for cuGraph: edge weights + move to GPU

In [38]:
g2 = (
    g.edges(g._edges.assign(ew=1.0))            # create weights column
    .bind(edge_weight='ew')                     # bind it so we can forget the name
    .edges(lambda g: cudf.DataFrame(g._edges))  # copy edges to GPU
    .edges(lambda g: g._edges.assign(ew=g._edges['ew'].astype('float32')))
    .nodes(lambda g: cudf.DataFrame(g._nodes))  # copy nodes to GPU
)

G = graphistry_to_cugraph(g2)

# Community detection

Many algorithms enable partitioning the graph into communities

Variations:

* Hierarchical
* Strong relationships: Triangles, K-Trusses, ...
* Node vs edges; single or multi-label
  * Ex: Someone may be in multiple networks - school, work, ...
  * Communities may have relationships & nesting: Pudding -> Chocolate -> Dessert -> Food

In [39]:
spectral_df = cugraph.spectralBalancedCutClustering(G, 10)

In [40]:
spectral_df.sample(3)

Unnamed: 0,cluster,vertex
585,6,489
1733,4,3002
826,4,3144


In [41]:
g3 = merge_in_enriched_entities(g2, spectral_df, enrich_as='spectral')
g3.encode_point_color('spectral').plot(memoize=False)

## Importance: K-core, Betweenness, Pagerank

In [48]:
%%time

print('betweenness..')
g4 = merge_in_vertex_alg(g3, 'betweenness_centrality', G)

print('core..')
g4 = merge_in_vertex_alg(g4, 'core_number', G)

print('pagerank..')
g4 = merge_in_vertex_alg(g4, 'pagerank', G)

print('hits: authorities & hubs..')
g4 = merge_in_enriched_entities(g4, cugraph.hits(G))

print('hits: ecg')
g4 = merge_in_vertex_alg(g4, 'ecg', G)


# Not working at time of use:
#g4 = merge_in_vertex_alg(g4, 'louvain', G)
#g4 = merge_in_vertex_alg(g4, 'leiden', G)

g4._nodes.sample(5)

betweenness..
core..
pagerank..
hits: authorities & hubs..
hits: ecg
CPU times: user 9.47 s, sys: 207 ms, total: 9.68 s
Wall time: 9.63 s


Unnamed: 0,id,spectral,betweenness_centrality,core_number,pagerank,hubs,authorities,ecg
2231,565,2,6.262184e-07,19,0.000157,5.381999e-08,5.382001e-08,8
45,2273,2,9.994309e-06,70,0.000287,0.0002556941,0.0002556941,26
3130,2642,2,0.0005477193,70,0.000451,0.0003348714,0.0003348715,26
2146,519,7,1.902928e-06,18,0.00021,3.836348e-08,3.836349e-08,8
3987,3915,7,1.970432e-06,14,0.000268,3.591928e-11,3.59193e-11,20


In [47]:
(g4
 .encode_point_color('core_number', palette=['blue', 'red', 'yellow'], as_continuous=True)
 .encode_point_size('betweenness_centrality')
).plot()

# Next steps

Many tools! 

Some popular for classic visual graph analytics:

* viz:
  * https://github.com/graphistry/pygraphistry  - notebooks, dataframes, js, GPU, etc
  * bio: cytoscape
  * manual: d3js
  * legacy: gephi
* classic small graph analytics (cpu): https://networkx.org/
* classic big graph analytics
  * gpu: https://github.com/rapidsai/cugraph
  * graph DB/compute vendors have assorted alg libs: Neo4j, TigerGraph, Neptune, ...
  * legacy: Stanford SNAP