# Graph Intelligence Hands-On Workshop

Most data has a useful graph perspective
* Graph -> graph
* Text -> knowledge graph
* Events, logs, tables -> Hypergraph
* Feature vectors & embeddings -> TDA
* Correlations & matrices -> weighted graph

In [1]:
! pip install --user graphistry

Collecting graphistry
  Downloading graphistry-0.20.2-py3-none-any.whl (88 kB)
[K     |████████████████████████████████| 88 kB 129 kB/s eta 0:00:01
Collecting pyarrow>=0.15.0
  Downloading pyarrow-6.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.6 MB)
[K     |████████████████████████████████| 25.6 MB 1.6 MB/s eta 0:00:01     |█████████████████████▋          | 17.3 MB 1.2 MB/s eta 0:00:07
Installing collected packages: pyarrow, graphistry
Successfully installed graphistry-0.20.2 pyarrow-6.0.1


In [2]:
import cudf, graphistry, json, pandas as pd
graphistry_creds = json.load(open('./graphistry_creds.json'))
graphistry.register(api=3, **graphistry_creds)

graphistry_creds.keys()

ModuleNotFoundError: No module named 'cudf'

# 1. Natural graphs

## Ex: Social network of (person)-[friend]->(person)
* What are the commnunities?
* Who is central? Who bridges them?

### ... Often Property Graph: Attributes on nodes + edges

Also: Retweet graph, hashtag interaction graph, ...

In [3]:
fb_df = pd.read_csv(
    'https://raw.githubusercontent.com/graphistry/pygraphistry/master/demos/data/facebook_combined.txt',
    sep=' ')
fb_df = fb_df.rename(columns={'0': 's', '1': 'd'})

graphistry.edges(cudf.from_pandas(fb_df), 's', 'd').plot()

### Property graphs everywhere:

* Security trails (logs)
* Customer journey
* Fraud: Payments, accounts, ownership
* Supplychain & logistics
* Finance: Derivative, trades, ...
* Communications

### ... because Events & Logs!



## Ex: Energy & Telco

Energy grid, with nodes colored by usage, alerts, ... 

Fun story: National telco using Graphistry saw a bunch of  their customers disconnected from the grid. Revealed year+ data quality bug! 

=> Viz import for both data scientists + end users!

# 2. Text
## Ex:  Knowledge graph of (noun)-[relation]->(noun) 
* Nodes: Extract entities
* Edges: Link based on relationship

### Super popular!
* RDF triple stores initially popular here

![diffbot knowledge graph](diffbot.png)

* Text is increasingly machine-understandable with many off-the-shelf ML algorithms: sentiment, ...
* In real use cases, often need to *ground* entities with internal data: products, scores, ... <= knowledge graph!

* Knowledge graph makes data *accessible* for both *machines* and *people*

# 3. Table: Records, Logs, Events, ...
## Ex: FTP access logs
## ... Via Hypergraph transform: Each event links multiple entities

### Growing in popularity: Security, fraud, especially recommendations: Graph neural nets

* Data traditionally stored in log or relational DB for text search
* Graph questions: Progression, relationships, connectivity, anomaly, ...
* ETL (extract/transform/load): 
  * Training on bulk extracts & inference on localized events
  * Use separate compute engine:
    * small (networkx: < 1MB), medium (cugraph: < 10GB), big (spark, dask-cudf, spark: 1TB+)
    * graph-native (network, cugraph, cugraph-on-dask): optimized
    * tabular underneath (graphx-on-spark): availability, scale ease...

In [None]:
df = pd.read_csv('http://www.secrepo.com/maccdc2012/ftp.log.gz', sep='\t',
                 names=['ts', 'uid', 'id.orig_h', 'id.orig_p', 'id.resp_h', 'id.resp_p', 'user', 'password',
                        'command', 'arg', 'mime_type', 'file_size', 'reply_code', 'reply_msg', 'passive',
                        'orig_h', 'resp_h', 'resp_p', 'fuid'])

In [7]:
graphistry.edges(df, 'orig_h', 'user').plot()

In [4]:
g = graphistry.hypergraph(
    df,
    ['uid', 'orig_h', 'resp_h', 'user', 'password', 'arg', 'fuid'],
    opts={
        'CATEGORIES': {
            'ip': ['orig_h', 'resp_h']
        }
    })['graph']
g.encode_point_color('category',
                     categorical_mapping={
                         'even': 'white',
                         'uid': 'gray',
                         'ip': 'blue',
                         'user': 'green',
                         'password': 'orange',
                         'arg': 'red',
                         'fuid': 'blue'
                     }).plot()

# links 40572
# events 5796
# attrib entities 2197


# 4. Matrix

## Ex: P(a,b): (a)-[P(a,b)]->(b)
## ... Demo: Survey for P(Programming Languge | Reason to use)

* Indicator matrix X has a "1" at X_i,j when there is edge i->j, otherwise 0 (no edge)
* Weighted matrix W has non-zero weight W_i,j for weighted edge i->j
* ... Ex: P(x,y), P(x|y), ...

In [9]:
%%html
<img src="https://www.researchgate.net/profile/Lagerstroem-Robert/publication/42804459/figure/fig1/AS:394377454211078@1471038320778/A-conditional-probability-matrix.png"/>

In [5]:
%%html
<iframe src="https://hub.graphistry.com/graph/graph.html?&dataset=PyGraphistry/PC7D53HHS5" width="100%" height="600"></iframe>

# 5. TDA for explaining ML vectors/embedding/scores

## K-nearest-neighbors for understanding an embedding
## Ex: Embeddings & scores everywhere
## Demo - COVID misinformation embedding space
* Nodes: Account
* Initial layout: X/Y position based on topics discussed (UMAP)
* Edges: Connects similar twitter accounts based on embedding

Core to most modern machine learning algorithms is mapping input data to feature vectors and denser / lower-dimensional embeddings. We can connect similar embeddings to understand how the model "sees" the data, which helps with data cleaning, featurization, debugging models, and interpreting the results

### ... Powerful when combined with coloring embedded samples by scores!

In [8]:
%%html
<iframe src="https://hub.graphistry.com/graph/graph.html?dataset=6fbdc5fb9ca64f37ade8a7a5ccb337f0&strongGravity=true&play=0" width="100%" height="600"></iframe>

## Ex: Topological data analysis
TDA takes point data (metrics, events, ...) and recovers relations. Ex: camera point clouds => shapes 

## Ex: Causal graphs, Bayesian factor trees/graphs, & time series correlation

Pattern mining often comes to inferring probabilities across items: wide world!


# Additional domains

Not mentioned above:

### Bio: Genomics, protein networks

### Citation graphs

### Trust networks

# Emerging Graph Stacks: Graph ML, Graph AI, Graph GPU, and combined Graph Intelligence

* Graph tech gaining adoption, but real maturity level for how
* Many choices: DB vs Compute, CPU vs GPU, BI/Viz/AI/...

## Graph ML stack
![Graph ML stack](graph_arch1.png)

## Graph AI stack
![Graph AI stack](graph_arch2.png)

# GPUs

## Why GPUs

Optimized for GB/s thinking: Scale, latency, and cost. Pricing competitiveness for general data tasks is quite recent.

![AWS GPU Pricing Improvements by Year](gpu_price.png)





## GPU stack (RAPIDS)

Same APIs as Python data tools (pandas => cudf, networkx => cugraph, dask => dask_cudf)

If you know PyData, you've done the hardest parts to GPU PyData!

![RAPIDS stack](rapids.png)

# GPU Graph Intelligence Stack
![GPU Graph Intelligence Stack](graph_stack.png)
