# Graph Data Science Demo

Now that we've created our *huge* (1.7B relationships! 244M nodes!) graph projections, let's do some data science.

The point of this demo is to show that enterprise graph data science is simple, fast, and easy using GDS. We're going to take our citation network dataset and build up a quick recommendations workflow by (1) paring it down to the relevant data, (2) calculating a graph embedding to encode all the relevant topological data for each node in our graph, and then (3) building up a nearest neighbors graph - based on those embeddings - so we can find out which papers are similar based on the structure of the graph.

In the real world, you might use that similarity graph as an alternative to traditional collaborative filtering methods. It's more scalable and flexible, and can look beyond one hop relationships. For this demo, we'll build up our graph and then take a peak at the results in bloom.

### Set up & Initialization

In [1]:
%%capture
pip install graphdatascience==1.1.0rc1 ipywidgets 

In [2]:
# Client import
from graphdatascience import GraphDataScience

# Replace with the actual URI, username and password
CONNECTION_URI = "neo4j+s://demo2.graphconnect.app:7687"
USERNAME = "neo4j"
with open('pass.txt', mode='r') as f:
    PASSWORD = f.readline().strip()

# Client instantiation
gds = GraphDataScience(
    CONNECTION_URI,
    auth=(USERNAME, PASSWORD)
)

### Bind the graph projection to a graph object 
The GDS Python Client works with graph objects in Python. If we were constructing the graph from a neo4j database (or a pandas dataframe), that would automatically return a graph object. Since we're using the graph that we just created with custom Arrow import code, we need to assign it to a graph object using `get`

In [3]:
G=gds.graph.get("gcdemo")

## Engineer FastRP Features

In [None]:
res=gds.fastRP.mutate(
    G,
    embeddingDimension=256,
    concurrency=224,
    mutateProperty="graphEmbedding"
)

res

## Export Labeled Papers with FastRP Features

In [22]:
# start with subgraph projection
g_labeled, res = gds.beta.graph.project.subgraph(
  'labledProjection',
  G,
  'n:Paper AND (n.flag >= 0)',
  '*',
  concurrency=224
)

res

fromGraphName                            gcdemo
nodeFilter            n:Paper AND (n.flag >= 0)
relationshipFilter                            *
graphName                      labledProjection
nodeCount                               1251341
relationshipCount                       4035688
projectMillis                             15600
Name: 0, dtype: object

In [4]:
import time
import neo4j_arrow as na

In [5]:
with open('pass.txt', mode='r') as f:
    password = f.readline().strip()

client = na.Neo4jArrowClient('demo2.graphconnect.app', graph="labledProjection", password=password, concurrency=224)

In [6]:
dfs = []
for chunk in client.read_nodes(["graphEmbedding", "flag", "years"]):
    dfs.append(chunk.to_pandas())
dfs[0]

Unnamed: 0,nodeId,graphEmbedding,flag,years
0,255702663,"[0.018323343, 0.012007594, -0.25506234, -0.053...",28,2018
1,255702719,"[-0.06454927, 0.05650944, 0.09837681, 0.013585...",60,2018
2,255702776,"[-0.13224208, -0.010649643, -0.14354382, 0.047...",141,2015
3,255702981,"[-0.062156837, 0.10611194, 0.20513633, -0.1482...",43,2011
4,255702983,"[-0.08450825, -0.05039478, -0.1359004, 0.02768...",141,2016
...,...,...,...,...
1336,255969052,"[0.1873631, -0.16102183, -0.0050299345, -0.030...",42,2018
1337,255969109,"[0.10000002, 0.0005037263, 0.10000002, 0.10000...",65,2017
1338,255969220,"[0.056905545, -0.07888851, 0.005949719, -0.082...",72,2019
1339,255969263,"[-0.16350809, -0.13330199, -0.00085447147, -0....",142,2019


In [7]:
import pandas as pd
df = pd.concat(dfs)
df

Unnamed: 0,nodeId,graphEmbedding,flag,years
0,255702663,"[0.018323343, 0.012007594, -0.25506234, -0.053...",28,2018
1,255702719,"[-0.06454927, 0.05650944, 0.09837681, 0.013585...",60,2018
2,255702776,"[-0.13224208, -0.010649643, -0.14354382, 0.047...",141,2015
3,255702981,"[-0.062156837, 0.10611194, 0.20513633, -0.1482...",43,2011
4,255702983,"[-0.08450825, -0.05039478, -0.1359004, 0.02768...",141,2016
...,...,...,...,...
9995,173135179,"[0.06877658, -0.117199585, 0.08505142, 0.04497...",111,2003
9996,173135185,"[-0.045556977, -0.18906558, -0.037356168, -0.0...",62,2012
9997,173135215,"[-0.24145196, -0.07898462, 0.18446049, 0.00257...",33,2015
9998,173135764,"[0.010843396, -0.11195288, -0.0343456, -0.1059...",54,2007


## Train Nueral Netowrk

In [8]:
# TensorFlow and tf.keras
import tensorflow as tf

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt

print(tf.__version__)

2.8.1


In [9]:
df_train = df[df.years < 2019]
df_test = df[df.years >= 2019]

In [10]:
y_train = df_train.flag
y_test = df_test.flag
X_train = np.stack(df_train.graphEmbedding, axis=0)
X_test = np.stack(df_test.graphEmbedding, axis=0)

In [11]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, kernel_regularizer=tf.keras.regularizers.l2(0.0001)),
    tf.keras.layers.Dense(180, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.0001)),
    tf.keras.layers.Dense(153)
])

2022-06-11 01:22:49.897366: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In [12]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [13]:
model.fit(X_train, y_train, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fc3e562f090>

In [14]:
test_loss, test_acc = model.evaluate(X_test,  y_test, verbose=2)

print('\nTest accuracy:', test_acc)

4343/4343 - 11s - loss: 3.1525 - accuracy: 0.3464 - 11s/epoch - 2ms/step

Test accuracy: 0.3463860750198364


In [15]:
model.save('simple-paper-classifier-2')

2022-06-11 01:33:49.461396: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: simple-paper-classifier-2/assets
