# Loading data into StellarGraph from Pandas

> This demo explains how to load data from Neo4j into a form that can be used by the StellarGraph library. [See all other demos](../README.md).

[The StellarGraph library](https://github.com/stellargraph/stellargraph) supports loading graph information from Neo4j. [Neo4j](https://en.wikipedia.org/wiki/Neo4j) is a graph database. If your data is already in Neo4j, this is a great way to load it. If not, [loading via Pandas](loading-pandas.ipynb) is likely to be faster and potentially more convenient.

The `StellarGraph` class is available at the top level of the `stellargraph` library:

In [1]:
from stellargraph import StellarGraph

## Dataset

We'll be working with a graph representing a square with a diagonal. We'll give the `a` node label `foo` and the other nodes the label `bar`, along with some features. We'll also give each edge a label matching its orientation and a weight.

```
a -- b
| \  |
|  \ |
d -- c
```

This section uses the types from `py2neo` to be able to seed our Neo4j instance with the example data.

In [183]:
from py2neo.data import Node, Relationship, Subgraph

a = Node("foo", name="a", top=True, left=True, foo_numbers=[0.1, 0.2, 0.3])
b = Node("bar", name="b", top=True, left=False, bar_numbers=[1, -2])
c = Node("bar", name="c", top=False, left=False, bar_numbers=[34, 5.6])
d = Node("bar", name="d", top=False, left=True, bar_numbers=[0.7, -98])

ab = Relationship(a, "horizontal", b, weight=1.0)
bc = Relationship(b, "vertical", c, weight=0.2)
cd = Relationship(c, "horizontal", d, weight=3.4)
da = Relationship(d, "vertical", a, weight=5.67)
ac = Relationship(a, "diagonal", c, weight=1.0)

subgraph = Subgraph([a, b, c, d], [ab, bc, cd, da, ac])

## Connecting to Neo4j

To read anything from Neo4j, we'll need a connection to a running instance.

In [184]:
import os
import py2neo

default_host = os.environ.get("STELLARGRAPH_NEO4J_HOST")

# Create the Neo4J Graph database object; the arguments can be edited to specify location and authentication
neo4j_graph = py2neo.Graph(host=default_host, port=None, user=None, password=None)

Just to be sure we're not overwriting important data, let's make sure the database is empty.

In [185]:
num_nodes = len(neo4j_graph.nodes)
num_relationships = len(neo4j_graph.relationships)
if num_nodes > 0 or num_relationships > 0:
    raise ValueError(
        f"neo4j_graphdb: expected an empty database to give a reliable result and so mutations do not corrupt your data, found {num_nodes} nodes and {num_relationships} relationships in the database already. "
        "Please clear this instance of Neo4j or start a new instance (and edit the arguments above to connect to it)"
    )

Finally, we can create our example data.

In [186]:
neo4j_graph.create(subgraph)

# check we wrote how many we expected
assert len(neo4j_graph.nodes) == 4
assert len(neo4j_graph.relationships) == 5

## Loading all data

We're going to load the data by creating Pandas DataFrames first. To do that, we'll need to import Pandas.

In [112]:
import pandas as pd

### Homogeneous graph without features (edges only)

For problems where there's only edges, and no node features, or edge weights, the process is simple TODO

In [237]:
edges = neo4j_graph.run(
    """
    MATCH (s) --> (t) 
    RETURN id(s) AS source, id(t) AS target
    """
).to_data_frame()
edges.head()

Unnamed: 0,source,target
0,3,0
1,2,1
2,3,2
3,0,2
4,1,3


In [238]:
edges_only = StellarGraph(edges=edges)
print(edges_only.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  default: [4]
    Features: none
    Edge types: default-default->default

 Edge types:
    default-default->default: [5]


### Homogeneous graph with features



In [239]:
raw_nodes = neo4j_graph.run(
    """
    MATCH (n) 
    RETURN id(n) AS id, n.top, n.left
    """
).to_data_frame()

homogeneous_nodes = raw_nodes.set_index("id")
homogeneous_nodes

Unnamed: 0_level_0,n.top,n.left
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,True,False
1,False,True
2,False,False
3,True,True


In [214]:
homogeneous = StellarGraph(homogeneous_nodes, edges)
print(homogeneous.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  default: [4]
    Features: float32 vector, length 2
    Edge types: default-default->default

 Edge types:
    default-default->default: [5]


### Homogeneous graph with edge weights

In [240]:
weighted_edges = neo4j_graph.run(
    """
    MATCH (s) -[r]-> (t) 
    RETURN id(s) AS source, id(t) AS target, r.weight AS weight
    """
).to_data_frame()
weighted_edges

Unnamed: 0,source,target,weight
0,3,0,1.0
1,2,1,3.4
2,3,2,1.0
3,0,2,0.2
4,1,3,5.67


In [241]:
weighted_homogeneous = StellarGraph(homogeneous_nodes, weighted_edges)
print(weighted_homogeneous.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  default: [4]
    Features: float32 vector, length 2
    Edge types: default-default->default

 Edge types:
    default-default->default: [5]


### Directed graphs

In [242]:
from stellargraph import StellarDiGraph
directed_weighted_homogeneous = StellarDiGraph(homogeneous_nodes, weighted_edges)
print(directed_weighted_homogeneous.info())

StellarDiGraph: Directed multigraph
 Nodes: 4, Edges: 5

 Node types:
  default: [4]
    Features: float32 vector, length 2
    Edge types: default-default->default

 Edge types:
    default-default->default: [5]


### Heterogeneous graphs

#### Multiple node types

In [261]:
raw_foo_nodes = neo4j_graph.run(
    """
    MATCH (n:foo) 
    RETURN id(n) AS id, n.foo_numbers AS numbers
    """
).to_data_frame()

foo_nodes = pd.DataFrame(
    raw_foo_nodes["numbers"].tolist(),
    index=raw_foo_nodes["id"]
)
foo_nodes

Unnamed: 0_level_0,0,1,2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,0.1,0.2,0.3


In [260]:
raw_bar_nodes = neo4j_graph.run(
    """
    MATCH (n:bar) 
    RETURN id(n) AS id, n.bar_numbers AS numbers
    """
).to_data_frame()

bar_nodes = pd.DataFrame(
    raw_bar_nodes["numbers"].tolist(),
    index=raw_bar_nodes["id"]
)
bar_nodes

Unnamed: 0_level_0,0,1
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1.0,-2.0
1,0.7,-98.0
2,34.0,5.6


In [245]:
heterogeneous_nodes = StellarGraph({"foo": foo_nodes, "bar": bar_nodes}, edges)
print(heterogeneous_nodes.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  bar: [3]
    Features: float32 vector, length 2
    Edge types: bar-default->bar, bar-default->foo
  foo: [1]
    Features: float32 vector, length 3
    Edge types: foo-default->bar

 Edge types:
    foo-default->bar: [2]
    bar-default->bar: [2]
    bar-default->foo: [1]


#### Multiple edge types

In [247]:
neo4j_relationships = neo4j_graph.relationships.match()

labelled_edges = neo4j_graph.run(
    """
    MATCH (s) -[r]-> (t) 
    RETURN id(s) AS source, id(t) AS target, TYPE(r) AS label
    """
).to_data_frame()

labelled_edges

Unnamed: 0,source,target,label
0,3,0,horizontal
1,2,1,horizontal
2,3,2,diagonal
3,0,2,vertical
4,1,3,vertical


In [248]:
# FIXME https://github.com/stellargraph/stellargraph/issues/1183
grouped = {name: df.drop(columns="label") for name, df in  labelled_edges.groupby("label")}
hetereogeneous_everything = StellarGraph({"foo": foo_nodes, "bar": bar_nodes}, grouped)
print(hetereogeneous_everything.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  bar: [3]
    Features: float32 vector, length 2
    Edge types: bar-diagonal->foo, bar-horizontal->bar, bar-horizontal->foo, bar-vertical->bar, bar-vertical->foo
  foo: [1]
    Features: float32 vector, length 3
    Edge types: foo-diagonal->bar, foo-horizontal->bar, foo-vertical->bar

 Edge types:
    foo-horizontal->bar: [1]
    foo-diagonal->bar: [1]
    bar-vertical->foo: [1]
    bar-vertical->bar: [1]
    bar-horizontal->bar: [1]


## Subgraphs

In [279]:
raw_subgraph_nodes = neo4j_graph.run(
    """
    MATCH (n) 
    WHERE n.left 
    RETURN id(n) AS id, n.top, apoc.coll.sum(coalesce(n.bar_numbers, n.foo_numbers))
    """
).to_data_frame()

subgraph_nodes = raw_subgraph_nodes.set_index("id")
subgraph_nodes

Unnamed: 0_level_0,n.top,"apoc.coll.sum(coalesce(n.bar_numbers, n.foo_numbers))"
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,False,-97.3
3,True,0.6


In [265]:
subgraph_edges = neo4j_graph.run(
    """
    MATCH (s:bar) -[r]-> (t:bar)
    WHERE r.weight > 1
    RETURN id(s) AS source, id(t) AS target
    """
).to_data_frame()

subgraph_edges

Unnamed: 0,source,target
0,2,1


In [266]:
subgraph = StellarGraph(subgraph_nodes, subgraph_edges)
print(subgraph.info())

StellarGraph: Undirected multigraph
 Nodes: 2, Edges: 1

 Node types:
  default: [2]
    Features: float32 vector, length 2
    Edge types: default-default->default

 Edge types:
    default-default->default: [1]


## Conclusion

You hopefully now know more about building a `StellarGraph` in various configurations reading from Neo4j, both entire graphs and subsets with Cypher queries.

Revisit this document to use as a reminder.

Once you've loaded your data, you can start doing machine learning: a good place to start is the [demo of the GCN algorithm on the Cora dataset for node classification](../node-classification/gcn/gcn-cora-node-classification-example.ipynb). Additionally, StellarGraph includes [many other demos of other algorithms, solving other tasks](../README.md).

Please [let us know](https://github.com/stellargraph/stellargraph#getting-help) your experience of using StellarGraph with Neo4j, both positive and negative.

In [176]:
# clean everything up, so that we're not leaving the square graph in the Neo4j instance
neo4j_graph.delete_all()