# Optimize fetching data from Neo4j with Apache Arrow
## High-performance data retrieval from Neo4j with Apache Arrow

The year is 2022, and graph machine learning is one of the rising trends in data analytics. While Neo4j has a Graph Data Science library that supports multiple graph algorithms and machine learning workflows, sometimes you want to export data from Neo4j and run it through your favorite machine learning frameworks like PyTorch or TensorFlow. In that scenario, you want to be able to export data from Neo4j in a fast and scalable way. But, unfortunately, using the Neo4j Python driver is not the most efficient way of retrieving data. However, no need to worry, Dave Voutila has got your back. In the past couple of months, he has been developing an Apache Arrow plugin for Neo4j.

https://github.com/neo4j-field/neo4j-arrow


The goal of the Neo4j Arrow project is to expose data available in Neo4j via high-performance Arrow Flight APIs. You can retrieve data via Cypher queries or even fetch information from the GDS in-memory graphs. I am not familiar with the underlying Apache Arrow infrastructure, so I won't try to explain how it works. Just know that it is blazing fast.
## Preparing the environment
We will use a subset of the Pokec dataset to demonstrate the data retrieval of Neo4j Arrow. The dataset contains 1 million nodes and 10 million relationships. You can download the Neo4j [database dump at this link](https://drive.google.com/file/d/176-Vmdn2fqy4KLPygl-yjJLiIhMXzKOL/view). To instantiate the database from the dump, open the Neo4j Desktop application and copy the dump under the Files section. Next, click on the three dots in the right-hand side of the file, and select Create new DBMS from dump.

It is crucial that you select a database version of 4.4.0 or later, as otherwise, the Arrow plugin might not work. Currently, the Neo4j Arrow project is still in the early stages, so not all database versions are supported. I have prepared the latest build of the [Arrow plugin for download](https://drive.google.com/file/d/1Xi0jwqFiJx_aKtZZ-Rd4XZrpMrJke6u5/view), or you can build it locally yourself. To activate the plugin, simply copy it in the database plugins folder.
To follow this blog post, you will also need to install the Graph Data Science library.

## Retrieve data via Cypher query
You can now start the database instance and learn how to retrieve data via Arrow plugin. I have prepared a Jupyter notebook that you can use to follow the examples.
First, we will use the official Neo4j Python driver to retrieve all the nodes and their properties from the database. As mentioned, there are around a million nodes in the database dump I prepared. The Cypher statement we will use is pretty straightforward.

In [2]:
cypher = """
MATCH (u:User)
RETURN u.id AS id, u.gender AS gender, u.age AS age
"""

Now, we can go ahead and execute the Cypher statement. We will also measure the execution time.

In [1]:
from neo4j import GraphDatabase
import time

host = 'bolt://localhost:7687'
user = 'neo4j'
password = 'letmein'
driver = GraphDatabase.driver(host,auth=(user, password))

def execute_query(query):
    with driver.session() as session:
        data = list(session.run(query))
    return data

In [3]:
start = time.time()
data = execute_query(cypher)
delta = round(time.time() - start, 1)
print(f'Neo4j driver took {delta} seconds to fetch the data')

Neo4j driver took 32.6 seconds to fetch the data


The Neo4j Python driver fetches a million rows in 30 seconds, which is not that bad. However, when dealing with tens or hundreds of millions of rows, it doesn't scale well enough. Here is where the Neo4j Arrow plugin comes in handy.
The nice benefit of the Arrow plugin is that it is language-independent, and you could use it to retrieve data from several scripting languages. At the moment, there is a Python wrapper object available, but you could develop the Arrow client in other languages as well.
The code to fetch data using a Cypher statement via the Python Arrow wrapper is pretty straightforward.

In [5]:
import neo4j_arrow as na
client = na.Neo4jArrow(user, password, ('localhost', 9999))

start = time.time()
ticket = client.cypher(cypher)
table = client.stream(ticket).read_all()
delta = round(time.time() - start, 1)

print(f'Neo4j arrow took {delta} seconds to fetch the data')

Neo4j arrow took 2.8 seconds to fetch the data


It took 2.8 seconds to retrieve a million rows from the Neo4j database on my laptop. While this is only a 10x improvement, the Arrow plugin is more scalable and will outperform the Python driver up to 450x on larger datasets.
Since data gurus have a soft spot for Pandas, you can easily convert the table object to a Pandas dataframe using the `to_pandas()` method.

In [7]:
df = table.to_pandas()
df.head()

Unnamed: 0,id,gender,age
0,1,1,26
1,19,1,21
2,21,0,17
3,36,1,28
4,39,1,18


## Retrieve data from GDS in-memory graph
What is remarkable about the Neo4j Arrow project is that it also allows high-performance reads from the Graph Data Science library in-memory graphs. You could perform feature engineering using any of the graph or embedding algorithms, or you could simply evaluate the algorithm's result in Python.
We will construct a GDS in-memory graph and execute the Weakly Connected Components algorithm to demonstrate how to retrieve data efficiently from the GDS in-memory graph. Let's start by constructing the in-memory graph called pokec that contains nodes with a label User and FRIEND relationships.

In [8]:
execute_query("""
CALL gds.graph.create('pokec', 'User', 'FRIEND')
""")

[<Record nodeProjection={'User': {'label': 'User', 'properties': {}}} relationshipProjection={'FRIEND': {'orientation': 'NATURAL', 'aggregation': 'DEFAULT', 'type': 'FRIEND', 'properties': {}}} graphName='pokec' nodeCount=1099121 relationshipCount=10794057 createMillis=2679>]

Now, we will execute the mutate mode of the Weakly Connected Components algorithm to store the results back to the in-memory graph.

In [9]:
execute_query("""
CALL gds.wcc.mutate('pokec', {mutateProperty:'wcc'})
""")

[<Record mutateMillis=1 nodePropertiesWritten=1099121 componentCount=921 componentDistribution={'p99': 5, 'min': 2, 'max': 1097079, 'mean': 1193.3984799131379, 'p90': 3, 'p50': 2, 'p999': 30, 'p95': 3, 'p75': 2} postProcessingMillis=99 createMillis=0 computeMillis=765 configuration={'seedProperty': None, 'consecutiveIds': False, 'threshold': 0.0, 'relationshipWeightProperty': None, 'nodeLabels': ['*'], 'sudo': False, 'relationshipTypes': ['*'], 'mutateProperty': 'wcc', 'username': None, 'concurrency': 4}>]

With the prepared GDS in-memory graph, we can use the Arrow plugin to extract the algorithm results from Neo4j. Thanks to Dave Voutila and the Python wrapper he prepared, this is a pretty simple process.

In [13]:
# Submit our GDS job to retrieve some node properties from a graph projection
print('>> Reading the result of our GDS job...''')
start = time.time()
ticket = client.gds_nodes('pokec', properties=['wcc'])
# Retrieve and consume the stream into a PyArrow Table
table = client.stream(ticket).read_all()
delta = round(time.time() - start, 2)
print(f'>> Took {delta:,}s to consume stream into a PyArrow table.')


>> Reading the result of our GDS job...
>> Took 1.85s to consume stream into a PyArrow table.


We retrieved the WCC results from the projected in-memory graph in 1.8 seconds. You could now use the WCC algorithm results to efficiently split the train-test data, use it as a feature in a downstream ML workflow, or simply evaluate and plot the results.
Of course, the popular way of interacting with data in Python is using the Pandas dataframe. We can easily transform the retrieved data into a dataframe with the `to_pandas()` method.

In [15]:
wcc_df = table.to_pandas()
print(wcc_df.head())
print(wcc_df['wcc'].nunique())

   _node_id_ _labels_  wcc
0       1365   [User]    0
1       1378   [User]    0
2       1386   [User]    0
3       1393   [User]    0
4       1400   [User]    0
921


## Conclusion
The Neo4j Arrow project is still in the early stages, so please test it out and report any bugs or feature requests directly to the [repository](https://github.com/neo4j-field/neo4j-arrow), and if you like it, please give it a star. I must say I am very excited about this project and how it can help graph data scientists to efficiently retrieve data from Neo4j, which makes it more enjoyable to implement various machine learning pipelines around Neo4j.