# How to connect Neo4j to Hopsworks
In this notebook we will
* import data into Neo4j
* use Neo4j's Graph Data Science library to calculate node2vec graph node embeddings, and store these on the nodes in the graph database.
* read these embeddings into a dataframe
* create feature groups in a Hopsworks feature store

## Step 1: Importing the data into Neo4j

First we do a few Imports and set a few parameters.

In [1]:
from neo4j import GraphDatabase
from graphdatascience import GraphDataScience

URI = "bolt://localhost:7687"
AUTH = ("neo4j", "changeme")
DATABASE = "gdshopsworksdemo"

Then we create a few indexes in Neo4j.

In [2]:
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.execute_query("CREATE CONSTRAINT party_id_constraint FOR (p:Party) REQUIRE p.partyId IS UNIQUE", database_=DATABASE)
    driver.execute_query("CREATE TEXT INDEX party_type_index FOR (p:Party) ON (p.partyType)", database_=DATABASE)
    driver.execute_query("CREATE CONSTRAINT transaction_id_constraint FOR ()-[r:TRANSACTION]-() REQUIRE r.tran_id is UNIQUE", database_=DATABASE)
    driver.execute_query("CREATE TEXT INDEX transaction_timestamp_index FOR ()-[r:TRANSACTION]-() ON r.tran_timestamp", database_=DATABASE)

Then we do the first import of the first .csv file, holding the (:Party) nodes. This will finish very quickly, as there are only 7-8k nodes.

In [3]:
with driver.session(database=DATABASE) as session:
            result = session.run("""
                load csv with headers from "https://repo.hops.works/master/hopsworks-tutorials/data/aml/party.csv" as parties
                create (p:Party)
                set p = parties
            """)
print(result.consume().counters)

  with driver.session(database=DATABASE) as session:


{'_contains_updates': True, 'labels_added': 7347, 'nodes_created': 7347, 'properties_set': 14694}


Next we will import the relationshops. There are approx 430k [:TRANSACTION] relationships, and importing these will take a few minutes.

In [4]:
with driver.session(database=DATABASE) as session:
            result = session.run("""
                LOAD CSV WITH HEADERS FROM "https://repo.hops.works/master/hopsworks-tutorials/data/aml/transactions.csv" AS Transaction
                    MATCH (startNode:Party)
                    WHERE startNode.partyId = Transaction.src
                    CALL {
                        WITH Transaction, startNode
                        MATCH (endNode:Party)
                        WHERE endNode.partyId = Transaction.dst
                        CREATE (startNode)-[rel:TRANSACTION {tran_id: Transaction.tran_id, tx_type: Transaction.tx_type, base_amt: Transaction.base_amt, tran_timestamp: datetime(Transaction.tran_timestamp)}]->(endNode)
                    } IN TRANSACTIONS OF 2500 ROWS;
            """)
print(result.consume().counters)

  with driver.session(database=DATABASE) as session:


{'_contains_updates': True, 'relationships_created': 438386, 'properties_set': 1753544}


This completes the importing of the data into Neo4j.

## Step 2: Calculating the node embeddings in Neo4j
This uses the GDS library of Neo4j, which has to be installed on the Neo4j server. We will be using the `node2vec` library to calculate the embeddings.

In [5]:
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    try:
        gds = GraphDataScience(URI, auth=AUTH, database=DATABASE)
        
        # Uncomment this if you need to drop the `transaction_graph` from the gds catalog in memory
        # gds.run_cypher(
        #     """
        #     CALL gds.graph.drop('transaction_graph') YIELD graphName
        #     """
        # )
        gds.run_cypher(
        """
        MATCH (p1:Party)-[t:TRANSACTION]->(p2:Party)
            WHERE t.tran_timestamp >= datetime("2020-11-01")
                AND t.tran_timestamp < datetime("2020-12-01")
            RETURN p1, t, p2
        """
        )
        G, project_result = gds.graph.project("transaction_graph", "Party", "TRANSACTION")
        node2vec_result = gds.node2vec.write(
            G,                                #  Graph object
            embeddingDimension=10,
            walkLength=80,
            inOutFactor=1,
            returnFactor=1,
            writeProperty="node2vec"
        )
        assert node2vec_result["nodePropertiesWritten"] == G.node_count()

    except Exception as e:
        print(e)
        # further logging/processing

## Step 3: reading the embeddings from Neo4j, and storing them in the Hopsworks feature store

In [6]:
import datetime
import pandas as pd
import numpy as np
import neo4j
from tqdm import tqdm

import hopsworks
# Connecting to Hopsworks Serverless - need to choose project
project = hopsworks.login()
fs = project.get_feature_store()

# Connecting to Neo4j, getting the embeddings, and putting them into a dataframe
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    graph_embeddings_df = driver.execute_query(
        """MATCH (p:Party)-[t:TRANSACTION]->(:Party) 
            return 
            p.partyId as party_id, 
            p.partyType as party_type, 
            p.node2vec as party_graph_embedding, 
            datetime(t.tran_timestamp).epochmillis as timestamp""",
        database_=DATABASE,
        result_transformer_=neo4j.Result.to_df
    )
    print(type(graph_embeddings_df))  # <class 'pandas.core.frame.DataFrame'>

# Creating the features
from hsfs import engine
features = engine.get_instance().parse_schema_feature_group(graph_embeddings_df)
for f in features:
    if f.type == "array<double>" or f.type == "array<float>":
        f.online_type = "VARBINARY(20000)"

# Creating the feature group in Hopsworks Serverless App
graph_embeddings_fg = fs.get_or_create_feature_group(name="graph_embeddings",
                                       version=1,
                                       primary_key=["party_id"],
                                       description="node embeddings from transactions graph",
                                       event_time = 'timestamp',     
                                       online_enabled=True,
                                       features=features,
                                       statistics_config={"enabled": False, "histograms": False, "correlations": False, "exact_uniqueness": False}
                                       )
graph_embeddings_fg.insert(graph_embeddings_df)

Connected. Call `.close()` to terminate connection gracefully.






To ensure compatibility please install the latest bug fix release matching the minor version of your backend (3.4) by running 'pip install hopsworks==3.4.*'


Multiple projects found. 

	 (1) rixdemo
	 (2) BeerVolumePrediction

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/189590
Connected. Call `.close()` to terminate connection gracefully.
<class 'pandas.core.frame.DataFrame'>
Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/189590/fs/189509/fg/551261


Uploading Dataframe: 0.00% |          | Rows 0/438386 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: graph_embeddings_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/189590/jobs/named/graph_embeddings_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x13b1bde90>, None)

That's it. We now have our feature engineering based on Neo4j done!