# Part 7: Feature Engineering in Neo4j and GDS

This notebook covers:

1. Native Graph Projection with Properties
2. Generating FastRP Features
3. Subgraph Projection and Data Export

## Connection Setup and Helper Functions

In [1]:
from neo4j import GraphDatabase
HOST = 'neo4j://localhost:7687'
USERNAME = 'neo4j'
DATABASE = 'ogblsc'
PASSWORD = 'neo'

def run(driver, query, params=None):
    with driver.session(database=DATABASE) as session:
        if params is not None:
            return [r for r in session.run(query, params)]
        else:
            return [r for r in session.run(query)]

def clear_graph(driver, graph_name):
    if run(driver, f"CALL gds.graph.exists('{graph_name}') YIELD exists RETURN exists")[0].get("exists"):
        run(driver, f"CALL gds.graph.drop('{graph_name}')")

def clear_all_graphs(driver):
    graphs = run(driver, 'CALL gds.graph.list() YIELD graphName RETURN collect(graphName) as graphs')[0].get('graphs')
    for g in graphs:
        run(driver, f"CALL gds.graph.drop('{g}')")

In [2]:
driver = GraphDatabase.driver(HOST, auth=(USERNAME, PASSWORD))

## Native Graph Projection with Properties

We will project just the Paper nodes and CITES relationships for purposes of this demo.

In [3]:
run(driver, '''
    CALL gds.graph.create('proj-features',
        {Paper:{properties: ['subject', 'encoding']}},
        {CITES:{orientation:'UNDIRECTED'}},
        {readConcurrency: 60}
    ) YIELD nodeCount, relationshipCount, createMillis
''')

[<Record nodeCount=121751666 relationshipCount=2595497852 createMillis=386135>]

## Generating FastRP Features

Fast Random Projection, or FastRP for short, is a node embedding algorithm. Node embedding algorithms compute low-dimensional vector representations of nodes in a graph. These vectors, also called embeddings, can be used as features for machine learning models among other tasks such as visualization and EDA.

FastRP leverages the concept of sparse projections to significantly scale the computation of embeddings on larger graphs.  More information can be found in [our documentation](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/).

In our example below we will choose to use a `propetyRatio`of 50% which basically initializes 50% of the embedding vectors with a linear combination of the RoBERTa components as weights. In layman's terms, we are basically using a combination of both the graph structure and the NLP encodings to generate (hopefully predictive) node features. 

In [4]:
run(driver, '''
    CALL gds.fastRP.mutate('proj-features',
        {
          embeddingDimension: 256,
          randomSeed: 7474,
          propertyRatio: 0.5,
          featureProperties: ['encoding'],
          mutateProperty: 'embedding',
          concurrency: 60
        }
    ) YIELD nodePropertiesWritten, createMillis, computeMillis, mutateMillis
''')

[<Record nodePropertiesWritten=121751666 createMillis=0 computeMillis=430806 mutateMillis=0>]

## Subgraph Projection and Data Export

To test predicting subject labels with the new (FastRP) graph features, we only need to export the fraction of papers with known labels. We can use a subgraph projection to filter down to these papers. We can then export the subgraph to csv.

In [5]:
# subgraph projection
run(driver, '''
    CALL gds.beta.graph.create.subgraph(
        'proj-features-labeled',
        'proj-features',
        'n.subject > -1',
        '*',
        {concurrency: 60}
    ) YIELD nodeCount, createMillis
''')

[<Record nodeCount=1251341 createMillis=11923>]

In [7]:
# csv export
run(driver, '''
    CALL gds.beta.graph.export.csv('proj-features-labeled', {
      exportName: 'proj-features-labeled',
      additionalNodeProperties: ['ogbIndex', 'split_segment', 'subject_status', 'year'],
      writeConcurrency: 16
    }) YIELD writeMillis
''')

[<Record writeMillis=41485>]