# Exercise 5c: More ML on graphs

## Introduction

In this notebook, we assume that you have already populated your Wikidata (AKA "Method 2") database, which was shown in Exercises 3 and 4. We also will assume that you have run the Cypher queries found in cypher_queries/method2_queries.cql to do things like update the node labels to the P31 values, segment out both model and holdback data, and create some basic embeddings.

No worries if you need to spin up a new Sandbox instance.  There is an optional cell below for repopulating it.

In [None]:
%matplotlib inline

import json
import re
import urllib
from pprint import pprint
import time
from tqdm import tqdm

from neo4j import GraphDatabase

import numpy as np
import pandas as pd

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

import pprint

In [None]:
class Neo4jConnection:
    
    def __init__(self, uri, user, pwd):
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

In [None]:
uri = ''
user = 'neo4j'
pwd = ''

conn = Neo4jConnection(uri=uri, user=user, pwd=pwd)
conn.query("MATCH (n) RETURN COUNT(n)")

## If you need to repopulate the Sandbox instance, run the following two cells...

In [None]:
wiki_url = 'https://resources.oreilly.com/binderhub/introduction-to-knowledge-graphs/raw/master/data/wiki.json'

query = "CALL apoc.import.json('" + wiki_url + "')"
conn.query(query)

conn.query("MATCH (n) RETURN COUNT(n)")

In [None]:
query1 = """MATCH (n:Node) 
           WITH n.name AS name, COLLECT(n) AS nodes 
           WHERE SIZE(nodes)>1 
           FOREACH (el in nodes | DETACH DELETE el)
"""

query2 = """MATCH (n:Node) 
            SET n.type_ls = apoc.convert.toStringList(n.type)
"""

query3 = """MATCH (n:Node) 
            CALL apoc.create.addLabels(n, n.type_ls) 
            YIELD node 
            RETURN COUNT(node)
"""

conn.query(query1)
conn.query(query2)
conn.query(query3)

## Binary classification example: can we identify whether a node is a place or not a place?

We are going to try and determine whether a node is a place or not based on our graph.  Let's create a property for our nodes, `is_place` (1 = a place, 0 = otherwise)and see how this works.  I have created an arbitrary list of node labels.  While I tried to be complete, I am sure there are errors.

In [None]:
query1 = """MATCH (n)
            WHERE ANY (x in n.type WHERE x IN 
                        ['county of Illinois', 
                        'state of the United States',
                        'oblast of Russian',
                        'province of Afghanistan',
                        'province of Iran',
                        'oblast of Ukraine',
                        'district of Libya',
                        'governorate of Iraq',
                        'province of Cuba',
                        'governorate of Syria',
                        'sovereign state',
                        'autonomous okrug of Russia',
                        'city',
                        'krai of Russia',
                        'city of the United States',
                        'territory of the United States',
                        'capital',
                        'geographic region',
                        'continent',
                        'county of Hawaii',
                        'village',
                        'historical country',
                        'autonomous republic',
                        'organized incorporated territory',
                        'unincorporated territory',
                        'census-designated place',
                        'human settlement',
                        'borough of New York City',
                        'Commonwealth realm',
                        'city of Pennyslvania',
                        'neighborhood of Washington, D.C.',
                        'country']
                      )
            SET n.is_place=1
"""

query2 = """MATCH (n) WHERE n.is_place IS NULL SET n.is_place=0"""

conn.query(query1)
conn.query(query2)


## Generate in-memory graph

We now are going to creating our in-memory graph.  We do this for all nodes and all relationships which, recall, is not a great idea in general.  However, since our graph is so small, we are going to go with it for the sake of demonstration.  Also recall that most of GDS is looking for an undirected graph.

In [None]:
query = """CALL gds.graph.create(
               'all_nodes',
               'Node',
               {
                   RELS: {
                           type: '*',
                           orientation: 'UNDIRECTED'
                   }
               }
           )
"""

conn.query(query)

## Create some embeddings

We are going to create two different kinds of 10-dimensional graph embeddings using:

1. [node2vec](https://neo4j.com/docs/graph-data-science/current/algorithms/node2vec/)
2. [Fast Random Projection](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/) (AKA "FastRP")

There are many hyperparameters that we are not tuning in this section.  If you would like to know more about them, you can read [this blog post](https://dev.neo4j.com/fastrp_background) on the math behind FastRP and [this blog post](https://dev.neo4j.com/bratanic_node2vec) on that for node2vec.

In [None]:
query = """CALL gds.beta.node2vec.write(
               'all_nodes', 
               { 
                   embeddingDimension: 10, 
                   writeProperty: 'n2v_all_nodes'
               } 
           )
"""

conn.query(query)

In [None]:
query = """CALL gds.fastRP.write(
               'all_nodes',
               {
                   embeddingDimension: 10, 
                   writeProperty: 'frp_all_nodes'
               }
           )
"""

conn.query(query)

## t-SNE of our embeddings

Let's now use the [t-distributed Stochastic Neighbor Embedding (t-SNE)](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html?highlight=tsne#sklearn.manifold.TSNE) approach of dimensionality reduction to try and visualize the quality of our embeddings.  Recall that we have a node property called `is_place`, which is 1 for all nodes we called a place and 0 otherwise.  So we are going to use this binary classification to see if we can get our classes to form separable clusters.

In [None]:
def create_tsne_plot(emb_name='n.n2v_all_nodes', n_components=2, debug=False):
    
    query_string = '''
        MATCH (n:Node)
        RETURN n.name, n.type, n.is_place AS category, {} AS vec
    '''.format(emb_name)
    model_df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
    
    if debug:
        uniqueValues = model_df['category'].nunique()
        print(uniqueValues)
    
    X_emb = TSNE(n_components=n_components).fit_transform(list(model_df['vec']))
    
    tsne_df = pd.DataFrame(data = {
        'x': [value[0] for value in X_emb],
        'y': [value[1] for value in X_emb],
        'label': model_df['category']
    })
    
    plt.figure(figsize=(16,10))
    s = 30
    ax = sns.scatterplot(
        x='x', y='y',
        palette=sns.color_palette('hls',2),
        data=tsne_df,
        hue='label',
        legend=True, 
        s=500,
        alpha=0.75
    )
    ax.legend(prop={'size': 20})
    plt.xlabel('X Component', fontsize=16)
    plt.ylabel('Y Component', fontsize=16)
    plt.show

    return tsne_df

In [None]:
tsne_df = create_tsne_plot(emb_name='n.n2v_all_nodes', n_components=2, debug=False)

In [None]:
tsne_df = create_tsne_plot(emb_name='n.frp_all_nodes', n_components=2, debug=False)

## Observation

OK, that is not so hot.  There are a few small clusters that have very few false positives, but all in all this is nothing to write home about.

## _EXERCISE:_ Try some different hyperparameters for the two embedding approaches to see if you can do better.

It will help to consult the docs for each, linked above.

## Binary classification model with support vector machines

As in the previous ML exercise, we will format our graph data into a format suitable for `scikit-learn` and will run it through our SVC classifier using 5-fold validation.  Unlike previous examples, we will use the built in `class_weight` parameter to attempt to handle the class imbalance problem.

In [None]:
def create_X(df2, emb):

    n2v_an_ls = df2[emb].to_list()
    n2v_arr = np.array([np.array(x) for x in n2v_an_ls], dtype=object)

    print(n2v_arr.shape)
    
    return n2v_arr


def modeler(df, emb_name, y_column_name, k_folds=5, model='linear', show_matrix=True):
    
    y = df[y_column_name].fillna(0.0).to_numpy()
    vec_array = create_X(df, emb_name)
    acc_scores = []
    
    pos = np.count_nonzero(y == 1.0)
    neg = y.shape[0] - pos
    print('Number of positive: ', pos, ' Number of negative: ', neg)
    
    for i in range(0, k_folds):
        
        X_train, X_test, y_train, y_test = train_test_split(vec_array, y, test_size=0.25)
        clf = svm.SVC(kernel='linear', class_weight='balanced')
        clf.fit(X_train, y_train)
        pred = clf.predict(X_test)

        acc = accuracy_score(pred, y_test)
        acc_scores.append(acc)        
        
    print('Accuracy scores: ', acc_scores)
    print('Mean accuracy: ', np.mean(acc_scores))
    
    if show_matrix:
        matrix = plot_confusion_matrix(clf, X_test, y_test, cmap=plt.cm.Blues, normalize='true')
        plt.show(matrix)
        plt.show()
        
    return clf

In [None]:
query_string = '''
    MATCH (n:Node)
    RETURN n.name, n.type, n.is_place AS category, n.n2v_all_nodes AS n2v_vec, n.frp_all_nodes AS frp_vec
'''
model_df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
model_df.head()

In [None]:
n2v_clf = modeler(model_df, emb_name='n2v_vec', y_column_name='category')

In [None]:
frp_clf = modeler(model_df, emb_name='frp_vec', y_column_name='category')

## Observation

Those results aren't _horrible,_ but we definitely see the impact of the class imbalance.  This is just one of those problems with ML with several ways we could attempt to correct it, most of which involving getting more data or manually attempting to balance the classes further.  I live that as an exercise to try after the workshop.

## _Exercise:_ play with the embeddings above to try and get a better result.  

## Let's see what happens if we give the classifier some nodes that we happen to know the answer for and see how it does

Good node names to try are "Illinois," "Bill Clinton," and "city of the United States").

In [None]:
def predict_unknown(node_name, emb_name, clf, debug=False):
    
    query_string = "MATCH (n:Node {name: '" + node_name + "'}) RETURN n.name AS name, n." + emb_name + " AS vec"
    
    if debug == True:
        print(query_string)
        print(type(query_string))
        print(emb_name)

    unknown_df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
    
    vec_array = create_X(unknown_df, emb='vec')
    pred = clf.predict(vec_array)
    print('Predicted Class: ', pred[0])
    
    return

In [None]:
predict_unknown('Illinois', 'n2v_all_nodes', clf=n2v_clf, debug=False)

## Trying a new classifier: K Neighbors

For the sake of comparison, we can try other classifiers as well.  I have chosen to use the [K Neighbors Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) in this example, but you could pick any classifier really.

In [None]:
def knc_modeler(df, emb_name, y_column_name, k_folds=5, model='linear', show_matrix=True):
    
    y = df[y_column_name].fillna(0.0).to_numpy()
    vec_array = create_X(df, emb_name)
    acc_scores = []
    
    pos = np.count_nonzero(y == 1.0)
    neg = y.shape[0] - pos
    print('Number of positive: ', pos, ' Number of negative: ', neg)
    
    for i in range(0, k_folds):
        
        X_train, X_test, y_train, y_test = train_test_split(vec_array, y, test_size=0.25)
        #clf = svm.SVC(kernel='linear', class_weight='balanced')
        #clf.fit(X_train, y_train)
        clf = KNeighborsClassifier(n_neighbors=10, weights='distance')
        clf.fit(X_train, y_train)
        pred = clf.predict(X_test)

        acc = accuracy_score(pred, y_test)
        acc_scores.append(acc)        
        
    print('Accuracy scores: ', acc_scores)
    print('Mean accuracy: ', np.mean(acc_scores))
    
    if show_matrix:
        matrix = plot_confusion_matrix(clf, X_test, y_test, cmap=plt.cm.Blues, normalize='true')
        plt.show(matrix)
        plt.show()
        
    return clf

In [None]:
n2v_knc = knc_modeler(model_df, emb_name='n2v_vec', y_column_name='category')

In [None]:
frp_knc = knc_modeler(model_df, emb_name='frp_vec', y_column_name='category')

## Observation

That is slightly better, if we ignore the class imbalance.  And again, remember that we really haven't spent any time optimizing the embeddings or the model (beyond k-fold validation).  But as per the caveats before, these demonstrations are for educational purposes and not intended to be optimized, especially for such small graphs.

## Next Steps

1. Add more data to the graph (always a good idea when you can and time permits!)
2. Optimize the above embeddings (obvious)
3. Try the [GraphSAGE](https://neo4j.com/docs/graph-data-science/current/algorithms/graph-sage/) or [Fast RP Extended](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/#algorithms-embeddings-fastrp-extended) embedding algorithms.  These two have the advantage of taking into accoun the node properties themselves in addition to the random walks we have used above

## Built-in ML Algorithms with GDS

The Graph Data Science library does much more than embeddings! In particular, I recommend you check out the [node classification](https://neo4j.com/docs/graph-data-science/current/algorithms/ml-models/node-classification/) and [link prediction](https://neo4j.com/docs/graph-data-science/current/algorithms/ml-models/linkprediction/) modeling capabilities. This small graph might not be able to take much advantage of them, but as the graph gets bigger you might find them to be really helpful!