# Collaborative Filtering Example

In [1]:
from neo4j import GraphDatabase
import pandas as pd
import textwrap
import configparser
import os
pd.set_option('display.width', 0)
pd.set_option('display.max_colwidth', 500)

In [2]:
## Using an ini file for credentials, otherwise providing defaults
HOST = 'neo4j://localhost'
DATABASE = 'neo4j'
PASSWORD = 'password'
credential_properties = '/Users/zachblumenfeld/devtools/aura-news-demo.ini'
if os.path.exists(credential_properties):
    config = configparser.RawConfigParser()
    config.read(credential_properties)
    HOST = config['NEO4J']['HOST']
    DATABASE = config['NEO4J']['DATABASE']
    PASSWORD = config['NEO4J']['PASSWORD']
    print('Using custom database properties')
else:
    print('Could not find database properties file, using defaults')

Using custom database properties


In [3]:
# helper functions
def run(driver, query, params=None):
    with driver.session() as session:
        if params is not None:
            return [r for r in session.run(query, params)]
        else:
            return [r for r in session.run(query)]

def clear_graph(driver, graph_name):
    if run(driver, f"CALL gds.graph.exists('{graph_name}') YIELD exists RETURN exists")[0].get("exists"):
        run(driver, f"CALL gds.graph.drop('{graph_name}')")

def clear_model(driver, model_name):
    if run(driver, f"CALL gds.beta.model.exists('{model_name}') YIELD exists RETURN exists")[0].get("exists"):
        run(driver, f"CALL gds.beta.model.drop('{model_name}')")

In [4]:
driver = GraphDatabase.driver(HOST, auth=(DATABASE, PASSWORD))

## Labeling Recent News
News tends to be most relevant when it is recent and can lose relevance quickly with the passing of time.
As such, the date the news is published is important to consider for recommendation.

Unfortunately in this case, we do not have exact publish dates. To approximate, I used the minimum impression time.
This means that news with NULL approximate times only showed up for historic clicks by users and were not included in
any impressions inside our sample.

For our Collaborative Filtering, we will only be interested in recommending recent content, so we will add a
'RecentNews' label to allow us to easily filter the graph in cypher queries and native projection. Remember that Neo4j
allows a node to have multiple labels, so the original 'News label will still be retained.

In [5]:
run(driver, textwrap.dedent("""\
    MATCH(n:News) WHERE n.approxTime IS NOT NULL
    SET n:RecentNews
    RETURN count(n)
    """)
)

[<Record count(n)=22771>]

## Basic Recommendation with Cypher
From here we could try just using Cypher to accomplish Collaborative Filtering.  For example, we can do a three hop
query to find potential recommendations for a given user based on the news clicked by users that viewed some of the same
news they did.

In [6]:
result = run(driver, textwrap.dedent("""\
    MATCH (u1:User {userId: "U91836"})
           -[r1:CLICKED]->(n1:RecentNews)
           <-[r2:CLICKED]-(u2:User)
           -[r3:CLICKED]->(n2:RecentNews)
    RETURN count(DISTINCT n1) AS clickedNews,
           count(DISTINCT u2) AS likeUsers,
           count(DISTINCT n2) AS potentialRecommendations
    """)
)

pd.DataFrame([dict(record) for record in result])

Unnamed: 0,clickedNews,likeUsers,potentialRecommendations
0,11,3512,4111


## Scaling with FastRP Embeddings and K-Nearest-Neighbor (KNN)

While the above can work well in some cases, and while it can certainly be a massive improvement from joining SQL tables or cross-walking over document stores, notice that we get a lot of results back (over 4k), and this is just a small sub-sample of the complete Microsoft dataset.

For a production use case where recommendations will need to be queried frequently, this method will have trouble
scaling as the graph grows.  We need some other strategy to help narrow down the results.

One way to do this is with embeddings. Specifically, we can use FastRP to reduce the dimensionality of the problem then
use an unsupervised ML technique called K-Nearest Neighbor (KNN) to identify, and draw recommendation relationships
between, news with similar/close embeddings. Because the FastRP embeddings are based off the graph
structure, news with similar embeddings should also be relatively connected in the graph via being clicked on by the
same and similar users.

### Graph Projection
We will start with a graph projection leveraging just the User and RecentNews nodes.  We will include both historic and
recent impression clicks, but we will give less weight to historic clicks so-as to favor more recent user activity.

We will use an `UNDIRECTED` orientation so FastRP can traverse the graph bi-directionaly.

In [7]:
clear_graph(driver, 'cf-projection')
run(driver, textwrap.dedent("""\
    CALL gds.graph.create(
        'cf-projection',
        ['User', 'RecentNews'],
        {
            CLICKED:{
                orientation:'UNDIRECTED',
                properties: {weight: {property: 'weight', defaultValue: 1.0}}
            },
            HISTORICALLY_CLICKED:{
                orientation:'UNDIRECTED',
                properties: {weight: {property: 'weight', defaultValue: 0.2}}
            }
        }
    ) YIELD nodeCount, relationshipCount, createMillis""")
)

[<Record nodeCount=116828 relationshipCount=1250672 createMillis=239>]

### FastRP
When running FastRP we will make sure to include the relationship weight property

In [8]:
run(driver, textwrap.dedent("""\
    CALL gds.fastRP.mutate(
        'cf-projection',
        {
            mutateProperty: 'embedding',
            embeddingDimension: 196,
            randomSeed: 7474,
            relationshipWeightProperty: 'weight'
        }
    ) YIELD nodePropertiesWritten, computeMillis""")
)

[<Record nodePropertiesWritten=116828 computeMillis=1246>]

### K-Nearest-Neighbor (KNN)
We can then run KNN and write similarity (a.k.a. `USERS_ALSO_LIKED`) relationships back to the graph

In [9]:
result = run(driver, textwrap.dedent("""\
    CALL gds.beta.knn.write('cf-projection', {
        nodeLabels: ['RecentNews'],
        nodeWeightProperty: 'embedding',
        writeRelationshipType: 'USERS_ALSO_LIKED',
        writeProperty: 'score'
    }) YIELD *""")
)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,createMillis,computeMillis,writeMillis,postProcessingMillis,nodesCompared,relationshipsWritten,similarityDistribution,configuration
0,0,7798,4409,-1,22771,227710,"{'p1': 0.0, 'max': 1.000007152557373, 'p5': 0.0, 'p90': 0.48069334030151367, 'p50': 0.0, 'p95': 0.5845332145690918, 'p10': 0.0, 'p75': 0.3537726402282715, 'p99': 0.7942309379577637, 'p25': 0.0, 'p100': 1.000007152557373, 'min': 0.0, 'mean': 0.1761598638488499, 'stdDev': 0.2251492896885192}","{'topK': 10, 'maxIterations': 100, 'writeConcurrency': 4, 'randomJoins': 10, 'perturbationRate': 0.0, 'sampleRate': 0.5, 'concurrency': 4, 'writeProperty': 'score', 'writeRelationshipType': 'USERS_ALSO_LIKED', 'nodeWeightProperty': 'embedding', 'nodeLabels': ['RecentNews'], 'sudo': False, 'relationshipTypes': ['*'], 'deltaThreshold': 0.001, 'username': None}"


### Collaborative Filtering Query with USER_ALSO_LIKED Relationships
Now we can structure a similar Collaborative filtering query but with
1. more refined results,
2. using less traversal steps, and
3. with a score from KNN that allows us to rank order the results based on aggregate similarity

In [10]:
result = run(driver, textwrap.dedent("""\
    MATCH(u:User {userId: "U91836"})-[:CLICKED|HISTORICALLY_CLICKED]->(n:RecentNews)
    WITH collect(id(n)) AS clickedNewsIds

    //get similar News according to KNN and exclude previously clicked news
    MATCH (clickedNews)-[s:USERS_ALSO_LIKED]->(similarNews:News)
        WHERE id(clickedNews) IN clickedNewsIds AND NOT id(similarNews) IN clickedNewsIds

    //aggregate and return ranked results
    RETURN DISTINCT similarNews.newsId as newsId, similarNews.title AS title, similarNews.approxTime AS time,
        sum(s.score) AS totalScore ORDER BY totalScore DESC""")
)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,newsId,title,time,totalScore
0,N54655,Peter Luger's says its steaks are still 'the best you can eat' after zero-star review from the New York Times,2019-11-14T06:30:58.000000000+00:00,1.446628
1,N2678,'The bridge has definitely been burned': Williams says Redskins have smeared him in aftermath of cancer diagnosis,2019-11-09T00:00:19.000000000+00:00,1.427643
2,N12174,McCandless Police Looking To Identify 3 Women Credited With Giving Life-Saving CPR At North Park,2019-11-12T16:33:27.000000000+00:00,1.286185
3,N50215,Icy conditions lead to multiple crashes on Pittsburgh bridges and throughout Allegheny County,2019-11-12T14:23:18.000000000+00:00,1.286185
4,N50135,Nike will look into runner Mary Cain's allegations of abuse,2019-11-09T00:00:19.000000000+00:00,1.179345
...,...,...,...,...
129,N5454,Veteran JPSO deputy arrested for payroll fraud in overtime scheme,2019-11-14T19:26:21.000000000+00:00,0.189312
130,N3182,Brightline will be called Virgin Trains USA. So when will you see the new name on trains?,2019-11-12T08:18:05.000000000+00:00,0.185018
131,N56479,Washington Judge Rules Value Village Misled Shoppers,2019-11-09T00:03:28.000000000+00:00,0.179169
132,N32753,eBay Find: Actual 'I Am Legend' 2007 Mustang Shelby GT500 Movie Car,2019-11-09T00:24:15.000000000+00:00,0.177132


And of course one can also add filters for score thresholds like so

In [11]:
result = run(driver, textwrap.dedent("""\
    MATCH(u:User {userId: "U91836"})-[:CLICKED|HISTORICALLY_CLICKED]->(n:RecentNews)
    WITH collect(id(n)) AS clickedNewsIds

    //get similar News according to KNN and exclude previously clicked news
    MATCH (clickedNews)-[s:USERS_ALSO_LIKED]->(similarNews:News)
        WHERE id(clickedNews) IN clickedNewsIds AND NOT id(similarNews) IN clickedNewsIds

    //aggregate and return ranked results
    WITH DISTINCT similarNews.newsId as newsId, similarNews.title AS title, similarNews.approxTime AS time,
        sum(s.score) AS totalScore
    WHERE totalScore >= $threshold RETURN * ORDER BY totalScore DESC"""), params = {'threshold':0.5}

)
pd.DataFrame([dict(record) for record in result])

Unnamed: 0,newsId,time,title,totalScore
0,N54655,2019-11-14T06:30:58.000000000+00:00,Peter Luger's says its steaks are still 'the best you can eat' after zero-star review from the New York Times,1.446628
1,N2678,2019-11-09T00:00:19.000000000+00:00,'The bridge has definitely been burned': Williams says Redskins have smeared him in aftermath of cancer diagnosis,1.427643
2,N12174,2019-11-12T16:33:27.000000000+00:00,McCandless Police Looking To Identify 3 Women Credited With Giving Life-Saving CPR At North Park,1.286185
3,N50215,2019-11-12T14:23:18.000000000+00:00,Icy conditions lead to multiple crashes on Pittsburgh bridges and throughout Allegheny County,1.286185
4,N50135,2019-11-09T00:00:19.000000000+00:00,Nike will look into runner Mary Cain's allegations of abuse,1.179345
5,N50451,2019-11-10T21:33:23.000000000+00:00,"Police: Man killed in crash, fire in Pleasant Ridge",0.857061
6,N64938,2019-11-12T13:07:06.000000000+00:00,"Operation Hallowed Streets Checks On 5,000 Sex Predators",0.792616
7,N58153,2019-11-09T15:14:37.000000000+00:00,Standing-room only tickets will go on sale day of MLS Cup,0.787982
8,N8085,2019-11-09T16:59:14.000000000+00:00,Muskegon passes Cedar Springs district test in a rout; East Grand Rapids up next,0.758322
9,N37889,2019-11-09T09:20:40.000000000+00:00,Florida man demands deputies take down his mug shot. They replaced it with his booking photo.,0.756572


This is a just another example for how, with GDS, we can leverage powerful graph analytics with only a few simple steps
to scale a real-world use case.

and always remember to clean up and close your driver connections! :)

In [12]:
run(driver,'MATCH (n:RecentNews) REMOVE n:RecentNews')
run(driver,'MATCH ()-[r:USERS_ALSO_LIKED]->() DELETE r')
run(driver,'CALL gds.graph.drop("cf-projection")')
driver.close()