# ResearchOps Community Toolbox Census + Graph
## Demo III - Graph Analysis & Algorithms
#### Author: Pete Tunkis
#### Date: 2024-10-29

This notebook provides replicable code that interacts with the graph directly (no more dataframes/pandas, just neo4j via the python API).

Please note that for this notebook's code to work, you must have successfully completed the steps from the previous two notebooks in the demo series. It begins with a handful of queries before transitioning to neodash, and after returning to this notebook, we build the Recommendation System.

The second half of this notebook turns to connect to an AuraDB instance (graph database hosted in the cloud) to run a couple of queries there, before transitioning the demo to Bloom (a.k.a. AuraDB Explore).

**Note**: Neodash and Bloom widgets/visualizations are built (largely) using cypher queries; appropriately-named `*.cql` scripts are available in this repository for reference.

In [1]:
### Environment

import os
from dotenv import load_dotenv
from graphdatascience import GraphDataScience
import pandas as pd
import numpy as np

# Initialization
load_dotenv()

### Parameters
URI = os.getenv('NEO4J_URI')
USER = os.getenv('NEO4J_USER')
PASS = os.getenv('NEO4J_PASS')
AUTH = (USER, PASS)

  from .autonotebook import tqdm as notebook_tqdm


Notice that this notebook uses a different module than the previous notebook to connect to the graph database--this is to show that both accomplish the same goals (running queries/pushing commands to the graph database, returning results accordingly). However, you need the `graphdatascience` module to run any graph-native algorithms.

In [2]:
### Initialize/connect to database
gds = GraphDataScience(URI, auth = AUTH)

### A Few Diagnostic Queries
One of the great things about neo4j's python API is that it automatically returns any (tabular) results from a query as a pandas dataframe! This is a great time-saver--if we imagine ahead to integration and deployment, it would be very easy to rig up a responsive API between your front-end and the graph database to query the graph, return results as a dataframe, easily convert that to json, and send it to the front-end. Alternatively, if you need to integrate the graph with extant relational data on which your applications rely, graph query results could be pushed straight into a view from dataframe, rows and columns and all!

In [3]:
### Basic/EDA queries

# Explore graph node counts via apoc
gds.run_cypher(
    '''
    CALL apoc.meta.stats()
    YIELD labels AS nodeCounts
    UNWIND keys(nodeCounts) AS label
    WITH label, nodeCounts[label] AS nodeCount
    WHERE label IN ['Respondent', 'Tool']
    RETURN label, nodeCount
    '''
    )

Unnamed: 0,label,nodeCount
0,Respondent,205
1,Tool,461


In [4]:
# Explore rel counts via apoc
gds.run_cypher(
    '''
    CALL apoc.meta.stats()
    YIELD relTypesCount AS relationshipCounts
    UNWIND keys(relationshipCounts) AS type
    WITH type, relationshipCounts[type] AS relationshipCount
    WHERE type IN ['USES']
    RETURN type, relationshipCount
    '''
    )

Unnamed: 0,type,relationshipCount
0,USES,6807


In [5]:
# Count of tools in use
gds.run_cypher(
    '''
    MATCH p = (t:Tool)<-[u:USES]-()
    WITH t.tool AS tool_name, COUNT(p) AS tool_count
    RETURN tool_name, tool_count ORDER BY tool_name    
    '''
    )

Unnamed: 0,tool_name,tool_count
0,11FS,1
1,ACPA/WCAG Figma,1
2,Abstract,4
3,Accessibility Insights,1
4,Acoustic,1
...,...,...
456,quantilope,1
457,slite,16
458,tremendous,1
459,typeform,1


In [6]:
# Count of tools by use-case
gds.run_cypher(
    '''
    MATCH p = (t:Tool)<-[u:USES]-()
    WITH t.tool AS tool_name, u.use_case AS use_case, COUNT(p) AS num_use_cases
    RETURN use_case, tool_name, num_use_cases ORDER BY use_case, tool_name ASC
    '''
    )

Unnamed: 0,use_case,tool_name,num_use_cases
0,A/B testing,Adobe,1
1,A/B testing,Adobe Target,2
2,A/B testing,Airtable,1
3,A/B testing,Alchemer,1
4,A/B testing,Amplitude,3
...,...,...,...
1753,Video editing,UserZoom GO,1
1754,Video editing,Windows Video Editor,1
1755,Video editing,Zoom,3
1756,Video editing,dscout,1


Instead of getting bogged down in a jupyter notebook, scrolling through miles of tables and charts, we can consolidate down to a dashboard--neodash is a neat product that lets you build a dashboard on top of a neo4j graph database.

### Building the Recommendation System
Building a recommendation system using neo4j essentially consists of two steps:
1. Project specific node(s) and relationship(s) subject to any analytics to an in-memory 'sub-graph'
2. Run algorithm on in-memory graph, then generate new nodes/relationships/properties
  - Write them back to your 'main' graph, or
  - Write them to your projection and rinse/repeat step two

However, there may be some pre-requisites, depending on *what* you are doing to *which* node(s) and/or relationship(s). In order to build the Recommendation System for the ReOps Toolbox, we need to run the K-NN algorithm over respondent nodes. However, there are a handful of limitations with how those properties may be expressed (data types!) in neo4j with graph projections. For that, we cannot use strings as they are; we need integers, floats, or *lists of integers/floats*. The latter, of course, imply some form of descriptive vector--since our variables of interest are all strings, we can one-hot encode these and create new properties in-graph on the respondent nodes. K-NN will then take all OHE properties together and calculate Jaccard distances between node-pairs, returning the top-10 most similar respondents for each individual.

#### Why not bring in the nodes rather than re-specifying them as properties on the Respondent?
This is in part another limitation of graph projections (many algorithms don't work with more than two node labels or relationships, many only work with one of either/each). However, it's also a product of what we are trying to do: we want explicit respondent-to-respondent relationships to be created if there is sufficient similarity between the two, and we need to pack all the info about respondents into a "small package" for our projection.

I do two things for each demographic element:
1. Create a one-hot encoding for each element
2. Write back the string element (or list, if more than one) 

I do the second bit mainly for the purpose of simplifying some demo queries.

In [7]:
### Start by One-Hot Encoding collecting connected descriptor nodes as (list) properties
# Business
gds.run_cypher(
    '''
    MATCH (bus:Business)
    WITH bus
        ORDER BY bus.business
    WITH collect(bus) AS businesses
    MATCH (resp:Respondent)
    SET resp.OHE_business = toFloatList(gds.alpha.ml.oneHotEncoding(businesses, [(resp)-[:IS_BUSINESS_TYPE]->(bus) | bus]))
    '''
    )
gds.run_cypher(
    '''
    MATCH (resp:Respondent)-[]-(bus:Business)
    WITH resp, collect(bus.business) AS businesses
    SET resp.business = businesses
    '''
    )

# Company
gds.run_cypher(
    '''
    MATCH (com:Company)
    WITH com
        ORDER BY com.company
    WITH collect(com) AS companies
    MATCH (resp:Respondent)
    SET resp.OHE_company = toFloatList(gds.alpha.ml.oneHotEncoding(companies, [(resp)-[:IS_COMPANY_TYPE]->(com) | com]))
    '''
    )
gds.run_cypher(
    '''
    MATCH (resp:Respondent)-[]-(com:Company)
    WITH resp, collect(com.company) AS companies
    SET resp.company = companies
    '''
    )

# Industry 
gds.run_cypher(
    '''
    MATCH (ind:Industry)
    WITH ind
        ORDER BY ind.industry
    WITH collect(ind) AS industries
    MATCH (resp:Respondent)
    SET resp.OHE_industry = toFloatList(gds.alpha.ml.oneHotEncoding(industries, [(resp)-[:IN_INDUSTRY]->(ind) | ind]))
    '''
    )
gds.run_cypher(
    '''
    MATCH (resp:Respondent)-[]-(ind:Industry)
    WITH resp, collect(ind.industry) AS industries
    SET resp.industry = industries
    '''
    )

# Locations (researchers, participants)
gds.run_cypher(
    '''
    MATCH (loc:Location)
    WITH loc
        ORDER BY loc.location
    WITH collect(loc) AS locations
    MATCH (resp:Respondent)
    SET resp.OHE_researchers = toFloatList(gds.alpha.ml.oneHotEncoding(locations, [(resp)-[:HAS_RESEARCHERS_IN]->(loc) | loc]))
    SET resp.OHE_participants = toFloatList(gds.alpha.ml.oneHotEncoding(locations, [(resp)-[:HAS_PARTICIPANTS_IN]->(loc) | loc]))
    '''
    )
gds.run_cypher(
    '''
    MATCH (resp:Respondent)-[:HAS_RESEARCHERS_IN]->(loc:Location)
    WITH resp, collect(loc.location) AS locations
    SET resp.researchers = locations
    '''
    )
gds.run_cypher(
    '''
    MATCH (resp:Respondent)-[:HAS_PARTICIPANTS_IN]->(loc:Location)
    WITH resp, collect(loc.location) AS locations
    SET resp.participants = locations
    '''
    )

# Responsibilities
gds.run_cypher(
    '''
    MATCH (r:Responsibility)
    WITH r
        ORDER BY r.responsibility
    WITH collect(r) AS responsibilities
    MATCH (resp:Respondent)
    SET resp.OHE_responsibility = toFloatList(gds.alpha.ml.oneHotEncoding(responsibilities, [(resp)-[:HAS_RESPONSIBILITY]->(r) | r]))
    '''
    )
gds.run_cypher(
    '''
    MATCH (resp:Respondent)-[]-(r:Responsibility)
    WITH resp, collect(r.responsibility) AS responsibilities
    SET resp.responsibility = responsibilities
    '''
    )

# OHE the remaining respondent properties intrinsic to that node
gds.run_cypher(
    '''
    MATCH (resp:Respondent)
    WITH collect(DISTINCT resp.discipline) AS disciplines
        , collect(DISTINCT resp.len_experience) AS experiences
        , collect(DISTINCT resp.maturity) AS maturities
        , collect(DISTINCT resp.num_researchers) AS size
    MATCH (resp:Respondent)
    SET resp.OHE_discipline = toFloatList(gds.alpha.ml.oneHotEncoding(disciplines, [resp.discipline]))
    SET resp.OHE_len_experience = toFloatList(gds.alpha.ml.oneHotEncoding(experiences, [resp.len_experience]))
    SET resp.OHE_maturity = toFloatList(gds.alpha.ml.oneHotEncoding(maturities, [resp.maturity]))
    SET resp.OHE_num_researchers = toFloatList(gds.alpha.ml.oneHotEncoding(size, [resp.num_researchers]))
    '''
    )

### Graph Projection
This is very simple! We just need to define a projection name to which the graph must refer, which node(s) to project, relationship(s) if any, along with any properties that must be referable/pointed to by the graph projection.

In [8]:
### Project the respondent nodes with their OHE properties into an in-memory graph for analytics
G, result = gds.graph.project(
    "respondent-similarity"
    , "Respondent"
    , "*"
    , nodeProperties = ['OHE_business'
                        , 'OHE_company'
                        , 'OHE_discipline'
                        , 'OHE_industry'
                        , 'OHE_len_experience'
                        , 'OHE_maturity'
                        , 'OHE_num_researchers'
                        , 'OHE_participants'
                        , 'OHE_researchers'
                        , 'OHE_responsibility']
    )

In [9]:
# Take a look
print(G)

Graph(name=respondent-similarity, node_count=205, relationship_count=0)


In [10]:
display(result)

nodeProjection            {'Respondent': {'label': 'Respondent', 'proper...
relationshipProjection    {'__ALL__': {'aggregation': 'DEFAULT', 'orient...
graphName                                             respondent-similarity
nodeCount                                                               205
relationshipCount                                                         0
projectMillis                                                           372
Name: 0, dtype: object

In [11]:
# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")

Graph 'respondent-similarity' node count: 205
Graph 'respondent-similarity' node labels: ['Respondent']


### The Recommendation System: K-Nearest Neighbours
In its simplest form, Recommendation Systems are built on some similarity metric. I use K-NN as it's intuitive and transparent; the OHE properties compose the data for which Jaccard similarities will be calculated, and new `SIMILAR_TO` relationships with a `score` property will be written back to the main graph accordingly.

**Nota bene**: neo4j on your local machine is free (a.k.a. the "Community Edition"), but this has limitations, such as a max of 4CPU concurrency for parallelizing procedures or algorithms. The free version of AuraDB has similar limitations, including the number of nodes/relationships you can store, but also precludes you from running any algorithms ("Graph Data Science"); for that you must subscribe/pay to run neo4j's AuraDS offering! Similarly, if building a database within another cloud platform (e.g., AWS or Azure) and ruling out Aura, you may need to look into procuring an "Enterprise Edition" license.

In [12]:
### Estimate similarities using K-NN
# Features required for KNN on respondent nodes
featureProperties = ['OHE_business'
                       , 'OHE_company'
                       , 'OHE_discipline'
                       , 'OHE_industry'
                       , 'OHE_len_experience'
                       , 'OHE_maturity'
                       , 'OHE_num_researchers'
                       , 'OHE_participants'
                       , 'OHE_researchers'
                       , 'OHE_responsibility']

# Run KNN - return default K = 10
result = gds.knn.write(G, nodeProperties = featureProperties, writeRelationshipType = 'SIMILAR_TO', writeProperty = 'score', sampleRate = 1.0, maxIterations = 1000)
print(result)

ranIterations                                                             4
didConverge                                                            True
nodePairsConsidered                                                   92917
preProcessingMillis                                                      14
computeMillis                                                           150
writeMillis                                                             354
postProcessingMillis                                                      0
nodesCompared                                                           205
relationshipsWritten                                                   2050
similarityDistribution    {'min': 0.39999961853027344, 'p5': 0.649999618...
configuration             {'writeProperty': 'score', 'writeRelationshipT...
Name: 0, dtype: object


# Query Aura for the next bits
Now that the new similarity relationships have been generated, we can further query the database. For demonstration, I exported a `*.dump` of the graph database and mounted it on an AuraDB instance in the cloud, but all queries were first built and tested on the local instance that's been used up to this point.

In [13]:
### Parameters
AURA_URI = os.getenv('AURA_URI')
AURA_USER = os.getenv('AURA_USER')
AURA_PASS = os.getenv('AURA_PASS')
AURA_AUTH = (AURA_USER, AURA_PASS)

In [14]:
### Initialize/connect to database
aura_db = GraphDataScience(URI, auth = AUTH)

In [15]:
# Query: Get tools based on a use_case alone - more like a traditional item-based CF
tool_query = '''
    MATCH (:Respondent)-[u:USES]->(t:Tool)
    WHERE u.use_case = "Research planning"
    RETURN DISTINCT t.tool
'''
tool_search_res = aura_db.run_cypher(tool_query)
display(tool_search_res)

Unnamed: 0,t.tool
0,Asana
1,Figma
2,Microsoft Word
3,Microsoft Excel
4,Microsoft Powerpoint
5,Miro
6,Dovetail
7,Google Workspace
8,Confluence
9,Microsoft Sharepoint


In [16]:
# Query: Get all the use cases for a given tool, because it's interesting to see how versatile stuff is
use_case_query = '''
    MATCH (:Respondent)-[u:USES]->(t:Tool)
    WHERE t.tool = "Figma"
    RETURN DISTINCT u.use_case
'''
use_case_search_res = aura_db.run_cypher(use_case_query)
display(use_case_search_res)

Unnamed: 0,u.use_case
0,Research planning
1,Collaborative brainstorming
2,Diagramming and mapping
3,Creating wireframes
4,Mockups
5,Prototyping
6,Participatory design
7,Sharing findings
8,Competitive analysis
9,Storytelling & presentations


In [17]:
# Query: Get tools used for use_case by respondents similar to target description
recco_query = '''
    MATCH (r1:Respondent)-[u1:USES]->(t:Tool)<-[u2:USES]-(r2:Respondent)
        , (r1)-[s:SIMILAR_TO]-(r2)
    WHERE r1 <> r2
        AND 'Tech' IN r1.industry
        AND 'Start-up or small corporation' IN r1.business
        AND ('Canada' IN r1.participants 
             OR 'Canada' IN r1.researchers)
        AND r1.num_researchers = "2-5"
        AND r1.discipline = "User or design research"
        AND (u1.use_case = "Research planning" 
             OR u1.use_case = "Research roadmapping")
        AND u1.use_case = u2.use_case
    RETURN r2.respondent_id
        , s.score AS similarity_pct
        , u1.use_case
        , t.tool
        , r2.num_researchers
        , r2.maturity
        , r2.len_experience
        , r2.discipline
        , r2.business
        , r2.company
        , r2.industry
        , r2.responsibility
        , r2.participants
        , r2.researchers
ORDER BY u1.use_case ASC
    '''
reccos_res = aura_db.run_cypher(recco_query)
display(reccos_res)

Unnamed: 0,r2.respondent_id,similarity_pct,u1.use_case,t.tool,r2.num_researchers,r2.maturity,r2.len_experience,r2.discipline,r2.business,r2.company,r2.industry,r2.responsibility,r2.participants,r2.researchers
0,cf0c6378a527cf714644e116cc19533e,0.870711,Research planning,Google Docs,2-5,5: Complete UX research culture,1 - 3 years,User or design research,[Start-up or small corporation],[Business-to-customer (B2C)],[Education],[Generalist],[Canada],[Canada]
1,cf0c6378a527cf714644e116cc19533e,0.870711,Research planning,Google Docs,2-5,5: Complete UX research culture,1 - 3 years,User or design research,[Start-up or small corporation],[Business-to-customer (B2C)],[Education],[Generalist],[Canada],[Canada]
2,d8890f137d3866309aa6e6bcf7d71d05,0.685355,Research planning,Google Docs,2-5,5: Complete UX research culture,7 - 10 years,Research operations,[Government/public sector],[Business-to-business-to-customer (B2B2C)],[Education],[Generalist],[Canada],[Canada]
3,1cb3453974ee2eab0090b3771b6f7bab,0.785355,Research planning,Google Docs,2-5,3: Maturing of UX research into an organizatio...,1 - 3 years,User or design research,[Start-up or small corporation],[Business-to-customer (B2C)],[Education],[Generalist],[UK],[Global]
4,c54d3902a28087ac49a50a5a87a67753,0.707735,Research planning,Google Docs,2-5,3: Maturing of UX research into an organizatio...,1 - 3 years,User or design research,[In-house design in a medium or large corporat...,[Business-to-business-to-customer (B2B2C)],[Tech],,"[Kenya, USA, Canada]","[Costa Rica, Kenya, USA, Canada, Uganda, UK]"
5,e34ce11ccf880df14e6f8a52e75436c9,0.670412,Research planning,Notion,6-10,5: Complete UX research culture,1 - 3 years,User or design research,[Start-up or small corporation],[Business-to-business-to-customer (B2B2C)],[Tech],,"[Sweden, Kenya, Mexico, USA, Canada, Spain]",[Global]
6,fd97b117861abc546a404c8c1089f027,0.785355,Research planning,Notion,2-5,1: UX Research Awareness - Ad Hoc Research in ...,7 - 10 years,User or design research,[Start-up or small corporation],[Business-to-business-to-customer (B2B2C)],"[Design, Tech]",[Generalist],"[North America, Europe]","[North America, Europe]"
7,e34ce11ccf880df14e6f8a52e75436c9,0.670412,Research roadmapping,Notion,6-10,5: Complete UX research culture,1 - 3 years,User or design research,[Start-up or small corporation],[Business-to-business-to-customer (B2B2C)],[Tech],,"[Sweden, Kenya, Mexico, USA, Canada, Spain]",[Global]
8,fd97b117861abc546a404c8c1089f027,0.785355,Research roadmapping,Notion,2-5,1: UX Research Awareness - Ad Hoc Research in ...,7 - 10 years,User or design research,[Start-up or small corporation],[Business-to-business-to-customer (B2B2C)],"[Design, Tech]",[Generalist],"[North America, Europe]","[North America, Europe]"
9,2f34906a14573b33f64a4db789452bba,0.789223,Research roadmapping,Miro,2-5,2: Adoption of UX research into projects,1 - 3 years,User or design research,[In-house design in a medium or large corporat...,[Business-to-customer (B2C)],"[Design, Tech, Entertainment]","[Data governance and guidelines, Generalist, R...",[USA],"[USA, Canada]"


The last query illustrates the Recommendation System - based on our input parameters, we can examine tool recommendations along with specific use cases alongside the demographics of those other respondents from which recommendations were drawn. This is ostensibly effective, but as it turns out, this is a lot more interesting if we can explore the data visually! Cue neo4j Bloom/Aura Explore...