# TigerGraph Data Science Library 101 - Similarity Algorithm
This notebook shows the examples of using the most common similarity algorithms in TigerGraph Graph Science Library. More detailed explanations of these algorithms can be found in the official documentation (https://docs.tigergraph.com/graph-ml/current/similarity-algorithms/).


## Step1: Setting things up
- Connect and Load data
- Visualize the graph schema 
- Get basic stats, e.g., counts of nodes & edges

### Create connection

In [1]:
import json
import pandas as pd
from pyTigerGraph import TigerGraphConnection

# Read in DB configs
with open('../config.json', "r") as config_file:
    config = json.load(config_file)

conn = TigerGraphConnection(
    host=config["host"],
    username=config["username"],
    password=config["password"],
)

### Download movie dataset

In [2]:
from pyTigerGraph.datasets import Datasets
dataset_movie = Datasets("movie")

Downloading:   0%|          | 0/2623 [00:00<?, ?it/s]

### Ingest data

In [3]:
conn.ingestDataset(dataset_movie, getToken=config["getToken"])

---- Checking database ----
A graph with name movie already exists in the database. Skip ingestion.


### Visualize schema

In [4]:
from pyTigerGraph.visualization import drawSchema
drawSchema(conn.getSchema(force=True))

CytoscapeWidget(cytoscape_layout={'name': 'circle', 'animate': True, 'padding': 1}, cytoscape_style=[{'selecto…

### Print graph stats

In [5]:
vertices = conn.getVertexTypes()
total_count = 0
for vertex in vertices:
    vertex_cnt = conn.getVertexCount(vertex)
    total_count += vertex_cnt
    print("Node count: ({} : {}) ".format(vertex, vertex_cnt))
print("Total node count: ", total_count)

Node count: (Person : 7) 
Node count: (Movie : 9) 
Total node count:  16


In [6]:
import pprint
edge_count = conn.getEdgeCount()
print("Edges count: total ", sum(edge_count.values()))
pprint.pprint(edge_count) 

Edges count: total  30
{'Likes': 15, 'Similarity': 0, 'reverse_Likes': 15}


## Step 2: Leveraging pyTigerGraph’s featurizer to run Similarity algorithms

pyTIgerGraph provides a full suit of data science capabilities, and in this tutorial, we will showcase how to use featurizer to list out all available Similarity algorithms in our GDS library, and to run a few popular algorithms as an example.

In [7]:
feat = conn.gds.featurizer()

In [8]:
feat.listAlgorithms("Similarity")

Available algorithms for Similarity:
  cosine:
    single_source:
      01. name: tg_cosine_nbor_ss
  jaccard:
    all_pairs:
      02. name: tg_jaccard_nbor_ap_batch
    single_source:
      03. name: tg_jaccard_nbor_ss
Call runAlgorithm() with the algorithm name to execute it


## tg_cosine_nbor_ss
This algorithm calculates the similarity between a given vertex and every other vertex in the graph using cosine similarity (https://docs.tigergraph.com/graph-ml/current/similarity-algorithms/cosine-similarity-of-neighborhoods-single-source).

## Input Parameters

* VERTEX source: Source vertex {"id": "vertex_id", "type": "vertex_type"}
* SET<STRING> e_type_set: Edge type to traverse
* SET<STRING> reverse_e_type_set: Reverse edge type to traverse
* STRING weight_attribute: The edge attribute to use as the weight of the edge.
* INT top_k: The number of vertices to return
* INT print_limit: The maximum number of vertices to return
* BOOL print_results: Whether to output the final results to the console in JSON format
* STRING filepath: If provided, the algorithm will save the output in CSV format to this file
* STRING similarity_edge: If provided, the similarity score will be saved to this edge

In [15]:
params = {
    "source": {"id": "Alex", "type": "Person"},
    "e_type_set": ["Likes"],
    "reverse_e_type_set": ["reverse_Likes"],
    "weight_attribute": "weight",
    "top_k": 5,
    "print_limit": 5,
    "print_results": True,
    "file_path": "",
    "similarity_edge": "Similarity"
}

In [16]:
import csv
import os
import time
import psutil
!pip install memory_profiler
%load_ext memory_profiler

algo_performance_out = '/home/tigergraph/GraphML/output/algorithm_' + config["job_id"] + '.csv'

start_time = time.time()

algo_memory = %memit -r 1 -o feat.runAlgorithm("tg_cosine_nbor_ss", params=params)

algo_memory = str(algo_memory)

start = algo_memory.find(": ") + 1
end = algo_memory.find("M")

algo_memory = algo_memory[start:end].strip()

execution_time = time.time() - start_time

cpu_usage = psutil.cpu_percent(4)

print('The CPU usage is: ', cpu_usage)

# print('RAM memory % used:', psutil.virtual_memory()[2])

host_memory = psutil.virtual_memory()[3]/1000000000

print('RAM Used (GB):', host_memory)

print ('tg_cosine_nbor_ss executed successfully')

print ('execution time: ' + str(execution_time) + ' seconds\n')

algo_id = "tg_cosine_nbor_ss_" + config["job_id"]

nb_id = "similarity.ipynb_" + config["job_id"]

keyword = "tg_cosine_nbor_ss"

data = [algo_id, "false" ,cpu_usage, algo_memory, execution_time, host_memory, "3.8", "no error", nb_id, keyword]

with open(algo_performance_out, mode='a+', encoding='utf-8') as f:
    writer = csv.writer(f) 
    writer.writerow(data)

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler
peak memory: 135.39 MiB, increment: 0.43 MiB
The CPU usage is:  34.8
RAM Used (GB): 11.187646464
tg_cosine_nbor_ss executed successfully
execution time: 0.36521410942077637 seconds



## Results

The output size is almost always 𝑘, except in cases where the number of total vertices is lower than 𝑘. The algorithm may arbitrarily choose to output one vertex over another if there are tied similarity scores.

using Movie graph, one way to calculate similarity between two people would be to see which movies they both rated similarly. Starting from one person’s name, this algorithm calculates the cosine similarity between the given person and every other person in the graph, as long as there is at least one movie they have both rated.

Given the source vertex "Alex", and top_k is set to 5, then we calculate the cosine similarity between him and two other persons, Jing and Kevin (since the example graph does not have enough data to return 5 Person vertices). The output shows the most similar vertices and their similarity scores in descending order.

In [17]:
results = feat.runAlgorithm("tg_cosine_nbor_ss", params=params)

In [18]:
df_cosine_nbor_ss = pd.json_normalize(results, record_path =['neighbours'])

# display(df_cosine_nbor_ss.columns)

display(df_cosine_nbor_ss.sort_values(by='attributes.neighbours.@sum_similarity', ascending=False))

Unnamed: 0,v_id,v_type,attributes.neighbours.@sum_similarity
0,Jing,Person,0.42173
1,Kevin,Person,0.14248


## tg_jaccard_nbor_ss
The Jaccard index measures the relative overlap between two sets. To compare two vertices by Jaccard similarity, first select a set of attribute values for each vertex (https://docs.tigergraph.com/graph-ml/current/similarity-algorithms/jaccard-similarity-of-neighborhoods-single-source).

## Input Parameters

* VERTEX source: Source vertex {"id": "vertex_id", "type": "vertex_type"}
* STRING e_type: Edge type to traverse
* STRING reverse_e_type: Reverse edge type to traverse
* INT top_k: The number of vertices to return
* BOOL print_results: Whether to output the final results to the console in JSON format
* STRING similarity_edge_type: If provided, the similarity score will be saved to this edge
* STRING filepath: If provided, the algorithm will save the output in CSV format to this file

In [19]:
params = {
    "source": {"id": "Neil", "type": "Person"},
    "e_type": "Likes",
    "reverse_e_type": "reverse_Likes",
    "top_k": 5,
    "print_results": True,
    "similarity_edge_type": "Similarity",
    "file_path": "",
}

In [20]:
start_time = time.time()

algo_memory = %memit -r 1 -o feat.runAlgorithm("tg_jaccard_nbor_ss", params=params)

algo_memory = str(algo_memory)

start = algo_memory.find(": ") + 1
end = algo_memory.find("M")

algo_memory = algo_memory[start:end].strip()

execution_time = time.time() - start_time

cpu_usage = psutil.cpu_percent(4)

print('The CPU usage is: ', cpu_usage)

# print('RAM memory % used:', psutil.virtual_memory()[2])

host_memory = psutil.virtual_memory()[3]/1000000000

print('RAM Used (GB):', host_memory)

print ('tg_jaccard_nbor_ss executed successfully')

print ('execution time: ' + str(execution_time) + ' seconds\n')

algo_id = "tg_jaccard_nbor_ss_" + config["job_id"]

nb_id = "similarity.ipynb_" + config["job_id"]

keyword = "tg_jaccard_nbor_ss"

data = [algo_id, "false" ,cpu_usage, algo_memory, execution_time, host_memory, "3.8", "no error", nb_id, keyword]

with open(algo_performance_out, mode='a+', encoding='utf-8') as f:
    writer = csv.writer(f) 
    writer.writerow(data)

Installing and optimizing the queries, it might take a minute...
Queries installed successfully
peak memory: 135.49 MiB, increment: 0.08 MiB
The CPU usage is:  37.5
RAM Used (GB): 11.238305792
tg_jaccard_nbor_ss executed successfully
execution time: 36.686015605926514 seconds



## Results

This example uses Movie graph consisting of Person and Movie vertices. There are Likes edges that are weighted according to how much the person liked the movie. Each person in the dataset liked at least one movie, but not all movies were liked by all people.

When comparing similarity to Neil, Kat is ranked higher than Kevin. This makes intuitive sense, because Kat likes two movies, both of which were also liked by Neil. Kevin also likes two movies that Neil likes. However, Kevin also likes a third movie that Neil doesn’t like, and is therefore less similar than Kat was.

Although we set top_k to 5, only three vertices were returned because neither Alex nor Elena likes any movies that Kevin likes.

If the source vertex (Person) doesn’t have any common neighbors (Movie) with any other vertex (Person), such as Elena in our example, the result is an empty list.

In [23]:
results = feat.runAlgorithm("tg_jaccard_nbor_ss", params=params)

In [24]:
df_jaccard_nbor_ss = pd.json_normalize(results, record_path =['Others'])

display(df_jaccard_nbor_ss.sort_values(by='attributes.Others.@sum_similarity', ascending=False))

Unnamed: 0,v_id,v_type,attributes.Others.@sum_similarity
1,Kat,Person,0.5
2,Kevin,Person,0.4
0,Jing,Person,0.2


## tg_jaccard_nbor_ap_batch
This algorithm computes the same similarity scores as the Jaccard similarity of neighborhoods, single source. Instead of selecting a single source vertex, however, it calculates similarity scores for all vertex pairs in the graph in parallel. Since this is a memory-intensive operation, it is split into batches to reduce peak memory usage. The user can specify how many batches it is to be split into. (https://docs.tigergraph.com/graph-ml/current/similarity-algorithms/jaccard-similarity-of-neighborhoods-batch)

## Input Parameters

* INT top_k: The number of vertices to return
* SET<STRING> v_type_set: Vertex type used to calculate similarity score
* SET<STRING> feat_v_type: Feature vertex type
* SET<STRING> e_type_set: Edge type to traverse
* SET<STRING> reverse_e_type_set: Reverse edge type to traverse
* STRING similarity_edge: If provided, the similarity score will be saved to this edge
* INT src_batch_num: The number of batches to split the source vertices into
* INT nbor_batch_num: The number of batches to split the 2-hop neighbor vertices into
* BOOL print_accum: Whether to output the final results to the console in JSON format
* INT print_limit: The number of source vertices to print, -1 to print all
* STRING filepath: If provided, the algorithm will save the output in CSV format to this file

In [25]:
params = {
    "top_k": 10,
    "v_type_set": ["Person"],
    "feat_v_type": ["Movie"],
    "e_type_set": ["Likes"],
    "reverse_e_type_set": ["reverse_Likes"],
    "similarity_edge": "Similarity",
    "src_batch_num": 50,
    "nbor_batch_num": 10,
    "print_results": True,
    "print_limit": 50,
    "file_path": ""
}

In [26]:
start_time = time.time()

algo_memory = %memit -r 1 -o feat.runAlgorithm("tg_jaccard_nbor_ap_batch", params=params)

algo_memory = str(algo_memory)

start = algo_memory.find(": ") + 1
end = algo_memory.find("M")

algo_memory = algo_memory[start:end].strip()

execution_time = time.time() - start_time

cpu_usage = psutil.cpu_percent(4)

print('The CPU usage is: ', cpu_usage)

# print('RAM memory % used:', psutil.virtual_memory()[2])

host_memory = psutil.virtual_memory()[3]/1000000000

print('RAM Used (GB):', host_memory)

print ('tg_jaccard_nbor_ap_batch executed successfully')

print ('execution time: ' + str(execution_time) + ' seconds\n')

algo_id = "tg_jaccard_nbor_ap_batch_" + config["job_id"]

nb_id = "similarity.ipynb_" + config["job_id"]

keyword = "tg_jaccard_nbor_ap_batch"

data = [algo_id, "false" ,cpu_usage, algo_memory, execution_time, host_memory, "3.8", "no error", nb_id, keyword]

with open(algo_performance_out, mode='a+', encoding='utf-8') as f:
    writer = csv.writer(f) 
    writer.writerow(data)

Installing and optimizing the queries, it might take a minute...
Queries installed successfully
peak memory: 135.75 MiB, increment: 0.49 MiB
The CPU usage is:  30.7
RAM Used (GB): 11.342086144
tg_jaccard_nbor_ap_batch executed successfully
execution time: 41.26648306846619 seconds



## Results

The result contains the top k Jaccard similarity scores for each vertex and its corresponding pair. A pair is only included if its similarity is greater than 0, meaning there is at least one common neighbor between the pair.

In [27]:
results = feat.runAlgorithm("tg_jaccard_nbor_ap_batch", params=params)

In [28]:
df_jaccard_nbor_ap_batch = pd.json_normalize(results, record_path =['print_batch'])

df_jaccard_nbor_ap_batch.columns = ['v_id', 'v_type', 'sim_heap']

df_jaccard_nbor_ap_batch = df_jaccard_nbor_ap_batch.reset_index()

for index, row in df_jaccard_nbor_ap_batch.iterrows():
    print(row['v_id'], row['v_type'])
    for p in row['sim_heap']:
        print(p)

Elena Person
Neil Person
{'ver': 'Kat', 'val': 0.5}
{'ver': 'Kevin', 'val': 0.4}
{'ver': 'Jing', 'val': 0.2}
Jing Person
{'ver': 'Alex', 'val': 0.25}
{'ver': 'Neil', 'val': 0.2}
Kat Person
{'ver': 'Neil', 'val': 0.5}
{'ver': 'Kevin', 'val': 0.25}
Alex Person
{'ver': 'Jing', 'val': 0.25}
{'ver': 'Kevin', 'val': 0.2}
Kevin Person
{'ver': 'Neil', 'val': 0.4}
{'ver': 'Kat', 'val': 0.25}
{'ver': 'Alex', 'val': 0.2}
