## Weighted Jaccard Similarity
Weighted Jaccard similarity computes the Jaccard similarity, taking into account weights placed on the nodes of the graph.

To compute the weighted Jaccard similarity between each pair of vertices connected by an edge in cuGraph use:
**nvJaccard_w(input_graph, vect_weights_ptr)**
* input_graph: A cugraph.Graph object
* vect_weights_ptr: An array of vertex weights

Returns: df: cudf.DataFrame with three columns:
* df['source']: The source vertex id.
* df['destination']: The destination vertex id.
* df['jaccard_coeff']: The weighted jaccard coefficient computed between the source and destination vertex.

In [1]:
# Import needed libraries
import cugraph
import cudf

import os.path
from collections import OrderedDict

## Read the data using cuDF

In [2]:
# Test file  - using the clasic Karate club dataset.  
datafile='../data/networks/karate-data.csv'

In [3]:
# Make sure that the dataset is available
if not os.path.isfile(datafile):
    print("Data File NOT found")

In [4]:
# Read the data file
cols = ["src", "dst"]

dtypes = OrderedDict([
        ("src", "int32"), 
        ("dst", "int32")
        ])

gdf = cudf.read_csv(datafile, names=cols, delimiter='\t', dtype=list(dtypes.values()), )

In [5]:
# Let's look at the DataFrame. There should be two columns and 154 records
gdf

<cudf.DataFrame ncols=2 nrows=154 >

In [6]:
# Look at the first few data records - the output should be two colums src and dst
gdf.head().to_pandas()

Unnamed: 0,src,dst
0,1,2
1,1,3
2,1,4
3,1,5
4,1,6


In [7]:
# create a Graph 
G1 = cugraph.Graph()
G1.add_edge_list(gdf["src"], gdf["dst"])

In [8]:
# How many vertices are in the graph?
G1.num_vertices()

35

In [9]:
# Call Pagerank on the graph to get weights to use:
pr_df = cugraph.pagerank(G1)

In [14]:
# Call weighted Jaccard using the Pagerank scores as weights:
df = cugraph.nvJaccard_w(G1, pr_df['pagerank'])

In [15]:
# Find the most similar pair of vertices
bestEdge = 0
for i in range(len(df)):
    if df['jaccard_coeff'][i] > df['jaccard_coeff'][bestEdge]:
        bestEdge = i
print("Vertices " + str(df['source'][bestEdge]) + " and " + str(df['destination'][bestEdge]) + " are most similar with score: " + str(df['jaccard_coeff'][bestEdge]))

Vertices 1 and 2 are most similar with score: 1.0
