# Jaccard Similarity
----

In this notebook we will explore the Jaccard vertex similarity metrics available in cuGraph. cuGraph supports:
- Jaccard Similarity (also called the Jaccard Index)
- Weight Jaccard

Similarity can be between neighboring vertices (default) or second hop neighbors

Notebook Credits

    Original Authors: Brad Rees
    Created:   10/14/2019
    Last Edit: 06/22/2022

RAPIDS Versions: 22.08

Test Hardware
* Tesla V100 32G, CUDA 11.5


## Introduction - Common Neighbor Similarity 

One of the most common types of vertex similarity is to evaluate the neighborhood of vertex pairs and looks at the number of common neighbors.  TThat type of similar comes from statistics and is based on set comparison.  Both Jaccard and the Overlap Coefficient operate on sets, and in a graph setting, those sets are the list of neighboring vertices. <br>
For those that like math:  The neighbors of a vertex, _v_, is defined as the set, _U_, of vertices connected by way of an edge to vertex v, or _N(v) = {U} where v ∈ V and ∀ u ∈ U ∃ edge(v,u)∈ E_.

For the rest of this introduction, set __A__ will equate to _A = N(i)_ and set __B__ will quate to _B = N(j)_.  That just make the rest of the text more readable.

### Jaccard Similarity

The Jaccard similarity between two sets is defined as the ratio of the volume of their intersection divided by the volume of their union. 

The Jaccard Similarity can then be defined as

<a href="https://www.codecogs.com/eqnedit.php?latex=js(A,B)&space;=&space;\frac{|A&space;\cap&space;B|}{|A&space;\cup&space;B&space;|&space;}&space;=&space;\frac{|A&space;\cap&space;B|}{&space;|A|&space;&plus;&space;|B|&space;-&space;|A&space;\cup&space;B&space;|&space;}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?js(A,B)&space;=&space;\frac{|A&space;\cap&space;B|}{|A&space;\cup&space;B&space;|&space;}&space;=&space;\frac{|A&space;\cap&space;B|}{&space;|A|&space;&plus;&space;|B|&space;-&space;|A&space;\cup&space;B&space;|&space;}" title="js(A,B) = \frac{|A \cap B|}{|A \cup B | } = \frac{|A \cap B|}{ |A| + |B| - |A \cup B | }" /></a>


To compute the Jaccard similarity between all pairs of vertices connected by an edge in cuGraph use: <br>
__df = cugraph.jaccard(G)__

    G: A cugraph.Graph object

Returns:

    df: cudf.DataFrame with three names columns:
        df["source"]: The source vertex id.
        df["destination"]: The destination vertex id.
        df["jaccard_coeff"]: The jaccard coefficient computed between the source and destination vertex.
<br>

__References__ 
- https://research.nvidia.com/publication/2017-11_Parallel-Jaccard-and


### Weighted Jaccard

Weighted Jaccard is similar to the Jaccard Similarity but takes into account vertex weights placed.  

given:
The neighbors of a vertex, v, is defined as the set, U, of vertices connected by way of an edge to vertex v, or N(v) = {U} where v ∈V and ∀ u∈U ∃ edge(v,u)∈E.
and
wt(i) is the weight on vertex i
   
we can now define weight summing function as<br>
$WT(U) = \sum_{v \in U} {wt(v)}$

$WtJaccard(i, j) = \frac{WT(N(i) \cap N(j))}{WT(N(i) \cup N(j))}$

To compute the weighted Jaccard similarity between each pair of vertices connected by an edge in cuGraph use:<br>

__df = cugraph.jaccard_w(input_graph, vect_weights_ptr)__

    input_graph: A cugraph.Graph object
    vect_weights_ptr: An array of vertex weights

Returns: 

    df: cudf.DataFrame with three names columns:
        df['source']: The source vertex id.
        df['destination']: The destination vertex id.
        df['jaccard_coeff']: The weighted jaccard coefficient computed between the source and destination vertex.
 

__Note:__ For this example we will be using PageRank as the edge weights.  Please review the PageRank notebook if you have any questions about running PageRank


### Additional Reading
- [Wikipedia: Jaccard](https://en.wikipedia.org/wiki/Jaccard_index)


### Some notes about vertex IDs...
* The current version of cuGraph requires that vertex IDs be representable as 32-bit integers, meaning graphs currently can contain at most 2^32 unique vertex IDs. However, this limitation is being actively addressed and a version of cuGraph that accommodates more than 2^32 vertices will be available in the near future.
* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.
  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).
  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`


## Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*


![Karate Club](../img/zachary_black_lines.png)

This is a small graph which allows for easy visual inspection to validate results.  

---
# Let's get started!

In [1]:
# Import needed libraries
import cugraph
import cudf
from collections import OrderedDict

----
### Define some Print functions
(the `del` are not needed since going out of scope should free memory, just good practice)

In [2]:
# define a function for printing the top most similar vertices
def print_most_similar_jaccard(df):
    
    jmax = df['jaccard_coeff'].max()
    dm = df.query('jaccard_coeff >= @jmax')    
    
    #find the best
    for i in range(len(dm)):    
        print("Vertices " + str(dm['source'].iloc[i]) + " and " + 
              str(dm['destination'].iloc[i]) + " are most similar with score: " 
              + str(dm['jaccard_coeff'].iloc[i]))
    del jmax
    del dm

In [3]:
# define a function for printing jaccard similar vertices based on a threshold
def print_jaccard_threshold(_d, limit):
    
    filtered = _d.query('jaccard_coeff > @limit')
    
    for i in range(len(filtered)):
        print("Vertices " + str(filtered['source'].iloc[i]) + " and " + 
            str(filtered['destination'].iloc[i]) + " are similar with score: " + 
            str(filtered['jaccard_coeff'].iloc[i]))

### Create an Edgelist

In [4]:
from cugraph.experimental.datasets import karate
gdf = karate.get_edgelist()

In [5]:
# Let's look at the DataFrame. There should be two columns and 156 records
gdf.shape

(156, 2)

In [6]:
# Look at the first few data records - the output should be two columns: 'src' and 'dst'
gdf.head()

Unnamed: 0,src,dst
0,1,2
1,1,3
2,1,4
3,1,5
4,1,6


### Create a Graph

In [7]:
# create a Graph 
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst')

In [8]:
# How many vertices are in the graph?  Remember that Graph is zero based
G.number_of_vertices()

34

_The test graph has only 34 vertices, so why is the Graph listing 35?_

As mentioned above, cuGraph vertex numbering is zero-based, meaning that the first vertex ID starts at zero.  The test dataset is 1-based.  Because of that, the Graph object adds an extra isolated vertex with an ID of zero.  Hence the difference in vertex count.  
We could have run _renumbering_ on the data, or updated the value of each element _gdf['src'] = gdf['src'] - 1_    
for now, we will just state that vertex 0 is not part of the dataset and can be ignored

--- 
# Jaccard 

In [9]:
#%%time
# Call cugraph.nvJaccard
jdf = cugraph.jaccard(G)

In [10]:
# Which two vertices are the most similar?
print_most_similar_jaccard(jdf)

Vertices 34 and 33 are most similar with score: 0.5263158
Vertices 33 and 34 are most similar with score: 0.5263158


The Most similar shoul be 33 and 34.
Vertex 33 has 12 neighbors, vertex 34 has 17 neighbors.  They share 10 neighbors in common:
$jaccard = 10 / (10 + (12 -10) + (17-10)) = 10 / 19 = 0.526$

In [11]:
### let's look at all similarities over a threshold
print_jaccard_threshold(jdf, 0.4)

Vertices 4 and 8 are similar with score: 0.42857143
Vertices 8 and 4 are similar with score: 0.42857143
Vertices 34 and 33 are similar with score: 0.5263158
Vertices 33 and 34 are similar with score: 0.5263158


In [12]:
# Since it is a small graph we can print all scores, notice that only vertices that are neighbors are being compared
#
# Before printing, let's get rid of the duplicates (x compared to y is the same as y compared to x).  We will do that
# by performing a query.  Then let's sort the data by score

jdf_s = jdf.query('source < destination').sort_values(by='jaccard_coeff', ascending=False)

print_jaccard_threshold(jdf_s, 0.0)

Vertices 33 and 34 are similar with score: 0.5263158
Vertices 4 and 8 are similar with score: 0.42857143
Vertices 1 and 2 are similar with score: 0.3888889
Vertices 4 and 14 are similar with score: 0.375
Vertices 2 and 4 are similar with score: 0.36363637
Vertices 3 and 4 are similar with score: 0.33333334
Vertices 6 and 7 are similar with score: 0.33333334
Vertices 2 and 8 are similar with score: 0.3
Vertices 1 and 4 are similar with score: 0.29411766
Vertices 9 and 31 are similar with score: 0.2857143
Vertices 24 and 30 are similar with score: 0.2857143
Vertices 3 and 8 are similar with score: 0.27272728
Vertices 2 and 14 are similar with score: 0.27272728
Vertices 2 and 3 are similar with score: 0.26666668
Vertices 3 and 14 are similar with score: 0.25
Vertices 1 and 3 are similar with score: 0.23809524
Vertices 9 and 33 are similar with score: 0.21428572
Vertices 7 and 17 are similar with score: 0.2
Vertices 5 and 11 are similar with score: 0.2
Vertices 25 and 26 are similar with s

---
# Expanding vertex pairs similarity scoring to 2-hop vertex pair

In [13]:
# get all two-hop vertex pairs
p = G.get_two_hop_neighbors()

In [14]:
# Let's look at the Jaccard score
j2 = cugraph.jaccard(G, vertex_pair=p)

In [15]:
print_most_similar_jaccard(j2)

Vertices 15 and 16 are most similar with score: 1.0
Vertices 15 and 19 are most similar with score: 1.0
Vertices 15 and 21 are most similar with score: 1.0
Vertices 15 and 23 are most similar with score: 1.0
Vertices 21 and 15 are most similar with score: 1.0
Vertices 21 and 16 are most similar with score: 1.0
Vertices 21 and 19 are most similar with score: 1.0
Vertices 21 and 23 are most similar with score: 1.0
Vertices 18 and 22 are most similar with score: 1.0
Vertices 23 and 15 are most similar with score: 1.0
Vertices 23 and 16 are most similar with score: 1.0
Vertices 23 and 19 are most similar with score: 1.0
Vertices 23 and 21 are most similar with score: 1.0
Vertices 19 and 15 are most similar with score: 1.0
Vertices 19 and 16 are most similar with score: 1.0
Vertices 19 and 21 are most similar with score: 1.0
Vertices 19 and 23 are most similar with score: 1.0
Vertices 16 and 15 are most similar with score: 1.0
Vertices 16 and 19 are most similar with score: 1.0
Vertices 16 

---

# Weighted Jaccard

For graph weights, we are going to use the PageRank scores.  If you are unfamillar with PageRank please see the notebook on PageRank

In [16]:
# Call PageRank on the graph to get weights to use:
pr_df = cugraph.pagerank(G)

In [17]:
# take a peek at the PageRank values
pr_df.head()

Unnamed: 0,pagerank,vertex
0,0.021979,5
1,0.021979,11
2,0.019605,20
3,0.021076,25
4,0.021006,26


### Now compute the Weighted Jaccard 

In [18]:
pr_df.rename(columns={'pagerank': 'weight'}, inplace=True)
# Call weighted Jaccard using the PageRank scores as weights:
wdf = cugraph.jaccard_w(G, pr_df)

In [19]:
print_most_similar_jaccard(wdf)

Vertices 1 and 2 are most similar with score: 0.5324633
Vertices 2 and 1 are most similar with score: 0.5324633


---
### It's that easy with cuGraph

Copyright (c) 2019-2022, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___