# Jaccard Similarity
----

In this notebook we will explore the Jaccard vertex similarity metrics available in cuGraph.

cuGraph supports Jaccard similarity for both unweighted and weighted graphs, but this notebook 
will demonstrate Jaccard similarity only on unweighted graphs. A future update will include an 
example using a graph with edge weights, where the weights are used to influence the Jaccard 
similarity coefficients.

## Introduction

The Jaccard similarity between two sets is defined as the ratio of the volume of their intersection 
divided by the volume of their union, where the sets used are the sets of neighboring vertices for each 
vertex.

The neighbors of a vertex, _v_, is defined as the set, _U_, of vertices connected by way of an edge to vertex v, or _N(v) = {U} where v ∈ V and ∀ u ∈ U ∃ edge(v,u)∈ E_.

If we then let set __A__ be the set of neighbors for vertex _a_, and set __B__ be the set of neighbors for vertex _b_, then the Jaccard Similarity for the vertex pair _(a, b)_ can be expressed as

$\text{Jaccard similarity} = \frac{|A \cap B|}{|A \cup B|}$


cuGraph's Jaccard function will, by default, compute the Jaccard similarity coefficient for every pair of 
vertices in the two-hop neighborhood for every vertex.

```df = cugraph.jaccard(G, vertex_pair=None)```

Parameters:

    G: A cugraph.Graph object

    vertex_pair: cudf.DataFrame, optional (default=None)
        A GPU dataframe consisting of two columns representing pairs of
        vertices. If provided, the jaccard coefficient is computed for the
        given vertex pairs.  If the vertex_pair is not provided then the
        current implementation computes the jaccard coefficient for all
        adjacent vertices in the graph.

Returns:

    df: cudf.DataFrame with three columns:
        df["first"]: The first vertex id of each pair.
        df["second"]: The second vertex id of each pair.
        df["jaccard_coeff"]: The jaccard coefficient computed between the vertex pairs.

To limit the computation to specific vertex pairs, including those not in the same two-hop 
neighborhood, pass a `vertex_pair` value (see example below).

__References__ 
- https://research.nvidia.com/publication/2017-11_Parallel-Jaccard-and 

__Additional Reading__ 
- [Intro to Graph Analysis using cuGraph: Similarity Algorithms](https://medium.com/rapids-ai/intro-to-graph-analysis-using-cugraph-similarity-algorithms-64fa923791ac)
- [Wikipedia: Jaccard](https://en.wikipedia.org/wiki/Jaccard_index)


## Test Data
We will be using the Zachary Karate club dataset.
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*

<img src="../../img/karate_similarity.png" width="50%"/>

This is a small graph which allows for easy visual inspection to validate results.

---
# Let's get started!

In [1]:
# Import needed libraries
import cugraph
import cudf

# The cugraph.datasets package contains several common graph datasets useful
# for testing and demonstrations.
from cugraph.datasets import karate

### Create the Graph object

In [2]:
# Create a cugraph.Graph object from the karate dataset. Download the karate
# dataset if not already present on disk.
G = karate.get_graph(download=True)

### Run `jaccard`

In [3]:
# Compute Jaccard coefficients for all pairs of vertices that are part of the
# two-hop neighborhood for each vertex.
jaccard_coeffs = cugraph.jaccard(G)

### Analyze the results

In [4]:
# Remove redundancies (remove (b, a) if (a, b) is present) and pairs consisting
# of the same vertices (a, a) from the results, then sort from most similar to
# least.
jaccard_coeffs = jaccard_coeffs.query("first < second")
jaccard_coeffs = jaccard_coeffs.sort_values("jaccard_coeff", ascending=False)

In [5]:
# Show the top-20 most similar vertices.
jaccard_coeffs.head(20)

Unnamed: 0,first,second,jaccard_coeff
541,14,15,1.0
542,14,18,1.0
543,14,20,1.0
544,14,22,1.0
561,15,18,1.0
562,15,20,1.0
563,15,22,1.0
587,17,21,1.0
605,18,20,1.0
606,18,22,1.0


We can see that several pairs have a coefficient of 1.0, meaning they have
the same set of neighbors. This can be easily verified in the plot above.

If we want to see the similarity of a pair of vertices that are not part of 
the same two-hop neighborhood, we have to specify them in a `cudf.DataFrame` 
to pass to the `jaccard` call.

In [6]:
cugraph.jaccard(G, cudf.DataFrame([(16, 33)]))

Unnamed: 0,first,second,jaccard_coeff
0,16,33,0.0


As expected, the coefficient is 0.0 because vertices 16 and 33 do not share any
neighbors.

We can use the `cudf.DataFrame` argument to pass in any number of specific vertex pairs 
to compute the similarity for, regardless of whether or not they're included by default. 
This is useful to limit the computation and result size when only specific vertex 
similarities are needed.

In [7]:
pairs = cudf.DataFrame([(16, 33), (32, 33), (0, 23)])
cugraph.jaccard(G, pairs)

Unnamed: 0,first,second,jaccard_coeff
0,16,33,0.0
1,32,33,0.526316
2,0,23,0.0


---
### It's that easy with cuGraph

Copyright (c) 2019-2024, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___

#### Revision History

| Author        | Date       | Update           | cuGraph Version | Test Hardware             |
| --------------|------------|------------------|-----------------|---------------------------|
| Brad Rees     | 10/14/2019 | created          | 0.14            | GV100 32 GB, CUDA 10.2    |
| Don Acosta    | 07/20/2022 | tested/updated   | 22.08 nightly   | DGX Tesla V100, CUDA 11.5 |
| Ralph Liu     | 06/29/2023 | updated          | 23.08 nightly   | DGX Tesla V100, CUDA 12.0 |
| Rick Ratzel   | 02/23/2024 | tested/updated   | 24.04 nightly   | DGX Tesla V100, CUDA 12.0 |