# Jaccard Similarity
----

In this notebook we will explore the Jaccard vertex similarity metrics available in cuGraph.

## Introduction

The Jaccard similarity between two sets is defined as the ratio of the volume of their intersection divided by the volume of their union. 

The Jaccard Similarity can then be expressed as

$\text{Jaccard similarity} = \frac{|A \cap B|}{|A \cup B|}$


To compute the Jaccard similarity between all pairs of vertices connected by an edge in cuGraph use: <br>
__df = cugraph.jaccard(G)__

    G: A cugraph.Graph object

Returns:

    df: cudf.DataFrame with three columns:
        df["first"]: The first vertex id of each pair.
        df["second"]: The second vertex id of each pair.
        df["jaccard_coeff"]: The jaccard coefficient computed between the vertex pairs.
<br>

__References__ 
- https://research.nvidia.com/publication/2017-11_Parallel-Jaccard-and 

__Additional Reading__ 
- [Wikipedia: Jaccard](https://en.wikipedia.org/wiki/Jaccard_index)


## Test Data
We will be using the Zachary Karate club dataset.
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*

<img src="../../img/karate_similarity.png" width="50%"/>

This is a small graph which allows for easy visual inspection to validate results.

---
# Let's get started!

In [None]:
# Import needed libraries
import cugraph
import cudf

# The cugraph.datasets package contains several common graph datasets useful
# for testing and demonstrations.
from cugraph.datasets import karate

### Create the Graph object

In [None]:
# Create a cugraph.Graph object from the karate dataset. Download the karate
# dataset if not already present on disk.
G = karate.get_graph(download=True)

### Run `jaccard`

In [None]:
# Compute Jaccard coefficients for all pairs of vertices that are part of the
# two-hop neighborhood for each vertex.
jaccard_coeffs = cugraph.jaccard(G)

### Analyze the results

In [None]:
# Remove redundancies (remove (b, a) if (a, b) is present) and pairs consisting
# of the same vertices (a, a) from the results, then sort from most similar to
# least.
jaccard_coeffs = jaccard_coeffs.query("first < second")
jaccard_coeffs = jaccard_coeffs.sort_values("jaccard_coeff", ascending=False)

In [None]:
# Show the top-20 most similar vertices.
jaccard_coeffs.head(20)

We can see that several pairs have a coefficient of 1.0, meaning they have
the same set of neighbors. This can be easily verified in the plot above.

We have to specify vertices in a DataFrame to see their similarity if they
are not part of the same two-hop neighborhood.

In [None]:
cugraph.jaccard(G, cudf.DataFrame([(16, 33)]))

As expected, the coefficient is 0.0 because vertices 16 and 33 do not share any
neighbors.

---
# Now we look at weighted Jaccard!

A full explanation of the weighted jaccard is found [here](https://en.wikipedia.org/wiki/Jaccard_index#Weighted_Jaccard_similarity_and_distance).

The Dining Preferences data set is a staple of smallest scale social network analysis.
The data represents the first (weight = 1) and second (weight = 2) dining partner preference from a survey done in a small school dormitory.

This data originated in social network publication by J.L. Moreno

Reference: J. L. Moreno (1960). The Sociometry Reader. The Free Press, Glencoe, Illinois, pg.35


Here is a visualization of the dataset
<img src="../../img/dorm_data_diagram.png" width="100%"/>


### First pull in the dining preferences data set and load it into a cuGraph.

In [None]:
# import the dining preferences dataset from cugraph's examples
from cugraph.datasets import dining_prefs
# load the graph making sure to not ignore the weights
G = dining_prefs.get_graph(download=True, store_transposed=True, ignore_weights=False)


Do the calculations

In [None]:
# calculate both the unweighted and weighted Jaccard
jaccard_coeffs = cugraph.jaccard(G)
jaccard_weighted = cugraph.jaccard(G, use_weight=True)
# rename the weighted results
jaccard_weighted = jaccard_weighted.rename(columns={'jaccard_coeff' : 'weighted_jaccard' })

Join the results dataframes

In [19]:
# Merge the two results together joining on the vertices pairs
jaccard_merged = jaccard_coeffs.merge(jaccard_weighted, on=['first','second'], how='left')
jaccard_merged.sort_values('weighted_jaccard',ascending=False)
jaccard_merged.head()

Unnamed: 0,first,second,jaccard_coeff,weighted_jaccard
0,Lena,Marion,0.125,0.076923
1,Lena,Adele,0.142857,0.090909
2,Lena,Ellen,0.166667,0.1
3,Lena,Louise,0.2,0.111111
4,Louise,Eva,0.111111,0.076923


---
### It's that easy with cuGraph

Copyright (c) 2019-2024, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___