# Overlap Similarity
----

In this notebook we will explore the Overlap Coefficient and compare it again Jaccard.  Similarity can be between neighboring vertices (default) or second hop neighbors


Notebook Credits

| Author Credit |    Date    |  Update          | cuGraph Version |  Test Hardware     |
| --------------|------------|------------------|-----------------|--------------------|
| Brad Rees     | 10/14/2019 | created          | 0.08            | GV100, CUDA 10.0   |
|               | 08/16/2020 | upadted          | 0.12            | GV100, CUDA 10.0   |
|               | 08/05/2021 | tested / updated | 21.10 nightly   | RTX 3090 CUDA 11.4 |
| Ralph Liu     | 06/22/2022 | updated/tested   | 22.08           | TV100, CUDA 11.5   |


## Introduction - Common Neighbor Similarity 

One of the most common types of vertex similarity is to evaluate the neighborhood of vertex pairs and looks at the number of common neighbors.  TThat type of similar comes from statistics and is based on set comparison.  Both Jaccard and the Overlap Coefficient operate on sets, and in a graph setting, those sets are the list of neighboring vertices. <br>
For those that like math:  The neighbors of a vertex, _v_, is defined as the set, _U_, of vertices connected by way of an edge to vertex v, or _N(v) = {U} where v ∈ V and ∀ u ∈ U ∃ edge(v,u)∈ E_.

For the rest of this introduction, set __A__ will equate to _A = N(i)_ and set __B__ will quate to _B = N(j)_.  That just make the rest of the text more readable.


### Overlap Coefficient

The Overlap Coefficient between two sets is defined as the ratio of the volume of their intersection divided by the volume of the smaller set.
The Overlap Coefficient can be defined as

<a href="https://www.codecogs.com/eqnedit.php?latex=oc(A,B)&space;=&space;\frac{|A|&space;\cap&space;|B|}{min(|A|,&space;|B|)&space;}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?oc(A,B)&space;=&space;\frac{|A&space;\cap&space;B|}{min(|A|,&space;|B|)&space;}" title="oc(A,B) = \frac{|A \cap B|}{min(|A|, |B|) }" /></a>

To compute the Overlap Coefficient between all pairs of vertices connected by an edge in cuGraph use: <br>

__df = cugraph.overlap(G)__

    G: A cugraph.Graph object

Returns:

    df: cudf.DataFrame with three names columns:
        df["source"]: The source vertex id.
        df["destination"]: The destination vertex id.
        df["overlap_coeff"]: The overlap coefficient computed between the source and destination vertex.

__References__
- https://en.wikipedia.org/wiki/Overlap_coefficient


#### Refresh on Jaccard
The Jaccard similarity between two sets is defined as the ratio of the volume of their intersection divided by the volume of their union. 

The Jaccard Similarity can then be defined as

<a href="https://www.codecogs.com/eqnedit.php?latex=js(A,B)&space;=&space;\frac{|A&space;\cap&space;B|}{|A&space;\cup&space;B&space;|&space;}&space;=&space;\frac{|A&space;\cap&space;B|}{&space;|A|&space;&plus;&space;|B|&space;-&space;|A&space;\cup&space;B&space;|&space;}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?js(A,B)&space;=&space;\frac{|A&space;\cap&space;B|}{|A&space;\cup&space;B&space;|&space;}&space;=&space;\frac{|A&space;\cap&space;B|}{&space;|A|&space;&plus;&space;|B|&space;-&space;|A&space;\cup&space;B&space;|&space;}" title="js(A,B) = \frac{|A \cap B|}{|A \cup B | } = \frac{|A \cap B|}{ |A| + |B| - |A \cup B | }" /></a>


To compute the Jaccard similarity between all pairs of vertices connected by an edge in cuGraph use: <br>
__df = cugraph.jaccard(G)__

    G: A cugraph.Graph object

Returns:

    df: cudf.DataFrame with three names columns:
        df["source"]: The source vertex id.
        df["destination"]: The destination vertex id.
        df["jaccard_coeff"]: The jaccard coefficient computed between the source and destination vertex.
<br>

See the Jaccard notebook for additional information and background

### Additional Reading
- [Similarity in graphs: Jaccard versus the Overlap Coefficient](https://medium.com/rapids-ai/similarity-in-graphs-jaccard-versus-the-overlap-coefficient-610e083b877d)
- [Wikipedia: Overlap Coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient)


### Some notes about vertex IDs...

* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.
  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).
  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`


## Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*


![Karate Club](../img/zachary_black_lines.png)

This is a small graph which allows for easy visual inspection to validate results.  

---
# Let's get started!

In [1]:
# Import needed libraries
import cugraph
import cudf

----
### Define some Print functions
(the `del` are not needed since going out of scope should free memory)

In [2]:
# define a function for printing the top most similar vertices
def print_most_similar_jaccard(df):
    
    jmax = df['jaccard_coeff'].max()
    dm = df.query('jaccard_coeff >= @jmax')    
    
    #find the best
    for i in range(len(dm)):    
        print("Vertices " + str(dm['source'].iloc[i]) + " and " + 
              str(dm['destination'].iloc[i]) + " are most similar with score: " 
              + str(dm['jaccard_coeff'].iloc[i]))
    del jmax
    del dm

In [3]:
# define a function for printing the top most similar vertices
def print_most_similar_overlap(df):
    
    smax = df['overlap_coeff'].max()
    dm = df.query('overlap_coeff >= @smax and source < destination')      
    
    for i in range(len(dm)):
        print("Vertices " + str(dm['source'].iloc[i]) + " and " + 
          str(dm['destination'].iloc[i]) + " are most similar with score: " 
          + str(dm['overlap_coeff'].iloc[i]))
        
    del smax
    del dm

In [4]:
# define a function for printing jaccard similar vertices based on a threshold
def print_jaccard_threshold(_d, limit):
    
    filtered = _d.query('jaccard_coeff > @limit')
    
    for i in range(len(filtered)):
        print("Vertices " + str(filtered['source'].iloc[i]) + " and " + 
            str(filtered['destination'].iloc[i]) + " are similar with score: " + 
            str(filtered['jaccard_coeff'].iloc[i]))

In [5]:
# define a function for printing similar vertices based on a threshold
def print_overlap_threshold(_d, limit):
    
    filtered = _d.query('overlap_coeff > @limit')
    
    for i in range(len(filtered)):
        if filtered['source'].iloc[i] != filtered['destination'].iloc[i] :
            print("Vertices " + str(filtered['source'].iloc[i]) + " and " + 
                str(filtered['destination'].iloc[i]) + " are similar with score: " + 
                str(filtered['overlap_coeff'].iloc[i]))

### Import a Dataset Object

In [6]:
from cugraph.experimental.datasets import karate
gdf = karate.get_edgelist()

In [7]:
# Let's look at the DataFrame. There should be two columns and 156 records
gdf.shape

(156, 2)

In [8]:
# Look at the first few data records - the output should be two columns: 'src' and 'dst'
gdf.head()

Unnamed: 0,src,dst
0,1,2
1,1,3
2,1,4
3,1,5
4,1,6


### Create a Graph

In [9]:
# create a Graph 
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst')

In [10]:
# How many vertices are in the graph?  
G.number_of_vertices()

34

--- 
# Jaccard 

In [11]:
#%%time
# Call cugraph.nvJaccard 
jdf = cugraph.jaccard(G)

In [12]:
# Which two vertices are the most similar?
print_most_similar_jaccard(jdf)

Vertices 34 and 33 are most similar with score: 0.5263158
Vertices 33 and 34 are most similar with score: 0.5263158


The Most similar shoul be 33 and 34.
Vertex 33 has 12 neighbors, vertex 34 has 17 neighbors.  They share 10 neighbors in common:
$jaccard = 10 / (10 + (12 -10) + (17-10)) = 10 / 19 = 0.526$

In [13]:
### let's look at all similarities over a threshold
print_jaccard_threshold(jdf, 0.4)

Vertices 4 and 8 are similar with score: 0.42857143
Vertices 8 and 4 are similar with score: 0.42857143
Vertices 34 and 33 are similar with score: 0.5263158
Vertices 33 and 34 are similar with score: 0.5263158


In [14]:
# Since it is a small graph we can print all scores, notice that only vertices that are neighbors are being compared
#
# Before printing, let's get rid of the duplicates (x compared to y is the same as y compared to x).  We will do that
# by performing a query.  Then let's sort the data by score

jdf_s = jdf.query('source < destination').sort_values(by='jaccard_coeff', ascending=False)

print_jaccard_threshold(jdf_s, 0.0)

Vertices 33 and 34 are similar with score: 0.5263158
Vertices 4 and 8 are similar with score: 0.42857143
Vertices 1 and 2 are similar with score: 0.3888889
Vertices 4 and 14 are similar with score: 0.375
Vertices 2 and 4 are similar with score: 0.36363637
Vertices 3 and 4 are similar with score: 0.33333334
Vertices 6 and 7 are similar with score: 0.33333334
Vertices 2 and 8 are similar with score: 0.3
Vertices 1 and 4 are similar with score: 0.29411766
Vertices 9 and 31 are similar with score: 0.2857143
Vertices 24 and 30 are similar with score: 0.2857143
Vertices 3 and 8 are similar with score: 0.27272728
Vertices 2 and 14 are similar with score: 0.27272728
Vertices 2 and 3 are similar with score: 0.26666668
Vertices 3 and 14 are similar with score: 0.25
Vertices 1 and 3 are similar with score: 0.23809524
Vertices 9 and 33 are similar with score: 0.21428572
Vertices 7 and 17 are similar with score: 0.2
Vertices 5 and 11 are similar with score: 0.2
Vertices 25 and 26 are similar with s

---
# Overlap Coefficient

Noticed that the Jaccard score is based on the number of common items over the combined (union) set of items.  That makes sense when the two sets being compared are relativcely close in size.  However, when one set is considerable larger, then it is important to know if one set is a proper subset of the other <br>
See:  [Similarity in graphs: Jaccard versus the Overlap Coefficient](https://medium.com/rapids-ai/similarity-in-graphs-jaccard-versus-the-overlap-coefficient-610e083b877d)

In [15]:
#%%time
# Call cugraph.nvJaccard 
odf = cugraph.overlap(G)

In [16]:
# print the top similar pair - this function include code to drop duplicates  
print_most_similar_overlap(odf)

Vertices 1 and 4 are most similar with score: 0.8333333
Vertices 33 and 34 are most similar with score: 0.8333333


In [17]:
# print all similarities over a threshold, in this case 0.5
#also, drop duplicates
odf_s = odf.query('source < destination').sort_values(by='overlap_coeff', ascending=False)

print_overlap_threshold(odf_s, 0.5)

Vertices 1 and 4 are similar with score: 0.8333333
Vertices 33 and 34 are similar with score: 0.8333333
Vertices 1 and 2 are similar with score: 0.7777778
Vertices 4 and 8 are similar with score: 0.75
Vertices 1 and 8 are similar with score: 0.75
Vertices 30 and 34 are similar with score: 0.75
Vertices 3 and 8 are similar with score: 0.75
Vertices 2 and 8 are similar with score: 0.75
Vertices 1 and 5 are similar with score: 0.6666667
Vertices 1 and 11 are similar with score: 0.6666667
Vertices 3 and 4 are similar with score: 0.6666667
Vertices 2 and 4 are similar with score: 0.6666667
Vertices 4 and 14 are similar with score: 0.6
Vertices 9 and 33 are similar with score: 0.6
Vertices 1 and 14 are similar with score: 0.6
Vertices 3 and 14 are similar with score: 0.6
Vertices 2 and 14 are similar with score: 0.6
Vertices 24 and 34 are similar with score: 0.6


---

# Expanding similarity scoring to 2-hop vertex pair

In [18]:
# get all two-hop vertex pairs
p = G.get_two_hop_neighbors()

In [19]:
# Let's look at the Jaccard score
ol2 = cugraph.overlap(G, vertex_pair=p)

In [20]:
print_most_similar_overlap(odf)

Vertices 1 and 4 are most similar with score: 0.8333333
Vertices 33 and 34 are most similar with score: 0.8333333


In [21]:
# print all similarities over a threshold, in this case 0.5
# also, drop duplicates
odf_s2 = ol2.query('source < destination').sort_values(by='overlap_coeff', ascending=False)

print_overlap_threshold(odf_s2, 0.74)

Vertices 15 and 16 are similar with score: 1.0
Vertices 15 and 19 are similar with score: 1.0
Vertices 15 and 21 are similar with score: 1.0
Vertices 15 and 23 are similar with score: 1.0
Vertices 16 and 32 are similar with score: 1.0
Vertices 16 and 24 are similar with score: 1.0
Vertices 15 and 31 are similar with score: 1.0
Vertices 1 and 17 are similar with score: 1.0
Vertices 21 and 23 are similar with score: 1.0
Vertices 18 and 20 are similar with score: 1.0
Vertices 18 and 22 are similar with score: 1.0
Vertices 19 and 31 are similar with score: 1.0
Vertices 12 and 32 are similar with score: 1.0
Vertices 12 and 14 are similar with score: 1.0
Vertices 10 and 29 are similar with score: 1.0
Vertices 13 and 14 are similar with score: 1.0
Vertices 24 and 27 are similar with score: 1.0
Vertices 16 and 30 are similar with score: 1.0
Vertices 16 and 19 are similar with score: 1.0
Vertices 16 and 21 are similar with score: 1.0
Vertices 16 and 23 are similar with score: 1.0
Vertices 19 an

---

## Let's now compare the Overlap Coefficient with the Jaccard Similarity 

In [22]:
# Call cugraph.nvJaccard 
jdf = cugraph.jaccard(G)

In [23]:
# Which two vertices are the most similar?
print_most_similar_jaccard(jdf)

Vertices 33 and 34 are most similar with score: 0.5263158
Vertices 34 and 33 are most similar with score: 0.5263158


In [24]:
# Let's combine the Jaccard and Overlap scores
mdf = jdf.merge(odf, on=['source','destination'])

In [25]:
# Also want to include the vertex degree
degree = G.degree()

In [26]:
dS = degree.rename(columns={'vertex':'source','degree': 'src_degree'})
dD = degree.rename(columns={'vertex':'destination','degree': 'dst_degree'})

In [27]:
m = mdf.merge(dS, how="left", on='source')
m = m.merge(dD, how="left", on='destination')

In [28]:
m.query('source < destination').sort_values(by='jaccard_coeff', ascending=False).head(20)

Unnamed: 0,jaccard_coeff,source,destination,overlap_coeff,src_degree,dst_degree
113,0.526316,33,34,0.833333,24,34
142,0.428571,4,8,0.75,12,8
82,0.388889,1,2,0.777778,32,18
131,0.375,4,14,0.6,12,10
8,0.363636,2,4,0.666667,18,12
0,0.333333,3,4,0.666667,20,12
43,0.333333,6,7,0.5,8,8
10,0.3,2,8,0.75,18,8
83,0.294118,1,4,0.833333,32,12
32,0.285714,9,31,0.5,10,8


In [29]:
# Now sort on the overlap
m.query('source < destination').sort_values(by='overlap_coeff', ascending=False).head(20)

Unnamed: 0,jaccard_coeff,source,destination,overlap_coeff,src_degree,dst_degree
83,0.294118,1,4,0.833333,32,12
113,0.526316,33,34,0.833333,24,34
82,0.388889,1,2,0.777778,32,18
3,0.272727,3,8,0.75,20,8
10,0.3,2,8,0.75,18,8
22,0.166667,30,34,0.75,8,34
89,0.176471,1,8,0.75,32,8
142,0.428571,4,8,0.75,12,8
0,0.333333,3,4,0.666667,20,12
8,0.363636,2,4,0.666667,18,12


___
Copyright (c) 2019-2020, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___