# Overlap Similarity
----

In this notebook we will explore the Overlap Coefficient and compare it again Jaccard.  Similarity can be between neighboring vertices (default) or second hop neighbors


Notebook Credits

    Original Authors: Brad Rees
    Created:   10/14/2019
    Last Edit: 01/23/2020

RAPIDS Versions: 0.12.0a

Test Hardware
* GV100 32G, CUDA 10.0


## Introduction - Common Neighbor Similarity 

One of the most common types of vertex similarity is to evaluate the neighborhood of vertex pairs and looks at the number of common neighbors.  TThat type of similar comes from statistics and is based on set comparison.  Both Jaccard and the Overlap Coefficient operate on sets, and in a graph setting, those sets are the list of neighboring vertices. <br>
For those that like math:  The neighbors of a vertex, _v_, is defined as the set, _U_, of vertices connected by way of an edge to vertex v, or _N(v) = {U} where v ∈ V and ∀ u ∈ U ∃ edge(v,u)∈ E_.

For the rest of this introduction, set __A__ will equate to _A = N(i)_ and set __B__ will quate to _B = N(j)_.  That just make the rest of the text more readable.


### Overlap Coefficient

The Overlap Coefficient between two sets is defined as the ratio of the volume of their intersection divided by the volume of the smaller set.
The Overlap Coefficient can be defined as

<a href="https://www.codecogs.com/eqnedit.php?latex=oc(A,B)&space;=&space;\frac{|A|&space;\cap&space;|B|}{min(|A|,&space;|B|)&space;}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?oc(A,B)&space;=&space;\frac{|A&space;\cap&space;B|}{min(|A|,&space;|B|)&space;}" title="oc(A,B) = \frac{|A \cap B|}{min(|A|, |B|) }" /></a>

To compute the Overlap Coefficient between all pairs of vertices connected by an edge in cuGraph use: <br>

__df = cugraph.overlap(G)__

    G: A cugraph.Graph object

Returns:

    df: cudf.DataFrame with three names columns:
        df["source"]: The source vertex id.
        df["destination"]: The destination vertex id.
        df["overlap_coeff"]: The overlap coefficient computed between the source and destination vertex.

__References__
- https://en.wikipedia.org/wiki/Overlap_coefficient


#### Refresh on Jaccard
The Jaccard similarity between two sets is defined as the ratio of the volume of their intersection divided by the volume of their union. 

The Jaccard Similarity can then be defined as

<a href="https://www.codecogs.com/eqnedit.php?latex=js(A,B)&space;=&space;\frac{|A&space;\cap&space;B|}{|A&space;\cup&space;B&space;|&space;}&space;=&space;\frac{|A&space;\cap&space;B|}{&space;|A|&space;&plus;&space;|B|&space;-&space;|A&space;\cup&space;B&space;|&space;}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?js(A,B)&space;=&space;\frac{|A&space;\cap&space;B|}{|A&space;\cup&space;B&space;|&space;}&space;=&space;\frac{|A&space;\cap&space;B|}{&space;|A|&space;&plus;&space;|B|&space;-&space;|A&space;\cup&space;B&space;|&space;}" title="js(A,B) = \frac{|A \cap B|}{|A \cup B | } = \frac{|A \cap B|}{ |A| + |B| - |A \cup B | }" /></a>


To compute the Jaccard similarity between all pairs of vertices connected by an edge in cuGraph use: <br>
__df = cugraph.jaccard(G)__

    G: A cugraph.Graph object

Returns:

    df: cudf.DataFrame with three names columns:
        df["source"]: The source vertex id.
        df["destination"]: The destination vertex id.
        df["jaccard_coeff"]: The jaccard coefficient computed between the source and destination vertex.
<br>

See the Jaccard notebook for additional information and background

### Additional Reading
- [Similarity in graphs: Jaccard versus the Overlap Coefficient](https://medium.com/rapids-ai/similarity-in-graphs-jaccard-versus-the-overlap-coefficient-610e083b877d)
- [Wikipedia: Overlap Coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient)


#### cuGraph Notice 
The current version of cuGraph has some limitations:

* Vertex IDs need to be 32-bit integers.
* Vertex IDs are expected to be contiguous integers starting from 0.

cuGraph provides the renumber function to mitigate this problem. Input vertex IDs for the renumber function can be either 32-bit or 64-bit integers, can be non-contiguous, and can start from an arbitrary number. The renumber function maps the provided input vertex IDs to 32-bit contiguous integers starting from 0. cuGraph still requires the renumbered vertex IDs to be representable in 32-bit integers. These limitations are being addressed and will be fixed soon.

## Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*


![Karate Club](../img/zachary_black_lines.png)

This is a small graph which allows for easy visual inspection to validate results.  

---
# Let's get started!

In [None]:
# Import needed libraries
import cugraph
import cudf
from collections import OrderedDict

----
### Define some Print functions
(the `del` are not needed since going out of scope should free memory)

In [None]:
# define a function for printing the top most similar vertices
def print_most_similar_jaccard(df):
    
    jmax = df['jaccard_coeff'].max()
    dm = df.query('jaccard_coeff >= @jmax')    
    
    #find the best
    for i in range(len(dm)):    
        print("Vertices " + str(dm['source'].iloc[i]) + " and " + 
              str(dm['destination'].iloc[i]) + " are most similar with score: " 
              + str(dm['jaccard_coeff'].iloc[i]))
    del jmax
    del dm

In [None]:
# define a function for printing the top most similar vertices
def print_most_similar_overlap(df):
    
    smax = df['overlap_coeff'].max()
    dm = df.query('overlap_coeff >= @smax and source < destination')      
    
    for i in range(len(dm)):
        print("Vertices " + str(dm['source'].iloc[i]) + " and " + 
          str(dm['destination'].iloc[i]) + " are most similar with score: " 
          + str(dm['overlap_coeff'].iloc[i]))
        
    del smax
    del dm

In [None]:
# define a function for printing jaccard similar vertices based on a threshold
def print_jaccard_threshold(_d, limit):
    
    filtered = _d.query('jaccard_coeff > @limit')
    
    for i in range(len(filtered)):
        print("Vertices " + str(filtered['source'].iloc[i]) + " and " + 
            str(filtered['destination'].iloc[i]) + " are similar with score: " + 
            str(filtered['jaccard_coeff'].iloc[i]))

In [None]:
# define a function for printing similar vertices based on a threshold
def print_overlap_threshold(_d, limit):
    
    filtered = _d.query('overlap_coeff > @limit')
    
    for i in range(len(filtered)):
        if filtered['source'].iloc[i] != filtered['destination'].iloc[i] :
            print("Vertices " + str(filtered['source'].iloc[i]) + " and " + 
                str(filtered['destination'].iloc[i]) + " are similar with score: " + 
                str(filtered['overlap_coeff'].iloc[i]))

### Read the CSV datafile using cuDF
data file is actually _tab_ separated, so we need to set the delimiter

In [None]:
# Test file  
datafile='../data/karate-data.csv'

gdf = cudf.read_csv(datafile, delimiter='\t', names=['src', 'dst'], dtype=['int32', 'int32'] )

In [None]:
# Let's look at the DataFrame. There should be two columns and 156 records
gdf.shape

In [None]:
# Look at the first few data records - the output should be two colums src and dst
gdf.head()

### Create a Graph

In [None]:
# create a Graph 
G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='src', destination='dst')

In [None]:
# How many vertices are in the graph?  Remember that Graph is zero based
G.number_of_vertices()

_The test graph has only 34 vertices, so why is the Graph listing 35?_

As mentioned above, cuGraph vertex numbering is zero-based, meaning that the first vertex ID starts at zero.  The test dataset is 1-based.  Because of that, the Graph object adds an extra isolated vertex with an ID of zero.  Hence the difference in vertex count.  
We could have run _renumbering_ on the data, or updated the value of each element _gdf['src'] = gdf['src'] - 1_    
for now, we will just state that vertex 0 is not part of the dataset and can be ignored

--- 
# Jaccard 

In [None]:
#%%time
# Call cugraph.nvJaccard 
jdf = cugraph.jaccard(G)

In [None]:
# Which two vertices are the most similar?
print_most_similar_jaccard(jdf)

The Most similar shoul be 33 and 34.
Vertex 33 has 12 neighbors, vertex 34 has 17 neighbors.  They share 10 neighbors in common:
$jaccard = 10 / (10 + (12 -10) + (17-10)) = 10 / 19 = 0.526$

In [None]:
### let's look at all similarities over a threshold
print_jaccard_threshold(jdf, 0.4)

In [None]:
# Since it is a small graph we can print all scores, notice that only vertices that are neighbors are being compared
#
# Before printing, let's get rid of the duplicates (x compared to y is the same as y compared to x).  We will do that
# by performing a query.  Then let's sort the data by score

jdf_s = jdf.query('source < destination').sort_values(by='jaccard_coeff', ascending=False)

print_jaccard_threshold(jdf_s, 0.0)

---
# Overlap Coefficient

Noticed that the Jaccard score is based on the number of common items over the combined (union) set of items.  That makes sense when the two sets being compared are relativcely close in size.  However, when one set is considerable larger, then it is important to know if one set is a proper subset of the other <br>
See:  [Similarity in graphs: Jaccard versus the Overlap Coefficient](https://medium.com/rapids-ai/similarity-in-graphs-jaccard-versus-the-overlap-coefficient-610e083b877d)

In [None]:
#%%time
# Call cugraph.nvJaccard 
odf = cugraph.overlap(G)

In [None]:
# print the top similar pair - this function include code to drop duplicates  
print_most_similar_overlap(odf)

In [None]:
# print all similarities over a threshold, in this case 0.5
#also, drop duplicates
odf_s = odf.query('source < destination').sort_values(by='overlap_coeff', ascending=False)

print_overlap_threshold(odf_s, 0.5)

---

# Expanding similarity scoring to 2-hop vertex pair

In [None]:
# get all two-hop vertex pairs
p = G.get_two_hop_neighbors()

In [None]:
# Let's look at the Jaccard score
ol2 = cugraph.overlap(G, vertex_pair=p)

In [None]:
print_most_similar_overlap(odf)

In [None]:
# print all similarities over a threshold, in this case 0.5
#also, drop duplicates
odf_s2 = ol2.query('source < destination').sort_values(by='overlap_coeff', ascending=False)

print_overlap_threshold(odf_s2, 0.74)

---

## Let's now compare the Overlap Coefficient with the Jaccard Similarity 

In [None]:
# Call cugraph.nvJaccard 
jdf = cugraph.jaccard(G)

In [None]:
# Which two vertices are the most similar?
print_most_similar_jaccard(jdf)

In [None]:
# Let's combine the Jaccard and Overlap scores
mdf = jdf.merge(odf, on=['source','destination'])

In [None]:
# Also want to include the vertex degree
degree = G.degree()

In [None]:
dS = degree.rename(columns={'vertex':'source','degree': 'src_degree'})
dD = degree.rename(columns={'vertex':'destination','degree': 'dst_degree'})

In [None]:
m = mdf.merge(dS, how="left", on='source')
m = m.merge(dD, how="left", on='destination')

In [None]:
m.query('source < destination').sort_values(by='jaccard_coeff', ascending=False).head(20)

In [None]:
# Now sort on the overlap
m.query('source < destination').sort_values(by='overlap_coeff', ascending=False).head(20)

___
Copyright (c) 2019-2020, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___