# Sørensen Coefficient
----

In this notebook we will explore the Sørensen Coefficient available in cuGraph:
The Sørensen coefficient often referred to as that Sørensen-Dice coefficient is used in many fields to define the similarity between two samples




| Author Credit |    Date    |  Update          | cuGraph Version |  Test Hardware        |
| --------------|------------|------------------|-----------------|-----------------------|
| Don Acosta    | 07/19/2023 | created          | 23.08 nightly   | AMPERE A6000 CUDA 11.7  |


## Introduction - Sørensen



### Sørensen Coefficient

The Sørensen Coefficient quantifying the similarity between two samples is twice the number of elements common to both sets divided by the sum of the number of elements in each set.

Sørensen coefficient = $\left(2 * |A \cap B| \right) \over \left(|A| + |B| \right)$


To compute the Sorensen between a set of pairs of vertices connected by an edge in cuGraph use: <br>
__df = cugraph.sorenson(G, pairs)__

    G: A cugraph.Graph object
    vertex_pair:  A GPU dataframe consisting of two columns representing pairs of
        vertices. 

    Note: if the vertex_pair argument is not provided, the algorithm will run comparisons on ALL pairs in the graph which can easily balloon runtimes or fail due to memory constraints.

Returns:

    df: cudf.DataFrame with three names columns:
        df["first"]: The first vertex id of each pair.
        df["second"]: The second vertex i of each pair.
        df["sorensen_coeff"]: The sorensen coefficient computed between the vertex pairs.
<br>

__References__ 
- https://research.nvidia.com/publication/2017-11_Parallel-Jaccard-and

### Additional Reading
- [Wikipedia: Sørensen-Dice](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient)


## Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*

<img src="../../img/karate_similarity.png" width="45%"/>

This is a small graph which allows for easy visual inspection to validate results.  

---
# Let's get started!

In [None]:
# Import needed libraries
import cugraph
import cudf
from collections import OrderedDict

----
### Define some Print functions
(the `del` are not needed since going out of scope should free memory, just good practice)

In [None]:
# define a function for printing the top most similar vertices
def print_most_similar_sorensen(df):
    
    jmax = df['sorensen_coeff'].max()
    dm = df.query('sorensen_coeff >= @jmax')    
    
    #find the best
    for i in range(len(dm)):    
        print("Vertices " + str(dm['first'].iloc[i]) + " and " + 
              str(dm['second'].iloc[i]) + " are most similar with score: " 
              + str(dm['sorensen_coeff'].iloc[i]))
    del jmax
    del dm

In [None]:
# define a function for printing Sørensen similar vertices based on a threshold
def print_sorensen_threshold(_d, limit):
    
    filtered = _d.query('sorensen_coeff > @limit')
    
    for i in range(len(filtered)):
        print("Vertices " + str(filtered['first'].iloc[i]) + " and " + 
            str(filtered['second'].iloc[i]) + " are similar with score: " + 
            str(filtered['sorensen_coeff'].iloc[i]))

### Use the cuGraph Datasets api to get the dataframe containing edge data


In [None]:
# Test file  
from cugraph.experimental.datasets import karate
gdf = karate.get_edgelist(fetch=True)

In [None]:
# Let's look at the DataFrame. There are three columns and 156 records but weight, the 3rd column, is not used.
gdf.shape

In [None]:
# Look at the first few data records - the output should be three columns: 'src', 'dst' and wgt. 
# this is the renumbered version of the data starting at zero.
# The 3rd column wgt (weight) is not used.
gdf.head()

### Create a Graph

In [None]:
# create a Graph 
G = cugraph.from_cudf_edgelist(gdf,source='src', destination='dst', renumber=False)


--- 
# Sørensen coefficient algorithm call

In [None]:
#%%time
# Call cugraph.sorensen
jdf = cugraph.sorensen_coefficient(G)
# to compare to the graph image above we will convert the vertices to start with one instead of zero
jdf['first'] += 1
jdf['second'] += 1
print(jdf[jdf['sorensen_coeff'] == jdf['sorensen_coeff'].max()])


In [None]:
# Which two vertices are the most similar?
# add one to each vertex id to account for renumbering
print_most_similar_sorensen(jdf)


The Most similar should be 33 and 34.

In [None]:
### let's look at all similarities over a threshold
print_sorensen_threshold(jdf, 0.5)

Since the algorithm processes each vertex independently, the pairs appear twice.

In [None]:
# Since it is a small graph we can print all scores, notice that only vertices that are neighbors are being compared
#
# Before printing, let's get rid of the duplicates (x compared to y is the same as y compared to x).  We will do that
# by performing a query.  Then let's sort the data by score

jdf_s = jdf.query('first < second').sort_values(by='sorensen_coeff', ascending=False)

print_sorensen_threshold(jdf_s, 0.5)

---
# Expanding vertex pairs similarity scoring to 2-hop vertex pair

In [None]:
# get all two-hop vertex pairs
p = G.get_two_hop_neighbors()

In [None]:
# Let's look at the  score
j2 = cugraph.sorensen_coefficient(G, ebunch=p)

In [None]:
# again to compare to the graph image above we will convert the vertices to start with one instead of zero
j2['first'] += 1
j2['second'] += 1
print_most_similar_sorensen(j2)

---
### It's that easy with cuGraph

Copyright (c) 2023, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___