# Sørensen Coefficient
----

In this notebook we will explore the Sørensen Coefficient available in cuGraph:
The Sørensen coefficient often referred to as that Sørensen-Dice coefficient is used in many fields to define the similarity between two samples




| Author Credit |    Date    |  Update          | cuGraph Version |  Test Hardware        |
| --------------|------------|------------------|-----------------|-----------------------|
| Don Acosta    | 06/28/2023 | created          | 23.08 nightly   | DGX Tesla V100 CUDA 11.7  |


## Introduction - Sørensen



### Sørensen Coefficient

The Sørensen Coefficient quantifying the similarity between two samples is twice the number of elements common to both sets divided by the sum of the number of elements in each set.

$S = \frac{{2 \cdot |A \cap B|}}{{|A| + |B|}} $


To compute the Sorensen between a set of pairs of vertices connected by an edge in cuGraph use: <br>
__df = cugraph.sorenson(G, pairs)__

    G: A cugraph.Graph object
    vertex_pair:  A GPU dataframe consisting of two columns representing pairs of
        vertices. 

    Note: if the vertex_pair argument is not provided, the algorithm will run comparisons on ALL pairs in the graph which can easily balloon runtimes or fail due to memory constraints.

Returns:

    df: cudf.DataFrame with three names columns:
        df["first"]: The first vertex id of each pair.
        df["second"]: The second vertex i of each pair.
        df["sorensen_coeff"]: The sorensen coefficient computed between the vertex pairs.
<br>

__References__ 
- https://research.nvidia.com/publication/2017-11_Parallel-Jaccard-and

### Additional Reading
- [Wikipedia: Sørensen-Dice](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient)


### Some notes about vertex IDs...
* cuGraph will automatically renumber graphs to an internal format consisting of a contiguous series of integers starting from 0, and convert back to the original IDs when returning data to the caller. If the vertex IDs of the data are already a contiguous series of integers starting from 0, the auto-renumbering step can be skipped for faster graph creation times.
  * To skip auto-renumbering, set the `renumber` boolean arg to `False` when calling the appropriate graph creation API (eg. `G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)`).
  * For more advanced renumbering support, see the examples in `structure/renumber.ipynb` and `structure/renumber-2.ipynb`


## Test Data
We will be using the Zachary Karate club dataset 
*W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of
Anthropological Research 33, 452-473 (1977).*

<img src="../../img/zachary_black_lines.png" width="35%"/>

This is a small graph which allows for easy visual inspection to validate results.  

---
# Let's get started!

In [1]:
# Import needed libraries
import cugraph
import cudf
from collections import OrderedDict

----
### Define some Print functions
(the `del` are not needed since going out of scope should free memory, just good practice)

In [2]:
# define a function for printing the top most similar vertices
def print_most_similar_sorensen(df):
    
    jmax = df['sorensen_coeff'].max()
    dm = df.query('sorensen_coeff >= @jmax')    
    
    #find the best
    for i in range(len(dm)):    
        print("Vertices " + str(dm['first'].iloc[i]) + " and " + 
              str(dm['second'].iloc[i]) + " are most similar with score: " 
              + str(dm['sorensen_coeff'].iloc[i]))
    del jmax
    del dm

In [3]:
# define a function for printing Sørensen similar vertices based on a threshold
def print_sorensen_threshold(_d, limit):
    
    filtered = _d.query('sorensen_coeff > @limit')
    
    for i in range(len(filtered)):
        print("Vertices " + str(filtered['first'].iloc[i]) + " and " + 
            str(filtered['second'].iloc[i]) + " are similar with score: " + 
            str(filtered['sorensen_coeff'].iloc[i]))

### Read the CSV datafile using cuDF
data file is actually _tab_ separated, so we need to set the delimiter

In [4]:
# Test file  
from cugraph.experimental.datasets import karate
gdf = karate.get_edgelist(fetch=True)

In [5]:
# Let's look at the DataFrame. There should be three columns and 156 records but weight is not used.
gdf.shape

(156, 3)

In [6]:
# Look at the first few data records - the output should be three columns: 'src', 'dst' and wgt. The 3rd column wgt (weight) is not used.
gdf.head()

Unnamed: 0,src,dst,wgt
0,1,0,1.0
1,2,0,1.0
2,3,0,1.0
3,4,0,1.0
4,5,0,1.0


### Create a Graph

In [7]:
# create a Graph 
G = karate.get_graph()
G = G.to_undirected()

In [8]:
# How many vertices are in the graph?  Remember that Graph is zero based
G.number_of_vertices()

34

_The test graph has only 34 vertices, so why is the Graph listing 35?_

As mentioned above, cuGraph vertex numbering is zero-based, meaning that the first vertex ID starts at zero.  The test dataset is 1-based.  Because of that, the Graph object adds an extra isolated vertex with an ID of zero.  Hence the difference in vertex count.  
We could have run _renumbering_ on the data, or updated the value of each element _gdf['src'] = gdf['src'] - 1_    
for now, we will just state that vertex 0 is not part of the dataset and can be ignored

--- 
# Sørensen

In [9]:
#%%time
# Call cugraph.sorensen
jdf = cugraph.sorensen_coefficient(G)

In [10]:
# Which two vertices are the most similar?
print_most_similar_sorensen(jdf)

Vertices 32 and 33 are most similar with score: 0.6896552
Vertices 33 and 32 are most similar with score: 0.6896552


The Most similar should be 33 and 34.
Fill in calculation example

In [11]:
### let's look at all similarities over a threshold
print_sorensen_threshold(jdf, 0.5)

Vertices 0 and 1 are similar with score: 0.56
Vertices 1 and 0 are similar with score: 0.56
Vertices 1 and 3 are similar with score: 0.53333336
Vertices 3 and 1 are similar with score: 0.53333336
Vertices 3 and 7 are similar with score: 0.59999996
Vertices 3 and 13 are similar with score: 0.54545456
Vertices 7 and 3 are similar with score: 0.59999996
Vertices 13 and 3 are similar with score: 0.54545456
Vertices 32 and 33 are similar with score: 0.6896552
Vertices 33 and 32 are similar with score: 0.6896552


In [12]:
# Since it is a small graph we can print all scores, notice that only vertices that are neighbors are being compared
#
# Before printing, let's get rid of the duplicates (x compared to y is the same as y compared to x).  We will do that
# by performing a query.  Then let's sort the data by score

jdf_s = jdf.query('first < second').sort_values(by='sorensen_coeff', ascending=False)

print_sorensen_threshold(jdf_s, 0.0)

Vertices 32 and 33 are similar with score: 0.6896552
Vertices 3 and 7 are similar with score: 0.59999996
Vertices 0 and 1 are similar with score: 0.56
Vertices 3 and 13 are similar with score: 0.54545456
Vertices 1 and 3 are similar with score: 0.53333336
Vertices 2 and 3 are similar with score: 0.5
Vertices 5 and 6 are similar with score: 0.5
Vertices 1 and 7 are similar with score: 0.4615385
Vertices 0 and 3 are similar with score: 0.45454547
Vertices 8 and 30 are similar with score: 0.44444448
Vertices 23 and 29 are similar with score: 0.44444448
Vertices 1 and 13 are similar with score: 0.42857146
Vertices 2 and 7 are similar with score: 0.42857146
Vertices 1 and 2 are similar with score: 0.42105266
Vertices 2 and 13 are similar with score: 0.4
Vertices 0 and 2 are similar with score: 0.38461536
Vertices 8 and 32 are similar with score: 0.3529412
Vertices 4 and 10 are similar with score: 0.3333333
Vertices 5 and 16 are similar with score: 0.3333333
Vertices 6 and 16 are similar wit

---
# Expanding vertex pairs similarity scoring to 2-hop vertex pair

In [13]:
# get all two-hop vertex pairs
p = G.get_two_hop_neighbors()

In [14]:
# Let's look at the  score
j2 = cugraph.sorensen_coefficient(G, ebunch=p)

In [15]:
print_most_similar_sorensen(j2)

Vertices 32 and 33 are most similar with score: 0.6896552
Vertices 33 and 32 are most similar with score: 0.6896552


---
### It's that easy with cuGraph

Copyright (c) 2023, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___