# Renumbering Test

In this notebook, we will use the _renumber_ function to compute new vertex IDs.

Under the covers, cuGraph represents a graph as a matrix in Compressed Sparse Row format (see https://en.wikipedia.org/wiki/Sparse_matrix).  The problem with a matrix representation is that there is a column and row for every possible vertex ID.  Therefore, if the data contains vertex IDs that are non-contiguous, or which start at a large initial value, then there is a lot of empty space that uses up memory.      

An alternative case is using renumbering to convert from one data type down to a contiguous sequence of integer IDs.  This is useful when the dataset contain vertex IDs that are not integers.  


| Author Credit                   |    Date         |  Update            | cuGraph Version |  Test Hardware |
| --------------------------------|-----------------|--------------------|-----------------|----------------|
| Brad Rees and Chuck Hastings    | 08/13/2019      | created            | 0.10            | GV100, CUDA 11.0
| Brad Rees                       | 06/22/2020      | updated            | 0.15            | GV100, CUDA 11.0
| Don Acosta                      | 08/28/2022      | updated/tested     | 22.10           | TV100, CUDA 11.5
## Introduction

Demonstrate creating a graph with renumbering.

Most cuGraph algorithms operate on a CSR representation of a graph.  A CSR representation requires an indices array that is as long as the number of edges and an offsets array that is as 1 more than the largest vertex id.  This makes the memory utilization entirely dependent on the size of the largest vertex id.  For data sets that have a sparse range of vertex ids, the size of the CSR can be unnecessarily large.  It is easy to construct an example where the amount of memory required for the offsets array will exceed the amount of memory in the GPU (not to mention the performance cost of having a large number of offsets that are empty but still have to be read to be skipped).

The renumbering feature allows us to generate unique identifiers for every vertex identified in the input data frame.

Renumbering can happen automatically as part of graph generation.  It can also be done explicitly by the caller, this notebook will provide examples using both techniques.

The fundamental requirement for the user of the renumbering software is to specify how to identify a vertex.  We will refer to this as the *external* vertex identifier.  This will typically be done by specifying a cuDF DataFrame, and then identifying which columns within the DataFrame constitute source vertices and which columns specify destination columns.

Let us consider that a vertex is uniquely defined as a tuple of elements from the rows of a cuDF DataFrame.  The primary restriction is that the number of elements in the tuple must be the same for both source vertices and destination vertices, and that the types of each element in the source tuple must be the same as the corresponding element in the destination tuple.  This restriction is a natural restriction and should be obvious why this is required.

Renumbering takes the collection of tuples that uniquely identify vertices in the graph, eliminates duplicates, and assigns integer identifiers to the unique tuples.  These integer identifiers are used as *internal* vertex identifiers within the cuGraph software.


First step is to import the needed libraries

In [None]:
import cugraph
import cudf
import socket
import struct
import pandas as pd
import numpy as np
import networkx as nx
from cugraph.structure import NumberMap

# Create some test data

This creates a small circle using some ipv4 addresses, storing the columns in a GPU data frame.

The current version of renumbering operates only on integer types, so we translate the ipv4 strings into 64 bit integers.

In [None]:
source_list = [ '192.168.1.1', '172.217.5.238', '216.228.121.209', '192.16.31.23' ]
dest_list = [ '172.217.5.238', '216.228.121.209', '192.16.31.23', '192.168.1.1' ]
source_as_int = [ struct.unpack('!L', socket.inet_aton(x))[0] for x in source_list ]
dest_as_int = [ struct.unpack('!L', socket.inet_aton(x))[0] for x in dest_list ]


print("sources came from: " + str([ socket.inet_ntoa(struct.pack('!L', x)) for x in source_as_int ]))
print("  sources as int = " + str(source_as_int))
print("destinations came from: " + str([ socket.inet_ntoa(struct.pack('!L', x)) for x in dest_as_int ]))
print("  destinations as int = " + str(dest_as_int))


# Create our GPU Dataframe

In [None]:
df = pd.DataFrame({
        'source_list': source_list,
        'dest_list': dest_list,
        'source_as_int': source_as_int,
        'dest_as_int': dest_as_int
        })

gdf = cudf.DataFrame.from_pandas(df[['source_as_int', 'dest_as_int']])

gdf.to_pandas()

# Run renumbering

Output from renumbering is a data frame and a NumberMap object.  The data frame contains the renumbered sources and destinations.  The NumberMap will allow you to translate from external to internal vertex identifiers.  The renumbering call will rename the specified source and destination columns to indicate they were renumbered and no longer contain the original data, and the new names are guaranteed to be unique and not collide with other column names.

Note that renumbering does not guarantee that the output data frame is in the same order as the input data frame (although in our simple example it will match).  To address this we will add the index as a column of gdf before renumbering.


In [None]:
gdf['order'] = gdf.index

renumbered_df, numbering = NumberMap.renumber(gdf, ['source_as_int'], ['dest_as_int'])
new_src_col_name = numbering.renumbered_src_col_name
new_dst_col_name = numbering.renumbered_dst_col_name

renumbered_df

# Now combine renumbered df with original df

We can use the order column to merge the data frames together.

In [None]:
renumbered_df = renumbered_df.merge(gdf, on='order').sort_values('order').reset_index(drop=True)

renumbered_df

# Data types

Just to confirm, the data types of the renumbered columns should be int32, the original data should be int64, the numbering map needs to be int64 since the values it contains map to the original int64 types.

In [None]:
renumbered_df.dtypes

# Quick verification

The NumberMap object allows us to translate back and forth between *external* vertex identifiers and *internal* vertex identifiers.

To understand the renumbering, here's an ugly block of verification logic.

In [None]:
numbering.from_internal_vertex_id(cudf.Series([0]))['0'][0]

for i in range(len(renumbered_df)):
    print(" ", i,
          ": (",  source_as_int[i], ",", dest_as_int[i],
          "), renumbered: (", renumbered_df[new_src_col_name][i], ",", renumbered_df[new_dst_col_name][i], 
          "), translate back: (",
          numbering.from_internal_vertex_id(cudf.Series([renumbered_df[new_src_col_name][i]]))['0'][0], ",",
          numbering.from_internal_vertex_id(cudf.Series([renumbered_df[new_dst_col_name][i]]))['0'][0], ")"
         )


# Now let's do some graph things...

To start, let's run page rank.  Not particularly interesting on our circle, since everything should have an equal rank.

Note, we passed in the renumbered columns as our input, so the output is based upon the internal vertex ids.

In [None]:
G = cugraph.Graph()
gdf_r = cudf.DataFrame()
gdf_r["src"] = renumbered_df[new_src_col_name]
gdf_r["dst"] = renumbered_df[new_dst_col_name]

G.from_cudf_edgelist(gdf_r, source='src', destination='dst', renumber=False)

pr = cugraph.pagerank(G)

pr.to_pandas()

# Convert vertex IDs back

To be relevant, we probably want the vertex ids converted back into the original ids.  This can be done by the NumberMap object.

Note again, the un-renumber call does not guarantee order.  If order matters you would need to do something to regenerate the desired order.

In [None]:
numbering.unrenumber(pr, 'vertex')

# Try to run Jaccard

Not at all an interesting result, but it demonstrates a more complicated case.  Jaccard returns a coefficient for each edge.  In order to show the original ids we need to add columns to the data frame for each column that contains one of renumbered vertices.  In this case, the columns source and destination contain renumbered vertex ids.

In [None]:
jac = cugraph.jaccard(G)

jac = numbering.unrenumber(jac, 'first')
jac = numbering.unrenumber(jac, 'second')

jac.insert(len(jac.columns),
           "original_source",
           [ socket.inet_ntoa(struct.pack('!L', x)) for x in jac['first'].values_host ])

jac.insert(len(jac.columns),
           "original_destination",
           [ socket.inet_ntoa(struct.pack('!L', x)) for x in jac['second'].values_host ])

jac.to_pandas()


# Working from the strings

Starting with version 0.15, the base renumbering feature contains support for any arbitrary columns.  So we can now directly support strings.

Renumbering also happens automatically in the graph.  So let's combine all of this to a simpler example with the same data.

In [None]:
gdf = cudf.DataFrame.from_pandas(df[['source_list', 'dest_list']])

G = cugraph.Graph()
G.from_cudf_edgelist(gdf, source='source_list', destination='dest_list', renumber=True)

pr = cugraph.pagerank(G)

print('pagerank output:\n', pr)

jac = cugraph.jaccard(G)

print('jaccard output:\n', jac)


___
Copyright (c) 2019-2022, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___