# Renumbering Test

Demonstrate creating a graph with renumbering.

Most cugraph algorithms operate on a CSR representation of a graph.  A CSR representation requires an indices array that is as long as the number of edges and an offsets array that is as 1 more than the largest vertex id.  This makes the memory utilization entirely dependent on the size of the largest vertex id.  For data sets that have a sparse range of vertex ids, the size of the CSR can be unnecessarily large.  It is easy to construct an example where the amount of memory required for the offsets array will exceed the amount of memory in the GPU (not to mention the performance cost of having a large number of offsets that are empty but still have to be read to be skipped).

The cugraph renumbering feature allows us to take two columns of any integer type and translate them into a densely packed contiguous array numbered from 0 to (num_unique_values - 1).  These renumbered vertices can be used to create a graph much more efficiently.

Another of the features of the renumbering function is that it can take vertex ids that are 64-bit values and map them down into a range that fits into 32-bit integers.  The current cugraph algorithms are limited to 32-bit signed integers as vertex ids. and the renumbering feature will allow the caller to translate ids that are 64-bit into a densly packed 32-bit array of ids that can be used in cugraph algorithms.  Note that if there are more than 2^31 - 1 unique vertex ids then the renumber method will fail with an error indicating that there are too many vertices to renumber into a 32-bit signed integer.

Note that this version (0.7) is limited to integer types.  The intention is to extend the renumbering function to be able to handle strings and other types.

First step is to import the needed libraries

In [1]:
import cugraph
import cudf
import socket
import struct
import pandas as pd
import numpy as np
import networkx as nx


# Create some test data

This creates a small circle using some ipv4 addresses, storing the columns in a GPU data frame.

The current version of renumbering operates only on integer types, so we translate the ipv4 strings into 64 bit integers.

In [2]:
source_list = [ '192.168.1.1', '172.217.5.238', '216.228.121.209', '192.16.31.23' ]
dest_list = [ '172.217.5.238', '216.228.121.209', '192.16.31.23', '192.168.1.1' ]
source_as_int = [ struct.unpack('!L', socket.inet_aton(x))[0] for x in source_list ]
dest_as_int = [ struct.unpack('!L', socket.inet_aton(x))[0] for x in dest_list ]


print("sources came from: " + str([ socket.inet_ntoa(struct.pack('!L', x)) for x in source_as_int ]))
print("  sources as int = " + str(source_as_int))
print("destinations came from: " + str([ socket.inet_ntoa(struct.pack('!L', x)) for x in dest_as_int ]))
print("  destinations as int = " + str(dest_as_int))


sources came from: ['192.168.1.1', '172.217.5.238', '216.228.121.209', '192.16.31.23']
  sources as int = [3232235777, 2899903982, 3638852049, 3222282007]
destinations came from: ['172.217.5.238', '216.228.121.209', '192.16.31.23', '192.168.1.1']
  destinations as int = [2899903982, 3638852049, 3222282007, 3232235777]


# Create our GPU data frame

In [3]:
df = pd.DataFrame({
        'source_list': source_list,
        'dest_list': dest_list,
        'source_as_int': source_as_int,
        'dest_as_int': dest_as_int
        })

gdf = cudf.DataFrame.from_pandas(df[['source_as_int', 'dest_as_int']])

gdf.to_pandas()

Unnamed: 0,source_as_int,dest_as_int
0,3232235777,2899903982
1,2899903982,3638852049
2,3638852049,3222282007
3,3222282007,3232235777


# Run renumbering

The current version of renumbering takes a column of source vertex ids and a column of dest vertex ids.  As mentioned above, these must be integer columns.

Output from renumbering is 3 cudf.Series structures representing the renumbered sources, the renumbered destinations and the numbering map which maps the new ids back to the original ids.

In this case,
 * gdf['source_as_int'] is a column of type int64
 * gdf['dest_as_int'] is a column of type int64
 * src_r will be a series of type int32 (we translate back to 32-bit integers)
 * dst_r will be a series of type int32
 * numbering will be a series of type int64 that translates the elements of src and dst back to their original 64-bit values
 
Note that because the renumbering translates us to 32-bit integers, if there are more than 2^31 - 1 unique 64-bit values in the source/dest passed into renumbering this would exceed the size of the 32-bit integers so you will get an error from the renumber call. 

In [4]:
src_r, dst_r, numbering = cugraph.renumber(gdf['source_as_int'], gdf['dest_as_int'])

gdf.add_column("original id", numbering)
gdf.add_column("src_renumbered", src_r)
gdf.add_column("dst_renumbered", dst_r)

gdf.to_pandas()

Unnamed: 0,source_as_int,dest_as_int,original id,src_renumbered,dst_renumbered
0,3232235777,2899903982,3638852049,1,2
1,2899903982,3638852049,3232235777,2,0
2,3638852049,3222282007,2899903982,0,3
3,3222282007,3232235777,3222282007,3,1


# Data types

Just to confirm, the data types of the renumbered columns should be int32, the original data should be int64, the numbering map needs to be int64 since the values it contains map to the original int64 types.

In [5]:
gdf.dtypes

source_as_int     int64
dest_as_int       int64
original id       int64
src_renumbered    int32
dst_renumbered    int32
dtype: object

# Quick verification

To understand the renumbering, here's a block of verification logic.  In the renumbered series we created a new id for each unique value in the original series.  The numbering map identifies that mapping.  For any vertex id X in the new numbering, numbering[X] should refer to the original value.

In [6]:
for i in range(len(src_r)):
    print(" " + str(i) +
          ": (" + str(source_as_int[i]) + "," + str(dest_as_int[i]) +")"
          ", renumbered: (" + str(src_r[i]) + "," + str(dst_r[i]) +")"
          ", translate back: (" + str(numbering[src_r[i]]) + "," + str(numbering[dst_r[i]]) +")"
         )


 0: (3232235777,2899903982), renumbered: (1,2), translate back: (3232235777,2899903982)
 1: (2899903982,3638852049), renumbered: (2,0), translate back: (2899903982,3638852049)
 2: (3638852049,3222282007), renumbered: (0,3), translate back: (3638852049,3222282007)
 3: (3222282007,3232235777), renumbered: (3,1), translate back: (3222282007,3232235777)


# Now let's do some graph things...

To start, let's run page rank.  Not particularly interesting on our circle, since everything should have an equal rank.

In [7]:
G = cugraph.Graph()
G.add_edge_list(src_r, dst_r)

pr = cugraph.pagerank(G)

pr.add_column("original id", numbering)
pr.to_pandas()


Unnamed: 0,vertex,pagerank,original id
0,0,0.25,3638852049
1,1,0.25,3232235777
2,2,0.25,2899903982
3,3,0.25,3222282007


# Try to run jaccard

Not at all an interesting result, but it demonstrates a more complicated case.  Jaccard returns a coefficient for each edge.  In order to show the original ids we need to add columns to the data frame for each column that contains one of renumbered vertices.  In this case, the columns source and destination contain renumbered vertex ids.

In [8]:
jac = cugraph.jaccard(G)

jac.add_column("original_source",
               [ socket.inet_ntoa(struct.pack('!L', numbering[x])) for x in jac['source'] ])

jac.add_column("original_destination",
               [ socket.inet_ntoa(struct.pack('!L', numbering[x])) for x in jac['destination'] ])

jac.to_pandas()


Unnamed: 0,source,destination,jaccard_coeff,original_source,original_destination
0,0,3,0.0,216.228.121.209,192.16.31.23
1,1,2,0.0,192.168.1.1,172.217.5.238
2,2,0,0.0,172.217.5.238,216.228.121.209
3,3,1,0.0,192.16.31.23,192.168.1.1


___
Copyright (c) 2019, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___