# Renumber

In this notebook, we will use the _renumber_ function to compute new vertex IDs.

Under the covers, cuGraph represents a graph as a matrix in Compressed Sparse Row format (see https://en.wikipedia.org/wiki/Sparse_matrix).  The problem with a matrix representation is that there is a column and row for every possible vertex ID.  Therefore, if the data contains vertex IDs that are non-contiguious, or which start at a large initial value, then there is a lot of empty space that uses up memory.      

An alternative case is using renumbering to convert from one data type down to a contiguious sequence of integer IDs.  This is useful when the dataset contain vertex IDs that are not integers.  


Notebook Credits
* Original Authors: Bradley Rees
* Created:   08/13/2019
* Updated:   07/08/2020

RAPIDS Versions: 0.13    

Test Hardware

* GV100 32G, CUDA 10.2


## Introduction
The renumber function takes an edge list (source, destination) and renumbers the vertices so that the index start at 0 and are contiguious.  The function also converts the data type to return int32

To renumber an edge list (COO data) use:<br>

**cugraph.renumber(source, destination)**
* __source__: cudf.Series
* __destination__: cudf.Series


Returns:
* __triplet__: three variables are returned:
    * 'src': the new source vertex IDs
    * 'dst': the new destination IDs
    * 'mapping': a mapping of new IDs to original IDs.  Since the new IDs are sequencial from 0, the index value represents the new vertex ID




### Test Data
A cyber data set from the University of New South Wales is used, where just the IP edge pairs from been extracted

### Prep

In [1]:
# Import needed libraries
import cugraph
import cudf
import cupy as cp

from cugraph.structure import NumberMap

In [3]:
# Read the data
# the file contains an index column that will be ignored

datafile='../data/cyber.csv'

gdf = cudf.read_csv(datafile, delimiter=',', names=['idx','srcip','dstip'], dtype=['int32','str', 'str'], skiprows=1, usecols=['srcip', 'dstip'] )

### Look at the data

In [4]:
# take a peek at the data
gdf.head()

Unnamed: 0,srcip,dstip
0,﻿59.166.0.0,149.171.126.6
1,59.166.0.0,149.171.126.9
2,59.166.0.6,149.171.126.7
3,59.166.0.5,149.171.126.5
4,59.166.0.3,149.171.126.0


In [5]:
# Since IP columns are strings, we first need to convert them to integers
gdf['src_ip'] = gdf['srcip'].str.ip2int()
gdf['dst_ip'] = gdf['dstip'].str.ip2int()

In [6]:
# look at that data and the range of values
maxT = max(gdf['src_ip'].max(), gdf['dst_ip'].max())
minT = min(gdf['src_ip'].min(), gdf['dst_ip'].min())

r = maxT - minT +1
print("edges: " + str(len(gdf)))
print("max: " + str(maxT) + " min: " + str(minT) + " range: " + str(r))

edges: 2546575
max: 3758096389 min: 59 range: 3758096331


The data has 2.5 million edges that span a range of 3,758,096,389.
Even if every vertex ID was unique per edge, that would only be 5 million values versus the 3.7 billion that is currently there.  
In the current state, the produced matrix would 3.7 billion by 3.7 billion - that is a lot of wasted space.

### Time to Renumber
One good best practice is to have the returned edge pairs appended to the original Dataframe. That will help merge results back into the datasets

In [7]:
gdf['order'] = gdf.index

tmp_df, numbering = NumberMap.renumber(gdf, ['src_ip'], ['dst_ip'])
new_src_col_name = numbering.renumbered_src_col_name
new_dst_col_name = numbering.renumbered_dst_col_name

gdf = gdf.merge(tmp_df, on='order').sort_values('order').set_index(keys='order', drop=True)
gdf = gdf.rename(columns={new_src_col_name: 'src_r', new_dst_col_name: 'dst_r'})

In [8]:
gdf.head()

Unnamed: 0_level_0,srcip_x,dstip_x,src_ip,dst_ip,srcip_y,dstip_y,src_r,dst_r
order,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,﻿59.166.0.0,149.171.126.6,59,2511044102,﻿59.166.0.0,149.171.126.6,43,30
1,59.166.0.0,149.171.126.9,1000734720,2511044105,59.166.0.0,149.171.126.9,4,33
2,59.166.0.6,149.171.126.7,1000734726,2511044103,59.166.0.6,149.171.126.7,7,34
3,59.166.0.5,149.171.126.5,1000734725,2511044101,59.166.0.5,149.171.126.5,2,26
4,59.166.0.3,149.171.126.0,1000734723,2511044096,59.166.0.3,149.171.126.0,5,35


Let's now look at the renumbered range of values

In [9]:
# look at that data and the range of values
maxT = max(gdf['src_r'].max(), gdf['dst_r'].max())
minT = min(gdf['src_r'].min(), gdf['dst_r'].min())

r = maxT - minT + 1
print("edges: " + str(len(gdf)))
print("max: " + str(maxT) + " min: " + str(minT) + " range: " + str(r))

edges: 2546575
max: 51 min: 0 range: 52


Just saved 3.7 billion unneeded spaces in the matrix!<br>
And we can now see that there are only 52 unique IP addresses in the dataset<br>
Let's confirm the number of unique values.

In [10]:
# Merge the renumbered columns
src, dst = gdf['src_r'].to_cupy(), gdf['dst_r'].to_cupy()
merged = cp.concatenate((src, dst))

print("Unique IPs: " + str(len(cp.unique(merged))))

Unique IPs: 52


As we can see, the values match!

___
Copyright (c) 2019-2020, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___