# How to compute a _Cost Matrix_ by replicating data
# Skip notebook test

### Approach
A simple approach to creating a cost matrix is to run All-Source Shortest Path (ASSP), however cuGraph currently does not have an All-Source Shortest Path (ASSP) algorithm.  One is on the roadmap, based on Floyd-Warshall, but that doesn't help us today. Luckily there is a work around if the graph to be processed is small.  The hack is to run ASSP by creating a lot of copies of the graph and running the Single Source Shortest Path (SSSP) on one seed per graph copy. Since each SSSP run within its own disjoint component, there is no issue with path collisions between seeds.  


### Notebook Organization
The first portion of the notebook discusses each step independently.  It gives insight into what is going on and how fast each step takes.

The second section puts it all the steps together in a single function and times how long with would take to compute the matrix


### Data

In this notebook we will use the email-Eu-core

* Number of Vertices:  1,005
* Number of Edges:    25,571

We are using this dataset since it is small with a few communities, meaning that there are paths to be found.

### Notebook Revisions

| Author Credit |    Date    |  Update          | cuGraph Version |  Test Hardware |
| --------------|------------|------------------|-----------------|----------------|
| Brad Rees     | 06/21/2022 | created          | 22.08           | V100 w 32 GB, CUDA 11.5
| Don Acosta    | 06/28/2022 | modified         | 22.08           | V100 w 32 GB, CUDA 11.5

### References

* https://www.sciencedirect.com/topics/mathematics/cost-matrix
* https://en.wikipedia.org/wiki/Shortest_path_problem

Dataset
* Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. Local Higher-order Graph Clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017.

* J. Leskovec, J. Kleinberg and C. Faloutsos. Graph Evolution: Densification and Shrinking Diameters. ACM Transactions on Knowledge Discovery from Data (ACM TKDD), 1(1), 2007. http://www.cs.cmu.edu/~jure/pubs/powergrowth-tkdd.pdf


In [None]:
# system and other
import time
from time import perf_counter
import math

# rapids
import cugraph
import cudf

-----
# Reading the data

Let's start with data read

In [None]:
# simple function to read in the CSV data file
def read_data_cudf(datafile):
    gdf = cudf.read_csv(datafile,
                     delimiter=" ",
                     header=None,
                     names=['src','dst', 'wt'])
    return gdf

In [None]:
# function to determine the number of nodes in the dataset
def find_number_of_nodes(df):
    node = cudf.concat([df['src'], df['dst']])
    node = node.unique()
    return len(node)

### Read the data and verify that it is zero based (e.g. first vertex is 0)
**IMPORTANT:** The node numbering must be zero based. We use the starting index on the replicated graph to be one larger than the number of vertices.  If the starting index is not zero, then the graph copies will overlap in index space and not be independent (disjoint). 

In [None]:
t1 = perf_counter()
gdf = read_data_cudf('../data/email-Eu-core.csv')
read_t = perf_counter() - t1

In [None]:
print(f" read {len(gdf)} edges in {read_t} seconds")

In [None]:
# verify that the starting ID is zero
min([gdf['src'].min(), gdf['dst'].min()])

In [None]:
# check the max ID
max([gdf['src'].max(), gdf['dst'].max()])

In [None]:
# the number of nodes should be one greater than the max ID
# that is the ID that we start the next instance of the data at
offset = find_number_of_nodes(gdf)
print(offset)

## Now let's dive into how to replicate the data
We will use a model that doubles the data at each pass.  That is a lot faster 
than adding one copy at a time.  
The number of disjoint versions of the data will be a power of 2.
Although the power of 2 replication results in faster data set growth and Graph building, the simple order one replication is shown here for illustration purposes.


![Data Duplicated](../../notebooks/img/graph_after_replication.png)

In [None]:
# This function creates additional version of the data 

def make_data(base_df, N):
    id = find_number_of_nodes(base_df)
    _d = base_df

    for x in range(N):
        tmp = _d.copy()
        tmp['src'] += id
        tmp['dst'] += id
        _d = cudf.concat([_d,tmp])
        id = id * 2
    return _d

In [None]:
%%timeit
_ = make_data(gdf, 3)

In [None]:
%%time
gdf2 = make_data(gdf, 3)
print()

In [None]:
# simple print to show tha there is not a lot more data
# print # of Edges and # of Nodes
print(f"Old {len(gdf)} {find_number_of_nodes(gdf)}")
print(f"New {len(gdf2)} {find_number_of_nodes(gdf2)}")

## Build the ghost node connection set
A ghost node is an artificially added node to parallelize/simulate the all-points shortest path algorithm which is not yet supported.
After the ghost node is added, the 2nd hop is actually the all points shortest path.
The Ghost node is later removed after the Shortest path algorithms are run.

![Ghost Node](../../notebooks/img/graph_after_ghost.png)

The Ghost Node is connected to a different corresponding node in each replication so all sources are covered.

In this simple example of a four-node 'square' graph after complete replication and adding the ghost node, the graph looks like this:

![Ghost Node](../../notebooks/img/Full-four_node_replication.png)





In [None]:
def add_ghost_node(_df, N):
    # get the size of the graph.  That number will be the ghost node ID
    ghost_node_id = find_number_of_nodes(_df)
    
    num_copies = math.floor(math.pow(2, N))

    seeds = cudf.DataFrame()
    seeds['dst'] = [((offset * x) + x) for x in range(num_copies)]
    seeds['src'] = ghost_node_id
    
    _d = cudf.concat([_df, seeds])
    
    return _d, ghost_node_id

In [None]:
%%timeit
_, _ = add_ghost_node(gdf2, 10)

In [None]:
gdf_with_ghost, ghost_id = add_ghost_node(gdf2, 10)

## Create an Empty directed Graph

In [None]:
G = cugraph.Graph(directed=True)

Populate the new graph with an edgelist containing
* The original Data
* The replicated data copies
* Each replication connected to the Ghost Node by a single edge from a different node
in each copy of the graph.

In [None]:
%time
G.from_cudf_edgelist(gdf_with_ghost, source='src', destination='dst', renumber=False)

In [None]:
%time
G.number_of_edges()

### Run Single Source Shortest Path (SSSP) from the ghost node
The single Ghost node source becomes a all-source shortest path after one hop since all the
replicated data is connected through that node. This will include extraneous ghost node related data which will be removed in later steps.

In [None]:
%%timeit
X = cugraph.sssp(G, ghost_id)

In [None]:
X = cugraph.sssp(G, ghost_id)

This result will contain a ghost node like the simple example.

In [None]:
X.head(5)

## Now reset vertex IDs and convert to a cost matrix
All edges with the ghost node as a source are removed here.

In [None]:
# drop the ghost node which doesnt exist so remove from matrix.
X = X[X['predecessor'] != ghost_id]

Apply the CuGraph filter which removes all nodes not encountered during the graph traversal. In this case the SSSP.

In [None]:
# drop unreachable
X = cugraph.filter_unreachable(X)

Remove the path cost that was incurred by going to the single seed in each copy from the ghost node.

In [None]:
# adjust distances so that they don't go to the ghost node
X['distance'] -= 1

## Now the Ghost node and tangential edges are removed.

In [None]:
X.head(5)

Calculate the seed for each copy. This is where it is critical that the original graph node numbering is zero based.

In [None]:
# add a new column for the seed
# since each seed was a different component with a different offset amount, exploit that to determine the seed number
X['seed'] = (X['vertex'] / offset).astype(int)

In [None]:
X.head(5)

In [None]:
# Now adjust all vertices to be in the correct range
# resets the seed number to the
X['v2'] = X['vertex'] - (X['seed'] * offset)

In [None]:
# Finally just pull out the cost matrix
cost = X.drop(columns=['vertex', 'predecessor'])

In [None]:
cost.head(10)

In [None]:
# cleanup 
del G
del X
del gdf_with_ghost
del gdf2

----
# Section 2: Do it all in a single function

In [None]:
# Set the number of replications - 10 will produce 1,024 graphs
N = 10

In [None]:
def build_cost_matrix(_gdf):
    data = make_data(_gdf, N)
    gdf_with_ghost, ghost_id = add_ghost_node(data, N)
    
    G = cugraph.Graph(directed=True)
    G.from_cudf_edgelist(gdf_with_ghost, source='src', destination='dst', renumber=False)
    
    X = cugraph.sssp(G, ghost_id)
    
    X = X[X['predecessor'] != ghost_id]
    X = cugraph.filter_unreachable(X)
    X['distance'] -= 1
    X['seed'] = (X['vertex'] / offset).astype(int)
    X['v2'] = X['vertex'] - (X['seed'] * offset)
    cost = X.drop(columns=['vertex', 'predecessor'])
    
    return cost

In [None]:
%%timeit
CM = build_cost_matrix(gdf)
CM

In [None]:
CM = build_cost_matrix(gdf)
CM.head(5)

___
Copyright (c) 2022, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___