# PageRank
#### Author : Alex Fender

In this notebook, we will show how to use multi-GPU features in cuGraph to compute the PageRank of each user in Twitter's dataset.

Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. If you think you have found a bug or an error, please file an issue in [cuGraph](https://github.com/rapidsai/cugraph/issues)

This notebook was run on 2 NVIDIA Tesla V100 GPUs with RAPIDS 0.9.0 and CUDA 10.0. 

## Introduction
Pagerank is measure of the relative importance of a vertex based on the relative importance of its neighbors.  PageRank was invented by Google Inc. and is (was) used to rank it's search results. PageRank uses the connectivity information of a graph to rank the importance of each vertex. See [Wikipedia](https://en.wikipedia.org/wiki/PageRank) for more details on the algorithm.

CuGraph's multi-GPU features leverage Dask. RAPIDS has other projects based Dask such as dask-cudf and dask-cuda. They will also be used in this example. Check out [RAPIDS.ai](https://rapids.ai/) to learn more about these technologies.

To compute the Pagerank scores for a graph in cuGraph we use:<br>

```python
cugraph.dask.pagerank.pagerank(edge_list, alpha=0.85, max_iter=30)
```
Parameters

*  *edge_list* : `dask_cudf.DataFrame`<br>
Contain the connectivity information as an edge list. Source 'src' and destination 'dst' columns must be of type 'int32'. Edge weights are not used for this algorithm. Indices must be in the range [0, V-1], where V is the global number of vertices. The input edge list should be provided in dask-cudf dataframe with one partition per GPU.
*  *alpha* : `float`<br>
The damping factor alpha represents the probability to follow an outgoing edge, standard value is 0.85. Thus, 1.0-alpha is the probability to “teleport” to a random vertex. Alpha should be greater than 0.0 and strictly lower than 1.0.
* *max_iter* : `int`<br>
The maximum number of iterations before an answer is returned. If this value is lower or equal to 0 cuGraph will use the default value, which is 30.<br>

Returns

* *PageRank* : `dask_cudf.DataFrame`<br>
Dask GPU DataFrame containing two columns of size V: the vertex identifiers and the corresponding PageRank values.

## Data
We will be analyzing 41.7 million user profiles and 1.47 billion social relations from the Twitter dataset.  The file is 24GB and was collected in :<br>
*What is Twitter, a social network or a news media? Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010.*<br> 

Please refer to the readme to obtain this dataset.

## Multi-GPU PageRank with cuGraph
### Basic setup

In [None]:
# Let's check out our hardware setup
!nvidia-smi

In [None]:
# Import needed libraries
import time
from dask.distributed import Client, wait
import dask_cudf
from dask_cuda import LocalCUDACluster
import cugraph.dask.pagerank as dcg

### Setup multi-GPU and dask

Before we get started, we need to setup a Dask local cluster of workers to execute our work and a client to coordinate and schedule work for that cluster. As we see below, we can inititate a `cluster` and `client` using only few lines of code.

In [None]:
cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)

### Load the data
cuGraph depends on dask-cudf for data loading and the initial DataFrame creation. The CSV data file contains an edge list, which represents the connection of a vertex to another. The source to destination pairs is what is known as Coordinate Format (COO). In this test case, the data is just two columns. 

In [None]:
# File path, assuming current directory
input_data_path = r"twitter.csv"

# Helper function to set the reader chunck size to automatically get one partition per GPU  
chunksize = dcg.get_chunksize(input_data_path)

# Multi-GPU CSV reader
e_list = dask_cudf.read_csv(input_data_path, chunksize = chunksize, delimiter=' ', names=['src', 'dst'], dtype=['int32', 'int32'])

### Call the Multi-GPU PageRank algorithm


In [None]:
# Get the pagerank scores
pr = dcg.pagerank(e_list, max_iter=10)

It was that easy! PageRank should only takes a few seconds to run on with 2 V100.

In [None]:
# Find the most important vertex using the scores
# This methods should only be used for small graph
bestScore = gdf_page['pagerank'][0]
bestVert = gdf_page['vertex'][0]

for i in range(len(gdf_page)):
    if gdf_page['pagerank'][i] > bestScore:
        bestScore = gdf_page['pagerank'][i]
        bestVert = gdf_page['vertex'][i]
        
print("Best vertex is " + str(bestVert) + " with score of " + str(bestScore))

The top PageRank vertex and socre match what was found by NetworkX

In [None]:
# A better way to do that would be to find the max and then use that values in a query
pr_max = gdf_page['pagerank'].max()

In [None]:
def print_pagerank_threshold(_df, t=0) :
    filtered = _df.query('pagerank >= @t')
    
    for i in range(len(filtered)):
        print("Best vertex is " + str(filtered['vertex'][i]) + 
            " with score of " + str(filtered['pagerank'][i]))              

In [None]:
print_pagerank_threshold(gdf_page, pr_max)

----

a PageRank score of _0.10047_ is very low, which can be an indication that there is no more central vertex than any other.  Rather than just looking at the top score, let's look at the top three vertices and see if there are any insights that can be inferred.  

Since this is a very small graph, let's just sort and get the first three records

In [None]:
sort_pr = gdf_page.sort_values('pagerank', ascending=False)

In [None]:
sort_pr.head(3).to_pandas()

Going back and looking at the graph with the top three vertices highlighted (illustration below) it is easy to see that the top scoring vertices also appear to be the vertices with the most connections.   
Let's look at sorted list of degrees

In [None]:
d = G.degree()

In [None]:
# divide the degree by two since this is an undirected graph
d['degree'] = d['degree'] / 2

In [None]:
d.sort_values('degree', ascending=False).head(3).to_pandas()

<img src="./img/zachary_graph_pagerank.png" width="600">

___
Copyright (c) 2019, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___