# End-to-End Demo
## Running Pagerank on Wikipedia With vs. Without `nx-cugraph`

This notebook demonstrates a zero code change, end-to-end workflow using `cudf.pandas` and `nx-cugraph`.

Please see the [System Requirements](https://docs.rapids.ai/api/cugraph/stable/nx_cugraph/installation/#system-requirements) in order to run this notebook.

In [None]:
# Uncomment these two lines to enable GPU acceleration
# The rest of the code stays the same!

%load_ext cudf.pandas
%env NX_CUGRAPH_AUTOCONFIG=True

In [2]:
import pandas as pd
import networkx as nx

Downloading the data

In [5]:
import gzip
import shutil
import urllib.request
from pathlib import Path

# Get the data
def download_datafile(url, file_path):
    compressed_path = file_path + ".gz"

    if not Path(file_path).exists():
        print(f"File not found. Downloading from {url}...")
        urllib.request.urlretrieve(url, compressed_path)

        print(f"\tDownloaded to {compressed_path}. Unzipping...")
        with gzip.open(compressed_path, 'rb') as f_in, open(file_path, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

        print("Done.")
    else:
        print(f"File already exists at {file_path}. Skipping download")

In [None]:
nodedata_url="https://data.rapids.ai/cugraph/benchmark/enwiki-20240620-nodeids.csv.gz"
nodedata_path = "enwiki-20240620-nodeids.csv"
download_datafile(nodedata_url, nodedata_path)

edgelist_url="https://data.rapids.ai/cugraph/benchmark/enwiki-20240620-edges.csv.gz"
edgelist_path = "enwiki-20240620-edges.csv"
download_datafile(edgelist_url, edgelist_path)

The dataset used in this script falls under the Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License, available at https://creativecommons.org/licenses/by-sa/4.0/legalcode.en

Timed end-to-end code

In [None]:
%%time

# Read the Wikipedia Connectivity data from `edgelist_path`
edgelist_df = pd.read_csv(
    edgelist_path,
    sep=" ",
    names=["src", "dst"],
    dtype="int32",
)

In [None]:
%%time

# Read the Wikipedia Page metadata from `nodedata_path`
nodedata_df = pd.read_csv(
    nodedata_path,
    sep="\t",
    names=["nodeid", "title"],
    dtype={"nodeid": "int32", "title": "str"},
)

In [None]:
%%time

# Create a NetworkX graph from the connectivity info
G = nx.from_pandas_edgelist(
    edgelist_df,
    source="src",
    target="dst",
    create_using=nx.DiGraph,
)

In [None]:
%%time

# Run pagerank on NetworkX
nx_pr_vals = nx.pagerank(G)

In [None]:
%%time

# Create a DataFrame containing the results
pagerank_df = pd.DataFrame({
    "nodeid": nx_pr_vals.keys(),
    "pagerank": nx_pr_vals.values()
})

In [None]:
%%time
# Add NetworkX results to `nodedata` as new columns
nodedata_df = nodedata_df.merge(pagerank_df, how="left", on="nodeid")

# Here the top 25 pages based on pagerank value
nodedata_df.sort_values(by="pagerank", ascending=False).head(25)