# `nx-cugraph` Demo - Wikipedia Pagerank

This notebook demonstrates a zero code change, end-to-end workflow using `cudf.pandas` and `nx-cugraph`.

In [1]:
# Uncomment these two lines to enable GPU acceleration
# The rest of the code stays the same!

# %load_ext cudf.pandas
# !NETWORKX_BACKEND_PRIORITY=cugraph

import pandas as pd
import networkx as nx

Downloading the data

In [None]:
# wget "https://data.rapids.ai/cugraph/datasets/"  # Use this command to download datasets from the web

In [3]:
# TODO: remove this
dataset_folder = "~/nvrliu/notebooks/demo/data/wikipedia"

edgelist_csv = f"{dataset_folder}/enwiki-20240620-edges.csv"
nodedata_csv = f"{dataset_folder}/enwiki-20240620-nodeids.csv"

Timed end-to-end code

Read in the Wikipedia Connectivity data from `edgelist_csv`

In [4]:
%%time 
edgelist_df = pd.read_csv(
    edgelist_csv,
    sep=" ",
    names=["src", "dst"],
    dtype="int32",
)

Read in the Wikipedia pages metadata from `nodedata_csv`

In [None]:
%%time
nodedata_df = pd.read_csv(
    nodedata_csv,
    sep="\t",
    names=["nodeid", "title"],
    dtype={"nodeid": "int32", "title": "str"},
)

Create a NetworkX graph from the connectivity info we just loaded

In [None]:
%%time
G = nx.from_pandas_edgelist(
    edgelist_df,
    source="src",
    target="dst",
    create_using=nx.DiGraph,
)

Run the Pagerank algorithm on the NetworkX graph

In [None]:
%%time
nx_pr_vals = nx.pagerank(G)

Create a DataFrame containing the resulting pagerank values for each nodeid

In [None]:
%%time
pagerank_df = pd.DataFrame({
    "nodeid": nx_pr_vals.keys(),
    "pagerank": nx_pr_vals.values()
})

Finally, add the NetworkX results to `nodedata` as a new column.

In [None]:
%%time
nodedata_df = nodedata_df.merge(pagerank_df, how="left", on="nodeid")

Showing the top 25 pages based on pagerank value

In [None]:
nodedata_df.sort_values(by="pagerank", ascending=False).head(25)