# A Movie Recommendation System Using Jaccard Similarity in NetworkX and the cuGraph Backend
This notebook demonstrates a simple and effective movie recommendation system based on MovieLens<sup>1</sup> data, the Jaccard Similarity algorithm in NetworkX, and the NVIDIA cuGraph backend to NetworkX (`nx-cugraph`) to provide GPU acceleration.

## Let's get the environment set up
Let's begin by importing some modules from the standard library, and Pandas for reading in and preprocessing the MovieLens data.

In [1]:
import os
import requests
from zipfile import ZipFile

import pandas as pd

`nx-cugraph` is available as a package installable using `pip`, `conda`, and [from source](https://github.com/rapidsai/nx-cugraph). Before importing `networkx`, lets install `nx-cugraph` so it can be registered as an available backend by NetworkX when needed. We'll use `pip` to install.

### NOTES:
* `nx-cugraph` requires a compatible NVIDIA GPU, NVIDIA CUDA, its associated drivers, and a supported OS. Details about these and other installation prerequisites can be seen [here](https://docs.rapids.ai/install/system-req).
* The `nx-cugraph` package is currently hosted by NVIDIA, therefore the `--extra-index-url` option must be used.
* `nx-cugraph` is supported on specific 11.x and 12.x CUDA versions, and the major version number must be known in order to install the correct build (this is determined automatically when using `conda`).

To find the CUDA major version on your system, run the following command:

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0


From the above output we can see that we're using CUDA 12.x so we'll be installing `nx-cugraph-cu12`. If we were using CUDA 11.x the package name would be `nx-cugraph-cu11`. Also note the additional index URLs specified using `--extra-index-url`:

In [None]:
!pip install "nx-cugraph-cu12>=25.2.0a0" --extra-index-url=https://pypi.nvidia.com --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple

This notebook will be using features added in NetworkX 3.4, so we'll import it here to verify we have a compatible version.

We'll also set the `NX_CUGRAPH_AUTOCONFIG` environment variable that is read by nx-cugraph (version 24.10 and newer) on initialization which will configure NetworkX to use the "cugraph" backend by default. This variable is unique to the cuGraph backend and must be set prior to importing networkx.

Finally, this notebook will automatically use a NetworkX caching feature. This is enabled by default in NetworkX, but does produce a courtesy warning for certain users that modify the graph in ways no longer recommended. You can re-enable the warning to see more details, but this notebook uses only recommended APIs and the warning does not apply here.

In [4]:
%env NX_CUGRAPH_AUTOCONFIG=True

import networkx as nx
print(f"using networkx version {nx.__version__}")

nx.config.warnings_to_ignore.add("cache")

env: NX_CUGRAPH_AUTOCONFIG=True
using networkx version 3.4.2


### The MovieLens dataset
The MovieLens dataset<sup>1</sup> is generously made available for download to the public [here](https://files.grouplens.org/datasets/movielens/ml-latest.zip), and is described in more detail in the README file [here](https://files.grouplens.org/datasets/movielens/ml-latest-README.html). The full set includes approximately 331k anonymized users reviewing 87k movies, resulting in 34M ratings.

Let's download the archive and extract the two files needed by this notebook:

In [5]:
ratings_csv = "ml-latest/ratings.csv"
movies_csv = "ml-latest/movies.csv"

if not os.path.exists(ratings_csv) or not os.path.exists(movies_csv):
    zip_file = "ml-latest.zip"
    if not os.path.exists(zip_file):
        req = requests.get(
            "https://files.grouplens.org/datasets/movielens/" + zip_file)
        with open(zip_file, "wb") as f:
            f.write(req.content)
    with ZipFile(zip_file, "r") as z:
        z.extract(ratings_csv)
        z.extract(movies_csv)

### Reading the data
The ratings data is read into a Pandas DataFrame for preprocessing.

In [6]:
ratings_df = pd.read_csv(ratings_csv,
                         dtype={"userId": "int32",
                                "movieId": "int32",
                                "rating": "float32",
                                "timestamp": "int32",
                                }
                         )
ratings_df["userId"][0]
# Not using timestamp
ratings_df.drop(columns="timestamp", inplace=True)

# Both user and movie IDs start at 1
# Add offset to make userId and movieId values unique
max_movie_id = int(ratings_df["movieId"].max())
ratings_df["userId"] = ratings_df["userId"] + max_movie_id

all_user_ids = ratings_df["userId"].unique()
all_movie_ids = ratings_df["movieId"].unique()

The movies CSV file contains a mapping from movie IDs to movie titles. It also contains the movie's genres, which will not be used in this notebook. The file will be read in and saved as a dictionary so movie titles can be easily retrieved using a movie ID.

In [7]:
movie_id_name_map = {}
with open(movies_csv) as f:
    for line in f.readlines():
        # Line format is: id,title,genres
        # Title may have "," in them, and will be in quotes if so
        items = line.split(",")
        try:
            mid = int(items[0])
        except ValueError:
            continue
        mname = ",".join(items[1:-1])
        movie_id_name_map[mid] = mname

This creates a separate DataFrame containing only "good" reviews (rating &ge; 3), which is used for finding similarities between good movies for recommendations, since `jaccard_coefficient()` does not consider edge weights (rating value) and would otherwise treat bad reviews and good reviews equally.

In [40]:
good_ratings_df = ratings_df[ratings_df["rating"] >= 3]
good_user_ids = good_ratings_df["userId"].unique()
good_movie_ids = good_ratings_df["movieId"].unique()

print(f"total number of users: {len(all_user_ids)}")
print(f"total number of reviews: {len(ratings_df)}")
print("average number of total reviews/user: "
      f"{len(ratings_df)/len(all_user_ids):.2f}")
print(f"total number of users with good ratings: {len(good_user_ids)}")
print(f"total number of good reviews: {len(good_ratings_df)}")
print("average number of good reviews/user: "
      f"{len(good_ratings_df)/len(good_user_ids):.2f}")

total number of users: 330975
total number of reviews: 33832162
average number of total reviews/user: 102.22
total number of users with good ratings: 329127
total number of good reviews: 27782577
average number of good reviews/user: 84.41


## Running `jaccard_coefficient` to recommend movies

Now that the data is prepared, the actual NetworkX graph object can be created.

In [41]:
good_user_movie_G = nx.from_pandas_edgelist(
    good_ratings_df, source="userId", target="movieId", edge_attr="rating")

A random user is selected, and one of their highest-rated movies is chosen. The goal is to find movies similar to this highly-rated one, filter out movies the user has already seen, and recommend the most similar movie.

In [47]:
# Pick a user and one of their highly-rated movies
user = good_user_ids[321]
user_reviews = good_user_movie_G[user]
highest_rated_movie = max(
    user_reviews,
    key=lambda n: user_reviews[n].get("rating", 0)
)

print(f"Highest rated movie for user {int(user)} is "
      f"{movie_id_name_map[highest_rated_movie]}, "
      f"id: {highest_rated_movie}, "
      f"rated: {user_reviews[highest_rated_movie]['rating']}")

Highest rated movie for user 289308 is Mulan (1998), id: 1907, rated: 5.0


In [48]:
# Create a list of nodes to compare the user's highest
# rated movie to all other movies in the graph.
ebunch = [(highest_rated_movie, n) for n in good_movie_ids[1:]
          if n != highest_rated_movie]

The Jaccard Similarity function calculates a value used for measuring similarity. Jaccard similarity is described in more detail [here](https://en.wikipedia.org/wiki/Jaccard_index).

Because the `NX_CUGRAPH_AUTOCONFIG` environment variable was set to `True`, NetworkX will use the Jaccard implementation provided by cuGraph.

In [51]:
%%time
# Run Jaccard Similarity
jacc_coeffs = list(nx.jaccard_coefficient(good_user_movie_G, ebunch))

CPU times: user 173 ms, sys: 4.07 ms, total: 177 ms
Wall time: 175 ms


The default NetworkX implementation can be used if specified using the `backend=` kwarg. This will override the backend priority set by the environment variable.

Let's run using the default implementation to see how much time was saved using cuGraph.

In [15]:
%%time
# Run Jaccard Similarity
jacc_coeffs = list(nx.jaccard_coefficient(good_user_movie_G, ebunch, backend="networkx"))

CPU times: user 1min 5s, sys: 65 ms, total: 1min 5s
Wall time: 1min 5s


To generate recommendations for this user, we identify the movies most similar to a movie they rated highly using `jaccard_coefficient()`. These movies are sorted by the Jaccard coefficient value, and any movies already seen by the user are filtered out.

In [52]:
# Sort by coefficient value, which is the 3rd item in the tuples
jacc_coeffs.sort(key=lambda t: t[2], reverse=True)

# Create a list of recommendations ordered by "best" to "worst" based on the
# Jaccard Similarity coefficients and the movies already seen
movies_seen = list(good_user_movie_G.neighbors(user))
recommendations = [mid for (_, mid, _) in jacc_coeffs
                   if mid not in movies_seen]
if len(recommendations) > 0:
    mid = recommendations[0]
    print(f"User ID {user} might like {movie_id_name_map[mid]} "
          f"(movie ID: {mid})")

User ID 289308 might like Tarzan (1999) (movie ID: 2687)


To further demonstrate how effective Jaccard similarity can be&mdash;especially when used with the cuGraph backend&mdash;a helper function can be created to find and print movie similarities.

In [67]:
def print_similar_movies(movie_id, n=10):
    # ebunch is the list of node pairs to generate Jaccard Similarity
    # coefficients for. This will generate a list of comparisons between
    # movie_id and every other movie in the graph
    ebunch = [(movie_id, n) for n in good_movie_ids[1:] if n != movie_id]

    jacc_coeffs = list(nx.jaccard_coefficient(good_user_movie_G, ebunch))

    jacc_coeffs.sort(key=lambda t: t[2], reverse=True)
    print(f"Movies similar to {movie_id_name_map[movie_id]}:")
    for i in range(n):
        (_, movieId, similarity) = jacc_coeffs[i]
        print(f"ID {int(movieId)}, {movie_id_name_map[movieId]}")

The helper function can be used to show other movies similar to the highly-rated one.

In [68]:
%%time
print_similar_movies(highest_rated_movie)

Movies similar to Mulan (1998):
ID 1566, Hercules (1997)
ID 2687, Tarzan (1999)
ID 4016, "Emperor's New Groove, The (2000)"
ID 2081, "Little Mermaid, The (1989)"
ID 5444, Lilo & Stitch (2002)
ID 2078, "Jungle Book, The (1967)"
ID 2355, "Bug's Life, A (1998)"
ID 81847, Tangled (2010)
ID 3114, Toy Story 2 (1999)
ID 2096, Sleeping Beauty (1959)
CPU times: user 191 ms, sys: 12.3 ms, total: 203 ms
Wall time: 202 ms


The `nx.config` confguration namespace can also be used to configure which backend(s) NetworkX will use. This is another option when access to the NetworkX function is not available to pass a kwarg to, as is the case with our helper function. Keep in mind that the `backend=` kwarg will override this setting.

Try running the following cells yourself to compare the difference in performance by using the cuGraph backend and the default NetworkX implementation.

In [None]:
%%time
current_priority = nx.config.backend_priority
nx.config.backend_priority=["cugraph"] # select between 'networkx' and 'cugraph'
print_similar_movies(highest_rated_movie)
nx.config.backend_priority = current_priority

In [None]:
%%time
# 1367: "101 Dalmatians (1996)"
print_similar_movies(1367)

In [None]:
%%time
# 1196: "Star Wars: Episode V - The Empire Strikes Back (1980)"
print_similar_movies(1196)

In [None]:
%%time
# 2105: "Tron (1982)"
print_similar_movies(2105)

In [None]:
%%time
# 4878: "Donnie Darko (2001)"
print_similar_movies(4878)

In [None]:
%%time
# 1301: "Forbidden Planet (1956)"
print_similar_movies(1301)

In [None]:
%%time
# 2139: ""Secret of NIMH, The (1982)""
print_similar_movies(2139)

In [None]:
%%time
# 106072: "Thor: The Dark World (2013)"
print_similar_movies(106072)

In [None]:
%%time
# 318: ""Shawshank Redemption, The (1994)""
print_similar_movies(318)

<br>
<sup>1</sup> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

---

### Hardware Information

In [38]:
# code used to retrieve CPU/GPU info

import psutil
import torch

with open("/proc/cpuinfo") as f:
        cpu_name = next(line.strip().split(": ")[1] for line in f if "model name" in line)
        cpu_name += f" ({psutil.cpu_count(logical=False)} cores)"
print(f"CPU: {cpu_name}")
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024 ** 3)
    print(f"GPU: {gpu_name}")
    print(f"GPU Memory: {gpu_memory:.2f} GB")

CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (40 cores)
GPU: Tesla V100-SXM2-32GB
GPU Memory: 31.74 GB
