# Real-world Application: Movie Recommendation System

In this notebook, we'll build a simple content-based movie recommendation system using `cuml.neighbors.NearestNeighbors`. 

The goal is to find movies that are similar to a given movie based on how users have rated them. We will use the popular [MovieLens dataset](https://grouplens.org/datasets/movielens/) for this task. The core idea is to treat each movie as a vector in a "user-rating space" and find the nearest neighbors for a given movie in that space.

In [None]:
# %%
import cudf
import cupy as cp
import pandas as pd
from scipy.sparse import csr_matrix

from cuml.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import requests
import zipfile
import io

In [None]:
# %%
# URL for the MovieLens Small dataset
url = "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"

print("Downloading and extracting the MovieLens dataset...")

# Download the file
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

# Extract the zip file to a directory
z.extractall("movielens_small")

print("Dataset downloaded and extracted.")

# Load the data using pandas
movies_df = pd.read_csv("movielens_small/ml-latest-small/movies.csv")
ratings_df = pd.read_csv("movielens_small/ml-latest-small/ratings.csv")

print("\nMovies DataFrame:")
display(movies_df.head())

print("\nRatings DataFrame:")
display(ratings_df.head())

In [None]:
# %%
print("Preparing the data for the model...")

# Create a pivot table: movies as rows, users as columns, ratings as values
movie_user_matrix_df = ratings_df.pivot(
    index='movieId',
    columns='userId',
    values='rating'
).fillna(0) # Fill missing ratings with 0

# Convert the Pandas DataFrame into a SciPy sparse matrix
# This is much more memory-efficient
movie_user_matrix_sparse = csr_matrix(movie_user_matrix_df.values)

print("Movie-user matrix created successfully.")
print(f"Sparse matrix shape: {movie_user_matrix_sparse.shape}")

In [None]:
# %%
print("Training the k-Nearest Neighbors model...")

# Instantiate the model
# n_neighbors=11 because the first neighbor will always be the movie itself
model_knn = NearestNeighbors(n_neighbors=11, 
                             metric='cosine', 
                             algorithm='brute')

# Train the model with the sparse matrix
model_knn.fit(movie_user_matrix_sparse)

print("Model trained successfully!")

In [None]:
# %%
def get_recommendations(movie_title, model, matrix, movie_df, matrix_df):
    """
    Finds and returns 10 movie recommendations.
    """
    print(f"Finding recommendations for: '{movie_title}'")
    
    # 1. Find the movie ID
    try:
        movie_id = movie_df[movie_df['title'] == movie_title].iloc[0]['movieId']
    except IndexError:
        print(f"Movie '{movie_title}' not found.")
        return

    # 2. Find the matrix index for this movie
    try:
        movie_index = matrix_df.index.get_loc(movie_id)
    except KeyError:
        print(f"Movie '{movie_title}' does not have enough ratings for recommendation.")
        return

    # 3. Use the model to find the nearest neighbors
    # We get the data vector for our movie and reshape it
    movie_vector = matrix[movie_index].reshape(1, -1)
    distances, indices = model.kneighbors(movie_vector)
    
    # 4. Get the top 10 neighbors (ignoring the first one, which is the movie itself)
    neighbor_indices = indices.flatten()[1:]
    
    # 5. Convert the indices back to movie titles
    recommended_movie_ids = matrix_df.index[neighbor_indices]
    recommendations = movie_df[movie_df['movieId'].isin(recommended_movie_ids)]['title']
    
    return recommendations

In [None]:
# %%
# Test with "Toy Story"
recommendations = get_recommendations(
    'Toy Story (1995)', 
    model_knn, 
    movie_user_matrix_sparse, 
    movies_df, 
    movie_user_matrix_df
)

if recommendations is not None:
    print("\nRecommended movies:")
    display(recommendations)

In [None]:
# %%
# Test with an action movie
print("-" * 50)
recommendations = get_recommendations(
    'Jumanji (1995)', 
    model_knn, 
    movie_user_matrix_sparse, 
    movies_df, 
    movie_user_matrix_df
)

if recommendations is not None:
    print("\nRecommended movies:")
    display(recommendations)

# Test with a movie from another genre
print("-" * 50)
recommendations = get_recommendations(
    'Pulp Fiction (1994)', 
    model_knn, 
    movie_user_matrix_sparse, 
    movies_df, 
    movie_user_matrix_df
)

if recommendations is not None:
    print("\nRecommended movies:")
    display(recommendations)

In [None]:
# %%
from cuml.manifold import UMAP

print("Reducing dimensionality with UMAP for visualization...")

# Instantiate and train UMAP
umap = UMAP(n_components=2, random_state=42)
movie_vectors_2d = umap.fit_transform(movie_user_matrix_sparse)

# Create the plot
fig, ax = plt.subplots(figsize=(16, 12))
ax.scatter(movie_vectors_2d[:, 0], movie_vectors_2d[:, 1], s=1, alpha=0.5)
ax.set_title("2D Map of All Movies in Rating Space")
ax.set_xlabel("UMAP Component 1")
ax.set_ylabel("UMAP Component 2")
plt.show()

## Conclusion and Next Steps

In this notebook, we successfully built a functional, GPU-accelerated movie recommendation system from scratch.

By transforming the raw user ratings into a sparse movie-user matrix, we were able to represent each movie as a vector. Using `cuml.neighbors.NearestNeighbors` with cosine similarity, we could then efficiently find the "closest" movies to a given input, providing relevant and logical recommendations.

This example demonstrates how a powerful primitive like k-Nearest Neighbors can be applied to solve a common, real-world problem in data science. From here, this system could be improved by incorporating movie genres or exploring more advanced recommendation algorithms like Matrix Factorization.