# Real-World Application: Movie Recommendation System

In this notebook, we'll build a simple but effective movie recommendation system using `cuml.neighbors.NearestNeighbors`.

The goal is to find movies that are "similar" to a given movie based on user rating patterns. We will use the popular **[MovieLens dataset](https://grouplens.org/datasets/movielens/)** for this task. The core idea is to treat each movie as a vector in a high-dimensional "user-rating space" and then use k-NN to find the closest vectors (i.e., the most similar movies).

In [None]:
import cudf
import cupy as cp
import pandas as pd
from scipy.sparse import csr_matrix
import requests
import zipfile
import io

from cuml.neighbors import NearestNeighbors
from cuml.manifold import UMAP
import matplotlib.pyplot as plt

## 1. Data Loading and Exploration

First, we'll download the MovieLens dataset and load the `movies.csv` and `ratings.csv` files into Pandas DataFrames. This allows us to inspect the raw data before processing.

In [None]:
# URL for the MovieLens Small dataset
url = "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
print("Downloading and extracting the MovieLens dataset...")

# Download and extract the file in memory
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("movielens_small")
print("Dataset downloaded and extracted.")

# Load the data using pandas
movies_df = pd.read_csv("movielens_small/ml-latest-small/movies.csv")
ratings_df = pd.read_csv("movielens_small/ml-latest-small/ratings.csv")

print("\nMovies DataFrame:")
display(movies_df.head())

print("\nRatings DataFrame:")
display(ratings_df.head())

## 2. Data Preparation: Creating the User-Item Matrix

The raw ratings data is in a "long" format (one row per rating). To use it with k-NN, we need to transform it into a "wide" user-item matrix where:
- Each **row** is a **movie**.
- Each **column** is a **user**.
- Each **value** is the **rating** the user gave the movie.

This matrix will be very large and mostly empty (sparse), since most users have not rated most movies. We'll use a `scipy.sparse.csr_matrix` to store this data efficiently.

In [None]:
print("Preparing the data for the model...")

# Create a pivot table: movies as rows, users as columns, ratings as values
movie_user_matrix_df = ratings_df.pivot(
    index='movieId',
    columns='userId',
    values='rating'
).fillna(0) # Fill missing ratings with 0

# Convert the Pandas DataFrame into a memory-efficient SciPy sparse matrix
movie_user_matrix_sparse = csr_matrix(movie_user_matrix_df.values)

print("Movie-user matrix created successfully.")
print(f"Sparse matrix shape: {movie_user_matrix_sparse.shape}")

## 3. Training the k-NN Model

Now we can train our `NearestNeighbors` model. We'll use the **cosine similarity** metric, which is excellent for measuring the similarity between item vectors based on user ratings.

In [None]:
print("Training the k-Nearest Neighbors model...")

# Instantiate the model
# n_neighbors is set to 11 because the first neighbor will always be the movie itself
model_knn = NearestNeighbors(n_neighbors=11, 
                             metric='cosine', 
                             algorithm='brute')

# Train the model with our sparse matrix
model_knn.fit(movie_user_matrix_sparse)

print("Model trained successfully!")

## 4. Building the Recommendation Function

With a trained model, we can create a function that takes a movie title, finds its nearest neighbors in the rating space, and returns them as a list of recommendations.

In [None]:
def get_recommendations(movie_title, model, matrix, movie_df, matrix_df):
    """
    Finds and returns 10 movie recommendations for a given movie title.
    """
    print(f"Finding recommendations for: '{movie_title}'")
    
    # 1. Find the movie's ID from its title
    try:
        movie_id = movie_df[movie_df['title'] == movie_title].iloc[0]['movieId']
    except IndexError:
        print(f"--> Movie '{movie_title}' not found in the dataset.")
        return

    # 2. Find the internal matrix index for that movie ID
    try:
        movie_index = matrix_df.index.get_loc(movie_id)
    except KeyError:
        print(f"--> Movie '{movie_title}' has no ratings and cannot be used for recommendations.")
        return

    # 3. Use the k-NN model to find the nearest neighbors
    movie_vector = matrix[movie_index].reshape(1, -1)
    distances, indices = model.kneighbors(movie_vector)
    
    # 4. Get the top 10 neighbors (ignoring the first one, which is the movie itself)
    neighbor_indices = indices.flatten()[1:]
    
    # 5. Convert the matrix indices back to movie titles for the final output
    recommended_movie_ids = matrix_df.index[neighbor_indices]
    recommendations = movie_df[movie_df['movieId'].isin(recommended_movie_ids)]
    
    return recommendations

## 5. Testing the Recommender

Let's test our system with a few different movies to see the recommendations in action.

In [None]:
# Test with a classic animation
recommendations = get_recommendations(
    'Toy Story (1995)', 
    model_knn, 
    movie_user_matrix_sparse, 
    movies_df, 
    movie_user_matrix_df
)

if recommendations is not None:
    print("\nRecommended movies:")
    display(recommendations[['title', 'genres']])

# Test with a classic action movie
print("\n" + "-" * 50)
recommendations = get_recommendations(
    'Jumanji (1995)', 
    model_knn, 
    movie_user_matrix_sparse, 
    movies_df, 
    movie_user_matrix_df
)

if recommendations is not None:
    print("\nRecommended movies:")
    display(recommendations[['title', 'genres']])

## 6. Bonus: Visualizing the Movie Space with UMAP

To better understand our data, we can use another `cuml` algorithm, **UMAP**, for dimensionality reduction. This will project our high-dimensional movie vectors (one dimension per user) down to 2D, allowing us to create a "map" of all movies. On this map, movies with similar rating patterns will appear closer together.

In [None]:
print("Reducing dimensionality with UMAP for visualization...")

# Instantiate and fit the UMAP model
umap = UMAP(n_components=2, random_state=42)
movie_vectors_2d = umap.fit_transform(movie_user_matrix_sparse)

# Create the scatter plot
fig, ax = plt.subplots(figsize=(16, 12))
ax.scatter(movie_vectors_2d[:, 0], movie_vectors_2d[:, 1], s=1, alpha=0.5)
ax.set_title("2D Map of All Movies in the User-Rating Space")
ax.set_xlabel("UMAP Component 1")
ax.set_ylabel("UMAP Component 2")
plt.show()

## 7. Conclusion

In this notebook, we successfully built a functional, GPU-accelerated movie recommendation system. By representing movies as vectors in a user-rating space, we used `cuml.neighbors.NearestNeighbors` to find similar items and provide relevant recommendations. The final UMAP visualization further illustrates the relationships between movies that the model learned.

This example demonstrates how a powerful primitive like k-NN can be applied to solve a common, real-world data science problem.