In [2]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer


In [3]:
movies = pd.read_csv("https://raw.githubusercontent.com/taylorduncan/DSC630/main/movies.csv")
ratings = pd.read_csv("https://raw.githubusercontent.com/taylorduncan/DSC630/main/ratings.csv")


In [None]:
# Merge ratings with movie titles
movie_ratings = pd.merge(ratings, movies, on='movieId')

# Create user-movie matrix (rows: users, columns: movie titles)
user_movie_matrix = movie_ratings.pivot_table(index='userId', columns='title', values='rating')

# Fill NaNs with 0 (assumes unwatched/unrated)
user_movie_matrix_filled = user_movie_matrix.fillna(0)


In [5]:
movie_user_matrix = user_movie_matrix_filled.T


In [6]:
# Compute cosine similarity
movie_similarity = cosine_similarity(movie_user_matrix)

# Convert to DataFrame for easy lookup
movie_similarity_df = pd.DataFrame(movie_similarity, index=movie_user_matrix.index, columns=movie_user_matrix.index)


In [7]:
def recommend_movies(movie_title, similarity_df, n=10):
    if movie_title not in similarity_df.columns:
        return f"'{movie_title}' not found in the dataset."
    
    # Get similarity scores
    similar_scores = similarity_df[movie_title].sort_values(ascending=False)
    
    # Exclude the input movie itself
    similar_scores = similar_scores.drop(movie_title)
    
    return similar_scores.head(n)


In [8]:
recommend_movies("Toy Story (1995)", movie_similarity_df)


title
Toy Story 2 (1999)                                   0.572601
Jurassic Park (1993)                                 0.565637
Independence Day (a.k.a. ID4) (1996)                 0.564262
Star Wars: Episode IV - A New Hope (1977)            0.557388
Forrest Gump (1994)                                  0.547096
Lion King, The (1994)                                0.541145
Star Wars: Episode VI - Return of the Jedi (1983)    0.541089
Mission: Impossible (1996)                           0.538913
Groundhog Day (1993)                                 0.534169
Back to the Future (1985)                            0.530381
Name: Toy Story (1995), dtype: float64

In this project, I developed a movie recommender system using the small MovieLens dataset, which includes three CSV files: movies.csv (containing movie titles and genres), ratings.csv (with user ratings for movies), and links.csv. The goal of the recommender system is to suggest ten similar movies based on a user's input movie.

The ratings and movies datasets were merged to connect movie ratings with their titles. From this combined dataset, I created a user-movie ratings matrix, where each row represents a user, each column a movie, and the values correspond to that user's rating for the movie. Because users do not rate every movie, the matrix contains missing values, which we filled with zeros to allow similarity calculations.

To find movies that are similar based on user ratings, I transposed the matrix to a movie-user format and then calculated the cosine similarity between all movie pairs. Cosine similarity measures how closely the rating patterns for two movies align across all users. A similarity score close to 1 indicates that two movies received similar ratings from most users, while a score close to 0 indicates little to no similarity.

I then defined a Python function that takes a movie title as input and returns the top 10 most similar movies, excluding the input movie itself. This approach is an example of collaborative filtering, which relies on the behavior of users to make recommendations, under the assumption that users who rate similar movies in similar ways are likely to enjoy other movies with similar patterns.

The technique applied here is inspired by collaborative filtering methods found in data science tutorials and articles. In particular, the approach aligns with the principles described in the article “Content-Based Recommender Systems” published on Towards Data Science (https://towardsdatascience.com/content-based-recommender-systems-1c3c8c19e665), even though the implementation is more collaborative than content-based.