### Ryan Frank
### DSC630
### 05/18/2025
# <center>Week 10 Assignment: Recommender System</center>

The goal is to create a recommender system based on the small MovieLens data set, where a movie will be provided to the recommender and based on that it will recommend 10 other movies to watch.  As we do not have a user profile to work with (the only information we have on the user is a single selected movie) I will be using item-to-item Collaborative Filtering to build my recommender system.  I want to leverage both the movie genres and the user ratings to make the recommendations.

Resources used: <br>
https://www.geeksforgeeks.org/item-to-item-based-collaborative-filtering/ <br>
https://analyticsindiamag.com/deep-tech/how-to-build-your-first-recommender-system-using-python-movielens-dataset/ <br>
https://towardsdatascience.com/using-cosine-similarity-to-build-a-movie-recommendation-system-ae7f20842599/ <br>

In [None]:
# Import modules used in code
import pandas
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# load movie data
movieData = pandas.read_csv('movies.csv')
movieData.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
# get distinct list of genres from the loaded data
distinctGenres = []
for i, row in movieData.iterrows():
    # do nothing if no genre data
    if row['genres'] == '(no genres listed)':
        continue
    # split genres by | delimited, and if not already in the list of unique genres, add to the list
    genreList = str(row['genres']).split("|")
    for item in genreList:
        if item not in distinctGenres:
            distinctGenres.append(item)

In [35]:
# turn genres into a 0 or 1 flag and add as columns to the movie data
def checkGenre(movieGenres, genre):
    if genre in movieGenres:
        return 1
    return 0

# for each distinct genre discovered create a column and populate with if that genre is present 
for genre in distinctGenres:
    movieData[genre] = movieData['genres'].apply(lambda x: checkGenre(x, genre))


In [58]:
# create a new column that is the count of the number of genres that movie has in data
# Need this to test for movies with no genre data in the recommender (as we will not want to use genre in that case)
movieData['numGenres'] = movieData[distinctGenres].sum(axis=1)

In [59]:
# check resulting dataframe
movieData.head(10)

Unnamed: 0,movieId,title,genres,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,numGenres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,5
1,2,Jumanji (1995),Adventure|Children|Fantasy,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,3
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,2
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,0,0,0,1,0,1,1,...,0,0,0,0,0,0,0,0,0,3
4,5,Father of the Bride Part II (1995),Comedy,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,6,Heat (1995),Action|Crime|Thriller,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
6,7,Sabrina (1995),Comedy|Romance,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,2
7,8,Tom and Huck (1995),Adventure|Children,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
8,9,Sudden Death (1995),Action,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,10,GoldenEye (1995),Action|Adventure|Thriller,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3


In [None]:
# create a matrix of genres 
genreMatrix = movieData[['movieId', 'Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy', 'Romance', 'Drama', 'Action', 'Crime', 'Thriller',
                         'Horror', 'Mystery', 'Sci-Fi', 'War', 'Musical', 'Documentary', 'IMAX', 'Western', 'Film-Noir']]
movieIDlist = genreMatrix['movieId'].values.tolist()
genreMatrix.set_index('movieId', inplace=True)


In [None]:
# create a matrix of similiarity of genres for movies both the index and column names are the movie id, which we can use to look up values later
# by sorting on a movie ID we can get a list of movies with the most similar set of genres as the input
genreSimilarityMatrix = pandas.DataFrame(cosine_similarity(genreMatrix), columns=movieIDlist, index=movieIDlist)
genreSimilarityMatrix.head(10)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
1,1.0,0.774597,0.316228,0.258199,0.447214,0.0,0.316228,0.632456,0.0,0.258199,...,0.447214,0.316228,0.316228,0.447214,0.0,0.67082,0.774597,0.0,0.316228,0.447214
2,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
3,0.316228,0.0,1.0,0.816497,0.707107,0.0,1.0,0.0,0.0,0.0,...,0.353553,0.0,0.5,0.0,0.0,0.353553,0.408248,0.0,0.0,0.707107
4,0.258199,0.0,0.816497,1.0,0.57735,0.0,0.816497,0.0,0.0,0.0,...,0.288675,0.408248,0.816497,0.0,0.0,0.288675,0.333333,0.57735,0.0,0.57735
5,0.447214,0.0,0.707107,0.57735,1.0,0.0,0.707107,0.0,0.0,0.0,...,0.5,0.0,0.707107,0.0,0.0,0.5,0.57735,0.0,0.0,1.0
6,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.57735,0.666667,...,0.288675,0.0,0.0,0.0,0.0,0.288675,0.0,0.0,0.408248,0.0
7,0.316228,0.0,1.0,0.816497,0.707107,0.0,1.0,0.0,0.0,0.0,...,0.353553,0.0,0.5,0.0,0.0,0.353553,0.408248,0.0,0.0,0.707107
8,0.632456,0.816497,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.408248,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,1.0,0.57735,...,0.5,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.707107,0.0
10,0.258199,0.333333,0.0,0.0,0.0,0.666667,0.0,0.408248,0.57735,1.0,...,0.288675,0.0,0.0,0.0,0.0,0.288675,0.0,0.0,0.408248,0.0


In [46]:
# test pull - sort movies based on similarity to movieId 2
genreSimilarityMatrix.sort_values(by=2,ascending=False).head(10)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
50601,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
2043,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
59501,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
104074,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
173873,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
56915,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
1009,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
160573,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
56171,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
4896,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0


In [81]:
# build a lookup dictionary to get movieId from movie title (will need this to link movie title the IDs that I use in the dataframes)
movieLookup = {}
for i, row in movieData.iterrows():
    movieLookup[row['title']] = {'movieID': row['movieId'], 'numGenres': row['numGenres']}
# test lookup
movieLookup['Father of the Bride Part II (1995)']
# also build lookup for movieId to title
titleLookup = {}
for i, row in movieData.iterrows():
    titleLookup[row['movieId']] = row['title']

In [45]:
# load movie data
ratingData = pandas.read_csv('ratings.csv')
ratingData.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [47]:
# create a matrix of userID x movieID with the value being the rating given
ratingMatrix = ratingData.pivot_table(index='userId',columns='movieId',values='rating')
ratingMatrix.head(10)

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
6,,4.0,5.0,3.0,5.0,4.0,4.0,3.0,,3.0,...,,,,,,,,,,
7,4.5,,,,,,,,,,...,,,,,,,,,,
8,,4.0,,,,,,,,2.0,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,


In [87]:
def makeRecommendations(movieTitle):
    # get the movieID of the movie we are basing recommendations on
    movieID = movieLookup[movieTitle]['movieID']
    # get the top 500 movies with similar genres based on the cosine similarity matrix created earlier
    similarGenres = genreSimilarityMatrix.sort_values(by=movieID,ascending=False).head(500)
    # get a list of the movieIDs with similar genres
    similarGenresList = similarGenres.index.values.tolist()
    # if the original movieID is in the list, remove it (so we don't recoomend the movie selected in the first place)
    similarGenresList.remove(movieID)
    # I'm using corrwith instead of cosine similarity here because of all the NaN data in the matrix - I didn't want to zero them out as that would create similarity where none exists
    # this creates correlation between movies based on the same users providing similar scores
    correlations = ratingMatrix.corrwith(ratingMatrix[movieID])
    # turn correlations into a dataframe
    recommendations = pandas.DataFrame(correlations,columns=['Correlation'])
    # remove entries with NaN correlation value
    recommendations.dropna(inplace=True)
    if movieLookup[movieTitle]['numGenres'] != 0:
        # only do this if the number of genres in the movie data is non-zero
        # otherwise we would prioritizing other movies with no genres which doesn't make sense
        finalRecommendation = recommendations[recommendations.index.isin(similarGenresList)].sort_values(by='Correlation', ascending=False).head(10)
    else:
        # when there is no genre data, make recommendations based on just the rating correlation
        finalRecommendation = recommendations.sort_values(by='Correlation', ascending=False).head(10)
    # report recommendations
    print(f"Based on your interest in {movieTitle} we recommend: ")
    i = 1
    for index, row in finalRecommendation.itertuples():
        print(f"{i}) {titleLookup[index]}")
        i += 1


In [88]:
makeRecommendations("Grumpier Old Men (1995)")

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c *= np.true_divide(1, fact)
  c /= stddev[:, None]
  c /= stddev[None, :]


Based on your interest in Grumpier Old Men (1995) we recommend: 
1) Blame It on Rio (1984)
2) Down with Love (2003)
3) World According to Garp, The (1982)
4) Lost & Found (1999)
5) The Big Sick (2017)
6) Booty Call (1997)
7) Sweetest Thing, The (2002)
8) Love Potion #9 (1992)
9) Mr. Baseball (1992)
10) Heartbreakers (2001)


In [91]:
makeRecommendations("Toy Story (1995)")


  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c *= np.true_divide(1, fact)
  c /= stddev[:, None]
  c /= stddev[None, :]


Based on your interest in Toy Story (1995) we recommend: 
1) Mind Game (2004)
2) Great Yokai War, The (Yôkai daisensô) (2005)
3) Land Before Time III: The Time of the Great Giving (1995)
4) Wizard, The (1989)
5) Rio 2 (2014)
6) It's a Very Merry Muppet Christmas Movie (2002)
7) Zathura (2005)
8) For the Birds (2000)
9) Ewok Adventure, The (a.k.a. Caravan of Courage: An Ewok Adventure) (1984)
10) Planes: Fire & Rescue (2014)


My final version of the recommender uses the similiarity of genres to filter movies before comparing the user rating profiles.  It seems to produce recommendations that make sense, although due to a lack of incorperating a minimum number of reviews it tends to return some obscure titles along with some more obvious ones.  If I were to continue to refine this I would probably also add a filter for movies to have a minimum number of reviews to make it on the recommendation list as well.