# System 1 - Recommendations by Genre

Group Members: rpberry2, sarahxy2, sauravv2

We created three different algorithms for recommending movies by genre:

1.   Highest Rated
2.   Highest Rated and Recently Created
3.   Most Popular

We will detail each of the three algorithms further below. All algorithms went through the same data loading and preprocessing steps.

# Data Loading

Load data to prepare for preprocessing.

In [17]:
import pandas as pd


genre_list = ["Action", "Adventure", "Animation",
               "Children's", "Comedy", "Crime",
               "Documentary", "Drama", "Fantasy",
               "Film-Noir", "Horror", "Musical",
               "Mystery", "Romance", "Sci-Fi",
               "Thriller", "War", "Western"]

def load_data():
    movies = pd.read_csv('movies.dat', sep='::', names=['MovieID', 'Title', 'Genre', 'Year'],
                            encoding='latin-1')
    ratings = pd.read_csv('ratings.dat', sep='::',
                             names=['UserID', 'MovieID', 'Rating', 'Timestamp'])
    return movies, ratings

# Data Preprocessing
For all algorithms we undergo the same preprocessing steps, however, algorithm 2 has an additional step to remove movies produced after a certain year.

Our steps are as follows:


*   Remove year from title and assign to own column
*   Remove movies that were reviewed less than 30 times, because that is not a significant sample size
*   (Algo 2 only) - remove movies created after 1995
*   Assign averages for ratings for each movie

In [11]:
def preprocess(movies, ratings, year_filter=None):
    # assign year
    movies['Year'] = movies['Title'].str[-5:-1]
    movies['Year'] = movies['Year'].str.extract('(\d+)', expand=False)
    movies = movies[movies['Year'].notna()]
    movies['Year'] = movies['Year'].astype(int)

    # remove year from title
    movies['Title'] = movies['Title'].str[:-6]

    # remove movies where there are less than 30 total reviews
    ratings_grouped = ratings.groupby('MovieID', as_index=False).size()
    movies = movies.join(ratings_grouped.set_index('MovieID'), on='MovieID')
    movies = movies.rename(columns={'size': 'RatingCount'})
    size1 = len(movies)
    movies = movies[movies['RatingCount'] >= 30]
    size2 = len(movies)
    print('\n\nRemoved', size1-size2, 'movies because there aren\'t enough reviews\n\n')

    # remove movies from before a certain year
    if year_filter:
        movies = movies[movies['Year'] >= year_filter]

    # assign average ratings
    ratings_avg = ratings.groupby('MovieID', as_index=False)['Rating'].mean()
    movies = movies.join(ratings_avg.set_index('MovieID'), on='MovieID')
    movies = movies.rename(columns={'Rating': 'AvgRating'})

    return movies

# Data Analysis

In this step, we identify the top n movies based on a certain category.

In [12]:
def get_top_by_genre(movies, category, top_n):
    top_dict = {}

    for genre in genre_list:
        genre_movies = movies[movies['Genre'].str.contains(genre)]
        genre_top = genre_movies.sort_values(by=[category], ascending=[False]).head(top_n)
        top_dict[genre] = genre_top.to_dict('records')

    return top_dict

# Data Display

This step helps to display recommendation results more clearly.

In [13]:
def display_top_results(top_dict, title=None):
    print(title)
    for key, value in top_dict.items():
        print('Genre:', key)
        for movie in value:
            print('Movie:', movie)
        print('\n')
    print('\n')

# Testing Algorithm 1 (Highest Rated)
Highest Rated refers to the overall highest average rated movies per genre. To get this metric, we take the mean rating across all reviews for each movie that survived preprocessing.

Notice how the AvgRating dictates the order of results.

In [14]:
# Get the top 5 movies per genre by average rating
movies_df, ratings_df = load_data()

cleaned_movies_df = preprocess(movies_df, ratings_df)
genre_dict = get_top_by_genre(cleaned_movies_df, 'AvgRating', top_n=5)
display_top_results(genre_dict, '\n\nEvaluating movies by average rating\n')

  return func(*args, **kwargs)




Removed 1047 movies because there aren't enough reviews




Evaluating movies by average rating

Genre: Action
Movie: {'MovieID': 2905, 'Title': 'Sanjuro ', 'Genre': 'Action|Adventure', 'Year': 1962, 'RatingCount': 69.0, 'AvgRating': 4.608695652173913}
Movie: {'MovieID': 2019, 'Title': 'Seven Samurai (The Magnificent Seven) (Shichinin no samurai) ', 'Genre': 'Action|Drama', 'Year': 1954, 'RatingCount': 628.0, 'AvgRating': 4.560509554140127}
Movie: {'MovieID': 858, 'Title': 'Godfather, The ', 'Genre': 'Action|Crime|Drama', 'Year': 1972, 'RatingCount': 2223.0, 'AvgRating': 4.524966261808367}
Movie: {'MovieID': 1198, 'Title': 'Raiders of the Lost Ark ', 'Genre': 'Action|Adventure', 'Year': 1981, 'RatingCount': 2514.0, 'AvgRating': 4.477724741447892}
Movie: {'MovieID': 260, 'Title': 'Star Wars: Episode IV - A New Hope ', 'Genre': 'Action|Adventure|Fantasy|Sci-Fi', 'Year': 1977, 'RatingCount': 2991.0, 'AvgRating': 4.453694416583082}


Genre: Adventure
Movie: {'MovieID': 2905, 'Title': 'Sa

# Testing Algorithm 2 (Highest Rated and Recently Created)

This algorithm is identical to algorithm 1, but includes a condition to remove older movies. All movies created before 1995 are excluded. Since this algorithm is so similar to the first, we added an extra algorithm (algo 3) for this assignment.

Notice how the year is always above 1995.

In [15]:
# Get the top 5 recently created movies per genre by average rating (only movies after 1995)
movies_df, ratings_df = load_data()

cleaned_movies_df_year = preprocess(movies_df, ratings_df, year_filter=1995)
genre_year_dict = get_top_by_genre(cleaned_movies_df_year, 'AvgRating', top_n=5)
display_top_results(genre_year_dict, '\n\nEvaluating movies after 1995 by average rating\n')



Removed 1047 movies because there aren't enough reviews




Evaluating movies after 1995 by average rating

Genre: Action
Movie: {'MovieID': 2028, 'Title': 'Saving Private Ryan ', 'Genre': 'Action|Drama|War', 'Year': 1998, 'RatingCount': 2653.0, 'AvgRating': 4.337353938937053}
Movie: {'MovieID': 2571, 'Title': 'Matrix, The ', 'Genre': 'Action|Sci-Fi|Thriller', 'Year': 1999, 'RatingCount': 2590.0, 'AvgRating': 4.315830115830116}
Movie: {'MovieID': 110, 'Title': 'Braveheart ', 'Genre': 'Action|Drama|War', 'Year': 1995, 'RatingCount': 2443.0, 'AvgRating': 4.234957020057307}
Movie: {'MovieID': 2692, 'Title': 'Run Lola Run (Lola rennt) ', 'Genre': 'Action|Crime|Romance', 'Year': 1998, 'RatingCount': 1072.0, 'AvgRating': 4.224813432835821}
Movie: {'MovieID': 3000, 'Title': 'Princess Mononoke, The (Mononoke Hime) ', 'Genre': 'Action|Adventure|Animation', 'Year': 1997, 'RatingCount': 345.0, 'AvgRating': 4.147826086956521}


Genre: Adventure
Movie: {'MovieID': 3000, 'Title': 'Princess Mononok

# Testing Algorithm 3 (Most Popular)

Most Popular represents the most rated movies per genre. To get this metric, we measure the quantity of reviews for each movie. So, the movies that have the most reviews are considered the most popular.

Notice how the RatingCount takes priority.

In [16]:
# Get the top 5 movies per genre by popularity (quantity of reviews)
movies_df, ratings_df = load_data()

cleaned_movies_df_pop = preprocess(movies_df, ratings_df)
genre_pop_dict = get_top_by_genre(cleaned_movies_df_pop, 'RatingCount', top_n=5)
display_top_results(genre_pop_dict, '\n\nEvaluating movies by popularity\n')



Removed 1047 movies because there aren't enough reviews




Evaluating movies by popularity

Genre: Action
Movie: {'MovieID': 260, 'Title': 'Star Wars: Episode IV - A New Hope ', 'Genre': 'Action|Adventure|Fantasy|Sci-Fi', 'Year': 1977, 'RatingCount': 2991.0, 'AvgRating': 4.453694416583082}
Movie: {'MovieID': 1196, 'Title': 'Star Wars: Episode V - The Empire Strikes Back ', 'Genre': 'Action|Adventure|Drama|Sci-Fi|War', 'Year': 1980, 'RatingCount': 2990.0, 'AvgRating': 4.292976588628763}
Movie: {'MovieID': 1210, 'Title': 'Star Wars: Episode VI - Return of the Jedi ', 'Genre': 'Action|Adventure|Romance|Sci-Fi|War', 'Year': 1983, 'RatingCount': 2883.0, 'AvgRating': 4.022892819979188}
Movie: {'MovieID': 480, 'Title': 'Jurassic Park ', 'Genre': 'Action|Adventure|Sci-Fi', 'Year': 1993, 'RatingCount': 2672.0, 'AvgRating': 3.7638473053892216}
Movie: {'MovieID': 2028, 'Title': 'Saving Private Ryan ', 'Genre': 'Action|Drama|War', 'Year': 1998, 'RatingCount': 2653.0, 'AvgRating': 4.337353938937