My project on analyzing the movie data set, and creating a recommendation engine using that analysis.
The scripts folder contains all the raw python files used to do the analysis. Please download the Data-set rar file and extract in the same folder as the scripts so that they can be used.
Introduction and Analysis¶
Introduction¶
The dataset that I will be using is provided by Kaggle , and is called the “The Movies Dataset”. The dataset contains over 45,000 movies with information about ratings, keywords, directors, cast, and much more. I would like to create a recommendation engine which suggests similar movies based on the movie a user inputs.
Assessing What Makes a Movie Good¶
In order to do this I will first have to define what indicates that a user will like a certain movie, and even simpler than that, what can I use to measure how much a user will like a certain movie. This means before I am able to create this engine I must do some analysis as to what information will be a good factor to consider when recommending the movies.
Filtering Data¶
The average rating of a movie will be the metric I will use to measure how “good” a movie is. Any factor which shows evidence of increasing a movies average rating will be considered something important to consider when making the engine. To remove outliers and data that would make the analysis inaccurate we will filter the database of movies, by only considering movies who have been rated by at least 100 people, and have a runtime between 40 and 180 minutes.
## Import Packages import pandas as pd #importing all the important packages import numpy as np import matplotlib.pyplot as plt import seaborn as sns import json from ast import literal_eval from operator import itemgetter pd.options.mode.chained_assignment = None## Define data frames df = pd.read_csv('movies_metadata.csv', encoding = 'UTF-8', low_memory=False) df2 = pd.read_csv('credits.csv', encoding = 'UTF-8', low_memory=False) df3 = pd.read_csv('keywords.csv', encoding = 'UTF-8', low_memory=False)
## Filter Movies according to definition df_fil = df[df["vote_count"] >= 100] df_fil2 = df_fil[df_fil["runtime"] >= 40] movie_data = df_fil2[df_fil2["runtime"] <= 180]
Lets see what this looks like
movie_data.head(n = 5)
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
adult | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | release_date | revenue | runtime | spoken_languages | status | tagline | title | video | vote_average | vote_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | [{'id': 16, 'name': 'Animation'}, {'id': 35, '... | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | ... | 1995-10-30 | 373554033.0 | 81.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | NaN | Toy Story | False | 7.7 | 5415.0 |
1 | False | NaN | 65000000 | [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | ... | 1995-12-15 | 262797249.0 | 104.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | Roll the dice and unleash the excitement! | Jumanji | False | 6.9 | 2413.0 |
4 | False | {'id': 96871, 'name': 'Father of the Bride Col... | 0 | [{'id': 35, 'name': 'Comedy'}] | NaN | 11862 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | ... | 1995-02-10 | 76578911.0 | 106.0 | [{'iso_639_1': 'en', 'name': 'English'}] | Released | Just When His World Is Back To Normal... He's ... | Father of the Bride Part II | False | 5.7 | 173.0 |
5 | False | NaN | 60000000 | [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam... | NaN | 949 | tt0113277 | en | Heat | Obsessive master thief, Neil McCauley leads a ... | ... | 1995-12-15 | 187436818.0 | 170.0 | [{'iso_639_1': 'en', 'name': 'English'}, {'iso... | Released | A Los Angeles Crime Saga | Heat | False | 7.7 | 1886.0 |
6 | False | NaN | 58000000 | [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '... | NaN | 11860 | tt0114319 | en | Sabrina | An ugly duckling having undergone a remarkable... | ... | 1995-12-15 | 0.0 | 127.0 | [{'iso_639_1': 'fr', 'name': 'Français'}, {'is... | Released | You are cordially invited to the most surprisi... | Sabrina | False | 6.2 | 141.0 |
5 rows × 24 columns
Graphing and Analysis¶
Now that we have filtered our dataset, we can try to see what information correlates with the average user rating. To visuzlize this we will will use the Seaborn and Matplotlib API. First, we will start with a graph of runtime in relation to rating.
## Graph Runtime vs Ratings g = sns.lmplot(x = "runtime", y = "vote_average", data = movie_data, size = 6, palette = "YlOrRd", hue = "runtime", fit_reg=False, legend = False) plt.title("Runtime in Relation to Ratings") plt.show()
As we can see, there is no clear relationship between the runtime and the average rating. There just a mass of points between 80 - 140 minutes, which makes sense as most movies are around that length, but is not useful for the egnine. The next relationship we will try is the movie popularity and ratings.
plt.close() # Close previous plot f = sns.lmplot(x = "popularity", y = "vote_average", data = movie_data, size = 6, fit_reg=False, legend = False) f.set(xticklabels = []) # Remove x labels as they were over crowded plt.title("Popularity in Relation to Ratings") plt.show()
Again this graph shows no clear relationship between these two attributes. This leads me to believe that using general movie data is not sufficient. Instead, I will use the credits and the keywords files, so that we can add in more information about each movie such as the directors, cast, and key words that describe the movie. The following will just be function definitions which help retireve the JSON data in the csv files.
def get_actors(x): list_acts = [] for i in x: if i['order'] < 3: list_acts.append( i['name'] ) if not list_acts: return list("None") else: return list_actsdef get_director(x): for i in x: if i['job'] == 'Director': return i['name'] return "None"
def get_keywords(x): list_keywords = [] for i in x: list_keywords.append(i['name']) return list_keywords
def get_collection(x): for i in x: return i['name'] return "None"
def get_genres(x): list_genres = [] for i in x: list_genres.append(i['name']) if not list_genres: return ["None"] else: return list_genres
In the following input, I will declare types from some information in the table so that it can be used for different calculations, and I will merge the tables from the keywords, and credits to the existing movie data table.
movie_data['id'] = movie_data['id'].astype('int') movie_data['genres'] = movie_data['genres'].apply(literal_eval) movie_data['genres'] = movie_data['genres'].apply(get_genres) df2['id'] = df2['id'].astype('int') df3['id'] = df3['id'].astype('int') df2['cast'] = df2['cast'].apply(literal_eval) df2['crew'] = df2['crew'].apply(literal_eval) df2['director'] = df2['crew'].apply(get_director) df2['leads'] = df2['cast'].apply(get_actors) df3['keywords'] = df3['keywords'].apply(literal_eval) df3['keywordz'] = df3['keywords'].apply(get_keywords) movie_data = movie_data.merge(df2, on='id') movie_data = movie_data.merge(df3, on='id')
Lets take a look at our new dataset
movie_data.head(n = 5)
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
adult | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | title | video | vote_average | vote_count | cast | crew | director | leads | keywords | keywordz | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | {'id': 10194, 'name': 'Toy Story Collection', ... | 30000000 | [Animation, Comedy, Family] | http://toystory.disney.com/toy-story | 862 | tt0114709 | en | Toy Story | Led by Woody, Andy's toys live happily in his ... | ... | Toy Story | False | 7.7 | 5415.0 | [{'cast_id': 14, 'character': 'Woody (voice)',... | [{'credit_id': '52fe4284c3a36847f8024f49', 'de... | John Lasseter | [Tom Hanks, Tim Allen, Don Rickles] | [{'id': 931, 'name': 'jealousy'}, {'id': 4290,... | [jealousy, toy, boy, friendship, friends, riva... |
1 | False | NaN | 65000000 | [Adventure, Fantasy, Family] | NaN | 8844 | tt0113497 | en | Jumanji | When siblings Judy and Peter discover an encha... | ... | Jumanji | False | 6.9 | 2413.0 | [{'cast_id': 1, 'character': 'Alan Parrish', '... | [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de... | Joe Johnston | [Robin Williams, Jonathan Hyde, Kirsten Dunst] | [{'id': 10090, 'name': 'board game'}, {'id': 1... | [board game, disappearance, based on children'... |
2 | False | {'id': 96871, 'name': 'Father of the Bride Col... | 0 | [Comedy] | NaN | 11862 | tt0113041 | en | Father of the Bride Part II | Just when George Banks has recovered from his ... | ... | Father of the Bride Part II | False | 5.7 | 173.0 | [{'cast_id': 1, 'character': 'George Banks', '... | [{'credit_id': '52fe44959251416c75039ed7', 'de... | Charles Shyer | [Steve Martin, Diane Keaton, Martin Short] | [{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n... | [baby, midlife crisis, confidence, aging, daug... |
3 | False | NaN | 60000000 | [Action, Crime, Drama, Thriller] | NaN | 949 | tt0113277 | en | Heat | Obsessive master thief, Neil McCauley leads a ... | ... | Heat | False | 7.7 | 1886.0 | [{'cast_id': 25, 'character': 'Lt. Vincent Han... | [{'credit_id': '52fe4292c3a36847f802916d', 'de... | Michael Mann | [Al Pacino, Robert De Niro, Val Kilmer] | [{'id': 642, 'name': 'robbery'}, {'id': 703, '... | [robbery, detective, bank, obsession, chase, s... |
4 | False | NaN | 58000000 | [Comedy, Romance] | NaN | 11860 | tt0114319 | en | Sabrina | An ugly duckling having undergone a remarkable... | ... | Sabrina | False | 6.2 | 141.0 | [{'cast_id': 1, 'character': 'Linus Larrabee',... | [{'credit_id': '52fe44959251416c75039da9', 'de... | Sydney Pollack | [Harrison Ford, Julia Ormond, Greg Kinnear] | [{'id': 90, 'name': 'paris'}, {'id': 380, 'nam... | [paris, brother brother relationship, chauffeu... |
5 rows × 30 columns
Now using this data frame I will create lists to be used to store specific data about each column. I will also create functions which will define the average rating for each director and each actor.
titles = list(movie_data['title']) keywords = list(movie_data['keywordz']) directors = list(movie_data['director']) leads = list(movie_data['leads']) genres = list(movie_data['genres']) collections = list(movie_data['belongs_to_collection']) year = list(movie_data['release_date']) adult = list(movie_data['adult']) ratings = list(movie_data['vote_average']) language = list(movie_data['original_language']) directors2 = np.array(directors) titles2 = np.array(titles)## Rounds a number to two decimal points def round_two (floating): return float(format(floating, ".2f"))
## Remove duplicates def remove_dups (list_ex): already_seen = [] for x in range(0, len(list_ex)): if list_ex[x] != "None" and list_ex[x] not in already_seen: already_seen.append(list_ex[x]) return already_seen
## Given an input of a directors name, output the directors average rating def get_dir_ave (dir_name): indices = np.where(directors2 == dir_name)[0] total = 0 for x in range(0, len(indices)): index = indices[x] total = total + ratings[index] return (total / len(indices))
## Create genres list def create_genre_list (): gen_ave = [] count = [] genre_fil = [] for x in range(0, len(genres)): if genres[0] != "None": for y in range(0, len(genres[x])): genre = genres[x][y] if genre not in genre_fil: genre_fil.append(genre) count.append(1) gen_ave.append((ratings[x])) else: index = genre_fil.index(genre) count_ind = count[index] count_ind += 1 count[index] = count_ind gen_ave[index] = gen_ave[index] + ratings[y] for x in range(0, len(gen_ave)):
gen_ave[x] = round_two(gen_ave[x] / count[x])
return [gen_ave, genre_fil, count]# Creates list of all actor names and list which contains their ratings def create_act_list (): act_ave = [] count = [] acts = [] for x in range(0, len(leads)): if leads[0] != "None": for y in range(0, len(leads[x])): act = leads[x][y] if act not in acts: acts.append(act) count.append(1) act_ave.append((ratings[x])) else: index = acts.index(act) count[index] += 1 act_ave[index] = act_ave[index] + ratings[x] for x in range(0, len(act_ave)):
act_ave[x] = round_two(act_ave[x] / count[x])
return [act_ave, acts]## Create list of average director ratings def create_dir_ave (): dir_ave = [] for x in range(0, len(dir_fil)): average = get_dir_ave(dir_fil[x]) dir_ave.append(round_two(average)) return dir_ave
## Define some lists to be used later on in data visualization dir_fil = remove_dups(directors2) list_actors_and_ave = create_act_list() actor_average = list_actors_and_ave[0] actors = list_actors_and_ave[1] gen_and_ave = create_genre_list() gen_count = gen_and_ave[2] genres_filtered = gen_and_ave[1] gen_averages = gen_and_ave[0] dir_ave_rate = create_dir_ave() mean_actor = round_two(np.array(actor_average).mean()) mean_genre = round_two(np.array(gen_averages).mean()) mean_director = round_two(np.array(dir_ave_rate).mean())
Now we will look at a sample of directors and see where each individual director's average movie rating is in relation to the mean director rating. I will be using a small sample for the directors as, displaying all of the directors in one graph will be very difficult.
sample_dir = [] sample_ave = []for x in range(0, len(dir_fil)): if x % 84 == 0: sample_dir.append(dir_fil[x]) sample_ave.append(dir_ave_rate[x])
sample_dir.append("Mean Rating") sample_ave.append(mean_director)
plt.close() f = sns.barplot(x = sample_dir, y = sample_ave) f.set_xticklabels(sample_dir, rotation=90) f.set_title("Directors Average Film Rating", size = 30) f.set_ylabel("Average Film Rating", size = 15) f.set_xlabel("Director", size = 15) plt.show()
From this graph we can see that specific directions have significantly lower ratings than others. This provides some indication that the director of a movie effects the movie's rating. Continuing on from this let’s see if directors with more movies under their belt have a higher ratings than those who do not.
sample_dir2 = [] sample_ave2 = []for x in range(0, len(dir_fil)): indices = np.where(directors2 == dir_fil[x])[0] if x % 10 == 0 and len(indices) > 5: sample_dir2.append(dir_fil[x]) sample_ave2.append(dir_ave_rate[x])
sample_dir2.append("Mean Rating") sample_ave2.append(mean_director)
plt.close() h = sns.barplot(x = sample_dir2, y = sample_ave2) h.set_xticklabels(sample_dir2, rotation=90) h.set_title("Directors Average Film Rating", size = 30) h.set_ylabel("Average Film Rating", size = 15) h.set_xlabel("Director", size = 15) plt.show()
Now we can see that directors who have made more films are more likely to have a higher average rating. This could have multiple meanings, one could be that specific directors produce movies in a way that people enjoy and thus they have higher ratings or maybe the more films a director creates the better they get. Regardless, not only does the director effect a films rating but, specifically the more movies a director has made the more of an effect it will have. Now we will move on to the actors.
sample_act = [] sample_ave3 = []for x in range(0, len(actors)): if x % 253 == 0: sample_act.append(actors[x]) sample_ave3.append(actor_average[x])
sample_act.append("Mean Rating") sample_ave3.append(mean_actor)
plt.clf() i = sns.barplot(x = sample_act, y = sample_ave3) i.set_xticklabels(sample_act, rotation=90) i.set_title("Actors Average Film Rating", size = 30) i.set_ylabel("Average Film Rating", size = 15) i.set_xlabel("Actor", size = 15) plt.show()
We can see that the actors follow a similar pattern to the directions, where specific actors significatly above or below the mean rating. This means just like the directors, the actors play a significant role in the movie ratings.
Next we will look at movie genres and how they relate to the average rating using a violin plot. To create this plot I will a select few genres and generate a sample set of ratings for each genre to see how they shape up against each other. Movies do usually have multiply genres, but we will refer to the first genre in the database list as the "main genre".
def grab_10_data_points(genre): rates = [] for x in range(0, len(genres)): if x % 15 == 0: if genre in genres[x]: rates.append(round_two(ratings[x])) if len(rates) == 10: return ratesdef re_arrange_data(pos): dp_list = [] dp_list.append(action[pos]) dp_list.append(adventure[pos]) dp_list.append(comedy[pos]) dp_list.append(horror[pos]) dp_list.append(family[pos]) dp_list.append(mystery[pos]) dp_list.append(romance[pos]) return dp_list
horror = grab_10_data_points('Horror') comedy = grab_10_data_points('Comedy') adventure = grab_10_data_points('Adventure') romance = grab_10_data_points('Romance') action = grab_10_data_points('Action') mystery = grab_10_data_points('Mystery') family = grab_10_data_points('Family')
dp1 = re_arrange_data(0) dp2 = re_arrange_data(1) dp3 = re_arrange_data(2) dp4 = re_arrange_data(3) dp5 = re_arrange_data(4) dp6 = re_arrange_data(5) dp7 = re_arrange_data(6) dp8 = re_arrange_data(7) dp9 = re_arrange_data(8) dp10 = re_arrange_data(9)
df_genre = pd.DataFrame(data = [dp1, dp2, dp3, dp4, dp5, dp6, dp7, dp8, dp9, dp10], columns = ["Action", "Adventure", "Comedy", "Horror", "Family", "Mystery", "Romance"], index = ["DP1", "DP2", "DP3", "DP4", "DP5", "DP6", "DP7", "DP8", "DP9", "DP10"])
plt.close()
j = sns.violinplot(data = df_genre) j.set_title("Genre Film Ratings", size = 25) j.set_ylabel("Rating", size = 15) j.set_xlabel("Genre", size = 15) plt.show()
This violin plot helps us clearly see that each genre has it own sort of average rating, and a high density of points around that average. This indicates that the genre will be key in recommending movies as since each genre has a diffrent shape, the viewers who watch it have different preferences in terms of what they want in a movie. For example, in this graph the family genre has more movies with ratings in the range of 7 - 9 than any other genre. This shows what kind and quality of a movie the views of that genre will want, and as such we should take into account what genre, and specifically the main genre of a movie to recommend other similar ones.
Now I believe this information we have gathered from our analysis, and some extra information in the database such as category, keywords, and year released will be enough to make our recommendation engine.
Recommendation Engine¶
Description¶
This recommendation engine will consider the following things directors, keywords, actors, genres, collections, years, ratings, adult, language and index for each movie and score all other movies based on these categories. I believe the director should be weighted slightly heavier than the actors as there are more possibilities for movies to have similar lead actors, but there is only one director. The genres and keywords should also be weighted roughly the same amount to ensure movies that are recommended are similar to the one inputted. The collections or the series it belongs to will be considered so that if a movie of a certain series is inputted then only movies that a sequels are recommended and not prequels. The language and adult are used to ensure movies of the same language or of erotic nature are recommended if the movie inputted is of that nature.
Function Definitions¶
To begin we must create a function which which can retrieve this data based on a movie index and store it in a list so that it can be easily accessed.
## Fuction retrieves data of movie based on index inputted def retrieve_data (index): return [titles[index], directors[index], keywords[index], leads[index], genres[index], collections[index], year[index], adult[index], ratings[index], index, language[index]]
Next we must create a function which can score a movie based on the data retrieved from the previous function.
## Scores each movie based on the inputted movie known as data_list1, and the movie being score data_list2 def scorer(data_list1, data_list2): score = 0 year1 = int(str(data_list1[6])[0:4]) year2 = int(str(data_list2[6])[0:4]) if data_list1[1] == data_list2[1]: ## Same director indices = np.where(directors2 == data_list2[1])[0] if len(indices) > 5: ## Checks if director has made several movies score += 70 ## Higher score if so else: score += 40 ## Lower if not for x in data_list2[2]: ## Number of keywords that are the same if x in data_list1[2]: score += 30 for x in data_list2[3]: ## Number of actors that are the same if x in data_list1[3]: score += 30 for x in data_list2[4]: ## Number of genres that are the same if x in data_list1[4]: score += 20 if "Family" in data_list1[4] and "Family" not in data_list2[4]: ## Recommend Family Movies score -= 75 if "Family" not in data_list1[4] and "Family" in data_list2[4]: ## Removes Family Moviess score -= 75 if data_list1[4][0] == data_list2[4][0]: ## If main genre is the same score += 10 if data_list1[5] == data_list2[5]: ## Movies belong to the same series if year1 > year2: ## Removes prequels score -= 300 else: score += 50000 / ( (year2 - year1) * 7) ## Sequels recommended in order if data_list1[7] == data_list2[7]: ## Adult nature check score += 100 score += int(data_list2[8]) * 3 ## Scores movies of higher rating higher if data_list1[9] == data_list2[9]: ## Remove duplicate movies score -= 500 if data_list1[10] == data_list2[10]: ## Scores movies of same language higher score += 100 return score
Finally we must create the recommendations function and the engine function.
def create_reccomendations(data_list1): scores = [] for x in range(0, len(titles)): if data_list1[9] == x: scores.append([x, -1000]) else: scores.append([x, scorer(data_list1, retrieve_data(x))]) scores = sorted(scores, key = itemgetter(1), reverse = True) suggestions = [] for x in range(0, 10): suggestions.append(scores[x]) return suggestionsdef engine(title): string = "Your recommendations are:\n\n" try: index = titles.index(title) list_data = retrieve_data(index) scores = create_reccomendations(list_data) movie_suggestions = [] for x in scores: movie_suggestions.append(titles[x[0]]) for x in range(0, len(movie_suggestions)): y = x + 1 string = string + str(y) + ". " + str(movie_suggestions[x]) + "\n" return string except ValueError: string = "There is no movie of that name located in the database." return string
Now our Movie Recommendation Engine is complete! Let's test it.
print(engine('Cars'))
Your recommendations are:
- Cars 2
- Cars 3
- Toy Story 2
- A Bug's Life
- Brother Bear
- The Pirates! In an Adventure with Scientists!
- Toy Story
- Wreck-It Ralph
- The Lego Movie
- Kung Fu Panda
print(engine('The Hunger Games'))
Your recommendations are:
- The Hunger Games: Catching Fire
- The Hunger Games: Mockingjay - Part 1
- The Hunger Games: Mockingjay - Part 2
- Divergent
- Cirque du Freak: The Vampire's Assistant
- Allegiant
- The 5th Wave
- Pleasantville
- The Time Machine
- Final Fantasy: The Spirits Within
print(engine('asdsarasdsa'))
There is no movie of that name located in the database.
print(engine('Deadpool'))
Your recommendations are:
- Captain America: The Winter Soldier
- Doctor Strange
- Thor: The Dark World
- The Avengers
- Ant-Man
- Guardians of the Galaxy Vol. 2
- Captain America: Civil War
- Iron Man
- Avengers: Age of Ultron
- Captain America: The First Avenger
print(engine('The Wolf of Wall Street'))
Your recommendations are:
- War Dogs
- The Aviator
- GoodFellas
- Gangs of New York
- The Departed
- The Big Short
- Taxi Driver
- Raging Bull
- Shutter Island
- Bringing Out the Dead
Conclusion¶
There we have it our engine is running and seems to be outputting a very specialized recommendation list, that is very similar to the movie that is inputted.