Skip to content

My project on analyzing the movie data set, and creating a recommendation engine using that analysis.

Notifications You must be signed in to change notification settings

yashkp1234/Movie-Recommendation-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Movie-Recommendation-Engine

My project on analyzing the movie data set, and creating a recommendation engine using that analysis.

The scripts folder contains all the raw python files used to do the analysis. Please download the Data-set rar file and extract in the same folder as the scripts so that they can be used.

Introduction and Analysis

Introduction

The dataset that I will be using is provided by Kaggle , and is called the “The Movies Dataset”. The dataset contains over 45,000 movies with information about ratings, keywords, directors, cast, and much more. I would like to create a recommendation engine which suggests similar movies based on the movie a user inputs.

Assessing What Makes a Movie Good

In order to do this I will first have to define what indicates that a user will like a certain movie, and even simpler than that, what can I use to measure how much a user will like a certain movie. This means before I am able to create this engine I must do some analysis as to what information will be a good factor to consider when recommending the movies.

Filtering Data

The average rating of a movie will be the metric I will use to measure how “good” a movie is. Any factor which shows evidence of increasing a movies average rating will be considered something important to consider when making the engine. To remove outliers and data that would make the analysis inaccurate we will filter the database of movies, by only considering movies who have been rated by at least 100 people, and have a runtime between 40 and 180 minutes.

In [1]:
## Import Packages
import pandas as pd   #importing all the important packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from ast import literal_eval
from operator import itemgetter
pd.options.mode.chained_assignment = None

## Define data frames df = pd.read_csv('movies_metadata.csv', encoding = 'UTF-8', low_memory=False) df2 = pd.read_csv('credits.csv', encoding = 'UTF-8', low_memory=False) df3 = pd.read_csv('keywords.csv', encoding = 'UTF-8', low_memory=False)

## Filter Movies according to definition df_fil = df[df["vote_count"] >= 100] df_fil2 = df_fil[df_fil["runtime"] >= 40] movie_data = df_fil2[df_fil2["runtime"] <= 180]

Lets see what this looks like

In [2]:
movie_data.head(n = 5)
Out[2]:
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0
1 False NaN 65000000 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0 [{'id': 35, 'name': 'Comedy'}] NaN 11862 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... ... 1995-02-10 76578911.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Just When His World Is Back To Normal... He's ... Father of the Bride Part II False 5.7 173.0
5 False NaN 60000000 [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam... NaN 949 tt0113277 en Heat Obsessive master thief, Neil McCauley leads a ... ... 1995-12-15 187436818.0 170.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released A Los Angeles Crime Saga Heat False 7.7 1886.0
6 False NaN 58000000 [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '... NaN 11860 tt0114319 en Sabrina An ugly duckling having undergone a remarkable... ... 1995-12-15 0.0 127.0 [{'iso_639_1': 'fr', 'name': 'Français'}, {'is... Released You are cordially invited to the most surprisi... Sabrina False 6.2 141.0

5 rows × 24 columns

Graphing and Analysis

Now that we have filtered our dataset, we can try to see what information correlates with the average user rating. To visuzlize this we will will use the Seaborn and Matplotlib API. First, we will start with a graph of runtime in relation to rating.

In [3]:
## Graph Runtime vs Ratings
g = sns.lmplot(x = "runtime", y = "vote_average", data = movie_data,
           size = 6, palette = "YlOrRd", hue = "runtime", fit_reg=False,
           legend = False)
plt.title("Runtime in Relation to Ratings")
plt.show()

As we can see, there is no clear relationship between the runtime and the average rating. There just a mass of points between 80 - 140 minutes, which makes sense as most movies are around that length, but is not useful for the egnine. The next relationship we will try is the movie popularity and ratings.

In [4]:
plt.close() # Close previous plot
f = sns.lmplot(x = "popularity", y = "vote_average", data = movie_data,
               size = 6, fit_reg=False, legend = False)
f.set(xticklabels = []) # Remove x labels as they were over crowded
plt.title("Popularity in Relation to Ratings")
plt.show()

Again this graph shows no clear relationship between these two attributes. This leads me to believe that using general movie data is not sufficient. Instead, I will use the credits and the keywords files, so that we can add in more information about each movie such as the directors, cast, and key words that describe the movie. The following will just be function definitions which help retireve the JSON data in the csv files.

In [5]:
def get_actors(x):
    list_acts = []
    for i in x:
        if i['order'] < 3:
            list_acts.append( i['name'] )
    if not list_acts:
        return list("None")
    else:
        return list_acts

def get_director(x): for i in x: if i['job'] == 'Director': return i['name'] return "None"

def get_keywords(x): list_keywords = [] for i in x: list_keywords.append(i['name']) return list_keywords

def get_collection(x): for i in x: return i['name'] return "None"

def get_genres(x): list_genres = [] for i in x: list_genres.append(i['name']) if not list_genres: return ["None"] else: return list_genres

In the following input, I will declare types from some information in the table so that it can be used for different calculations, and I will merge the tables from the keywords, and credits to the existing movie data table.

In [6]:
movie_data['id'] = movie_data['id'].astype('int')
movie_data['genres'] = movie_data['genres'].apply(literal_eval)
movie_data['genres'] = movie_data['genres'].apply(get_genres)
df2['id'] = df2['id'].astype('int')
df3['id'] = df3['id'].astype('int')
df2['cast'] = df2['cast'].apply(literal_eval)
df2['crew'] = df2['crew'].apply(literal_eval)
df2['director'] = df2['crew'].apply(get_director)
df2['leads'] = df2['cast'].apply(get_actors)
df3['keywords'] = df3['keywords'].apply(literal_eval)
df3['keywordz'] = df3['keywords'].apply(get_keywords)
movie_data = movie_data.merge(df2, on='id')
movie_data = movie_data.merge(df3, on='id')

Lets take a look at our new dataset

In [7]:
movie_data.head(n = 5)
Out[7]:
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... title video vote_average vote_count cast crew director leads keywords keywordz
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [Animation, Comedy, Family] http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... Toy Story False 7.7 5415.0 [{'cast_id': 14, 'character': 'Woody (voice)',... [{'credit_id': '52fe4284c3a36847f8024f49', 'de... John Lasseter [Tom Hanks, Tim Allen, Don Rickles] [{'id': 931, 'name': 'jealousy'}, {'id': 4290,... [jealousy, toy, boy, friendship, friends, riva...
1 False NaN 65000000 [Adventure, Fantasy, Family] NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... Jumanji False 6.9 2413.0 [{'cast_id': 1, 'character': 'Alan Parrish', '... [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de... Joe Johnston [Robin Williams, Jonathan Hyde, Kirsten Dunst] [{'id': 10090, 'name': 'board game'}, {'id': 1... [board game, disappearance, based on children'...
2 False {'id': 96871, 'name': 'Father of the Bride Col... 0 [Comedy] NaN 11862 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... ... Father of the Bride Part II False 5.7 173.0 [{'cast_id': 1, 'character': 'George Banks', '... [{'credit_id': '52fe44959251416c75039ed7', 'de... Charles Shyer [Steve Martin, Diane Keaton, Martin Short] [{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n... [baby, midlife crisis, confidence, aging, daug...
3 False NaN 60000000 [Action, Crime, Drama, Thriller] NaN 949 tt0113277 en Heat Obsessive master thief, Neil McCauley leads a ... ... Heat False 7.7 1886.0 [{'cast_id': 25, 'character': 'Lt. Vincent Han... [{'credit_id': '52fe4292c3a36847f802916d', 'de... Michael Mann [Al Pacino, Robert De Niro, Val Kilmer] [{'id': 642, 'name': 'robbery'}, {'id': 703, '... [robbery, detective, bank, obsession, chase, s...
4 False NaN 58000000 [Comedy, Romance] NaN 11860 tt0114319 en Sabrina An ugly duckling having undergone a remarkable... ... Sabrina False 6.2 141.0 [{'cast_id': 1, 'character': 'Linus Larrabee',... [{'credit_id': '52fe44959251416c75039da9', 'de... Sydney Pollack [Harrison Ford, Julia Ormond, Greg Kinnear] [{'id': 90, 'name': 'paris'}, {'id': 380, 'nam... [paris, brother brother relationship, chauffeu...

5 rows × 30 columns

Now using this data frame I will create lists to be used to store specific data about each column. I will also create functions which will define the average rating for each director and each actor.

In [66]:
titles = list(movie_data['title'])
keywords = list(movie_data['keywordz'])
directors = list(movie_data['director'])
leads = list(movie_data['leads'])
genres = list(movie_data['genres'])
collections = list(movie_data['belongs_to_collection'])
year = list(movie_data['release_date'])
adult = list(movie_data['adult'])
ratings = list(movie_data['vote_average'])
language = list(movie_data['original_language'])
directors2 = np.array(directors)
titles2 = np.array(titles)

## Rounds a number to two decimal points def round_two (floating): return float(format(floating, ".2f"))

## Remove duplicates def remove_dups (list_ex): already_seen = [] for x in range(0, len(list_ex)): if list_ex[x] != "None" and list_ex[x] not in already_seen: already_seen.append(list_ex[x]) return already_seen

## Given an input of a directors name, output the directors average rating def get_dir_ave (dir_name): indices = np.where(directors2 == dir_name)[0] total = 0 for x in range(0, len(indices)): index = indices[x] total = total + ratings[index] return (total / len(indices))

## Create genres list def create_genre_list (): gen_ave = [] count = [] genre_fil = [] for x in range(0, len(genres)): if genres[0] != "None": for y in range(0, len(genres[x])): genre = genres[x][y] if genre not in genre_fil: genre_fil.append(genre) count.append(1) gen_ave.append((ratings[x])) else: index = genre_fil.index(genre) count_ind = count[index] count_ind += 1 count[index] = count_ind gen_ave[index] = gen_ave[index] + ratings[y] for x in range(0, len(gen_ave)):
gen_ave[x] = round_two(gen_ave[x] / count[x])
return [gen_ave, genre_fil, count]

# Creates list of all actor names and list which contains their ratings def create_act_list (): act_ave = [] count = [] acts = [] for x in range(0, len(leads)): if leads[0] != "None": for y in range(0, len(leads[x])): act = leads[x][y] if act not in acts: acts.append(act) count.append(1) act_ave.append((ratings[x])) else: index = acts.index(act) count[index] += 1 act_ave[index] = act_ave[index] + ratings[x] for x in range(0, len(act_ave)):
act_ave[x] = round_two(act_ave[x] / count[x])
return [act_ave, acts]

## Create list of average director ratings def create_dir_ave (): dir_ave = [] for x in range(0, len(dir_fil)): average = get_dir_ave(dir_fil[x]) dir_ave.append(round_two(average)) return dir_ave

## Define some lists to be used later on in data visualization dir_fil = remove_dups(directors2) list_actors_and_ave = create_act_list() actor_average = list_actors_and_ave[0] actors = list_actors_and_ave[1] gen_and_ave = create_genre_list() gen_count = gen_and_ave[2] genres_filtered = gen_and_ave[1] gen_averages = gen_and_ave[0] dir_ave_rate = create_dir_ave() mean_actor = round_two(np.array(actor_average).mean()) mean_genre = round_two(np.array(gen_averages).mean()) mean_director = round_two(np.array(dir_ave_rate).mean())

Now we will look at a sample of directors and see where each individual director's average movie rating is in relation to the mean director rating. I will be using a small sample for the directors as, displaying all of the directors in one graph will be very difficult.

In [9]:
sample_dir = []
sample_ave = []

for x in range(0, len(dir_fil)): if x % 84 == 0: sample_dir.append(dir_fil[x]) sample_ave.append(dir_ave_rate[x])

sample_dir.append("Mean Rating") sample_ave.append(mean_director)

plt.close() f = sns.barplot(x = sample_dir, y = sample_ave) f.set_xticklabels(sample_dir, rotation=90) f.set_title("Directors Average Film Rating", size = 30) f.set_ylabel("Average Film Rating", size = 15) f.set_xlabel("Director", size = 15) plt.show()

From this graph we can see that specific directions have significantly lower ratings than others. This provides some indication that the director of a movie effects the movie's rating. Continuing on from this let’s see if directors with more movies under their belt have a higher ratings than those who do not.

In [10]:
sample_dir2 = []
sample_ave2 = []

for x in range(0, len(dir_fil)): indices = np.where(directors2 == dir_fil[x])[0] if x % 10 == 0 and len(indices) > 5: sample_dir2.append(dir_fil[x]) sample_ave2.append(dir_ave_rate[x])

sample_dir2.append("Mean Rating") sample_ave2.append(mean_director)

plt.close() h = sns.barplot(x = sample_dir2, y = sample_ave2) h.set_xticklabels(sample_dir2, rotation=90) h.set_title("Directors Average Film Rating", size = 30) h.set_ylabel("Average Film Rating", size = 15) h.set_xlabel("Director", size = 15) plt.show()

Now we can see that directors who have made more films are more likely to have a higher average rating. This could have multiple meanings, one could be that specific directors produce movies in a way that people enjoy and thus they have higher ratings or maybe the more films a director creates the better they get. Regardless, not only does the director effect a films rating but, specifically the more movies a director has made the more of an effect it will have. Now we will move on to the actors.

In [65]:
sample_act = []
sample_ave3 = []

for x in range(0, len(actors)): if x % 253 == 0: sample_act.append(actors[x]) sample_ave3.append(actor_average[x])

sample_act.append("Mean Rating") sample_ave3.append(mean_actor)

plt.clf() i = sns.barplot(x = sample_act, y = sample_ave3) i.set_xticklabels(sample_act, rotation=90) i.set_title("Actors Average Film Rating", size = 30) i.set_ylabel("Average Film Rating", size = 15) i.set_xlabel("Actor", size = 15) plt.show()

We can see that the actors follow a similar pattern to the directions, where specific actors significatly above or below the mean rating. This means just like the directors, the actors play a significant role in the movie ratings.

Next we will look at movie genres and how they relate to the average rating using a violin plot. To create this plot I will a select few genres and generate a sample set of ratings for each genre to see how they shape up against each other. Movies do usually have multiply genres, but we will refer to the first genre in the database list as the "main genre".

In [198]:
def grab_10_data_points(genre):
    rates = []
    for x in range(0, len(genres)):
        if x % 15 == 0:
            if genre in genres[x]:
                rates.append(round_two(ratings[x]))
            if len(rates) == 10:
                return rates

def re_arrange_data(pos): dp_list = [] dp_list.append(action[pos]) dp_list.append(adventure[pos]) dp_list.append(comedy[pos]) dp_list.append(horror[pos]) dp_list.append(family[pos]) dp_list.append(mystery[pos]) dp_list.append(romance[pos]) return dp_list

horror = grab_10_data_points('Horror') comedy = grab_10_data_points('Comedy') adventure = grab_10_data_points('Adventure') romance = grab_10_data_points('Romance') action = grab_10_data_points('Action') mystery = grab_10_data_points('Mystery') family = grab_10_data_points('Family')

dp1 = re_arrange_data(0) dp2 = re_arrange_data(1) dp3 = re_arrange_data(2) dp4 = re_arrange_data(3) dp5 = re_arrange_data(4) dp6 = re_arrange_data(5) dp7 = re_arrange_data(6) dp8 = re_arrange_data(7) dp9 = re_arrange_data(8) dp10 = re_arrange_data(9)

df_genre = pd.DataFrame(data = [dp1, dp2, dp3, dp4, dp5, dp6, dp7, dp8, dp9, dp10], columns = ["Action", "Adventure", "Comedy", "Horror", "Family", "Mystery", "Romance"], index = ["DP1", "DP2", "DP3", "DP4", "DP5", "DP6", "DP7", "DP8", "DP9", "DP10"])
plt.close()
j = sns.violinplot(data = df_genre) j.set_title("Genre Film Ratings", size = 25) j.set_ylabel("Rating", size = 15) j.set_xlabel("Genre", size = 15) plt.show()

This violin plot helps us clearly see that each genre has it own sort of average rating, and a high density of points around that average. This indicates that the genre will be key in recommending movies as since each genre has a diffrent shape, the viewers who watch it have different preferences in terms of what they want in a movie. For example, in this graph the family genre has more movies with ratings in the range of 7 - 9 than any other genre. This shows what kind and quality of a movie the views of that genre will want, and as such we should take into account what genre, and specifically the main genre of a movie to recommend other similar ones.

Now I believe this information we have gathered from our analysis, and some extra information in the database such as category, keywords, and year released will be enough to make our recommendation engine.

Recommendation Engine

Description

This recommendation engine will consider the following things directors, keywords, actors, genres, collections, years, ratings, adult, language and index for each movie and score all other movies based on these categories. I believe the director should be weighted slightly heavier than the actors as there are more possibilities for movies to have similar lead actors, but there is only one director. The genres and keywords should also be weighted roughly the same amount to ensure movies that are recommended are similar to the one inputted. The collections or the series it belongs to will be considered so that if a movie of a certain series is inputted then only movies that a sequels are recommended and not prequels. The language and adult are used to ensure movies of the same language or of erotic nature are recommended if the movie inputted is of that nature.

Function Definitions

To begin we must create a function which which can retrieve this data based on a movie index and store it in a list so that it can be easily accessed.

In [173]:
## Fuction retrieves data of movie based on index inputted
def retrieve_data (index):
    return [titles[index], directors[index], keywords[index],
            leads[index], genres[index], collections[index],
            year[index], adult[index], ratings[index], index, language[index]]

Next we must create a function which can score a movie based on the data retrieved from the previous function.

In [174]:
## Scores each movie based on the inputted movie known as data_list1, and the movie being score data_list2
def scorer(data_list1, data_list2):
    score = 0
    year1 = int(str(data_list1[6])[0:4])
    year2 = int(str(data_list2[6])[0:4])
    if data_list1[1] == data_list2[1]: ## Same director
        indices = np.where(directors2 == data_list2[1])[0]
        if len(indices) > 5: ## Checks if director has made several movies
            score += 70 ## Higher score if so
        else:
            score += 40 ## Lower if not
    for x in data_list2[2]: ## Number of keywords that are the same
        if x in data_list1[2]: 
            score += 30
    for x in data_list2[3]: ## Number of actors that are the same
        if x in data_list1[3]:
            score += 30
    for x in data_list2[4]: ## Number of genres that are the same
        if x in data_list1[4]:
            score += 20
    if "Family" in data_list1[4] and "Family" not in data_list2[4]: ## Recommend Family Movies
        score -= 75
    if "Family" not in data_list1[4] and "Family" in data_list2[4]: ## Removes Family Moviess
        score -= 75
    if data_list1[4][0] == data_list2[4][0]: ## If main genre is the same
        score += 10
    if data_list1[5] == data_list2[5]: ## Movies belong to the same series
        if year1 > year2: ## Removes prequels
            score -= 300
        else:
            score += 50000 / ( (year2 - year1) * 7) ## Sequels recommended in order
    if  data_list1[7] == data_list2[7]: ## Adult nature check
        score += 100
    score += int(data_list2[8]) * 3 ## Scores movies of higher rating higher
    if data_list1[9] == data_list2[9]: ## Remove duplicate movies
        score -= 500
    if data_list1[10] == data_list2[10]: ## Scores movies of same language higher
        score += 100
    return score

Finally we must create the recommendations function and the engine function.

In [175]:
def create_reccomendations(data_list1):
    scores = []
    for x in range(0, len(titles)):
        if data_list1[9] == x:
            scores.append([x, -1000])
        else:
            scores.append([x, scorer(data_list1, retrieve_data(x))])
    scores = sorted(scores, key = itemgetter(1), reverse = True)
    suggestions = []
    for x in range(0, 10):
        suggestions.append(scores[x])
    return suggestions

def engine(title): string = "Your recommendations are:\n\n" try: index = titles.index(title) list_data = retrieve_data(index) scores = create_reccomendations(list_data) movie_suggestions = [] for x in scores: movie_suggestions.append(titles[x[0]]) for x in range(0, len(movie_suggestions)): y = x + 1 string = string + str(y) + ". " + str(movie_suggestions[x]) + "\n" return string except ValueError: string = "There is no movie of that name located in the database." return string

Now our Movie Recommendation Engine is complete! Let's test it.

In [176]:
print(engine('Cars'))
Your recommendations are:
  1. Cars 2
  2. Cars 3
  3. Toy Story 2
  4. A Bug's Life
  5. Brother Bear
  6. The Pirates! In an Adventure with Scientists!
  7. Toy Story
  8. Wreck-It Ralph
  9. The Lego Movie
  10. Kung Fu Panda
In [177]:
print(engine('The Hunger Games'))
Your recommendations are:
  1. The Hunger Games: Catching Fire
  2. The Hunger Games: Mockingjay - Part 1
  3. The Hunger Games: Mockingjay - Part 2
  4. Divergent
  5. Cirque du Freak: The Vampire's Assistant
  6. Allegiant
  7. The 5th Wave
  8. Pleasantville
  9. The Time Machine
  10. Final Fantasy: The Spirits Within
In [178]:
print(engine('asdsarasdsa'))
There is no movie of that name located in the database.
In [179]:
print(engine('Deadpool'))
Your recommendations are:
  1. Captain America: The Winter Soldier
  2. Doctor Strange
  3. Thor: The Dark World
  4. The Avengers
  5. Ant-Man
  6. Guardians of the Galaxy Vol. 2
  7. Captain America: Civil War
  8. Iron Man
  9. Avengers: Age of Ultron
  10. Captain America: The First Avenger
In [180]:
print(engine('The Wolf of Wall Street'))
Your recommendations are:
  1. War Dogs
  2. The Aviator
  3. GoodFellas
  4. Gangs of New York
  5. The Departed
  6. The Big Short
  7. Taxi Driver
  8. Raging Bull
  9. Shutter Island
  10. Bringing Out the Dead

Conclusion

There we have it our engine is running and seems to be outputting a very specialized recommendation list, that is very similar to the movie that is inputted.

About

My project on analyzing the movie data set, and creating a recommendation engine using that analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages