# Movies Recommender System

In this notebook, I will attempt at implementing a few recommendation algorithms (content based, popularity based and collaborative filtering) and try to build an ensemble of these models to come up with our final recommendation system. With us, we have two MovieLens datasets.

* **The Full Dataset:** Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
* **The Small Dataset:** Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

I will build a Simple Recommender using movies from the *Full Dataset* whereas all personalised recommender systems will make use of the small dataset (due to the computing power I possess being very limited). As a first step, I will build my simple recommender system.

In [241]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import KFold
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

## Simple Recommender

The Simple Recommender offers generalized recommnendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user. 

The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre. 

In [242]:

md = pd.read_csv('C:/Users/sb/Desktop/input/movies_metadata.csv')

In [243]:
# Clearing the genre field, to take only the different genres of the movie and formatting it to a vector
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# Clearing the country field to get only the different countries in the movie and formatting it to a vector
md['countries'] = md['production_countries'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# Clearing the release date field, to only have the year
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

# clearing null values
null_values = {}
for col in md.columns:
    null_values[col] = md[col].isna().sum()

I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. We will use **95th percentile** as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

In [244]:
# Getting the values ​​for V, for M and for C
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()

# Applying the percentile
m = vote_counts.quantile(0.95)

# Selecting the films with the least votes (m) and whose scores are not null, organizing them in the Qualified set
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')

Therefore, to qualify to be considered for the chart, a movie has to have at least **434 votes** on TMDB. We also see that the average rating for a movie on TMDB is **5.244** on a scale of 10. **2274** Movies qualify to be on our chart.

In [245]:
# Weighted Rating Formula
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [246]:
# Applying the formula to the entire Qualified set, with the CP field
qualified['wr'] = qualified.apply(weighted_rating, axis=1)
qualified = qualified.sort_values('wr', ascending=False)

### Top Films

In [247]:
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.108149,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167259,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.213481,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.869599,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.950236,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.645403,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.307194,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,"[Adventure, Fantasy, Action]",7.851924


We see that three Christopher Nolan Films, **Inception**, **The Dark Knight** and **Interstellar** occur at the very top of our chart. The chart also indicates a strong bias of TMDB Users towards particular genres and directors. 

Let us now construct our function that builds charts for particular genres. For this, we will use relax our default conditions to the **85th** percentile instead of 95. 

In [248]:
# A new dataset, to preserve the generic list
gen_md = qualified

# The "Genres" column has been dismantled, creating a new movie entry for each genre it belongs to.
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)

In [249]:
# Function to create lists of movies of a specific genre
def build_chart(genre):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.85)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

In [250]:
# The unique genres usable for creating the lists
s.unique()

array(['Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy', 'Romance',
       'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History',
       'Science Fiction', 'Mystery', 'War', 'Foreign', 'Music',
       'Documentary', 'Western', 'TV Movie', 'Carousel Productions',
       'Vision View Entertainment', 'Telescene Film Group Productions',
       'Aniplex', 'GoHands', 'BROSTA TV',
       'Mardock Scramble Production Committee', 'Sentai Filmworks',
       'Odyssey Media', 'Pulser Productions', 'Rogue State', 'The Cartel'],
      dtype=object)

In [251]:
build_chart('Adventure').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
15480,Inception,2010,14075,8,29.108149,7.906526
22879,Interstellar,2014,11187,8,32.213481,7.883426
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,7.854939
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,7.843867
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,7.832647
256,Star Wars,1977,6778,8,42.149697,7.812801
1225,Back to the Future,1985,6239,8,25.778509,7.797828
1154,The Empire Strikes Back,1980,5998,8,19.470959,7.790329
5481,Spirited Away,2001,3968,8,41.048867,7.695056
9698,Howl's Moving Castle,2004,2049,8,16.136048,7.465435


### (1.1) Biasing the Simple Recommender System

Now let's modify the code so that a movie can be named according to the countries that produced it. For this, we will separate the films by production countries, listing the film if it has been produced or co-produced by the country. The error induced here is such that a film co-produced by several countries appears on the list of all of them.

In [252]:
# A new dataset, to preserve the generic list
con_md = qualified

# The "Countries" column has been dismantled, creating a new movie entry for each country it belongs to.
c = md.apply(lambda x: pd.Series(x['countries']),axis=1).stack().reset_index(level=1, drop=True)
c.name = 'country'
con_md = md.drop('countries', axis=1).join(c)

In [253]:
# Function to create the movie lists of a specific country
def build_chart_country(country):
    df = con_md[con_md['country'] == country]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.85)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

In [254]:
# The unique countries usable for creating the lists
c.unique()

array(['United States of America', 'Germany', 'United Kingdom', 'France',
       'Italy', 'Spain', 'China', 'Australia', 'South Africa', 'Canada',
       'Switzerland', 'Belgium', 'Japan', 'Iran', 'Netherlands',
       'Hong Kong', 'Tunisia', 'Ireland', 'Dominican Republic', 'Croatia',
       'Russia', 'Macedonia', 'Austria', 'Taiwan', 'New Zealand',
       'Mexico', 'Poland', 'Peru', 'Cuba', 'Liechtenstein', 'Denmark',
       'Portugal', 'Finland', 'Sweden', 'Argentina', 'Iceland',
       'South Korea', 'Serbia', 'Hungary', 'Czech Republic', 'India',
       'Brazil', 'Greece', 'Congo', 'Senegal', 'Burkina Faso', 'Romania',
       'Philippines', 'Vietnam', 'Trinidad and Tobago', 'Bulgaria',
       'Chile', 'Norway', 'Kazakhstan', 'Algeria', 'Luxembourg',
       'Georgia', 'Ukraine', 'Botswana', 'Aruba', 'Israel', 'Turkey',
       'Ecuador', 'Lebanon', 'Morocco', 'Bosnia and Herzegovina',
       'Bahamas', 'Malaysia', 'Bhutan', 'Jamaica', 'Pakistan', 'Nepal',
       "Cote D'Ivoire", 'Th

In [255]:
build_chart_country('France').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
18465,The Intouchables,2011,5410,8,16.086919,7.953508
289,Leon: The Professional,1994,4293,8,20.477329,7.941722
5857,The Pianist,2002,1927,8,14.811594,7.87413
5878,City of God,2002,1852,8,14.95927,7.869326
1132,Cinema Paradiso,1988,834,8,14.177005,7.728855
25044,Song of the Sea,2014,420,8,6.967358,7.518248
31783,Mustang,2015,378,8,6.49255,7.477039
2616,The 400 Blows,1959,363,8,7.268688,7.460559
16881,Senna,2010,282,8,11.331859,7.34994
23325,Jodorowsky's Dune,2013,172,8,6.18442,7.099039


In [256]:
build_chart_country('Argentina').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
24902,Wild Tales,2014,878,7,8.764888,6.962866
14115,The Secret in Their Eyes,2009,638,7,11.251632,6.949364
3786,Dancer in the Dark,2000,392,7,10.684806,6.91928
8146,The Motorcycle Diaries,2004,295,7,7.667491,6.894584
5185,Nine Queens,2000,156,7,5.324962,6.812287
36246,Embrace of the Serpent,2015,129,7,5.061041,6.778734
35007,Truman,2015,101,7,4.778359,6.728385
17730,Chinese Take-Out,2011,95,7,3.843784,6.714462
19245,Sidewalls,2011,86,7,3.649501,6.690678
5470,Son of the Bride,2001,37,7,3.319598,6.434001


The film **City of God**, even though it is Brazilian, appears among the first in the list of French films, as it is a co-production. This can mislead the viewer into thinking the film is French. Or else we can mention **Embrace of the Serpent**, a Colombian film that appears in the lists of Argentina and Venezuela.

In [257]:
build_chart_country('Mexico').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
11308,Pan's Labyrinth,2006,3122,7,26.817754,6.97523
24464,The Book of Life,2014,778,7,10.337994,6.904955
33362,Silence,2016,754,7,14.16938,6.902111
12228,The Orphanage,2007,652,7,16.213592,6.887851
4108,Amores perros,2000,537,7,5.958138,6.865813
5653,Frida,2002,397,7,11.283472,6.823615
5091,Y Tu Mamá También,2001,321,7,13.962548,6.787306
16124,Biutiful,2010,299,7,7.131568,6.773829
4772,The Devil's Backbone,2001,277,7,11.111299,6.758529
21551,Instructions Not Included,2013,210,7,6.507895,6.69587


A similar problem occurs in the listings of Mexico, which among the top 5 titles, one has the US as its main producer and the other Spain, with Mexico only co-producer of the films.

## (2) Content Based Recommendation

Previous recommendations, by country or by gender, we're limited to general knowledge of the dataset, without considering a person's preferences. The recommendations are the same, for any user of a streaming platform. In order to customize the recommendations, we will create a method that will be based on the contents of the titles and look for similarities between them. Thus, after a person indicates a movie they like, recommendations for similar movies will be made. This type of recommendation is commonly referred to as **Content Based Filtering**.

Two Content-Based Recommendations will be constructed, taking into account the following aspects:

* Movie Synopsis and Call
* Director, Main actors/actresses, Keywords and Genre

As previously stated, this filter will be based on a smaller dataset, with only 9099 movies (versus 45,000 for the larger dataset), so that recommendations can be made more quickly.

In [258]:
# Loading the dataset
links_small = pd.read_csv('C:/Users/sb/Desktop/input/links_small.csv')

# Dropping three movies that cause problem due to wrong data in their records
md = md.drop([19730, 29503, 35587])

# Taking IDs from the smaller dataset and selecting only those movies from the larger dataset
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
md['id'] = md['id'].astype('int')
smd = md[md['id'].isin(links_small)]

### Recommendation based on movie description

First, let's build a recommendation system using movie descriptions and taglines. We don't have a quantitative metric to judge our machine's performance, so it will have to be done qualitatively.

In [259]:
# Cleaning up the tagline (call) and synopsis fields, merging them in the Description field
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + ' ' + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [260]:
# Here, a matrix is ​​created with the frequency of the terms used, with the TF-IDF tool (Term Frequency Inverse Document Frequency)
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

#### Similarity by cosine

We will use cosine similarity to calculate the numerical value that illustrates the similarity between two films. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we used the TF-IDF vectorizer, calculating the dot product will directly give us the cosine similarity score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities, as it is much faster.

In [261]:
# Applying the linear kernel to the terms matrix
cosine_sim = linear_kernel(tfidf_matrix)

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on their cosine similarity score. We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [262]:
# Preserving the original Index in the index column and resetting the index so that titles can be easily found and related in the dataset
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [263]:
# Function that recommends according to the title, searching for similar ones
def get_recommendations(title, cosine_sim):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

We are ready. Let's now try to get top recommendations for some movies and see how good the recommendations are.

In [264]:
get_recommendations('The Dark Knight', cosine_sim).head(10)

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

We see that for **The Dark Knight**, our system is able to identify it as a Batman movie and subsequently recommend other Batman movies as its top recommendations. But unfortunately, that's all this system can do at the moment. This isn't very helpful for most people, as it doesn't take into account very important characteristics like cast, crew, director, and genre that determine a movie's rating and popularity. Someone who liked **The Dark Knight** probably likes it more because of Nolan and would hate **Batman Forever** and every other substandard movie in the Batman franchise.

In [265]:
get_recommendations('City of God', cosine_sim).head(10)

1230           The Designated Mourner
8323                 What Maisie Knew
3318    Long Night's Journey Into Day
2411                      The Insider
4215                       Ratcatcher
5488             How the West Was Won
6797                      Elite Squad
2846                       Bossa Nova
4529                   Salaam Bombay!
5578             The Boston Strangler
Name: title, dtype: object

The recommendations for films similar to **City of God** are somewhat wrong, bringing films that are not that similar, not even in terms of script, country or aesthetics.

Therefore, we will use much more suggestive metadata than **Synopsis** and **Call-Action**. In the next subsection, we'll create a more sophisticated recommender that takes into account **genre**, **keywords**, **cast**, and **crew**.

### Recommendation based on metadata

To build the metadata-based content recommendation system, we'll need to merge our current dataset with the team and keyword datasets.

In [266]:
# loading the data
credits = pd.read_csv('C:/Users/sb/Desktop/input/credits.csv')
keywords = pd.read_csv('C:/Users/sb/Desktop/input/keywords.csv')

In [267]:
# Handling the indexing field to associate the records
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')
smd = md[md['id'].isin(links_small)]

Now we have our cast, crew, genres and credits all in one dataframe. Let's discuss this a little further using the following intuitions:

1. **Crew:** Of the crew, we'll only pick the director, as the others don't contribute much to the *feel* of the movie.
2. **Cast:** Casting is a bit more complicated. Lesser known actors and minor roles don't affect people's opinion of a movie. Therefore, we will only select the main characters and their respective actors. Arbitrarily we will choose the 3 main actors that appear in the list of credits.

In [268]:
# Handling the data format
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

In [269]:
# Function to select the director of the film
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [270]:
# Storing the director's name
smd['director'] = smd['crew'].apply(get_director)

# Storing the name of the first three actors that appear in the movie credits
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

# Storing movie keywords
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

To facilitate how this recommendation system will work, let's create a single field that will contain, in plain text, all the information regarding **genres, director, main actors and keywords.** Next, I use a ** Count vectorizer** to create our count matrix, as we did in the previous system. The remaining steps are similar to what we did earlier: we compute the cosine similarities and return the most similar movies. We'll handle data better, such as **stripping spaces and converting to lower case** from all entries. This way, our engine will not confuse **John Goodman** and **John Cusack**.

In [271]:
# Removing spaces from names and putting all characters in lowercase
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

# The same with the director's name
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))

#### Key words

We'll do a little pre-processing of our keywords before putting them to use. As a first step, we calculate frequency counts for each keyword that appears in the dataset.

In [272]:
# Preparing a list of keywords to be treated before adding them to the dataset
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s = s.value_counts()

# Ignoring keywords that only occur once
s = s[s > 1]

# Function to ignore plural words
stemmer = SnowballStemmer('english')

In [273]:
# Function to filter the keywords
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [274]:
# Entering keywords in the data set, removing plurals and spaces
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [275]:
# Handling the Director's column so that it is inserted correctly in the final field
smd['director'] = smd['director'].apply(lambda row: row.split())

In [276]:
# Creating the column with all the information together
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [277]:
# Counting the words to then approximate them and applying AI to define which ones are most similar to each other
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])
new_cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [278]:
# Renaming the dataset index
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

Let's reuse the get_recommendations function we wrote earlier. As our cosine similarity scores have changed, we expect this to give us different (and probably better) results. Let's check out **The Dark Knight** again and see what recommendations I get this time.

In [279]:
get_recommendations('The Dark Knight', new_cosine_sim).head(10)

8031         The Dark Knight Rises
6218                 Batman Begins
7659    Batman: Under the Red Hood
8927       Kidnapping Mr. Heineken
1134                Batman Returns
5943                      Thursday
1260                Batman & Robin
2448                    Nighthawks
5098                  The Enforcer
8026            Bullet to the Head
Name: title, dtype: object

I am much more satisfied with the results I get this time around. The recommendations seem to have recognized other Christopher Nolan films and place them as top recommendations. I enjoyed watching **The Dark Knight** as well as some of the others on the list including **Batman Begins** and **The Dark Knight Rises**.

In [280]:
get_recommendations('City of God', new_cosine_sim).head(10)

1412                         Men with Guns
5436                              Mitchell
154                                   Kids
4584                        Salaam Bombay!
146                          Feast of July
943                       The Bloody Child
15                                  Casino
9108                          Rubble Kings
6868                            Persepolis
7128    Angus, Thongs and Perfect Snogging
Name: title, dtype: object

in the case of **City of God** the recommendations are still not very good, which means we need more details for the recommendations to be effective.

Of course, we can experiment with this mechanism by testing different weights for our assets (directors, actors, genres), limiting the number of keywords that can be used in the soup column, weighting genres based on their frequency, showing only movies with the same languages, etc.

#### Popularity and ratings

One thing we noticed about our recommendation system is that it recommends movies regardless of ratings and popularity. It's true that **Batman & Robin** has many characters similar to **The Dark Knight**, but it was a terrible movie that shouldn't be recommended to anyone.

Therefore, we will add a mechanism to remove bad movies and return movies that are popular and have a good critical response.

I'll take the top 25 movies based on similarity scores and calculate the 60th percentile movie vote. Then, using that as the $m$ value, we'll calculate each movie's weighted rating using the IMDB formula, as we did in the Simple Recommendation section.

In [281]:
# Function to recommend similar movies based on rating
def improved_recommendations(title, cosine_var):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_var[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'genres', 'countries']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [282]:
improved_recommendations('Lone Survivor', new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,genres,countries,wr
8800,The Imitation Game,5895,8,2014,"[History, Drama, Thriller, War]",[United Kingdom],7.811074
8777,Fury,4028,7,2014,"[War, Drama, Action]","[United Kingdom, United States of America, China]",6.829288
3823,The Last Castle,290,7,2001,"[Action, Drama, Thriller]",[United States of America],5.947908
7335,Prison Break: The Final Break,174,7,2009,"[Action, Drama, Thriller]",[United States of America],5.747179
7555,Green Zone,730,6,2010,"[War, Action, Adventure, Drama, Thriller]","[Spain, United Kingdom, United States of Ameri...",5.718458
4561,Tears of the Sun,582,6,2003,"[Action, Drama, War]",[United States of America],5.677446
3872,Behind Enemy Lines,464,6,2001,"[Action, Drama, Thriller, War]",[United States of America],5.635061
1275,G.I. Jane,404,6,1997,"[Action, Drama]",[United States of America],5.608932
2857,U-571,345,6,2000,"[Action, Drama, Thriller, War]","[France, United States of America]",5.579313
8061,Seeking Justice,314,5,2011,"[Action, Drama, Thriller]",[United States of America],5.142092


Unfortunately, **Batman and Robin** does not disappear from our list of recommendations. This is likely due to it being rated 4, which is just slightly below average on TMDB. It certainly doesn't deserve a 4 when amazing movies like **The Dark Knight Rises** only have a 7. However, there's not much we can do about that. Therefore, we will conclude our Content-Based Recommendation section here and return to it when we build a hybrid engine.

## (2.1) Content-Based Recommendation - Biased

In the same way as we did in item 1.1, now let's transform the last created function, Improved Recommendation. To do so, we will start by changing the content of the **soup** column, and making a new evaluation of the similarity cosine, to then verify what is recommended. Let's start by adding the first country producing the work in the column items.

In [283]:
# Treating the country column so that it only has the first of them and in a single string
smd['country'] = smd['countries'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
smd['country'] = smd['country'].str.get(0)
smd['country'] = smd['country'].astype('str').apply(lambda x: x.split()) 

In [284]:
# Creating the column with all the information together
smd['soup_country'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres'] + smd['country']
smd['soup_country'] = smd['soup_country'].apply(lambda x: ' '.join(x))

# Counting the words to then approximate them and applying AI to define which ones are most similar to each other
count_country = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix_country = count.fit_transform(smd['soup_country'])
cosine_sim_country = cosine_similarity(count_matrix_country, count_matrix_country)

In [285]:
improved_recommendations('Lone Survivor', cosine_sim_country)

Unnamed: 0,title,vote_count,vote_average,year,genres,countries,wr
3823,The Last Castle,290,7,2001,"[Action, Drama, Thriller]",[United States of America],5.947908
7335,Prison Break: The Final Break,174,7,2009,"[Action, Drama, Thriller]",[United States of America],5.747179
7555,Green Zone,730,6,2010,"[War, Action, Adventure, Drama, Thriller]","[Spain, United Kingdom, United States of Ameri...",5.718458
7469,Brothers,650,6,2009,"[Drama, Thriller, War]",[United States of America],5.69768
4561,Tears of the Sun,582,6,2003,"[Action, Drama, War]",[United States of America],5.677446
3872,Behind Enemy Lines,464,6,2001,"[Action, Drama, Thriller, War]",[United States of America],5.635061
2857,U-571,345,6,2000,"[Action, Drama, Thriller, War]","[France, United States of America]",5.579313
3029,Missing in Action,107,5,1984,"[Action, Adventure, Thriller, War]",[United States of America],5.19646
6606,The Marine,165,5,2006,"[Action, Adventure, Drama, Thriller]",[United States of America],5.177438
8061,Seeking Justice,314,5,2011,"[Action, Drama, Thriller]",[United States of America],5.142092


Entering the country of origin of the film in the word search to approximate the titles resulted in more films from the country of origin of the searched title. That is, if a user has the habit of watching only movies from the US, more movies from this country will be recommended to him, reducing his chances of watching Latin American movies, for example.

Now let's reinforce, on top of this last modification of the indication, the name of the film's director, tripling it so that it has more weight in the approximation cosine. This change is enough to illustrate how much an indication can be manipulated to reinforce criterion X or Y in the indications of titles.

In [286]:
# Taking the directors name again and duplicating it in a vector, to increase its weight
smd['director'] = smd['crew'].apply(get_director)
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x,x])

In [287]:
# Creating the column with all the information together
smd['soup_director'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres'] + smd['country']
smd['soup_director'] = smd['soup_director'].apply(lambda x: ' '.join(x))

# Counting the words to then approximate them and applying AI to define which ones are most similar to each other
count_director = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix_director = count.fit_transform(smd['soup_director'])
cosine_sim_director = cosine_similarity(count_matrix_director, count_matrix_director)

In [288]:
improved_recommendations('City of God', cosine_sim_director)

Unnamed: 0,title,vote_count,vote_average,year,genres,countries,wr
1000,Once Upon a Time in America,1104,8,1984,"[Drama, Crime]","[United States of America, Italy]",7.222552
2839,American Psycho,2128,7,2000,"[Thriller, Drama, Crime]","[Canada, United States of America]",6.702687
15,Casino,1343,7,1995,"[Drama, Crime]","[France, United States of America]",6.571348
6878,Elite Squad,519,7,2007,"[Drama, Action, Crime]",[Brazil],6.200719
7834,Elite Squad: The Enemy Within,477,7,2010,"[Drama, Action, Crime]",[Brazil],6.16387
6868,Persepolis,461,7,2007,"[Animation, Drama]","[France, United States of America]",6.148922
8050,The Iron Lady,540,6,2011,"[History, Drama]","[France, United Kingdom]",5.663537
6275,The Constant Gardener,428,6,2005,"[Drama, Mystery, Thriller]","[China, Germany, United Kingdom, United States...",5.61982
1691,The Outsiders,293,6,1983,"[Crime, Drama]",[United States of America],5.549223
154,Kids,280,6,1995,"[Drama, Crime]",[United States of America],5.541016


It is possible to see that more films by the same director were recommended among the 10 films closest to the sought after. With this example, we show that it is possible to interfere with AI recommendations through human interference, placing more weight on one aspect of the recommendations to the detriment of others. Even if inadvertently, an AI favors nominations within a specific bubble, whether based on the theme of the film, who directs it or the country of origin.

## (3) Collaborative filtering

Our content-based engine suffers from some severe limitations. It is only able to suggest movies that are *close* to a certain movie. That is, it is not able to capture tastes and provide recommendations across genres.

Also, the engine we built is not really personal as it doesn't capture a user's personal likes and prejudices. Anyone who queries our engine for recommendations based on a movie will receive the same recommendations for that movie, no matter who they are.

Therefore, in this section, we will use a technique called **Collaborative Filtering** to make recommendations to movie viewers. Collaborative Filtering is based on the idea that users similar to me can be used to predict how much I will like a certain product or service that these users have used/tried but I have not.

I'm not going to implement Collaborative Filtering from scratch. Instead, I'll be using the **Surprise** library which used extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimize RMSE (Root Mean Square Error) and give great recommendations.

In [289]:
# Loading user roles and ratings
reader = Reader()
ratings = pd.read_csv('C:/Users/sb/Desktop/input/ratings_small.csv')
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [290]:
# Utilizing cross-validation AI technique
kf = KFold(n_splits=5)
kf.split(data)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'])

{'test_rmse': array([0.89761827, 0.89085431, 0.89475097, 0.89329432, 0.90626256]),
 'test_mae': array([0.69570297, 0.68491747, 0.68767285, 0.68696854, 0.69734107]),
 'fit_time': (0.8597822189331055,
  0.8537769317626953,
  0.8277530670166016,
  0.826751708984375,
  0.8347592353820801),
 'test_time': (0.1311194896697998,
  0.1141045093536377,
  0.11510419845581055,
  0.10509657859802246,
  0.11310362815856934)}

We get a **Root Mean Square Error** of 0.8963, which is more than enough for our case. Let's now train on our dataset and get to the predictions.

In [291]:
# training the dataset
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1afdfcda880>

In [292]:
svd.predict(50, 8554, 3)

Prediction(uid=50, iid=8554, r_ui=3, est=3.3319703218706187, details={'was_impossible': False})

For the movie with ID 8554 (Lone Survivor), we get an estimated prediction of **3332**. A surprising feature of this recommendation system is that it doesn't care what the movie is (or what's in it). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how other users predicted the movie.

## Hybrid Recommendation

In this section, I'll try to build a simple hybrid recommender that pulls together techniques we've implemented in content-based and collaborative filter-based engines. This is how it will work:

* **Input:** User ID and the title of a movie
* **Result:** Ranked similar movies based on expected ratings by that specific user.

In [None]:
smd.columns

In [294]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [295]:
# Linking movie ratings to their ratings
id_map = pd.read_csv('C:/Users/sb/Desktop/input/links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')
indices_map = id_map.set_index('id')

In [296]:
# Hybrid recommendation function
def hybrid(userId, title, cosine_var):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_var[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id', 'genres', 'countries']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [297]:
# Hybrid recommendation function with weighted rating
def hybrid_wr(userId, title, cosine_var):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_var[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id', 'genres', 'countries', 'keywords']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.50)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    
    qualified['est'] = qualified['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    qualified = qualified.sort_values('est', ascending=False)
    
    return qualified.head(10)   

In [298]:
# Organizing a dataset with all ratings from all users
id_map_movies = pd.read_csv('C:/Users/sb/Desktop/input/links_small.csv')[['movieId', 'tmdbId']]
id_map_movies['tmdbId'] = id_map_movies['tmdbId'].apply(convert_int)
id_map_movies.columns = ['movieId', 'id']
id_map_movies = id_map_movies.merge(smd[['title', 'id']], on='id')
user_movie_rating = id_map_movies.merge(ratings)

In [299]:
user_movie_rating.query('rating == 5.0 and title == "The Hurt Locker"')

Unnamed: 0,movieId,id,title,userId,rating,timestamp
93479,69481,12162.0,The Hurt Locker,178,5.0,1437424744
93482,69481,12162.0,The Hurt Locker,298,5.0,1447366104
93490,69481,12162.0,The Hurt Locker,483,5.0,1465387515
93491,69481,12162.0,The Hurt Locker,544,5.0,1435786462


We searched above for users who gave the maximum score of 5 to the movie War on Terror, which talks about terrorism in Iraq. So we made recommendations for each of them. It is possible to notice that although the 4 users really liked the movie, the recommendations for each of them present some exclusive titles, although most are the same indications, in different orders.

In [300]:
hybrid_wr(178, 'The Hurt Locker', cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
3477,Tootsie,429,6,1982,9576,"[Comedy, Romance]",[United States of America],"[infidel, love, friend, crossdress, reput, apo...",5.620261,3.608138
6263,Grizzly Man,213,7,2005,501,[Documentary],[United States of America],"[lifeanddeath, camp, rain, camcord, self-aband...",5.822697,3.588392
964,The Thin Blue Line,109,7,1988,14285,"[Mystery, Documentary, Crime]",[United States of America],[],5.59721,3.506911
1534,Kramer vs. Kramer,483,7,1979,12102,[Drama],[United States of America],"[wifehusbandrelationship, custodybattl, marria...",6.16934,3.472306
6595,Borat: Cultural Learnings of America for Make ...,1617,6,2006,496,[Comedy],[United States of America],"[malenud, usa, california, prostitut, journali...",5.840217,3.206967
4089,My Big Fat Greek Wedding,686,6,2002,8346,"[Comedy, Drama, Romance]","[Canada, United States of America]","[usa, parentchildrelationship, greec, culturec...",5.707397,3.2022
6647,Tenacious D in The Pick of Destiny,419,6,2006,2179,"[Comedy, Music]",[United States of America],"[guitar, rockstar, rock, musiccompetit, satan,...",5.615809,3.1999
1072,Field of Dreams,383,7,1989,2323,"[Drama, Fantasy]",[United States of America],"[regret, basedonnovel, basebal, miracl, farm, ...",6.067668,3.13022
4292,White Oleander,111,7,2002,10994,[Drama],"[Germany, United States of America]","[poison, prison, lossoffamili, lossoffath, fos...",5.602358,3.119782
6665,Curse of the Golden Flower,206,6,2006,1494,"[Action, Drama, Fantasy]","[China, Hong Kong]","[poison, china, martialart, swordplay, fight, ...",5.487946,3.092538


In [301]:
hybrid_wr(298, 'The Hurt Locker', cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
1534,Kramer vs. Kramer,483,7,1979,12102,[Drama],[United States of America],"[wifehusbandrelationship, custodybattl, marria...",6.16934,4.961829
964,The Thin Blue Line,109,7,1988,14285,"[Mystery, Documentary, Crime]",[United States of America],[],5.59721,4.909913
3477,Tootsie,429,6,1982,9576,"[Comedy, Romance]",[United States of America],"[infidel, love, friend, crossdress, reput, apo...",5.620261,4.901354
4292,White Oleander,111,7,2002,10994,[Drama],"[Germany, United States of America]","[poison, prison, lossoffamili, lossoffath, fos...",5.602358,4.795872
6263,Grizzly Man,213,7,2005,501,[Documentary],[United States of America],"[lifeanddeath, camp, rain, camcord, self-aband...",5.822697,4.715716
7385,Post Grad,102,5,2009,25704,[Comedy],[United States of America],"[career, famili, unemploy, womandirector]",5.198293,4.714487
6647,Tenacious D in The Pick of Destiny,419,6,2006,2179,"[Comedy, Music]",[United States of America],"[guitar, rockstar, rock, musiccompetit, satan,...",5.615809,4.653242
1072,Field of Dreams,383,7,1989,2323,"[Drama, Fantasy]",[United States of America],"[regret, basedonnovel, basebal, miracl, farm, ...",6.067668,4.628965
6665,Curse of the Golden Flower,206,6,2006,1494,"[Action, Drama, Fantasy]","[China, Hong Kong]","[poison, china, martialart, swordplay, fight, ...",5.487946,4.452938
4939,21 Grams,943,7,2003,470,"[Drama, Crime, Thriller]",[United States of America],"[lifeanddeath, transplant, suicideattempt, des...",6.44683,4.398819


In [302]:
hybrid_wr(483, 'The Hurt Locker', cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
1534,Kramer vs. Kramer,483,7,1979,12102,[Drama],[United States of America],"[wifehusbandrelationship, custodybattl, marria...",6.16934,3.923066
3477,Tootsie,429,6,1982,9576,"[Comedy, Romance]",[United States of America],"[infidel, love, friend, crossdress, reput, apo...",5.620261,3.875917
1072,Field of Dreams,383,7,1989,2323,"[Drama, Fantasy]",[United States of America],"[regret, basedonnovel, basebal, miracl, farm, ...",6.067668,3.835392
6263,Grizzly Man,213,7,2005,501,[Documentary],[United States of America],"[lifeanddeath, camp, rain, camcord, self-aband...",5.822697,3.666686
6647,Tenacious D in The Pick of Destiny,419,6,2006,2179,"[Comedy, Music]",[United States of America],"[guitar, rockstar, rock, musiccompetit, satan,...",5.615809,3.644456
7385,Post Grad,102,5,2009,25704,[Comedy],[United States of America],"[career, famili, unemploy, womandirector]",5.198293,3.527046
4292,White Oleander,111,7,2002,10994,[Drama],"[Germany, United States of America]","[poison, prison, lossoffamili, lossoffath, fos...",5.602358,3.503965
964,The Thin Blue Line,109,7,1988,14285,"[Mystery, Documentary, Crime]",[United States of America],[],5.59721,3.362961
6595,Borat: Cultural Learnings of America for Make ...,1617,6,2006,496,[Comedy],[United States of America],"[malenud, usa, california, prostitut, journali...",5.840217,3.35431
4939,21 Grams,943,7,2003,470,"[Drama, Crime, Thriller]",[United States of America],"[lifeanddeath, transplant, suicideattempt, des...",6.44683,3.312268


In [303]:
hybrid_wr(544, 'The Hurt Locker', cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
964,The Thin Blue Line,109,7,1988,14285,"[Mystery, Documentary, Crime]",[United States of America],[],5.59721,4.699401
1534,Kramer vs. Kramer,483,7,1979,12102,[Drama],[United States of America],"[wifehusbandrelationship, custodybattl, marria...",6.16934,4.588431
6263,Grizzly Man,213,7,2005,501,[Documentary],[United States of America],"[lifeanddeath, camp, rain, camcord, self-aband...",5.822697,4.538559
3477,Tootsie,429,6,1982,9576,"[Comedy, Romance]",[United States of America],"[infidel, love, friend, crossdress, reput, apo...",5.620261,4.492115
7385,Post Grad,102,5,2009,25704,[Comedy],[United States of America],"[career, famili, unemploy, womandirector]",5.198293,4.433949
1072,Field of Dreams,383,7,1989,2323,"[Drama, Fantasy]",[United States of America],"[regret, basedonnovel, basebal, miracl, farm, ...",6.067668,4.406373
6595,Borat: Cultural Learnings of America for Make ...,1617,6,2006,496,[Comedy],[United States of America],"[malenud, usa, california, prostitut, journali...",5.840217,4.388175
6647,Tenacious D in The Pick of Destiny,419,6,2006,2179,"[Comedy, Music]",[United States of America],"[guitar, rockstar, rock, musiccompetit, satan,...",5.615809,4.372088
4292,White Oleander,111,7,2002,10994,[Drama],"[Germany, United States of America]","[poison, prison, lossoffamili, lossoffath, fos...",5.602358,4.339604
4089,My Big Fat Greek Wedding,686,6,2002,8346,"[Comedy, Drama, Romance]","[Canada, United States of America]","[usa, parentchildrelationship, greec, culturec...",5.707397,4.191689


Now let's analyze the title Brokeback Mountain, controversial in its time for presenting a homosexual romance between cowboys, an archetype typically valued for its qualities as a rough and tough man. Let's list the movie's ratings and then look at the recommendations for a person who gave the lowest rating, 0.5, and the highest rating, 5.0. It is possible to notice that the results diverged more, with 4 exclusive titles in each list and very different orders as well.

In [304]:
user_movie_rating.query('title == "Brokeback Mountain"').head(10)

Unnamed: 0,movieId,id,title,userId,rating,timestamp
87387,39183,142.0,Brokeback Mountain,23,4.5,1148671229
87388,39183,142.0,Brokeback Mountain,52,4.5,1231769880
87389,39183,142.0,Brokeback Mountain,78,5.0,1328636566
87390,39183,142.0,Brokeback Mountain,133,0.5,1416150529
87391,39183,142.0,Brokeback Mountain,244,0.5,1448175972
87392,39183,142.0,Brokeback Mountain,255,4.5,1236984187
87393,39183,142.0,Brokeback Mountain,273,4.0,1466944177
87394,39183,142.0,Brokeback Mountain,283,4.5,1150010196
87395,39183,142.0,Brokeback Mountain,303,4.5,1299590603
87396,39183,142.0,Brokeback Mountain,316,2.0,1460823439


In [305]:
hybrid_wr(78, 'Brokeback Mountain', new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
204,Eat Drink Man Woman,76,7,1994,10451,"[Comedy, Drama, Romance]","[Taiwan, United States of America]","[cook, senseoflif, daughter, date, famili]",5.506441,4.694301
873,Jude,38,6,1996,25147,"[Drama, Romance]",[United Kingdom],[],5.305689,4.611381
112,The Star Maker,29,6,1995,37975,"[Drama, Romance]",[Italy],[],5.292193,4.442504
4844,My Life Without Me,78,7,2003,20,"[Drama, Romance]","[Canada, Spain]","[farewel, respons, dyinganddeath, nightshift, ...",5.512276,4.415204
8556,Geography Club,55,6,2013,197033,"[Comedy, Drama, Romance]",[United States of America],"[gay, friendship, youngpeopl]",5.329826,4.386457
2875,Two Moon Junction,23,5,1988,5241,"[Drama, Romance]",[United States of America],[],5.232571,4.308281
5090,Wuthering Heights,61,6,1992,25095,"[Drama, Romance]","[United Kingdom, United States of America]","[costum, drama, reveng]",5.33795,4.295521
3356,I Know Where I'm Going!,23,6,1945,56137,"[Drama, Romance]",[United Kingdom],[],5.2829,4.265748
3615,Torch Song Trilogy,25,7,1988,8463,"[Comedy, Drama, Romance]",[],"[gay, newyork, homophobia, homosexu, basedonpl...",5.34049,4.264805
6051,In Good Company,198,5,2004,1901,"[Comedy, Drama, Romance]",[United States of America],"[midlifecrisi, daughter, badboss]",5.168173,4.261694


In [306]:
hybrid_wr(244, 'Brokeback Mountain', new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
204,Eat Drink Man Woman,76,7,1994,10451,"[Comedy, Drama, Romance]","[Taiwan, United States of America]","[cook, senseoflif, daughter, date, famili]",5.506441,4.239597
873,Jude,38,6,1996,25147,"[Drama, Romance]",[United Kingdom],[],5.305689,3.874311
8556,Geography Club,55,6,2013,197033,"[Comedy, Drama, Romance]",[United States of America],"[gay, friendship, youngpeopl]",5.329826,3.727031
4844,My Life Without Me,78,7,2003,20,"[Drama, Romance]","[Canada, Spain]","[farewel, respons, dyinganddeath, nightshift, ...",5.512276,3.611915
649,Stealing Beauty,99,6,1996,14553,"[Drama, Romance]","[France, Italy, United Kingdom]",[],5.38515,3.558994
3356,I Know Where I'm Going!,23,6,1945,56137,"[Drama, Romance]",[United Kingdom],[],5.2829,3.537214
3615,Torch Song Trilogy,25,7,1988,8463,"[Comedy, Drama, Romance]",[],"[gay, newyork, homophobia, homosexu, basedonpl...",5.34049,3.529919
8799,Free Fall,164,7,2013,167581,"[Drama, Romance]",[Germany],"[gay, comingout, policeacademi, lgbt]",5.726229,3.514222
112,The Star Maker,29,6,1995,37975,"[Drama, Romance]",[Italy],[],5.292193,3.443016
6051,In Good Company,198,5,2004,1901,"[Comedy, Drama, Romance]",[United States of America],"[midlifecrisi, daughter, badboss]",5.168173,3.379197


## (3.1) Hybrid Recommendation Variations

We will use different matrices of cosine approximations to verify the indications that occur. Let's start with the movie **Green Zone**, an American movie that portrays Iraqis as terrorists and tries to carry out a coup d'état in the country. We'll see the recommendations based on two users, one who liked the movie, giving it a 4.0, and another who didn't like it that much, giving it a 1.5.

In [307]:
user_movie_rating.query('title == "Green Zone"')

Unnamed: 0,movieId,id,title,userId,rating,timestamp
94480,74795,22972.0,Green Zone,15,1.5,1367764758
94481,74795,22972.0,Green Zone,84,4.0,1429911567
94482,74795,22972.0,Green Zone,380,3.5,1304471397
94483,74795,22972.0,Green Zone,475,3.5,1446908825
94484,74795,22972.0,Green Zone,547,3.5,1296916834
94485,74795,22972.0,Green Zone,580,3.5,1280152650
94486,74795,22972.0,Green Zone,624,3.0,1312638561


In [308]:
hybrid_wr(15, 'Green Zone', new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
5556,The Enemy Below,46,7,1957,15876,"[Action, Adventure, Drama, History, Thriller, ...",[United States of America],"[submarin, worldwarii, u.s.navi]",5.413094,3.052273
4274,Bloody Sunday,54,7,2002,4107,"[Action, Adventure, Drama, History]","[Ireland, United Kingdom]","[northernireland, independentfilm, civilright]",5.439109,2.969771
4221,Undisputed,141,6,2002,15070,"[Action, Adventure, Drama, Thriller]",[United States of America],[sport],5.430061,2.736548
3470,Uncommon Valor,54,6,1983,9967,"[Action, Adventure, Drama, Thriller]",[United States of America],"[vietnamveteran, lossofson, vietnam, vietnamwa...",5.328453,2.646576
3626,American Ninja 3: Blood Hunt,33,4,1989,25682,"[Action, Adventure, Drama]","[United States of America, Canada, South Africa]",[ninja],5.156927,2.631705
5308,Gunga Din,38,6,1939,24965,"[Action, Adventure, Drama, War]",[United States of America],"[armi, upris]",5.305689,2.602716
3950,The Count of Monte Cristo,463,7,2002,11362,"[Action, Adventure, Drama, Thriller]",[United States of America],"[lossoflov, lover(female), ex-lov, tortur, nap...",6.15082,2.548974
7,Tom and Huck,45,5,1995,45325,"[Action, Adventure, Drama, Family]",[United States of America],[],5.22189,2.481086
4727,Flight of the Intruder,31,5,1991,18886,"[Action, Adventure, Drama, Thriller]",[United States of America],"[bomber, vietnamwar, u.s.navi, aviat, combat, ...",5.22857,2.263693
8882,The Pacific,318,7,2010,189197,"[Action, Adventure, Drama, History, War]",[United States of America],"[worldwarii, miniseri]",5.987081,2.17379


In [309]:
hybrid_wr(15, 'Green Zone', cosine_sim_country)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
5556,The Enemy Below,46,7,1957,15876,"[Action, Adventure, Drama, History, Thriller, ...",[United States of America],"[submarin, worldwarii, u.s.navi]",5.413094,3.052273
5110,To Have and Have Not,112,7,1944,22584,"[War, Action, Adventure, Drama, Romance, Thril...",[United States of America],"[fish, island, nazi, frenchresist, singer, alc...",5.604918,3.015172
4274,Bloody Sunday,54,7,2002,4107,"[Action, Adventure, Drama, History]","[Ireland, United Kingdom]","[northernireland, independentfilm, civilright]",5.439109,2.969771
4221,Undisputed,141,6,2002,15070,"[Action, Adventure, Drama, Thriller]",[United States of America],[sport],5.430061,2.736548
3470,Uncommon Valor,54,6,1983,9967,"[Action, Adventure, Drama, Thriller]",[United States of America],"[vietnamveteran, lossofson, vietnam, vietnamwa...",5.328453,2.646576
5308,Gunga Din,38,6,1939,24965,"[Action, Adventure, Drama, War]",[United States of America],"[armi, upris]",5.305689,2.602716
3950,The Count of Monte Cristo,463,7,2002,11362,"[Action, Adventure, Drama, Thriller]",[United States of America],"[lossoflov, lover(female), ex-lov, tortur, nap...",6.15082,2.548974
7,Tom and Huck,45,5,1995,45325,"[Action, Adventure, Drama, Family]",[United States of America],[],5.22189,2.481086
4727,Flight of the Intruder,31,5,1991,18886,"[Action, Adventure, Drama, Thriller]",[United States of America],"[bomber, vietnamwar, u.s.navi, aviat, combat, ...",5.22857,2.263693
8882,The Pacific,318,7,2010,189197,"[Action, Adventure, Drama, History, War]",[United States of America],"[worldwarii, miniseri]",5.987081,2.17379


In [310]:
hybrid_wr(84, 'Green Zone', new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
3950,The Count of Monte Cristo,463,7,2002,11362,"[Action, Adventure, Drama, Thriller]",[United States of America],"[lossoflov, lover(female), ex-lov, tortur, nap...",6.15082,3.856349
4274,Bloody Sunday,54,7,2002,4107,"[Action, Adventure, Drama, History]","[Ireland, United Kingdom]","[northernireland, independentfilm, civilright]",5.439109,3.739339
8882,The Pacific,318,7,2010,189197,"[Action, Adventure, Drama, History, War]",[United States of America],"[worldwarii, miniseri]",5.987081,3.679704
4727,Flight of the Intruder,31,5,1991,18886,"[Action, Adventure, Drama, Thriller]",[United States of America],"[bomber, vietnamwar, u.s.navi, aviat, combat, ...",5.22857,3.671982
3470,Uncommon Valor,54,6,1983,9967,"[Action, Adventure, Drama, Thriller]",[United States of America],"[vietnamveteran, lossofson, vietnam, vietnamwa...",5.328453,3.618182
4221,Undisputed,141,6,2002,15070,"[Action, Adventure, Drama, Thriller]",[United States of America],[sport],5.430061,3.615671
5308,Gunga Din,38,6,1939,24965,"[Action, Adventure, Drama, War]",[United States of America],"[armi, upris]",5.305689,3.608074
5556,The Enemy Below,46,7,1957,15876,"[Action, Adventure, Drama, History, Thriller, ...",[United States of America],"[submarin, worldwarii, u.s.navi]",5.413094,3.597488
7,Tom and Huck,45,5,1995,45325,"[Action, Adventure, Drama, Family]",[United States of America],[],5.22189,3.586623
3626,American Ninja 3: Blood Hunt,33,4,1989,25682,"[Action, Adventure, Drama]","[United States of America, Canada, South Africa]",[ninja],5.156927,3.55974


In [311]:
hybrid_wr(84, 'Green Zone', cosine_sim_country)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
3950,The Count of Monte Cristo,463,7,2002,11362,"[Action, Adventure, Drama, Thriller]",[United States of America],"[lossoflov, lover(female), ex-lov, tortur, nap...",6.15082,3.856349
5110,To Have and Have Not,112,7,1944,22584,"[War, Action, Adventure, Drama, Romance, Thril...",[United States of America],"[fish, island, nazi, frenchresist, singer, alc...",5.604918,3.748232
4274,Bloody Sunday,54,7,2002,4107,"[Action, Adventure, Drama, History]","[Ireland, United Kingdom]","[northernireland, independentfilm, civilright]",5.439109,3.739339
8882,The Pacific,318,7,2010,189197,"[Action, Adventure, Drama, History, War]",[United States of America],"[worldwarii, miniseri]",5.987081,3.679704
4727,Flight of the Intruder,31,5,1991,18886,"[Action, Adventure, Drama, Thriller]",[United States of America],"[bomber, vietnamwar, u.s.navi, aviat, combat, ...",5.22857,3.671982
3470,Uncommon Valor,54,6,1983,9967,"[Action, Adventure, Drama, Thriller]",[United States of America],"[vietnamveteran, lossofson, vietnam, vietnamwa...",5.328453,3.618182
4221,Undisputed,141,6,2002,15070,"[Action, Adventure, Drama, Thriller]",[United States of America],[sport],5.430061,3.615671
5308,Gunga Din,38,6,1939,24965,"[Action, Adventure, Drama, War]",[United States of America],"[armi, upris]",5.305689,3.608074
5556,The Enemy Below,46,7,1957,15876,"[Action, Adventure, Drama, History, Thriller, ...",[United States of America],"[submarin, worldwarii, u.s.navi]",5.413094,3.597488
7,Tom and Huck,45,5,1995,45325,"[Action, Adventure, Drama, Family]",[United States of America],[],5.22189,3.586623


We see that for our hybrid recommender we get different recommendations for different users even though the movie is the same. Therefore, our recommendations are more personalized and tailored to specific users. Additionally, when using the cosine approximation without country, two foreign titles appear in the nominations, and when using the cosine with country, only US titles are indicated. That is, the programming of artificial intelligence that takes into account the country of origin of the titles reinforces the consumption of titles from the same country.

Below we will see the indications of the Mexican film **Pan's Labyrinth**, we will use two parameters, the cosine indications with and without country, for two different users, one that gave the film a maximum score and another that gave a minimum score. Interestingly, the same user who gave Brokeback Mountain a 0.5 also gave Pan's Labyrinth a low score.

As it is a film that is not from the USA, when using the cosine with country, the recommendations are for very different films in relation to the producing country. Whereas the cosine without country tends to feature more US films.

In [312]:
user_movie_rating.query('title == "Pan\'s Labyrinth"').head(15)

Unnamed: 0,movieId,id,title,userId,rating,timestamp
89105,48394,1417.0,Pan's Labyrinth,15,1.5,1348976680
89106,48394,1417.0,Pan's Labyrinth,31,4.5,1274149580
89107,48394,1417.0,Pan's Labyrinth,42,4.0,1473258404
89108,48394,1417.0,Pan's Labyrinth,48,4.0,1316495346
89109,48394,1417.0,Pan's Labyrinth,61,4.5,1216050612
89110,48394,1417.0,Pan's Labyrinth,72,3.5,1461783951
89111,48394,1417.0,Pan's Labyrinth,73,3.5,1466320346
89112,48394,1417.0,Pan's Labyrinth,94,3.0,1291780552
89113,48394,1417.0,Pan's Labyrinth,96,3.5,1223256776
89114,48394,1417.0,Pan's Labyrinth,97,1.5,1460343121


In [313]:
hybrid_wr(106, "Pan's Labyrinth", new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
6565,The Illusionist,1657,7,2006,1491,"[Fantasy, Drama, Thriller, Romance]","[Czech Republic, United States of America]","[magic, princess, love, rivalri, viennaaustria...",6.635717,3.996511
39,Richard III,50,6,1995,31174,"[Drama, War]","[United Kingdom, United States of America]","[england, shakespear, kidnap, murder, king]",5.322903,3.893436
3845,The Devil's Backbone,277,7,2001,1433,"[Fantasy, Drama, Horror, Thriller, Science Fic...","[Spain, Mexico]","[bomb, sex, fight, nuditi, war, supernatur, or...",5.928671,3.86737
3549,Bat*21,42,5,1988,14911,"[Action, Drama, War]",[United States of America],[basedonnovel],5.223288,3.863873
8348,Blancanieves,67,7,2012,127864,"[Drama, Fantasy]","[France, Spain]","[matador, referencetosnowwhit]",5.479611,3.836177
7557,Dorian Gray,588,5,2009,23488,"[Fantasy, Drama, Thriller]",[United Kingdom],[],5.103997,3.797013
3979,Hart's War,246,5,2002,10592,"[Drama, War]",[United States of America],"[blackpeopl, worldwarii, prisonersofwar, u.s.a...",5.156302,3.721959
3551,The Beast of War,53,6,1988,15267,"[Drama, War]",[United States of America],[],5.327074,3.713023
4561,Tears of the Sun,582,6,2003,9567,"[Action, Drama, War]",[United States of America],"[u.s.armi, presid]",5.677446,3.680711
5308,Gunga Din,38,6,1939,24965,"[Action, Adventure, Drama, War]",[United States of America],"[armi, upris]",5.305689,3.635466


In [314]:
hybrid_wr(106, "Pan's Labyrinth", cosine_sim_country)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
6565,The Illusionist,1657,7,2006,1491,"[Fantasy, Drama, Thriller, Romance]","[Czech Republic, United States of America]","[magic, princess, love, rivalri, viennaaustria...",6.635717,3.996511
5422,Viridiana,93,7,1961,4497,[Drama],"[Spain, Mexico]","[spain, rape, nun, faith, wine, devout, orphan...",5.554621,3.930505
7174,Che: Part Two,155,6,2009,8880,"[Drama, History, War]","[Spain, France, United States of America]","[biographi, cheguevara]",5.443608,3.903483
3845,The Devil's Backbone,277,7,2001,1433,"[Fantasy, Drama, Horror, Thriller, Science Fic...","[Spain, Mexico]","[bomb, sex, fight, nuditi, war, supernatur, or...",5.928671,3.86737
7555,Green Zone,730,6,2010,22972,"[War, Action, Adventure, Drama, Thriller]","[Spain, United Kingdom, United States of Ameri...",[],5.718458,3.838479
4400,Talk to Her,273,7,2002,64,"[Drama, Romance]",[Spain],"[suicid, spain, rape, coma, bullfight, journal...",5.92261,3.811261
7173,Che: Part One,209,6,2008,8881,"[Drama, History, War]","[Spain, France, United States of America]",[cheguevara],5.490335,3.75386
3979,Hart's War,246,5,2002,10592,"[Drama, War]",[United States of America],"[blackpeopl, worldwarii, prisonersofwar, u.s.a...",5.156302,3.721959
3551,The Beast of War,53,6,1988,15267,"[Drama, War]",[United States of America],[],5.327074,3.713023
7461,Broken Embraces,167,7,2009,8088,"[Drama, Romance, Thriller]",[Spain],"[madrid, jealousi, obsess, love, hitchcockian]",5.732588,3.694825


In [315]:
hybrid_wr(133, "Pan's Labyrinth", new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
39,Richard III,50,6,1995,31174,"[Drama, War]","[United Kingdom, United States of America]","[england, shakespear, kidnap, murder, king]",5.322903,2.588181
2115,A Midsummer Night's Dream,95,6,1999,10210,"[Fantasy, Drama, Comedy, Romance]","[Italy, United States of America, United Kingdom]","[shakespear, fairi, lover, promis, luck]",5.380501,2.57397
3549,Bat*21,42,5,1988,14911,"[Action, Drama, War]",[United States of America],[basedonnovel],5.223288,2.501019
3551,The Beast of War,53,6,1988,15267,"[Drama, War]",[United States of America],[],5.327074,2.461984
3979,Hart's War,246,5,2002,10592,"[Drama, War]",[United States of America],"[blackpeopl, worldwarii, prisonersofwar, u.s.a...",5.156302,2.40181
6565,The Illusionist,1657,7,2006,1491,"[Fantasy, Drama, Thriller, Romance]","[Czech Republic, United States of America]","[magic, princess, love, rivalri, viennaaustria...",6.635717,2.375757
4561,Tears of the Sun,582,6,2003,9567,"[Action, Drama, War]",[United States of America],"[u.s.armi, presid]",5.677446,2.314684
3845,The Devil's Backbone,277,7,2001,1433,"[Fantasy, Drama, Horror, Thriller, Science Fic...","[Spain, Mexico]","[bomb, sex, fight, nuditi, war, supernatur, or...",5.928671,2.257898
5308,Gunga Din,38,6,1939,24965,"[Action, Adventure, Drama, War]",[United States of America],"[armi, upris]",5.305689,2.246081
7557,Dorian Gray,588,5,2009,23488,"[Fantasy, Drama, Thriller]",[United Kingdom],[],5.103997,2.166639


In [316]:
hybrid_wr(133, "Pan's Labyrinth", cosine_sim_country)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
553,Land and Freedom,53,7,1995,38884,"[Drama, History, War, Romance]","[France, Germany, Spain, United Kingdom, Italy]","[spain, solidar, fascism, revolut, communist, ...",5.435904,2.70992
4400,Talk to Her,273,7,2002,64,"[Drama, Romance]",[Spain],"[suicid, spain, rape, coma, bullfight, journal...",5.92261,2.543049
3551,The Beast of War,53,6,1988,15267,"[Drama, War]",[United States of America],[],5.327074,2.461984
5422,Viridiana,93,7,1961,4497,[Drama],"[Spain, Mexico]","[spain, rape, nun, faith, wine, devout, orphan...",5.554621,2.409892
7173,Che: Part One,209,6,2008,8881,"[Drama, History, War]","[Spain, France, United States of America]",[cheguevara],5.490335,2.40582
3979,Hart's War,246,5,2002,10592,"[Drama, War]",[United States of America],"[blackpeopl, worldwarii, prisonersofwar, u.s.a...",5.156302,2.40181
6565,The Illusionist,1657,7,2006,1491,"[Fantasy, Drama, Thriller, Romance]","[Czech Republic, United States of America]","[magic, princess, love, rivalri, viennaaustria...",6.635717,2.375757
7555,Green Zone,730,6,2010,22972,"[War, Action, Adventure, Drama, Thriller]","[Spain, United Kingdom, United States of Ameri...",[],5.718458,2.340651
4561,Tears of the Sun,582,6,2003,9567,"[Action, Drama, War]",[United States of America],"[u.s.armi, presid]",5.677446,2.314684
3845,The Devil's Backbone,277,7,2001,1433,"[Fantasy, Drama, Horror, Thriller, Science Fic...","[Spain, Mexico]","[bomb, sex, fight, nuditi, war, supernatur, or...",5.928671,2.257898


In [317]:
user_movie_rating.query('rating == 5.0 and title == "City of God"').head(5)

Unnamed: 0,movieId,id,title,userId,rating,timestamp
77408,6016,598.0,City of God,17,5.0,1127470498
77409,6016,598.0,City of God,23,5.0,1148670462
77410,6016,598.0,City of God,26,5.0,1352597728
77412,6016,598.0,City of God,56,5.0,1467002967
77414,6016,598.0,City of God,73,5.0,1255503493


In [318]:
user_movie_rating.query('rating == 1.0 and title == "City of God"').head(5)

Unnamed: 0,movieId,id,title,userId,rating,timestamp
77439,6016,598.0,City of God,355,1.0,1207512737


In [319]:
hybrid_wr(17, "City of God", new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
154,Kids,280,6,1995,9344,"[Drama, Crime]",[United States of America],"[puberti, firsttim]",5.541016,4.121999
1000,Once Upon a Time in America,1104,8,1984,311,"[Drama, Crime]","[United States of America, Italy]","[lifeanddeath, corrupt, streetgang, rape, sadi...",7.222552,3.938564
15,Casino,1343,7,1995,524,"[Drama, Crime]","[France, United States of America]","[poker, drugabus, 1970s, overdos, illegalprost...",6.571348,3.915265
2839,American Psycho,2128,7,2000,1359,"[Thriller, Drama, Crime]","[Canada, United States of America]","[basedonnovel, wallstreet, psychopath, childof...",6.702687,3.801185
6948,In Bruges,1414,7,2008,8321,"[Comedy, Drama, Crime]","[United Kingdom, United States of America]",[vietnames],6.587817,3.793614
7116,Gomorrah,228,6,2008,8882,"[Drama, Crime]",[Italy],"[moneydeliveri, gangwar, power, mafiaboss, gan...",5.504962,3.564977
1691,The Outsiders,293,6,1983,227,"[Crime, Drama]",[United States of America],"[streetgang, children'shom, comingofag, gang, ...",5.549223,3.559231
7128,"Angus, Thongs and Perfect Snogging",231,6,2008,12689,"[Comedy, Drama, Family, Romance]",[United Kingdom],"[adolesc, parentchildrelationship, becomingana...",5.507196,3.46387
6868,Persepolis,461,7,2007,2011,"[Animation, Drama]","[France, United States of America]","[civilwar, parentchildrelationship, 1970s, pub...",6.148922,3.45995
8050,The Iron Lady,540,6,2011,71688,"[History, Drama]","[France, United Kingdom]","[capit, primeminist, argentina, warpropaganda,...",5.663537,3.430207


In [320]:
hybrid_wr(17, "City of God", cosine_sim_country)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
6878,Elite Squad,519,7,2007,7347,"[Drama, Action, Crime]",[Brazil],"[slum, policebrut, brazilian, warondrug, riode...",6.200719,4.315954
154,Kids,280,6,1995,9344,"[Drama, Crime]",[United States of America],"[puberti, firsttim]",5.541016,4.121999
6974,City of Men,66,6,2007,7343,[Drama],[Brazil],"[slum, fathersonrelationship, streetgang, braz...",5.34457,4.016414
15,Casino,1343,7,1995,524,"[Drama, Crime]","[France, United States of America]","[poker, drugabus, 1970s, overdos, illegalprost...",6.571348,3.915265
2839,American Psycho,2128,7,2000,1359,"[Thriller, Drama, Crime]","[Canada, United States of America]","[basedonnovel, wallstreet, psychopath, childof...",6.702687,3.801185
7834,Elite Squad: The Enemy Within,477,7,2010,47931,"[Drama, Action, Crime]",[Brazil],"[brazilian, drugtraff, penitentiari, polic, ja...",6.16387,3.785188
1691,The Outsiders,293,6,1983,227,"[Crime, Drama]",[United States of America],"[streetgang, children'shom, comingofag, gang, ...",5.549223,3.559231
7128,"Angus, Thongs and Perfect Snogging",231,6,2008,12689,"[Comedy, Drama, Family, Romance]",[United Kingdom],"[adolesc, parentchildrelationship, becomingana...",5.507196,3.46387
6868,Persepolis,461,7,2007,2011,"[Animation, Drama]","[France, United States of America]","[civilwar, parentchildrelationship, 1970s, pub...",6.148922,3.45995
8050,The Iron Lady,540,6,2011,71688,"[History, Drama]","[France, United Kingdom]","[capit, primeminist, argentina, warpropaganda,...",5.663537,3.430207


In [321]:
hybrid_wr(355, "City of God", new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
15,Casino,1343,7,1995,524,"[Drama, Crime]","[France, United States of America]","[poker, drugabus, 1970s, overdos, illegalprost...",6.571348,4.113503
1000,Once Upon a Time in America,1104,8,1984,311,"[Drama, Crime]","[United States of America, Italy]","[lifeanddeath, corrupt, streetgang, rape, sadi...",7.222552,3.924078
154,Kids,280,6,1995,9344,"[Drama, Crime]",[United States of America],"[puberti, firsttim]",5.541016,3.884756
6948,In Bruges,1414,7,2008,8321,"[Comedy, Drama, Crime]","[United Kingdom, United States of America]",[vietnames],6.587817,3.739247
6868,Persepolis,461,7,2007,2011,"[Animation, Drama]","[France, United States of America]","[civilwar, parentchildrelationship, 1970s, pub...",6.148922,3.591435
7698,Wall Street: Money Never Sleeps,504,5,2010,33909,"[Drama, Crime]",[United States of America],[duringcreditssting],5.11331,3.516735
2839,American Psycho,2128,7,2000,1359,"[Thriller, Drama, Crime]","[Canada, United States of America]","[basedonnovel, wallstreet, psychopath, childof...",6.702687,3.422974
7128,"Angus, Thongs and Perfect Snogging",231,6,2008,12689,"[Comedy, Drama, Family, Romance]",[United Kingdom],"[adolesc, parentchildrelationship, becomingana...",5.507196,3.400694
7116,Gomorrah,228,6,2008,8882,"[Drama, Crime]",[Italy],"[moneydeliveri, gangwar, power, mafiaboss, gan...",5.504962,3.371279
1691,The Outsiders,293,6,1983,227,"[Crime, Drama]",[United States of America],"[streetgang, children'shom, comingofag, gang, ...",5.549223,3.254615


In [322]:
hybrid_wr(355, "City of God", cosine_sim_country)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
15,Casino,1343,7,1995,524,"[Drama, Crime]","[France, United States of America]","[poker, drugabus, 1970s, overdos, illegalprost...",6.571348,4.113503
154,Kids,280,6,1995,9344,"[Drama, Crime]",[United States of America],"[puberti, firsttim]",5.541016,3.884756
7834,Elite Squad: The Enemy Within,477,7,2010,47931,"[Drama, Action, Crime]",[Brazil],"[brazilian, drugtraff, penitentiari, polic, ja...",6.16387,3.86973
8804,The Way He Looks,262,8,2014,237791,"[Drama, Romance]",[Brazil],"[gay, brazil, friendship, comingofag, blind]",6.282019,3.701712
6878,Elite Squad,519,7,2007,7347,"[Drama, Action, Crime]",[Brazil],"[slum, policebrut, brazilian, warondrug, riode...",6.200719,3.67023
6868,Persepolis,461,7,2007,2011,"[Animation, Drama]","[France, United States of America]","[civilwar, parentchildrelationship, 1970s, pub...",6.148922,3.591435
6974,City of Men,66,6,2007,7343,[Drama],[Brazil],"[slum, fathersonrelationship, streetgang, braz...",5.34457,3.547765
7698,Wall Street: Money Never Sleeps,504,5,2010,33909,"[Drama, Crime]",[United States of America],[duringcreditssting],5.11331,3.516735
2839,American Psycho,2128,7,2000,1359,"[Thriller, Drama, Crime]","[Canada, United States of America]","[basedonnovel, wallstreet, psychopath, childof...",6.702687,3.422974
7128,"Angus, Thongs and Perfect Snogging",231,6,2008,12689,"[Comedy, Drama, Family, Romance]",[United Kingdom],"[adolesc, parentchildrelationship, becomingana...",5.507196,3.400694


In [323]:
user_movie_rating.query('title == "Born on the Fourth of July"')

Unnamed: 0,movieId,id,title,userId,rating,timestamp
67720,4008,2604.0,Born on the Fourth of July,23,4.0,1166036125
67721,4008,2604.0,Born on the Fourth of July,30,5.0,994457943
67722,4008,2604.0,Born on the Fourth of July,95,4.0,1016317157
67723,4008,2604.0,Born on the Fourth of July,202,4.0,1046145972
67724,4008,2604.0,Born on the Fourth of July,214,3.0,978128437
67725,4008,2604.0,Born on the Fourth of July,220,4.0,975086737
67726,4008,2604.0,Born on the Fourth of July,295,4.0,1100131025
67727,4008,2604.0,Born on the Fourth of July,299,3.5,1344186070
67728,4008,2604.0,Born on the Fourth of July,311,3.0,1080681055
67729,4008,2604.0,Born on the Fourth of July,350,4.0,1011046800


In [324]:
hybrid_wr(30, "Born on the Fourth of July", new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
2992,Coming Home,38,6,1978,31657,"[Drama, Romance, War]",[United States of America],"[wheelchair, vietnam, war, volunt, love, hospit]",5.305689,4.224086
1034,The Deer Hunter,943,7,1978,11778,"[Drama, War]","[United Kingdom, United States of America]","[vietnamveteran, pennsylvania, vietnam, parti,...",6.44683,4.198818
899,Platoon,1236,7,1986,792,"[Drama, War, Action]","[United Kingdom, United States of America]","[famousscor, hero, mine, vietnamwar, villag, g...",6.543883,4.070104
3174,Tigerland,155,6,2000,10687,"[Drama, War]","[Germany, United States of America]","[independentfilm, awol, peacesign]",5.443608,4.051862
3549,Bat*21,42,5,1988,14911,"[Action, Drama, War]",[United States of America],[basedonnovel],5.223288,3.855685
8835,Unbroken,1437,7,2014,227306,"[Drama, War]",[United States of America],"[worldwarii, prisonersofwar, biographi, sport,...",6.592884,3.718039
2400,Red Sorghum,36,7,1987,42006,"[Drama, War]",[China],[china],5.37933,3.673446
3304,Hope and Glory,44,6,1987,32054,"[Drama, War]","[United Kingdom, United States of America]","[londonengland, wartim]",5.314404,3.6449
9082,Beasts of No Nation,476,7,2015,283587,"[Drama, War]",[United States of America],"[africa, basedonnovel]",6.162951,3.635951
3551,The Beast of War,53,6,1988,15267,"[Drama, War]",[United States of America],[],5.327074,3.59612


In [325]:
hybrid_wr(30, "Born on the Fourth of July", cosine_sim_country)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
899,Platoon,1236,7,1986,792,"[Drama, War, Action]","[United Kingdom, United States of America]","[famousscor, hero, mine, vietnamwar, villag, g...",6.543883,4.070104
6789,Rescue Dawn,339,6,2006,9952,"[Adventure, Drama, War]",[United States of America],"[warcrim, violenc, pipesmok]",5.576048,3.94799
3799,The Big Red One,83,6,1980,16121,"[Drama, War]",[United States of America],"[africa, sicili, concentrationcamp, omahabeach...",5.366122,3.901622
3979,Hart's War,246,5,2002,10592,"[Drama, War]",[United States of America],"[blackpeopl, worldwarii, prisonersofwar, u.s.a...",5.156302,3.726647
8835,Unbroken,1437,7,2014,227306,"[Drama, War]",[United States of America],"[worldwarii, prisonersofwar, biographi, sport,...",6.592884,3.718039
8554,Lone Survivor,1549,7,2013,193756,"[Action, Drama, Thriller, War]",[United States of America],"[war, surviv, navys, militari, dangerousmiss]",6.615878,3.683362
9082,Beasts of No Nation,476,7,2015,283587,"[Drama, War]",[United States of America],"[africa, basedonnovel]",6.162951,3.635951
8015,War Horse,1016,7,2011,57212,"[Drama, War]",[United States of America],"[worldwari, hors, farmlif, execut, trap, alcoh...",6.474679,3.599995
3551,The Beast of War,53,6,1988,15267,"[Drama, War]",[United States of America],[],5.327074,3.59612
3030,Braddock: Missing in Action III,55,4,1988,27352,"[Action, War]",[United States of America],"[vietnamveteran, orphanag, rescuemiss]",5.104878,3.530617


In [326]:
hybrid_wr(605, "Born on the Fourth of July", new_cosine_sim)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
1034,The Deer Hunter,943,7,1978,11778,"[Drama, War]","[United Kingdom, United States of America]","[vietnamveteran, pennsylvania, vietnam, parti,...",6.44683,3.671427
899,Platoon,1236,7,1986,792,"[Drama, War, Action]","[United Kingdom, United States of America]","[famousscor, hero, mine, vietnamwar, villag, g...",6.543883,3.56874
9082,Beasts of No Nation,476,7,2015,283587,"[Drama, War]",[United States of America],"[africa, basedonnovel]",6.162951,3.170338
3551,The Beast of War,53,6,1988,15267,"[Drama, War]",[United States of America],[],5.327074,3.122821
2992,Coming Home,38,6,1978,31657,"[Drama, Romance, War]",[United States of America],"[wheelchair, vietnam, war, volunt, love, hospit]",5.305689,3.085845
7698,Wall Street: Money Never Sleeps,504,5,2010,33909,"[Drama, Crime]",[United States of America],[duringcreditssting],5.11331,3.066807
3174,Tigerland,155,6,2000,10687,"[Drama, War]","[Germany, United States of America]","[independentfilm, awol, peacesign]",5.443608,3.061042
4561,Tears of the Sun,582,6,2003,9567,"[Action, Drama, War]",[United States of America],"[u.s.armi, presid]",5.677446,3.049391
413,Heaven & Earth,51,6,1993,31642,"[History, War, Action, Drama]","[France, United States of America]","[vietnam, vietnamwar, biographi, dog]",5.324299,3.004159
3549,Bat*21,42,5,1988,14911,"[Action, Drama, War]",[United States of America],[basedonnovel],5.223288,2.977221


In [327]:
hybrid_wr(605, "Born on the Fourth of July", cosine_sim_country)

Unnamed: 0,title,vote_count,vote_average,year,id,genres,countries,keywords,wr,est
3946,M*A*S*H,238,6,1970,651,"[Comedy, Drama, War]",[United States of America],"[dailylif, americanwayoflif, militaryspoof, do...",5.512329,3.631315
899,Platoon,1236,7,1986,792,"[Drama, War, Action]","[United Kingdom, United States of America]","[famousscor, hero, mine, vietnamwar, villag, g...",6.543883,3.56874
3030,Braddock: Missing in Action III,55,4,1988,27352,"[Action, War]",[United States of America],"[vietnamveteran, orphanag, rescuemiss]",5.104878,3.252955
8015,War Horse,1016,7,2011,57212,"[Drama, War]",[United States of America],"[worldwari, hors, farmlif, execut, trap, alcoh...",6.474679,3.243166
3799,The Big Red One,83,6,1980,16121,"[Drama, War]",[United States of America],"[africa, sicili, concentrationcamp, omahabeach...",5.366122,3.197168
9082,Beasts of No Nation,476,7,2015,283587,"[Drama, War]",[United States of America],"[africa, basedonnovel]",6.162951,3.170338
3551,The Beast of War,53,6,1988,15267,"[Drama, War]",[United States of America],[],5.327074,3.122821
7698,Wall Street: Money Never Sleeps,504,5,2010,33909,"[Drama, Crime]",[United States of America],[duringcreditssting],5.11331,3.066807
4561,Tears of the Sun,582,6,2003,9567,"[Action, Drama, War]",[United States of America],"[u.s.armi, presid]",5.677446,3.049391
3979,Hart's War,246,5,2002,10592,"[Drama, War]",[United States of America],"[blackpeopl, worldwarii, prisonersofwar, u.s.a...",5.156302,2.997252


We can draw some conclusions from two factors, whether the film is from the US and whether we use the cosine that includes the country or not:

* **US Titles and Country Cosine:** The trend is to nominate more US films, practically 100% of the nominations
* **US titles and Cosine without Country:** The trend is to indicate more US films, but non-US titles appear
* **Non-US Titles and Country Cosine:** Tends to indicate non-US movies
* **Non-US titles and Cosine without Country:** The trend is to indicate more US movies, but non-US titles appear

That is, in three of the four scenarios presented, US films are more nominated than any other nationality, reinforcing the cultural imperialism of Hollywood cinema.

## Conclusion

In this notebook, I built 4 different recommendation engines based on different ideas and algorithms. They are as follows:

1. **Simple Recommendation:** This system used the TMDB's overall vote count and vote averages to create charts of outstanding movies, both generally and for a specific genre. The IMDB Weighted Rating System was used to calculate the ratings on which the rating was ultimately held.
2. **Content-Based Recommender:** We have created two content-based engines; one that took the general synopsis of the film and called as input and the other that took metadata like cast, crew, genre and keywords to make predictions. We also implemented a simple filter to give more preference to movies with more votes and higher ratings.
3. **Collaborative Filtering:** We use the powerful Surprise Library to build a collaborative filter based on single value decomposition. The RMSE obtained was less than 1 and the engine provided estimated ratings for a given user and movie.
4. **Hybrid Engine:** We brought together content ideas and collaborative filtering to create an engine that gave movie suggestions to a given user based on estimated ratings calculated internally for that user.