# Building your own recommender system

Today, we will start working on building our own recommender system. For this assignment, we will work with movie data.
Download the following datasets [here](https://www.kaggle.com/tmdb/tmdb-movie-metadata):
- `tmdb_5000_credits.csv`
- `tmdb_5000_movies.csv`

Place the files a folder in the current working directory, which you can call `data/`.


## Task 1. Explore and preprocess the data.

### a.  Explore the data

- As a first step, explore the datasets. Inspect what data you have at hand, what might be interesting variables and what not. Make a selection of interesting columns.
  -  Keep in mind that ultimatly, you want to build a knowledge-based and content-based recommender systems. Hence, look for columns that might be suitable to use later on.
- Combine (merge) both datasets. Can you identify a variable that can be used for matching?
Think about a good way to do this. Can you write a function that will return the merged data? An example is provided below. 

- Check whether the data is ready to use, or whether you need to transform or pre-processing your data somehow.

In [2]:
import pandas as pd
from ast import literal_eval
from sklearn.metrics.pairwise import cosine_similarity

#### Read the data

<br>
<div class="alert-block alert-warning">
For now, only keep movies that recieved frequent votes (note this will make the dataset rather small; computation is therefore faster)
</div>



   

In [3]:
PATH = 'data/'

VOTE_COUNT = 2000 #If you want to work with a larger dataset, decrease this value.

def get_data(path_to_data):

    data1 = pd.read_csv(f'{path_to_data}tmdb_5000_credits.csv')
    data2 = pd.read_csv(f'{path_to_data}tmdb_5000_movies.csv')
    data2.rename(columns={'id': 'movie_id'}, inplace=True)

    data = pd.merge(data1,data2,  on=['movie_id', 'title'])
    data["original_title"] = data["original_title"].str.lower()

    data = data[data['vote_count'] > VOTE_COUNT] # for now, only keep movies with frequent votes (this will keep the dataset rather small and therefore computation is faster)
    data.index = [i for i in range(0,len(data))]
    return data

In [4]:
data = get_data(PATH)
data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124



### b.  Pre-processing and feature engineering

As a first step, some data wrangling techniques are needed to get the data into the right shape.
- Think about relevant attributes of movies that you want to use later on when designing a recommender system.
- Can you convert `release_year` to a yearly-level variable?
- Can you clean up the `genres` column?

In the next code block, we will convert `release_year` to a yearly-level variable. In addition, we will clean up the `genres` column

In [5]:
data['release_year'] = pd.DatetimeIndex(data['release_date']).year

def get_genres(x):
    return " ".join( [e['name'] for e in literal_eval(x)] )

data['genres'] = data['genres'].apply(get_genres)
data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,release_year
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,2009
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500,2007
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466,2015
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106,2012
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124,2012


... <br>
Skipping a large part of CCS2 - for full code, see "build_a_recommender.ipynb" <br>
...

## Task 3. Create a content-based recommender system using cosine similarity

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

For this taks, we go back to the dataset in the original format (hence, before exploding the data to a long format).

In [7]:
data = get_data(PATH)
data['release_year'] = pd.DatetimeIndex(data['release_date']).year
data['genres'] = data['genres'].apply(get_genres)
data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,release_year
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,2009
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500,2007
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466,2015
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106,2012
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124,2012


... <br>
Skipping a large part of CCS2 - for full code, see "build_a_recommender.ipynb" <br>
... 
### e. Putting the pieces together: Content-based recommender system using cosine similarity

In [8]:
def transform_data(data):

    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(data['combined_features'])
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

    return cosine_sim

def recommender(movie_title, data, transformed_data):

    indices = pd.Series(data.index, index = data['original_title'])
    index = indices[movie_title]

    sim_scores = list(enumerate(transformed_data[index]))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]

    movie_indices = [i[0] for i in sim_scores]

    movie_id = data['movie_id'].iloc[movie_indices]

    movie_title = data['original_title'].iloc[movie_indices]
    movie_genres = data['genres'].iloc[movie_indices]

    recommendation = pd.DataFrame(columns=['Id','title', 'genres'])

    recommendation['Id'] = movie_id
    recommendation['title'] = movie_title
    recommendation['genres'] = movie_genres

    return recommendation

def results(insert_movie_name):
    insert_movie_name =insert_movie_name.lower()


    data = get_data(PATH)
    data_with_combined_features = combine_features(data)
    transformed_data = transform_data(data_with_combined_features)

    if insert_movie_name not in data['original_title'].unique():
        return 'Movie not in Database'

    else:
        recommendations = recommender(insert_movie_name, data_with_combined_features, transformed_data)
        return recommendations.to_dict('records')


### Condition A

In [9]:
def combine_features(data): 
    data['combined_features'] = data[['genres', 'overview']].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
    return data

In [10]:
result1 = results('Avatar')
result1

[{'Id': 603,
  'title': 'the matrix',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 68721,
  'title': 'iron man 3',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 1726,
  'title': 'iron man',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 878, "name": "Science Fiction"}, {"id": 12, "name": "Adventure"}]'},
 {'Id': 18,
  'title': 'the fifth element',
  'genres': '[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}, {"id": 53, "name": "Thriller"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 91314,
  'title': 'transformers: age of extinction',
  'genres': '[{"id": 878, "name": "Science Fiction"}, {"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}]'},
 {'Id': 601,
  'title': 'e.t. the extra-terrestrial',
  'genres': '[{"id": 878, "name": "Science Fiction"}, {"id": 12, "name": "Adventure"}, {"id": 10

### Condition B

In [11]:
data['keywords'][0]

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [12]:
def get_name(x):
    return " ".join( [e['name'] for e in literal_eval(x)] )

data['keywords'] = data['keywords'].apply(get_name)

data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,release_year
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,culture clash future space war space colony so...,en,avatar,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,2009
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,ocean drug abuse exotic island east india trad...,en,pirates of the caribbean: at world's end,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500,2007
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,spy based on novel secret agent sequel mi6 bri...,en,spectre,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466,2015
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,dc comics crime fighter terrorist secret ident...,en,the dark knight rises,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106,2012
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,based on novel mars medallion space travel pri...,en,john carter,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124,2012


In [13]:
def combine_features(data): 
    data['combined_features'] = data[['keywords', 'overview']].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
    return data

In [14]:
result2 = results('Avatar')
result2

[{'Id': 679,
  'title': 'aliens',
  'genres': '[{"id": 27, "name": "Horror"}, {"id": 28, "name": "Action"}, {"id": 53, "name": "Thriller"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 157336,
  'title': 'interstellar',
  'genres': '[{"id": 12, "name": "Adventure"}, {"id": 18, "name": "Drama"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 18,
  'title': 'the fifth element',
  'genres': '[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}, {"id": 53, "name": "Thriller"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 348,
  'title': 'alien',
  'genres': '[{"id": 27, "name": "Horror"}, {"id": 28, "name": "Action"}, {"id": 53, "name": "Thriller"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 62,
  'title': '2001: a space odyssey',
  'genres': '[{"id": 878, "name": "Science Fiction"}, {"id": 9648, "name": "Mystery"}, {"id": 12, "name": "Adventure"}]'},
 {'Id': 95,
  'title': 'armageddon',
  'genres': '[{"id": 28, "name": "Acti

### Randomly select 3 movies

In [25]:
import random

def random_movies(x):
    indices = pd.Series(data.index, index = data['original_title'])
    
    random_indices = random.choices(indices, k=x)
    
    movie_id = data['movie_id'].iloc[random_indices]
    movie_title = data['original_title'].iloc[random_indices]
    movie_genres = data['genres'].iloc[random_indices]
    
    recommendation = pd.DataFrame(columns=['Id','title', 'genres'])
    recommendation['Id'] = movie_id
    recommendation['title'] = movie_title
    recommendation['genres'] = movie_genres
    
    return recommendation

In [29]:
random_movies(3)

Unnamed: 0,Id,title,genres
389,137,groundhog day,Romance Fantasy Drama Comedy
255,1271,300,Action Adventure War
11,58,pirates of the caribbean: dead man's chest,Adventure Fantasy Action
