# Building your own recommender system

Today, we will start working on building our own recommender system. For this assignment, we will work with movie data.
Download the following datasets [here](https://www.kaggle.com/tmdb/tmdb-movie-metadata):
- `tmdb_5000_credits.csv`
- `tmdb_5000_movies.csv`

Place the files a folder in the current working directory, which you can call `data/`.


## Task 1. Explore and preprocess the data.

### a.  Explore the data

- As a first step, explore the datasets. Inspect what data you have at hand, what might be interesting variables and what not. Make a selection of interesting columns.
  -  Keep in mind that ultimatly, you want to build a knowledge-based and content-based recommender systems. Hence, look for columns that might be suitable to use later on.
- Combine (merge) both datasets. Can you identify a variable that can be used for matching?
Think about a good way to do this. Can you write a function that will return the merged data? An example is provided below. 

- Check whether the data is ready to use, or whether you need to transform or pre-processing your data somehow.

In [1]:
import pandas as pd
from ast import literal_eval
from sklearn.metrics.pairwise import cosine_similarity

#### Read the data

<br>
<div class="alert-block alert-warning">
For now, only keep movies that recieved frequent votes (note this will make the dataset rather small; computation is therefore faster)
</div>



   

In [3]:
PATH = 'data/'

VOTE_COUNT = 2000 #If you want to work with a larger dataset, decrease this value.

def get_data(path_to_data):

    data1 = pd.read_csv(f'{path_to_data}tmdb_5000_credits.csv')
    data2 = pd.read_csv(f'{path_to_data}tmdb_5000_movies.csv')
    data2.rename(columns={'id': 'movie_id'}, inplace=True)

    data = pd.merge(data1,data2,  on=['movie_id', 'title'])
    data["original_title"] = data["original_title"].str.lower()

    data = data[data['vote_count'] > VOTE_COUNT] # for now, only keep movies with frequent votes (this will keep the dataset rather small and therefore computation is faster)
    data.index = [i for i in range(0,len(data))]
    return data

In [4]:
data = get_data(PATH)
data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124



### b.  Pre-processing and feature engineering

As a first step, some data wrangling techniques are needed to get the data into the right shape.
- Think about relevant attributes of movies that you want to use later on when designing a recommender system.
- Can you convert `release_year` to a yearly-level variable?
- Can you clean up the `genres` column?

In the next code block, we will convert `release_year` to a yearly-level variable. In addition, we will clean up the `genres` column

In [5]:
data['release_year'] = pd.DatetimeIndex(data['release_date']).year

def get_genres(x):
    return " ".join( [e['name'] for e in literal_eval(x)] )

data['genres'] = data['genres'].apply(get_genres)
data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,release_year
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,2009
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500,2007
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466,2015
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106,2012
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124,2012


Next, we will convert the `genres` column and explode the data so we have a single genre for each row. Specifically, in order to create a knowledge-based recommender system in the next step, that leverages information on genre, we need to transform our data so that each genre is a single observation. Hence, we want a single genre in the rows. We will therefore transform/explode the data to a long format:

In [6]:
s = data.apply(lambda x: pd.Series(x['genres'].split()),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'single_genre'
data = data.join(s)

data[['single_genre', 'title', 'vote_average', 'vote_count', 'release_year']].head(10)

Unnamed: 0,single_genre,title,vote_average,vote_count,release_year
0,Action,Avatar,7.2,11800,2009
0,Adventure,Avatar,7.2,11800,2009
0,Fantasy,Avatar,7.2,11800,2009
0,Science,Avatar,7.2,11800,2009
0,Fiction,Avatar,7.2,11800,2009
1,Adventure,Pirates of the Caribbean: At World's End,6.9,4500,2007
1,Fantasy,Pirates of the Caribbean: At World's End,6.9,4500,2007
1,Action,Pirates of the Caribbean: At World's End,6.9,4500,2007
2,Action,Spectre,6.3,4466,2015
2,Adventure,Spectre,6.3,4466,2015


## Task 2.   Create a knowledge-based recommender system

Below, you will find an example of a simple knowledge based recommender system. 

First, play around a bit with the `input()` function: 

In [None]:
print("Enter whatever:")
a_test = input()

In [None]:
print(a_test)

### Example of a simple knowlege based recommender

In [7]:
def knowledge_based_recommender(data):

    data = data[data['single_genre'].notna()]
    data['single_genre'] = data['single_genre'].str.lower()

    print(f"What type of genre do you like? \n\nYou can choose from the following:\n\n{set(data['single_genre'])}")
    genre = input().lower()

    print("What is the minimum release year of movies you are interested in? (e.g., how 'old' may a movie be?)" )
    release_year = int(input())

    movies = data[(data['single_genre'] == genre) &
    (data['release_year'] >= release_year) ]

    recommend_movies = movies.sort_values('vote_average', ascending=False)

    return recommend_movies[['title', 'vote_average', 'genres']].head(5)

### Run the knowledge-based recommender...

In [8]:
knowledge_based_recommender(data)

What type of genre do you like? 

You can choose from the following:

{'family', 'thriller', 'fiction', 'crime', 'animation', 'western', 'music', 'science', 'horror', 'comedy', 'action', 'war', 'romance', 'drama', 'mystery', 'history', 'fantasy', 'adventure'}


KeyboardInterrupt: Interrupted by user

<u>Question: can you adjust or improve this recommender?</u> 
<br>
<br>
<div class="alert-info">
Can you improve instructions for users, give hints for correct answers (such as the range of available values in the dataset, etc?)
</div>


<u>Question: can you make this recommender more elaborate?</u> 
<br>
<br>
<div class="alert-info">
Can you make this recommender more elaborate, taking in different types of user-generated input?
</div>


<u>BONUS Question: can you improve the scoring algoritm?</u> 
<br>
<br>
<div class="alert-info">
We can further improve this algorithm by accounting for the fact that some movies have not been frequently rated. See for an improved scoring algorithm that is typically employed by IMDB <a href=https://www.datacamp.com/community/tutorials/recommender-systems-python>here</a>
</div>




## Task 3. Create a content-based recommender system using cosine similarity

In [9]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

For this taks, we go back to the dataset in the original format (hence, before exploding the data to a long format).

In [11]:
data = get_data(PATH)
data['release_year'] = pd.DatetimeIndex(data['release_date']).year
data['genres'] = data['genres'].apply(get_genres)
data.head()

Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,release_year
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,2009
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500,2007
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466,2015
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106,2012
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124,2012


### a. Create a combined feature column.
Create a combined feature column. More specifically, combine (textual) data of several columns, that can be used later on.

<u>Question: Create your own combined feature column</u> 
<br>
<br>
<div class="alert-info">
Can you adjust this, to make your own combined feature column? Are there different columns that might be interesting to consider here? Do you need to preprocess them?
</div>
<br>
<div class="alert-block alert-warning">
Make sure you are making an  <b>informed decision</b> about merging these columns together. Why do think this is a good idea, when designing a recommender system?
</div>



If you want to 'glue' several columns with textual data together in `pandas`, you can do something like this:

```python
data['combined'] = data[['genres', 'overview']].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
```




In [12]:
def combine_features(data): 
    data['combined_features'] = data[['original_title', 'genres', 'overview', 'tagline']].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
    return data

data = combine_features(data)
data.head()

In [13]:
data = combine_features(data)
data.head()


Unnamed: 0,movie_id,title,cast,crew,budget,genres,homepage,keywords,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,release_year,combined_features
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,avatar,...,2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,2009,"avatar,Action Adventure Fantasy Science Fictio..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,pirates of the caribbean: at world's end,...,2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500,2007,"pirates of the caribbean: at world's end,Adven..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,spectre,...,2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466,2015,"spectre,Action Adventure Crime,A cryptic messa..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,the dark knight rises,...,2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106,2012,"the dark knight rises,Action Crime Drama Thril..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,john carter,...,2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124,2012,"john carter,Action Adventure Science Fiction,J..."


### b. Transform your data

Think about a strategy for transforming your combined data column, as designed in the previous step. More specifically, `fit_transform` the combined data column using `tfidf` or `count` vectorizer.


<u>Question: can you adjust settings of the vectorizer?</u> 
<br>
<br>
<div class="alert-info">
When initializing the vectorizer, think about some of settings we've discussed in earlier weeks. Do you, for example, want to remove stopwords manually, or use pruning?
</div>

In [79]:
data = get_data(PATH)
data['release_year'] = pd.DatetimeIndex(data['release_date']).year
data['genres'] = data['genres'].apply(get_genres)

def combine_features(data): 
    data['combined_features'] = data[['original_title', 'genres', 'overview', 'tagline']].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
    return data

data = combine_features(data)

QUERY = 'romance thriller Bike horse animal'.lower() #later, you can replace it with sometthing like:
#print("Hello user! What type of genres or things do you like in general? Just insert some key words!")
#QUERY = input()

tfidf = TfidfVectorizer(stop_words='english')
r = data['combined_features'].values.tolist()
r.append(QUERY)

tfidf_matrix = tfidf.fit_transform(r)
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim[-1] # QUERY is now the last item in the list 'r'
sim_scores = list(enumerate(cosine_sim[index])) 
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
movie_indices = [i[0] for i in sim_scores[1:10]]

data.iloc[movie_indices]['title']

329                            Warm Bodies
6                                  Tangled
205                                    Rio
352                                    Her
82     The Curious Case of Benjamin Button
354                Silver Linings Playbook
413                                   Juno
360                    Edward Scissorhands
399                           American Pie
Name: title, dtype: object

329                                  Warm Bodies
352                                          Her
82           The Curious Case of Benjamin Button
354                      Silver Linings Playbook
413                                         Juno
360                          Edward Scissorhands
399                                 American Pie
118    The Twilight Saga: Breaking Dawn - Part 2
390                          Slumdog Millionaire
Name: title, dtype: object

In [None]:
indices = pd.Series(data.index, index = data['original_title'])
index = -1

In [None]:
sim_scores = list(enumerate(cosine_sim[index])) 
sim_scores

### c. Calculate cosine similarity

Using the vectorized (sparse) matrix, calculate cosine similarity.

In [55]:
cosine_sim = cosine_similarity(tfidf_matrix)

### d. Find the most similair movies

How can we, based on the matrix of cosine scores, find the most similar movie?

As an example, let's try to look up the movie that is most similair to `the dark knight rises`. This movie is at index number 3 in our dataframe, as you can see by simple looking at our dataframe. In other words, we know the index is 3, because we looked it up manually. Now, we can get the similarity scores between this movie and all other movies in our dataset, simply by printing: 

In [56]:
cosine_sim[-1]

array([0.        , 0.        , 0.        , 0.03315757, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.03561449, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.04956019, 0.        , 0.        , 0.        ,
       0.        , 0.02430372, 0.        , 0.03419494, 0.04335396,
       0.03330675, 0.        , 0.        , 0.        , 0.05681992,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.06133154, 0.03525206, 0.05863823, 0.04265735,
       0.        , 0.03500906, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.03642425, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.03828726, 0.06434099,
       0.        , 0.        , 0.        , 0.        , 0.03088697,
       0.        , 0.        , 0.        , 0.        , 0.04943328,
       0.        , 0.        , 0.06910266, 0.02805485, 0.     

There are, however, more systematic ways of retrieving an index value. 
For example, we can create a list of indices, and simple look our movie up: 

In [58]:
indices = pd.Series(data.index, index = data['original_title'])
index = -1
print(index)

indices

-1


original_title
avatar                                        0
pirates of the caribbean: at world's end      1
spectre                                       2
the dark knight rises                         3
john carter                                   4
                                           ... 
saw                                         428
reservoir dogs                              429
il buono, il brutto, il cattivo             430
taxi driver                                 431
12 angry men                                432
Length: 433, dtype: int64

Now that we have the index value, we need to sort the associated vector of cosine similarity scores to get the highest values.

In order to sort the cosine scores, while keep track of the entries to which these scores belong, we map the movie-index to the cosine value. We can do so by simple enumerating the cosine scores:

In [59]:
sim_scores = list(enumerate(cosine_sim[index])) 
sim_scores

[(0, 0.0),
 (1, 0.0),
 (2, 0.0),
 (3, 0.03315756888826011),
 (4, 0.0),
 (5, 0.0),
 (6, 0.0),
 (7, 0.0),
 (8, 0.0),
 (9, 0.0),
 (10, 0.03561448946600001),
 (11, 0.0),
 (12, 0.0),
 (13, 0.0),
 (14, 0.0),
 (15, 0.0),
 (16, 0.04956018795258567),
 (17, 0.0),
 (18, 0.0),
 (19, 0.0),
 (20, 0.0),
 (21, 0.02430371581960795),
 (22, 0.0),
 (23, 0.03419494057326246),
 (24, 0.04335396069191257),
 (25, 0.03330674561737957),
 (26, 0.0),
 (27, 0.0),
 (28, 0.0),
 (29, 0.056819920882220876),
 (30, 0.0),
 (31, 0.0),
 (32, 0.0),
 (33, 0.0),
 (34, 0.0),
 (35, 0.0),
 (36, 0.061331535685228984),
 (37, 0.03525206078961171),
 (38, 0.058638229089765966),
 (39, 0.042657354558874656),
 (40, 0.0),
 (41, 0.035009063864343756),
 (42, 0.0),
 (43, 0.0),
 (44, 0.0),
 (45, 0.0),
 (46, 0.0),
 (47, 0.0),
 (48, 0.0),
 (49, 0.0),
 (50, 0.03642425407571395),
 (51, 0.0),
 (52, 0.0),
 (53, 0.0),
 (54, 0.0),
 (55, 0.0),
 (56, 0.0),
 (57, 0.0),
 (58, 0.038287263547207205),
 (59, 0.06434098689571639),
 (60, 0.0),
 (61, 0.0),
 (62

now, we can sort this list on the basis of the cosine scores, while keeping track of the indexes. 

In [60]:
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores

[(433, 1.0000000000000002),
 (413, 0.16486949267731266),
 (161, 0.13858094088230746),
 (261, 0.1314753044697095),
 (204, 0.12729254349485455),
 (421, 0.12673325935526056),
 (338, 0.12379449958472115),
 (154, 0.11258354941742732),
 (226, 0.10373668648068857),
 (382, 0.09859237223911227),
 (370, 0.09633581864775007),
 (309, 0.09387121321892666),
 (293, 0.09003488402413566),
 (354, 0.08267259243342336),
 (144, 0.07810550314768919),
 (319, 0.0773586740075473),
 (197, 0.07497935162947626),
 (399, 0.07450999777197148),
 (104, 0.0742543297397583),
 (203, 0.073889913856295),
 (329, 0.07377283812716218),
 (277, 0.07273851522343731),
 (376, 0.06970843638935603),
 (237, 0.06952775730751076),
 (350, 0.06949245312660551),
 (209, 0.06946142624113992),
 (72, 0.0691026643544251),
 (239, 0.06880039576193905),
 (373, 0.06852834112832047),
 (420, 0.06781069383740596),
 (93, 0.065848564949443),
 (108, 0.0648867580484384),
 (59, 0.06434098689571639),
 (330, 0.06406141887001554),
 (212, 0.06361622095432197)

We are only interested in the most similar ones, so we keep the top 10:

In [61]:
sim_scores = sim_scores[1:11]
sim_scores

[(413, 0.16486949267731266),
 (161, 0.13858094088230746),
 (261, 0.1314753044697095),
 (204, 0.12729254349485455),
 (421, 0.12673325935526056),
 (338, 0.12379449958472115),
 (154, 0.11258354941742732),
 (226, 0.10373668648068857),
 (382, 0.09859237223911227),
 (370, 0.09633581864775007)]

To which movies to these scores belong? Lets map the index values back to the dataframe and find them.

In [62]:
movie_indices = [i[0] for i in sim_scores]
movie_indices

[413, 161, 261, 204, 421, 338, 154, 226, 382, 370]

Now we can simple look those movies up in our dataframe:

In [63]:
data.iloc[movie_indices]['title']

413                 Juno
161     Mr. & Mrs. Smith
261                  RED
204     Now You See Me 2
421    Shaun of the Dead
338        The Big Short
154           The Croods
226      Despicable Me 2
382           Home Alone
370            Neighbors
Name: title, dtype: object

### e. Putting the pieces together: Content-based recommender system using cosine similarity

In [26]:
def transform_data(data):

    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(data['combined_features'])
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

    return cosine_sim

def recommender(movie_title, data, transformed_data):

    indices = pd.Series(data.index, index = data['original_title'])
    index = indices[movie_title]

    sim_scores = list(enumerate(transformed_data[index]))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]

    movie_indices = [i[0] for i in sim_scores]

    movie_id = data['movie_id'].iloc[movie_indices]

    movie_title = data['original_title'].iloc[movie_indices]
    movie_genres = data['genres'].iloc[movie_indices]

    recommendation = pd.DataFrame(columns=['Id','title', 'genres'])

    recommendation['Id'] = movie_id
    recommendation['title'] = movie_title
    recommendation['genres'] = movie_genres

    return recommendation

def results(insert_movie_name):
    insert_movie_name =insert_movie_name.lower()

    data = get_data(PATH)
    data_with_combined_features = combine_features(data)
    transformed_data = transform_data(data_with_combined_features)

    if insert_movie_name not in data['original_title'].unique():
        return 'Movie not in Database'

    else:
        recommendations = recommender(insert_movie_name, data_with_combined_features, transformed_data)
        return recommendations.to_dict('records')


Now, try it out... You can get recommendations by simply calling the `results()` function

In [27]:
results('Avatar')

[{'Id': 603,
  'title': 'the matrix',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 68721,
  'title': 'iron man 3',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 1726,
  'title': 'iron man',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 878, "name": "Science Fiction"}, {"id": 12, "name": "Adventure"}]'},
 {'Id': 601,
  'title': 'e.t. the extra-terrestrial',
  'genres': '[{"id": 878, "name": "Science Fiction"}, {"id": 12, "name": "Adventure"}, {"id": 10751, "name": "Family"}, {"id": 14, "name": "Fantasy"}]'},
 {'Id': 18,
  'title': 'the fifth element',
  'genres': '[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}, {"id": 53, "name": "Thriller"}, {"id": 878, "name": "Science Fiction"}]'},
 {'Id': 72190,
  'title': 'world war z',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 18, "name": "Drama"}, {"id": 27

In [None]:
def transform_data(data, query):

    tfidf = TfidfVectorizer(stop_words='english')
    
    d = data['combined_features'].values.tolist()
    d.append(query)
    
    tfidf_matrix = tfidf.fit_transform(data['combined_features'])
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

    return cosine_sim

def recommender(query, data, transformed_data):

    indices = pd.Series(data.index, index = data['original_title'])
    index = -1

    sim_scores = list(enumerate(transformed_data[index]))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]

    movie_indices = [i[0] for i in sim_scores]

    movie_id = data['movie_id'].iloc[movie_indices]

    movie_title = data['original_title'].iloc[movie_indices]
    movie_genres = data['genres'].iloc[movie_indices]

    recommendation = pd.DataFrame(columns=['Id','title', 'genres'])

    recommendation['Id'] = movie_id
    recommendation['title'] = movie_title
    recommendation['genres'] = movie_genres

    return recommendation

def results(insert_movie_name):
    insert_movie_name =insert_movie_name.lower()


    data = get_data(PATH)
    data_with_combined_features = combine_features(data)
    transformed_data = transform_data(data_with_combined_features)

    if insert_movie_name not in data['original_title'].unique():
        return 'Movie not in Database'

    else:
        recommendations = recommender(insert_movie_name, data_with_combined_features, transformed_data)
        return recommendations.to_dict('records')


<u>Question: can you improve this recommender?</u> 
<br>
<br>
<div class="alert-info">
Think about alternative ways to find relevant movies. Consider, for example, which features you want to include.
</div>

<u>Question: adjust the (amount of) information returned to the user</u> 
<br>
<br>
<div class="alert-info">
What information do you want to return to the user?
</div>

# BONUS:
## 4. Content-based recommender system using Soft Cosine Similarity

Next, we will introduce code that helps you build a content-based recommender system using soft cosine scores. This material is not required for this course, but merely provided here as example code in case you would like to use soft cosine in a recommender system. 

By now, you may wonder whether we can improve our recommender system by using a "better" similarity measure, such as soft cosine similiarity. Let's try it out..

For the next steps, we largely rely on the code we've used for week 4 as well, so go back to the slides and exercises of that week if you are confused. 

In [34]:
import gensim
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import WordEmbeddingSimilarityIndex
print(gensim.__version__)

from gensim.corpora import Dictionary
import numpy as np

4.1.2


In [30]:
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')

In [40]:
dictionary = corpora.Dictionary([simple_preprocess(doc) for doc in data['combined_features']]) #initialize a Dictionary. This step assigns a token_id to each word
bag_of_words_vectors = [ dictionary.doc2bow(simple_preprocess(doc)) for doc in data['combined_features']]

In [35]:
## This step also takes quite a while.... 
similarity_index = WordEmbeddingSimilarityIndex(fasttext_model300)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary) # Build a term similarity matrix and compute the Soft Cosine Measure.

100%|███████████████████████████████████████| 5849/5849 [04:13<00:00, 23.09it/s]


Next, let's create a matrix of soft-cosine similarity scores..

<div class="alert-danger">
Only run the following code on a dataset that is not too large, otherwise it will be very slow...
</div>


In [43]:
def create_soft_cossim_matrix(documents):# reference: https://www.machinelearningplus.com/nlp/cosine-similarity/
    len_array = np.arange(len(documents))
    xx, yy = np.meshgrid(len_array, len_array)
    cossim_mat = pd.DataFrame([[round(similarity_matrix.inner_product(documents[i],documents[j], normalized=(True, True)) ,2) for i, j in zip(x,y)] for y, x in zip(xx, yy)])
    return cossim_mat

In [44]:
transformed_data = create_soft_cossim_matrix(bag_of_words_vectors)

Just like we did with the cosine similiarity scores, enumerate the similarity scores, so we can sort them will maintaining information regarding their index (i.e., the movie they belong to)

In [46]:
sim_scores = list(enumerate(transformed_data[index].to_list() ))

The remainder of the code is the same as we've also used for cosine similarity: sort the values, keep the top 10 most similar ones, and retrieve the associated titles... 

In [48]:
#sim_scores = list(enumerate(cosine_sim[index])) 
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:11]
movie_indices = [i[0] for i in sim_scores]
data.iloc[movie_indices]['title']

95                                     Batman Begins
311                                           Batman
114                       How to Train Your Dragon 2
185    The Lord of the Rings: The Return of the King
17         The Hobbit: The Battle of the Five Armies
44                             Jack the Giant Slayer
141                           The Day After Tomorrow
149     Star Wars: Episode III - Revenge of the Sith
318                               Return of the Jedi
Name: title, dtype: object

<u>Question: What do you think of the quality of these recommendations?</u> 
<br>
<br>
<div class="alert-info">
Create recommendations for different movies. How do these results compare to the recommendations based on regular cosine similarity? Which one performance better do you think? 
</div>

### putting the pieces together: building content-based recommender system based on soft-cosine similarity

In [49]:

def recommender(movie_title, data, transformed_data):

    indices = pd.Series(data.index, index = data['original_title'])
    index = indices[movie_title]

    sim_scores = list(enumerate(transformed_data[index].to_list() ))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]

    movie_indices = [i[0] for i in sim_scores]

    movie_id = data['movie_id'].iloc[movie_indices]

    movie_title = data['original_title'].iloc[movie_indices]
    movie_genres = data['genres'].iloc[movie_indices]

    recommendation = pd.DataFrame(columns=['Id','title', 'genres'])

    recommendation['Id'] = movie_id
    recommendation['title'] = movie_title
    recommendation['genres'] = movie_genres

    return recommendation

def results(insert_movie_name, transformed_data):
    insert_movie_name =insert_movie_name.lower()

     #initialize a Dictionary. This step assigns a token_id to each word
    data = combine_features(get_data(PATH))
    
    if insert_movie_name not in data['original_title'].unique():
        return 'Movie not in Database'

    else:
        recommendations = recommender(insert_movie_name, data, transformed_data)
        return recommendations.to_dict('records')


In [50]:
results('Saw', transformed_data)

[{'Id': 70160,
  'title': 'the hunger games',
  'genres': '[{"id": 878, "name": "Science Fiction"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}]'},
 {'Id': 81005,
  'title': 'jack the giant slayer',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 10751, "name": "Family"}, {"id": 14, "name": "Fantasy"}]'},
 {'Id': 82702,
  'title': 'how to train your dragon 2',
  'genres': '[{"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 16, "name": "Animation"}, {"id": 35, "name": "Comedy"}, {"id": 10751, "name": "Family"}]'},
 {'Id': 98,
  'title': 'gladiator',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 18, "name": "Drama"}, {"id": 12, "name": "Adventure"}]'},
 {'Id': 203801,
  'title': 'the man from u.n.c.l.e.',
  'genres': '[{"id": 35, "name": "Comedy"}, {"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}]'},
 {'Id': 76341,
  'title': 'mad max: fury road',
  'genres': '[{"id": 28, "name": "Action"}, {"id": 

<u>Question: Can you improve this recommender?</u> 
<br>
<br>
<div class="alert-info">
Can you adjust this recommender? Can you for example think of alternative ways to select and combine features?
</div>