# Introduction

In this file, we shall take user input in the form of answers to questions about the kind of movies they would like to watch. Once we collect this information, we shall use a content-based recommender which we construct from the data generated in the 'get_data' file to give real-time recommendations to the user.

In [2]:
import numpy as np
import pandas as pd
import os

In [3]:
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = 99

We start by reading the 'tags' and the 'principalComponents_df' dataframes from previously generated files. These files contain the movie mapping and embeddings respectively, generated using TF-IDF and PCA from the MovieLens user-generated tag data.

In [4]:
cwd = os.getcwd()

tags = pd.read_csv(os.path.join(cwd, "tags_with_document.csv"))
tags.drop(['Unnamed: 0'], axis=1, inplace=True)
principalComponents_df = pd.read_csv(os.path.join(cwd, "principal_Components.csv"))
principalComponents_df.drop(['Unnamed: 0'], axis=1, inplace=True)

We also read the 'movieid_df' and the 'bert_movie_embeddings_df' dataframes from previously generated files. These files contain the movie mappings and embeddings respectively, generated using BERT from the CMU Movie Summary corpus. 

In [5]:
cwd = os.getcwd()

movieid_df = pd.read_csv(os.path.join(cwd, "movieId_bert.csv"))
movieid_df.drop(['Unnamed: 0'], axis=1, inplace=True)
bert_movie_embeddings_df = pd.read_csv(os.path.join(cwd, "movie_roberta-large_embeddings.csv"))
bert_movie_embeddings_df.drop(['Unnamed: 0'], axis=1, inplace=True)

Next, we create a content-based recommender. We take an input of movie indices that a user presumably likes. We then locate these movies in the embeddings dataframe ('principalComponents_df' and 'bert_movie_embeddings_df') and calculate the cosine similarity for every other movie in the database with these movies. Once done, we shall compute the average cosine similarity for each movie and recommend the ones that have the highest score. 

In [7]:
from sklearn.metrics.pairwise import cosine_similarity

def get_content_recommendations(movie_indices, mapping_df, embeddings_df, num = 10):
    
    if len(movie_indices) == 0:
        return "Sorry, please try again."
    else:
        embeddings_movie_indices_df = embeddings_df.loc[movie_indices]
        similarity_matrix_df = pd.DataFrame(cosine_similarity(embeddings_movie_indices_df.values,embeddings_df.values), index=movie_indices, columns = mapping_df.index.tolist())
        similarity_matrix_df.drop(labels=movie_indices, axis=1, inplace=True)
        similarity_score_df = pd.DataFrame(similarity_matrix_df.mean(), columns=['average similarity score'])
        similarity_score_df.sort_values(by=['average similarity score'], ascending=False, inplace=True)
        
        return display(mapping_df.loc[similarity_score_df.head(num).index.values.tolist()]['title'])

We can test whether this recommendation scheme works well or not. To start with, we try to figure out the movie index corresponding to a given movie. For example, consider "Star Wars". We first start with the MovieLens data.

In [36]:
movie_title = "Star Wars"
tags[tags.title.str.contains(movie_title)][['title']]

Unnamed: 0,title
248,Star Wars: Episode IV - A New Hope (1977)
1079,Star Wars: Episode V - The Empire Strikes Back (1980)
1092,Star Wars: Episode VI - Return of the Jedi (1983)
2388,Star Wars: Episode I - The Phantom Menace (1999)
5011,Star Wars: Episode II - Attack of the Clones (2002)
9517,Star Wars: Episode III - Revenge of the Sith (2005)
12064,Star Wars: The Clone Wars (2008)
14218,Empire of Dreams: The Story of the 'Star Wars' Trilogy (2004)
19766,Star Wars: Threads of Destiny (2014)
23031,Star Wars: Episode VII - The Force Awakens (2015)


Similarly, the movie indices with respect to the CMU corpus are given by:

In [11]:
movie_title = "Star Wars"
movieid_df[movieid_df.title.str.contains(movie_title)][['title']]

Unnamed: 0,title
1468,Star Wars: The Clone Wars (2008-08-10)
1843,Star Wars Episode II: Attack of the Clones (2002-05-16)
2754,Saving Star Wars (2004-06-09)
4781,Star Wars Episode VI: Return of the Jedi (1983-05-25)
9692,The Making of Star Wars (1977-09-16)
10270,LEGO Star Wars: Revenge of the Brick (2005-05-08)
15224,Star Wars Episode I: The Phantom Menace (1999-05-19)
16127,Lego Star Wars: Bombad Bounty (2010-11-27)
17002,Star Wars Episode V: The Empire Strikes Back (1980-05-21)
17070,The Star Wars Holiday Special (1978-11-17)


Suppose we pick Star Wars: Episode IV and Star Wars: Episode V as movies that we like. Then, the content-based recommendations from using the MovieLens data and TF-IDF PCA are given by:

In [9]:
get_content_recommendations([248, 1079], tags, principalComponents_df)

1092                        Star Wars: Episode VI - Return of the Jedi (1983)
2388                         Star Wars: Episode I - The Phantom Menace (1999)
5011                      Star Wars: Episode II - Attack of the Clones (2002)
20515                                                Jupiter Ascending (2015)
23993                                                    Velocity Trap (1997)
33572                                  Buck Rogers in the 25th Century (1979)
9517                      Star Wars: Episode III - Revenge of the Sith (2005)
15139    Message from Space (Uchu Kara no Messeji) (Return to Jelucia) (1978)
28652                                         In the Dust of the Stars (1976)
43225                                       Polish Legends: Twardowsky (2015)
Name: title, dtype: object

Similarly, the content-based recommendations from using the CMU corpus and BERT are given by:

In [13]:
get_content_recommendations([24522, 17002], movieid_df, bert_movie_embeddings_df)

4781       Star Wars Episode VI: Return of the Jedi (1983-05-25)
10270          LEGO Star Wars: Revenge of the Brick (2005-05-08)
18861    Star Wars Episode III: Revenge of the Sith (2005-05-15)
12909                   The Transformers: The Movie (1986-08-08)
20202                Ultraman: The Adventure Begins (1987-10-12)
19795                                     Starcrash (1978-12-21)
15224       Star Wars Episode I: The Phantom Menace (1999-05-19)
16365                                      Heart of Steel (2006)
10772                              Batman: Dead End (2003-07-19)
20337              He-Man and She-Ra: A Christmas Special (1985)
Name: title, dtype: object

Thus, looking at the recommendations, one can conclude that the recommendation scheme works reasonably well with both our data sources. Our next step is to construct a function that will smartly query a user about the kind of movies they are currently interested in watching. Once information about the user's immediate preferences is collected, we can make a smart recommendation using our function above. The idea behind this smart way of querying is to use hierarchical binary clustering. We start with the user's history and perform the first binary clustering. Then we pick a movie from one cluster at random and ask if the user is currently interested in a movie like the one we picked. If the user says "Yes", then we limit ourselves to that particular cluster and repeat the exercise. If no, we move to the other cluster and do the same. We stop when no further clustering is possible. This is a desirable scheme because the number of questions that a user has to answer is roughly of the order $\log_2 N$ where $N$ is the size of a user's history. If $N = 200$, then the user will only have to answer roughly 7 or 8 such questions. This scheme can easily be implemented in a Tinder-like user interface where each question can be framed as a swipe. 

In order to perform clustering, we use the AgglomerativeClustering module within the sklearn library.  We thus write the following function:

In [23]:
from random import randrange
from sklearn.cluster import AgglomerativeClustering 

def ask_yes_no(movie_index_list, mapping_df, embeddings_df):
    movies_liked_indices = []
    
    if len(movie_index_list) == 0:
        return movies_liked_indices

    elif len(movie_index_list) == 1:
        response = input("Do you like the movie \" {} \" ?".format(mapping_df['title'].iloc[movie_index_list[0]]))
        
        if response == "Yes":
            movies_liked_indices.append(movie_index_list[0])
            
        return movies_liked_indices
    else:
        movie_index_embeddings = np.take(embeddings_df.values, movie_index_list, axis=0)
        clt = AgglomerativeClustering(linkage='average', affinity='cosine', n_clusters=2)
        model = clt.fit(movie_index_embeddings) 
    
        movies_label_zero = []
        movies_label_one = []
        
        for i in range(len(model.labels_)):
            if model.labels_[i] == 0:
                movies_label_zero.append(movie_index_list[i])
            else:
                movies_label_one.append(movie_index_list[i])
          
        if len(movies_label_zero) != 0:
            random_movie_index = np.random.choice(movies_label_zero, replace=False)
            random_response = input("Do you like the movie \" {} \" ?".format(mapping_df['title'].iloc[random_movie_index]))
            
            if random_response == "Yes":
                movies_label_zero.remove(random_movie_index)
                return ask_yes_no(movies_label_zero, mapping_df, embeddings_df) + [random_movie_index]
            else: 
                return ask_yes_no(movies_label_one, mapping_df, embeddings_df)
        else:
            random_movie_index = np.random.choice(movies_label_one, replace=False)
#             random_movie_index = movies_label_one[0] 
            random_response = input("Do you like the movie \" {} \" ?".format(mapping_df['title'].iloc[random_movie_index]))
                        
            if random_response == "Yes":
                movies_label_one.remove(random_movie_index)
                return ask_yes_no(movies_label_one, mapping_df, embeddings_df) + [random_movie_index]
            else: 
                movies_label_one.remove(random_movie_index)
                return ask_yes_no(movies_label_one, mapping_df, embeddings_df)

We can readily test to see if our function works well. We start with the 'tags' dataset. As a test case, we pick a user who has watched 9 movies - 4 from the Star Wars Universe and remaining 5 from the Marvel Cinematic Universe. The specific movies are as follows:

In [28]:
user_history_index_sample_tfidf = [248, 1079, 2388, 9517, 11808, 13962, 15830, 15469, 16133]
tags.loc[user_history_index_sample_tags]['title']

248                  Star Wars: Episode IV - A New Hope (1977)
1079     Star Wars: Episode V - The Empire Strikes Back (1980)
2388          Star Wars: Episode I - The Phantom Menace (1999)
9517       Star Wars: Episode III - Revenge of the Sith (2005)
11808                                          Iron Man (2008)
13962                                        Iron Man 2 (2010)
15830                Captain America: The First Avenger (2011)
15469                                              Thor (2011)
16133                                     Avengers, The (2012)
Name: title, dtype: object

Now, we can sit back and watch the two functions in action. Function ask_yes_no queries the user about the type of movies they would want to watch and the function get_content_recommendations actually gives the recommendations based on the user's responses. 

In [34]:
indices_tfidf = ask_yes_no(user_history_index_sample_tfidf, tags, principalComponents_df)
get_content_recommendations(indices_tfidf, tags, principalComponents_df)

Do you like the movie " Captain America: The First Avenger (2011) " ?No
Do you like the movie " Star Wars: Episode V - The Empire Strikes Back (1980) " ?Yes
Do you like the movie " Star Wars: Episode I - The Phantom Menace (1999) " ?Yes


1092       Star Wars: Episode VI - Return of the Jedi (1983)
5011     Star Wars: Episode II - Attack of the Clones (2002)
248                Star Wars: Episode IV - A New Hope (1977)
9517     Star Wars: Episode III - Revenge of the Sith (2005)
20515                               Jupiter Ascending (2015)
42867                         Solo: A Star Wars Story (2018)
33572                 Buck Rogers in the 25th Century (1979)
28652                        In the Dust of the Stars (1976)
23993                                   Velocity Trap (1997)
23063                                     Scorpio One (1998)
Name: title, dtype: object

We consider the same user for the CMU corpus dataset as well. Once again, we have:

In [32]:
user_history_index_sample_bert = [24522, 17002, 15224, 18861, 14213, 743, 12984, 8478, 8642]
movieid_df['title'].loc[user_history_index_sample_bert]

24522                Star Wars Episode IV: A New Hope (1977-05-25)
17002    Star Wars Episode V: The Empire Strikes Back (1980-05-21)
15224         Star Wars Episode I: The Phantom Menace (1999-05-19)
18861      Star Wars Episode III: Revenge of the Sith (2005-05-15)
14213                                        Iron Man (2008-04-14)
743                                        Iron Man 2 (2010-04-26)
12984              Captain America: The First Avenger (2011-07-22)
8478                                             Thor (2011-05-06)
8642                                     The Avengers (2012-04-11)
Name: title, dtype: object

In [35]:
indices_bert = ask_yes_no(user_history_index_sample_bert, movieid_df, bert_movie_embeddings_df)
get_content_recommendations(indices_bert, movieid_df, bert_movie_embeddings_df)

Do you like the movie " Iron Man (2008-04-14) " ?Yes
Do you like the movie " Captain America: The First Avenger (2011-07-22) " ?Yes
Do you like the movie " The Avengers (2012-04-11) " ?No
Do you like the movie " Thor (2011-05-06) " ?Yes


10630                                                       Northern Pursuit (1943-11-13)
8642                                                            The Avengers (2012-04-11)
4546                                                                      X2 (2003-04-24)
11023                                                              Max Manus (2008-12-19)
18703                                                             Space Raiders (1983-07)
22242                                                             Undersea Kingdom (1936)
10422                                                         Batman Forever (1995-06-09)
14956                                                       S.S. Doomtrooper (2006-04-01)
6482     The Adventures of Young Van Helsing: The Quest for the Lost Scepter (2004-04-20)
9919                                               King of the Royal Mounted (1940-09-20)
Name: title, dtype: object

One can clearly see from the above that the tags-tfidf-PCA method seems to result in "better" recommendations than the plot-bert method. One reason why this might be the case is because the plot-bert method only uses details about the plot of the movie to make recommendations, instead of the more crucial salient features that users might pick up while generating tags. The plot-bert method could be more useful to experimental viewers who are interested in exploring new movies. More work needs to be done to leverage the metadata that is available in the CMU dataset. Also some pruning of the dataset to get rid of older movies might be useful thing to consider. 