# Introduction

In this file, we shall take user input in the form of answers to questions about the kind of movies they would like to watch. Once we collect this information, we shall use a content-based recommender which we construct from the data generated in the 'get_data' file to give real-time recommendations to the user.

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = 99

We start by reading the 'tags' and the 'principalComponents_df' dataframes from previously generated files. 

In [3]:
cwd = os.getcwd()

tags = pd.read_csv(os.path.join(cwd, "tags_with_document.csv"))
tags.drop(['Unnamed: 0'], axis=1, inplace=True)
principalComponents_df = pd.read_csv(os.path.join(cwd, "principal_Components.csv"))
principalComponents_df.drop(['Unnamed: 0'], axis=1, inplace=True)

Next, we create a content-based recommender. We take an input of movie indices that a user presumably likes. We then locate these movies in the principalComponents_df dataframe and calculate the cosine similarity for every other movie in the database with these movies. Once done, we shall compute the average cosine similarity for each movie and recommend the ones that have the highest score. 

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

def get_content_recommendations(movie_indices, tags, principalComponents_df, num = 5):
    
    if len(movie_indices) == 0:
        return "Sorry"
    else:
        principalComponents_movie_indices_df = principalComponents_df.loc[movie_indices]
        similarity_matrix_df = pd.DataFrame(cosine_similarity(principalComponents_movie_indices_df.values,principalComponents_df.values), index=movie_indices, columns = tags.index.tolist())
        similarity_matrix_df.drop(labels=movie_indices, axis=1, inplace=True)
        similarity_score_df = pd.DataFrame(similarity_matrix_df.mean(), columns=['average similarity score'])
        similarity_score_df.sort_values(by=['average similarity score'], ascending=False, inplace=True)
        
        return display(tags.loc[similarity_score_df.head(num).index.values.tolist()]['title'])

We can test whether this recommendation scheme works well or not. To start with, we try to figure out the movie index corresponding to a given movie. For example, consider "Star Wars".

In [5]:
movie_title = "Star Wars"
tags[tags.title.str.contains(movie_title)]['title']

248                          Star Wars: Episode IV - A New Hope (1977)
1079             Star Wars: Episode V - The Empire Strikes Back (1980)
1092                 Star Wars: Episode VI - Return of the Jedi (1983)
2388                  Star Wars: Episode I - The Phantom Menace (1999)
5011               Star Wars: Episode II - Attack of the Clones (2002)
9517               Star Wars: Episode III - Revenge of the Sith (2005)
12064                                 Star Wars: The Clone Wars (2008)
14218    Empire of Dreams: The Story of the 'Star Wars' Trilogy (2004)
19766                             Star Wars: Threads of Destiny (2014)
23031                Star Wars: Episode VII - The Force Awakens (2015)
26845                             The Star Wars Holiday Special (1978)
31328               Plastic Galaxy: The Story of Star Wars Toys (2014)
36262                              Rogue One: A Star Wars Story (2016)
40428                                  Star Wars: The Last Jedi (2017)
41381 

Suppose we pick Star Wars: Episode IV and Star Wars: Episode V as movies that we like. Then, the content based recommendations are given by:

In [6]:
get_content_recommendations([248, 1079], tags, principalComponents_df)

1092       Star Wars: Episode VI - Return of the Jedi (1983)
2388        Star Wars: Episode I - The Phantom Menace (1999)
5011     Star Wars: Episode II - Attack of the Clones (2002)
20515                               Jupiter Ascending (2015)
23993                                   Velocity Trap (1997)
Name: title, dtype: object

Thus, looking at the recommendations, one can conclude that the recommendation scheme works reasonably well. Our next step is to construct a function that will smartly query a user about the kind of movies they are currently interested in watching. Once information about the user's immediate preferences is collected, we can make a smart recommendation using our function above. The idea behind this smart way of querying is to use hierarchical binary clustering. We start with the user's history and perform the first binary clustering. Then we pick a movie from one cluster at random and ask if the user is currently interested in a movie like the one we picked. If the user says "Yes", then we limit ourselves to that particular cluster and repeat the exercise. If no, we move to the other cluster and do the same. We stop when no further clustering is possible. This is a desirable scheme because the number of questions that a user has to answer is roughly of the order $\log_2 N$ where $N$ is the size of a user's history. If $N = 200$, then the user will only have to answer roughly 7 or 8 such questions. This scheme can easily be implemented in a Tinder-like user interface where each question can be framed as a swipe. 

In order to perform clustering, we use the AgglomerativeClustering module within the sklearn library.  \We thus write the following function:

In [11]:
from random import randrange
from sklearn.cluster import AgglomerativeClustering 

def ask_yes_no(movie_index_list, tags, principalComponents_df):
    movies_liked_indices = []
    
    if len(movie_index_list) == 0:
        return movies_liked_indices

    elif len(movie_index_list) == 1:
        response = input("Do you like the movie \" {} \" ?".format(tags['title'].iloc[movie_index_list[0]]))
        
        if response == "Yes":
            movies_liked_indices.append(movie_index_list[0])
            
        return movies_liked_indices
    else:
        movie_index_principalComponents = np.take(principalComponents_df.values, movie_index_list, axis=0)
        clt = AgglomerativeClustering(linkage='average', affinity='cosine', n_clusters=2)
        model = clt.fit(movie_index_principalComponents) 
    
        movies_label_zero = []
        movies_label_one = []
        
        for i in range(len(model.labels_)):
            if model.labels_[i] == 0:
                movies_label_zero.append(movie_index_list[i])
            else:
                movies_label_one.append(movie_index_list[i])
          
        if len(movies_label_zero) != 0:
            random_movie_index = np.random.choice(movies_label_zero, replace=False)
            random_response = input("Do you like the movie \" {} \" ?".format(tags['title'].iloc[random_movie_index]))
            
            if random_response == "Yes":
                movies_label_zero.remove(random_movie_index)
                return ask_yes_no(movies_label_zero, tags, principalComponents_df) + [random_movie_index]
            else: 
                return ask_yes_no(movies_label_one, tags, principalComponents_df)
        else:
            random_movie_index = np.random.choice(movies_label_one, replace=False)
#             random_movie_index = movies_label_one[0] 
            random_response = input("Do you like the movie \" {} \" ?".format(tags['title'].iloc[random_movie_index]))
                        
            if random_response == "Yes":
                movies_label_one.remove(random_movie_index)
                return ask_yes_no(movies_label_one, tags, principalComponents_df) + [random_movie_index]
            else: 
                movies_label_one.remove(random_movie_index)
                return ask_yes_no(movies_label_one, tags, principalComponents_df)

We can readily test to see if our function works well. As a test case, we pick a user who has watched 12 movies - 4 from the Star Wars Universe and remaining 8 from the Marvel Cinematic Universe. The specific movies are as follows:

In [8]:
user_history_index_sample = [248, 1079, 2388, 9517, 11808, 18376, 19854, 23047, 15469, 23037, 23046, 16133, 23034]
print(tags.loc[user_history_index_sample]['title'])

248                  Star Wars: Episode IV - A New Hope (1977)
1079     Star Wars: Episode V - The Empire Strikes Back (1980)
2388          Star Wars: Episode I - The Phantom Menace (1999)
9517       Star Wars: Episode III - Revenge of the Sith (2005)
11808                                          Iron Man (2008)
18376                                        Iron Man 3 (2013)
19854               Captain America: The Winter Soldier (2014)
23047                        Captain America: Civil War (2016)
15469                                              Thor (2011)
23037                                           Ant-Man (2015)
23046                         Guardians of the Galaxy 2 (2017)
16133                                     Avengers, The (2012)
23034                           Avengers: Age of Ultron (2015)
Name: title, dtype: object


Now, we can sit back and watch the two functions in action. Function ask_yes_no queries the user about the type of movies they would want to watch and the function get_content_recommendations actually gives the recommendations based on the user's responses. 

In [12]:
indices = ask_yes_no(user_history_index_sample, tags, principalComponents_df)
get_content_recommendations(indices, tags, principalComponents_df)

Do you like the movie " Guardians of the Galaxy 2 (2017) " ?No
Do you like the movie " Thor (2011) " ?Yes
Do you like the movie " Avengers: Age of Ultron (2015) " ?Yes
Do you like the movie " Captain America: The Winter Soldier (2014) " ?Yes
Do you like the movie " Captain America: Civil War (2016) " ?Yes


9633              Fantastic Four (2005)
17151    Amazing Spider-Man, The (2012)
16133              Avengers, The (2012)
11904       Incredible Hulk, The (2008)
6107                        Hulk (2003)
Name: title, dtype: object