# DSC 550 Assignment 7.2
## Chase Lemons

In this exercise, you will create a command line interactive movie recommender system. Your program will ask the user to rate ten movies and recommend other movies they might like by finding users with similar tastes in movies and recommending movies they rated highly. This technique is often called collaborative filtering.

You can find the data for this exercise in the movielens folder. You will need the movies.csv and ratings.csv files to complete this assignment.

In [2]:
import pandas as pd
import numpy as np

In [20]:
import sys
!{sys.executable} -m pip install ipywidgets



In [21]:
from ipywidgets import widgets

In [13]:
# Need to read in the data to use. 

movies = pd.read_csv("movies.csv", sep = ',')
ratings = pd.read_csv("ratings.csv", sep = ',')

print(ratings.head())
print(movies.head())

   userId  movieId  rating   timestamp
0       1       31     2.5  1260759144
1       1     1029     3.0  1260759179
2       1     1061     3.0  1260759182
3       1     1129     2.0  1260759185
4       1     1172     4.0  1260759205
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


###### 1. Recommendation Engine

The first step in the process is creating a matrix of user ratings for each movie. If there are users and movies, then the user rating matrix will be N x M.

The recommendation engine should take as input user ratings one or more movies and output an ordered list of movie recommendations. Test your recommendation engine by selecting and rating ten movies from the top 200 most rated movies. The following is an example of test input.

In order to determine which movies to recommend, you should first compute the cosine distance of each existing user’s ratings from the inputs given. You can compute the cosine distance between two vectors using scipy or scikit-learn. Next, compute a weighted average of the ratings of all user ratings using the cosine distance as the weight. In other words, users who gave similar ratings to the input ratings, are weighted higher than those who gave less similar ratings.

In mathematical terms, if there are N movies and M users, the user rating matrix is an N x M matrix. The user input is a vector of length N. One row of the user rating matrix represents the movie ratings for one user. If you compute the cosine similarity between that row and the user’s input, you have a measure of how close that user’s ratings are to the input ratings. If the other user rating is everything opposite, the cosine similarity is -1. If they rate everything the same, the cosine similarity is 1.

The following Python code demonstrates how to calculate the movie recommendations without using Numpy’s matrix math operations. It uses Python list comprehensions and for loops for the purpose of clarity. The output of this function is a list of length N (the number of unique movies) with a suggested rating for each movie.

This function does not take into account movies the user has already rated, so you should make sure you remove those movies from your recommendations. Additionally, an optimal solution would use Numpy’s matrix mathematics, rather than Python lists.

In [86]:
# Combine the two tables ratings and the movies. Then take the top 200 movies with ratings.

combined = pd.merge(ratings, movies, on='movieId', how='left')

combined_count = pd.DataFrame(combined.groupby('title')['rating'].count().sort_values(ascending=False).head(200)) 

combined = pd.merge(combined_count, combined, on='title', how='left')
combined.head()


Unnamed: 0,title,rating_x,userId,movieId,rating_y,timestamp,genres
0,Forrest Gump (1994),341,2,356,3.0,835355628,Comedy|Drama|Romance|War
1,Forrest Gump (1994),341,3,356,5.0,1298862167,Comedy|Drama|Romance|War
2,Forrest Gump (1994),341,4,356,5.0,949919763,Comedy|Drama|Romance|War
3,Forrest Gump (1994),341,5,356,4.0,1163374152,Comedy|Drama|Romance|War
4,Forrest Gump (1994),341,7,356,3.0,851868188,Comedy|Drama|Romance|War


In [88]:
# In this section, we create widgets for interactivity for users to select a movie and then rate it. 
# As well, we preset the movies and ratings to the ones provided in the test set in the homework.

user_rat_mat = pd.pivot_table(combined, values='rating_y', index=['userId'],columns=['title'],aggfunc=np.sum)

movie_titles = combined['title'].tolist()

movie_1 = widgets.Dropdown(
    options=movie_titles,
    value='Lethal Weapon (1987)',
    description='Movie 1:',
    disabled=False,
)

movie_1_rating = widgets.Dropdown(
    options=['0','1','2','3','4','5'],
    value='4',
    description='Rating:',
    disabled=False,
)

movie_2 = widgets.Dropdown(
    options=movie_titles,
    value='Natural Born Killers (1994)',
    description='Movie 2:',
    disabled=False,
)

movie_2_rating = widgets.Dropdown(
    options=['0','1','2','3','4','5'],
    value='3',
    description='Rating:',
    disabled=False,
)
movie_3 = widgets.Dropdown(
    options=movie_titles,
    value='Inception (2010)',
    description='Movie 3:',
    disabled=False,
)

movie_3_rating = widgets.Dropdown(
    options=['0','1','2','3','4','5'],
    value='1',
    description='Rating:',
    disabled=False,
)
movie_4 = widgets.Dropdown(
    options=movie_titles,
    value='Heat (1995)',
    description='Movie 4:',
    disabled=False,
)

movie_4_rating = widgets.Dropdown(
    options=['0','1','2','3','4','5'],
    value='0',
    description='Rating:',
    disabled=False,
)
movie_5 = widgets.Dropdown(
    options=movie_titles,
    value='Finding Nemo (2003)',
    description='Movie 5:',
    disabled=False,
)

movie_5_rating = widgets.Dropdown(
    options=['0','1','2','3','4','5'],
    value='3',
    description='Rating:',
    disabled=False,
)
movie_6 = widgets.Dropdown(
    options=movie_titles,
    value='Office Space (1999)',
    description='Movie 6:',
    disabled=False,
)

movie_6_rating = widgets.Dropdown(
    options=['0','1','2','3','4','5'],
    value='1',
    description='Rating:',
    disabled=False,
)
movie_7 = widgets.Dropdown(
    options=movie_titles,
    value='Home Alone (1990)',
    description='Movie 7:',
    disabled=False,
)

movie_7_rating = widgets.Dropdown(
    options=['0','1','2','3','4','5'],
    value='0',
    description='Rating:',
    disabled=False,
)
movie_8 = widgets.Dropdown(
    options=movie_titles,
    value='High Fidelity (2000)',
    description='Movie 8:',
    disabled=False,
)

movie_8_rating = widgets.Dropdown(
    options=['0','1','2','3','4','5'],
    value='2',
    description='Rating:',
    disabled=False,
)
movie_9 = widgets.Dropdown(
    options=movie_titles,
    value='Donnie Darko (2001)',
    description='Movie 9:',
    disabled=False,
)

movie_9_rating = widgets.Dropdown(
    options=['0','1','2','3','4','5'],
    value='3',
    description='Rating:',
    disabled=False,
)
movie_10 = widgets.Dropdown(
    options=movie_titles,
    value='Lion King, The (1994)',
    description='Movie 10:',
    disabled=False,
)

movie_10_rating = widgets.Dropdown(
    options=['0','1','2','3','4','5'],
    value='2',
    description='Rating:',
    disabled=False,
)


In [89]:
display(movie_1,movie_1_rating)

Dropdown(description='Movie 1:', index=25132, options=('Forrest Gump (1994)', 'Forrest Gump (1994)', 'Forrest …

Dropdown(description='Rating:', index=4, options=('0', '1', '2', '3', '4', '5'), value='4')

In [90]:
display(movie_2,movie_2_rating)

Dropdown(description='Movie 2:', index=20275, options=('Forrest Gump (1994)', 'Forrest Gump (1994)', 'Forrest …

Dropdown(description='Rating:', index=3, options=('0', '1', '2', '3', '4', '5'), value='3')

In [91]:
display(movie_3,movie_3_rating)

Dropdown(description='Movie 3:', index=18969, options=('Forrest Gump (1994)', 'Forrest Gump (1994)', 'Forrest …

Dropdown(description='Rating:', index=1, options=('0', '1', '2', '3', '4', '5'), value='1')

In [92]:
display(movie_4,movie_4_rating)

Dropdown(description='Movie 4:', index=21748, options=('Forrest Gump (1994)', 'Forrest Gump (1994)', 'Forrest …

Dropdown(description='Rating:', options=('0', '1', '2', '3', '4', '5'), value='0')

In [93]:
display(movie_5,movie_5_rating)

Dropdown(description='Movie 5:', index=16282, options=('Forrest Gump (1994)', 'Forrest Gump (1994)', 'Forrest …

Dropdown(description='Rating:', index=3, options=('0', '1', '2', '3', '4', '5'), value='3')

In [94]:
display(movie_6,movie_6_rating)

Dropdown(description='Movie 6:', index=20700, options=('Forrest Gump (1994)', 'Forrest Gump (1994)', 'Forrest …

Dropdown(description='Rating:', index=1, options=('0', '1', '2', '3', '4', '5'), value='1')

In [95]:
display(movie_7,movie_7_rating)

Dropdown(description='Movie 7:', index=14279, options=('Forrest Gump (1994)', 'Forrest Gump (1994)', 'Forrest …

Dropdown(description='Rating:', options=('0', '1', '2', '3', '4', '5'), value='0')

In [96]:
display(movie_8,movie_8_rating)

Dropdown(description='Movie 8:', index=25757, options=('Forrest Gump (1994)', 'Forrest Gump (1994)', 'Forrest …

Dropdown(description='Rating:', index=2, options=('0', '1', '2', '3', '4', '5'), value='2')

In [97]:
display(movie_9,movie_9_rating)

Dropdown(description='Movie 9:', index=24399, options=('Forrest Gump (1994)', 'Forrest Gump (1994)', 'Forrest …

Dropdown(description='Rating:', index=3, options=('0', '1', '2', '3', '4', '5'), value='3')

In [98]:
display(movie_10,movie_10_rating)

Dropdown(description='Movie 10:', index=6053, options=('Forrest Gump (1994)', 'Forrest Gump (1994)', 'Forrest …

Dropdown(description='Rating:', index=2, options=('0', '1', '2', '3', '4', '5'), value='2')

In [130]:
# Then we take the selected values and put them into lists. As well we filter our pivot table of movies and ratings from
# other users to be only the movies rated by the input into the widgets.

from sklearn.metrics.pairwise import cosine_similarity

user_input_movies = [movie_1.value,movie_2.value,movie_3.value,movie_4.value,movie_5.value,
                     movie_6.value,movie_7.value,movie_8.value,movie_9.value, movie_10.value]
user_input_ratings = [movie_1_rating.value,movie_2_rating.value,movie_3_rating.value,
                      movie_4_rating.value,movie_5_rating.value,movie_6_rating.value,
                      movie_7_rating.value,movie_8_rating.value,movie_9_rating.value, movie_10_rating.value]


req_movie_df = user_rat_mat.filter(items=user_input_movies)

In [190]:
# For this section we take the cosine similarity of users ratings of the movies and the widget input ratings.
# Then we feed these results back into the overall movie matrix to the provide a suggested list of movies 
# based on users that rated movies similarly to them. And suggested those movies.

keys = []
vals = []
for m in range(len(req_movie_df)):
    req_movie_df = req_movie_df.fillna(-1)

    v = req_movie_df.iloc[[m]].index.values
    key = v.tolist()[0]


    a = np.array(req_movie_df.iloc[m])
    b = np.array(user_input_ratings)
    aa = a.reshape(1,10)
    ba = b.reshape(1,10)
    similarity = cosine_similarity(aa, ba)
    similarity_2 = similarity.tolist()[0]
    
    keys.append(key)
    vals.append(similarity_2[0])

similarity_mat = list(zip(keys,vals))
similarity_df = pd.DataFrame(similarity_mat, columns=['userId', 'similarity'])    


all_user_ratings = pd.merge(combined, similarity_df, on='userId', how='left')

all_user_ratings.fillna(0)
all_user_ratings["score_1"] = all_user_ratings.rating_y * all_user_ratings.similarity

Suggested_movies_rats = pd.pivot_table(all_user_ratings, values='score_1', index=['title'],aggfunc=np.mean)
Suggested_movies_rats = Suggested_movies_rats.sort_values(by='score_1', ascending=False)
Suggested_movies_rats

Unnamed: 0_level_0,score_1
title,Unnamed: 1_level_1
Donnie Darko (2001),1.417064
Lethal Weapon (1987),1.311427
Finding Nemo (2003),1.192757
High Fidelity (2000),1.160440
"O Brother, Where Art Thou? (2000)",0.980193
Lost in Translation (2003),0.977083
Catch Me If You Can (2002),0.883215
Full Metal Jacket (1987),0.881003
Indiana Jones and the Temple of Doom (1984),0.859678
Office Space (1999),0.847523


###### 2.  Movie Search Engine

Now that you have a recommendation engine, you need to provide a way for users to find movies to rate. You will need to create a function that takes in a search parameter and returns a ranked list of movies that best matches the input. For this data set, there are less than 10,000 movies and you only need to worry about searching the titles for those movies. Therefore, we do not need to worry as much about coming up with an optimal solution that scales for larger datasets.

When returning candidate movie titles, you will want to return the titles that match the search input with the highest probability. Consider dividing up the titles and the user input into n-grams, but instead of using n-grams of works, the n-grams are characters in the string.

For example, the title Batman contains the bigrams, [‘ba’, ‘tm’, ‘an’, ‘at’, ‘ma’]. You could then match that input title to titles that contain those bigrams with the highest probability. Find a search method that generally returns correct recommendations based off the search input.

In [217]:
# For this section we take bigrams of a search word and then bigrams of all of the movies and then count how many 
# bigrams match from the input word to all movies in the list of movie titles. Then sort and filter for the top 5
# results of highest counted bigrams based on the search word.

import nltk
import wikipedia
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

def search(word):
    
    movie_word_bigram_match_count = []
    
    token=nltk.word_tokenize(word)
    tokens = []
    
    for i in token:
        for x in i:
            tokens.append(x)
    
    word_bigrams = list(nltk.bigrams(tokens))
            
    for movie in movie_titles:
        movie_token = nltk.word_tokenize(movie)
        movie_tokens = []
        for i in movie_token:
            for x in i:
                movie_tokens.append(x)
        movie_bigrams = list(nltk.bigrams(movie_tokens))
        movie_counts = []
        for z in word_bigrams:
            bigram_count = movie_bigrams.count(z)
            movie_counts.append(bigram_count)
        totals = sum(movie_counts)
        movie_word_bigram_match_count.append(totals)
        
    
    movie_match = list(zip(movie_titles,movie_word_bigram_match_count))
    Search_df = pd.DataFrame(movie_match, columns=['movie', 'match'])
    Search_df = Search_df.sort_values(by='match', ascending=False).drop_duplicates()
    return Search_df['movie'].iloc[0:5]


###### 3. Movie Recommendation Application

In this part, you create an interactive movie recommendation application by combining the movie recommendation engine and the movie search engine. To accomplish this, you will create a simple command line application. Upon starting the application, you should ask the user to find a movie to rate. It should return a list of numbered movies or an “I don’t see what I’m looking for” option.

If the user selects a movie, they need to enter a rating from zero to five and that movie will be added to their list of movies.


After the user has rated at least five movies, give them the option to rate more movies or to get recommendations. Your application should return a ranked list of five movie recommendations.

In [223]:
def search_bar(word):
    
    movie_word_bigram_match_count = []
    
    token=nltk.word_tokenize(word)
    tokens = []
    
    for i in token:
        for x in i:
            tokens.append(x)
    
    word_bigrams = list(nltk.bigrams(tokens))
            
    for movie in movie_titles:
        movie_token = nltk.word_tokenize(movie)
        movie_tokens = []
        for i in movie_token:
            for x in i:
                movie_tokens.append(x)
        movie_bigrams = list(nltk.bigrams(movie_tokens))
        movie_counts = []
        for z in word_bigrams:
            bigram_count = movie_bigrams.count(z)
            movie_counts.append(bigram_count)
        totals = sum(movie_counts)
        movie_word_bigram_match_count.append(totals)
        
    
    movie_match = list(zip(movie_titles,movie_word_bigram_match_count))
    Search_df = pd.DataFrame(movie_match, columns=['movie', 'match'])
    Search_df = Search_df.sort_values(by='match', ascending=False).drop_duplicates()
    G = list(Search_df['movie'].iloc[0:5])
    return G

In [224]:

search("Beauty and the Beast")

8452                           Beauty and the Beast (1991)
3770     Raiders of the Lost Ark (Indiana Jones and the...
12453    Pirates of the Caribbean: The Curse of the Bla...
21122    Harry Potter and the Sorcerer's Stone (a.k.a. ...
20977    Dr. Strangelove or: How I Learned to Stop Worr...
Name: movie, dtype: object

In [231]:
# For this section we have an interactive list of menus based on the search word which then populates a
# drop down menu of the top 5 matches to the movie titles list and then there is a rating slider defaulted to 3.
# Ran out of time to feed this back into the workflow to recommend a list of movies.

# However, I did already create an interactive drop down menu for movie titles and ratings to then
# be fed into a recommendation workflow.

from ipywidgets import interact, interact_manual


@interact
def show_movies(Search='Enter Text Here', Rating = widgets.IntSlider(min=0,max=5,step=1,value=3)):
    return widgets.Dropdown(options = search(Search)) 

interactive(children=(Text(value='claps', description='Search'), IntSlider(value=3, description='Rating', max=…