# User-based filtering

Item’s recommendation rating for a user is calculated depending on that items’ ratings by other similar users.

The ratings are predicted using the ratings of neighbouring users.


In [140]:
#importing the libraries we're gonna use
from datetime import datetime
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import warnings
warnings.filterwarnings("ignore")

Upload the `movies` and `ratings` tables into pandas dataframe below:

In [2]:
movies= pd.read_csv("Data/movies.csv")
movies.drop("Unnamed: 0",axis=1,inplace=True)

In [3]:
ratings_sample = pd.read_csv("Data/ratings_sample.csv")

### User-Item matrix / DataFrame

To find the relationship between the new user that has just entered the a movie that they like, we need a user-item matrix.

Below I'm creating a dataframe which it's rows indicate the id of users.
And the column names represent the movie IDs.
The values inside represent the rating that each user has given a movie.

In [4]:
# Create a user-item matrix
# Identify unique users and items
unique_users = ratings_sample['userId'].unique()
unique_items = ratings_sample['movieId'].unique()

# Create an empty user-item matrix
user_item_matrix = np.zeros((len(unique_users), len(unique_items)))

# Populate the user-item matrix
for index, row in ratings_sample.iterrows():
    user_id = row['userId']
    item_id = row['movieId']
    rating = row['rating']

    user_index = np.where(unique_users == user_id)[0][0]
    item_index = np.where(unique_items == item_id)[0][0]

    user_item_matrix[user_index, item_index] = rating

# Print the user-item matrix
print(user_item_matrix)

[[4.  3.5 4.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0. ]
 ...
 [0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0. ]
 [0.  0.  0.  ... 0.  0.  0. ]]


The dot product of user and item matrix can generate the rating matrix.
In this case the numbers indicate the rating that the user has given to the movie.

In [167]:
df = pd.DataFrame(data=user_item_matrix, index=[unique_users],
                  columns=[unique_items])
                  #user-item matrix in a dataframe as described above

Let's look at the user-item matrix, which in the columns are movie IDs and rows are UserIDs.

In [6]:
df

Unnamed: 0,356,4167,4306,4979,5574,6156,6213,6333,6383,6595,...,120905,182415,34229,35826,70978,72027,81665,3291,5457,6246
3,4.0,3.5,4.0,4.0,4.0,3.0,3.0,4.0,4.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162521,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162529,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162533,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# df.to_csv("user_item.csv")

### Search for the movie

The function below finds the title of the movies which have similar title to what has been searched:
(Same function as the one we have in Content based notebook)

In [183]:
#Define a TF-IDF Vectorizer Object for titles

tfidf_title = TfidfVectorizer(stop_words='english',min_df=0,ngram_range=(1,2))


#Construct the required TF-IDF matrix by fitting and transforming the data
title_matrix = tfidf_title.fit_transform(movies['title'])

title_similarity = cosine_similarity(title_matrix,title_matrix)

In [184]:
def find_similar_movies(movie_title, top_n=10, threshold=60):
    idx=movies[movies['title']==movie_title].index.values
    
    titles=[]
    # Calculate similarity scores for each movie title
    titles = movies['title'].tolist() #movie title that we have in a list 
    similarity_scores = process.extract(movie_title, titles, scorer=fuzz.partial_ratio) #similarity score between the title searched and the movie titles we have
    
    

    # Filter movies with similarity scores above the threshold
    similar_movies = [name for name,score in similarity_scores if score >= threshold]
    
    #create a dataframe to compare the similarity of the titles we've found and the one that has been searched.
    title_similarity_df =  pd.DataFrame(columns =["title" , 'similarity'])

    
    similar_movies.append(movie_title)
    
    title_matrix = tfidf_title.transform(similar_movies)
    title_sim = cosine_similarity(title_matrix)
    
    for i in range(len(similar_movies)-1):
        #transforming the titles we have found into out TF-IDF matrix and getting a cosine similarity
        sim = title_sim[len(similar_movies)-1][i]
        new_row = [{"title": similar_movies[i], "similarity": sim }]
        title_similarity_df = title_similarity_df.append(new_row,ignore_index=True)   

        
    title_similarity_df =title_similarity_df.sort_values(by='similarity',ascending=False).head(top_n)    
    return title_similarity_df.title.values

## Recommend based on similar users data

`Find Similar movies` will find the movies that users who have rated similar to the movie entered, like.
Hoping that users have the same taste and that is why we're recommending based on their likes to our new user.

The function `get_top_recommendations` will get a title and pass that into `find_similar_movies` to search through the movies dataframe and find the most similar movie title to the one searched.

Then it's going to take the movie's `movieID` and `user ID` of the people who like the same movie.

After collecting the user IDs it will iterate through the rows of the dataframe for those users and look for other movies that the users has rated for higher than `3.5`. (Treshold for rating = 3.5)

Then we're going to add those movies the users has liked and append them to `top_movies` dataframe.

At the end the function will return the top 10 movies based on the ratings users has given.


In [209]:
#The function looks for the data of the similar users and gets the movies they like
def get_top_recommendations(title):
    idx = movies[movies['title']==title].index
    if len(idx)==0:
        similar = find_similar_movies(title,top_n=2) #this gives me a list of similar movie titles to the one searched.
        title=similar[0]
        
    idx = movies[movies['title']==title].index[0]    
    movie_id = movies.loc[idx]['movieId']
    
    user_ids = ratings_sample[ratings_sample['movieId']==movie_id]['userId'] 
    if len(user_ids) ==0:   #double checking if there are no users that liked that movie, get the next similar movie
        title=similar[1]

    #get the movie index, movieID and userIDs of the people who rate the movie    
    idx = movies[movies['title']==title].index[0]    
    movie_id = movies.loc[idx]['movieId']
    user_ids = ratings_sample[ratings_sample['movieId']==movie_id]['userId']

    #Creating a data frame to show our result at the end
    top_movies=pd.DataFrame(columns=['movieID','rating'])

    #iterate over each row of the dataframe
    for index, row in df.iterrows():
        if index in user_ids.values:
            
        # Iterate over each column in the row     
            for column,val in row.iteritems():
                if (val> 3.5) &  (column in movies.movieId.values): #if the rating that the user has given is higher than 3.5 that indicates they liked it!
                    new_row=[{'movieID': column[0], "rating":val}]
                    top_movies = top_movies.append(new_row,ignore_index=True)
                   
        
    #returning the top 10 in terms of ratings.
    top10 = top_movies.sort_values(by='rating',ascending=False).head(10)
    return top10



Function below gets the top10 movies list and displays the title + some information about the movie.

In [210]:
#just printing the results in a pretty format
def get_movie_info(top_movies_df,movies):
    for movieId in top10.movieID.values:
        print("\n---------------------------------------------\n")
        print("Movie Title: " ,movies[movies['movieId']==movieId]['title'].values[0])
        print("Overview: " ,movies[movies['movieId']==movieId]['overview'].values[0])
        print("Genre: " ,movies[movies['movieId']==movieId]['genre'].values[0])
        print("Runtime: ",movies[movies['movieId']==movieId]['runtime'].values[0])
        
            

In [212]:
#type any movie title you want, I 
top10 = get_top_recommendations('Love actua')
get_movie_info(top10,movies)
#this may take a while `1-2 minutes`
   



---------------------------------------------

Movie Title:  Point Break
Overview:  In the coastal town of Los Angeles, a gang of bank robbers call themselves The Ex-Presidents commit their crimes while wearing masks of Reagan, Carter, Nixon and Johnson. The F.B.I. believes that the members of the gang could be surfers and send young agent Johnny Utah undercover at the beach to mix with the surfers and gather information.
Genre:  ['Action', 'Thriller', 'Crime']
Runtime:  120.0

---------------------------------------------

Movie Title:  The Searchers
Overview:  As a Civil War veteran spends years searching for a young niece captured by Indians, his motivation becomes increasingly questionable.
Genre:  ['Western']
Runtime:  119.0

---------------------------------------------

Movie Title:  The Poseidon Adventure
Overview:  The Poseidon Adventure was one of the first Catastrophe films and began the Disaster Film genre. Director Neame tells the story of a group of people that must figh