# Item-Based Collaborative Filtering Practice

In [27]:
import pandas as pd
import numpy as np

In the real dataset, the ratings are between 0.5 and 5.0. Therefore, zero only represents missing values in the rating matrix.

For practice, let's say we have 10 movies and 10 users, and the ratings of each user are as follows:

In [28]:
df = pd.DataFrame({'user_0':[0,3,0,5,0,0,4,5,0,2], 'user_1':[0,0,3,2,5,0,4,0,3,0], 'user_2':[3,1,0,3,5,0,0,4,0,0], 'user_3':[4,3,4,2,0,0,0,2,0,0], 
                   'user_4':[2,0,0,0,0,4,4,3,5,0], 'user_5':[1,0,2,4,0,0,4,0,5,0], 'user_6':[2,0,0,3,0,4,3,3,0,0], 'user_7':[0,0,0,3,0,2,4,3,4,0], 
                   'user_8':[5,0,0,0,5,3,0,3,0,4], 'user_9':[1,0,2,0,4,0,4,3,0,0]}, index=['movie_0','movie_1','movie_2','movie_3','movie_4','movie_5','movie_6','movie_7','movie_8','movie_9'])
df

Unnamed: 0,user_0,user_1,user_2,user_3,user_4,user_5,user_6,user_7,user_8,user_9
movie_0,0,0,3,4,2,1,2,0,5,1
movie_1,3,0,1,3,0,0,0,0,0,0
movie_2,0,3,0,4,0,2,0,0,0,2
movie_3,5,2,3,2,0,4,3,3,0,0
movie_4,0,5,5,0,0,0,0,0,5,4
movie_5,0,0,0,0,4,0,4,2,3,0
movie_6,4,4,0,0,4,4,3,4,0,4
movie_7,5,0,4,2,3,0,3,3,3,3
movie_8,0,3,0,0,5,5,0,4,0,0
movie_9,2,0,0,0,0,0,0,0,4,0


In [29]:
df.values

array([[0, 0, 3, 4, 2, 1, 2, 0, 5, 1],
       [3, 0, 1, 3, 0, 0, 0, 0, 0, 0],
       [0, 3, 0, 4, 0, 2, 0, 0, 0, 2],
       [5, 2, 3, 2, 0, 4, 3, 3, 0, 0],
       [0, 5, 5, 0, 0, 0, 0, 0, 5, 4],
       [0, 0, 0, 0, 4, 0, 4, 2, 3, 0],
       [4, 4, 0, 0, 4, 4, 3, 4, 0, 4],
       [5, 0, 4, 2, 3, 0, 3, 3, 3, 3],
       [0, 3, 0, 0, 5, 5, 0, 4, 0, 0],
       [2, 0, 0, 0, 0, 0, 0, 0, 4, 0]])

Let's use **NearestNeighbors()** to calculate the distance between movies using **cosine similarity** and find the most similar movies for each movie.

In [30]:
from sklearn.neighbors import NearestNeighbors

In [31]:
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(df.values)
distances, indices = knn.kneighbors(df.values, n_neighbors=3)

The parameter for the number of the nearest neighbors is set to be 3. 

In [32]:
indices

array([[0, 7, 5],
       [1, 3, 7],
       [2, 1, 6],
       [3, 6, 7],
       [4, 0, 7],
       [5, 7, 0],
       [6, 8, 3],
       [7, 3, 0],
       [8, 6, 3],
       [9, 0, 7]])

**indices** shows the nearest movies to each movie. Each row corresponds to the row in the **df**. The first element in a row is the most similar (nearest) movie. It is the movie itself. The second element is the second nearest, and the third is the third nearest. For example, in the first row **[0,7,5]**, the nearest movie to **'movie_0'** is itself, the second nearest movie is **'movie_7'**, and the third is **'movie_5'**. 

In [33]:
distances

array([[0.00000000e+00, 3.19586183e-01, 4.03404722e-01],
       [4.44089210e-16, 3.68421053e-01, 3.95436458e-01],
       [0.00000000e+00, 5.20766162e-01, 5.24329288e-01],
       [2.22044605e-16, 2.72367798e-01, 2.86615021e-01],
       [0.00000000e+00, 4.04534842e-01, 4.80655057e-01],
       [0.00000000e+00, 3.87174123e-01, 4.03404722e-01],
       [0.00000000e+00, 2.33726809e-01, 2.72367798e-01],
       [1.11022302e-16, 2.86615021e-01, 3.19586183e-01],
       [2.22044605e-16, 2.33726809e-01, 4.96677704e-01],
       [1.11022302e-16, 4.22649731e-01, 4.81455027e-01]])

**distances** shows the distance between movies. Each number in this matrix corresponds to the number in the **indices** matrix.

Let's present the info in the above matrices.

In [34]:
for title in df.index:

  index_user_likes = df.index.tolist().index(title) # get an index for a movie
  sim_movies = indices[index_user_likes].tolist() # make list for similar movies
  movie_distances = distances[index_user_likes].tolist() # the list for distances of similar movies
  id_movie = sim_movies.index(index_user_likes) # get the position of the movie itself in indices and distances

  print('Similar Movies to '+str(df.index[index_user_likes])+':\n')


  sim_movies.remove(index_user_likes) # remove the movie itself in indices
  movie_distances.pop(id_movie) # remove the movie itself in distances

  j = 1
  
  for i in sim_movies:
    print(str(j)+': '+str(df.index[i])+', the distance with '+str(title)+': '+str(movie_distances[j-1]))
    j = j + 1
      
  print('\n')

Similar Movies to movie_0:

1: movie_7, the distance with movie_0: 0.3195861825602283
2: movie_5, the distance with movie_0: 0.40340472183738674


Similar Movies to movie_1:

1: movie_3, the distance with movie_1: 0.3684210526315791
2: movie_7, the distance with movie_1: 0.39543645824165696


Similar Movies to movie_2:

1: movie_1, the distance with movie_2: 0.5207661617014769
2: movie_6, the distance with movie_2: 0.5243292879915494


Similar Movies to movie_3:

1: movie_6, the distance with movie_3: 0.27236779788557686
2: movie_7, the distance with movie_3: 0.2866150207251553


Similar Movies to movie_4:

1: movie_0, the distance with movie_4: 0.40453484184315647
2: movie_7, the distance with movie_4: 0.4806550570967598


Similar Movies to movie_5:

1: movie_7, the distance with movie_5: 0.38717412297165876
2: movie_0, the distance with movie_5: 0.40340472183738674


Similar Movies to movie_6:

1: movie_8, the distance with movie_6: 0.23372680904614496
2: movie_3, the distance with m

## Recommend Similar Movies to a Selected Movie

Using the algorithm above, we can make a recommender for the similar movies to a selected movie by users.

In [35]:
def recommend_movie(title):

  index_user_likes = df.index.tolist().index(title) # get an index for a movie
  sim_movies = indices[index_user_likes].tolist() # make list for similar movies
  movie_distances = distances[index_user_likes].tolist() # the list for distances of similar movies
  id_movie = sim_movies.index(index_user_likes) # get the position of the movie itself in indices and distances

  print('Similar Movies to '+str(df.index[index_user_likes])+': \n')

  sim_movies.remove(index_user_likes) # remove the movie itself in indices
  movie_distances.pop(id_movie) # remove the movie itself in distances

  j = 1
    
  for i in sim_movies:
    print(str(j)+': '+str(df.index[i])+', the distance with '+str(title)+': '+str(movie_distances[j-1]))
    j = j + 1

In [36]:
recommend_movie('movie_3')

Similar Movies to movie_3: 

1: movie_6, the distance with movie_3: 0.27236779788557686
2: movie_7, the distance with movie_3: 0.2866150207251553


## Recommend Movies for a Selected User

Now, instead of recommending movies to a user based on the movie selected by the user, let's make a recommender which recommends movies to the user based on the predicted ratings for movies the user hasn't watch. 

The algorithm of this recommendation engine is as follows:

*  Step 1: Calculate similarity between movies using KNN. (The similarities are calculated based on all the ratings by all users. We already did this previously.)
*  Step 2: For each movie a user has not watched, predict the rating: for a movie not watched by a user,
   *  Find closest movies based on the similarity calculated in the Step 1.
   *  For the closest movies, calcuate the weighted average of the ratings by the user. The inverse distance metric is used as the weight. 
   *  Use the weighted average of the ratings as the predicted rating for the movie by the user. 

* Step 3: Recommend movies which have the highest predicted ratings for the user. 

### Predict a Rating for a Movie by a User

In this part, let's practice how to predict a rating for a movie by a user. For example, **user_7** has not give a rating for **movie_0**. Let's predict this rating.

The first thing to do is to find the nearest neighors for this movie using **KNN (n_neighbors = 3)**. 

In [37]:
df

Unnamed: 0,user_0,user_1,user_2,user_3,user_4,user_5,user_6,user_7,user_8,user_9
movie_0,0,0,3,4,2,1,2,0,5,1
movie_1,3,0,1,3,0,0,0,0,0,0
movie_2,0,3,0,4,0,2,0,0,0,2
movie_3,5,2,3,2,0,4,3,3,0,0
movie_4,0,5,5,0,0,0,0,0,5,4
movie_5,0,0,0,0,4,0,4,2,3,0
movie_6,4,4,0,0,4,4,3,4,0,4
movie_7,5,0,4,2,3,0,3,3,3,3
movie_8,0,3,0,0,5,5,0,4,0,0
movie_9,2,0,0,0,0,0,0,0,4,0


In [38]:
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(df.values)
distances, indices = knn.kneighbors(df.values, n_neighbors=3)

In [39]:
index_for_movie = df.index.tolist().index('movie_0') # it returns 0
sim_movies = indices[index_for_movie].tolist() # make list for similar movies
movie_distances = distances[index_for_movie].tolist() # the list for distances of similar movies
id_movie = sim_movies.index(index_for_movie) # get the position of the movie itself in indices and distances
sim_movies.remove(index_for_movie) # remove the movie itself in indices
movie_distances.pop(id_movie) # remove the movie itself in distances

print('The Nearest Movies to movie_0:', sim_movies)
print('The Distance from movie_0:', movie_distances)

The Nearest Movies to movie_0: [7, 5]
The Distance from movie_0: [0.3195861825602283, 0.40340472183738674]


#### Predict a Rating

The formula to calculate the predicted rating is as follows:

$\hat R_{mu}=\frac{\sum_{j} R_{ju}S_{mj}}{\sum_{j} S_{mj}}$

where $\hat R_{mu}$ is the predicted rating for movie $m$ by user $u$, $R_{ju}$ is the actual rating for movie $j$ by user $u$, and $S_{mj}$ is the similarity between movie $m$ and movie $j$.

This formula simply implies that the predicted rating for a movie is the weighted average of ratings for similar movies. The weight for each rating, $\frac{S_{mj}}{\sum_{k} S_{mk}}$, becomes greater when movie $m$ and movie $j$ are more similar. The denominator of this term makes the sum of all the weights in $\hat R_{mu}$ become 1. 




Let's predict the rating for **movie_0** by **user_7**, $R(0,7)$. The closest movies to **movie_0** are **movie_5** and **movie_7**, and the distances are **0.4034**, and **0.3196**, respectively.

Therefore, the predicted rating for $R(0,7)$ is as follows:
 
$R(0,7) = [S(0,5)*R(5,7) + S(0,7)*R(7,7)]/[S(0,5)+S(0,7)]$.

Since the distances between **movie_0** and **movie_5** and between **movie_0** and **movie_7** are 0.4034 and 0.3196, 
* $S(0,5)$ = (1 - 0.4034)
* $S(0,7)$ = (1 - 0.3196).

Also, $R(5,7)$ = 2 and $R(7,7)$ = 3. Therefore, the predicted $R(0,7)$ is 2.5328.

In [40]:
movie_similarity = [-x+1 for x in movie_distances] # inverse distance 

predicted_rating = (movie_similarity[0]*df.iloc[sim_movies[0],7] + movie_similarity[1]*df.iloc[sim_movies[1],7])/sum(movie_similarity)
print(predicted_rating)

2.5328183015946415


### Build a Recommender

In [41]:
# find the nearest neighbors using NearestNeighbors(n_neighbors=3)
number_neighbors = 3
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(df.values)
distances, indices = knn.kneighbors(df.values, n_neighbors=number_neighbors)

# copy df
df1 = df.copy()

# convert user_name to user_index
user_index = df.columns.tolist().index('user_4')

# t: movie_title, m: the row number of t in df
for m,t in list(enumerate(df.index)):
  
  # find movies without ratings by user_4
  if df.iloc[m, user_index] == 0:
    sim_movies = indices[m].tolist()
    movie_distances = distances[m].tolist()
    
    # Generally, this is the case: indices[3] = [3 6 7]. The movie itself is in the first place.
    # In this case, we take off 3 from the list. Then, indices[3] == [6 7] to have the nearest NEIGHBORS in the list. 
    if m in sim_movies:
      id_movie = sim_movies.index(m)
      sim_movies.remove(m)
      movie_distances.pop(id_movie) 

    # However, if the percentage of ratings in the dataset is very low, there are too many 0s in the dataset. 
    # Some movies have all 0 ratings and the movies with all 0s are considered the same movies by NearestNeighbors(). 
    # Then,even the movie itself cannot be included in the indices. 
    # For example, indices[3] = [2 4 7] is possible if movie_2, movie_3, movie_4, and movie_7 have all 0s for their ratings.
    # In that case, we take off the farthest movie in the list. Therefore, 7 is taken off from the list, then indices[3] == [2 4].
    else:
      sim_movies = sim_movies[:number_neighbors-1]
      movie_distances = movie_distances[:number_neighbors-1]
        
    # movie_similarty = 1 - movie_distance    
    movie_similarity = [1-x for x in movie_distances]
    movie_similarity_copy = movie_similarity.copy()
    nominator = 0

    # for each similar movie
    for s in range(0, len(movie_similarity)):
      
      # check if the rating of a similar movie is zero
      if df.iloc[sim_movies[s], user_index] == 0:

        # if the rating is zero, ignore the rating and the similarity in calculating the predicted rating
        if len(movie_similarity_copy) == (number_neighbors - 1):
          movie_similarity_copy.pop(s)
          
        else:
          movie_similarity_copy.pop(s-(len(movie_similarity)-len(movie_similarity_copy)))

      # if the rating is not zero, use the rating and similarity in the calculation
      else:
        nominator = nominator + movie_similarity[s]*df.iloc[sim_movies[s],user_index]

    # check if the number of the ratings with non-zero is positive
    if len(movie_similarity_copy) > 0:
      
      # check if the sum of the ratings of the similar movies is positive.
      if sum(movie_similarity_copy) > 0:
        predicted_r = nominator/sum(movie_similarity_copy)

      # Even if there are some movies for which the ratings are positive, some movies have zero similarity even though they are selected as similar movies.
      # in this case, the predicted rating becomes zero as well  
      else:
        predicted_r = 0

    # if all the ratings of the similar movies are zero, then predicted rating should be zero
    else:
      predicted_r = 0

  # place the predicted rating into the copy of the original dataset
    df1.iloc[m,user_index] = predicted_r

In [42]:
def recommend_movies(user, num_recommended_movies):

  print('The list of the Movies {} Has Watched \n'.format(user))

  for m in df[df[user] > 0][user].index.tolist():
    print(m)
  
  print('\n')

  recommended_movies = []

  for m in df[df[user] == 0].index.tolist():

    index_df = df.index.tolist().index(m)
    predicted_rating = df1.iloc[index_df, df1.columns.tolist().index(user)]
    recommended_movies.append((m, predicted_rating))

  sorted_rm = sorted(recommended_movies, key=lambda x:x[1], reverse=True)
  
  print('The list of the Recommended Movies \n')
  rank = 1
  for recommended_movie in sorted_rm[:num_recommended_movies]:
    
    print('{}: {} - predicted rating:{}'.format(rank, recommended_movie[0], recommended_movie[1]))
    rank = rank + 1

In [43]:
recommend_movies('user_4',5)

The list of the Movies user_4 Has Watched 

movie_0
movie_5
movie_6
movie_7
movie_8


The list of the Recommended Movies 

1: movie_2 - predicted rating:4.0
2: movie_3 - predicted rating:3.504943460433221
3: movie_1 - predicted rating:3.0
4: movie_9 - predicted rating:2.473170201830165
5: movie_4 - predicted rating:2.4658595597666277


In [44]:
df1 = df.copy()

def movie_recommender(user, num_neighbors, num_recommendation):
  
  number_neighbors = num_neighbors

  knn = NearestNeighbors(metric='cosine', algorithm='brute')
  knn.fit(df.values)
  distances, indices = knn.kneighbors(df.values, n_neighbors=number_neighbors)

  user_index = df.columns.tolist().index(user)

  for m,t in list(enumerate(df.index)):
    if df.iloc[m, user_index] == 0:
      sim_movies = indices[m].tolist()
      movie_distances = distances[m].tolist()
    
      if m in sim_movies:
        id_movie = sim_movies.index(m)
        sim_movies.remove(m)
        movie_distances.pop(id_movie) 

      else:
        sim_movies = sim_movies[:num_neighbors-1]
        movie_distances = movie_distances[:num_neighbors-1]
           
      movie_similarity = [1-x for x in movie_distances]
      movie_similarity_copy = movie_similarity.copy()
      nominator = 0

      for s in range(0, len(movie_similarity)):
        if df.iloc[sim_movies[s], user_index] == 0:
          if len(movie_similarity_copy) == (number_neighbors - 1):
            movie_similarity_copy.pop(s)
          
          else:
            movie_similarity_copy.pop(s-(len(movie_similarity)-len(movie_similarity_copy)))
            
        else:
          nominator = nominator + movie_similarity[s]*df.iloc[sim_movies[s],user_index]
          
      if len(movie_similarity_copy) > 0:
        if sum(movie_similarity_copy) > 0:
          predicted_r = nominator/sum(movie_similarity_copy)
        
        else:
          predicted_r = 0

      else:
        predicted_r = 0
        
      df1.iloc[m,user_index] = predicted_r
  recommend_movies(user,num_recommendation)

In [45]:
movie_recommender('user_4', 4, 5)

The list of the Movies user_4 Has Watched 

movie_0
movie_5
movie_6
movie_7
movie_8


The list of the Recommended Movies 

1: movie_3 - predicted rating:3.504943460433221
2: movie_2 - predicted rating:3.0283245929027296
3: movie_1 - predicted rating:3.0
4: movie_9 - predicted rating:2.473170201830165
5: movie_4 - predicted rating:2.4658595597666277


### Applying to the Real Movie Data

The dataset is from MovieLens.

In [46]:
ratings = pd.read_csv('ratings.csv', usecols=['userId','movieId','rating'])
movies = pd.read_csv('movies.csv', usecols=['movieId','title'])
ratings2 = pd.merge(ratings, movies, how='inner', on='movieId')

In [47]:
df = ratings2.pivot_table(index='title',columns='userId',values='rating').fillna(0)
df1 = df.copy()

In [48]:
def recommend_movies(user, num_recommended_movies):

  print('The list of the Movies {} Has Watched \n'.format(user))

  for m in df[df[user] > 0][user].index.tolist():
    print(m)
  
  print('\n')

  recommended_movies = []

  for m in df[df[user] == 0].index.tolist():

    index_df = df.index.tolist().index(m)
    predicted_rating = df1.iloc[index_df, df1.columns.tolist().index(user)]
    recommended_movies.append((m, predicted_rating))

  sorted_rm = sorted(recommended_movies, key=lambda x:x[1], reverse=True)
  
  print('The list of the Recommended Movies \n')
  rank = 1
  for recommended_movie in sorted_rm[:num_recommended_movies]:
    
    print('{}: {} - predicted rating:{}'.format(rank, recommended_movie[0], recommended_movie[1]))
    rank = rank + 1

In [49]:
def movie_recommender(user, num_neighbors, num_recommendation):
  
  number_neighbors = num_neighbors

  knn = NearestNeighbors(metric='cosine', algorithm='brute')
  knn.fit(df.values)
  distances, indices = knn.kneighbors(df.values, n_neighbors=number_neighbors)

  user_index = df.columns.tolist().index(user)

  for m,t in list(enumerate(df.index)):
    if df.iloc[m, user_index] == 0:
      sim_movies = indices[m].tolist()
      movie_distances = distances[m].tolist()
    
      if m in sim_movies:
        id_movie = sim_movies.index(m)
        sim_movies.remove(m)
        movie_distances.pop(id_movie) 

      else:
        sim_movies = sim_movies[:num_neighbors-1]
        movie_distances = movie_distances[:num_neighbors-1]
           
      movie_similarity = [1-x for x in movie_distances]
      movie_similarity_copy = movie_similarity.copy()
      nominator = 0

      for s in range(0, len(movie_similarity)):
        if df.iloc[sim_movies[s], user_index] == 0:
          if len(movie_similarity_copy) == (number_neighbors - 1):
            movie_similarity_copy.pop(s)
          
          else:
            movie_similarity_copy.pop(s-(len(movie_similarity)-len(movie_similarity_copy)))
            
        else:
          nominator = nominator + movie_similarity[s]*df.iloc[sim_movies[s],user_index]
          
      if len(movie_similarity_copy) > 0:
        if sum(movie_similarity_copy) > 0:
          predicted_r = nominator/sum(movie_similarity_copy)
        
        else:
          predicted_r = 0

      else:
        predicted_r = 0
        
      df1.iloc[m,user_index] = predicted_r
  recommend_movies(user,num_recommendation)

In [50]:
movie_recommender(15, 10, 10)

The list of the Movies 15 Has Watched 

(500) Days of Summer (2009)
10 Cloverfield Lane (2016)
101 Dalmatians (One Hundred and One Dalmatians) (1961)
28 Days Later (2002)
9 (2009)
A.I. Artificial Intelligence (2001)
Adjustment Bureau, The (2011)
Aladdin (1992)
Alien (1979)
Aliens (1986)
American Beauty (1999)
American History X (1998)
American Psycho (2000)
Apocalypto (2006)
Avatar (2009)
Avengers, The (2012)
Back to the Future (1985)
Back to the Future Part II (1989)
Back to the Future Part III (1990)
Beautiful Mind, A (2001)
Bicentennial Man (1999)
Bolt (2008)
Bridge of Spies (2015)
Captain America: The Winter Soldier (2014)
Captain Phillips (2013)
Casper (1995)
Cast Away (2000)
Catch Me If You Can (2002)
Chappie (2015)
Children of Men (2006)
Cloudy with a Chance of Meatballs (2009)
Dark Knight Rises, The (2012)
Dark Knight, The (2008)
Deadpool (2016)
District 9 (2009)
Django Unchained (2012)
Doctor Strange (2016)
Edge of Tomorrow (2014)
Escape from L.A. (1996)
Ex Machina (2015)
Fift