## User Based collaborative filtering

In [19]:
import pandas as pd
import numpy as np

In [4]:
movie = pd.read_csv('ml-25m/movies.csv')
ratings = pd.read_csv('ml-25m/ratings.csv')

In [5]:
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [7]:
ratings = ratings.drop('timestamp', axis=1)

In [8]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5
2,1,307,5.0
3,1,665,5.0
4,1,899,3.5


In [9]:
import pandas as pd

ratings_df = pd.read_csv('ml-25m/ratings.csv')
movies_df = pd.read_csv('ml-25m/movies.csv')

movies_ratings_df = pd.merge(ratings_df, movies_df, on="movieId")

small_movies_ratings_df = movies_ratings_df[:10000]

ratings_df = small_movies_ratings_df.reset_index(drop=True)

user_ratings_pivot = ratings_df.pivot(index='userId', columns='movieId', values='rating').fillna(0)

avg_ratings = user_ratings_pivot.mean(axis=1)
filled_user_item_matrix = user_ratings_pivot.sub(avg_ratings, axis=0).fillna(0)


In [10]:
filled_user_item_matrix.head()

movieId,1,2,3,5,6,7,9,10,11,14,...,182715,182823,187541,187593,189333,195159,200818,200838,203375,203519
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,...,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229,-0.081229
2,3.296775,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225,...,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225,-0.203225
3,3.262093,-0.737907,-0.737907,-0.737907,-0.737907,-0.737907,-0.737907,-0.737907,-0.737907,-0.737907,...,-0.737907,2.262093,2.762093,3.762093,-0.737907,-0.737907,-0.737907,-0.737907,-0.737907,-0.737907
4,2.751293,-0.248707,-0.248707,-0.248707,-0.248707,-0.248707,-0.248707,-0.248707,-0.248707,-0.248707,...,4.251293,-0.248707,3.251293,3.251293,3.751293,4.751293,4.751293,2.751293,4.251293,2.251293
5,3.884697,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303,...,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303,-0.115303


We make use of the ratings data to create a user-item matrix. The user-item matrix is a matrix where each row represents a user and each column represents an item. The matrix is filled with ratings given by users to items and we make sure to normalize the ratings by subtracting the average rating given by each user. We then fill the missing values with 0.

In [55]:
import pandas as pd

user_ratings_transposed = user_ratings_pivot.T

user_corr_matrix = user_ratings_transposed.corr(method='pearson')

user_corr_matrix_df = pd.DataFrame(user_corr_matrix, columns=user_ratings_transposed.columns,
                                   index=user_ratings_transposed.columns)


user_corr_matrix_df.head()

userId,1,2,3,4,5,6,7,8,9,10,...,66,67,68,69,70,71,72,73,74,75
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.010201,-0.001074,0.004636,-0.008629,-0.012413,0.082918,-0.008348,-0.009415,0.009031,...,0.0062,-0.012525,0.109552,0.050013,0.066039,-0.014767,-0.021501,-0.015032,0.027486,-0.002442
2,0.010201,1.0,0.093823,0.149475,0.125682,0.113875,0.047515,0.137379,0.083273,0.134199,...,0.131331,0.087306,0.084431,0.252501,0.290441,0.224253,0.141783,0.035913,0.080509,0.131457
3,-0.001074,0.093823,1.0,0.281502,-0.015372,0.086019,-0.006997,-0.01248,-0.043644,0.088017,...,0.040714,-0.044089,0.110521,0.201226,0.13887,0.127157,-0.104538,-3.5e-05,0.001178,0.016536
4,0.004636,0.149475,0.281502,1.0,0.02308,0.051967,-0.006703,0.037243,0.008647,0.054529,...,0.019619,0.022974,0.149917,0.138742,0.147264,0.163868,-0.026823,-0.003309,-0.001956,0.021134
5,-0.008629,0.125682,-0.015372,0.02308,1.0,0.102048,0.191285,0.282524,0.185378,0.254675,...,0.123302,0.09624,0.018173,0.135603,0.153964,0.167625,0.189669,0.264584,0.251744,0.263324


We can also measure the similarity between users by calculating the Pearson correlation coefficient. The Pearson correlation coefficient measures the linear relationship between two variables. We can use this to measure the similarity between users based on their ratings.

In [56]:
from scipy.sparse import csr_matrix

movie_features_df_matrix = csr_matrix(filled_user_item_matrix.values)

from sklearn.neighbors import NearestNeighbors


model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(movie_features_df_matrix)

We can use the NearestNeighbors algorithm to find the k-nearest neighbors of a user. We use the cosine similarity metric to measure the similarity between users.

In [76]:
query_index = np.random.choice(user_corr_matrix_df.shape[0])
print(query_index)

30


We use this to choose a random user to find similar users to.

In [77]:
distances, indices = model_knn.kneighbors(movie_features_df_matrix[query_index, :].reshape(1, -1), n_neighbors = 6)


In [78]:
similar_users=[]
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(user_corr_matrix_df.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, user_corr_matrix_df.index[indices.flatten()[i]], distances.flatten()[i]))
        similar_users.append(user_corr_matrix_df.index[indices.flatten()[i]])

Recommendations for 31:

1: 43, with distance of 0.6475665679593291:
2: 70, with distance of 0.6747446411980205:
3: 18, with distance of 0.6892149751942906:
4: 69, with distance of 0.7132821352368135:
5: 8, with distance of 0.7835943998361411:


In [80]:
similar_users_ratings = ratings[ratings['userId'].isin(similar_users)]

highest_rated_movies = similar_users_ratings.sort_values(by='rating', ascending=False)

highest_rated_movies_names = pd.merge(highest_rated_movies, movie, on='movieId')['title']

print(highest_rated_movies_names)

0                                      Caddyshack (1980)
1                                          Batman (1989)
2                 Monty Python and the Holy Grail (1975)
3                                          Batman (1989)
4                                  Reservoir Dogs (1992)
                              ...                       
1102    Star Wars: Episode I - The Phantom Menace (1999)
1103                      Lara Croft: Tomb Raider (2001)
1104                                        Speed (1994)
1105                                      Ben-Hur (1959)
1106                                         Fame (2009)
Name: title, Length: 1107, dtype: object


We have now found the k-nearest neighbors of a user and recommended movies based on the ratings of these similar users.