Recommender system - Collaborative filtering 


In this tutorial, I'm going to introduce a recommender system method called "Collaborative filtering". Nowadays, many big companies such as Spotify, Netflix, Amazon, Facbook are using Collaborative filtering to create their recommender system. While their algorithm might be way more complicated than this tutorial, it's good to get the basic idea here.
First of all, I will show how to introduce the concept of this method and implement a simple version. Then show how to use a recommender system library called "Crab". The dataset I use for testing is "MovieLens 100K Dataset" that downloaded from http://grouplens.org/datasets/movielens/100k/.

Concept: The way how Collaborative filtering recommends an item to an user is that, it compares the similarity among all users to find the one with the most similar taste with you and recommend what he/she likes for you. In order to show the basic idea of this method, I simply use "Euclidean Distance" to calculate the similarity. Let's say user1 and user2 both rated movie1, movie2 and movie3. The similarity between them is sum of the square of each rating difference and take an square root of this summation, just like how we compute distance in high dimension space. Once this is done, we roughly know who has the similar taste with you, so we can recommend his/her favorite movie that you haven't watched to you. Let's look more into the implementation part.

First of all, import the library and all the dependency that we need. Here I also import the needed module for the later introduction of Crab library.

In [8]:
import pandas as pd
import numpy as np
import random
from collections import Counter
from copy import deepcopy
from scikits.crab import datasets
from scikits.crab.models import MatrixPreferenceDataModel
from scikits.crab.metrics import pearson_correlation
from scikits.crab.similarities import UserSimilarity
from scikits.crab.recommenders.knn import UserBasedRecommender
from scikits.crab.recommenders.knn.neighborhood_strategies import NearestNeighborsStrategy 

ImportError: No module named crab

Let's load the movie dataset and take an simple processing. There are three columns, 'userId', 'itemId', and 'rating'. Each row is a particular user gives an particular a rating.

In [9]:
def loadData(filename):
	data = pd.read_csv(filename, header=None, sep=r"\s+")
	data = data.rename(columns = {0:'userId', 1:'itemId', 2:'rating', 3:'timestamp'})
	data = data.sort_values('userId', ascending=True)
	data = data.drop('timestamp', 1)
	return data

filename = "data/u.data"
data = loadData(filename)
print data


       userId  itemId  rating
66567       1      55       5
62820       1     203       4
10207       1     183       5
9971        1     150       5
22496       1      68       4
9811        1     201       3
9722        1     157       4
9692        1     184       4
9566        1     210       4
9382        1     163       4
35409       1     271       2
22669       1     146       4
22845       1     176       5
35389       1     144       4
9307        1      53       3
334         1     160       4
333         1      33       4
9255        1      44       5
9224        1      97       3
9170        1      14       5
62649       1     263       1
62631       1       4       3
1780        1      17       3
820         1     265       4
56881       1      25       4
37820       1      37       2
57256       1     251       4
57423       1     195       5
11080       1     148       2
21152       1     179       3
...       ...     ...     ...
68943     943     808       4
84061     

Then we have to process the data into the typical format that Collaborative filtering needs. The format of the input data should be formatted like this { {userId1: {itemId1: rating1, itemId2: rating2.....} {userId2: {itemId1: rating1, itemId2: rating2.....} ... ... ... } 


Example: { 1: {1: 3.0, 2: 4.0, 3: 3.5, 4: 5.0, 5: 3.0}, 2: {1: 3.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 2.0}, 3: {2: 3.5, 3: 2.5, 4: 4.0, 5: 4.5, 6: 3.0}, 4: {1: 2.5, 2: 3.5, 3: 2.5, 4: 3.5, 5: 3.0, 6: 3.0}, 5: {2: 4.5, 3: 1.0, 4: 4.0}, 6: {1: 3.0, 2: 3.5, 3: 3.5, 4: 5.0, 5: 3.0, 6: 1.5}, 7: {1: 2.5, 2: 3.0, 4: 3.5, 5: 4.0}} }

Also, since different users have different rating habit, I added normalization to elimiate this bias.

In [11]:
def normalize(rating_dict):
	rating_max = max(rating_dict.values())
	rating_min = min(rating_dict.values())
	rating_range = rating_max - rating_min
	for rating in rating_dict:
		rating_dict[rating] = (rating_dict[rating] - rating_min) / rating_range
	return rating_dict

def processData(data):
	data_dict = {}
	rating_dict = {}
	userId_prev = -1 
	for i in xrange(data.shape[0]):
		userId_curr = data.userId.values[i]
		itemId = data.itemId.values[i]
		rating = float(data.rating.values[i])
		if userId_curr != userId_prev and i != 0:
			rating_dict = normalize(rating_dict)
			data_dict[userId_prev] = deepcopy(rating_dict)
			rating_dict = {}
			# userId_prev = userId_curr
		elif i == data.shape[0] - 1 and userId_prev == userId_curr:
			rating_dict[itemId] = rating
			rating_dict = normalize(rating_dict)
			data_dict[userId_prev] = deepcopy(rating_dict)
		elif i == data.shape[0] - 1 and userId_prev != userId_curr:
			rating_dict = normalize(rating_dict)
			data_dict[userId_prev] = deepcopy(rating_dict)
			rating_dict = {}
			rating_dict[itemId] = rating
			rating_dict = normalize(rating_dict)
			data_dict[userId_curr] = deepcopy(rating_dict)
			break
		rating_dict[itemId] = rating
		userId_prev = userId_curr
	return data_dict

data_dict = processData(data)
print data_dict

{1: {1: 1.0, 2: 0.5, 3: 0.75, 4: 0.5, 5: 0.5, 6: 1.0, 7: 0.75, 8: 0.0, 9: 1.0, 10: 0.5, 11: 0.25, 12: 1.0, 13: 1.0, 14: 1.0, 15: 1.0, 16: 1.0, 17: 0.5, 18: 0.75, 19: 1.0, 20: 0.75, 21: 0.0, 22: 0.75, 23: 0.75, 24: 0.5, 25: 0.75, 26: 0.5, 27: 0.25, 28: 0.75, 29: 0.0, 30: 0.5, 31: 0.5, 32: 1.0, 33: 0.75, 34: 0.25, 35: 0.0, 36: 0.25, 37: 0.25, 38: 0.5, 39: 0.75, 40: 0.5, 41: 0.25, 42: 1.0, 43: 0.75, 44: 1.0, 45: 1.0, 46: 0.75, 47: 0.75, 48: 1.0, 49: 0.5, 50: 1.0, 51: 0.75, 52: 0.75, 53: 0.5, 54: 0.5, 55: 1.0, 56: 0.75, 57: 1.0, 58: 0.75, 59: 1.0, 60: 1.0, 61: 0.75, 62: 0.5, 63: 0.25, 64: 1.0, 65: 0.75, 66: 0.75, 67: 0.5, 68: 0.75, 69: 0.5, 70: 0.5, 71: 0.5, 72: 0.75, 73: 0.5, 74: 0.0, 75: 0.75, 76: 0.75, 77: 0.75, 78: 0.0, 79: 0.75, 80: 0.75, 81: 1.0, 82: 1.0, 83: 0.5, 84: 0.75, 85: 0.5, 86: 1.0, 87: 1.0, 88: 0.75, 89: 1.0, 90: 0.75, 91: 1.0, 92: 0.5, 93: 1.0, 94: 0.25, 95: 0.75, 96: 1.0, 97: 0.5, 98: 0.75, 99: 0.5, 100: 1.0, 101: 0.25, 102: 0.25, 103: 0.0, 104: 0.0, 105: 0.25, 106: 0.75,

Next, we have to define our similarity method. As mentioned above, I'm using Eucliean Distance here. The output "similarity" is a dictionary of Euclidean distance of every other users to the target user, which is user number 5.

In [14]:
def computeSimilarity(target_user, data_dict):
	similarity = {}
	for user in data_dict.keys():
		distance = 0;
		if user != target_user:
			for itemId in data_dict[target_user]:
				if itemId in data_dict[user]:
					distance += (data_dict[target_user][itemId] - data_dict[user][itemId])**2
		if distance != 0:
			similarity[user] = distance
	return similarity
target_user = 5
similarity = computeSimilarity(target_user, data_dict)

Based on the similarity calculated above, we can find several users with the similar movie taste. Let's do top 10 here.

In [15]:
def getTopSimilarUsers(top, similarity):
	similar_users = []
	similarity = sorted(similarity.items(), key=lambda x:x[1])
	for i in xrange(top):
		similar_users.append(similarity[i][0])
	return similar_users
top = 10
similar_users = getTopSimilarUsers(top, similarity)
print similar_users

[319, 728, 3, 34, 61, 107, 134, 212, 242, 444]


Imagine a kind of user that always rate movies the same score, the infomation they provide might be relatively useless. In order to eliminate this factor, we take the variance of user's rating into account. The greater a user's rating variance is, the more important his rating is to our recommender system. Hence, we are going to calculate users' rating variance as their weights.

In [16]:
def calculateVariance(similar_users, data_dict):
	variance = {}
	for user in similar_users:
		rating_list = []
		for rating in data_dict[user].itervalues():
			rating_list.append(rating)
		variance[user] = np.var(rating_list)
	return variance
user_variance = calculateVariance(similar_users, data_dict)

Having the variance to be user's weights, I can calculate the average rating with the weights of each movies from the users. Hence, we can find the common favorite movie that your similar users have. The output is the ordered recommended movies for the target user.

In [20]:
def findCommonFavoriteItem(similar_users, user_variance, data_dict):
	items = []
	score = {}
	favorite_items = []
	for user in similar_users:
		for itemId, rating in data_dict[user].iteritems():
			if not itemId in score:
				score[itemId] = rating * user_variance[user] 
			else:
				score[itemId] += rating * user_variance[user] 
			items.append(itemId)
	for itemId in items:
		score[itemId] = score[itemId] / items.count(itemId)

	score = sorted(score.items(), key=lambda x:x[1], reverse=True)
	for item, score in score:	
		favorite_items.append(item)
	return favorite_items
recommended_list = findCommonFavoriteItem(similar_users, user_variance, data_dict)
print recommended_list

[1137, 1152, 740, 361, 934, 1357, 1024, 242, 292, 898, 899, 990, 528, 87, 527, 631, 197, 199, 246, 382, 902, 320, 991, 539, 316, 86, 735, 423, 427, 511, 9, 50, 251, 1127, 111, 283, 181, 260, 344, 348, 906, 912, 305, 645, 191, 1243, 1, 275, 329, 347, 303, 334, 343, 345, 349, 351, 354, 355, 1355, 267, 880, 318, 315, 15, 25, 116, 117, 147, 742, 282, 285, 287, 471, 299, 312, 310, 879, 326, 916, 324, 100, 515, 317, 321, 346, 304, 690, 237, 863, 127, 272, 342, 330, 352, 682, 689, 750, 261, 358, 508, 124, 1483, 264, 350, 325, 339, 319, 306, 331, 340, 332, 243, 245, 327, 289, 307, 338, 288, 322, 301, 328, 313, 751, 678, 748, 333, 302, 271, 259, 268, 258, 300, 294, 323, 269, 286, 546, 179, 180, 291, 335, 336, 337, 341, 353, 871, 892, 475, 1011]


Finally, we iterate through the recommended list and find the movie the target user haven't watched, which is our recommended movie.

In [22]:
for item in recommended_list:
	if not item in data_dict[target_user]:
		recommended_item = item
		break

movie_list = pd.read_csv("data/u.item", header=None, sep="|")
print 'The movie recommended by my simple implementation for user no.' ,target_user, ' is :' , movie_list.iloc[recommended_item,1]

The movie recommended by my simple implementation for user no. 5  is : Best Men (1997)


With the introduction shown above, you might have a basic idea of how Collaborative filtering works. This is the simplest implementation of it. There are many variation that you could try. Such as different way to calculate similarity between users, and different weighting method. 


In the next section, I'm going to show the usage of "Crab", a recommender system for Python.

First of all, let's take a look at what Crab is.


"Crab as known as scikits.recommender is a Python framework for building recommender engines integrated with the world of scientific Python packages (numpy, scipy, matplotlib).
The engine aims to provide a rich set of components from which you can construct a customized recommender system from a set of algorithms and be usable in various contexts: ** science and engineering ** ." 
-- from https://muricoca.github.io/crab/

Installation:
If you go to Crab website, it might ask you to install with pip. However, I would recommend to go to its github source and install it from there.
1. git clone https://github.com/muricoca/crab.git
2. cd crab
3. python setup.py install


Unfortunately, since Crab library is not perfectly developed yet, there are still some bugs and version issues that user might have to google search and solve it. Here's the issue that I met, and thanks to this link showing how to solve these issues. http://datapandas.com/index.php/tag/no-module-named-learn-base/

Issue1:

from scikits.crab.recommenders.knn import UserBasedRecommender.
...
ImportError: No module named learn.base

Solution1:

go to crab/base.py and replace the whole of “scikits.learn” with “sklearn”

Issue2:

No Attribute named _set_params

Solution2:

go to crab/scikit/crab/recommenders/knn/class.py, in line 138 and 600 replace ”self._set_params (** params )" with "self.set_params (** params)"

Then you are good to compile the code!

First let's initiate some variables. The target of this setting is to use 1000 rating data from multiple users to multiple movie items, to recommend top 3 potentially wanted movies to user number 5. The reason why I used only 1000 rating data is because the computational speed of running this library on the whole dataset will be very slow. Since we want to understand the idea of this library and algorithm, we might not want to use the whole 100K data because it's taking so long. Here we cut the raw dataset into 1000 rows, which can be processed within 5 seconds.

In [23]:

rows = 1000
top = 3
data = data.iloc[:rows,:]
data_dict = processData(data)

Now, we are ready to feed data and use this library! First choose a preference model within the library. While there are several models that we can use, and we can also modify the source code to customize our application, we are using the simplest preference model here.
For the similarity function, here I chose pearson correlation method. You could try different method that I commented below. if you are interested. Also, we could add a neighborhood strategy to the recommender.

In [13]:
model = MatrixPreferenceDataModel(data_dict)
similarity = UserSimilarity(model, pearson_correlation)
neighborhood = NearestNeighborsStrategy()
# similarity = UserSimilarity(model, euclidean_distances, 3)
# similarity = UserSimilarity(model, cosine_distances)
# similarity = UserSimilarity(model, jaccard_coefficient)

NameError: name 'MatrixPreferenceDataModel' is not defined

Everything's ready, we can create the recommender and ask it to recommend the movies for us.

In [12]:
recommender = UserBasedRecommender(model, similarity, neighborhood, with_preference=True)
results = recommender.recommend(target_user)

NameError: name 'UserBasedRecommender' is not defined

Since the results we got from the above contains more than 3 best movies with the same highest rating, I randomly selected 3 of these best recommended movies to be the top 3 for our targeting user.

In [24]:
rating_best = max(results,key=lambda item:item[1])[1]
top_results = Counter(elem[1] for elem in results)[rating_best] # all the best rating movies(more than wanted)
selected_best = random.sample(range(0, top_results-1),  top) # randomly select number of top movies
print 'The top', top, 'movies recommended by Crab library for user no.', target_user, 'are: ', [movie_list[1][i] for i in selected_best]

NameError: name 'results' is not defined

Above is my tutorial of the recommender method Collaborative Filtering. By my understanding, there are many different application with different kinds of restriction. This tutorial is intended to make people understand the idea behind. Hope you like it!