# Recommendation System
Student Name: Dacheng Wen (dachengw)

## Introduction

This tutorial will introduce a approach to build a simple recommendation system.
Accroding to the definition from Wikipedia, recommendation system is a subclass of information filtering system that seek to predict the "rating" or "preference" that a user would give to an item.

A daily example is the Amazon's recommendation engine:
[<img src="http://netdna.webdesignerdepot.com/uploads/amazon//recommended.jpg">](http://netdna.webdesignerdepot.com/uploads/amazon//recommended.jpg)

Theoretically, Amazon analyzes users' information (purchase history, browse history and more) to recommend what the users may want to buy.

## Tutorial content

In this tutorial, we will build a simple offline recommendation system to recommend movies. This recommendaiton system is not a practical or sophisticated one for commerical use, but working through this tutorial can give a sense about how a recommendation system works.

We will cover the following topics in this tutorial:
- [Expectation](#Expectation)
- [Downloading and loading data](#Downloading-and-loading-data)
- [Item-based collaborative filtering](#Item-based-collaborative-filtering)
- [Recommendation for new users](#Recommendation-for-new-users)
- [Summary](#Summary)

## Expectation

The recommendation system we will build can:
1. Take the existing rating data as input.
2. Recommend at most k (k = 5 for this tutorial) movies which haven't rated by the user for each user.

In [87]:
k = 5

## Downloading and loading data

We are going to use the open dataset provided by MovieLens (https://movielens.org/).
The dataset can be downloaded from http://grouplens.org/datasets/movielens/. 
For this tutorial, we will use the u.data file from smallest dataset (100K records)
According to the ReadMe (http://files.grouplens.org/datasets/movielens/ml-100k/README). This files contains ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1.  The data is randomly ordered. 
This is a tab separated list of:

	         user id | item id | rating | timestamp. 
Note: 
1. An item means an movie, so the item id is the movie id. We consider item and movie interchangable for this tutorial.
2. For the simple recommendaiton system we are going to build, we only use the first three fields, user id, item id and rating. That is to say, we ignore the timestamp. Timestamp is indeed a valuable information, but we ignore it in this tutorial for simplicity.
3. The range of rating is 1-5, and 5 means the best.

Althought not necessry, it would be nice to be able to get the movie title by its id. Therefore we need to download the u.item file. The first two fields of every record in this file are

	         movie id | movie title | ...

Let's download these files:

In [88]:
import requests

def download_file(link_address, filename):
    response = requests.get(link_address, stream=True)   
    if (response.status_code == requests.codes.ok) :
        with open(filename, 'wb') as handle:
            for block in response.iter_content(1024):
                handle.write(block)
        print "Successfully downloaded " + filename
        return True
    else:
        print "Sorry, " + filename + " download failed"
        return False

# download user - movie ratings
download_file('http://files.grouplens.org/datasets/movielens/ml-100k/u.data', 'u.data')

# download movie id - movie map
download_file('http://files.grouplens.org/datasets/movielens/ml-100k/u.item', 'u.item')


Successfully downloaded u.data
Successfully downloaded u.item


True

Then read the files to memory:

In [105]:
# read u.data
user_rating_raw = []
with open('u.data') as f:
    for line in f:
        fields = line.split('\t')
        user_rating_raw.append([int(fields[0]), 
                                int(fields[1]), 
                                float(fields[2]), 
                                int(fields[3])])
        
print "Read u.data, got " + str(len(user_rating_raw)) + " rating records."
print
print "The first 5 records are:"

for row_index in range(5):
    print user_rating_raw[row_index]
    print

Read u.data, got 100000 rating records.

The first 5 records are:
[196, 242, 3.0, 881250949]

[186, 302, 3.0, 891717742]

[22, 377, 1.0, 878887116]

[244, 51, 2.0, 880606923]

[166, 346, 1.0, 886397596]



In [106]:
# read u.item
movie_title_map = {};
with open('u.item') as f:
    for line in f:
        fields = line.split('|')
        movie_title_map[int(fields[0])] = fields[1]

print "Read id-title map for " + str(len(movie_title_map)) + " movies."
print
print "The first 5 movies in the map are:"

for movie_id in range(1, 6):
    print (movie_id, movie_title_map[movie_id])
    print 

Read id-title map for 1682 movies.

The first 5 movies in the map are:
(1, 'Toy Story (1995)')

(2, 'GoldenEye (1995)')

(3, 'Four Rooms (1995)')

(4, 'Get Shorty (1995)')

(5, 'Copycat (1995)')



## Item based collaborative filtering

Among the multiple recommendation alogrithms, item-based collabrative filtering is one of most popular alogorithm. The recommendation alogrithm used by Amazon and other websites are based on item-based collabrative filtering (https://en.wikipedia.org/wiki/Item-item_collaborative_filtering). *  

We are going to implement a simple item-based collabrative filtering on thie tutorial.
The idea of item-based collabrative filtering is to find similar items, and then recommend items based on the users' history related item. 

Let's say we found that _Star Wars (1977)_ is similar to _Return of the Jedi (1983)_, we assumes that the users who like _Star Wars (1977)_ are going to enjoy _Return of the Jedi (1983)_ too. Therefore, if we find that there is a user who watched (rated) _Star Wars (1997)_ but haven't watched (rated) _Return of the Jedi (1983)_, we will recommend _Return of the Jedi (1983)_ to the user.

For our MovieLens scenario, we need to:
1. Compute the similarity between movies based on the ratings
2. For each user, recommend movies which are similar to the movies rated by that user, and the recommended movies should not contains those movies which have already rated by that user.

Reference: 
* Linden, G., Smith, B., & York, J. (2003). Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing, 7(1), 76-80.

Before computing the similarity between movies, let's convert the raw data, user_rating_record, into a matrix (numpy 2d array), movie_user_mat.
Each element in the movie_user_mat stores a rating. movie_user_mat is of size num_movie by num_user. num_movie\[i\]\[j\] means the j-th user's rating for i-th movie. Therefore, each row stores the ratings for a movie from all users, and each column stores a user's rating.
Noted that the the range of the rating is 1-5, so we can use 0 to indicate that a user haven't rated a movie.

In [92]:
import numpy as np

# number of movies and number of users, 
# these two numbers are from ReadMe (http://files.grouplens.org/datasets/movielens/ml-100k/README)
num_user = 943
num_movie = 1682
movie_user_mat = np.zeros((num_movie, num_user));

for user_rating_record in user_rating_raw:
    # minus 1 to convert the index (id) to 0 based
    user_index = user_rating_record[0] - 1
    movie_index = user_rating_record[1] - 1
    rating = user_rating_record[2]
    movie_user_mat[movie_index][user_index] = rating

Now that we have the movie-user matrix, we can perform the first step, computing the similarity between movies. We will use cosine similarity that we learned (https://en.wikipedia.org/wiki/Cosine_similarity). Because each row represents the ratings for a movie from all users, we consider treat rows as the input vectors. Noted that the similarity matrix, movie_similarity_mat, is a sysemtric matrix (movie_similarity_mat\[i\]\[j\] = movie_similarity_mat\[j\]\[i\]).

In [93]:
import scipy.spatial as scp

movie_similarity_mat = np.zeros((num_movie, num_movie))
for i in range(num_movie):
    movie_i_rating = movie_user_mat[i]
    for j in range(i, num_movie):
        movie_j_rating = movie_user_mat[j]
        cos_similarity = 1.0 - scp.distance.cosine(movie_i_rating, movie_j_rating)
        movie_similarity_mat[i][j] = cos_similarity
        movie_similarity_mat[j][i] = cos_similarity

Finally, we can compute the what movies should be recommended to the users.
In order to achieve this goal, for each user, we need to compute his / her interest in each movie. We represent the interests using a coefficient.

The coefficient that indicates j-th user's interest in i-th movie (a large the coefficient means the user is highly interested in that movie)
$$ coefficient[i][j]= \sum_{k=1}^n similarity[k-1][i] * rating[k-1][j]$$
Where n is the number of movies, similarity\[k-1\]\[i\] is movie_similarity_mat\[k-1\]\[i\] (similarity between k-1 th movie and i-th movie) and rating\[k-1\]\[j\] is movie_user_mat\[k-1\]\[j\] (j-th user's rating on k-1 th movie)

Noted that this equation is equivalent to
$$ coefficient[i][j]= \sum_{k=1}^n similarity[i][k-1] * rating[k-1][j]$$
because movie_similarity_mat is symmetric.

It may looks cofusing, so let's take a small dataset (stored in test_rat) as an example.

In [95]:
test_rat = np.asarray([[0,1,5],
                       [1,0,5],
                       [5,0,0],
                       [0,5,3]]);
test_simi = np.zeros((4, 4))
for i in range(4):
    movie_i_rating = test_rat[i]
    for j in range(i, 4):
        movie_j_rating = test_rat[j]
        cos_similarity = 1.0 - scp.distance.cosine(movie_i_rating, movie_j_rating)
        test_simi[i][j] = cos_similarity
        test_simi[j][i] = cos_similarity
print "movie-rating:"
print test_rat
print
print "similarities:"
print test_simi

movie-rating:
[[0 1 5]
 [1 0 5]
 [5 0 0]
 [0 5 3]]

similarities:
[[ 1.          0.96153846  0.          0.67267279]
 [ 0.96153846  1.          0.19611614  0.5045046 ]
 [ 0.          0.19611614  1.          0.        ]
 [ 0.67267279  0.5045046   0.          1.        ]]


For the first user (0-th user), his / her interst in the first movie (0-th movie) should be:
$$ coefficent[0][0] = rating[0][0] * similarity[0][0] + rating[1][0] * similarity[1][0] + rating[2][0] * similarity[2][0] + rating[3][0] * similarity[3][0] $$
$$ coefficent[0][0] = 0 * 1 + 1 * 0.96153846 + 5 * 0 + 0 * 0.67267279 = 0.96153846 $$

his / her interst in the last movie (3-th movie) should be:

$$ coefficent[3][0] = 0 * 0.67267279 + 1 * 0.5045046 + 5 * 0 + 0 * 1 = 0.5045046 $$

because 0.96153846 > 0.5045046, we should recommend the first movie instead of the last movie if we can only recommend one movie.

Noted that the equation
$$ coefficient[i][j]= \sum_{k=1}^n similarity[i][k-1] * rating[k-1][j]$$
is simply a matrix dot operation:
$$coefficient = similarity.dot(rating)$$

The last detail we need to take care of is that we shouldn't recommend a movie that have been rated. If a user already rated the movie _Star Wars (1977)_, we should not recomment _Star Wars (1977)_ to this user.

We store the coeffiecients in recommendation_coefficient_mat, and store the id of the recommended movies for each user in a dictionary, recommendation_per_user.

In [110]:
import heapq
# find n elements with largest values from a dictonary 
# http://www.pataprogramming.com/2010/03/python-dict-n-largest/
def dict_nlargest(d,n):
    return heapq.nlargest(n, 
                          d, 
                          key = lambda t: d[t])

# num_movie by num_user = (num_movie by num_movie) * (num_movie by num_user)
recommendation_coefficient_mat = movie_similarity_mat.dot(movie_user_mat)
recommendation_per_user = {}

for user_index in range(num_user):
    recommendation_coefficient_vector = recommendation_coefficient_mat.T[user_index]
    # remove the movies that already been rated
    unrated_movie = (movie_user_mat.T[user_index] == 0)
    recommendation_coefficient_vector *= unrated_movie
    recommendation_coefficient_dict = {movie_id:coefficient 
                                      for movie_id, coefficient 
                                      in enumerate(recommendation_coefficient_vector)}
    recommendation_per_user[user_index] = dict_nlargest(recommendation_coefficient_dict, k)

So the recommended movie for the first user is:

In [107]:
print "(movie id, title)"
for movie_id in recommendation_per_user[0]:
    # movie_id + 1 to convert it backed to 1-based instead of 0-based
    print (movie_id, movie_title_map[movie_id + 1])
    print 

(movie id, title)
(422, 'E.T. the Extra-Terrestrial (1982)')

(654, 'Stand by Me (1986)')

(567, 'Speed (1994)')

(402, 'Batman (1989)')

(384, 'True Lies (1994)')



## Recommendation for new users

We mentioned that we can use users's information to recommend movies, but what if we have a new user that we have no information about? The coefficients for that user will be all zeros, it is not reasonable to find the top-5 elements in an array of zeros.

What movies should we recommend? An option is to recommend the movies which got rated by the most number of the users. This is similiar to recommending "best seller" on Amazon.com to new users.

In [109]:
import collections

movie_rated_counter = collections.Counter([rating_record[1] 
                                           for rating_record in user_rating_raw])
most_rated_movies = movie_rated_counter.most_common(k)

print "The most rated 5 movies are:\n"
for movie_id, rated_count in most_rated_movies:
    print (movie_id, movie_title_map[movie_id], rated_count)
    print 

The most rated 5 movies are:

(50, 'Star Wars (1977)', 583)

(258, 'Contact (1997)', 509)

(100, 'Fargo (1996)', 508)

(181, 'Return of the Jedi (1983)', 507)

(294, 'Liar Liar (1997)', 485)



So we can recommend these five movie to new users: Star Wars (1977) Contact (1997) Fargo (1996) Return of the Jedi (1983) Liar Liar (1997)

## Summary

In short, we implemented a simple recommendation system using item-based collaborative filtering :)

But the truth is that this recommendation system is too simple, there are a lot of details we haven't taken care of. For example, if a movie haven't been rated by any user, it's likely that this movie will never be recommended, that means the coverage (recall rate) of this system is not satisfying.

For the curious minds, you are welcome to explore more sophisticated systems :)