# Basics of Model Based Collaborative Filtering

this excerpt and technique has been taken from : https://blog.dominodatalab.com/recommender-systems-collaborative-filtering/

Now that we have concrete method for defining the similarity between vectors, we can now discuss how to use this method to identify similar users. The problem set-up is as follows:

1. We have an $N X M$ matrix consisting of the ratings of $N$ users and $M$ items. Each element of the matrix $(i, j)$ represents how user $i$ rated item $j$. Since we are working with movie ratings, each rating can be expected to be an integer from 1-5 (reflecting one-star ratings to five-star ratings) if user i has rated movie j, and 0 if the user has not rated that particular movie.

2. For each user, we want to recommend a set of movies that they have not seen yet (the movie rating is 0). To do this, we will effectively use an approach that is similar to **weighted K-Nearest Neighbors.**

3. For each movie j user i has not seen yet, we find the set of users U who are similar to user i and have seen movie j.For each similar user u, we take u's rating of movie j and multiply it by the cosine similarity of user i and user u. Sum up these weighted ratings, divide by the number of users in U, and we get a weighted average rating for the movie j.

4. Finally, we sort the movies by their weighted average rankings. These average rankings serve as an estimate for what the user will rate each movie. Movies with higher average rankings are more likely to be favored by the user, so we will recommend the movies with the highest average rankings to the user.

**I said earlier that this procedure is similar to a weighted K-Nearest Neighbors algorithm. We take the set of users who have seen movie j as the training set for K-NN and each user who has not seen the movie as a test point. For each user who has not seen the movie (test point), we compute the similarity to users who have seen the movie, and assign an estimated rating based on the known ratings of the neighbors.**

For a concrete example, let's say I have not seen the movie *Hidden Figures*. I have seen and rated (on a 5-star scale) a ton of other movies though. With this information, you want to predict what I will rate Hidden Figures. **Based on my rating history, you can find a group of users who rate movies similarly to me and have also seen Hidden Figures.** 

To keep this example simple, let's say we look at the two users who are most similar to me and have seen the movie. Let's say User 1 has 95% similarity to me and gave the movie a four-star rating, and User 2 has 80% similarity to me and gave the movie a five-star rating. Now my predicted rating is the average of 0.954 = 3.8 (Similarity X Rating of User 1) and 0.805 = 4 (Similarity X Rating of User 2), so I am predicted to give the movie a rating of 3.9.

# The code

I am going to use *Surprise* library for achieving the recommendation engine on the standard **Movie Lens Dataset**.

In [22]:
from surprise import Dataset,evaluate
from surprise import KNNBasic

In [4]:
data=Dataset.load_builtin("ml-100k")
trainingSet=data.build_full_trainset()

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /home/uddeshya/.surprise_data/ml-100k


For our task, we want to use the cosine similarity between movies to make new recommendations. Although I explained collaborative filtering based on user similarity, we can just as easily use item-item similarity to make recommendations. With item-item collaborative filtering, each movie has a vector of all its ratings, and we compute the cosine similarity between two movies' rating vectors.

In [11]:
sim_options={
    'name':'cosine',
    'user_based':False
}

knn=KNNBasic(sim_options=sim_options)

In [16]:
knn.train(trainset=trainingSet)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7f1437ffee90>

Now that we have trained our model, we want to make movie recommendations for users. Using the build_anti_testset method, we can find all user-movie pairs in the training set where the user has not viewed the movie and create a "testset" out of these entries.

In [17]:
testSet=trainingSet.build_anti_testset()

In [18]:
predictions=knn.test(testSet)

In [23]:
from collections import defaultdict

# This particular thing is a utility function
def get_top3_recommendations(predictions,topN=3):
    top_recs=defaultdict(list)
    for uid,iid,true_r,est,_ in predictions:
        top_recs[uid].append((iid,est))
        
    for uid,user_ratings in top_recs.items():
        user_ratings.sort(key=lambda x:x[1],reverse=True)
        top_recs[uid]=user_ratings[:topN]
        
    return top_recs


In [24]:
import os, io

def read_item_names():
    """Read the u.item file from MovieLens 100-k dataset and returns a
    mapping to convert raw ids into movie names.
    """

    file_name = (os.path.expanduser('~') +
                 '/.surprise_data/ml-100k/ml-100k/u.item')
    rid_to_name = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]

    return rid_to_name


In [25]:
top3_recommendations = get_top3_recommendations(predictions)
rid_to_name = read_item_names()
for uid, user_ratings in top3_recommendations.items():
    print(uid, [rid_to_name[iid] for (iid, _) in user_ratings])


(u'344', [u'Raising Arizona (1987)', u'Taxi Driver (1976)', u'To Kill a Mockingbird (1962)'])
(u'345', [u"One Flew Over the Cuckoo's Nest (1975)", u'Godfather, The (1972)', u'Casablanca (1942)'])
(u'346', [u'Entertaining Angels: The Dorothy Day Story (1996)', u'Searching for Bobby Fischer (1993)', u'Lion King, The (1994)'])
(u'347', [u'Mystery Science Theater 3000: The Movie (1996)', u'Clerks (1994)', u'King of New York (1990)'])
(u'340', [u"I Don't Want to Talk About It (De eso no se habla) (1993)", u'Every Other Weekend (1990)', u'Homage (1995)'])
(u'341', [u'Small Faces (1995)', u'Deceiver (1997)', u'Getting Away With Murder (1996)'])
(u'342', [u'Very Natural Thing, A (1974)', u'Walk in the Sun, A (1945)', u'Substance of Fire, The (1996)'])
(u'343', [u'12 Angry Men (1957)', u'North by Northwest (1959)', u'Six Degrees of Separation (1993)'])
(u'810', [u'All Things Fair (1996)', u'Mamma Roma (1962)', u'Substance of Fire, The (1996)'])
(u'811', [u'Aiqing wansui (1994)', u'Cyclo (1995)'