## Collaborative based filtering with SVD for Movie Recommendation

&nbsp;

* A recommender system refers to a system that is capable of predicting the future preference of a set of items for a user, and recommend the top items. One key reason why we need a recommender system in modern society is that people have too much options to use from due to the prevalence of Internet. In the past, people used to shop in a physical store, in which the items available are limited. For instance, the number of movies that can be placed in a Blockbuster store depends on the size of that store. By contrast, nowadays, the Internet allows people to access abundant resources online. Netflix, for example, has an enormous collection of movies.

&nbsp;

* In this project my major goal is to implement collaborative based filtering with **SVD( Singular Value Decomposition)**.

&nbsp;

### What is Collaborative Based filtering?

&nbsp;

![](images/coll.png)

&nbsp;

* Collaborative based filtering is based on users’ rating, and it will recommend us movies that we haven’t watched yet, but users similar to us have, and like. To determine whether two users are similar or not, this filter considers the movies both of them watched and how they rated them. By looking at the items in common, this type of algorithm will basically predict the rating of a movie for a user who hasn’t watched it yet, based on the similar users’ rating. 

&nbsp;

* There are two kinds of Collaborative filtering:
    * Item Based : measure the similarity between target users and other users.
    * User Based : measure the similarity between the items that target users rates/ interacts with and other items.
    
&nbsp;

* Imagine there are m users and n items, we create a matrix having dimensions m*n to denote all the past ratings of items given by user. For example m{i,j} be the rating given by user i to item j. There can be a no. Of missing cells in the matrix. Collaborative filtering involves filling those missing entries in the matrices that the user hasn’t seen or rated.

&nbsp;

![](images/m.png)

&nbsp;

* **User-Based Collaborative Filtering**

&nbsp;

* For User based CF we need to find similar users based on their interest. There are two approach for finding similarity i.e Pearson Correlation or Cosine Similarity. Let u{i,k} denotes the similarity between user i and user k and v{i,j} denotes the rating that user i gives to item j and v{i,j} = ? if the user hasn’t rated them. The two method can be expressed in the following manner.

&nbsp;

![](images/pc.png)

&nbsp;

![](images/cs.png)

&nbsp;

* **Item-Based Collaborative Filtering**

&nbsp;

* Instead of measuring the similarity between users, Item based CF recommends items based on their similarity with the item that the target user has rated. Like wise, the similarity can be computed with Pearson Correlation or Cosine Similarity. The major difference between them is, In item based we fill the blank vertically as oppose to the horizontal manner in the user based.

&nbsp;

* However there are major limitation with this aproach 
    * The first one would be scalability.
    * The computation grows with both the customer and the product. The worst case complexity is O(mn) with m users and n items.
    * And last would sparsity there would be a lot of empty cells within the matrix.For example if we 2000 users and 1000 movies we will be having a matrix with 2 million entries (2000*1000)
    
&nbsp;

* **So how we going to solve it.**

&nbsp;

* One of the approach would be **Low Rank Matrix Factorization** using **SVD**. The approach involves leveraging the latent factor to capture the similarity between user and items. Essentialy we wnated to turn the recomendation problem into an optimization problem. 

&nbsp;

* Given a matrix that task is to simply come up with concise vector representation of users and items such that if we want to find out the rating given by user u to item v we can get by just computing the dot product of the user vector and item vector. And we want to do it fast so we want these vectors to have few dimensions as possible which is where low rank comes into play.

&nbsp;

* One common metric is **Root Mean Square Error(RMSE)** , the lower the RMSE the better the performance. Since we do not know the rating of the unseen items, we will temporarily ignore them. Namely, we are only minimzing RMSE on the known entries in the matrix. To achive RMSE, **SVD (Singular Value Decomposition)** is adopted as shown in the below formula. 

&nbsp;

![](images/svd.png)

&nbsp;

* X denotes the utility matrix, and U is a left singular matrix, representing the relationship between users and latent factors. S is a diagonal matrix describing the strength of each latent factor, while V transpose is a right singular matrix, indicating the similarity between items and latent factors. Now, you might wonder what do I mean by latent factor here? It is a broad idea which describes a property or concept that a user or an item have. For instance, for music, latent factor can refer to the genre that the music belongs to. SVD decreases the dimension of the utility matrix by extracting its latent factors. Essentially, we map each user and each item into a latent space with dimension r. Therefore, it helps us better understand the relationship between users and items as they become directly comparable. 

In [1]:
#importing libraries
import pandas as pd
import numpy as np

In [None]:
# importing user data from the zip file
user_cols = ['user_id','age','sex','occupation','zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names = user_cols, encoding = 'latin-1')

# importing movie ratings from the zip file
ratings_cols = ['user_id','movie_id','rating','unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names = ratings_cols, encoding = 'latin-1')

# importing movies data from the zip file
movies_cols = ['movie_id','title','release_date','video_release_date','imdb_url']
movies = pd.read_csv('ml-100k/u.item', sep='|', names = movies_cols,usecols = range(5),
                     encoding = 'latin-1')

In [None]:
# importing genre dataset 
genres_list = ['unknown','Action','Adventure','Animation','Children','Comedy','Crime',
               'Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery',
               'Romance','Sci-Fi','Thriller','War','Western']
genre = pd.read_csv('ml-100k/u.item', sep='|',names = genres_list,usecols = range(5,24),encoding = 'latin-1')

In [None]:
# dropping redundant columns
movies.drop(['video_release_date','imdb_url'],inplace=True,axis = 1)
ratings.drop('unix_timestamp',axis = 1,inplace=True)

In [None]:
# merge all the dataset into one whole dataset
dataset = pd.merge(pd.merge(movies, ratings),users)