### Parallel Movie recommender system 

This notebook demos collaborative filtering based movie recommender systems in Julia. The package [RecSys.jl](https://github.com/abhijithch/RecSys.jl/) is a package for recommender systems in Julia, it can currently work with explicit ratings data. This demos a parallel implementation of the ALS factorization based collaborative filtering for movie recommendations based on [this](http://dl.acm.org/citation.cfm?id=1424269) research article. The detailed report of the system is [here](http://juliacomputing.com/blog/2016/04/22/a-parallel-recommendation-engine-in-julia.html).

### Collaborative Filtering using weighted ALS factorization :

<img src="./images/als.png" width="550">

Let $U={u_i}$ be the user feature matrix where ${u_i} \subseteq\mathbb{R}^{n_f}$ and $i=1,2,...,n_u$, and let $M={m_j}$ be the item or movie feature matrix, where ${m_j} \subseteq \mathbb{R}^{n_f}$ and $j=1,2,...,n_m$. Here $n_f$ is the number of factors, i.e., the reduced dimension or the lower rank, which is determined by cross validation. The predictions can be calculated for any user-movie combination,
$(i,j)$, as $r_{ij}={u_i} \cdotp {m_j}, \forall i,j$.

** Credits ** :

[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dl.acm.org/citation.cfm?id=1424269)

[Movielens dataset](http://grouplens.org/datasets/movielens/)

In [4]:
using RecSys
include(joinpath(Pkg.dir("RecSys"), "examples", "movielens", "movielens.jl"))

test_chunks (generic function with 1 method)

### Dataset : 

GroupLens Research has collected and made available rating data sets from the [MovieLens](http://movielens.org) web site. 

#### MovieLens 20M Dataset

We have used the 20 million ratings dataset, which must be downloaded into /data/recommender folder. We use the ratings data to form a sparse matrix of size `138,000 X 27,000` with 20 million ratings ranging from 1 to 5.

In [1]:
# Please specify path to the data folder which includes the 20 million ratings data folder "ml-20m"
dataset_path = joinpath("C:\\dsvm\\notebooks\\Julia_notebooks\\","data","recommender")

"C:\\dsvm\\notebooks\\Julia_notebooks\\data\\recommender"

In [2]:
data_folder = "ml-20m"

"ml-20m"

Creating file handles to the movie ratings and the movies list files.

In [5]:
ratings_file = DlmFile(joinpath(dataset_path,data_folder, "ratings.csv"); dlm=',', header=true)
movies_file = DlmFile(joinpath(dataset_path,data_folder, "movies.csv"); dlm=',', header=true)

RecSys.DlmFile("C:\\dsvm\\notebooks\\Julia_notebooks\\data\\recommender\\ml-20m\\movies.csv",',',true,true)

#### Parallel implementations :

This package offers 3 modes of parallelism, 

1. Multi-threading - Julia native threading infrastructure provides an easy way to make use threads.
2. Shared memory - This is a multiprocessing using shared data.
3. Distributed memory - This is distributed memory based multiprocessing, this would require that the data be split into chunks. There is code to do this, refer ...

Multiple Dispatch is a nice feature in Julia, which would dispatch to the correct implementation based on the type of the objects passed as arguments. 

For e.x., if we need to train the model using shared memory multiprocessing, the type of `MovieRec` is `MovieRec(trainingset::FileSpec, movie_names::FileSpec)` and if we need distributed memory model the type of `MovieRec` is `MovieRec(user_item_ratings::FileSpec, item_user_ratings::FileSpec, movie_names::FileSpec)`.


In [11]:
rec = MovieRec(ratings_file, movies_file)

MovieRec(RecSys.DlmFile("C:\\dsvm\\notebooks\\Julia_notebooks\\data\\recommender\\ml-20m\\movies.csv",',',true,true),RecSys.ALSWR{RecSys.ParShmem,RecSys.SharedMemoryInputs,RecSys.SharedMemoryModel}(RecSys.SharedMemoryInputs(RecSys.DlmFile("C:\\dsvm\\notebooks\\Julia_notebooks\\data\\recommender\\ml-20m\\ratings.csv",',',true,true),0,0,Nullable{Union{ParallelSparseMatMul.SharedSparseMatrixCSC{Float64,Int64},RecSys.MatrixBlobs.SparseMatBlobs{Tv,Ti},SparseMatrixCSC{Float64,Int64}}}(),Nullable{Union{ParallelSparseMatMul.SharedSparseMatrixCSC{Float64,Int64},RecSys.MatrixBlobs.SparseMatBlobs{Tv,Ti},SparseMatrixCSC{Float64,Int64}}}(),Nullable{Union{Array{Int64,1},SharedArray{Int64,1}}}(),Nullable{Union{Array{Int64,1},SharedArray{Int64,1}}}()),Nullable{RecSys.SharedMemoryModel}(),RecSys.ParShmem()),Nullable{SparseVector{AbstractString,Int64}}())

Let us train the model with `10` factors and `10` iterations.

In [10]:
 
@time train(rec, 10, 10)

113.227545 seconds (286.02 M allocations: 68.991 GB, 23.24% gc time)


In [14]:
err = rmse(rec)

0.7530518439786652

#### Select a user, for which we show the movies watched and the recommendations for the user. 

In [15]:
user = 100
print_recommendations(rec, recommend(rec, user)...)

Already watched:
  [1 ]  =  "Nixon (1995) - Drama"
  [2 ]  =  "Leaving Las Vegas (1995) - Drama|Romance"
  [3 ]  =  "Twelve Monkeys (a.k.a. 12 Monkeys) (1995) - Mystery|Sci-Fi|Thriller"
  [4 ]  =  "Clueless (1995) - Comedy|Romance"
  [5 ]  =  "Usual Suspects, The (1995) - Crime|Mystery|Thriller"
  [6 ]  =  "From Dusk Till Dawn (1996) - Action|Comedy|Horror|Thriller"
  [7 ]  =  "Crimson Tide (1995) - Drama|Thriller|War"
  [8 ]  =  "Crumb (1994) - Documentary"
  [9 ]  =  "Net, The (1995) - Action|Crime|Thriller"
  [10]  =  "Smoke (1995) - Comedy|Drama"
  [11]  =  "Clerks (1994) - Comedy"
  [12]  =  "Ed Wood (1994) - Comedy|Drama"
  [13]  =  "Star Wars: Episode IV - A New Hope (1977) - Action|Adventure|Sci-Fi"
  [14]  =  "Like Water for Chocolate (Como agua para chocolate) (1992) - Drama|Fantasy|Romance"
  [15]  =  "Natural Born Killers (1994) - Action|Crime|Thriller"
  [16]  =  "Léon: The Professional (a.k.a. The Professional) (Léon) (1994) - Action|Crime|Drama|Thriller"
  [17]  =  "Pulp