# Implicit ALS

We usually consider using ALS on a set of user/product ratings. But what if the data isn't so self explanatory?

### A day trip to the library
Consider, for example, the data collected by a local library. The library records which users took out each books and how long they kept the books before returning them. 

As such, we have no explicit indication that a user liked or disliked the books they took out - Just because you borrowed a book does not mean that you enjoyed it, or even read it.
Furthermore, the missing data is of interest - the fact that a user has not taken out a specific book could indicate that they dislike that genre, or that they haven't been to that section of the library.

Furthermore the same user action could have many different causes. Suppose you withdraw a book three times. That might indicate that you loved the book, but it may also indicate that the book doesn't appeal to you as strongly as some other books you withdrew so you never got round to reading it the first two times.

To make the situation even worse, implicit data is often dirty. For example, a user may withdraw a library book for their child using their account, or they may accidentally pick up a book that was sitting on the counter. 

### The solution
Based on the standard ALS implementation, [Hu et al. (2008)](https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwi899eAu6baAhUurlkKHaVvB6UQFggsMAA&url=http%3A%2F%2Fyifanhu.net%2FPUB%2Fcf.pdf&usg=AOvVaw3WIcPGTpxR8m7C32F8whFx) presented a methodolgy for carrying out ALS when dealing with implicit data. 

The general idea is that we have some recorded observations $r_{u,i}$ denoting the level of interaction user $u$ had with product $i$. For example, if a user $1$ borrowed book $4$ once we may set $r_{1,4}=1$. Alternatively we may wish to allow $r_{u,i}$ to hold information about how many days the book was borrowed for. (There is a lot of freedom in this set up, so we need to make some data specific decisions regarding how we will select $r_{u,i}$).

Given the set of observations $r_{u,i}$, a binary indicator $p_{u,i}$ is introduced where:

$ p_{u,i} = \begin{cases} 1 & \mbox{if } r_{i,j}>0 \\
0 & \mbox{otherwise.} \end{cases} $


A confidence parameter $\alpha$ lets the user determine how much importance they wish to place on the recorded $r_{u,i}$. This leads to the introduction of $c_{u,i}$ which we take to be the confidence we have in the strength of user $u$'s reaction to product $i$: 
$c_{u,i} = 1 + \alpha r_{u,i}$.

Let $N_u$ denote the number of users, and $N_p$ denote the number of products. Let $k\in \mathbb{R}^+$ be a user defined number of factors. 
Now, in implicit ALS the goal is to find matrices $X\in \mathbb{R}^{N_u \times k}$ and $Y\in \mathbb{R}^{N_p \times k}$ such that the following cost function is minimised:

$\sum_{u,i} c_{u,i}(p_{u,i}-X_u^T Y_i)^2 + \lambda (\sum_u \| X_u\|^2 + \sum_{i} \| y_u\|^2), $


where
$X_u$ is the $u$th row of X, 
$Y_i$ is the $i$th row of Y,
\lambda is a user defined parameter which prevents overfitting. 

With this minimisation at hand, we are able to recover estimates of $c_{u,i}$, and thus of $r_{u,i}$ for interactions which have not yet occured. 

### Let's get going
We are going to run implicit ALS using the implementation given in the pyspark.mllib.recommendation module. 

The data I'm using can be found at http://www2.informatik.uni-freiburg.de/~cziegler/BX/

In [2]:
#Set up a spark context

from pyspark import SparkContext,  SparkConf

conf = SparkConf().setAppName("implicitALS")
sc = SparkContext(conf=conf)

In [3]:
#Load in the data
#The data is csv, with ';' as a delimiter, hence the split command. 
#The data has quote marks around all info, so I remove these with a replace mapping. 
#The first bit of data is user id, the second is the book isbn number, 
# and the third is the observation. 
ratings = sc.textFile('implicit.csv').map(lambda x: x.replace('"',"")) \
            .map(lambda x:x.split(";"))\
            .map(lambda x:(int(x[0]), str(x[1]), int(x[2])))

In [4]:
#Need the isbns to be linked to an int for item id
isbns=ratings.map(lambda x:x[1]).distinct()
isbns_with_indices=isbns.zipWithIndex()
reordered_ratings = ratings.map(lambda x:(x[1], (x[0], x[2])))
joined = reordered_ratings.join(isbns_with_indices)
ratings_int_nice = joined.map(lambda x: (x[1][0][0], x[1][1], x[1][0][1]))

In [5]:
#Need 1s not 0s. since the matrix is singular if 0s. i.e. 1 indicates response, not 0.
ratings_ones = ratings_int_nice.map(lambda x:(x[0], x[1], 1))

In [14]:
from pyspark.mllib.recommendation import ALS
model=ALS.trainImplicit(ratings_ones, rank=5, iterations=3, alpha=0.99)

In [32]:
#Pretend to be user 8. 
users_books = ratings_ones.filter(lambda x: x[0] is 8).map(lambda x:x[1])

users_books.take(10)
books_for_them = users_books.collect()
iterable_chain = users_books.toLocalIterator()
books_for_them

[83211,
 35018,
 149718,
 201265,
 170534,
 168010,
 176140,
 139616,
 219591,
 222162,
 244957]

In [35]:
unseen = isbns_with_indices.map(lambda x:x[1]).filter(lambda x: x not in books_for_them).map(lambda x: (8, int(x)))
unseen

[(8, 0),
 (8, 1),
 (8, 2),
 (8, 3),
 (8, 4),
 (8, 5),
 (8, 6),
 (8, 7),
 (8, 8),
 (8, 9)]

In [37]:
#Using the predict all function to give predictions for any unseens. 
predictions = model.predictAll(unseen)


In [38]:
predictions.take(10)

[Rating(user=8, product=185544, rating=4.2271030054993974e-06),
 Rating(user=8, product=152288, rating=2.0750792412236027e-05),
 Rating(user=8, product=143464, rating=1.0009329102651996e-05),
 Rating(user=8, product=23776, rating=4.4093986779524005e-05),
 Rating(user=8, product=155312, rating=7.568740395670591e-06),
 Rating(user=8, product=82512, rating=-2.370503477555326e-06),
 Rating(user=8, product=170792, rating=-1.5023831712554203e-08),
 Rating(user=8, product=103184, rating=-1.21955109500547e-08),
 Rating(user=8, product=40888, rating=-8.983168069646947e-06),
 Rating(user=8, product=200376, rating=1.584185955094937e-05)]

In [40]:
predictions.takeOrdered(20, lambda x: -x[2])

[Rating(user=8, product=218987, rating=0.005679459189748123),
 Rating(user=8, product=37124, rating=0.0056743738510453045),
 Rating(user=8, product=222388, rating=0.003184519694282364),
 Rating(user=8, product=61040, rating=0.002636430903826744),
 Rating(user=8, product=278, rating=0.0023415537266199974),
 Rating(user=8, product=224398, rating=0.0023259600211237154),
 Rating(user=8, product=232462, rating=0.002162405558826569),
 Rating(user=8, product=170112, rating=0.002145339089236659),
 Rating(user=8, product=215022, rating=0.0021304683630739008),
 Rating(user=8, product=127891, rating=0.002008426523603443),
 Rating(user=8, product=74810, rating=0.001978164272045892),
 Rating(user=8, product=111968, rating=0.0019664236210441216),
 Rating(user=8, product=226501, rating=0.0019470256951683467),
 Rating(user=8, product=155093, rating=0.0017297421250543385),
 Rating(user=8, product=134108, rating=0.0016752060544477018),
 Rating(user=8, product=158268, rating=0.0016475328355897317),
 Rati

In [41]:
model.recommendProducts(8,10)

[Rating(user=8, product=218987, rating=0.005679459189748123),
 Rating(user=8, product=37124, rating=0.0056743738510453045),
 Rating(user=8, product=222388, rating=0.003184519694282364),
 Rating(user=8, product=61040, rating=0.002636430903826744),
 Rating(user=8, product=278, rating=0.0023415537266199974),
 Rating(user=8, product=224398, rating=0.0023259600211237154),
 Rating(user=8, product=232462, rating=0.002162405558826569),
 Rating(user=8, product=170112, rating=0.002145339089236659),
 Rating(user=8, product=215022, rating=0.0021304683630739008),
 Rating(user=8, product=127891, rating=0.002008426523603443)]