# Implicit ALS

We usually consider using ALS on a set of user/product ratings. But what if the data isn't so self explanatory?

### A day trip to the library
Consider, for example, the data collected by a local library. The library records which users took out each books and how long they kept the books before returning them. 

As such, we have no explicit indication that a user liked or disliked the books they took out - Just because you borrowed a book does not mean that you enjoyed it, or even read it.
Furthermore, the missing data is of interest - the fact that a user has not taken out a specific book could indicate that they dislike that genre, or that they haven't been to that section of the library.

Furthermore the same user action could have many different causes. Suppose you withdraw a book three times. That might indicate that you loved the book, but it may also indicate that the book doesn't appeal to you as strongly as some other books you withdrew so you never got round to reading it the first two times.

To make the situation even worse, implicit data is often dirty. For example, a user may withdraw a library book for their child using their account, or they may accidentally pick up a book that was sitting on the counter. 

### The solution
Based on the standard ALS implementation, [Hu et al. (2008)](https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwi899eAu6baAhUurlkKHaVvB6UQFggsMAA&url=http%3A%2F%2Fyifanhu.net%2FPUB%2Fcf.pdf&usg=AOvVaw3WIcPGTpxR8m7C32F8whFx) presented a methodolgy for carrying out ALS when dealing with implicit data. 

The general idea is that we have some recorded observations $r_{u,i}$ denoting the level of interaction user $u$ had with product $i$. For example, if a user $1$ borrowed book $4$ once we may set $r_{1,4}=1$. Alternatively we may wish to allow $r_{u,i}$ to hold information about how many days the book was borrowed for. (There is a lot of freedom in this set up, so we need to make some data specific decisions regarding how we will select $r_{u,i}$).

Given the set of observations $r_{u,i}$, a binary indicator $p_{u,i}$ is introduced where:

$ p_{u,i} = \begin{cases} 1 & \mbox{if } r_{i,j}>0 \\
0 & \mbox{otherwise.} \end{cases} $


A confidence parameter $\alpha$ lets the user determine how much importance they wish to place on the recorded $r_{u,i}$. This leads to the introduction of $c_{u,i}$ which we take to be the confidence we have in the strength of user $u$'s reaction to product $i$: 
$c_{u,i} = 1 + \alpha r_{u,i}$.

Let $N_u$ denote the number of users, and $N_p$ denote the number of products. Let $k\in \mathbb{R}^+$ be a user defined number of factors. 
Now, in implicit ALS the goal is to find matrices $X\in \mathbb{R}^{N_u \times k}$ and $Y\in \mathbb{R}^{N_p \times k}$ such that the following cost function is minimised:

$\sum_{u,i} c_{u,i}(p_{u,i}-X_u^T Y_i)^2 + \lambda (\sum_u \| X_u\|^2 + \sum_{i} \| y_u\|^2), $


where
$X_u$ is the $u$th row of X, 
$Y_i$ is the $i$th row of Y,
\lambda is a user defined parameter which prevents overfitting. 

With this minimisation at hand, we are able to recover estimates of $c_{u,i}$, and thus of $r_{u,i}$ for interactions which have not yet occured. 

### Let's get going
We are going to run implicit ALS using the implementation given in the pyspark.mllib.recommendation module. 

The data we will be using can be found at http://www2.informatik.uni-freiburg.de/~cziegler/BX/

In [1]:
#Set up a spark context

from pyspark import SparkContext,  SparkConf

conf = SparkConf().setAppName("implicitALS")
sc = SparkContext(conf=conf)

# The Data


In the cell below, we download and unzip the data. The two files we are interested in are BX-Books.csv and BX-Book-Ratings.csv, which follow these schema: 

### BX-Books.csv
| Field Name |  Type | Description |
|------------|------|
|ISBN |  String | length 10, alphanumeric |
| Book-Title | String | Title of book |
|Book-Author | String| Name of author |
| Year-Of-Publication | String | yyyy|
|Publisher| String |Name of publisher |
|Image-URL-S | String| URL for small image on amazon.com |
|Image-URL-M | String| URL for medium image on amazon.com |
|Image-URL-L | String| URL for large image on amazon.com|


### BX-Book-Ratings.csv
| Field Name |  Type | Description |
|------------|------|
|User-ID |  Integer | Range from 2 to 278854 |
| ISBN | String| length 10, alphanumeric |
|Book-Rating| Integer | 1-10 denotes dislike-like. 0 denotes implicit interaction|

In [2]:
#Downloading and unzipping the data
!wget http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
!unzip BX-CSV-Dump.zip

--2018-04-09 15:29:29--  http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
Resolving www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)... 132.230.105.133
Connecting to www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)|132.230.105.133|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘BX-CSV-Dump.zip’


2018-04-09 15:29:37 (3.11 MB/s) - ‘BX-CSV-Dump.zip’ saved [26085508/26085508]

Archive:  BX-CSV-Dump.zip
  inflating: BX-Book-Ratings.csv     
  inflating: BX-Books.csv            
  inflating: BX-Users.csv            


The cell above loads three .csv files into the working directory. We are interested in the files "BX-Books-Ratings.csv" and "BX-Books.csv". The first three columns of "BX-Book-Ratings" are the user id, an isbn which identifies the book, and a rating. A '0' in the rating column is used to denote that an implicit interaction occured between the user an the book. It is this data that we are interested in, and we extract such rows using the following grep command:

In [3]:
!grep '"0"' BX-Book-Ratings.csv > implicit.csv

In [4]:
#Load in the data
#The data is csv, with ';' as a delimiter, hence the split command. 
#The data has quote marks around all info, so I remove these with a replace mapping. 
#The first bit of data is user id, the second is the book isbn number, 
# and the third is the observation. 
ratings = sc.textFile('implicit.csv').map(lambda x: x.replace('"',"")) \
            .map(lambda x:x.split(";"))\
            .map(lambda x:(int(x[0]), str(x[1]), int(x[2])))

Let's have a look at the first 10 entries in the ratings file: 

In [5]:
ratings.take(10)

[(276725, '034545104X', 0),
 (276727, '0446520802', 0),
 (276733, '2080674722', 0),
 (276746, '0425115801', 0),
 (276746, '0449006522', 0),
 (276746, '0553561618', 0),
 (276746, '055356451X', 0),
 (276746, '0786013990', 0),
 (276746, '0786014512', 0),
 (276747, '0451192001', 0)]

The implicit ALS function we are going to use requires that product ids are integers. At the moment we have unique ISBNs, which contain a mixture of numbers and letters, so we must convert to integers. This can be done using the zipWithIndex() function which takes an RDD and joins unique ids to each entry. 

In [6]:
# Extract unique isbns.
isbns=ratings.map(lambda x:x[1]).distinct()
#Associates an integer with each unique isbn.
isbns_with_indices=isbns.zipWithIndex() 
#sets isbn as the key
reordered_ratings = ratings.map(lambda x:(x[1], (x[0], x[2]))) 
joined = reordered_ratings.join(isbns_with_indices) #joins with indexes 
joined.take(10)

[('0425103528', ((41455, 0), 103633)),
 ('0425103528', ((95991, 0), 103633)),
 ('0425103528', ((102967, 0), 103633)),
 ('0425103528', ((186570, 0), 103633)),
 ('0756401836', ((93047, 0), 30136)),
 ('0756401836', ((110483, 0), 30136)),
 ('0756401836', ((170415, 0), 30136)),
 ('0756401836', ((176875, 0), 30136)),
 ('321630681X', ((245839, 0), 61121)),
 ('0439110246', ((114414, 0), 2))]

In [7]:
#The data above is of the form :
    #(isbn, ((userid, rating), isbn-id-integer))
#We use the map function to get to the form :
    #(user id, isbn-id-integer, rating)
#This is the form expected by the ALS function
ratings_int_nice = joined.map(lambda x: (x[1][0][0], x[1][1], x[1][0][1]))
ratings_int_nice.take(10)

[(41455, 103633, 0),
 (95991, 103633, 0),
 (102967, 103633, 0),
 (186570, 103633, 0),
 (93047, 30136, 0),
 (110483, 30136, 0),
 (170415, 30136, 0),
 (176875, 30136, 0),
 (245839, 61121, 0),
 (114414, 2, 0)]

In [8]:
#Need 1s not 0s. since the matrix is singular if 0s. 
#i.e. we use '1' to indicate response, not 0.
ratings_ones = ratings_int_nice.map(lambda x:(x[0], x[1], 1))

We now import the ALS function from the mllib module, and build the model. 

In [9]:
from pyspark.mllib.recommendation import ALS
model=ALS.trainImplicit(ratings_ones, rank=5, iterations=3, alpha=0.99)

Let's have a look at user 8. We wish to make predictions on what books the user will like, based on their interactions. 

In [10]:
#Filter out all the  id of all books rated by user id = 8. 
users_books = ratings_ones.filter(lambda x: x[0] is 8).map(lambda x:x[1])
books_for_them = users_books.collect() #Collect this as a list

In this next cell, we make a rdd of (user = 8, book ids) where there is an entry for every book they have not before interacted with. 

In [11]:
unseen = isbns_with_indices.map(lambda x:x[1]) \
                            .filter(lambda x: x not in books_for_them) \
                            .map(lambda x: (8, int(x)))
unseen.take(10)

[(8, 0),
 (8, 1),
 (8, 2),
 (8, 3),
 (8, 4),
 (8, 5),
 (8, 6),
 (8, 7),
 (8, 8),
 (8, 9)]

In [12]:
#Using the predict all function to give predictions for any unseens. 
predictions = model.predictAll(unseen)

We can now look at predictions for a range of user, product pairs:

In [13]:
predictions.take(10)

[Rating(user=8, product=185544, rating=3.5499178936374144e-06),
 Rating(user=8, product=152288, rating=1.1140932904340358e-05),
 Rating(user=8, product=143464, rating=1.2945490545668097e-05),
 Rating(user=8, product=23776, rating=3.676099823410155e-05),
 Rating(user=8, product=155312, rating=1.923518143573988e-05),
 Rating(user=8, product=82512, rating=-2.2283889456646724e-06),
 Rating(user=8, product=170792, rating=-9.112801659288064e-09),
 Rating(user=8, product=103184, rating=4.111202468665078e-09),
 Rating(user=8, product=40888, rating=2.4160142131097027e-06),
 Rating(user=8, product=200376, rating=1.281169533897479e-05)]

We can use .takeOrdered to view the 20 highest rated items for that user. 

In [14]:
predictions.takeOrdered(20, lambda x: -x[2])

[Rating(user=8, product=218987, rating=0.006400773040138297),
 Rating(user=8, product=37124, rating=0.004201500598023086),
 Rating(user=8, product=226501, rating=0.003013403659823166),
 Rating(user=8, product=222388, rating=0.0019098079964174936),
 Rating(user=8, product=224398, rating=0.001893200744164136),
 Rating(user=8, product=135249, rating=0.0018104361395196718),
 Rating(user=8, product=158268, rating=0.0017122310913895078),
 Rating(user=8, product=115200, rating=0.0015320753848580505),
 Rating(user=8, product=126542, rating=0.0015233736116273327),
 Rating(user=8, product=153367, rating=0.001396647908855291),
 Rating(user=8, product=104641, rating=0.0013685284571707751),
 Rating(user=8, product=56184, rating=0.0012524267983160201),
 Rating(user=8, product=61040, rating=0.0012331378504750984),
 Rating(user=8, product=215022, rating=0.0011415533435755764),
 Rating(user=8, product=170112, rating=0.0011322398700266108),
 Rating(user=8, product=194437, rating=0.0011117130375315337),


The .recommendProducts function allows us to view predicted ratings for specific user, item pairs. 

In [15]:
model.recommendProducts(8,10)

[Rating(user=8, product=218987, rating=0.006400773040138297),
 Rating(user=8, product=37124, rating=0.004201500598023086),
 Rating(user=8, product=226501, rating=0.003013403659823166),
 Rating(user=8, product=222388, rating=0.0019098079964174936),
 Rating(user=8, product=224398, rating=0.001893200744164136),
 Rating(user=8, product=135249, rating=0.0018104361395196718),
 Rating(user=8, product=158268, rating=0.0017122310913895078),
 Rating(user=8, product=115200, rating=0.0015320753848580505),
 Rating(user=8, product=126542, rating=0.0015233736116273327),
 Rating(user=8, product=153367, rating=0.001396647908855291)]

### Conclusion 
In this notebook we saw how to build a basic implicit ALS model in Spark. However, the data used was fairly plain, with "0"s being used for all implicit interactions. Furtherwork should consider a dataset more suited to implicit ALS. 