# Implicit ALS

We usually consider using ALS on a set of user/product ratings. But what if the data isn't so self explanatory?

### A day trip to the library
Consider, for example, the data collected by a local library. The library records which users took out each books and how long they kept the books before returning them. 

As such, we have no explicit indication that a user liked or disliked the books they took out - Just because you borrowed a book does not mean that you enjoyed it, or even read it.
Furthermore, the missing data is of interest - the fact that a user has not taken out a specific book could indicate that they dislike that genre, or that they haven't been to that section of the library.

Furthermore the same user action could have many different causes. Suppose you withdraw a book three times. That might indicate that you loved the book, but it may also indicate that the book doesn't appeal to you as strongly as some other books you withdrew so you never got round to reading it the first two times.

To make the situation even worse, implicit data is often dirty. For example, a user may withdraw a library book for their child using their account, or they may accidentally pick up a book that was sitting on the counter. 

### The solution
Based on the standard ALS implementation, [Hu et al. (2008)](https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwi899eAu6baAhUurlkKHaVvB6UQFggsMAA&url=http%3A%2F%2Fyifanhu.net%2FPUB%2Fcf.pdf&usg=AOvVaw3WIcPGTpxR8m7C32F8whFx) presented a methodolgy for carrying out ALS when dealing with implicit data. 

The general idea is that we have some recorded observations $r_{u,i}$ denoting the level of interaction user $u$ had with product $i$. For example, if a user $1$ borrowed book $4$ once we may set $r_{1,4}=1$. Alternatively we may wish to allow $r_{u,i}$ to hold information about how many days the book was borrowed for. (There is a lot of freedom in this set up, so we need to make some data specific decisions regarding how we will select $r_{u,i}$).

Given the set of observations $r_{u,i}$, a binary indicator $p_{u,i}$ is introduced where:

$ p_{u,i} = \begin{cases} 1 & \mbox{if } r_{i,j}>0 \\
0 & \mbox{otherwise.} \end{cases} $


A confidence parameter $\alpha$ lets the user determine how much importance they wish to place on the recorded $r_{u,i}$. This leads to the introduction of $c_{u,i}$ which we take to be the confidence we have in the strength of user $u$'s reaction to product $i$: 
$c_{u,i} = 1 + \alpha r_{u,i}$.

Let $N_u$ denote the number of users, and $N_p$ denote the number of products. Let $k\in \mathbb{R}^+$ be a user defined number of factors. 
Now, in implicit ALS the goal is to find matrices $X\in \mathbb{R}^{N_u \times k}$ and $Y\in \mathbb{R}^{N_p \times k}$ such that the following cost function is minimised:

$\sum_{u,i} c_{u,i}(p_{u,i}-X_u^T Y_i)^2 + \lambda (\sum_u \| X_u\|^2 + \sum_{i} \| y_u\|^2), $


where
$X_u$ is the $u$th row of X, 
$Y_i$ is the $i$th row of Y,
\lambda is a user defined parameter which prevents overfitting. 

With this minimisation at hand, we are able to recover estimates of $c_{u,i}$, and thus of $r_{u,i}$ for interactions which have not yet occured. 

### Let's get going
We are going to run implicit ALS using the implementation given in the pyspark.mllib.recommendation module. 

The data we will be using can be found at http://www2.informatik.uni-freiburg.de/~cziegler/BX/

In [19]:
import os
import pandas as pd
import numpy as np

os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'

In [10]:
#Set up a spark context
import findspark
findspark.init()
from pyspark import SparkContext,  SparkConf

conf = SparkConf().setAppName("implicitALS")
sc = SparkContext(conf=conf)

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=implicitALS, master=local[*]) created by __init__ at <ipython-input-2-9e8547090bfd>:7 

In [38]:
with open("new training list.csv",'r') as f:
    with open("updated training list.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

In [47]:
#Load in the data
#The data is csv, with ';' as a delimiter, hence the split command. 
#The data has quote marks around all info, so I remove these with a replace mapping. 
#The first bit of data is user id, the second is the book isbn number, 
# and the third is the observation. 
ratings = sc.textFile('updated training list.csv').map(lambda x:x.split(",")) \
            .map(lambda x:(int(x[0]), int(x[3]), int(x[2])))

In [75]:
ratingsdf=pd.read_csv('new training list.csv')

In [48]:
ratings

PythonRDD[49] at RDD at PythonRDD.scala:52

Let's have a look at the first 10 entries in the ratings file: 

In [54]:
ratings.take(10)

[(2, 508, 1),
 (9, 34, 1),
 (12, 508, 1),
 (24, 522, 39),
 (29, 522, 1),
 (42, 1, 1),
 (42, 34, 1),
 (42, 628, 1),
 (42, 522, 2),
 (61, 508, 1)]

In [58]:
from pyspark.mllib.recommendation import ALS
model=ALS.trainImplicit(ratings, rank=5, iterations=5)

Let's have a look at user 8. We wish to make predictions on what books the user will like, based on their interactions. 

In [153]:
def rec(user_id,k):
    predicted=model.recommendProducts(user_id,ratingsdf['New_id'].nunique())

    stlist=[]
    stexist=[]

    for x in predicted:
        st=(x[1],x[2])
        stlist.append(st)
    
    df=ratingsdf[ratingsdf['user_id']==user_id]

    for index,row in df.iterrows():
        stold=row['New_id']
        stexist.append(stold)
    
    finallist=[x for x in stlist if x[0] not in stexist]
    final=finallist[0:k]
    return final

In [155]:
rec(42,10)

[(6, 0.174345308161444),
 (106, 0.12648781554226715),
 (3, 0.11752252422730258),
 (15, 0.06500946377697087),
 (153, 0.05178953708412176),
 (28, 0.04814923991183831),
 (5, 0.046985803697913664),
 (441, 0.04260351458528505),
 (52, 0.040151053789649185),
 (10, 0.03865064728491016)]