<a href="https://colab.research.google.com/github/yexf308/AdvancedMachineLearning/blob/main/Recommender_Systems_implicit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$\def\m#1{\mathbf{#1}}$
$\def\mb#1{\mathbb{#1}}$
$\def\c#1{\mathcal{#1}}$

# Implicit feedback
We have assumed that the user gives explicit ratings for each item that they interact with.
This is a very restrictive assumption. More generally, we would like to learn from the **implicit feedback** that users give just by interacting with a system.

we can view the fact that they watched movie $i$ but did not watch movie $j$ as an
implicit signal that they prefer $i$ to $j$. 

Even when we do have explicit data, it might still be a good idea to incorporate implicit data into the model. Consider, for example, listening to songs. When users listen to music on a streaming service, they might rarely ever rate a song that he/she like or dislike. But more often they skip a song, or listen only halfway through it if they dislike it. If the user really liked a song, they will often come back and listen to it. So, to infer a user's musical taste profile, their listens, repeat listens, skips and fraction of tracks listened to, etc. might be far more valuable signals than explicit ratings.

<img src="https://github.com/yexf308/AdvancedMachineLearning/blob/main/image/implicit1.png?raw=true" width="300" />

### Possible method
- Treat all the missing entries as zeros and minimizing the loss on all the entries. 

- 95% elements are zero and it tend to fit zeroes instead of ones. 

<img src="https://github.com/yexf308/AdvancedMachineLearning/blob/main/image/implicit2.png?raw=true" width="500" />


- Or one can use the exact the same method in the explicit feedback. We make the assumption that if a user has interacted at all with an item, if the user $u$ has a preference for item $i$, then $A_{ui}=1$. However, when $A_{ui}=0$, we assume that it should be associated with a lower confidence, as there are many reasons beyond disliking the item as to why the user has not interacted with it,  e.g. Unaware of its existence.

## Method 1: Weighted matrix factorization 
- Give lower weights to 0 and higher weights to 1:
\begin{align}
\min_{W\in \mb{R}^{M\times k}, H\in \mb{R}^{N\times k}}\sum_{(u,i), A_{ui}=1}(A_{ui}-W_u H_i^\top)^2 +\alpha \sum_{(u,i), A_{ui}=0}(A_{ui}-W_u H_i^\top)^2 +\lambda (\|W\|_F^2 + \|H\|_F^2)
\end{align}
   
    - $0<\alpha< 1$: the weight for 0s is smaller.

    - Make the loss more balance (a standard way for class-unbalanced classification)

    -  Another motivation from PU (positive-and-unlabeled) learning:
        - 0 can be either “negative” or “unobserved” ⇒ lower weight

        - 1 is always “positive” ⇒ higher weight

- Optimization:

  - Assume $M$ users and $N$ items, hidden dimension k.

  - Use ALS algorithm: can have time complexity $O(\|A\|_0k+Mk^2+Nk^2)$. 

  - Training time only proportional to number of 1s in $A$.


## Method 2: Matrix factorization with pairwise loss

- Idea: using pair-wise comparisons to design the loss function

  - If $A_{ui}=1$ and $A_{uj}=0$, then user $u$ prefer $i$ over $j$. 

  - Therefore $W_uH_i^\top$ should be larger than $W_uH_j^\top$.

- Matrix factorization with pair-wise loss: $f(W,H)$
\begin{align}
f(W,H) = \sum_{u=1}^M \left(\sum_{(i,j): A_{ui}=1, A_{uj}=0}\ell\left(W_u(H_i^\top - H_j^\top)\right)\right)
\end{align}
The optimization problem is 
\begin{align}
\min_{W\in \mb{R}^{M\times k}, H\in \mb{R}^{N\times k}} f(W,H)
\end{align}

- The choice of the loss function: Classification loss on $W_u(H_i^\top - H_j^\top)$. We want this to be positive. 

  - Hinge loss: 
     \begin{align}
     \ell\left(z\right) = \max(0, 1-z)
     \end{align}

  - Logistic loss (also called Bayesian Personalized Ranking, BPR)
    \begin{align}
     \ell\left(z\right) = \log(1+e^{-z})
     \end{align}
    
### Algorithm: SGD
- Each iteration sample $(u,i,j)$ by
   - sample a user $u$

   - sample a positive item $i$ ($A_{ui}=1$)

   - Sample a negative (unobserved) item $j$ ($A_{uj}=0$)

- The sampled loss function: 
 $$\ell(W_u, H_i, H_j)=\log(1+\exp(-W_u(H_i^\top - H_j^\top))) $$    


- Update $W_u, H_i, H_j$ by SGD:
  - $W_u\leftarrow W_u -\nabla_{W_u}\ell(W_u, H_i, H_j)$ 

  - $H_i\leftarrow H_i -\nabla_{H_i}\ell(W_u, H_i, H_j)$

  - $H_j\leftarrow H_j -\nabla_{H_j}\ell(W_u, H_i, H_j)$

