## Collaborative filtering

Week's learning objectives:
* Implement collaborative filtering recommender systems in TensorFlow
* Implement deep learning content based filtering using a neural network in TensorFlow
* Understand ethical considerations in building recommender systems

**In this week we are using a movies platform recommendation system as an example**

#### What is Collaborative filtering algorithm?

It is an unsupervised learning algorithm that predicts users preference given other users preferences and predict similar data entries (products).

#### How does Collaborative filtering work?

If we are looking to implement a movie rating prediction system based on users other movies ratings, we would probably start using a linear regression model to predict the rating. 

However, the inputs to this model won't include any labels that best describe the movies, so we need a model that will generate features that best describe the movies, and then we can try and predict the movie rating based on this user preference and the generated features. Simple!

**Alsoo**, we can create a system that predicts whether the user will like/fav a movie based on other users likes and the current user's likes/favs, which is already implemented in social media timeline posts, they are based on your likes and similar people likes. With a system like this, you will do the same thing as we did above, but we are only deciding whether a user will like a movie or not, so it is a binary label.

#### What is the proccess of creating a model that is implementing the Collaborative filtering algorithm?

Let's take the movies rating prediction based on a specific user preference system; After loading all users movies ratings, we will do the following:

1. Normalize, using mean normalization, the movies ratings to avoid movies ratings starting at 0, which is not a good initial value for the model to start learning
   <figure>
   <img src="./resources/collaborative-filtering-mean-normalization-2.png"  style="width:400px;height:250px;" >
</figure>

2. Similar to linear/logistic regression, use a cost function to minimize the cost, and get better predictions. However, when we are calculating the cost, we now have an extra variable X, which is the vector of features generated & learnt that best describe the movie.
<figure>
   <img src="./resources/ColabFilterLearn.PNG"  style="width:740px;height:250px;" >
</figure>
               The collaborative filtering cost function is given by

$$J({\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)},\mathbf{w}^{(0)},b^{(0)},...,\mathbf{w}^{(n_u-1)},b^{(n_u-1)}})= \left[ \frac{1}{2}\sum_{(i,j):r(i,j)=1}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right]
+ \underbrace{\left[
\frac{\lambda}{2}
\sum_{j=0}^{n_u-1}\sum_{k=0}^{n-1}(\mathbf{w}^{(j)}_k)^2
+ \frac{\lambda}{2}\sum_{i=0}^{n_m-1}\sum_{k=0}^{n-1}(\mathbf{x}_k^{(i)})^2
\right]}_{regularization}
\tag{1}$$
               The first summation in (1) is "for all $i$, $j$ where $r(i,j)$ equals $1$" and could be written:

$$
= \left[ \frac{1}{2}\sum_{j=0}^{n_u-1} \sum_{i=0}^{n_m-1}r(i,j)*(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right]
+\text{regularization}
$$
Where 

| General Notation | Description | Python (if any)|
| :-----------------| :-----------: | :-------------|
| $r(i,j)$     | scalar; = 1  if user j rated movie i  = 0  otherwise             ||
| $y(i,j)$     | scalar; = rating given by user j on movie  i    (if r(i,j) = 1 is defined) ||
|$\mathbf{w}^{(j)}$ | vector; parameters for user j ||
|$b^{(j)}$     |  scalar; parameter for user j ||
| $\mathbf{x}^{(i)}$ |   vector; feature ratings for movie i        ||     
| $n_u$        | number of users |num_users|
| $n_m$        | number of movies | num_movies |
| $n$          | number of features | num_features                    |
| $\mathbf{X}$ |  matrix of vectors $\mathbf{x}^{(i)}$         | X |
| $\mathbf{W}$ |  matrix of vectors $\mathbf{w}^{(j)}$         | W |
| $\mathbf{b}$ |  vector of bias parameters $b^{(j)}$ | b |
| $\mathbf{R}$ | matrix of elements $r(i,j)$                    | R |

3. Calculate gradient (we used tf gradient tape in this lab to auto diff/grad), and use an optimizer to optimize the convergence speed and reach local minima faster.
   
4. Use the movies X features predicitions, and using the user's preference to predict this user's rating to the movie and decide whether to recommend it accordingly.

#### Content-based filtering vs Collaborative filtering

Content-basd filtering algorithm, similar to Collaborative filtering, is an unsupervised learning algorithm that tries to match items to users, by trying to analyse the the item's features and the users features too! User's features can be something like Gender, age, favourite genre, etc..

This means we will have another vector X for users to predict the user's features & preferences to match it to an item; so, we are going to get rid of $\mathbf{w}^{(j)}$ and use $v_u$ instead for the user's features, and use $v_m$ instead of $\mathbf{x}^{(i)}$ to annotate the item's features; this should make it easier for us instead of having loads of parameters that we need to the model to consider and make it more comlex

The dot product of $v_u$ and $v_m$ should help us find a good match 

#### How can we calculate $v_u$ & $v_m$

The model architecture that is widely used in big commercial systems, is building two neural networks, one for the users features ($v_u$) and one for the movies features($v_m$), and the dot product of these two vectors should give us a prediction on how this user would rate this movie, or the probability of the user liking/faving this movie in binary labeling. \
**Note:** The output layer would have more than 1 node as the output layer is the vectors of users & movies features

Cost function would be:
$$ \sum_{(i,j):r(i,j)=1}(\mathbf{v_u}^{(j)} \cdot \mathbf{v_m}^{(i)} - y^{(i,j)})^2 + NN regularization $$

To find movies similar to movie i, we can use the squared distance function and get the smallest squared differences between the features with movie i

#### Recommending from a large catalogue

Platforms can have 10s of milions of items to recommend from, and this would be a very expensive computaional requirement.

One way to better handle this is to use two steps: Retrieval & Rankings.

During Retreival:
* Generate a large list of plausible item candidates:
  * for each of the last 10 movies watched by the user, find 10 most similar movies
  * For most viewed 3 genres by the user, find the top 10 movies
  * Top 20 movies in the country

This will lead to a list of ~100s of movies, which is good enough. Next step is ranking these plausible candidate movies. We can rank the retreived list of movies now using our NN, since we have gone down from 10s of milions to 100s of movies; if the movies features have been calculated before, this means that the model only has to go through the user's features model and try to predict the ranking of these movies.