# General overview on Recommendation systems for movies

There are basically three types of recommender systems:

* **Demographic Filtering**: offer generalized recommendations to every user, based on movie popularity and/or genre. The System recommends the same movies to users with similar demographic features. Since each user is different , this approach is considered to be too simple. The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.

* **Content Based Filtering**: suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.

* **Collaborative Filtering**: this system matches persons with similar interests and provides recommendations based on this matching. Collaborative filters do not require item metadata like its content-based counterparts.

# Analyzing different model proposals

**Demographic Filtering** will create a general recommender that doesn't fit our final product. Although we could add some bias in favour of those movies which are most popular, this is not the type of recommender that we aim to build. A combination of both **Content Based Filtering** and **Collaborative Filtering** (more so the latter), seem to be the correct approach to the recommendation problem. 

The nature of our data, which consist of mainly users ratings, suggests that the use of the latter is appropiate for the baseline model. In this way, the use of **similarity measures** and **matrix factorization (SVD)** comes to mind immediately.

However, matrix factorization has some limitations. If we consider our input $x$ as an initial user query, it will be difficult to use side features and, as a result, the model can only be queried with an user or item present in the training set. There's also the relevance of the recommendations that we just talked about. Popular items tend to be recommended for everyone, especially when using dot product as a similarity measure. It is better to capture **specific user interests**.

**Deep neural network (DNN)** models can address these limitations of matrix factorization. DNNs can easily incorporate query features and item features (due to the flexibility of the input layer of the network), which can help capture the specific interests of a user and improve the relevance of recommendations. The input to a DNN can include:

* **Dense features**: Watch time (which we don't really have access to), ratings...

* **Sparse features**: For example, watch history and country.

In summary, we have the following:

* **Matrix Factorization** is usually the better choice for large corpora. It is easier to scale for an input matrix which is sparse as ours, cheaper to query, and less prone to the "folding" phenomena.

* **DNN models** can better capture personalized preferences, but are harder to train and more expensive to query. DNN models are preferable to matrix factorization for scoring because DNN models can use more features to better capture relevance. Also, it is usually acceptable for DNN models to "fold", since you mostly care about ranking a pre-filtered set of candidates assumed to be relevant.

In this notebook, we will explore the DNN alternative for the movies recommendation and, as our dataset is refined, carry out a meaningful implementation of the model of our choice.

# Model architecture

One possible DNN model is **softmax**, which treats the problem as a multiclass prediction problem in which:

 * The input is the user query.
 
 * The output is a probability vector with size equal to the number of items in the corpus, representing the probability to interact with each item; for example, the probability to watch one movie/director/actor or another.


The model architecture determines the complexity and expressivity of the model. By adding hidden layers and non-linear activation functions (for example, **ReLU**), the model can capture more complex relationships in the data. However, increasing the number of parameters also typically makes the model harder to train and more expensive to serve. We will denote the output of the last hidden layer by $\psi(x)\in\mathbb{R}^d$.

The model maps the output of the last layer, $\psi(x)$, through a softmax layer to a probability distribution $\hat{p}=h(\psi(x)V^T)$ where $h$ is the know softmax function and $V$ is the matrix of weights of the softmax layer. The softmax layer maps a vector of scores $y\in\mathbb{R}^n$ (sometimes called the logits) to a probability distribution.

Basically, the loss function must compare the following:

* $\hat p$, the output of the softmax layer (a probability distribution).

* $p$, the ground truth, representing the items the user has interacted with.

For example, you can use the **cross-entropy loss** since you are basically comparing two probability distributions.


# Softmax training

The softmax training data consists of the query features, $x$, and a vector of items the user interacted with, $p$ (represented as a probability distribution). The variables of the model are the weights in the different layers (depth is up to us to choose). The model is typically trained using any variant of the **stochastic gradient descent**.

Computing the gradient of the loss (for a single query $x$) can be prohibitively expensive if the corpus size $n$ is too big. You could set up a system to compute gradients only on the positive items (items that are active in the ground truth vector). However, if the system only trains on positive pairs, the model may suffer from folding.

Instead of using all items to compute the gradient (which can be too expensive) or using only positive items (which makes the model prone to folding), you can use **negative sampling**. More precisely, you compute an approximate gradient, using the following items:

* All positive items (the ones that appear in the target label).

* A sample of negative items ($j\in\{1,...,n\}$). We can do it uniformly or adding weights as we previously suggested to the items with higher scores.