# Meta vector model (with no normalization)

# Model 1
We have a dataset of covariates $ (x_u, y_i)$ and a binary response variable $C = [0,1]$.

Assume that a user u has a probability $\theta_{ui}$ to click on a particular item i.

$$ \theta_{ui} = Pr(click | data, par) = Pr(C = c | (x_u, y_i), W) = \sigma( x_i W y_i) $$

where $\sigma(x) = \frac{exp(x)}{1+exp(x)}$ is the logit transform, and $W \in \Re^{d* d}$ is the model's parameters.

## Likelihood function

The likelihood of one observations is $ P(c_k | W) =  \theta_k^{c_k} (1-\theta_k) ^{1-c_k} $.
Assuming that each observation is independent, the log likelihood can be written as

$$ loglik = \sum_{i=1}^{n}  c_k ln(\theta_k) + (1-w_k)ln(1-\theta_k) $$

where $\theta_k = \theta_{u,i} = \sigma( x_i W y_i)$.

## Prior
We know from before that the product $x_u^T y_i$ is correlated with relevance. 
Therefore, it is natural to put a prior of 1 on the diagonal and 0 elsewhere.
For simplicity, assume all entries in $W$ are independent.

$$P(w_{ij}) = N( 1_{i=j}, \sigma_0)$$

where $\sigma_0$ is some constant. e.g. $\sigma_0 = 1$

## Posterior

$$ lnP(W | data) = loglik + \sum_{i = 0}^d \sum_{j = i}^d lnP(w_{ij})$$


## Restrictions:
- $|Wy_i| = 1$ or $|x_i W| = 1$ in order for normalized dot product to work in production.
- This model does not consider normalization factors. Expand model to take this into account.
- A dimension of 300 in the final model is too big to be able to run in production.
Need to downsample it to managable size.

## Model 2 (Work In Progress)

We need to fix the limitations of the first model. 
- We need to have vectors that has norm equal to one.
- We need to add normalizing variables for each normalizing variable. We will encode all these as one-hot in the model.
- Further we need to combine the score from the user/item context and the normalizing constant. We decide to do that by adding a linear layer that takes the concatinated sub-results as input.

We have a dataset of following covariates:  
$x_u \in \Re^{d_{in}}$ is a user vector of dimension $d_{in}$,  
$y_i \in \Re^{d_{in}}$ is a item vector of dimension $d_{in}$,  
$h \in \Re^{24}$ a one-hot encoded categorical variable of what hour the event happened,  
$p \in \Re^{48}$ a one-hot categorical variable of what position the event happened,  
$w \in \Re^{7}$ a one-hot categorical variable of what weekday the event happened,  
$d \in \Re^{3}$ a one-hot categorical variable of what device the event happened,  
$C = [0,1]$ is a binary response variable.

$$ h_u(x) = \frac{W_u x}{\sqrt{|W_u x|}} \in \Re^{d_{out}}$$

$$h_i(y) = \frac{W_i y}{\sqrt{|W_i y|}} \in \Re^{d_{out}}$$

for parameters $W_i, W_u \in \Re^{d_{out},d_{in}}$ 

For all the normalizing factors:
$$h_p = W_p p \in \Re^{1}$$
$$h_h = W_h h \in \Re^{1}$$
$$h_w = W_w w \in \Re^{1}$$
$$h_d = W_d d \in \Re^{1}$$

for parameters $W_h \in \Re^{24}$, $W_p \in \Re^{48}$, $W_w \in \Re^{7}$, $W_d \in \Re^{3}$.


Concatingating all the values we have gathered above we get:
$$ H_1 = [h_u(x)^t h_i(y), h_p, h_h, h_w, h_d]^T$$

Doing a linear transform to create logits for click/no click:
$$ H_2 = W_1 H_1 \in \Re^{2}$$
for parameter $W_1 \in \Re^{2 * 5}$.

Then we can define the probability of click as 

$$ \theta_{ui} = Pr(click | data, par) = Pr(C = c | (x_u, y_i), W) = \sigma( H_2) $$

where $\sigma(x) = \frac{e^{x_1}}{e^{x_1} + e^{x_2}}$ for a vector x.


### Likelihood function
Let W be the collection of all weights defined above. The likelihood of one observations is $ P(c_k | W) =  \theta_k^{c_k} (1-\theta_k) ^{1-c_k} $.
Assuming that each observation is independent, the log likelihood can be written as

$$ loglik = \sum_{i=1}^{n}  c_k ln(\theta_k) + (1-w_k)ln(1-\theta_k) $$

where $\theta_k = \theta_{u,i} = \sigma( H_2)$. That is such a messy equation that I do not think it will help to write it out...


## Priors