# Learning to Rank

## 1 Problem Statement

Let $Q = \{q_1 ... q_m\}$ - set of queries,  
$D_i = \{d_{i,1} ... d_{i,n_q} \}$ - set of relevant documents for $q_i$,  
$y_i = \{y_{i,1} ... y_{i,n_q} \}$ - set of relevance scores for $D_i$,  
$x_{i,j} = \phi(q_i, d_{i,j})$ - features for query $q_i$ and document $d_{i,j}$,  
$x_i = \{x_{i,j}\}$ - features for query $q_i$ and documents $D_i$,  
$f: R^d \rightarrow R$ - ranking model,  
$s_{i,j} = f(x_{i,j})$ - predicted ranking scores.  


We need to train a model, which predicts a correct order of documents, sorted from high relevant to low relevant.

## 2 Evaluation Metrics

### Precision@k

$$Pr@k = \frac {|relevant| \cap |retrieved|} {|retrieved|}$$

### Recall@k

$$Re@k = \frac {|relevant| \cap |retrieved|} {|relevant|}$$


### F1@k

$$F1@k = \frac {Pr@k * Re@k} {Pr@k + Re@k}$$


### MAP@k

$$AP@k = \sum_K (Re@k - Re@[k-1]) \cdot Pr@k $$
$$MAP@k =  \frac {1} {Q} \sum_{q = 1}^Q AP(q)$$

### MRR@k

$$MRR = \frac 1 {Q} \sum^{Q}_{i = 1} \frac 1 {rank_i}$$

### NDCG@k

$$ DCG@k = \sum_{i=1}^n \frac {2^{rel_i} - 1} {\log_2 (i + 1)} $$

$$ NDCG@k = \frac {DCG@k} {IdealDCG@k} $$

### Kendal rank correlation coefficient

$$ \tau = \frac {(\# concordant\ pairs) - (\# discordant\ pairs)} {\binom {n} {2}} $$
$\tau \in [-1, 1]$

### Q

* Why can't we optimize evaluation metrics directly?

All evaluation metrics are either flat or discontinious.

* Can we use classifier to solve ranking problem?

We can predict probability of a document been top-1, but it is not very effective.

## 3 Pointwise Models

We want to correctly predict ranking score for each (query, document) pair.

$$  \sum_{q=1}^Q \sum_{j=1}^D L(f( {x}_{q,j}),\ y_{q,j}) \to \min $$

## 4 Pairwise Models

$$ \sum_{q=1}^Q \sum_{i, j:\ y_{q,i} \gt y_{q,j}} L(f(x_{q,i}) - f(x_{q,j})) \to \min $$

## RankNet

Consider 2 documents for a single query:

Let $P_{ij}$ - model prediction, that $f(x_i) > f(x_j)$,  
$\bar P_{ij}$ - ground truth,  
$o_{ij} = f(x_i) - f(x_j)$, 

Loss between $i$-th and $j$-th document for query $q$:
$$ L_{ij} = L(o_{ij}) = -\bar P_{ij} \log P_{ij} - (1 - \bar P_{ij}) \log(1 - P_{ij}) $$

$$ P_{ij} = \frac {e^{o_{ij}}} {1 + e^{o_{ij}}} $$

$$ L_{ij} = -\bar P_{ij} o_{ij} + \log(1 + e^{o_{ij}}) $$


Properties:
* linear asymptotics - robust to noise
* reduce of pairwise errors does not always imply increase in some ranking metric 

Q:
* Monotonic behavior of $P_{ij}$  
* What is the full expression for loss?
* Asymptotic complexity $O(n^2)$

## 5 Listwise Models

$$ \sum_{q=1}^Q L(\{f(x_{q,j})\}^{n_q}_{j=1},\ \{y_{q,j}\}^{n_q}_{j=1}) \to \min $$

## ListNet

Consider permutation $\pi = \{ \pi(1) ... \pi(n)\}$

Let $\phi$ be a strictly positive, monotonically increasing function.  
$\phi(s) = e^s$,  

Probability of permutation 
$$ P_s (\pi) = \prod^n_{j = 1} \frac {\phi(s_{\pi(j)})} {\sum^n_{k = j} \phi(s_{\pi(k)})} $$


For example, 

$$ P_s (\pi) = \prod^n_{j = 1} \frac {\phi(s_{\pi(j)})} {\sum^n_{k = j} \phi(s_{\pi(k)})} = \frac {\phi(s_1)} {\phi(s_1) + \phi(s_2) + \phi(s_3)} \cdot \frac {\phi(s_2)} {\phi(s_2) + \phi(s_3)} \cdot \frac {\phi(s_3)} {\phi(s_3)} $$

! "The permutations sorted by ranking function are most likely to occur"

This probability measure over permutations induce a probability distribution
$$ \sum_{\pi \in \Omega_{\pi}} P_s (\pi) = 1 $$


Top one probability

$$ P_s(j) = \sum_{\pi(1) = j,\ \pi \in \Omega_n} P_s(\pi)  =  \frac {\phi(s_j)} {\sum^n_{k = 1} \phi(s_k)} $$


Loss function

$$ L(y^{(i)}, z^{(i)}) = -\sum^n_{j = 1} P_{y^{(i)}}(j) \log(P_{z^{(i)}}(j)) $$

### Q:

* Properties of $P_s (\pi)$
* Properties of $P_s(j)$
* Can we use another loss function in this model? 
Yes, for example KL divergence.

## 6 LambdaRank

Original RankNet Loss is rewritten:

$ \bar P_{ij} = \frac {1 + S_{ij}} 2 $,  
$ S_{ij} \in [-1, 1]$,  
$ p(s_i > s_j) = \sigma (\alpha(s_i - s_j))$

$$ L_{ij} = -\bar P_{ij} \log P_{ij} - (1 - \bar P_{ij}) \log(1 - P_{ij}) $$ 

$$ L_{ij} = -\frac {1 + S_{ij}} 2 \log (1 + e^{-\alpha(s_i-s_j)}) - \frac {1 - S_{ij}} 2 (-\alpha(s_i-s_j) - \log (1 + e^{-\alpha(s_i-s_j)})) $$ 

$$ L_{ij} = \frac {1 - S_{ij}} 2 \alpha(s_i-s_j) + \log (1 + e^{\alpha(s_j-s_i))}) $$ 

Now loss function is symmetric


$$ \dfrac{\partial L\left(s_{i}-s_{j}\right)}{\partial s_{i}}=-\dfrac{\partial L\left(s_{i}-s_{j}\right)}{\partial s_{j}}=\alpha\left(\frac {1 - S_{ij}} 2-\dfrac{1}{1+e^{\alpha\left(s_{i}-s_{j}\right)}}\right) = \lambda_{ij}  $$

$$ \dfrac{\partial L}{\partial w}=\sum_{(i,j) \in D} \dfrac{\partial L}{\partial s_{i}} \dfrac{\partial s_{i}}{\partial w}+ \dfrac{\partial L}{\partial s_{j}} \dfrac{\partial s_{j}}{\partial w}
= \sum_{(i,j) \in D} \lambda_{ij} (\dfrac{\partial s_{i}}{\partial w}-\dfrac{\partial s_{j}}{\partial w}) $$

Rewrite more efficient

$$ \dfrac{\partial L}{\partial w} = \sum_{i} \dfrac{\partial s_{i}}{\partial w} (\sum_{(i,j) \in D} \lambda_{ij} - \sum_{(j,i) \in D} \lambda_{ji}) $$

Let $$\lambda_{i}=\sum_{j:\{i, j\} \in D} \lambda_{i j}-\sum_{j:\{j, i\} \in D} \lambda_{i j} $$

$$ \dfrac{\partial L}{\partial w} = \sum_{i} \dfrac{\partial s_{i}}{\partial w} \lambda_i $$

$\lambda_i$ can be loosely interpreted as a "force", which moves the document up or down in the list.

Choice of $\lambda$

$$ DCG = \sum_{i} \dfrac{2^{\mathrm{rel}_{i}}-1}{\log _{2}(i+1)} $$

$$ \lambda'_{ij}=\lambda_{ij}|\Delta NDCG_{ij}|= \left(\dfrac{1}{1+e^{s_{i}-s_{j}}}\right)\left(2^{\mathrm{rel}_{i}}-2^{\mathrm{rel}_{j}}\right)\left(\dfrac{1}{\log_2 (i+1)}-\dfrac{1}{\log_2 (j+1)}\right) $$

## LambdaMART

Gradient Boosting

$$ F_{N}(x)=\sum_{i=1}^{N} \alpha_{i} f_{i}(x) $$

We want to predict gradient

$$ \bar{y}_{i}=-\left[\dfrac{\partial L\left(y_{i}, F\left(x_{i}\right)\right.}{\partial F\left(x_{i}\right)}\right]_{F(x)=F_{m-1}(x)} $$

It is very similar to

$$ \lambda_{i} = \sum_{j \in P_{i}} \dfrac{\partial L\left(s_{i}, s_{j}\right)}{\partial s_{i}} $$