## Review on IR Metrics

Given the predictions $\mathbf{y}=f(\mathbf{x})$ and the corresponding ground-truth
labels $\mathbf{y}^{*}$, we use $Y^{+}=\{i|y_{i}^{*}>0\}$ and $Y^{-}=\{j|y_{j}^{*}=0\}$ to represent the sets of relevant documents and non-relevant documents, respectively. We use $\mathbf{b}^{*}=\mathbb{I}\{\mathbf{y}^{*}>0\}$ with $b_{j}^{*}=\mathbb{I}\{y_{j}^{*}>0\}$
to represent the binarized ground-truth, and the cumulative sum on
$\mathbf{b}^{*}$ is given as $B_{k}^{*}=\sum_{j=1}^{k}b_{j}^{*}$. The scoring function $f$ induces
a ranking $\bar{\mathbf{y}}$.
The corresponding ground-truth labels are  $\mathbf{y}^{**}$. Furthermore, we denote the binarized ground-truth labels as $\mathbf{b}^{**}=\mathbb{I}\{\mathbf{y}^{**}>0\}$
with $b_{j}^{**}=\mathbb{I}\{y_{j}^{**}>0\}$, and the cumulative
sum on $\mathbf{b}^{**}$ is given as $B_{k}^{**}=\sum_{j=1}^{k}b_{j}^{**}$.

To evaluate the effectiveness of a scoring function, a number of IR
metrics have been proposed to emphasize the items that are ranked
at higher positions. In general, the IR metrics are computed
based on the list of ground-truth labels $\bar{\mathbf{y}}$ induced by $f$. For example, the binary-relevance IR metrics measure the performance of
a specific ranking model based on $\mathbf{b}^{**}$, such as precision and AP. The graded-relevance IR metrics measure the performance of a specific ranking model based on $\mathbf{y}^{**}$,
such as nDCG and ERR.

## Precision

Precision@k measures the proportion
of relevant documents retrieved at a given truncation position, which is defined as:
\begin{equation}
Pre@k=\frac{1}{k}\sum_{j=1}^{k}b_{j}^{**}
\end{equation}

where $k$ denotes the truncation position.

## Average Precision

Different from Precision@k that does not take into account the position
at which a document is ranked, Average Precision (AP) is a **rank-sensitive**
metric, which builds upon Precision as follows:
\begin{equation}
AP=\frac{1}{|Y^{+}|}\sum_{j}b_{j}^{**}\times Pre@j
\end{equation}

Then Mean Average Precision (MAP) is defined
as the mean of AP scores over a set of queries.

## Normalized Discounted Cumulative Gain (nDCG)

Normalized Discounted Cumulative Gain (nDCG) is a **graded-relevance
rank-sensitive** metric. The discounted cumulative
gain (DCG) of a ranked list is given as $DCG@k=\sum_{j=1}^{k}\frac{2^{y_{j}^{**}}-1}{\log_{2}(j+1)}$,
where $G_{j}=2^{y_{j}^{**}}-1$ is usually referred to as the gain
value of the $j$-th document.
We denote the maximum DCG value attained by the ideal ranking as $DCG^{*}$,
then normalizing DCG with $DCG^{*}$ gives nDCG as follows:

\begin{equation}
nDCG@k=\frac{DCG@k}{DCG^{*}@k}
\end{equation}

## Expected Reciprocal Rank (ERR)

Expected Reciprocal Rank (ERR) is another popular graded-relevance rank-sensitive metric. Let $Pr(j)$ be the relevance probability of the document at rank
$j$. In accordance with the gain function for nDCG, the relevance
probability is commonly calculated as $Pr(j)=\frac{2^{y_{j}^{**}}-1}{2^{\max(\mathbf{y}^{**})}}$.
ERR interprets the relevance probability as the probability that the
user is satisfied with the document at a rank position. Thus the probability
that the user is dissatisfied with the documents at ranks from $1$
to $k$ is given as $Disp(1,k)=\prod_{i=1}^{k}(1-Pr(i))$. ERR is
then defined as

\begin{equation}
ERR@k=\sum_{j=1}^{k}\frac{Disp(1,j-1)\cdot Pr(j)}{j}
\end{equation}


Let $ERR^{*}$ the maximum ERR value attained by the ideal ranking, we have the normalized ERR as follows:

\begin{equation}
nERR@k=\frac{ERR@k}{ERR^{*}@k}
\end{equation}