# Overview
I wanted to understand the various ways we can test LLMs.

# LLM Use Cases 

LLMs have a number of use cases. Generally speaking those largely fall into the following categories:
- Translation
- Sequence Generation
    - Summarization
    - Prompt/Response

# Test Objetives


The intuition for evaluating generated text is the same as that for evaluating labels. 

## Accuracy

## Precision

## Recall

## Fluency

> Fluency measures how well the model's responses are structured, grammatically correct, and linguistically coherent. It assesses the model's ability to generate smooth and natural-sounding language. To measure Fluency: Fluency is measured by the perplexity metric. Perplexity = normalized inverse probability of the test set normalized by number of words.
>
> [source](https://www.linkedin.com/pulse/key-metrics-consider-llm-based-ai-products-ganeshwaran-jayachandran/)

## Engagement and Interactivity

> This metric evaluates the model's ability to engage users in a conversation and promote interactivity. It examines whether the model asks relevant follow-up questions, seeks clarification when needed, and maintains an engaging dialogue flow. To measure Engagement: Well known usage metrics obtained through surveys or by other means can be used (e.x Avg number of queries, Avg size of queries, Response feedback rating, Avg session length, etc).
>
> [source](https://www.linkedin.com/pulse/key-metrics-consider-llm-based-ai-products-ganeshwaran-jayachandran/)

# Algorithms

# Timeline & Source Materials

## Reeder (2001) - Human Evaluation Techniques

In 2021 Reeder [published](https://aclanthology.org/www.mt-archive.info/HLT-2001-Reeder.pdf) *Is That Your Final Answer?*.

The paper is concerned with conducting an experiment to determine whether people are able to distinguish between machine translations and human translations. Ultimately, this discovery is liekly intended to feed into reasearch as to why that is the case.

> The purpose of this research is to test the efficacy of applying automated evaluation techniques ... to the output of machine translation (MT) systems. 
> ...
> Subjects were given a set of up to six extracts of translated newswire text. Some of the extracts were expert human translations, others were machine translation outputs. The subjects were given three minutes per extract to determine whether they believed the sample output to be an expert human translation or a machine translation. Additionally, they were asked to mark the word at which they made this decision.

According to Papineni et. al. (2002), Reeder (2001) provides "A comprehensive catalog of MT evaluation techniques". Reading through the paper, there is a discussion about the various techniques used to determine the level of proficiency of a human translator. The notable technique involves counting the average number of words in a translation to determine if the translation was generated by a human or a machine.


## Reeder (2001) - list of SMT evaluation metrics

I cannot find this paper but from the papers that reference it, Reeder catalogues the current state of statistical machine translation evaluation techniques.

Additional mt-eval references.
Technical report, International Standards for, 2001.

## IBM (2001) - Evaluation Understudy

> the July 2001 TIDES PI meeting in
Philadelphia, IBM described an automatic MT evaluation technique
that can provide immediate feedback and guidance in MT research.
Their idea, which they call an "evaluation understudy", compares
MT output with expert reference translations in terms of the
statistics of short sequences of words (word N-grams). The more
of these N-grams that a translation shares with the reference
translations, the better the translation is judged to be. The idea is
elegant in its simplicity. But far more important, IBM showed a
strong correlation between these automatically generated scores
and human judgments of translation quality.' As a result, DARPA
commissioned NIST to develop an MT evaluation facility based on
the IBM work. This utility is now available from NIST and serves
as the primary evaluation measure for TIDES MT research.
>
> Doddington (2002) 

## DoD/DARPA (2001) - MT Evaluation Series Launches

The U.S. Department Of Defence originally lead an effort to centralize and push forward machine translation related research with the Machine Translation (MT) evaluation series.

> The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. 

## Papineni et. al. (2002) - BLEU - Automated Evaluation

In July 2002, Papineni et. al. [published](https://aclanthology.org/P02-1040.pdf) *BLEU: a Method for Automatic Evaluation of Machine Translation* with the help of the IBM T. J. Watson Research Center. 

The paper is framed in the context of machine translation and proposes an automated method for evaluating the quality of a machine translation. Interestingly, they refer to the method as an "understudy" implying that it can fill in for a human when needed.

> We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.

They discuss the philosophy of the approach as

> How does one measure translation performance? The closer a machine translation is to a professional human translation, the better it is. This is the central idea behind our proposal. To judge the quality of a machine translation, one measures its closeness to one or more reference human translations according to a numerical metric. Thus, our MT evaluation system requires two ingredients:
>
> 1. a numerical “translation closeness” metric
> 2. a corpus of good quality human reference translations

The authors also repeatedly note that they belive there is a possibility of multiple good translations. So their algorithm is designed to compare a given translation with a set of reference translations.

And note that the method can be though of as an extension to the methods used in speech recognition

> We fashion our closeness metric after the highly successful word error rate metric used by the speech recognition community, appropriately modified for multiple reference translations and allowing for legitimate differences in word choice and word order

The paper assumes familiarity with n-grams. An n-gram is a segment of a text corpus that is n characters long. For exmaple, the sentence "the brown fox" would have thw following trigrams (n=3): "the", "he ", "e b".

The paper then goes on to discuss its method of computing precision using n-grams. It attributes a brevity penalty which penalized translations longer then the reference. The method also mentions a recall penalty which aims to prevent the translation from containing redundant information found separately accross the reference materials.

> The BLEU metric ranges from 0 to 1. Few translations will attain a score of 1 unless they are identical to a reference translation. For this reason, even a human translator will not necessarily score 1. It is important to note that the more reference translations per sentence there are, the higher the score is. Thus, one must be cautious making even “rough” comparisons on evaluations with different numbers of reference translations: on a test corpus of about 500 sentences (40 general news stories), a human translator scored 0.3468 against four references and scored 0.2571 against two references

A conceise summary of BLEU is also provided by Przybocki et. al. (2010) and  Wołk and Marasek (2015)

> BLEU (1) is a precision-based metric that counts the number of n-grams (sequences of n consecutive
tokens) that a candidate translation and a corresponding reference translation have in common. The
different precision scores (one per n-gram length) are combined using the geometric mean. Once the
overall precision score is computed, a brevity penalty is computed over the entire corpus. The purpose
of this brevity penalty is to penalize candidate translations that are shorter (overall) than the reference
translations.

Looking back, Przyboki et. al. (2010) note the monumental impact of BLEU

> It is not inconceivable to claim that IBM’s introduction of BLEU (1) in 2001 has had a greater impact on
the advancement of statistical machine translation (MT) technology than any other single contribution
to the field over the succeeding five years. BLEU was the first automated, and more importantly
repeatable, metric to demonstrate general correlation with human judgments of translation quality (1),
(2). As such, BLEU provided a means for instituting large-scale MT technology evaluations.
1
As the
popularity of these evaluations grew, BLEU quickly became the de facto standard metric for MT
evaluation.

## Doddington (2003) - NIST Score

In 2003, Doddington, from NIST, [published](https://aclanthology.org/www.mt-archive.info/HLT-2002-Doddington.pdf) *Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics*.

> The NIST score (Doddington 2002) was the official metric in early DARPA TIDES MT evaluations. It is based on
information weighted n-gram co-occurrences. Some of the differences between BLEU and the NIST
score include the method of co-occurrence measures (arithmetic mean replacing geometric mean), a
modified brevity penalty, and a modified weighting of n-grams, depending on the frequency of specific
n-grams.
>
> Przybocki et. al. (2010)

The NIST Score, like BLEU is based on n-gram scoring. However it makes several enhancements which are discussed in the paper.


> The NIST metric was designed to improve BLEU by
rewarding the translation of infrequently used words.
This was intended to further prevent inflation of SMT
evaluation scores by focusing on common words and
high confidence translations. As a result, the NIST metric
uses heavier weights for rarer words. The final NIST
score is calculated using the arithmetic mean of the ngram matches between SMT and reference translations.
In addition, a smaller brevity penalty is used for smaller
variations in phrase lengths. The reliability and quality of
the NIST metric has been shown to be superior to the
BLEU metric. 
>
> Wołk and Marasek (2015)

Ultimately the paper provides a discussion of the perfromance of NIST compared to the original BLEU. It looks at the correlation between BLEU scores and human judge scores. It then looks at a set of possible n-gram scoring variations and evaluates the F-scores of those variations. Based on the results, NIST elected the best method as the official NIST formula given in equation 3. Additionally, NIST makes a change to the brevity penalty.

> This change
was made to minimize the impact on the score of small variations
in the length of a translation. This preserves the original motivation
of including a brevity penalty (which is to help prevent gaming the
evaluation measure) while reducing the contributions of length
variations to the score for small variations. 


> The NIST evaluation score is compared with IBM's original BLEU
score in Figure 5 and Figure 6. Figure 5 demonstrates that the
NIST score provides significant improvement in score stability and
reliability for all four of the corpora studied. Figure 6 demonstrates
that, for human judgments of Adequacy, the NIST score correlates
better than the BLEU score on all of the corpora. For Fluency
judgments, however, the NIST score correlates better than the
BLEU score only on the Chinese corpus. This may be a mere
141
7
 Large amounts of data are required to estimate N-gram statistics
for N > 2. In the current implementation, however, the N-gram
statistics are computed only from the reference translations for the
evaluation corpus.
random statistical difference between corpora. Or alternatively, this
may be a consequence of different human judgment criteria or
procedures. (The Chinese-to-English translations were judged at
LDC using a different procedure than that used by John White at
PRC for the 1994 corpora.)

<table>
    <tbody>
        <tr>
            <td>
                <img src="images/NIST_score_graph_1.png" >
            </td>
            <td>
                <img src="images/NIST_score_graph_2.png" >
            </td>
        </tr>
    </tbody>
</table>

## NIST (2006) - OpenMT Takes Over MT Evaluation Series

The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. According to their website they facilitate a series of experiments called "challenge events" which test and observe various machine translation evaluation techniques.

The MT evaluation series was then handed over to NIST around 2006

> Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT.
>
> [NIST](https://catalog.ldc.upenn.edu/LDC2010T21)

According to their website

> The objective of the NIST Open Machine Translation (OpenMT) evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies - technologies that translate text between human languages.

The source data, reference translations and scoring software used in the NIST OpenMT evaluations is cataloged and made available by the Linguistic Data Consortium (LDC). The NIST 2008 Open Machine Translation (OpenMT) Evaluation for example is available [here](https://catalog.ldc.upenn.edu/LDC2010T21).

## NIST (2008) - NIST Launches Sister Evaluation Series re. Metrology

Around 2008, NIST launched MetricsMaTr, a complementary MT Evaluation Series that deat with Metrology rather than the actual MT technology itself.



> NIST coordinates Metrics for Machine Translation Evaluation (MetricsMaTr), a series of research challenge events for machine translation (MT) metrology.
>
> [MetricsMaTr](https://www.nist.gov/itl/iad/mig/metrics-machine-translation-evaluation)

> Metrology is the science of measurement and its application. NIST's work in metrology focuses on advancing measurement science to enhance economic security and improve quality of life. Almost all of NIST's research has a metrology component to it.
>
> [NIST > Metrology](https://www.nist.gov/metrology)


In 2008 a MetricsMaTr challenge was conducted where various automated metric scores were compared with those rendered by human judges. A general discussion of the usefulness of automated metrics is offered. And suggestions regarding improvements that should be incorporated into future evaluations of metrics for MT evaluation are put forward. 

This challenge is documented in detail by [Przybocki et. al. (2010)](https://www.nist.gov/publications/nist-2008-metrics-machine-translation-challenge-overview-methodology-metrics-and) and NIST provides a high level [summary](https://www.nist.gov/itl/iad/mig/metrics-machine-translation-evaluation) of the challeng results.

## Przybocki et. al. (2010) - Metrics For Machine Translation Challenge

In March 2010, Przybocki et. al. [published](https://www.nist.gov/publications/nist-2008-metrics-machine-translation-challenge-overview-methodology-metrics-and) the paper titled *The NIST 2008 Metrics for Machine Translation Challenge - Overview, Methodology, Metrics, and Results* describing the 2008 MetricsMaTr challenge results and reccomendations for future research.

In section 3, the paper notes that 39 metrics evaluated, seven of which were preexisting
baseline metrics, that is, metrics that have been used prominently in past evaluations. The remaining 32
metrics were submitted by the participants.

It then broke down the metrics into the following categories: baseline metrics which were the metrics from past evaluations, and submitted metrics, which are the newly submitted ones.

The baseline metrics were as follows:

> 3.1. Baseline metrics
>
> 3.1.1. Variants of the BLEU metric
>
> MetricsMATR evaluated four baseline variants of the BLUE metric using case-sensitive scoring:
> - BLEU-1: (IBM version 1.04) limited to unigram precision
> - BLEU-4: (IBM version 1.04) precision scores for n-grams of size between 1 and 4 tokens
> - BLEU-v11b: (NIST mteval-v11b) similar to BLEU-4, but with a modified brevity penalty
> - BLEU-v12: (NIST mtevale-v12) similar to BLEU-v11b, but with a modified tokenization scheme
>
> 3.1.2. NIST score
>
> NIST-v11b: (NIST mteval-v11b) scores case-sensitive n-grams of size varying between 1 and 5
>
> 3.1.3. TER
>
> TER (16) is a measure of edit distance which captures the number of edits required to make a candidate
translation identical to a reference translation, counting block moves as a single error. Scoring was case
sensitive and uses similar text normalization as the variants of BLEU.
>
> TER-v0.7.25: (BBN/UMD version 0.7.25) TERCOM scoring software5
>
> 3.1.4. METEOR
>
> METEOR-v0.6: (CMU version 0.66) modules used: exact, porter, wn_stem and wn_synonymy
>
> 



The submitted metrics were as follows:

> |Affiliation | Metric name(s)|
  |------------|---------------|
  |BabbleQuest                                                              |Badger, BadgerLite
  |Carnegie Mellon University                                               |METEOR-v0.7, METEOR-ranking, mBLEU, mTER |
  |City University of Hong Kong, Department of Chinese, Translation and Linguistics | ATEC1, ATEC2, ATEC3, ATEC4 |
  |Columbia University                                                      |SEPIA1, SEPIA2
  |Harbin Institute of Technology, School of Computer Scienceand Technology | SVM-Rank, SNR, LET
  |National University of Singapore MaxSimRWTH Aachen University            | BleuSP, invWer, CDer
  |Stanford University                                                      | RTE, RTE-MT|
  |University of Maryland / BBN Technologies                                |TERp |
  |Universitat Politècnica de Catalunya, LSI                                |ULCh, ULCopt, DP-Or, SR-Or, DR-Or,DP-Orp |
  |USC, Information Sciences Institute (Team 1)                             | BEwT-E |
  |USC, Information Sciences Institute (Team 2)                             | Bleu-sbp, 4-GRR |
  |University of Washington                                                 | EDPM |


The paper uses Spearman's rho statistic to show the degree of agreement between the automated metric scores and the scores of human judgments across three levels: segment, document, and system level.

> We report correlations for the Spearman’s rho statistic (Spearman’s correlation coefficient for ranked
data) as our primary correlation measure. Although we found that the correlations statistics for
Spearman’s Rho, Kendall’s Tau, and Pearson’s R to closely track each other, Spearman’s provides the
benefit of not showing sensitivity to outliers (as does Pearson’s R), and being based on ranks,
Spearman’s does not assume samples from a bivariate normal distribution (21).

Looking at the results table in section 5, METEOR is consistantly in the top two for each level and the only metric to be top two for all levels.

## NIST (2010) - MetricsMaTr 2010 Challenge

[link](https://www.nist.gov/system/files/documents/itl/iad/mig/NISTMetricsMaTr10EvalPlan.pdf)

## Przybocki et. al. (2010) - MetricsMaTr 2010 Challenge Results

On July 2010, Przybocki et. al. [published](https://www.nist.gov/publications/findings-2010-joint-workshop-statistical-machine-translation-and-metrics-machine) to outline the challenger results. The abstract reads:

> This paper presents the results of the WMT10 and MetricsMATR10 shared tasks, which included a translation task, a system combination task, and an evaluation task. We conducted a large-scale manual evaluation of 104 machine translation systems and 41 system combination entries. We used the ranking of these systems to measure how strongly auto- matic metrics correlate with human judgments of translation quality for 26 metrics. This year we also investigated increasing the number of human judgments by hiring non-expert annotators through Amazon's Mechanical Turk.


An interesting note from the authors was that in some cases, they prefer rule based machine translation (RBMT) systems. Recall that RBMT systems have hard coded, language specific, rules which perform the translations.

> Unfortunately, fewer rule-based systems participated in this year’s edition of WMT, compared
to previous editions. We hope to attract more
rule-based systems in future editions as they increase the variation of translation output and for
some language pairs, such as German-English,
tend to outperform statistical machine translation
systems.

Looking at the system level results of the study we see that there is not a clear consistent winner. Instead we see competition between a number of metrics.


<table>
    <tbody>
        <tr>
            <td>
                <img src="images/metricsmatr_2010_system_results.png">
            </td>
            <td>
                <img src="images/metricsmatr_2010_system_results_2.png">
            </td>
        </tr>
    </tbody>
</table>

At the segment level we see that SVM-rank is outperforming in every category with Bkars being a close second.

<table>
    <tbody>
        <tr>
            <td>
                <img src="images/metricsmatr_2010_segment_results.png">
            </td>
            <td>
                <img src="images/metricsmatr_2010_segment_results_2.png">
            </td>
        </tr>
    </tbody>
</table>

##  Wołk and Marasek (2015) - Enhanced BLEU
On Sep 2015, Wołk and Marasek [published](https://arxiv.org/abs/1509.09088) paper titled *Enhanced Bilingual Evaluation Understudy*

Mentioned by Wołk and Marasek (2015):
- National Institute of Standards and Technology (NIST) metric; 
- Translation Error Rate (TER), 
- the Metric for Evaluation of Translation with Explicit Ordering (METEOR); 
- Length Penalty, Precision, n-gram Position difference Penalty and Recall (LEPOR); 
- the Rankbased Intuitive Bilingual Evaluation Score (RIBES)