# Model evaluation

In the paper, we consider a classification problem where inputs to the model are questions $x$ paired with candidate answers $y$ to constitute concatenated sequences.
The generative model then processes these concatenated question-answer pairs to predict the most probable answer $\hat{y}$ from the provided choices $Y$ for a given $x$:
\begin{align*}
\hat{y} = \underset{y \in Y}{\text{arg max }} p_{\text{LM}}(y|x).
\end{align*}
Here, the probability of the token sequence
$y$ is derived as the product of individual token $y_{[i]}$ probabilities within the sequence, conditioned on
$x$ and the preceding tokens $y_{[1:i-1]}$:
\begin{align*}
p_{\text{LM}}(y|x) = \prod_{i=1}^{|y|} p_{\text{LM}}(y_{[i]}|x, y_{[1:i-1]}),
\end{align*}
where $|y|$ is the number of tokens composing the answer $y$.

For the entailment generation benchmarks, we use texts concatenated with possible completions as inputs to the model.
We compare the quantized and full-precision models with the difference in the probabilities of the sequences  $p_{\text{LM}}(y|x)$, further referred to as confidences.

To compute the scores $\hat{y}$, we use lm-evaluation harness framework and detailed output for each evaluation obtained with `write_out` argument: https://github.com/EleutherAI/lm-evaluation-harness.

*Note that while we use the December 2023 version of the framework, you can use instead the current version (master branch) and replace the arguments with current arguments:*
```
!lm_eval --model hf \
    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \
    --tasks hellaswag

```
* `write_out` was replaced with `log_samples` argument.

In [1]:
!pip install auto-gptq==0.7.1 torch==2.3.0 -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m94.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

In [2]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness.git
%cd lm-evaluation-harness
!git checkout "add-siqa"
!pip install -e . -q

Cloning into 'lm-evaluation-harness'...
remote: Enumerating objects: 34827, done.[K
remote: Counting objects: 100% (916/916), done.[K
remote: Compressing objects: 100% (530/530), done.[K
remote: Total 34827 (delta 531), reused 604 (delta 382), pack-reused 33911[K
Receiving objects: 100% (34827/34827), 23.58 MiB | 17.64 MiB/s, done.
Resolving deltas: 100% (24266/24266), done.
/content/lm-evaluation-harness
Branch 'add-siqa' set up to track remote branch 'add-siqa' from 'origin'.
Switched to a new branch 'add-siqa'
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.5/325.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.0/235.0 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [3]:
# !export LC_ALL="en_US.UTF-8"
# !export LD_LIBRARY_PATH="/usr/lib64-nvidia"
# !export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
# !ldconfig /usr/lib64-nvidia

In [4]:
#@title Model type and tokenizer
model_path="iproskurina/bloom-1b7-gptq-4bit"#@param {type:"string"}
tokenizer_path='iproskurina/bloom-1b7-gptq-4bit'#@param {type:"string"}

In [5]:
output_base_path=model_path
output_path=output_base_path+"_suite.json"

In [6]:
!python main.py \
    --model hf-causal-experimental \
    --model_args pretrained=$model_path,tokenizer=$tokenizer_path,quantized="model.safetensors",gptq_use_triton=True \
    --device cuda:0 \
    --tasks hellaswag,piqa,boolq,truthfulqa_mc,arc_easy,xstory_cloze_en,openbookqa \
    --write_out \
    --no_cache \
    --output_path $output_path \
    --output_base_path $output_base_path

2024-06-10 20:13:17.394236: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-10 20:13:17.445754: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-10 20:13:17.445802: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-10 20:13:17.447287: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-10 20:13:17.454905: I tensorflow/core/platform/cpu_feature_guar

For non-quantized models, remove `quantized` and `gptq_use_triton` arguments.