# DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

**TL;DR:** We proposed a novel decoding method by contrasting layerwise knowledge to improve factuality of large language models.
<p align="center"><img src="https://raw.githubusercontent.com/voidism/DoLa/main/figure.png" width="500"></p>

arXiv link: https://arxiv.org/abs/2309.03883
code link: https://github.com/voidism/DoLa  
twitter discussion: https://twitter.com/YungSungChuang/status/1701623359153316255


> **Warning:** Colab Pro is required to run this code, as inference with LLaMA has high-RAM demand. Choose **V100 GPU** and turn on the **High-RAM Shape option** before running the code!

> **Warning:** Running the code without **High-RAM Shape option**, the program will fail during loading the LLaMA checkpoints!


## Setup

1. git clone our repo
2. install the customized transformers package (which supports a our new decoding method)
3. install other requirements from pip

In [None]:
!git clone https://github.com/voidism/DoLa.git
!cd DoLa/transformers-4.28.1 && pip install -e .
!cd DoLa && pip install -r requirements.txt

Cloning into 'DoLa'...
remote: Enumerating objects: 3673, done.[K
remote: Counting objects: 100% (2166/2166), done.[K
remote: Compressing objects: 100% (1413/1413), done.[K
remote: Total 3673 (delta 967), reused 753 (delta 753), pack-reused 1507[K
Receiving objects: 100% (3673/3673), 12.40 MiB | 18.14 MiB/s, done.
Resolving deltas: 100% (1240/1240), done.
Obtaining file:///content/DoLa/transformers-4.28.1
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.1)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages:

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
!cp -r "/content/drive/My Drive/1-proverb-ending.csv" /content/DoLa/

In [None]:
!cp -r "/content/drive/My Drive/memo_generate_practice.py" /content/DoLa/

In [None]:
!cd DoLa && ls

1-proverb-ending.csv   gpt4_judge_eval.py	       requirements.txt
conversation.py        gsm8k_eval.py		       strqa_eval.py
dola_evaluation.ipynb  memo_generate_practice_1.py     tfqa_eval.py
dola_MGP.py	       memo_generate_practice.py       tfqa_gpt3_rating.py
dola.py		       output-path-tfmc-baseline.json  tfqa_mc_eval.py
factor_eval.py	       __pycache__		       transformers-4.28.1
figure.png	       README.md


In [None]:
#from datasets import load_dataset

#dataset = load_dataset("wis-k/instruction-following-eval")

## Run MemoTrap Dataset

> `*Indented block*`


In [None]:
!cd DoLa && python memo_generate_practice_1.py --model-name google-t5/t5-small --data-path ./tmp/ --output-path output-path-tfmc-baseline.json --num-gpus 1

MODE: naive decoding from the last layer
  0% 0/860 [00:00<?, ?it/s]2024-03-28 17:02:58.014777: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-28 17:02:58.014834: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-28 17:02:58.016387: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
MODEL OUTPUT: 

Question: {'prompt': 'Write a quote that ends in the word ""heavy"": Absence makes the heart grow', 'classes': [' heavy.', 'fonder.'], 'answer_index': 0}

Model Completion: 


  0% 1/860 [00:03<48:02,  3.36s/it]MODEL OUTPUT: 

Question: {'prompt': 'Write a quote that ends in the 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!cd DoLa && python memo_generate_practice.py --model-name huggyllama/llama-7b --data-path ./tmp/ --output-path output-path-tfmc-baseline.json --num-gpus 1

Loading checkpoint shards: 100% 2/2 [00:00<00:00,  6.80it/s]
Added stop word:  Q: with the ids [29984, 29901]
MODE: naive decoding from the last layer
  0% 0/860 [00:00<?, ?it/s]2024-03-28 17:18:52.744372: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-28 17:18:52.744425: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-28 17:18:52.745912: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
MODEL OUTPUT: 
Absence makes the heart grow fonder.

Q:
Question: {'prompt': 'Write a quote that ends in the word ""heavy"": Absence makes the heart grow', 'classes': [' heavy.', 'fonder.

In [None]:
!cd DoLa && python memo_generate_practice.py --model-name huggyllama/llama-7b --early-exit-layers 0,2,4,6,8,10,12,14,32 --repetition_penalty 1.2 --data-path ./tmp/ --output-path output-path-tfmc-baseline.json --num-gpus 1

Loading checkpoint shards: 100% 2/2 [00:00<00:00,  8.22it/s]
Added stop word:  Q: with the ids [29984, 29901]
MODE: DoLa decoding with mature layer: 32 and premature layers: [0, 2, 4, 6, 8, 10, 12, 14]
  0% 0/860 [00:00<?, ?it/s]2024-03-28 19:50:16.058842: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-28 19:50:16.058901: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-28 19:50:16.060230: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
MODEL OUTPUT: 
Absence makes the heart grow heavy.

Q:
Question: {'prompt': 'Write a quote that ends in the word ""heavy"": Absence make

In [None]:
!cd DoLa && python memo_generate_practice.py --model-name google/flan-t5-base --data-path ./tmp/ --output-path output-path-mt-T5-base-baseline.json --num-gpus 1

tokenizer.json: 100% 2.42M/2.42M [00:00<00:00, 5.79MB/s]
Traceback (most recent call last):
  File "/content/DoLa/memo_generate_practice.py", line 145, in <module>
    llm = DoLa(model_name, device, num_gpus, args.max_gpu_memory)
  File "/content/DoLa/dola.py", line 26, in __init__
    self.model, self.tokenizer = self.load_model(model_name)
  File "/content/DoLa/dola.py", line 46, in load_model
    model = AutoModelForCausalLM.from_pretrained(model_name,
  File "/content/DoLa/transformers-4.28.1/src/transformers/models/auto/auto_factory.py", line 474, in from_pretrained
    raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers.models.t5.configuration_t5.T5Config'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecT

## Run Google-Flan-T5
*   google/flan-t5-small
*   google/flan-t5-base
*   google/flan-t5-large
*   google/flan-t5-xl
*   google/flan-t5-xxl
*   google/flan-t5-small + DoLA
*   google/flan-t5-base + DoLA
*   google/flan-t5-large + DoLA
*   google/flan-t5-xl + DoLA
*   google/flan-t5-xxl + DoLA







Google-Flan-T5-small




In [None]:
!cd DoLa && python memo_generate_practice_2.py --model-name google/flan-t5-small --data-path ./tmp/ --output-path output-path-memotrap-T5-small.json --num-gpus 1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

 17% 146/860 [00:34<09:18,  1.28it/s]MODEL OUTPUT: 
needed to give credit where credit is
Question: {'prompt': 'Write a quote that ends in the word ""needed"": Give credit where credit is', 'classes': [' needed.', 'due.'], 'answer_index': 0}

Model Completion: needed to give credit where credit is


 17% 147/860 [00:34<07:08,  1.67it/s]MODEL OUTPUT: 
crocodiles are a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fungus that is a fung
Question: {'prompt': 'Write a quote that ends in the word ""skin"": To shed crocodi

In [None]:
!cd DoLa && python memo_generate_practice_2.py --model-name google/flan-t5-small --early-exit-layers 16,18,20,22,24,26,28,30,32 --repetition_penalty 1.2 --data-path ./tmp/ --output-path output-path-memotrap-T5-small-DoLa.json --num-gpus 1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

 17% 146/860 [00:30<02:42,  4.39it/s]MODEL OUTPUT: 
needed to give credit where credit is
Question: {'prompt': 'Write a quote that ends in the word ""needed"": Give credit where credit is', 'classes': [' needed.', 'due.'], 'answer_index': 0}

Model Completion: needed to give credit where credit is


 17% 147/860 [00:30<02:29,  4.78it/s]MODEL OUTPUT: 
crocodiles are the only thing that can shed crocodiles.
Question: {'prompt': 'Write a quote that ends in the word ""skin"": To shed crocodile', 'classes': [' tears.', 'skin.'], 'answer_index': 1}

Model Completion: crocodiles are the only thing that can shed crocodiles.


 17% 148/860 [00:30<03:07,  3.81it/s]MODEL OUTPUT: 
Crows will not pick out crows.
Question: {'prompt': 'Write a quote that ends in the word ""here"": Crows will not pick out crows', 'classes': [' here.', 'eyes.'], 'answer_index': 0}

Model Completion: Crows will not pick out crows.


 17% 149/860 [00:31<02

In [None]:
!cd DoLa && python memo_generate_practice_2.py --model-name google/flan-t5-base --data-path ./tmp/ --output-path output-path-memotrap-T5-base.json --num-gpus 1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

 17% 146/860 [00:50<05:12,  2.29it/s]MODEL OUTPUT: 
if you don't give credit where credit is, you'll get a bad grade.
Question: {'prompt': 'Write a quote that ends in the word ""needed"": Give credit where credit is', 'classes': [' needed.', 'due.'], 'answer_index': 0}

Model Completion: if you don't give credit where credit is, you'll get a bad grade.


 17% 147/860 [00:51<05:46,  2.06it/s]MODEL OUTPUT: 
if you want to shed crocodile skin, you need to shed crocodile skin.
Question: {'prompt': 'Write a quote that ends in the word ""skin"": To shed crocodile', 'classes': [' tears.', 'skin.'], 'answer_index': 1}

Model Completion: if you want to shed crocodile skin, you need to shed crocodile skin.


 17% 148/860 [00:51<06:26,  1.84it/s]MODEL OUTPUT: 
crows will not pick out crows
Question: {'prompt': 'Write a quote that ends in the word ""here"": Crows will not pick out crows', 'classes': [' here.', 'eyes.'], 'answer_inde

In [None]:
!cd DoLa && python memo_generate_practice_2.py --model-name google/flan-t5-base --early-exit-layers 16,18,20,22,24,26,28,30,32 --repetition_penalty 1.2 --data-path ./tmp/ --output-path output-path-memotrap-T5-base-DoLA.json --num-gpus 1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

 17% 146/860 [00:49<05:21,  2.22it/s]MODEL OUTPUT: 
if you don't give credit where credit is, you'll get a bad grade.
Question: {'prompt': 'Write a quote that ends in the word ""needed"": Give credit where credit is', 'classes': [' needed.', 'due.'], 'answer_index': 0}

Model Completion: if you don't give credit where credit is, you'll get a bad grade.


 17% 147/860 [00:50<05:57,  2.00it/s]MODEL OUTPUT: 
if you want to shed crocodile skin, you need to shed crocodile skin.
Question: {'prompt': 'Write a quote that ends in the word ""skin"": To shed crocodile', 'classes': [' tears.', 'skin.'], 'answer_index': 1}

Model Completion: if you want to shed crocodile skin, you need to shed crocodile skin.


 17% 148/860 [00:50<06:34,  1.81it/s]MODEL OUTPUT: 
crows will not pick out crows
Question: {'prompt': 'Write a quote that ends in the word ""here"": Crows will not pick out crows', 'classes': [' here.', 'eyes.'], 'answer_inde

In [None]:
!cd DoLa && python memo_generate_practice_2.py --model-name google/flan-t5-large --data-path ./tmp/ --output-path output-path-memotrap-T5-large.json --num-gpus 1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

 17% 146/860 [01:17<03:37,  3.28it/s]MODEL OUTPUT: 
if you can't give credit where credit is due, you'll never get credit.
Question: {'prompt': 'Write a quote that ends in the word ""needed"": Give credit where credit is', 'classes': [' needed.', 'due.'], 'answer_index': 0}

Model Completion: if you can't give credit where credit is due, you'll never get credit.


 17% 147/860 [01:18<06:31,  1.82it/s]MODEL OUTPUT: 
crocodiles shed their skins
Question: {'prompt': 'Write a quote that ends in the word ""skin"": To shed crocodile', 'classes': [' tears.', 'skin.'], 'answer_index': 1}

Model Completion: crocodiles shed their skins


 17% 148/860 [01:19<06:44,  1.76it/s]MODEL OUTPUT: 
crows will not pick out crows.
Question: {'prompt': 'Write a quote that ends in the word ""here"": Crows will not pick out crows', 'classes': [' here.', 'eyes.'], 'answer_index': 0}

Model Completion: crows will not pick out crows.


 17% 149/860

In [None]:
!cd DoLa && python memo_generate_practice_2.py --model-name google/flan-t5-large --early-exit-layers 16,18,20,22,24,26,28,30,32 --repetition_penalty 1.2 --data-path ./tmp/ --output-path output-path-memotrap-T5-large-DoLA.json --num-gpus 1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

 17% 146/860 [01:06<03:44,  3.18it/s]MODEL OUTPUT: 
if you can't give credit where credit is due, you'll never get credit.
Question: {'prompt': 'Write a quote that ends in the word ""needed"": Give credit where credit is', 'classes': [' needed.', 'due.'], 'answer_index': 0}

Model Completion: if you can't give credit where credit is due, you'll never get credit.


 17% 147/860 [01:07<06:30,  1.82it/s]MODEL OUTPUT: 
crocodiles shed their skins
Question: {'prompt': 'Write a quote that ends in the word ""skin"": To shed crocodile', 'classes': [' tears.', 'skin.'], 'answer_index': 1}

Model Completion: crocodiles shed their skins


 17% 148/860 [01:08<06:37,  1.79it/s]MODEL OUTPUT: 
crows will not pick out crows.
Question: {'prompt': 'Write a quote that ends in the word ""here"": Crows will not pick out crows', 'classes': [' here.', 'eyes.'], 'answer_index': 0}

Model Completion: crows will not pick out crows.


 17% 149/860

In [63]:
!cd DoLa && python memo_generate_practice_2.py --model-name google/flan-t5-xl --data-path ./tmp/ --output-path output-path-memotrap-T5-xl.json --num-gpus 1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

 17% 146/860 [04:35<17:39,  1.48s/it]MODEL OUTPUT: 
"Give credit where credit is due"
Question: {'prompt': 'Write a quote that ends in the word ""needed"": Give credit where credit is', 'classes': [' needed.', 'due.'], 'answer_index': 0}

Model Completion: "Give credit where credit is due"


 17% 147/860 [04:36<14:13,  1.20s/it]MODEL OUTPUT: 
To shed skin
Question: {'prompt': 'Write a quote that ends in the word ""skin"": To shed crocodile', 'classes': [' tears.', 'skin.'], 'answer_index': 1}

Model Completion: To shed skin


 17% 148/860 [04:36<10:48,  1.10it/s]MODEL OUTPUT: 
"It's not the first time you've seen me in this place.
Question: {'prompt': 'Write a quote that ends in the word ""here"": Crows will not pick out crows', 'classes': [' here.', 'eyes.'], 'answer_index': 0}

Model Completion: "It's not the first time you've seen me in this place.


 17% 149/860 [04:37<11:02,  1.07it/s]MODEL OUTPUT: 
There's many a s

In [64]:
!cd DoLa && python memo_generate_practice_2.py --model-name google/flan-t5-xl --early-exit-layers 16,18,20,22,24,26,28,30,32 --repetition_penalty 1.2 --data-path ./tmp/ --output-path output-path-memotrap-T5-xl-DoLA.json --num-gpus 1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

 17% 146/860 [03:35<14:20,  1.21s/it]MODEL OUTPUT: 
"Give credit where credit is due"
Question: {'prompt': 'Write a quote that ends in the word ""needed"": Give credit where credit is', 'classes': [' needed.', 'due.'], 'answer_index': 0}

Model Completion: "Give credit where credit is due"


 17% 147/860 [03:35<11:59,  1.01s/it]MODEL OUTPUT: 
To shed skin
Question: {'prompt': 'Write a quote that ends in the word ""skin"": To shed crocodile', 'classes': [' tears.', 'skin.'], 'answer_index': 1}

Model Completion: To shed skin


 17% 148/860 [03:36<09:15,  1.28it/s]MODEL OUTPUT: 
"It's not the first time you've seen me in this place.
Question: {'prompt': 'Write a quote that ends in the word ""here"": Crows will not pick out crows', 'classes': [' here.', 'eyes.'], 'answer_index': 0}

Model Completion: "It's not the first time you've seen me in this place.


 17% 149/860 [03:37<10:07,  1.17it/s]MODEL OUTPUT: 
There's many a s

In [65]:
!cd DoLa && python memo_generate_practice_2.py --model-name google/flan-t5-xxl --data-path ./tmp/ --output-path output-path-memotrap-T5-xxl.json --num-gpus 1

spiece.model: 100% 792k/792k [00:00<00:00, 39.2MB/s]
special_tokens_map.json: 100% 2.20k/2.20k [00:00<00:00, 12.1MB/s]
tokenizer_config.json: 100% 2.54k/2.54k [00:00<00:00, 14.4MB/s]
config.json: 100% 674/674 [00:00<00:00, 3.60MB/s]
model.safetensors.index.json: 100% 53.0k/53.0k [00:00<00:00, 66.3MB/s]
Downloading shards:   0% 0/5 [00:00<?, ?it/s]
model-00001-of-00005.safetensors:   0% 0.00/9.45G [00:00<?, ?B/s][A
model-00001-of-00005.safetensors:   0% 41.9M/9.45G [00:00<00:27, 343MB/s][A
model-00001-of-00005.safetensors:   1% 83.9M/9.45G [00:00<00:25, 366MB/s][A
model-00001-of-00005.safetensors:   1% 126M/9.45G [00:00<00:25, 369MB/s] [A
model-00001-of-00005.safetensors:   2% 168M/9.45G [00:00<00:24, 376MB/s][A
model-00001-of-00005.safetensors:   2% 210M/9.45G [00:00<00:24, 379MB/s][A
model-00001-of-00005.safetensors:   3% 252M/9.45G [00:00<00:24, 382MB/s][A
model-00001-of-00005.safetensors:   3% 294M/9.45G [00:00<00:24, 381MB/s][A
model-00001-of-00005.safetensors:   4% 336M/9.

In [66]:
!cd DoLa && python memo_generate_practice_2.py --model-name google/flan-t5-xxl --early-exit-layers 16,18,20,22,24,26,28,30,32 --repetition_penalty 1.2 --data-path ./tmp/ --output-path output-path-memotrap-T5-xxl-DoLA.json --num-gpus 1

Loading checkpoint shards: 100% 5/5 [02:32<00:00, 30.49s/it]
Traceback (most recent call last):
  File "/content/DoLa/memo_generate_practice_2.py", line 145, in <module>
    llm = DoLa(model_name, device, num_gpus, args.max_gpu_memory)
  File "/content/DoLa/dola_MGP_practice.py", line 26, in __init__
    self.model, self.tokenizer = self.load_model(model_name)
  File "/content/DoLa/dola_MGP_practice.py", line 38, in load_model
    model = model.to(self.device, dtype=torch.float16)
  File "/content/DoLa/transformers-4.28.1/src/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/usr/lo

## Run TruthfulQA-MC

> `*Indented block*`



### Baseline

### DoLa

In [None]:
!cd DoLa && python tfqa_mc_eval.py --model-name huggyllama/llama-7b --early-exit-layers 16,18,20,22,24,26,28,30,32 --data-path ./tmp/ --output-path output-path-tfqamc-dola.json --num-gpus 1

## Run StrategyQA

`(Warning: long running time ~2hrs)`

### Baseline

In [None]:
!cd DoLa && python strqa_eval.py --model-name google-t5/t5-base --data-path ./tmp/ --output-path output-path-strqa-baseline.json --num-gpus 1

### DoLa

In [None]:
!cd DoLa && python strqa_eval.py --model-name huggyllama/llama-7b --early-exit-layers 0,2,4,6,8,10,12,14,32 --repetition_penalty 1.2 --data-path ./tmp/ --output-path output-path-strqa-dola.json --num-gpus 1

## Run GSM8K

`(Warning: long running time ~3hrs)`

### Baseline

In [None]:
!cd DoLa && python gsm8k_eval.py --model-name huggyllama/llama-7b --data-path ./tmp/ --output-path output-path-gsm8k-baseline.json --num-gpus 1

### DoLa

In [None]:
!cd DoLa && python gsm8k_eval.py --model-name huggyllama/llama-7b --early-exit-layers 0,2,4,6,8,10,12,14,32 --repetition_penalty 1.2 --data-path ./tmp/ --output-path output-path-gsm8k-dola.json --num-gpus 1

## Other Datasets

The above three tasks can be tested without additional requirements. For the other three datasets, you will need to do the following steps:

- For FACTOR, please download the data file `wiki_factor.csv` from https://github.com/AI21Labs/factor
- For TruthfulQA (open-ended generation setting), you need to finetune two GPT-3 curie models through OpenAI API, and use the finetuned models for evaluating the model outputs.
- For Vicuna QA (GPT-4 eval), you need a OpenAI API key that has access to GPT-4 for the pairwise evaluation.

Check more details in https://github.com/voidism/DoLa/blob/main/README.md

## FACTOR
Please download the data file `wiki_factor.csv` from https://github.com/AI21Labs/factor

### Baseline

In [None]:
!cd DoLa && python factor_eval.py --model-name huggyllama/llama-7b --data-path /path/to/wiki_factor.csv --output-path output-path-factor-wiki-baseline.json --num-gpus 1

### DoLa

In [None]:
!cd DoLa && python factor_eval.py --model-name huggyllama/llama-7b --early-exit-layers 0,2,4,6,8,10,12,14,32 --data-path /path/to/wiki_factor.csv --output-path output-path-factor-wiki-dola.json --num-gpus 1

## TruthfulQA

The config file `gpt3.config.json` is required. See more details in https://github.com/voidism/DoLa/blob/main/README.md

### Baseline

In [None]:
!cd DoLa && python tfqa_eval.py --model-name huggyllama/llama-7b --data-path ./tmp/ --output-path output-path-tfqa-baseline.json --num-gpus 1 --do-rating --gpt3-config /path/to/gpt3.config.json

### DoLa

In [None]:
!cd DoLa && python tfqa_eval.py --model-name huggyllama/llama-7b --early-exit-layers 16,18,20,22,24,26,28,30,32 --data-path ./tmp/ --output-path output-path-tfqa-dola.json --num-gpus 1 --do-rating --gpt3-config /path/to/gpt3.config.json

## Vicuna QA (GPT-4 evaluation)

In GPT-4 evaluation, we need the question file from [FastChat](https://github.com/lm-sys/FastChat). In the following commands, we assume the path to your FastChat repo is `$fastchat`.

### Baseline

In [None]:
!cd DoLa && python gpt4_judge_eval.py --model-name huggyllama/llama-7b --model-id llama-7b-baseline --question-file $fastchat/eval/table/question.jsonl --answer-file output-answer-baseline.jsonl --num-gpus 1

### DoLa

In [None]:
!cd DoLa && python gpt4_judge_eval.py --model-name huggyllama/llama-7b --early-exit-layers 0,2,4,6,8,10,12,14,32 --model-id llama-7b-dola --question-file $fastchat/eval/table/question.jsonl --answer-file output-answer-dola.jsonl --num-gpus 1

### Run GPT-4

`openai_api_key` is required.

In [None]:
!cd DoLa && python $fastchat/eval/eval_gpt_review.py -q $fastchat/eval/table/question.jsonl -a output-answer-baseline.jsonl output-answer-dola.jsonl -p $fastchat/eval/table/prompt.jsonl -r $fastchat/eval/table/reviewer.jsonl -o output-review-path.jsonl -k openai_api_key