# Using LM-Evaluation-Harness

The LM Evaluation Harness is a popular tool for evaluating the performance of LMs. It is what HuggingFace uses for its leaderboard, and it's what we'll use for our subquadratic LMs.

In this notebook we will be going through a short tutorial on how things work.

`Note`: Some fields are still being cleaned up.

## Install LM-Eval

In [1]:
# Install LM-Eval
!pip install -q git+https://github.com/EleutherAI/lm-evaluation-harness.git

[0m

In [2]:
from lm_eval import api

Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

## Create tasks



In [None]:
YAML_flamethrower_string = '''
task: flamethrower
dataset_path: TrevorAsbery/scottsus_flamethrower_questions
output_type: multiple_choice
test_split: test
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
'''

with open('flamethrower.yaml', 'w') as f:
    f.write(YAML_flamethrower_string)

In [None]:
YAML_research_string = '''
task: research
dataset_path: scottsus/papers-test-v2
output_type: multiple_choice
test_split: test
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
'''

with open('research.yaml', 'w') as f:
    f.write(YAML_research_string)

In [None]:
YAML_products_string = '''
task: products
dataset_path: slyq/wdc-products-mcq
output_type: multiple_choice
test_split: test
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
'''

with open('products.yaml', 'w') as f:
    f.write(YAML_products_string)

In [3]:
YAML_flamethrower_rag_string = '''
task: flamethrower_rag
dataset_path: slyq/code-mcq
output_type: multiple_choice
test_split: test
doc_to_text: "{{contexts[0].strip()}}\n{{contexts[1].strip()}}\n{{contexts[2].strip()}}\n{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
'''

with open('flamethrower_rag.yaml', 'w') as f:
    f.write(YAML_flamethrower_rag_string)

In [4]:
YAML_research_rag_string = '''
task: research_rag
dataset_path: slyq/papers-mcq
output_type: multiple_choice
test_split: test
doc_to_text: "{{contexts[0].strip()}}\n{{contexts[1].strip()}}\n{{contexts[2].strip()}}\n{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
  - metric: acc
'''

with open('research_rag.yaml', 'w') as f:
    f.write(YAML_research_rag_string)

In [5]:
YAML_products_rag_string = '''
task: products_rag
dataset_path: slyq/wdc-products-mcq
dataset_name: ragged
output_type: multiple_choice
test_split: test
doc_to_text: "{{contexts[0].strip()}}\n{{contexts[1].strip()}}\n{{contexts[2].strip()}}\n{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
  - metric: acc
'''

with open('products_rag.yaml', 'w') as f:
    f.write(YAML_products_rag_string)

## Log into HuggingFace (req for Gemma)

In [6]:
!pip install huggingface_hub

[0m

In [7]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Mamba Instruct-Tuned: Base

## Flamethrower (Code)

In [None]:
# !accelerate launch --no_python
!lm_eval \
    --model hf \
    --model_args pretrained=scottsus/mamba-2.8b-instruct-hf \
    --include_path ./ \
    --tasks flamethrower \
    --output output/flamethrower/ \
    --log_samples

2024-04-26 05:48:12.955357: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-26 05:48:12.955411: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-26 05:48:12.957228: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-26:05:48:18,211 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-26:05:48:18,211 INFO     [__main__.py:262] Including path: ./
2024-04-26:05:48:24,450 INFO     [__main__.py:335] Selected Tasks: ['flamethrower']
2024-04-26:05:48:24,455 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1

## Research

In [None]:
# !accelerate launch --no_python
!lm_eval \
    --model hf \
    --model_args pretrained=scottsus/mamba-2.8b-instruct-hf \
    --include_path ./ \
    --tasks research \
    --output output/research/ \
    --log_samples

2024-04-26 05:20:37.451575: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-26 05:20:37.451627: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-26 05:20:37.453242: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-26:05:20:41,902 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-26:05:20:41,902 INFO     [__main__.py:262] Including path: ./
2024-04-26:05:20:48,106 INFO     [__main__.py:335] Selected Tasks: ['research']
2024-04-26:05:20:48,108 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234


## Products

In [None]:
# !accelerate launch --no_python
!lm_eval \
    --model hf \
    --model_args pretrained=scottsus/mamba-2.8b-instruct-hf \
    --include_path ./ \
    --tasks products \
    --output output/products/ \
    --log_samples

2024-04-26 06:01:46.362762: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-26 06:01:46.362813: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-26 06:01:46.364600: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-26:06:01:51,613 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-26:06:01:51,613 INFO     [__main__.py:262] Including path: ./
2024-04-26:06:01:57,888 INFO     [__main__.py:335] Selected Tasks: ['products']
2024-04-26:06:01:57,890 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234


# Mamba Fine-Tuned + RAG

## Code (Flamethrower)

In [15]:
!lm_eval \
    --model hf \
    --model_args pretrained=scottsus/mamba-2.8b-flamethrower-trained \
    --include_path ./ \
    --tasks flamethrower_rag \
    --output output/mb_flamethrower_ragft/ \
    --log_samples

2024-05-03:05:17:49,906 INFO     [__main__.py:251] Verbosity set to INFO
2024-05-03:05:17:49,906 INFO     [__main__.py:262] Including path: ./
2024-05-03:05:17:56,404 INFO     [__main__.py:335] Selected Tasks: ['flamethrower_rag']
2024-05-03:05:17:56,406 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-03:05:17:56,406 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'scottsus/mamba-2.8b-flamethrower-trained'}
2024-05-03:05:17:57,206 INFO     [huggingface.py:164] Using device 'cuda'
config.json: 100%|█████████████████████████████| 897/897 [00:00<00:00, 4.61MB/s]
model.safetensors.index.json: 100%|█████████| 50.9k/50.9k [00:00<00:00, 493kB/s]
Downloading shards:   0%|                                 | 0/3 [00:00<?, ?it/s]
model-00001-of-00003.safetensors:   0%|             | 0.00/4.97G [00:00<?, ?B/s][A
model-00001-of-00003.safetensors:   0%|    | 10.5M/4.97G [00:00<03:43, 22.2M

## Research

In [None]:
!lm_eval \
    --model hf \
    --model_args pretrained=scottsus/mamba-2.8b-papers-trained \
    --include_path ./ \
    --tasks research_rag \
    --output output/mb_research_ragft/ \
    --log_samples

2024-05-03:05:56:33,708 INFO     [__main__.py:251] Verbosity set to INFO
2024-05-03:05:56:33,708 INFO     [__main__.py:262] Including path: ./
2024-05-03:05:56:40,373 INFO     [__main__.py:335] Selected Tasks: ['research_rag']
2024-05-03:05:56:40,375 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-03:05:56:40,375 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'scottsus/mamba-2.8b-papers-trained'}
2024-05-03:05:56:41,179 INFO     [huggingface.py:164] Using device 'cuda'
config.json: 100%|█████████████████████████████| 897/897 [00:00<00:00, 4.21MB/s]
model.safetensors.index.json: 100%|█████████| 50.9k/50.9k [00:00<00:00, 495kB/s]
Downloading shards:   0%|                                 | 0/3 [00:00<?, ?it/s]
model-00001-of-00003.safetensors:   0%|             | 0.00/4.97G [00:00<?, ?B/s][A
model-00001-of-00003.safetensors:   0%|    | 10.5M/4.97G [00:00<01:29, 55.5MB/s][A
mo

## Products

In [None]:
!lm_eval \
    --model hf \
    --model_args pretrained=scottsus/mamba-2.8b-wdc-trained-v2 \
    --include_path ./ \
    --tasks products_rag \
    --output output/mb_products_ragft/ \
    --log_samples

# Gemma Instruct-Tuned: Base

## Code (Flamethrower)

In [None]:
!lm_eval \
    --model hf \
    --model_args pretrained=google/gemma-2b-it \
    --include_path ./ \
    --tasks flamethrower \
    --output output/flamethrower/ \
    --log_samples

2024-04-26 06:30:21.848504: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-26 06:30:21.848561: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-26 06:30:21.850341: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-26:06:30:26,760 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-26:06:30:26,761 INFO     [__main__.py:262] Including path: ./
2024-04-26:06:30:33,087 INFO     [__main__.py:335] Selected Tasks: ['flamethrower']
2024-04-26:06:30:33,088 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1

## Research

In [None]:
!lm_eval \
    --model hf \
    --model_args pretrained=google/gemma-2b-it \
    --include_path ./ \
    --tasks research \
    --output output/research/ \
    --log_samples

2024-04-26 05:10:20.903394: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-26 05:10:20.903445: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-26 05:10:20.904866: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-26:05:10:25,067 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-26:05:10:25,067 INFO     [__main__.py:262] Including path: ./
2024-04-26:05:10:31,393 INFO     [__main__.py:335] Selected Tasks: ['research']
2024-04-26:05:10:31,394 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234


## Products

In [None]:
!lm_eval \
    --model hf \
    --model_args pretrained=google/gemma-2b-it \
    --include_path ./ \
    --tasks products \
    --output output/products/ \
    --log_samples

2024-04-26 06:22:50.693174: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-26 06:22:50.693223: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-26 06:22:50.694950: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-26:06:22:55,547 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-26:06:22:55,547 INFO     [__main__.py:262] Including path: ./
2024-04-26:06:23:01,716 INFO     [__main__.py:335] Selected Tasks: ['products']
2024-04-26:06:23:01,717 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234


## Closer Look at YAML Fields

To prepare a task we can simply fill in a YAML config with the relevant information.

`output_type`
The current provided evaluation types comprise of the following:
1.   `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.
2.   `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)
3.   `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.
4.   `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)

The core prompt revolves around 3 fields.
1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.
2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.
3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.

These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.


# Gemma Instruct-Tuned: RAG

## Code (Flamethrower)

In [9]:
!lm_eval \
    --model hf \
    --model_args pretrained=google/gemma-2b-it \
    --include_path ./ \
    --tasks flamethrower_rag \
    --output output/flamethrower_rag/ \
    --log_samples

2024-05-03:03:58:49,047 INFO     [__main__.py:251] Verbosity set to INFO
2024-05-03:03:58:49,048 INFO     [__main__.py:262] Including path: ./
2024-05-03:03:58:55,458 INFO     [__main__.py:335] Selected Tasks: ['flamethrower_rag']
2024-05-03:03:58:55,460 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-03:03:58:55,460 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'google/gemma-2b-it'}
2024-05-03:03:58:56,239 INFO     [huggingface.py:164] Using device 'cuda'
config.json: 100%|█████████████████████████████| 627/627 [00:00<00:00, 5.48MB/s]
model.safetensors.index.json: 100%|████████| 13.5k/13.5k [00:00<00:00, 35.1MB/s]
Downloading shards:   0%|                                 | 0/2 [00:00<?, ?it/s]
model-00001-of-00002.safetensors:   0%|             | 0.00/4.95G [00:00<?, ?B/s][A
model-00001-of-00002.safetensors:   0%|    | 10.5M/4.95G [00:00<01:08, 71.8MB/s][A
model-00001-of

## Research

In [10]:
!lm_eval \
    --model hf \
    --model_args pretrained=google/gemma-2b-it \
    --include_path ./ \
    --tasks research_rag \
    --output output/research_rag/ \
    --log_samples

2024-05-03:04:02:06,687 INFO     [__main__.py:251] Verbosity set to INFO
2024-05-03:04:02:06,688 INFO     [__main__.py:262] Including path: ./
2024-05-03:04:02:13,168 INFO     [__main__.py:335] Selected Tasks: ['research_rag']
2024-05-03:04:02:13,169 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-03:04:02:13,170 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'google/gemma-2b-it'}
2024-05-03:04:02:13,783 INFO     [huggingface.py:164] Using device 'cuda'
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00,  1.02s/it]
2024-05-03:04:02:17

## Products

In [11]:
!lm_eval \
    --model hf \
    --model_args pretrained=google/gemma-2b-it \
    --include_path ./ \
    --tasks products_rag \
    --output output/products_rag/ \
    --log_samples

2024-05-03:04:08:03,484 INFO     [__main__.py:251] Verbosity set to INFO
2024-05-03:04:08:03,484 INFO     [__main__.py:262] Including path: ./
2024-05-03:04:08:09,984 INFO     [__main__.py:335] Selected Tasks: ['products_rag']
2024-05-03:04:08:09,985 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-03:04:08:09,985 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'google/gemma-2b-it'}
2024-05-03:04:08:10,287 INFO     [huggingface.py:164] Using device 'cuda'
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00,  1.07it/s]
2024-05-03:04:08:13

## Closer Look at YAML Fields

To prepare a task we can simply fill in a YAML config with the relevant information.

`output_type`
The current provided evaluation types comprise of the following:
1.   `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.
2.   `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)
3.   `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.
4.   `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)

The core prompt revolves around 3 fields.
1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.
2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.
3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.

These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.


# Gemma Fine-Tuned

## Code (Flamethrower)

In [None]:
!lm_eval \
    --model hf \
    --model_args pretrained=TrevorAsbery/trevorasbery-gemma-2b-flamethrower-hf \
    --include_path ./ \
    --tasks flamethrower \
    --output output/flamethrower/ \
    --log_samples

2024-04-26 06:35:35.466640: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-26 06:35:35.466688: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-26 06:35:35.468458: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-26:06:35:40,707 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-26:06:35:40,707 INFO     [__main__.py:262] Including path: ./
2024-04-26:06:35:46,977 INFO     [__main__.py:335] Selected Tasks: ['flamethrower']
2024-04-26:06:35:46,978 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1

## Research

In [None]:
### DOES NOT EXIST

!lm_eval \
    --model hf \
    --model_args pretrained=TrevorAsbery/trevorasbery-gemma-2b-research-hf \
    --include_path ./ \
    --tasks research \
    --output output/research/ \
    --log_samples

2024-04-26 07:15:29.454561: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-26 07:15:29.454612: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-26 07:15:29.456345: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-26:07:15:34,359 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-26:07:15:34,359 INFO     [__main__.py:262] Including path: ./
Traceback (most recent call last):
  File "/usr/local/bin/lm_eval", line 8, in <module>

^C


## Products

In [None]:
!lm_eval \
    --model hf \
    --model_args pretrained=TrevorAsbery/trevorasbery-gemma-2b-products-hf \
    --include_path ./ \
    --tasks products \
    --output output/products/ \
    --log_samples

Traceback (most recent call last):
  File "/usr/local/bin/lm_eval", line 5, in <module>
    from lm_eval.__main__ import cli_evaluate
  File "/usr/local/lib/python3.10/dist-packages/lm_eval/__init__.py", line 1, in <module>
    from .evaluator import evaluate, simple_evaluate
  File "/usr/local/lib/python3.10/dist-packages/lm_eval/evaluator.py", line 11, in <module>
    import lm_eval.api.metrics
  File "/usr/local/lib/python3.10/dist-packages/lm_eval/api/metrics.py", line 7, in <module>
    import evaluate as hf_evaluate
  File "/usr/local/lib/python3.10/dist-packages/evaluate/__init__.py", line 29, in <module>
    from .evaluation_suite import EvaluationSuite
  File "/usr/local/lib/python3.10/dist-packages/evaluate/evaluation_suite/__init__.py", line 7, in <module>
    from datasets import Dataset, DownloadConfig, DownloadMode, load_dataset
  File "/usr/local/lib/python3.10/dist-packages/datasets/__init__.py", line 18, in <module>
    from .arrow_dataset import Dataset
  File "/usr/l

# Gemma Fine-Tuned + RAG

## Code (Flamethrower)

In [12]:
!lm_eval \
    --model hf \
    --model_args pretrained=TrevorAsbery/trevorasbery-gemma-2b-flamethrower-hf \
    --include_path ./ \
    --tasks flamethrower_rag \
    --output output/flamethrower_ragft/ \
    --log_samples

2024-05-03:04:13:32,047 INFO     [__main__.py:251] Verbosity set to INFO
2024-05-03:04:13:32,047 INFO     [__main__.py:262] Including path: ./
2024-05-03:04:13:38,590 INFO     [__main__.py:335] Selected Tasks: ['flamethrower_rag']
2024-05-03:04:13:38,591 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-03:04:13:38,592 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'TrevorAsbery/trevorasbery-gemma-2b-flamethrower-hf'}
2024-05-03:04:13:39,270 INFO     [huggingface.py:164] Using device 'cuda'
config.json: 100%|█████████████████████████████| 696/696 [00:00<00:00, 3.13MB/s]
model.safetensors.index.json: 100%|████████| 13.5k/13.5k [00:00<00:00, 96.4MB/s]
Downloading shards:   0%|                                 | 0/3 [00:00<?, ?it/s]
model-00001-of-00003.safetensors:   0%|             | 0.00/4.91G [00:00<?, ?B/s][A
model-00001-of-00003.safetensors:   0%|    | 10.5M/4.91G [00:00<01

## Research

In [13]:
!lm_eval \
    --model hf \
    --model_args pretrained=scottsus/gemma-2b-papers-trained \
    --include_path ./ \
    --tasks research_rag \
    --output output/research_ragft/ \
    --log_samples

2024-05-03:04:22:35,915 INFO     [__main__.py:251] Verbosity set to INFO
2024-05-03:04:22:35,915 INFO     [__main__.py:262] Including path: ./
2024-05-03:04:22:42,419 INFO     [__main__.py:335] Selected Tasks: ['research_rag']
2024-05-03:04:22:42,421 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-03:04:22:42,421 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'scottsus/gemma-2b-papers-trained'}
2024-05-03:04:22:43,205 INFO     [huggingface.py:164] Using device 'cuda'
config.json: 100%|█████████████████████████████| 696/696 [00:00<00:00, 6.97MB/s]
model.safetensors.index.json: 100%|████████| 13.5k/13.5k [00:00<00:00, 67.5MB/s]
Downloading shards:   0%|                                 | 0/3 [00:00<?, ?it/s]
model-00001-of-00003.safetensors:   0%|             | 0.00/4.91G [00:00<?, ?B/s][A
model-00001-of-00003.safetensors:   0%|    | 10.5M/4.91G [00:00<03:59, 20.5MB/s][A
mode

## Products

In [14]:
!lm_eval \
    --model hf \
    --model_args pretrained=TrevorAsbery/trevorasbery-gemma-2b-products-hf \
    --include_path ./ \
    --tasks products_rag \
    --output output/products_ragft/ \
    --log_samples

2024-05-03:04:58:13,367 INFO     [__main__.py:251] Verbosity set to INFO
2024-05-03:04:58:13,368 INFO     [__main__.py:262] Including path: ./
2024-05-03:04:58:19,875 INFO     [__main__.py:335] Selected Tasks: ['products_rag']
2024-05-03:04:58:19,876 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-03:04:58:19,876 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'TrevorAsbery/trevorasbery-gemma-2b-products-hf'}
2024-05-03:04:58:20,666 INFO     [huggingface.py:164] Using device 'cuda'
config.json: 100%|█████████████████████████████| 696/696 [00:00<00:00, 5.02MB/s]
model.safetensors.index.json: 100%|████████| 13.5k/13.5k [00:00<00:00, 64.9MB/s]
Downloading shards:   0%|                                 | 0/3 [00:00<?, ?it/s]
model-00001-of-00003.safetensors:   0%|             | 0.00/4.91G [00:00<?, ?B/s][A
model-00001-of-00003.safetensors:   0%|    | 10.5M/4.91G [00:00<01:17, 63.

# LM-Eval Sanity Check...

LM-Eval is giving terrible results on these models, so let's try it on a very open-source dataset (cais/mmlu) and a larger model (Gemma-7B).

In [None]:
YAML_mmlu_geo_string = '''
group: mmlu
task: demo_mmlu_high_school_geography
dataset_path: cais/mmlu
dataset_name: high_school_geography
description: "The following are multiple choice questions (with answers) about high school geography.\n\n"
test_split: test
fewshot_split: dev
fewshot_config:
  sampler: first_n
output_type: multiple_choice
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
'''
with open('mmlu_high_school_geography.yaml', 'w') as f:
    f.write(YAML_mmlu_geo_string)

In [None]:
# !accelerate launch --no_python
!lm_eval \
    --model hf \
    --model_args pretrained=google/gemma-7b \
    --include_path ./ \
    --tasks demo_mmlu_high_school_geography \
    --limit 10 \
    --output output/mmlu_high_school_geography/ \
    --log_samples

2024-04-26 19:32:20.553802: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-26 19:32:20.553853: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-26 19:32:20.555638: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-26:19:32:24,692 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-26:19:32:24,692 INFO     [__main__.py:262] Including path: ./
2024-04-26:19:32:30,908 INFO     [__main__.py:335] Selected Tasks: ['demo_mmlu_high_school_geography']
2024-04-26:19:32:30,913 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting tor

## Closer Look at YAML Fields

To prepare a task we can simply fill in a YAML config with the relevant information.

`output_type`
The current provided evaluation types comprise of the following:
1.   `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.
2.   `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)
3.   `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.
4.   `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)

The core prompt revolves around 3 fields.
1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.
2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.
3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.

These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.
