# "Openness" in LLM benchmarks

### 1. Introduction

The question of "openness" in AI research is wide open, marked with controversies regarding how and what should be made available.

This issue in large language modeling (LLM) research is particularly heated, given the unparalleled commercial interest, but also, and most importanly, its concrete and potential impact on several areas of endeavor from the arts to the sciences. What does it mean to "open" a large language model application or the research and development process of an LLM? What "levels of openness" are perceived to be necessary by practitioners across the data "pipeline"---from data collection, preparation, and training to post-training, fine-tuning, and deployment? What is not "open" but forcibly made open into foundational models of so-called "open" AI, reigniting the debate about "fair use" and "intellectual property" rights? 

These questions are just the tip of the proverbial iceberg of AI systems. They concern an emergent problem space where new understandings and practices of "openness" are being portrayed in the context of AI research and development. 

These questions are part of an emergent problem-space concerning the question of participation in the present and future of computing. Given the new prevalance of private over publicly-funded research, the future of open AI scholarship is currently being put to test. And the question of "openness" takes us to the key concerns of research and development in AI for the examination of its sociotechnical conditions, processes, and resources.

In this notebook, we will bracket the sociotechnical complexity of AI systems to focus on the question of "openness" of benchmarking practices of model evaluation. We will do so with a focus on one of the most popular "open AI" systems, Meta Llama v3. We will describe the benchmarking of a large language model (LLM) to discuss the affordances, but also the foreclosures of "openness." We will operate with the distinction between "open AI" (as an industry practice of releasing documentation and "open weights") versus "Open Source AI" (as defined and vetted by the Open Source Initiative), but will take a naive approach to "openness" through a simple exercise: we will test whether we can reproduce a benchmark of an "open" foundation model family without reference (initially) to any particular understanding or vetted definition of "openness" (whether de facto or de jure).

### 2. Data Sources

One of the most controversial aspects in the debate about openness, accessibility, and transparency concerns the question of training data. Where is it obtained? How is it obtained? What is the licensing status of the data obtained? Who owns the copyright of the training data? Who benefits and who does not in the process?

This is an area where a shroud of mystery and secrecy exists. Most of what is currently known was obtained through court documents and leked information from internal processes of big tech companies with LLM products.

On this topic, open AI systems usually do no disclose as much as necessary for an informed debate about the ethics of LLM research. For example, in the whitepaper that presents the technical aspects of Meta Llama 3, "Heards of Models," the initial amount of data is not detailed, nor are the details of data collection and processing. Descriptions of the data are vague. They refer to a "a large, multilingual text corpus" of 15.6 trillion tokens. We are informed the quality is substantially better than the previous models, but also that "much of the data is obtained from the web." In their terms, "a new mix of publicly available online data." Personally identifiable information (PII) and adult content is claimed to be removed from the final training set.

To find more information about the corpus, we cannot rely on official documentation. We need to collect information through informal ways. We find, for example, that Llama has, at least, a known combination of sources, such as:
- Common Crawl (~3 million web pages)
- Colossal Clean Crawled Corpus (C4, based on Common Crawl)
- Books3 collection (191,000 pirated books)
- Project Gutenberg and Wikipedia
- ArXiV, Github and Stack Overflow
  Corporate news outlets (i.e. NYT)

Copyrighted material can be found in most of these datasets, but claims of fair use have been used in the past (for Common Crawl) and prevailed, more recently, in a court case against copyright violation against Meta (in its training of Llama). Another fundamental aspect is that the conditions of production of the language practices that are "scrapped" from the web are all erased through the process of data cleaning and preparation for feeding into a "tokenizer" that divides up the corpus into elementary tokens.

### 3. Public Repository

TODO

### 4. Open Tools

AI research was an arcane and fairly insular domain of scholarship before it become a profitable industry. As such, it inherited the moral economy of contemporary academic scholarship with the incorporation of Open Access and Open Science practices. There is a common practice among practitioners, even newcomers, of preparing "preprints" that are circulated before and after important academic conferences. As part of the "open science" ethos, papers often include code and, at times, data (for training, post-training or benchmarking).

There is also an important tendency that extends from the incorporation of Open Source development practices in the tech industry to the domain of AI research, which is the incorporation of OS development practices and tools into the AI research process. We find, for example, repositories for projects with papers and code. We find frameworks and tools that are widely distributed, included key frameworks and tools that are developed and made available using open licenses by big industry players (such as Meta, OpenAI, Google, and IBM). There is a general orientation, we could say, toward a practice of sharing tools for AI research and development. And this tendency is usually referred to by practitioners as the reason why the field of AI reseach is moving so fast in terms of its results, but also in terms of investment and interest.

One of the open tools of interest in the space of bencharking is LLM Harness, developed by Eleuther AI (a research collective and non-profit that advocates for "Open AI research"). LLM Harness was created to fullfil an important need in the domain of LLM research: address the problem of reproduciblity where new research would present the improvements on the state-of-the art, but not offer code and data for independent researchers reproduce the results.

How reproducible is a benchmark? (Or, better, how reproducible should it be, given all the challenges of reproducibility?) The answer to this question is, interestingly, far from straight forward.

### 5. Benchmarking

Language model benchmarking is one of the most discussed, yet one of the most problematized aspects of current LLM research. Many practitioners are dismissive of the industry reliance on benchmarks, given the problems that have been identified in the literature regarding the problems of reproducibility, such as: 
* Ambiguous / polysemic preparation and interpretation of prompts
* Difference in GPU hardware
* Benchmark dataset being incorporated by the training data of the model (or similar tasks from the benchmark)
* Poorly designed benchmarks (that have more than one answer to their questions)

The technical evaluation of Llama v3 was conducted in three steps. First, we identified the models to study based on 1) how active their community was online; 2) how performant they were in previous benchmarks; and 3) how available (and accepted) their models and tools were
for the research community. We identified in the Llama family of models a good candidate for being at the center of the controversy regarding "open" versus "open source" AI. Second, we prepared the environmental and datasets to run the benchmarks. Third and last, we ran evaluations and documented the results, while saving the outputs to see how the models actually responded to automated tasks.

For testing whether the benchmark results from Meta could be reproduced, we adopted one of the most popular benchmarks that is also used and reported first by the company when sharing their progress on LLM training. The MMLU (Measuring Massive Multitask Language Understanding) benchmark consists in a series of questions of general knowledge, organized in a dataset that contains responses (annotations) to multiple choice questions. The MMLU dataset is widely recognized in the literature as a robust tool for evaluating language models, covering 116,000 multiple-choice questions distributed. However, it's robustness is also object of strong critique. Gema et al. (2025) in their article "Are We Done with MMLU?" describes serious problems with MMLU that compromise its usefulness as an evaluation for comparison across models. They report, for example, that around 67% of the answers for the subject of virology are wrong. Overall, they have identified several problems in the benchmark, including: ambiguous questions, questions with reference to objects that are not part of the benchmark (wrongly extracted from online sources), questions that have more than one answer, as well as biased answers in the legal subject that do not include jurisdiction (but assume the US as default). For our purposes in testing the "openness" of evaluation tools, however, MMLU is the most adequate tool.

#### 5.1. Running the benchmarks

When attempting to use Llama models, we encounter the first restriction: they are "gated models" for which users needs to register with Meta to have access, after declaring that the terms of the "META LLAMA 3 COMMUNITY LICENSE AGREEMENT" are accepted.

This is the message a user will encounter if not registered or not accepted to register:

```
Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3-8B is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Meta-Llama-3-8B to ask for access.
```

Interestingly, one of the biggest number of "orphaned tickets" opened on the repository of the model (meta/meta-llama) concerns the denial of access to users. Many users are turned down without explanation.

Below, we will run the MMLU benchmark on models of 3 sizes (1B, 3B and 8B parameters) and report the results:

In [1]:
#
# Run this for the first time to install the requirements:
#
#!pip install -r https://huggingface.co/flunardelli/llm-metaeval/raw/main/requirements.txt
#
import torch
from huggingface_hub import notebook_login
#
# We need to pass a HF token to continue
#
notebook_login()

In [3]:
#
# Here we declare the parameters of the eval
#
YAML_mmlu_en_us_string = """
task: mmlu_all
dataset_path: cais/mmlu
dataset_name: all
description: "MMLU dataset"
test_split: test
fewshot_split: dev
fewshot_config:
  sampler: first_n
num_fewshot: 5
output_type: multiple_choice
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
"""
with open("mmlu_en_us.yaml", "w") as f:
    f.write(YAML_mmlu_en_us_string)

---

Here is the result for MMLU, accuracy test of the language model with 1B parameters:

In [4]:
!lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct \
  --include_path ./ \
  --tasks mmlu_all \
  --output output/mmlu/ \
  --use_cache cache \
  --device cuda:0 \
  --log_samples

2025-10-07:16:50:07,114 INFO     [__main__.py:279] Verbosity set to INFO
2025-10-07:16:50:07,114 INFO     [__main__.py:303] Including path: ./
2025-10-07:16:50:11,812 INFO     [__main__.py:376] Selected Tasks: ['mmlu_all']
2025-10-07:16:50:11,812 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-10-07:16:50:11,813 INFO     [evaluator.py:201] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Llama-3.2-1B-Instruct'}
2025-10-07:16:50:11,979 INFO     [huggingface.py:131] Using device 'cuda:0'
2025-10-07:16:50:12,888 INFO     [huggingface.py:368] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
`torch_dtype` is deprecated! Use `dtype` instead!
2025-10-07:16:50:13,894 INFO     [evaluator.py:221] Using cache at cache_rank0.db
'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read

---

Here is the result for MMLU, accuracy test of the language model with 3B parameters:

In [5]:
!lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct \
  --include_path ./ \
  --tasks mmlu_all \
  --output output/mmlu/ \
  --use_cache cache2 \
  --device cuda:0 \
  --log_samples

2025-10-07:16:51:41,421 INFO     [__main__.py:279] Verbosity set to INFO
2025-10-07:16:51:41,421 INFO     [__main__.py:303] Including path: ./
2025-10-07:16:51:46,094 INFO     [__main__.py:376] Selected Tasks: ['mmlu_all']
2025-10-07:16:51:46,095 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-10-07:16:51:46,095 INFO     [evaluator.py:201] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Llama-3.2-3B-Instruct'}
2025-10-07:16:51:46,261 INFO     [huggingface.py:131] Using device 'cuda:0'
2025-10-07:16:51:47,137 INFO     [huggingface.py:368] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  2.59it/s]
2025-10-07:16:51:48,509 INFO     [evaluator.py:221] Using cache at cache2_rank0.db
2025-10-07:16

---

Here is the result for MMLU, accuracy test of the language model with 8B parameters:


In [None]:
!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
#torch.cuda.empty_cache()
!lm_eval --model hf \
  --model_args pretrained=meta-llama/Meta-Llama-3-8B,revision=62bd457b6fe961a42a631306577e622c83876cb6,dtype=float16 \
  --include_path ./ \
  --tasks mmlu_all \
  --output output/mmlu/ \
  --use_cache cache3 \
  --device cuda:0 \
  --log_samples \
  --batch_size 1

### 6. Conclusion

What was the result of our benchmarking exercise and what does it tell us about the "openness" of benchmarks themselves for "open" AI systems? How did our results compare with those of Meta?

For the MMLU benchmark, all 57 available subgroups or topics were included, using the arithmetic mean of accuracy to group the results. The few-shot strategy (with 5 examples) was applied in the tests to improve overall performance. Among all the models we evaluated, Meta-Llama-3 with 8 billion parameters achieved the highest accuracy in the MMLU benchmark (57.54%), surpassing the 3 billion (53.23%) and 1
billion (34.13%) models, evidencing the influence of model size on test performance. On the other hand, the increase in accuracy was less significant when comparing the Meta-Llama-3B and 8B models, reflecting that, although relevant, the impact of model size may present limits
for performance improvements---contrary to common held assumption that scaling is comensurate with better model performance. 

| Model             | LLM-Harness | Meta |
| :---------------- | :---------: | ---: |
| Llama 3.2 1B      | 34.13       | 49.3 |
| Llama 3.2 3B      | 53.23       | 63.4 |
| Llama 3.2 8B      | 57.54       | ?    |


### REFERENCES

### AUTHORS

* Text / Analysis: LF Murillo and Sara E. Berger;
* Benchmarking: Fernando Lunardelli, CC-BY-SA 10-06-2025.