# BRIGHT EDA

In [1]:
from datasets import load_dataset

ds = load_dataset("xlangai/BRIGHT", "documents");

In [2]:
ds.keys()

dict_keys(['biology', 'earth_science', 'economics', 'psychology', 'robotics', 'stackoverflow', 'sustainable_living', 'pony', 'leetcode', 'aops', 'theoremqa_theorems', 'theoremqa_questions'])

In [3]:
ds["biology"]

Dataset({
    features: ['id', 'content'],
    num_rows: 57359
})

In [4]:
for i in range(5):
    print(f"Row {i}")
    print(ds["biology"][i]["content"])
    print("\n")

Row 0
 pelvises; and proportionally shorter forearms and forelegs.
Based on 45 Neanderthal long bones from 14 men and 7 women, the average height was 164 to 168 cm (5 ft 5 in to 5 ft 6 in) for males and 152 to 156 cm (5 ft 0 in to 5 ft 1 in) for females. For comparison, the average height of 20 males and 10 females Upper Palaeolithic humans is, respectively, 176.2 cm (5 ft 9.4 in) and 162.9


Row 1
 Gorham's Cave, Gibraltar, were discovered, dated to older than 39,000 years ago, which the discoverers have interpreted as Neanderthal abstract art. The scratches could have also been produced by a bear. In 2021, an Irish elk phalanx with five engraved offset chevrons stacked above each other was discovered at the entrance to the Einhornhöhle cave in Germany, dating to about 51,000 years ago.
In 2018, some red-painted dots, disks, lines and hand stencils on the cave walls of the Spanish La Pasie


Row 2
 similar to modern hunter gatherers, and was born in the spring, which is consistent wit

In [5]:
questions = load_dataset("xlangai/BRIGHT", "examples")

In [12]:
questions["biology"][0]

{'query': 'Claim in article about why insects are attracted to light\nIn this article they are addressing the reason insects are attracted to light when they say\nHeat radiation as an attractive component is refuted by the effect of LED lighting, which supplies negligible infrared radiation yet still entraps vast numbers of insects.\nI don\'t see why attraction to LEDs shows they\'re not seeking heat. Could they for example be evolutionarily programmed to associate light with heat? So that even though they don\'t encounter heat near/on the LEDs they still "expect" to?',
 'reasoning': 'The question probes why insects are drawn to low-heat LED lights, challenging the idea that their attraction to light is heat-based. The document helps distinguish between heat attraction and evolved behaviors, shedding light on why insects might be attracted to LEDs despite their minimal heat.',
 'id': '0',
 'excluded_ids': ['N/A'],
 'gold_ids_long': ['insects_attracted_to_light/Proximate_and_ultimate_ca

In [13]:
ds["biology"][0]

{'id': 'neanderthals_vitamin_C_diet/Neanderthal_0_43.txt',
 'content': ' pelvises; and proportionally shorter forearms and forelegs.\nBased on 45 Neanderthal long bones from 14 men and 7 women, the average height was 164 to 168\xa0cm (5\xa0ft 5\xa0in to 5\xa0ft 6\xa0in) for males and 152 to 156\xa0cm (5\xa0ft 0\xa0in to 5\xa0ft 1\xa0in) for females. For comparison, the average height of 20 males and 10 females Upper Palaeolithic humans is, respectively, 176.2\xa0cm (5\xa0ft 9.4\xa0in) and 162.9'}

In [6]:
for key in questions.keys():
    print(f"{key}: {len(questions[key])} examples")

biology: 103 examples
earth_science: 116 examples
economics: 103 examples
psychology: 101 examples
robotics: 101 examples
stackoverflow: 117 examples
sustainable_living: 108 examples
pony: 112 examples
leetcode: 142 examples
aops: 111 examples
theoremqa_theorems: 76 examples
theoremqa_questions: 194 examples


In [7]:
import tiktoken
import numpy as np

# Initialize the tokenizer
enc = tiktoken.get_encoding("cl100k_base")

# Get all queries and count their tokens
token_counts = []
queries = []

for example in questions["biology"]:
    query = example["query"]
    token_count = len(enc.encode(query))
    token_counts.append(token_count)
    queries.append(query)

# Print statistics
print(f"Number of questions: {len(token_counts)}")
print(f"Mean tokens: {np.mean(token_counts):.2f}")
print(f"Median tokens: {np.median(token_counts):.2f}")
print(f"Min tokens: {min(token_counts)}")
print(f"Max tokens: {max(token_counts)}")


# Print examples of shortest and longest questions
print("\nShortest question:")
min_idx = token_counts.index(min(token_counts))
print(f"Tokens: {token_counts[min_idx]}")
print(f"Query: {queries[min_idx]}")

print("\nLongest question:")
max_idx = token_counts.index(max(token_counts))
print(f"Tokens: {token_counts[max_idx]}")
print(f"Query: {queries[max_idx]}")

Number of questions: 103
Mean tokens: 111.91
Median tokens: 93.00
Min tokens: 19
Max tokens: 501

Shortest question:
Tokens: 19
Query: Which organism has the smallest genome length?
Which animal/plant/anything has smallest length genome?

Longest question:
Tokens: 501
Query: Why does my room suddenly look 'reddish'? My eyes seem to adapt to color
To get the context of this question clear, I would like you to walk through some parts of my house.
We'll start with one of my rooms as it appears normally - area Y
As evident, this part of my house has a creamish tinge to it, also the balcony door is open which further gives this room a yellow tint. Nothing special. I'll call it "area Y" (for yellow)*. Let's move on.
area G
Here we arrive in another part of my house which has greenish/blue shades acting as a sunlight blocker. This gives this entire place a greenish/blue tint as shown. (Ref. "area G")
So, now let's visit the area Y again. I am always surprised with what my eyes now see. This. 

In [10]:
# Count distribution of gold_ids lengths in biology split
gold_ids_counts = [len(q["gold_ids"]) for q in questions["biology"]]
count_distribution = {}
for count in gold_ids_counts:
    count_distribution[count] = count_distribution.get(count, 0) + 1

# Print distribution sorted by count
print("Distribution of gold_ids counts:")
for count in sorted(count_distribution.keys()):
    print(f"{count} gold_ids: {count_distribution[count]} examples")

Distribution of gold_ids counts:
1 gold_ids: 10 examples
2 gold_ids: 26 examples
3 gold_ids: 27 examples
4 gold_ids: 17 examples
5 gold_ids: 7 examples
6 gold_ids: 10 examples
7 gold_ids: 1 examples
8 gold_ids: 1 examples
9 gold_ids: 1 examples
10 gold_ids: 1 examples
13 gold_ids: 1 examples
19 gold_ids: 1 examples
