# ELI5 Dataset - Gemini Answer Generation

This notebook processes the ELI5 (Explain Like I'm 5) dataset and generates LLM answers using Google's Gemini.

## Dataset Information
- **Source**: HuggingFace dataset `rexarski/eli5_category`
- **Period**: January 2017 - June 2021
- **Content**: Human-written questions and answers from the ELI5 subreddit
- **Purpose**: Generate one LLM answer per unique question and merge with human responses

####  **<span style="color:red">IMPORTANT: <span>**

**Workflow & Integration:**
1. This notebook: Load dataset → Test API → Generate Gemini answers → Merge with human answers
2. Model: gemini-2.5-flash
3. Each unique question gets ONE LLM answer replicated across all human responses
4. **To combine with OpenAI:**
   - Run `openai_generate_dataset_clean.ipynb` first (generates human + chatgpt answers)
   - Run this notebook (generates human + gemini answers)
   - Use the merge function in Cell 10 to combine both without duplicating human answers
   - Final dataset: human + chatgpt + gemini sources

## 1. Install and Import Required Libraries

In [26]:
# Install required packages (run once)
# !pip install pandas numpy datasets
# !pip install google-generativeai
# !pip install gdown matplotlib seaborn tqdm
# !pip install fastparquet

In [27]:
# Import libraries
import pandas as pd
import numpy as np
from tqdm import tqdm
import time
import json
import os
import glob
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed

# Gemini API
import google.generativeai as genai

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)

## 2. Set Up Gemini API Key

In [28]:
import dotenv
dotenv.load_dotenv('.env')
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

if GEMINI_API_KEY:
    genai.configure(api_key=GEMINI_API_KEY)
    print("Gemini API key configured successfully")
else:
    print("Gemini API key not set")

Gemini API key configured successfully


## 3. Load the ELI5 Dataset

## Configuration - Set Parameters Here

In [29]:
num_questions_to_generate = 4000
delay_between_api_calls = 0.7
test_sample_size = 2
max_workers = 6  # Number of concurrent API calls

gemini_model = "gemini-2.5-flash"

print("Configuration:")
print(f"  Questions to generate: {num_questions_to_generate}")
print(f"  Delay between calls: {delay_between_api_calls}s")
print(f"  Max concurrent workers: {max_workers}")
print(f"  Test sample size: {test_sample_size}")
print(f"  Model: {gemini_model}")

Configuration:
  Questions to generate: 4000
  Delay between calls: 0.7s
  Max concurrent workers: 6
  Test sample size: 2
  Model: gemini-2.5-flash


In [30]:
path = "./human_data/output/eli5_cleaned.csv"

df_human = pd.read_csv(path)
print(f"Dataset loaded with {len(df_human)} records")

unique_questions = df_human[['q_id', 'title']].drop_duplicates().reset_index(drop=True)
print(f"Found {len(unique_questions)} unique questions")
print(f"Average answers per question: {len(df_human) / len(unique_questions):.1f}")

Dataset loaded with 228637 records
Found 93893 unique questions
Average answers per question: 2.4


In [31]:
# Display first few rows
print("\nFirst few rows of dataset:")
print("=" * 80)
df_human[['q_id', 'title', 'text']].head(3)


First few rows of dataset:


Unnamed: 0,q_id,title,text
0,5lchat,Why there was a 'leap second' added to the end of 2016?,"the rotation of the earth is not a constant. in fact the rotation of the earth is slowing down, which means that a full day is getting slightly longer. without leap seconds our clocks would slowly drift ever so slightly out of sync with the actual day. we could deal with this by redefining how how long 1 second is, making it slightly longer so that one day is still exactly 24*60*60 seconds. but in practice that is really inconvenient for a lot of our technology which relies on very precise timing. its easier to just move us ahead one second every couple of years or so."
1,5lchat,Why there was a 'leap second' added to the end of 2016?,"The Earth's rotation is not regular. It varies a bit, so sometimes we add a second. We do this to ensure that noon is always going to be sometime around mid-day. If we did not add leap seconds, over a very long period of time where the Earth's rotation slowly changed, noon could end up being at dusk. We want to keep 7am in the morning, noon at mid-day, 7pm around evening, etc. Though we have never had one, it's also possible to have a negative leap second. That is, taking away a second from the year. This has never happened, but if the Earth's rotation were to speed up, it could happen. The biggest thing to know about leap seconds is that they can cause computer problems. You might remember the Y2K bug. A leap second can cause similar problems, and they actually have caused problems in the past. The reason for this is that generally we expect a day to have 24 hours, and for time to always move forward. With a leap second this is not true. When writing software, programers try to think of all the possible exceptions that could happen withing their code. For example, the program might expect a word, but instead get a number. A good programmer will check for these exceptions and deal with them. However, a programer can easily forget about leap seconds and not have a fail-safe in their code for when a day have more than 24 hours. When such an exception happens, the program can produce errors or crash. It is an interesting topic, you can read more about it here: URL_0"
2,5lchat,Why there was a 'leap second' added to the end of 2016?,"Because the Earth's rotation is slowing. If you multiply 24 hours by 60 minutes by 60 seconds, you find that there are 86400 seconds per day. The problem is that our definition of the second is based on [an average that is a century old.] In modern times, the average day is about 2 thousandths of a second longer—again, because of Earth's slowing rotation. Those thousandths of a second add up, so every few years we have to slip in an extra second to account for them. Without leap seconds, we'd eventually end up with noon at 7 o'clock, though admittedly, this would take a very long time."


In [32]:
# Load existing generated answers from output folder to avoid regenerating
output_folder = "gemini-output"
existing_q_ids = set()

if os.path.exists(output_folder):
    # Find all CSV files in the output folder
    csv_files = [f for f in os.listdir(output_folder) if f.startswith('eli5_gemini_answers_') and f.endswith('.csv')]
    
    if csv_files:
        print(f"Found {len(csv_files)} existing output file(s):")
        for csv_file in csv_files:
            filepath = os.path.join(output_folder, csv_file)
            try:
                df_existing = pd.read_csv(filepath)
                existing_q_ids.update(df_existing['q_id'].unique())
                print(f"  - {csv_file}: {df_existing['q_id'].nunique()} questions")
            except Exception as e:
                print(f"  - {csv_file}: Error reading file - {str(e)}")
        
        print(f"\nTotal unique questions already generated: {len(existing_q_ids)}")
    else:
        print("No existing output files found")
else:
    print(f"Output folder '{output_folder}' does not exist - starting fresh")

# Filter out already-processed questions
unique_questions_filtered = unique_questions[~unique_questions['q_id'].isin(existing_q_ids)].reset_index(drop=True)

print(f"Total unique questions: {len(unique_questions)}")
print(f"Already generated: {len(existing_q_ids)}")
print(f"Remaining to generate: {len(unique_questions_filtered)}")

if len(unique_questions_filtered) == 0:
    print("\nAll questions have already been processed")


Found 1 existing output file(s):
  - eli5_gemini_answers_20260215_230432.csv: 1000 questions

Total unique questions already generated: 1000
Total unique questions: 93893
Already generated: 1000
Remaining to generate: 92893


## 4. Define LLM Answer Generation Function

In [33]:
def generate_gemini_answer(question, model=None, max_retries=3):
    """
    Generate an ELI5-style answer using Gemini.
    
    Args:
        question: The question to answer
        model: Gemini model to use (default: gemini-2.5-flash)
        max_retries: Number of retry attempts on failure
    
    Returns:
        Generated answer as string, or error message if failed
    """
    if model is None:
        model = gemini_model
        
    if not GEMINI_API_KEY:
        return "ERROR: Gemini API key not configured"
    
    prompt = f"""You are answering questions in the style of the ELI5 (Explain Like I'm 5) subreddit. 
Provide a clear, simple explanation that a 5-year-old could understand, but still be informative.
Keep everything as one block of text.

Question: {question}

Answer:"""
    
    for attempt in range(max_retries):
        try:
            model_instance = genai.GenerativeModel(model)
            response = model_instance.generate_content(
                prompt,
                generation_config=genai.types.GenerationConfig(
                    temperature=0.7,
                    max_output_tokens=1000,
                )
            )
            return response.text.strip()
        
        except Exception as e:
            print(f"  Attempt {attempt + 1} failed: {str(e)[:100]}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                return f"ERROR: {str(e)}"
    
    return None

## 5. Test API with Sample Questions

In [34]:
# Test with sample questions
print(f"Testing API with {test_sample_size} sample questions...")
print("=" * 80)

test_questions = unique_questions.head(test_sample_size)

for idx, row in test_questions.iterrows():
    question = row['title']
    print(f"\n[Test {idx + 1}] {question[:70]}...")
    answer = generate_gemini_answer(question)
    
    if answer.startswith("ERROR"):
        print(f"{answer}")
    else:
        print(f"Generated ({len(answer)} chars): {answer[:100]}...")
    
    time.sleep(1)

print("\n" + "=" * 80)
print("API test complete!")

Testing API with 2 sample questions...

[Test 1] Why there was a 'leap second' added to the end of 2016?...
Generated (844 chars): Imagine you're running a race, and your friend is also running, but your friend is just a tiny, tiny...

[Test 2] How do you claim undiscovered land?...
Generated (849 chars): Imagine you're playing in your backyard and you find a brand new patch of grass that no one has ever...

API test complete!


## 6. Generate LLM Answer for Each Unique Question

In [35]:
# Generate one LLM answer per unique question using parallel processing
llm_answers_map = {}  # Map from q_id to llm_answer

questions_to_process = unique_questions_filtered.head(min(num_questions_to_generate, len(unique_questions_filtered)))
print(f"Generating {len(questions_to_process)} LLM answers with {max_workers} concurrent workers...")
print("=" * 80)

def process_single_question(row):
    """
    Process a single question and return (q_id, answer) tuple.
    """
    q_id = row['q_id']
    question = row['title']
    
    try:
        llm_answer = generate_gemini_answer(question)
        
        if not llm_answer.startswith("ERROR"):
            return (q_id, llm_answer, True)
        else:
            return (q_id, llm_answer[:80], False)
    except Exception as e:
        return (q_id, f"ERROR: {str(e)[:80]}", False)

# Check if there are questions to process
if len(questions_to_process) == 0:
    print("No questions to process. All questions have been generated already.")
else:
    # Use ThreadPoolExecutor for concurrent API calls
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all tasks
        futures = {executor.submit(process_single_question, row): idx 
                   for idx, row in questions_to_process.iterrows()}
        
        # Process completed futures with progress bar
        with tqdm(total=len(futures), desc="Generating answers") as pbar:
            for future in as_completed(futures):
                q_id, result, success = future.result()
                
                if success:
                    llm_answers_map[q_id] = result
                else:
                    print(f"\nSkipped {q_id}: {result}")
                
                pbar.update(1)
                
                # Rate limiting: brief delay after each completion
                time.sleep(delay_between_api_calls / max_workers)

    print(f"\n{'=' * 80}")
    print(f"Generated {len(llm_answers_map)} / {len(questions_to_process)} LLM answers")

Generating 4000 LLM answers with 6 concurrent workers...


Generating answers:   9%|▉         | 378/4000 [05:10<1:14:50,  1.24s/it]

  Attempt 1 failed: Invalid operation: The `response.text` quick accessor requires the response to contain a valid `Part


Generating answers:  39%|███▉      | 1552/4000 [20:56<34:46,  1.17it/s]  

  Attempt 1 failed: Invalid operation: The `response.text` quick accessor requires the response to contain a valid `Part


Generating answers:  49%|████▉     | 1954/4000 [26:15<24:22,  1.40it/s]

  Attempt 1 failed: Invalid operation: The `response.text` quick accessor requires the response to contain a valid `Part


Generating answers: 100%|██████████| 4000/4000 [53:10<00:00,  1.25it/s]


Generated 4000 / 4000 LLM answers





## 7. Create LLM Answers Dataset

In [36]:
# Create LLM answers dataset
llm_rows_list = []

for q_id, llm_answer in llm_answers_map.items():
    # Get the question info
    question_info = unique_questions[unique_questions['q_id'] == q_id].iloc[0]
    
    llm_rows_list.append({
        'q_id': q_id,
        'title': question_info['title'],
        'text': llm_answer,
        'source': 'gemini'
    })

df_llm = pd.DataFrame(llm_rows_list).reset_index(drop=True)


print(f"Total new rows: {len(df_llm)}")
print(f"Unique questions: {df_llm['q_id'].nunique()}")
print(f"Columns: {list(df_llm.columns)}")

if len(df_llm) == 0:
    print("\nNo new answers generated")

Total new rows: 4000
Unique questions: 4000
Columns: ['q_id', 'title', 'text', 'source']


In [37]:
# Show sample rows
print("\nSample LLM answers:")
print("=" * 80)

for idx, row in df_llm.head(3).iterrows():
    print(f"\n[{idx + 1}] Question: {row['title'][:70]}...")
    print(f"    Answer: {row['text'][:100]}...")
    print(f"    Source: {row['source']}")


Sample LLM answers:

[1] Question: Why we have a dominant hand/foot?...
    Answer: Imagine you have two best friends, but you play with one friend a *tiny* bit more often. That friend...
    Source: gemini

[2] Question: Why are the flags on the moon only faded from the sun, and not entirel...
    Answer: Imagine you have a favorite toy that you leave outside in the backyard for a very, very long time. T...
    Source: gemini

[3] Question: How did we discover planets that are so far away?...
    Answer: Imagine you have a super-duper strong flashlight, but it's really, really far away. You can see its ...
    Source: gemini


## 8. Save LLM Answers Dataset

In [38]:
# Create output folder
output_folder = "gemini-output"
os.makedirs(output_folder, exist_ok=True)

# Only save if there are new answers generated
if len(df_llm) > 0:
    # Generate filename with timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Save to CSV
    csv_filename = os.path.join(output_folder, f"eli5_gemini_answers_{timestamp}.csv")
    df_llm.to_csv(csv_filename, index=False)
    print(f"CSV saved to: {csv_filename}")

    # Save as Parquet
    parquet_filename = os.path.join(output_folder, f"eli5_gemini_answers_{timestamp}.parquet")
    df_llm.to_parquet(parquet_filename, index=False)
    print(f"Parquet saved to: {parquet_filename}")

    # Save metadata
    metadata = {
        'timestamp': timestamp,
        'total_rows': int(len(df_llm)),
        'unique_questions': int(df_llm['q_id'].nunique()),
        'gemini_answers': int(len(df_llm)),
        'llm_model': gemini_model,
        'max_workers': max_workers,
        'columns': list(df_llm.columns)
    }

    metadata_filename = os.path.join(output_folder, f"metadata_{timestamp}.json")
    with open(metadata_filename, 'w') as f:
        json.dump(metadata, f, indent=2)
    print(f"Metadata saved to: {metadata_filename}")
    print(f"Generated {len(df_llm)} new LLM answers")
    print(f"Output folder: {output_folder}")
else:
    print("All questions have already been processed in previous runs.")

CSV saved to: gemini-output\eli5_gemini_answers_20260216_021315.csv
Parquet saved to: gemini-output\eli5_gemini_answers_20260216_021315.parquet
Metadata saved to: gemini-output\metadata_20260216_021315.json
Generated 4000 new LLM answers
Output folder: gemini-output


## 9. Merge all generated answers

In [39]:
# Merge all CSVs under gemini folder
output_folder = "gemini-output"

# Find all CSV files in the output folder
csv_pattern = os.path.join(output_folder, "eli5_gemini_answers_*.csv")
csv_files = glob.glob(csv_pattern)

if not csv_files:
    print(f"No CSV files found in {output_folder}")
else:
    for f in csv_files:
        print(f"  - {os.path.basename(f)}")
    
    # Load and concatenate all CSV files
    dfs = []
    for csv_file in csv_files:
        try:
            df = pd.read_csv(csv_file)
            dfs.append(df)
        except Exception as e:
            print(f"Error loading {csv_file}: {str(e)}")
    
    if dfs:
        # Merge all dataframes
        df_merged = pd.concat(dfs, ignore_index=True)
        
        # Remove duplicates based on q_id (keep first occurrence)
        df_merged_unique = df_merged.drop_duplicates(subset=['q_id'], keep='first')
        
        # Save merged file
        merged_filename = os.path.join(output_folder, "eli5_gemini_answers_merged.csv")
        df_merged_unique.to_csv(merged_filename, index=False)
        
        # Save as parquet
        merged_parquet = os.path.join(output_folder, "eli5_gemini_answers_merged.parquet")
        df_merged_unique.to_parquet(merged_parquet, index=False)
    else:
        print("No data loaded to merge")

  - eli5_gemini_answers_20260215_230432.csv
  - eli5_gemini_answers_20260216_021315.csv
