# Wikipedia Infobox Generation and Model Fine-tuning

This notebook demonstrates how to:
1. Load Wikipedia stub articles about women in religion
2. Use GPT-4o-mini to generate appropriate infoboxes for these stubs
3. Fine-tune a T5 model to learn this stub → infobox transformation
4. Publish the resulting dataset and model to Hugging Face Hub

The goal is to create a model that can automatically generate Wikipedia infoboxes from article content.

## Step 1: Install Required Libraries

We need:
- `datadreamer.dev`: Framework for LLM-powered data generation and model training
- `datasets`: Hugging Face library for handling datasets
- `OpenAI`: For accessing GPT models to generate synthetic infoboxes

In [None]:
!pip3 install datadreamer.dev datasets==3.2.0 OpenAI

## Step 2: Import Core Libraries

Setting up the main components we'll use for data processing and LLM interactions.

In [None]:
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import ProcessWithPrompt, HFHubDataSource

## Step 3: Configure API Keys

Using Colab's secure userdata to access API keys for:
- **OpenAI**: To generate infoboxes using GPT-4o-mini
- **Hugging Face**: To download datasets and upload results
- **Weights & Biases**: For experiment tracking during training

In [None]:
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
os.environ['WANDB_API_KEY'] = userdata.get('WANDB_API_KEY')

## Step 4: Initialize DataDreamer Session

DataDreamer manages the entire pipeline and saves intermediate results to `./output/` for reproducibility.

In [None]:
dd = DataDreamer('./output/')
dd.start()

## Step 5: Load Source Dataset

Loading a curated dataset of Wikipedia stub articles about women in religion. These are short, incomplete articles that would benefit from having infoboxes added.

In [None]:
wiki_stubs_dataset = HFHubDataSource(
    "Get Women in Religion Stubs",
    "andersoncliffb/women-in-religion-stubs",
    split="train",
).select_columns(["Wiki_Content"])

## Step 6: Configure the LLM

Setting up GPT-4o-mini as our generation model. This is cost-effective for generating structured content like infoboxes.

In [None]:
gpt4 = OpenAI(
    model_name="gpt-4o-mini",
)

## Step 7: Generate Infoboxes from Stubs

This is the core synthetic data generation step. For each Wikipedia stub:
1. Send it to GPT-4o-mini with instructions to create an appropriate infobox
2. Store both the original stub and generated infobox as training pairs

This creates the input-output pairs we'll use to train our T5 model.

In [None]:
stubs_and_infoboxes = ProcessWithPrompt(
    "Generate Infoboxes from Stubs",
    inputs={"inputs": wiki_stubs_dataset.output["Wiki_Content"]},
    args={
        "llm": gpt4,
        "instruction": (
            "Extract the infobox from the Wikipedia stub. If there is no infobox, generate an appropriate Wikipedia infobox for the stub."
            "Return only the infoxbox, nothing else."
        ),
    },
    outputs={"inputs": "stub", "generations": "infobox"},
).select_columns(["stub", "infobox"])


## Step 8: Publish Generated Dataset

Uploading our synthetic dataset to Hugging Face Hub with:
- 90% for training
- 10% for validation

This makes the dataset publicly available and creates the train/validation splits we need.

In [None]:
stubs_and_infoboxes.publish_to_hf_hub(
    "andersoncliffb/women-religion-stubs-with-infoboxes",
    train_size=0.90,
    validation_size=0.10,
)

## Step 9: Create Local Data Splits

Creating local train/validation splits from our generated data for the fine-tuning process.

In [None]:
splits = stubs_and_infoboxes.splits(train_size=0.90, validation_size=0.10)

## Step 10: Import Training Libraries

Setting up for model fine-tuning:
- `TrainHFFineTune`: DataDreamer's wrapper for Hugging Face model training
- `LoraConfig`: Parameter-efficient fine-tuning using Low-Rank Adaptation

In [None]:
from datadreamer.trainers import TrainHFFineTune
from peft import LoraConfig

## Step 11: Configure the Training Setup

Creating a trainer that will:
- Use Google's T5-v1.1-base as the foundation model
- Apply LoRA for efficient fine-tuning (only trains a small subset of parameters)
- Learn to transform Wikipedia stubs into appropriate infoboxes

In [None]:
trainer = TrainHFFineTune(
      "Train an Wiki Article => Infoboxes Model",
      model_name="google/t5-v1_1-base",
      peft_config=LoraConfig(),
)

## Step 12: Train the Model

Starting the fine-tuning process with:
- **Input**: Wikipedia stub articles
- **Output**: Generated infoboxes
- **30 epochs**: Multiple passes through the training data
- **Batch size 8**: Number of examples processed simultaneously

This will take some time and use the L4 GPU for training.

In [None]:
trainer.train(
      train_input=splits["train"].output["stub"],
      train_output=splits["train"].output["infobox"],
      validation_input=splits["validation"].output["stub"],
      validation_output=splits["validation"].output["infobox"],
      epochs=30,
      batch_size=8,
  )

## Step 13: Publish the Fine-tuned Model

Uploading the trained model to Hugging Face Hub so it can be:
- Downloaded and used by others
- Integrated into applications
- Further fine-tuned on different data

In [None]:
trainer.publish_to_hf_hub("andersoncliffb/stubs-and-infoboxes")


## Step 14: Clean Up

Properly closing the DataDreamer session and saving all pipeline metadata.

In [None]:
dd.stop()

## Summary

This notebook demonstrates a complete pipeline for:
1. **Synthetic data generation**: Using GPT-4o-mini to create training examples
2. **Model fine-tuning**: Training T5 to learn the stub→infobox transformation
3. **Knowledge sharing**: Publishing both dataset and model to Hugging Face Hub

The resulting model can generate Wikipedia infoboxes from article stubs, potentially helping editors improve Wikipedia coverage of underrepresented topics.