# Chapter 3: Text Generation

## Overview of Text Generation Models

Text generation involves using models to create human-like text based on input prompts. Modern models like GPT (Generative Pre-trained Transformers) have revolutionized text generation with their ability to understand and generate contextually relevant text.

---

### Fine-Tuning Open-Source LLMs
Open-source large language models (LLMs), like GPT-Neo, can be fine-tuned for domain-specific applications such as:
- Writing assistance (e.g., emails, blogs).
- Code generation.
- Personalized chatbots.

#### Steps for Fine-Tuning:
1. Prepare a domain-specific dataset (e.g., medical reports, product reviews).
2. Use pre-trained models as a base.
3. Train on the dataset using tools like Hugging Face.

---

## Concept Sketch
Below is a sketch illustrating the text generation process:

![Text Generation Flow](https://upload.wikimedia.org/wikipedia/commons/1/10/Transformer.svg)

The Transformer architecture forms the backbone of modern text generation models.

---

## Code Examples

### Example 1: Text Generation Using GPT-Neo
Let's use Hugging Face's Transformers library to generate text with GPT-Neo.

```python
from transformers import pipeline

# Load GPT-Neo for text generation
generator = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B")

# Generate text
prompt = "The future of AI is"
output = generator(prompt, max_length=50, num_return_sequences=1)

# Display output
print("Generated Text:")
print(output[0]['generated_text'])
```

---

### Example 2: Training and Inference Pipelines

#### Training Pipeline
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import TrainingArguments, Trainer
import datasets

# Load dataset and tokenizer
dataset = datasets.load_dataset("wikitext", "wikitext-2-raw-v1")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

# Tokenize dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Load model
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M")

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

# Start training
trainer.train()
```

#### Inference Pipeline
```python
# Load the fine-tuned model
model = AutoModelForCausalLM.from_pretrained("./results")

# Generate text using the fine-tuned model
inputs = tokenizer("Fine-tuned model example:", return_tensors="pt")
outputs = model.generate(inputs["input_ids"], max_length=50)
print(tokenizer.decode(outputs[0]))
```

---

## Quiz

1. What is the primary role of tokenization in text generation?
   - A. To generate new tokens.
   - B. To split text into manageable chunks.
   - C. To train models directly on raw text.

2. Which library is primarily used for text generation tasks with GPT-Neo?
   - A. TensorFlow
   - B. Hugging Face Transformers
   - C. NumPy

3. What is the function of the training pipeline?
   - A. Generating text based on prompts.
   - B. Fine-tuning a model on a specific dataset.

---

### Answers:
1. **B**: To split text into manageable chunks.
2. **B**: Hugging Face Transformers.
3. **B**: Fine-tuning a model on a specific dataset.

---

## Exercise

### Task:
Fine-tune a small GPT model on a custom dataset of your choice (e.g., a collection of poems or product reviews). Measure its performance by generating text based on test prompts.

---

### Example Solution:
```python
# Fine-tune a GPT model on custom data (sample implementation)
from datasets import Dataset

# Prepare custom dataset
data = {"text": ["AI is transforming industries.", "Generative AI creates text and images."]}
custom_dataset = Dataset.from_dict(data)

# Tokenize dataset and fine-tune (similar steps as above)
# Refer to the training pipeline code.
```

---
