a generative poetry system that learns from your uploaded PDF.

In [None]:
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.6.0/en_core_web_md-3.6.0-py3-none-any.whl
!pip install transformers datasets torch PyMuPDF accelerate

import tensorflow as tf
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset
import fitz
import tempfile

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

def preprocess_text(text):
    # Basic text cleaning (adjust for your needs)
    text = text.replace("\n", " ")
    return text

def load_pdf_data(uploaded):
    text = ""
    for fn in uploaded.keys():
        with fitz.open(fn) as doc:
            for page in doc:
                text += page.get_text()
    return text

def build_text_dataset(text, tokenizer, block_size=128):
    # Create a temporary file to store the text
    with tempfile.NamedTemporaryFile(mode='w', delete=False) as tmp_file:
        tmp_file.write(text)
        tmp_file_path = tmp_file.name

    dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=tmp_file_path,  # Use the temporary file path
        block_size=block_size,
    )
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )

    return dataset, data_collator

def train_model(dataset, data_collator, model):
    training_args = TrainingArguments(
        output_dir="./results",
        overwrite_output_dir=True,
        num_train_epochs=100,  # Adjust as needed
        per_device_train_batch_size=4,  # Adjust as needed
        save_steps=10_000,
        save_total_limit=2,
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=dataset,
    )
    trainer.train()

import torch

def generate_poem(model, tokenizer, prompt="", max_length=100):
    # Ensure model and input are on the same device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)  # Move model to the device

    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device) # Move input to device
    output = model.generate(input_ids, max_length=max_length, num_return_sequences=1)
    poem = tokenizer.decode(output[0], skip_special_tokens=True)
    return poem

# Upload PDF
from google.colab import files
uploaded = files.upload()

# Load and preprocess data
text = load_pdf_data(uploaded)
dataset, data_collator = build_text_dataset(text, tokenizer)

# Train the model
train_model(dataset, data_collator, model)

# Generate a poem based on the uploaded PDF
generated_poem = generate_poem(model, tokenizer, prompt="The moon shines dimly, weeping in the sky for the fallen, shards of light piercing dark dimensions")
print(generated_poem)

**Explanation:**

1. **Libraries:**
    - `transformers`: Provides pre-trained language models (like GPT-2) and utilities for fine-tuning.
    - `datasets`: Facilitates loading and processing text data for training.
    - `torch`: The underlying deep learning framework for Transformers.

2. **Model Setup:**
    - Loads GPT-2 tokenizer and model (pre-trained on a large corpus of text).

3. **Data Preparation:**
    - `load_pdf_data`: Extracts text from the uploaded PDF.
    - `build_text_dataset`: Creates a dataset from the text for training.

4. **Model Training:**
    - Fine-tunes the GPT-2 model on the dataset derived from your PDF. This teaches the model the specific style and patterns of your poetry.

5. **Poem Generation:**
    - `generate_poem`: Uses the fine-tuned model to generate a new poem.
        - You can provide an optional `prompt` to guide the poem's theme or starting point.
        - `max_length` controls the maximum length of the generated poem.

**How to Use:**

1. **Upload your PDF** of poems in Colab.
2. **Run the code.** It will fine-tune GPT-2 on your poems.
3. **A poem in the style of your PDF will be printed.**  Feel free to experiment with different prompts!

**Key Improvements:**

*   **Uses GPT-2:** A powerful language model known for generating creative text.
*   **Fine-tuning:** Adapts the model to your specific writing style.
*   **Flexibility:** Allows you to provide a prompt to guide the generated poem.


Let me know if you have any other questions.

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://github.com/mhadihossaini/Custom_GPT2_Text_Generation">https://github.com/mhadihossaini/Custom_GPT2_Text_Generation</a></li>
  <li><a href="https://discuss.huggingface.co/t/gpt2-training-from-scratch-in-german/1157">https://discuss.huggingface.co/t/gpt2-training-from-scratch-in-german/1157</a></li>
  <li><a href="https://blog.devgenius.io/build-your-own-llm-model-using-openai-dd2be7fe9bb2">https://blog.devgenius.io/build-your-own-llm-model-using-openai-dd2be7fe9bb2</a></li>
  </ol>
</div>