<a href="https://colab.research.google.com/github/rockingboi/task-ai/blob/main/text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Step 1: Install Dependencies
!pip install transformers datasets torch

In [None]:
# Step 2: Load a Small Pre-installed Dataset (Wikitext-2)
from datasets import load_dataset

# Load Wikitext-2 dataset for training (raw version for better control)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

# Combine all text data from the dataset into one string
text_data = "\n".join(dataset["text"])

print(f"Sample Data:\n{text_data[:500]}")  # Display the first 500 characters

# Step 3: Preprocess the Dataset (Tokenization)
from transformers import GPT2Tokenizer

# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Tokenize the entire text dataset
tokens = tokenizer(text_data, return_tensors="pt", max_length=512, truncation=True)

print(f"Tokenized Example:\n{tokens['input_ids'][:10]}")  # Display a tokenized sample

# Step 4: Load the GPT-2 Model
from transformers import GPT2LMHeadModel
import torch

# Load the pre-trained GPT-2 model with a language modeling head
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Move the model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Step 5: Train the Model (Fine-tuning on Wikitext-2)
from torch.optim import AdamW

# Prepare input data for training
input_ids = tokens["input_ids"].to(device)
attention_mask = tokens["attention_mask"].to(device)

# Define the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training loop: Fine-tune the model for a small number of epochs
epochs = 3  # Can be increased for better results
model.train()

for epoch in range(epochs):
    optimizer.zero_grad()  # Clear gradients from the previous step
    outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
    loss = outputs.loss  # Calculate the loss
    print(f"Epoch {epoch + 1} | Loss: {loss.item()}")

    loss.backward()  # Backpropagate the loss
    optimizer.step()  # Update model parameters

print("\nTraining Complete!")

# Step 6: Generate Text in a Continuous Loop Until User Stops
model.eval()  # Set the model to evaluation mode

print("\nType 'exit' or 'quit' to stop the text generation loop.")

# Store the last user prompt for regeneration
last_prompt = None

while True:
    # Prompt the user for input or use the last prompt for regeneration
    if last_prompt:
        regenerate = input("Type 'regenerate' to generate again, or press Enter to enter a new prompt: ").strip().lower()
        if regenerate == "regenerate":
            user_prompt = last_prompt
        else:
            user_prompt = input("Enter a prompt to generate text: ")
            if user_prompt.lower() in ["exit", "quit"]:
                print("Exiting the text generation loop. Goodbye!")
                break
            last_prompt = user_prompt  # Store the prompt for future regeneration
    else:
        user_prompt = input("Enter a prompt to generate text: ")
        if user_prompt.lower() in ["exit", "quit"]:
            print("Exiting the text generation loop. Goodbye!")
            break
        last_prompt = user_prompt  # Store the prompt for future regeneration

    # Tokenize the user input prompt
    input_ids = tokenizer(user_prompt, return_tensors="pt").input_ids.to(device)

    # Generate text using the fine-tuned model
    output = model.generate(
        input_ids,
        max_length=100,  # Generate up to 100 tokens
        num_return_sequences=1,
        no_repeat_ngram_size=2,  # Avoid repetition
        early_stopping=True
    )

    # Decode and display the generated text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"\nGenerated Text:\n{generated_text}\n")

# Step 7: Save the Fine-tuned Model (Optional)
model.save_pretrained("./fine_tuned_gpt2")
tokenizer.save_pretrained("./fine_tuned_gpt2")

print("\nModel and tokenizer saved to './fine_tuned_gpt2'")


Sample Data:

 = Valkyria Chronicles III = 


 Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs
Tokenized Example:
tensor([[  198,   796,   569, 18354,  7496, 17740,  6711,   796,   220,   628,
           198,  2311,    73, 13090,   645,   569, 18354,  7496,   513,  1058,
           791, 47398, 17740,   357,  4960,  1058, 10545,   230,    99,   161,
           254,   112,  5641, 44444,  9202, 25084, 24440, 12675, 11839,    18,
           837,  6578,   764,   569, 18354,  7496,   286,   262, 30193,   513,
          1267,   837,  8811,  6412,   284,   355,   569, 18354,  7496, 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated Text:
hello, NY) – The New York Red Bulls have signed midfielder Jordan Morris to a one-year contract, the club announced today.

Morris, 23, joins the Red Bull Arena as a member of the New England Revolution's academy. He joins a group of players who have been with the team since the beginning of last season, including midfielder Christian Pulisic, midfielder David Villa, defender Michael Bradley, and midfielder Michael Orozco. Morris, who has played in all 16 MLS

Type 'regenerate' to generate again, or press Enter to enter a new prompt: regenerate


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated Text:
hello, NY) – The New York Red Bulls have signed midfielder Jordan Morris to a one-year contract, the club announced today.

Morris, 23, joins the Red Bull Arena as a member of the New England Revolution's academy. He joins a group of players who have been with the team since the beginning of last season, including midfielder Christian Pulisic, midfielder David Villa, defender Michael Bradley, and midfielder Michael Orozco. Morris, who has played in all 16 MLS

