<a href="https://colab.research.google.com/github/raja-coder-del/GENERATIVE-TEXT-MODEL/blob/main/task_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import pandas as pd
import random
import re
import os

# --- 1. Setup and Imports ---
print("--- 1. Setting up environment and importing libraries ---")
# Install transformers and datasets if not already installed
# !pip install transformers datasets

# Check if GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- 2. Dataset Preparation ---
print("\n--- 2. Preparing the dataset ---")

# Create a dummy text file for demonstration.
# In a real scenario, you would replace this with your actual, larger dataset.
# The content here is just to give the model some context, even if minimal.
text_data = """
The quick brown fox jumps over the lazy dog. This is a classic sentence for testing.
Artificial intelligence is a rapidly evolving field. It has the potential to transform many aspects of our lives.
Machine learning is a subset of AI that focuses on algorithms that learn from data. Deep learning is a further subset of machine learning.
The future of technology is exciting, with new innovations emerging constantly. We are living in a truly remarkable era of advancements.
Climate change is a pressing global issue that requires urgent action. Sustainable solutions are crucial for our planet's future.
Generative AI models are revolutionizing creative industries and content creation.
Large Language Models (LLMs) like GPT are capable of understanding and generating human-like text.
Natural Language Processing (NLP) is a branch of AI that deals with the interaction between computers and human language.
The history of computing is filled with groundbreaking discoveries, from early calculating machines to modern supercomputers.
Space exploration continues to push the boundaries of human knowledge and our understanding of the universe.
"""

file_path = "my_gpt_corpus.txt"
with open(file_path, "w", encoding="utf-8") as f:
    f.write(text_data)

print(f"Dummy text data saved to: {file_path}")

# Load the text data
with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

# Basic cleaning of the text data
cleaned_text = raw_text.lower()
# Replace multiple spaces/newlines/tabs with a single space
cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
print(f"Corpus length after cleaning: {len(cleaned_text)} characters")

# --- 3. Model Building and (Conceptual) Fine-tuning ---
print("\n--- 3. Loading GPT-2 model and tokenizer ---")

# Load pre-trained GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Set pad_token_id for the tokenizer. GPT-2 tokenizer doesn't have a default pad token.
# We'll set it to the EOS (End of Sequence) token for simplicity, which is common.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"Tokenizer pad_token set to eos_token: {tokenizer.pad_token}")

# Move model to the appropriate device (GPU if available)
model.to(device)
print(f"Model moved to {device}")

# --- Conceptual Fine-tuning (Simplified for demonstration) ---
# For actual fine-tuning, you would use `TextDataset`, `DataCollatorForLanguageModeling`,
# and `Trainer` from the Hugging Face `transformers` library.
# This part is just to illustrate that fine-tuning is a step, but for this small
# dummy dataset, the impact on generation will be minimal compared to the base GPT-2.

# Example of how you would prepare data for actual fine-tuning (not executed as a full training loop here):
# from transformers import TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
#
# # Create a TextDataset from your corpus
# print("Creating TextDataset (conceptual for fine-tuning)...")
# dataset_for_training = TextDataset(
#     tokenizer=tokenizer,
#     file_path=file_path,
#     block_size=128 # The maximum sequence length for input to the model
# )
#
# # Data collator for language modeling (will handle padding and masking)
# data_collator = DataCollatorForLanguageModeling(
#     tokenizer=tokenizer, mlm=False # mlm=False for causal language modeling (GPT-style)
# )
#
# # Define training arguments
# training_args = TrainingArguments(
#     output_dir="./gpt2_fine_tuned_model", # Directory to save checkpoints
#     overwrite_output_dir=True,
#     num_train_epochs=3, # Number of training epochs
#     per_device_train_batch_size=2, # Batch size per device during training
#     save_steps=10_000, # Save checkpoint every X steps
#     save_total_limit=2, # Limit the total number of checkpoints
#     logging_dir="./logs", # Directory for storing logs
#     logging_steps=500,
# )
#
# # Create a Trainer instance
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     data_collator=data_collator,
#     train_dataset=dataset_for_training,
# )
#
# print("Starting conceptual fine-tuning... (This is a simplified demonstration, not full training)")
# # To actually fine-tune, uncomment the line below. This will train the model.
# # trainer.train()
# print("Conceptual fine-tuning completed. (Model weights are not significantly updated with this small data)")

print("GPT-2 model loaded. For optimal results on specific topics, fine-tuning on a larger, domain-specific dataset is highly recommended.")
print("Proceeding with text generation using the base GPT-2 model (or minimally 'tuned' if you uncommented trainer.train()).")

# --- 4. Text Generation Function ---
print("\n--- 4. Defining the text generation function ---")

def generate_text_gpt2(prompt, model, tokenizer, max_length=100, temperature=0.7, top_k=50, top_p=0.9, num_return_sequences=1):
    """
    Generates text using a GPT-2 model.

    Args:
        prompt (str): The starting text for generation.
        model: The GPT2LMHeadModel instance.
        tokenizer: The GPT2Tokenizer instance.
        max_length (int): Maximum total length of the generated text (prompt + new text).
        temperature (float): Controls randomness. Lower values (e.g., 0.5) make output more predictable/conservative.
                             Higher values (e.g., 1.0+) make output more creative/diverse.
        top_k (int): Limits sampling to the top-k most probable words. Set to 0 to disable.
        top_p (float): Nucleus sampling: limits to tokens with cumulative probability top_p.
                       Often used instead of or in conjunction with top_k.
        num_return_sequences (int): Number of independent sequences to generate for the given prompt.

    Returns:
        list: A list of generated text strings.
    """
    # Encode the prompt text into input IDs
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Move input_ids to the same device as the model (GPU if available)
    input_ids = input_ids.to(device)

    print(f"Generating text for prompt: '{prompt}' with max_length={max_length}, temp={temperature}, top_k={top_k}, top_p={top_p}")

    # Generate text using the model's generate method
    # This method is highly optimized and offers various decoding strategies.
    output_sequences = model.generate(
        input_ids=input_ids,
        max_length=max_length, # max_length includes the prompt tokens
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        repetition_penalty=1.2, # Discourage repeating the same words/phrases
        do_sample=True, # Enable sampling (vs. greedy decoding or beam search)
        num_return_sequences=num_return_sequences,
        pad_token_id=tokenizer.pad_token_id, # Essential for handling padding
        eos_token_id=tokenizer.eos_token_id, # Model stops when EOS token is generated
    )

    generated_texts = []
    for sequence in output_sequences:
        # Decode the generated sequence back to text, skipping special tokens
        text = tokenizer.decode(sequence, skip_special_tokens=True)

        # GPT-2's generate method returns the prompt + generated text.
        # We might want to remove the exact prompt from the beginning of the generated text
        # to see only the newly generated part, or keep it to see the full output.
        # Here, we'll keep the full output as it demonstrates coherence.
        generated_texts.append(text.strip())

    return generated_texts

# --- 5. Demonstration with User Prompts ---
print("\n--- 5. Demonstrating Text Generation with User Prompts ---")

# Define a list of prompts related to various topics
prompts = [
    "The future of artificial intelligence will be",
    "Sustainable solutions for energy are essential because",
    "Once upon a time, in a land far away, there lived",
    "The latest scientific discovery indicates that",
    "Generative AI models are transforming",
    "Climate change impacts all of us, therefore we must",
    "In the realm of natural language processing,"
]

# Iterate through prompts and generate text
for i, prompt in enumerate(prompts):
    print(f"\n===== PROMPT {i+1} =====")
    print(f"User Prompt: '{prompt}'")

    # Generate text with default parameters
    generated_default = generate_text_gpt2(prompt, model, tokenizer,
                                           max_length=80, # Generate up to 80 tokens in total
                                           temperature=0.7,
                                           top_k=50,
                                           top_p=0.9,
                                           num_return_sequences=1)
    print("\n--- Generated (Default Params) ---")
    print(generated_default[0])

    # Experiment with different parameters for a single prompt
    if i == 0: # Only for the first prompt to keep output manageable
        print("\n--- Experimenting with different parameters for the first prompt ---")
        print("\n--- High Temperature (More Creative/Random) ---")
        generated_high_temp = generate_text_gpt2(prompt, model, tokenizer,
                                                 max_length=80,
                                                 temperature=1.0, # Higher temperature
                                                 top_k=0, top_p=0.9, # Use only top_p
                                                 num_return_sequences=1)
        print(generated_high_temp[0])

        print("\n--- Low Temperature (More Conservative/Coherent) ---")
        generated_low_temp = generate_text_gpt2(prompt, model, tokenizer,
                                                max_length=80,
                                                temperature=0.5, # Lower temperature
                                                top_k=50, top_p=0.9,
                                                num_return_sequences=1)
        print(generated_low_temp[0])

        print("\n--- Generating Multiple Sequences ---")
        generated_multiple = generate_text_gpt2(prompt, model, tokenizer,
                                                max_length=60,
                                                temperature=0.8,
                                                top_k=50, top_p=0.9,
                                                num_return_sequences=3) # Generate 3 sequences
        for j, text in enumerate(generated_multiple):
            print(f"Sequence {j+1}:\n{text}\n")

print("\n--- GPT-2 Text Generation Demo Complete ---")

# Optional: Clean up the dummy file
# os.remove(file_path)
# print(f"Cleaned up dummy file: {file_path}")


--- 1. Setting up environment and importing libraries ---
Using device: cpu

--- 2. Preparing the dataset ---
Dummy text data saved to: my_gpt_corpus.txt
Corpus length after cleaning: 1143 characters

--- 3. Loading GPT-2 model and tokenizer ---
Tokenizer pad_token set to eos_token: <|endoftext|>
Model moved to cpu
GPT-2 model loaded. For optimal results on specific topics, fine-tuning on a larger, domain-specific dataset is highly recommended.
Proceeding with text generation using the base GPT-2 model (or minimally 'tuned' if you uncommented trainer.train()).

--- 4. Defining the text generation function ---

--- 5. Demonstrating Text Generation with User Prompts ---

===== PROMPT 1 =====
User Prompt: 'The future of artificial intelligence will be'
Generating text for prompt: 'The future of artificial intelligence will be' with max_length=80, temp=0.7, top_k=50, top_p=0.9

--- Generated (Default Params) ---
The future of artificial intelligence will be decided by the technology itself