Capstone Project Title: "Text Generation with Fine-tuned Pre-trained Language Models"

Project Description:
For the capstone project, students will develop a text generation system using pre-trained language models. The project aims to showcase their expertise in fine-tuning a pre-trained language model and using it for generating coherent and contextually relevant text.

Project Components:

Data Collection and Preprocessing:

Input Dataset: Use a publicly available dataset such as the "IMDb Movie Reviews" dataset, which contains movie reviews with labels for sentiment analysis.
Data Preprocessing: Students will clean and preprocess the IMDb dataset to prepare it for fine-tuning.
Model Selection and Fine-tuning:

Model Choice: Students will select a pre-trained language model, like GPT-2, from the Hugging Face Model Hub.
Fine-tuning: Using the IMDb dataset, students will fine-tune the selected model for sentiment classification. This step involves adjusting hyperparameters and training the model.
Text Generation:

After fine-tuning, students will implement a text generation pipeline using their fine-tuned model.
Input Prompt: Users will be able to input movie review prompts, and the system will generate a coherent review text based on the sentiment provided in the prompt.
Evaluation:

Students will evaluate the generated text using sentiment analysis metrics to measure how well the generated reviews align with the provided prompts.
Metrics: Metrics such as accuracy, precision, recall, and F1-score for sentiment classification will be used to assess the quality of the generated text.

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset

# Load the IMDb dataset
dataset = load_dataset("imdb")
train_data = dataset["train"]
test_data = dataset["test"]

# Define the models to compare
models = [
    {"model_name": "gpt2", "num_labels": 2},
    {"model_name": "bert-base-uncased", "num_labels": 2},
    {"model_name": "roberta-base", "num_labels": 2},
]




  from .autonotebook import tqdm as notebook_tqdm


: 

: 

In [None]:
results = {}

for model_config in models:
    model_name = model_config["model_name"]
    num_labels = model_config["num_labels"]

    # Load pre-trained model and tokenizer
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Define training arguments and trainer
    training_args = TrainingArguments(
        per_device_train_batch_size=8,
        output_dir=f"./fine-tuned-model-{model_name}",
        evaluation_strategy="steps",
        eval_steps=100,
        save_steps=500,
        num_train_epochs=3,
    )

    # Fine-tune the model
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
        eval_dataset=test_data,
    )

    trainer.train()

    # Evaluate the model on test data
    eval_results = trainer.evaluate()

    results[model_name] = eval_results

# Compare evaluation results
for model_name, eval_result in results.items():
    print(f"Model: {model_name}")
    print(f"Accuracy: {eval_result['eval_accuracy']:.4f}")
    print(f"Precision: {eval_result['eval_precision']:.4f}")
    print(f"Recall: {eval_result['eval_recall']:.4f}")
    print(f"F1 Score: {eval_result['eval_f1']:.4f}")
    print()