## 1. 데이터 로드 및 전처리 

데이터를 PyTorch DataLoader로 불러올 필요가 있습니다. train, test, val 데이터를 각각 분할해 모델 학습 및 평가에 사용할 수 있습니다.

In [None]:
import pandas as pd
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained('google/bigbird-pegasus-large-arxiv')  # 저장된 토크나이저 경로
model = AutoModelForSeq2SeqLM.from_pretrained('google/bigbird-pegasus-large-arxiv')  # 저장된 모델 경로

class NewsDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        article = self.data.iloc[index]['article']
        highlights = self.data.iloc[index]['highlights']

        # Tokenize input and target text
        input_ids = self.tokenizer.encode(
            f"summarize: {article}",
            return_tensors='pt',
            truncation=True,
            max_length = self.max_len,
            padding= "max_length"
        ).squeeze()

        target_ids = self.tokenizer.encode(
            highlights,
            return_tensors='pt',
            truncation=True,
            max_length = self.max_len // 2,
            padding='max_length'
        ).squeeze()

        return {
            'input_ids': input_ids,
            'target_ids': target_ids
        }

# Load CSV files and prepare dataset
train_data_path = '/content/drive/MyDrive/cnn_dailymail_dataset/cnn_dailymail/train.csv'
val_data_path = '/content/drive/MyDrive/cnn_dailymail_dataset/cnn_dailymail/validation.csv'
test_data_path = '/content/drive/MyDrive/cnn_dailymail_dataset/cnn_dailymail/test.csv'

# Load dataframes
train_df = pd.read_csv(train_data_path)
val_df = pd.read_csv(val_data_path)
test_df = pd.read_csv(test_data_path)

# Prepare datasets
train_dataset = NewsDataset(train_df, tokenizer, max_len=1500)
val_dataset = NewsDataset(val_df, tokenizer, max_len=1500)
test_dataset = NewsDataset(test_df, tokenizer, max_len=1500)

# DataLoader for training, validation, and testing
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False)

  from .autonotebook import tqdm as notebook_tqdm


ImportError: 
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


## 2. 모델 학습 및 평가

모델 학습 및 성능 평가를 위한 코드입니다. 주어진 데이터셋의 validation 데이터로 중간중간 평가하면서, 가장 좋은 성능을 낸 모델을 저장하도록 할 수 있습니다.

In [None]:
import random
from transformers import LongformerTokenizer, LongformerForSequenceClassification
import torch

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)
model.to(device)

# Training and evaluation loop
best_val_loss = float('inf')

save_interval = 800  # Save model after every 500 batches
log_interval = 500  # Print loss every 100 batches
val_chunk_size = 100  # Number of validation batches to evaluate

model.train()
step = 0

for batch_idx, batch in enumerate(train_loader):
    step += 1

    input_ids = batch['input_ids'].to(device)
    target_ids = batch['target_ids'].to(device)

    # Forward pass
    outputs = model(input_ids=input_ids, labels=target_ids)
    loss = outputs.loss

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print training loss at intervals
    if step % log_interval == 0:
        print(f"Step {step} - Training Loss: {loss.item()}")

    # Validation and save model at intervals
    if step % save_interval == 0:
        print("Running validation on a random chunk...")
        model.eval()
        val_loss = 0.0

        # Randomly sample a subset of validation data
        sampled_indices = random.sample(range(len(val_loader)), val_chunk_size)
        sampled_batches = [val_loader.dataset[i] for i in sampled_indices]

        with torch.no_grad():
            for val_batch in sampled_batches:
                val_input_ids = val_batch['input_ids'].unsqueeze(0).to(device)
                val_target_ids = val_batch['target_ids'].unsqueeze(0).to(device)

                # Forward pass
                val_outputs = model(input_ids=val_input_ids, labels=val_target_ids)
                val_loss += val_outputs.loss.item()

        avg_val_loss = val_loss / val_chunk_size
        print(f"Validation Loss at Step {step}: {avg_val_loss}")

        # Save the model if validation loss improves
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            model.save_pretrained('models/summarize_model.pth')
            tokenizer.save_pretrained('tokenizer/summarize_tokenizer')
            print(f"Model and tokenizer saved at Step {step} with Validation Loss: {avg_val_loss}")

        model.train()  # Return to training mode

print("Training complete!")

## 평가 및 테스트

이 코드는 모델을 학습하고, validation 데이터셋으로 평가하여 최상의 모델을 저장한 뒤, test 데이터셋으로 성능을 평가하는 전체적인 프로세스를 제공합니다.

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
from torch.utils.data import DataLoader, Subset
import random

# Load the tokenizer and the model
tokenizer = T5Tokenizer.from_pretrained('tokenizer/summarize_tokenizer')
best_model = T5ForConditionalGeneration.from_pretrained('models/summarize_model.pth')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
best_model.to(device)

# Randomly sample 200 indices from the test dataset
num_samples = 200
random_indices = random.sample(range(len(test_loader.dataset)), num_samples)
subset = Subset(test_loader.dataset, random_indices)
sampled_test_loader = DataLoader(subset, batch_size=4, shuffle=False)

best_model.eval()
test_loss = 0.0

# Store a few examples for comparison
examples_to_show = 5
examples = []

with torch.no_grad():
    for batch_idx, batch in enumerate(sampled_test_loader):
        input_ids = batch['input_ids'].to(device)
        target_ids = batch['target_ids'].to(device)

        # Forward pass
        outputs = best_model(input_ids=input_ids, labels=target_ids)
        test_loss += outputs.loss.item()

        # Decode for comparison
        if len(examples) < examples_to_show:
            for i in range(input_ids.size(0)):
                if len(examples) >= examples_to_show:
                    break
                input_text = tokenizer.decode(input_ids[i], skip_special_tokens=True)
                predicted_text = tokenizer.decode(
                    best_model.generate(input_ids[i].unsqueeze(0), max_length=512)[0],
                    skip_special_tokens=True
                )
                target_text = tokenizer.decode(target_ids[i], skip_special_tokens=True)
                examples.append((input_text, predicted_text, target_text))

avg_test_loss = test_loss / len(sampled_test_loader)
print(f"Test Loss: {avg_test_loss}")

# Display examples
print("\nSample Predictions:")
for idx, (input_text, predicted_text, target_text) in enumerate(examples):
    print(f"\nExample {idx + 1}")
    print(f"Input: {input_text}")
    print(f"Predicted Output: {predicted_text}")
    print(f"Target Output: {target_text}")


## 실제 뉴스 TEST

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

# Load the tokenizer and the model
tokenizer = T5Tokenizer.from_pretrained('/content/drive/MyDrive/tokenizer/e2e_summarize_tokenizer')  # 저장된 토크나이저 경로
model = T5ForConditionalGeneration.from_pretrained('/content/drive/MyDrive/models/e2e_summarize_model.pth')  # 저장된 모델 경로

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Function to summarize an article
def summarize_article(news):
    # Preprocess the input text
    input_text = f"summarize: {news}"
    input_ids = tokenizer.encode(input_text, return_tensors="pt", truncation=True, max_length=1024).to(device)

    # Generate summary
    summary_ids = model.generate(input_ids, max_length=300, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary


news = '''There’s some kind of magic afoot. If, like me, you’re one of the very few people who hasn’t already seen the blockbuster stage musical Wicked (it’s the second-highest-grossing Broadway show of all time, so that’s an awful lot of bums on seats), you may approach this shiny, high-energy, relentlessly marketed movie adaptation with low to moderate expectations. There’s the unwieldy running time, for a start – two hours and 40 minutes – and the cynical, box-office-gouging decision to carve the story into two films (fans will have to wait almost a year to the day before they get to watch the concluding chapter).But here’s the thing: reservations are soon extinguished and grumbles about the release strategy swiftly quashed. Wicked matches its polished razzle-dazzle with real heart. Driven by knockout performances from Cynthia Erivo and Ariana Grande, Jon M Chu’s impossibly slick charm assault of an adaptation zips along so enjoyably that you almost wish it were longer (your bladder may disagree). With its all too timely themes of bullying, corrupt leaders and the demonisation of difference, this is a movie that promises a froth of pink and green escapism but delivers considerably more in the way of depth and darkness.For those who have somehow evaded Wicked’s cultural reach over the past couple of decades, here’s a brief primer. The film and the stage show are both loosely based on Gregory Maguire’s 1995 novel Wicked: The Life and Times of the Wicked Witch of the West, which offers an alternative backstory for the Wicked Witch from The Wizard of Oz. This film focuses on Wicked’s early years and two young witches-to-be: green-skinned outcast Elphaba (Erivo), who will go on to become the Wicked Witch of the West, and the vain, popular Galinda (Grande), who will eventually blossom into Glinda the Good.Both Elphaba and Galinda are newly arrived at Shiz University (think the student politics of Mean Girls’s North Shore High and the curriculum of Hogwarts). Although not enrolled as a student, Elphaba is on site to help her paraplegic younger sister, Nessarose (Marissa Bode). But the formidable teacher Madame Morrible (a haughtily fabulous Michelle Yeoh, breezing over any vocal limitations in a cloud of intimidating glamour) spots potential in Elphaba and offers her one-to-one tutoring in the art of enchantment. To the mutual disgust of both girls, Elphaba and Galinda find themselves assigned as roomies.Grande has a vocal range so extensive that some of it is only audible to bats, and she uses every last note of it here
It’s not just their personalities that clash: the film’s colour palette is at first a battleground between the chlorophyll green of Elphaba’s skin and the candyfloss pink of Galinda’s wardrobe. But visuals that initially seem jarring start to find harmony as the movie progresses. A scene in a forest full of mossy tuffets of vegetation and garlands of delicate, rosy blooms is lush and lovely, one of several notable triumphs for the production design department, led by Nathan Crowley, whose credits include the similarly lavish Wonka. Likewise, Elphaba and Galinda warm to each other and a genuine connection is forged between them.
Both lead actors impress. Erivo is terrific, her rich, velvety voice cracking under the weight of rejections and ridicule suffered by Elphaba; her eyes showing the bruises that her skin cannot. And Grande is supremely well cast. It’s not just the voice: the singer has a vocal range so extensive that some of it is only audible to bats, and she uses every last note of it here. But more crucial is her gift for physical comedy – each flouncy hair toss, each ditsy heel kick, is a precision-tooled punchline.
Elsewhere, Bridgerton’s Jonathan Bailey, as the shallow and self-absorbed Prince Fiyero, skips away with every one of his scenes – in particular, a dizzyingly complex song-and-dance sequence in the college library. Kudos, too, to the choreographer Christopher Scott for dreaming it all up, and to cinematographer Alice Brooks for capturing the magic.
Does it all work? There are moments that get too caught up in their own whirl of CGI pageantry and empty spectacle. And certainly some scenes could be tightened up a little – it’s worth noting that the running time of this first film instalment is longer than the stage version in its entirety. But for the most part, Wicked the movie takes flight and lifts our hearts along with it. We’re caught in the slipstream of Elphaba and her knobbly and uncomfortable-looking broomstick as she whooshes off into the second half of the story."
'''
# Generate and print the summary
summary = summarize_article(news)
print("News : ", news)
print("Summary : ", summary)
