<a href="https://colab.research.google.com/github/youzhanghe123/A-simple-BART-summary-pipeline/blob/main/BART_summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [65]:
import pandas as pd
from torch.utils.data import DataLoader, Dataset
from transformers import BartTokenizer
from transformers import BartForConditionalGeneration
from transformers import AdamW
import torch
from rouge_score import rouge_scorer

In [3]:
file_path="/content/drive/MyDrive/Reviews.csv"

In [4]:
df = pd.read_csv(file_path).head(10)

In [44]:
df_eval=pd.read_csv(file_path).head(20)[10:]

In [5]:
df=df[["Summary","Text"]]

In [62]:
df.to_csv("train.csv")

In [45]:
df_eval=df_eval[["Summary","Text"]]

In [63]:
df_eval.to_csv("evaluate.csv")

In [6]:
df.iloc[0]["Text"]

'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

In [7]:
df.iloc[0]["Summary"]

'Good Quality Dog Food'

In [48]:
print(df_eval.iloc[0]["Text"])

I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind!  We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away!  When we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan.  Just realize that once you taste it, you will never want to use any other sauce.<br /><br />Thank you for the personal, incredible service!


In [8]:
# Check if GPU is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available")
    print("Using GPU:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU")

GPU is available
Using GPU: Tesla T4


## Tokenize the Data

In [9]:
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")

def tokenize_data(dataset):
    '''
    args: inpuut dataset, which is a dataframe contain two columns: Text, Summary
    returns :
    1.input_ids: This is a list (or tensor) of token indices that represent the tokenized input text. Each integer in this list corresponds to a specific token in the tokenizer's vocabulary. The input_ids are used by the model to look up the token embeddings during the forward pass.
    2.attention_mask: This is a list (or tensor) of the same length as input_ids, containing 1s and 0s.
    It is used to indicate which tokens should be attended to and which should be ignored. A value of 1 in the attention_mask means that the corresponding token in input_ids is a real token that should be attended to, while a value of 0 means that it is a padding token and should be ignored.
    The attention_mask is important because it allows the model to handle variable-length input sequences.
    The model can ignore padding tokens and focus only on the meaningful parts of the input.
    The final output is a list of dictionaries, each dictionary contians the input_ids, attention_mask and labels
    '''
    tokenized_data = []
    for index in range(len(dataset)):
        text_encodings = tokenizer(dataset.iloc[index]["Text"], truncation=True, padding="max_length", max_length=512)
        summary_encodings = tokenizer(dataset.iloc[index]["Summary"], truncation=True, padding="max_length", max_length=128)
        tokenized_data.append({"input_ids": torch.tensor(text_encodings["input_ids"]), "attention_mask": torch.tensor(text_encodings["attention_mask"]), "labels": torch.tensor(summary_encodings["input_ids"])})
    return tokenized_data


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [10]:
tokenized_dataset = tokenize_data(df)

##Create a Dataloader

In [11]:
class SummarizationDataset(Dataset):
    '''
    args: Dataset, which is the tokenized dataset, a list contain dictionaries
    returns: a pytorch Dataset, where each element is a dictionary from the tokenized dataset
    '''
    def __init__(self, data):
        self.data=data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

In [12]:
train_dataset = SummarizationDataset(tokenized_dataset)
#each batch in the dataloader is a dictionary, contain "input_ids","attention_mask" and "labels" (each of hem is a 32*512 tensor)
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)

In [13]:
for i,batch in enumerate(train_dataloader):
  if i==0:
    batch=batch
    break

In [14]:
batch["input_ids"].shape #example of the first batch

torch.Size([10, 512])

##Initialize the model

In [15]:
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

In [16]:
model.to(device)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50264, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0-11): 12 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_norm): LayerN

## Set up the training loop

In [17]:
optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 1

for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["labels"]
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        print(f"Epoch {epoch}, Loss: {loss.item()}")



Epoch 0, Loss: 12.198269844055176


## Evaluate the model

In [30]:
#generate summary for the texts in the evaluation dataset
#pre-train model does not change the tokenizer

def tokenize_data_eval(dataset):
    tokenized_data = []
    for index in range(len(dataset)):
        text_encodings = tokenizer(dataset.iloc[index]["Text"], truncation=True, padding="max_length", max_length=512)
        input_ids=torch.tensor(text_encodings["input_ids"]).to(device)
        attention_mask=torch.tensor(text_encodings["attention_mask"]).to(device)
        tokenized_data.append({"input_ids": input_ids , "attention_mask":attention_mask })
    return tokenized_data

In [31]:
tokenized_dataset_eval=tokenize_data_eval(df_eval)

In [56]:
def generate(eval_tokenized):
  summary_lis=[]
  for token in eval_tokenized:

    summary_ids = model.generate(token["input_ids"].unsqueeze(0),attention_mask=token["attention_mask"].unsqueeze(0), num_beams=4, max_length=60, early_stopping=True)

    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    summary_lis.append(summary)
  return summary_lis

In [57]:
summary=generate(tokenized_dataset_eval)

In [58]:
for i in range(len(df_eval)):
  print("original text: ", df_eval.iloc[i]["Text"])
  print("generated summary: ", summary[i] )
  print("**********")

original text:  I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind!  We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away!  When we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan.  Just realize that once you taste it, you will never want to use any other sauce.<br /><br />Thank you for the personal, incredible service!
generated summary:  I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind! We picked up a bottle once on a 

In [59]:
with open("summary.txt","w") as f:
  for summar in summary:
    f.write(summar+"/n")

In [68]:
#use rouge_score to measure the overlap between the generated summary and the reference summary in terms of n-grams, word sequences, and word pairs
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Assuming you have a list of reference summaries and generated summaries
reference_summaries = list(df_eval["Text"])
generated_summaries = summary
scores = [scorer.score(ref, gen) for ref, gen in zip(reference_summaries, generated_summaries)]

# Calculate average scores
avg_scores = {key: sum(score[key].fmeasure for score in scores) / len(scores) for key in scores[0].keys()}
print(avg_scores)


{'rouge1': 0.6680995380365928, 'rouge2': 0.6481616838899383, 'rougeL': 0.6569884269254817}
