<a href="https://colab.research.google.com/github/vitsiupia/projektPython/blob/main/t5_abstractive_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers==4.20.0
!pip install keras_nlp==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score

In [2]:
# Importing the necessary libraries
import os
import logging
import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras
import zipfile
import pandas as pd

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['TF_CPP_MIN_LOG_LEVEL']

'1'

In [3]:
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
for device in gpu_devices:
    tf.config.experimental.set_memory_growth(device, True)

Przygotujmy nasz korpus, żeby wyglądał dokładnie jak datasets pobrany z Huggingface, czyli miał kolumny jak: 'document' 'summary' 'id'. Zróbmy to w pandas dataframe...

... A później przerobimy na HF dataset takim kodem:

```
from datasets import Dataset
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
dataset = Dataset.from_pandas(df)
```

In [4]:
# Download our dataset with splits.
!wget https://github.com/vitsiupia/projektPython/raw/main/meetings_split.zip

--2023-05-20 19:55:50--  https://github.com/vitsiupia/projektPython/raw/main/meetings_split.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/vitsiupia/projektPython/main/meetings_split.zip [following]
--2023-05-20 19:55:50--  https://raw.githubusercontent.com/vitsiupia/projektPython/main/meetings_split.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1034366 (1010K) [application/zip]
Saving to: ‘meetings_split.zip.1’


2023-05-20 19:55:50 (20.0 MB/s) - ‘meetings_split.zip.1’ saved [1034366/1034366]



In [5]:
# Unpack the dataset.
with zipfile.ZipFile('meetings_split.zip', 'r') as zip:
  zip.extractall()

In [6]:
data = []
folder_path = "meetings_split"
transcripts_path="transcripts"
summaries_path="summaries"

# Przeszukiwanie folderów train/ i val/ i tworzenie DataFrame
for split in ['train','val']:
    transcripts_dir = os.path.join(folder_path, split, transcripts_path)
    summaries_dir = os.path.join(folder_path, split, summaries_path)
    
    # Przechodzenie przez każdy transkrypt w folderzeю
    for transcript_name in os.listdir(transcripts_dir):
          code = transcript_name.split(".")[0]
          summary_name = code + ".abssumm.txt"
            
          # Read the text inside the transcript.
          with open(os.path.join(transcripts_dir, transcript_name), "r") as file:
              transcript = file.read().strip()

          # Now read the text in the respective summary.
          with open(os.path.join(summaries_dir,summary_name), "r") as file:
              summary = file.read().strip()
            
          data.append({
              "transcript": transcript,
               "summary": summary,
               "code": code,
               "split" : split,
          })
          # Usunięcie zmiennych transcript_text i summary_text
          del transcript, summary

# Tworzenie DataFrame z zebranych danych
ami_df = pd.DataFrame(data)

In [7]:
ami_df

Unnamed: 0,transcript,summary,code,split
0,hi hi everyone havent yes actually maybe uhhuh...,The project manager opened the meeting and sta...,IB4010,train
1,thanks coming meeting remote ideas ideas peopl...,The project manager recapped the events and de...,ES2016b,train
2,names andrew market research person meeting pr...,The project manager opens the meeting by welco...,ES2012a,train
3,talk functional design hopefully weve got bett...,The Project Manager reviewed new requirements ...,IS1002b,train
4,hi good going nex next maybe participant two n...,The project manager opens the meeting by going...,IS1009c,train
...,...,...,...,...
137,name francina user interface role main respons...,The meeting opens with the group doing introdu...,IS1009a,val
138,think youve got control f eight shift f eight ...,The Project Manager gave the group new require...,ES2014b,val
139,wi project project documents yes think last mi...,The project manager opens the meeting by stati...,ES2006b,val
140,p well problem think youve got lot people dont...,The project manager opens the meeting by going...,ES2012b,val


In [8]:
ami_df.loc[ami_df.code.duplicated()].sort_values(by='code')

Unnamed: 0,transcript,summary,code,split


In [9]:
for folder_path in ["meetings_split/train/transcripts", "meetings_split/train/summaries", 
                    "meetings_split/val/transcripts", "meetings_split/val/summaries", "meetings_split/test/transcripts"]:
  file_count = len(os.listdir(folder_path))
  print(f"Ilość plików w folderze '{folder_path}': {file_count}") 

Ilość plików w folderze 'meetings_split/train/transcripts': 120
Ilość plików w folderze 'meetings_split/train/summaries': 120
Ilość plików w folderze 'meetings_split/val/transcripts': 22
Ilość plików w folderze 'meetings_split/val/summaries': 22
Ilość plików w folderze 'meetings_split/test/transcripts': 29


In [10]:
from datasets import Dataset
meetings_dataset = Dataset.from_pandas(ami_df)

Definiowanie zmiennych

In [11]:
# Define certain variables

# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.15

# Z wcześniejszego notatnika:
# ...Najdłuższy transkrypt ma 3870 słów.
# ...Najdłuższe podsumowanie ma 530 słów.
# ...Najkrótsze podsumowanie ma 41 słów.

MAX_INPUT_LENGTH = ami_df.transcript.map(lambda row: len(row.split())).max() # Maximum length of the input to the model
MIN_TARGET_LENGTH = ami_df.summary.map(lambda row: len(row.split())).min()  # Minimum length of the output by the model
MAX_TARGET_LENGTH = ami_df.summary.map(lambda row: len(row.split())).max()  # Maximum length of the output by the model
BATCH_SIZE = 50  # Batch-size for training our model
LEARNING_RATE = 2e-3  # Learning-rate for training our model
MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

In [12]:
print(meetings_dataset)

Dataset({
    features: ['transcript', 'summary', 'code', 'split'],
    num_rows: 142
})


In [13]:
meetings_dataset = meetings_dataset.train_test_split(
    train_size=0.85, test_size=TRAIN_TEST_SPLIT
)

In [14]:
meetings_dataset

DatasetDict({
    train: Dataset({
        features: ['transcript', 'summary', 'code', 'split'],
        num_rows: 120
    })
    test: Dataset({
        features: ['transcript', 'summary', 'code', 'split'],
        num_rows: 22
    })
})

Data Pre-Processing

In [15]:
import transformers
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

In [16]:
if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

In [17]:
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["transcript"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

In [18]:
tokenized_datasets = meetings_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

Map:   0%|          | 0/22 [00:00<?, ? examples/s]

Defining the model

In [19]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [20]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [21]:
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["test"]
    .shuffle()
    .select(list(range(10)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

Building and compiling the model

In [22]:
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Training and evaluating the model

In [23]:
import keras_nlp

rouge_l = keras_nlp.metrics.RougeL()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback]

# For now we will use our test set as our validation_data
model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)



##Inference
Now we will try to infer the model we trained on an arbitary article. To do so, we will use the pipeline method from Hugging Face Transformers. Hugging Face Transformers provides us with a variety of pipelines to choose from. For our task, we use the summarization pipeline.

The pipeline method takes in the trained model and tokenizer as arguments. The framework="tf" argument ensures that you are passing a model that was trained with TF.

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")

summarizer(
    meetings_dataset["test"][0]["transcript"],
    min_length=MIN_TARGET_LENGTH,
    max_length=MAX_TARGET_LENGTH,
)

In [None]:
meetings_dataset["test"][0]["summary"]