## Environment setup & constant variables declaration

In [24]:
!pip install transformers==4.20.0
!pip install keras_nlp==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.20.0
  Using cached transformers-4.20.0-py3-none-any.whl (4.4 MB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1 (from transformers==4.20.0)
  Using cached tokenizers-0.12.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.12.1 transformers-4.20.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Importing the necessary libraries
import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
# Define certain variables

# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.1

MAX_INPUT_LENGTH = 1024 # Maximum length of the input to the model
MIN_TARGET_LENGTH = 5  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model
BATCH_SIZE = 8  # Batch-size for training our model
LEARNING_RATE = 2e-5  # Learning-rate for training our model
MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

## Load the dataset

We will now download the Extreme Summarization (XSum). The dataset consists of BBC articles and accompanying single sentence summaries. Specifically, each article is prefaced with an introductory sentence (aka summary) which is professionally written, typically by the author of the article. That dataset has 226,711 articles divided into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) sets.

Following much of literature, we use the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric to evaluate our sequence-to-sequence abstrative summarization approach.

We will use the Hugging Face Datasets library to download the data we need to use for training and evaluation. This can be easily done with the `load_dataset` function.

In [4]:
from datasets import load_dataset

raw_datasets = load_dataset("xsum", split="train")

Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading and preparing dataset xsum/default to /root/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

KeyboardInterrupt: ignored

The dataset has the following fields:

* **document**: the original BBC article to me summarized
* **summary**: the single sentence summary of the BBC article
* **id**: ID of the document-summary pair

In [None]:
print(raw_datasets)

In [None]:
print(raw_datasets[0])

For the sake of demonstrating the workflow, in this notebook we will only take small stratified balanced splits (10%) of the train as our training and test sets. We can easily split the dataset using the `train_test_split` method which expects the split size and the name of the column relative to which you want to stratify.

In [None]:
raw_datasets = raw_datasets.train_test_split(
    train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)

## Data Pre-processing

Before we can feed those texts to our model, we need to pre-process them and get them ready for the task. This is done by a Hugging Face `Transformers Tokenizer` which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

The `from_pretrained()` method expects the name of a model from the Hugging Face Model Hub. This is exactly similar to MODEL_CHECKPOINT declared earlier and we will just pass that.



In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We will write a simple function that helps us in the pre-processing that is compatible with Hugging Face Datasets. To summarize, our pre-processing function should:

* Tokenize the text dataset (input and targets) into it's corresponding token ids that will be used for embedding look-up in BERT
* Add the prefix to the tokens
* Create additional inputs for the model like token_type_ids, attention_mask, etc.

In [None]:
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

To apply this function on all the pairs of sentences in our dataset, we just use the map method of our dataset object we created earlier. This will apply the function on all the elements of all the splits in dataset, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

## Defining the model

# Nasze własne dane

Przygotujmy nasz korpus, żeby wyglądał dokładnie jak datasets pobrany z Huggingface, czyli miał kolumny jak: 'document' 'summary' 'id'. Zróbmy to w pandas dataframe...

... A później przerobimy na HF dataset takim kodem:

```
from datasets import Dataset
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
dataset = Dataset.from_pandas(df)
```




In [8]:
# Importing the necessary libraries
import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [14]:
!pip install transformers==4.20.0
!pip install keras_nlp==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.20.0
  Downloading transformers-4.20.0-py3-none-any.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.1.0 (from transformers==4.20.0)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1 (from transformers==4.20.0)
  Downloading tokenizers-0.12.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.12.1 transforme

In [1]:
!wget https://github.com/vitsiupia/projektPython/raw/main/meetings_split.zip

--2023-05-18 08:32:09--  https://github.com/vitsiupia/projektPython/raw/main/meetings_split.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/vitsiupia/projektPython/main/meetings_split.zip [following]
--2023-05-18 08:32:09--  https://raw.githubusercontent.com/vitsiupia/projektPython/main/meetings_split.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1027422 (1003K) [application/zip]
Saving to: ‘meetings_split.zip’


2023-05-18 08:32:09 (20.7 MB/s) - ‘meetings_split.zip’ saved [1027422/1027422]



In [2]:
import zipfile
with zipfile.ZipFile('meetings_split.zip', 'r') as zip:
  zip.extractall()

In [9]:
import os
import pandas as pd

data = []

# Sciezka do foldera meetings_split
folder_path = "meetings_split"

# Przeszukiwanie folderów train/ i val/ i tworzenie DataFrame
for folder_name in ["train", "val"]:
    folder_dir = os.path.join(folder_path, folder_name)
    
    # Przechodzenie przez pliki w folderze
    for file_name in os.listdir(folder_dir):
        if file_name.endswith(".txt"):
            code = file_name.split(".")[0]
            
            file_path = os.path.join(folder_dir, file_name)
            
            with open(file_path, "r") as file:
                text = file.read().strip()
                
                if file_name.endswith("transcript.txt"):
                    transcript_text = text
                elif file_name.endswith("abssumm.txt"):
                    summary_text = text
            
            # Sprawdzenie, czy są obie wartości transcript i summary dla danego kodu
            if "transcript_text" in locals() and "summary_text" in locals():
                data.append({
                    "transcript": transcript_text,
                    "summary": summary_text,
                    "code": code
                })
                # Usunięcie zmiennych transcript_text i summary_text
                del transcript_text, summary_text

# Tworzenie DataFrame z zebranych danych
ami_df = pd.DataFrame(data)

In [10]:
ami_df

Unnamed: 0,transcript,summary,code
0,yes well well see made one didnt enough yellow...,The group introduced themselves to each other....,ES2013d
1,morning sure sheep gonna draw head lets see kn...,The User Interface Designer presented the majo...,IS1001b
2,forgot name youre sorry forgot write know b su...,The project manager opened the meeting and int...,TS3004b
3,everybody welcome detailed design meeting lets...,The User Interface Designer and the Industrial...,ES2011d
4,tha gonna powerpoint presentation well catheri...,The meeting begins with the group trying to re...,IS1000a
...,...,...,...
94,sa pierrette professor shes shes fifty per cen...,The Project Manager gave the group new require...,IB4005
95,think youve got control f eight shift f eight ...,The Project Manager introduced the project to ...,ES2014b
96,hi everybody welcome kickoff meeting new produ...,The project manager opens the meeting by going...,ES2010a
97,wear thing many cables stuff original one thin...,The project manager opened the meeting and int...,ES2008a


In [11]:
import os

folder_path = "meetings_split/train"

file_count = len(os.listdir(folder_path))

print(f"Ilość plików w folderze '{folder_path}': {file_count}")

Ilość plików w folderze 'meetings_split/train': 240


In [15]:

folder_path = "meetings_split/val"

file_count = len(os.listdir(folder_path))

print(f"Ilość plików w folderze '{folder_path}': {file_count}")

Ilość plików w folderze 'meetings_split/val': 44


In [16]:
from datasets import Dataset
meetings_dataset = Dataset.from_pandas(ami_df)

### Definiowanie zmienntych

In [17]:
# Define certain variables

# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.10

# Z wcześniejszego notatnika:
# ...Najdłuższy transkrypt ma 3870 słów.
# ...Najdłuższe podsumowanie ma 201 słów.

MAX_INPUT_LENGTH = 4000 # Maximum length of the input to the model
MIN_TARGET_LENGTH = 80  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 150  # Maximum length of the output by the model
BATCH_SIZE = 8  # Batch-size for training our model
LEARNING_RATE = 2e-5  # Learning-rate for training our model
MAX_EPOCHS = 5  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

In [18]:
print(meetings_dataset)

Dataset({
    features: ['transcript', 'summary', 'code'],
    num_rows: 99
})


In [19]:
meetings_dataset = meetings_dataset.train_test_split(
    train_size=0.9, test_size=TRAIN_TEST_SPLIT
)

In [20]:
meetings_dataset

DatasetDict({
    train: Dataset({
        features: ['transcript', 'summary', 'code'],
        num_rows: 89
    })
    test: Dataset({
        features: ['transcript', 'summary', 'code'],
        num_rows: 10
    })
})

### Data Pre-processing

In [21]:
import transformers
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

Downloading:   0%|          | 0.00/2.27k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

In [22]:
if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

In [23]:
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["transcript"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

In [24]:
meetings_dataset = meetings_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/89 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

## Defining the model

In [25]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [26]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [27]:
train_dataset = meetings_dataset["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = meetings_dataset["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    meetings_dataset["test"]
    .shuffle()
    .select(list(range(10)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

## Building and compliling the model

In [28]:
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


## Training and evaluating our model

In [29]:
import keras_nlp

rouge_l = keras_nlp.metrics.RougeL()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback]

# For now we will use our test set as our validation_data
model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)



Epoch 1/5
