<a href="https://colab.research.google.com/github/vitsiupia/projektPython/blob/main/t5_abstractive_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Meeting Summary Generation Using T5

In [None]:
!pip install keras_nlp==0.3.0
!pip install huggingface-hub
!pip install datasets transformers rouge-score nltk
!apt install git-lfs

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
# Importing the necessary libraries
import os
import logging
import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras
import zipfile
import pandas as pd

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['TF_CPP_MIN_LOG_LEVEL']

In [4]:
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
for device in gpu_devices:
    tf.config.experimental.set_memory_growth(device, True)

# Loading the Data

Przygotujmy nasz korpus, żeby wyglądał dokładnie jak datasets pobrany z Huggingface, czyli miał kolumny jak: 'document' 'summary' 'id'. Zróbmy to w pandas dataframe...

... A później przerobimy na HF dataset takim kodem:

```
from datasets import Dataset
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
dataset = Dataset.from_pandas(df)
```

In [5]:
# Download our dataset with splits.
!wget https://github.com/vitsiupia/projektPython/raw/main/meetings_split.zip

--2023-06-01 16:12:02--  https://github.com/vitsiupia/projektPython/raw/main/meetings_split.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/vitsiupia/projektPython/main/meetings_split.zip [following]
--2023-06-01 16:12:02--  https://raw.githubusercontent.com/vitsiupia/projektPython/main/meetings_split.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1034366 (1010K) [application/zip]
Saving to: ‘meetings_split.zip.1’


2023-06-01 16:12:03 (18.1 MB/s) - ‘meetings_split.zip.1’ saved [1034366/1034366]



In [6]:
# Unpack the dataset.
with zipfile.ZipFile('meetings_split.zip', 'r') as zip:
  zip.extractall()

In [7]:
data = []
meetings_folder = "meetings_split"
transcripts_folder, summaries_folder = "transcripts", "summaries"

# Przeszukiwanie folderów train/, val/, test/
for split in ['train','val', 'test']:
    transcripts_dir = os.path.join(meetings_folder, split, transcripts_folder)
    
    # Przechodzenie przez każdy transkrypt w folderzeю
    for transcript_name in os.listdir(transcripts_dir):
          code = transcript_name.split(".")[0]
            
          # Read the text inside the transcript.
          with open(os.path.join(transcripts_dir, transcript_name), "r") as file:
              transcript = file.read().strip()

          # Now read the text in the respective summary for train, val folders.
          if split != 'test':
              summary_name = code + ".abssumm.txt"
              summaries_dir = os.path.join(meetings_folder, split, summaries_folder)
              with open(os.path.join(summaries_dir,summary_name), "r") as file:
                  summary = file.read().strip()
          else:
              summary = ''
            
          data.append({
              "transcript": transcript,
               "summary": summary,
               "code": code,
               "split" : split,
          })
          # Usunięcie zmiennych transcript_text i summary_text
          del transcript, summary

# Tworzenie DataFrame z zebranych danych
ami_df = pd.DataFrame(data)

In [8]:
ami_df

Unnamed: 0,transcript,summary,code,split
0,nick industrial designer tool training yes you...,The group introduced themselves and their role...,ES2014a,train
1,maybe maybe maybe maybe bra thats thats mean g...,The User Interface Designer and the Industrial...,IS1002d,train
2,help screens black thats fine done right secon...,The Industrial Designer gave his presentation ...,ES2004b,train
3,go slides number three number two sorry final ...,The project manager goes through the minutes o...,IS1004d,train
4,making slide open underneath fold open dont kn...,The project manager presented the agenda and t...,TS3004d,train
...,...,...,...,...
166,make point well make point make status shot de...,,IN1001,test
167,assume well youve sort information well got gd...,,EN2009d,test
168,summary basing certain thresholds special that...,,EN2001d,test
169,wonder much meetings talking stuff meetings lo...,,EN2002a,test


In [9]:
ami_df.loc[ami_df.code.duplicated()].sort_values(by='code')

Unnamed: 0,transcript,summary,code,split


In [10]:
for folder_path in ["meetings_split/train/transcripts", "meetings_split/train/summaries", 
                    "meetings_split/val/transcripts", "meetings_split/val/summaries", "meetings_split/test/transcripts"]:
  file_count = len(os.listdir(folder_path))
  print(f"Ilość plików w folderze '{folder_path}': {file_count}") 

Ilość plików w folderze 'meetings_split/train/transcripts': 120
Ilość plików w folderze 'meetings_split/train/summaries': 120
Ilość plików w folderze 'meetings_split/val/transcripts': 22
Ilość plików w folderze 'meetings_split/val/summaries': 22
Ilość plików w folderze 'meetings_split/test/transcripts': 29


We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to prepare the data and get the metric we need to use for evaluation (to compare our model to the benchmark).

In [11]:
# # Define certain variables

# # Z wcześniejszego notatnika:
# # ...Najdłuższy transkrypt ma 3870 słów.
# # ...Najdłuższe podsumowanie ma 530 słów.
# # ...Najkrótsze podsumowanie ma 41 słów.

# MAX_INPUT_LENGTH = 1024
# MIN_TARGET_LENGTH = 10
# MAX_TARGET_LENGTH = 128
# BATCH_SIZE = 50  # Batch-size for training our model
# LEARNING_RATE = 2e-3  # Learning-rate for training our model
# MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# # This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
# MODEL_CHECKPOINT = "t5-small"

In [12]:
import datasets

# Split the DataFrame into train, validation and test DataFrames based on the 'split' column
train_df = ami_df[ami_df['split'] == 'train']
val_df = ami_df[ami_df['split'] == 'val']
test_df = ami_df[ami_df['split'] == 'test']

# Create datasets.Dataset objects from the train and validation DataFrames
train_dataset = datasets.Dataset.from_pandas(train_df.drop('split', axis=1))
val_dataset = datasets.Dataset.from_pandas(val_df.drop('split', axis=1))
test_dataset = datasets.Dataset.from_pandas(test_df.drop('split', axis=1))

# Create a DatasetDict containing train, val, test datasets
raw_datasets = datasets.DatasetDict({
    'train': train_dataset,
    'val': val_dataset,
    'test': test_dataset
})

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [13]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['transcript', 'summary', 'code', '__index_level_0__'],
        num_rows: 120
    })
    val: Dataset({
        features: ['transcript', 'summary', 'code', '__index_level_0__'],
        num_rows: 22
    })
    test: Dataset({
        features: ['transcript', 'summary', 'code', '__index_level_0__'],
        num_rows: 29
    })
})

In [14]:
raw_datasets["train"][0]

{'transcript': 'nick industrial designer tool training yes youve lost microphone ill ill shall see get across without tangling everything theres one didnt think pens ill try red pen gonna go bear able draw well ill bash ooh ooh lost think ive knocked microphone well g well go small small bear animal looks nothing bear dunno maybe theres many cartoon characters made bear jungle book characters stuff great yes remote controls buttons little small theyre quite hard press maybe make something easy press buttons main function yes sort easy use buttons accessible easy use see yes sorry go go yes maybe could better instructions remote design remote control sort instructions would come slide four yes yes think said start television remote control maybe stick unless get told otherwise thats great right start first meeting right agenda first meeting twenty five minutes meeting get acquainted everyone want say seem sensible alastair project leader tool training project plan anyone thoughts tool t

In [15]:
metric = datasets.load_metric("rouge")
metric

  metric = datasets.load_metric("rouge")


Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [16]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

# Data Pre-Processing

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [18]:
MODEL_CHECKPOINT = "t5-small"
MAX_INPUT_LENGTH = 1024
MAX_TARGET_LENGTH = 128  # the notebook might crash if MAX_TARGET_LENGTH > 128

In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

In [20]:
tokenizer("Hello, this is one sentence")

{'input_ids': [8774, 6, 48, 19, 80, 7142, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [21]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [22]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}




In [23]:
if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [24]:
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["transcript"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [25]:
preprocess_function(raw_datasets["train"][:1])

{'input_ids': [[21603, 10, 3, 11191, 2913, 4378, 1464, 761, 4273, 25, 162, 1513, 18701, 3, 1092, 3, 1092, 1522, 217, 129, 640, 406, 3, 8967, 697, 762, 132, 7, 80, 737, 17, 317, 3, 3801, 3, 1092, 653, 1131, 4550, 3, 13366, 281, 4595, 3, 179, 3314, 168, 3, 1092, 3905, 107, 3, 32, 32, 107, 3, 32, 32, 107, 1513, 317, 3, 757, 7673, 15, 26, 18701, 168, 3, 122, 168, 281, 422, 422, 4595, 2586, 1416, 1327, 4595, 146, 29, 29, 32, 2087, 132, 7, 186, 15074, 2850, 263, 4595, 19126, 484, 2850, 2005, 248, 4273, 4322, 7415, 10634, 385, 422, 79, 60, 882, 614, 2785, 2087, 143, 424, 514, 2785, 10634, 711, 1681, 4273, 1843, 514, 169, 10634, 3551, 514, 169, 217, 4273, 8032, 281, 281, 4273, 2087, 228, 394, 3909, 4322, 408, 4322, 610, 1843, 3909, 133, 369, 9116, 662, 4273, 4273, 317, 243, 456, 4390, 4322, 610, 2087, 4372, 3, 3227, 129, 1219, 2904, 24, 7, 248, 269, 456, 166, 1338, 269, 8101, 166, 1338, 6786, 874, 676, 1338, 129, 29740, 921, 241, 497, 1727, 11743, 491, 9, 15303, 516, 2488, 1464, 761, 516, 515,

In [26]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

Map:   0%|          | 0/22 [00:00<?, ? examples/s]

Map:   0%|          | 0/29 [00:00<?, ? examples/s]


Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

# Fine-Tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is sequence-to-sequence (both the input and output are text sequences), we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [27]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Next we set some parameters like the learning rate and the `batch_size`and customize the weight decay. 

In [28]:
batch_size = 8
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!

We also want to compute `ROUGE` metrics, which will require us to generate text from our model.

In [29]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="np")

generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="np", 
                                                  pad_to_multiple_of=128)

Next, we convert our datasets to `tf.data.Dataset`, which Keras understands natively. There are two ways to do this - we can use the slightly more low-level [`Dataset.to_tf_dataset()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset) method, or we can use [`Model.prepare_tf_dataset()`](https://huggingface.co/docs/transformers/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset). The main difference between these two is that the `Model` method can inspect the model to determine which column names it can use as input, which means you don't need to specify them yourself. Make sure to specify the collator we just created as our `collate_fn`!

In [31]:
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=batch_size,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
val_dataset = tokenized_datasets["val"].to_tf_dataset(
    batch_size=batch_size,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset =  tokenized_datasets["val"].to_tf_dataset(
    batch_size=batch_size,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=generation_data_collator
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Now we initialize our loss and optimizer and compile the model. Note that most Transformers models compute loss internally - we can train on this as our loss value simply by not specifying a loss when we `compile()`.

In [32]:
from transformers import AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


### Training and evaluating the model

Now we can train our model. We can also add a few optional callbacks here.
- TensorBoard is a built-in Keras callback that logs TensorBoard metrics.
- KerasMetricCallback is a callback for computing advanced metrics. There are a number of common metrics in NLP like ROUGE which are hard to fit into your compiled training loop because they depend on decoding predictions and labels back to strings with the tokenizer, and calling arbitrary Python functions to compute the metric. The KerasMetricCallback will wrap a metric function, outputting metrics as training progresses.

The KerasMetricCallback callback takes two main arguments - a `metric_fn` and an `eval_dataset`. It then iterates over the `eval_dataset` and collects the model's outputs for each sample, before passing the `list` of predictions and the associated `list` of labels to the user-defined `metric_fn`. If the `predict_with_generate` argument is `True`, then it will call `model.generate()` for each input sample instead of `model.predict()` - this is useful for metrics that expect generated text from the model, like `ROUGE`.

This callback allows complex metrics to be computed each epoch that would not function as a standard Keras Metric. Metric values are printed each epoch, and can be used by other callbacks like `TensorBoard` or `EarlyStopping`.

In [33]:
import numpy as np
import nltk


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Rouge expects a newline after each sentence
    decoded_predictions = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_predictions]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    result = metric.compute(predictions=decoded_predictions, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return result

And now we can try training our model. By default, we only do a single epoch of training here, as the inputs are very long, which means training is quite slow. However, you may wish to experiment with larger pre-trained models and longer training runs if you want to maximize the quality of your summaries.

In [34]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [35]:
from transformers.keras_callbacks import KerasMetricCallback
from tensorflow.keras.callbacks import TensorBoard

tensorboard_callback = TensorBoard(log_dir="./summarization_model_save/logs")
metric_callback = KerasMetricCallback(metric_fn, eval_dataset=generation_dataset, 
                                      predict_with_generate=True, use_xla_generation=True)
callbacks = [metric_callback, tensorboard_callback]



In [36]:
model.fit(train_dataset, validation_data=val_dataset, epochs=num_train_epochs, callbacks=callbacks)



  return py_builtins.overload_of(f)(*args)




<keras.callbacks.History at 0x7fcb63b90ca0>

# Inference

### Pipeline API

Now we will try to infer the model we trained on an arbitary transcript. To do so, we will use the pipeline method from Hugging Face Transformers. Hugging Face Transformers provides us with a variety of pipelines to choose from. For our task, we use the summarization pipeline.

The pipeline method takes in the trained model and tokenizer as arguments. The framework="tf" argument ensures that you are passing a model that was trained with TF.

In [44]:
doc = raw_datasets["val"][0]["transcript"]
sum = raw_datasets["val"][0]["summary"]

In [41]:
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")
# summarizer = pipeline('text2text-generation', model_name, framework="tf")

summarizer(doc, min_length=10, max_length=MAX_TARGET_LENGTH)

[{'summary_text': 'names tara user interface designer also responsible functional design phase conceptual design phase user interface design elk vicious sheep really good quite good think favourite animal would dog really sure draw one ive never drawn dog dont think tempted draw snail cause draw sometimes theyre really easy draw right gonna really funny dog cause sure draw dog suppose lon god right yous know supposed dog dogs theyre good humans trained police dogs . theyre friendly animals dont look yous find theres lot buttons remote control dont know half dont know nothing welcome meeting everyone gonna attempt make'}]

In [45]:
sum

'The project manager opens the meeting by introducing herself and asking everyone to say their name and role in the group. She then states the agenda of the meeting and tells them that they will be designing and creating a new remote control that should be trendy and user-friendly. The meetings will focus on functional, conceptual, and detailed design. Next, each group member draws their favorite animal on the whiteboard and explains the characteristics of that animal. After that the project manager covers the project budget, and then they begin discussing their personal experiences with remote controls and how they want their remote to look. Then the project manager closes the meeting by telling each group member what to do in preparation for the next meeting.'

In [59]:
doc = raw_datasets["val"][1]["transcript"]
sum = raw_datasets["val"][1]["summary"]

In [60]:
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")
# summarizer = pipeline('text2text-generation', model_name, framework="tf")

summarizer(doc, min_length=10, max_length=MAX_TARGET_LENGTH)

[{'summary_text': 'uhoh maybe full page spent lots time talking about remote control . sammy benjo expert design artist thats gonna less technical functions user interface current intentions everything linked next slide .'}]

In [61]:
sum

"The project manager opened the meeting and stated the agenda to the team members. The marketing expert discussed the findings of a survey which indicated that current remotes are ugly, difficult to use, have a number of unused buttons, frustrate users when misplaced, and contribute to RSI. The marketing expert also stated that young users like speech recognition and that users in general want buttons for power, channel selection, volume control, and a few lesser used settings. The user interface designer presented existing remotes to exemplify the need for simpler designs, discussed the use of components such as titanium and a back-lit LCD screen, and discussed other features to consider such as color options. The industrial designer discussed the interior workings of a remote and how to handle universal capability and speech recognition. After the project manager's closing, the project manager recapped some decisions and the team discussed how to handle the issue of locating a remote

In [54]:
doc = raw_datasets["test"][0]["transcript"]
doc

'get seven ten one bad think topic box based whats called nom nite text area something text area sort highlighting stuff probably look sure inbuilt though honest wh wh dont ill w thinking going direction anyway see mean think think already calculated data supposed right come back numbers whatever give think mean whenever open window one speaker sh pretty much would thought definitely one present although could th could one highlighted suppose mean would would implementing said wanna see per topic wanna see one spoken something could could zap topics meetings sorry would come much spoke pick one looks interesting wanna definitely random things return talking time good mean put topic spoken well could use topic pop window thatd r wasnt moment rightclick topic window one topics option pop popping open window gives list meetings containing topic would would calculate would summarisation populate know global right sta stored together right sounds good done yet gotta decide wanna definitely 

In [57]:
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")
# summarizer = pipeline('text2text-generation', model, tokenizer=tokenizer, framework="tf")

summarizer(doc, min_length=10, max_length=MAX_TARGET_LENGTH)

[{'generated_text': "nxt is a nfl search engine that uses a search engine to search for a specific topic . the search engine has compiled a list of ten different types of information that can be accessed by a single user . a number of 'seens' are used to find a topic that has been viewed by users . it's a good idea to use a different type of information to search ."}]

### Generate Manually

Now let's try tokenizing a document from the training set. Don't forget to add 'summarize:' at the start if you're using a `T5` model.

In [49]:
doc = raw_datasets["val"][0]["transcript"]
sum = raw_datasets["val"][0]["summary"]

if 't5' in MODEL_CHECKPOINT: 
    document = "summarize: " + doc
tokenized = tokenizer([doc], return_tensors='np')
out = model.generate(**tokenized, max_length=128)

In [50]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0]))

<pad><extra_id_0> id camera thats gonna favourite animal snow seemed say spring finally go yes well hundred percent profit twelve fifty find dark often hard know button youre pushing theres remote controls theres kind hidden panel buttons dont really use unless youre programming something thats useful number buttons power button shaped remote interface industrial marketing expert.</s><pad><pad>


In [48]:
sum

'The project manager opens the meeting by introducing herself and asking everyone to say their name and role in the group. She then states the agenda of the meeting and tells them that they will be designing and creating a new remote control that should be trendy and user-friendly. The meetings will focus on functional, conceptual, and detailed design. Next, each group member draws their favorite animal on the whiteboard and explains the characteristics of that animal. After that the project manager covers the project budget, and then they begin discussing their personal experiences with remote controls and how they want their remote to look. Then the project manager closes the meeting by telling each group member what to do in preparation for the next meeting.'

In [42]:
doc = raw_datasets["test"][0]['transcript']

if 't5' in MODEL_CHECKPOINT: 
    document = "summarize: " + doc
tokenized = tokenizer([doc], return_tensors='np')
out = model.generate(**tokenized, max_length=128)

In [43]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0]))

<pad> nxt is a nfl search engine that uses a search engine to search for a specific topic. the search engine has compiled a list of ten different types of information that can be accessed by a single user. a number of'seens' are used to find a topic that has been viewed by users. it's a good idea to use a different type of information to search.</s>


