When you want machine learning to convey the meaning of a text, it can do one of two things: rephrase the information, or just show you the most important parts of the content. The first approach is called **abstractive summarization**, while the second is called **extractive summarization**. Neither task is easy, and both have their own limitations even in the current state of the art. 

Extractive summarization often fails to organize sentences in a natural way, so that the readability of created summaries is not acceptable and many times not even conveying the gist of the content. Meanwhile, current state-of-the-art deep learning models like GPT-3, GPT-2, BERT, etc. help us to generate paraphrased human-like summaries in terms of readability, but their correctness is often questionable. Here we'll have an overview and application of what both techniques look like along with what they can extract/paraphrase from the parent context.


https://blog.paperspace.com/generating-text-summaries-gpt-2/

Remember to RESTART the Runtime after you run the following cell to apply the changes.

In [None]:
!pip install bert-extractive-summarizer
!!pip install sentencepiece
#Restart Runtime

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert-extractive-summarizer
  Downloading bert_extractive_summarizer-0.10.1-py3-none-any.whl (25 kB)
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 10.1 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 49.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 75.4 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers, bert-extractive-summarizer
Successfully installed bert-extractive-summarizer-0.10.1 huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.24.0


['Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/',
 'Collecting sentencepiece',
 '  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)',
 '\x1b[?25l',
 '\x1b[K     |▎                               | 10 kB 24.8 MB/s eta 0:00:01',
 '\x1b[K     |▌                               | 20 kB 15.6 MB/s eta 0:00:01',
 '\x1b[K     |▊                               | 30 kB 20.8 MB/s eta 0:00:01',
 '\x1b[K     |█                               | 40 kB 7.0 MB/s eta 0:00:01',
 '\x1b[K     |█▎                              | 51 kB 7.3 MB/s eta 0:00:01',
 '\x1b[K     |█▌                              | 61 kB 8.6 MB/s eta 0:00:01',
 '\x1b[K     |█▉                              | 71 kB 9.6 MB/s eta 0:00:01',
 '\x1b[K     |██                              | 81 kB 9.2 MB/s eta 0:00:01',
 '\x1b[K     |██▎                             | 92 kB 10.2 MB/s eta 0:00:01',
 '\x1b[K     |██▋                             | 102

This tool utilizes the HuggingFace Pytorch transformers library to run extractive summarizations. This works by first embedding the sentences, then running a clustering algorithm, finding the sentences that are closest to the cluster's centroids.


In [None]:
body = '''
Training a machine-learning model on an intelligent edge device allows it to adapt to new data and make better predictions. For instance, training a model on a smart keyboard could enable the keyboard to continually learn from the user's writing. However, the training process requires so much memory that it is typically done using powerful computers at a data center, before the model is deployed on a device. This is more costly and raises privacy issues since user data must be sent to a central server.

To address this problem, researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique that enables on-device training using less than a quarter of a megabyte of memory. Other training solutions designed for connected devices can use more than 500 megabytes of memory, greatly exceeding the 256-kilobyte capacity of most microcontrollers (there are 1,024 kilobytes in one megabyte).

The intelligent algorithms and framework the researchers developed reduce the amount of computation required to train a model, which makes the process faster and more memory efficient. Their technique can be used to train a machine-learning model on a microcontroller in a matter of minutes.

This technique also preserves privacy by keeping data on the device, which could be especially beneficial when data are sensitive, such as in medical applications. It also could enable customization of a model based on the needs of users. Moreover, the framework preserves or improves the accuracy of the model when compared to other training approaches.

"Our study enables IoT devices to not only perform inference but also continuously update the AI models to newly collected data, paving the way for lifelong on-device learning. The low resource utilization makes deep learning more accessible and can have a broader reach, especially for low-power edge devices," says Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and senior author of the paper describing this innovation.

Joining Han on the paper are co-lead authors and EECS PhD students Ji Lin and Ligeng Zhu, as well as MIT postdocs Wei-Ming Chen and Wei-Chen Wang, and Chuang Gan, a principal research staff member at the MIT-IBM Watson AI Lab. The research will be presented at the Conference on Neural Information Processing Systems.

Han and his team previously addressed the memory and computational bottlenecks that exist when trying to run machine-learning models on tiny edge devices, as part of their TinyML initiative.
The intelligent algorithms and framework the researchers developed reduce the amount of computation required to train a model, which makes the process faster and more memory efficient. Their technique can be used to train a machine-learning model on a microcontroller in a matter of minutes. This technique also preserves privacy by keeping data on the device, which could be especially beneficial when data are sensitive, such as in medical applications. It also could enable customization of a model based on the needs of users. Moreover, the framework preserves or improves the accuracy of the model when compared to other training approaches.
'''


# Extractive Text Summarization

The traditional method with the main objective to identify the significant sentences of the text and add them to the summary. Note that the summary obtained contains exact sentences from the original text data.

In [None]:
from summarizer import Summarizer

bert_model = Summarizer()


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [None]:
result = bert_model(body, min_length=60, num_sentences=5)
BERT_Summary = ''.join(result).replace('. ', '. \n')
print(BERT_Summary)

Training a machine-learning model on an intelligent edge device allows it to adapt to new data and make better predictions. 
Their technique can be used to train a machine-learning model on a microcontroller in a matter of minutes. 
This technique also preserves privacy by keeping data on the device, which could be especially beneficial when data are sensitive, such as in medical applications. 
Moreover, the framework preserves or improves the accuracy of the model when compared to other training approaches. 
Han and his team previously addressed the memory and computational bottlenecks that exist when trying to run machine-learning models on tiny edge devices, as part of their TinyML initiative.


In [None]:
from summarizer import Summarizer, TransformerSummarizer

Let’s explore the power of another beast — the Generative Pre-trained Transformer 2 (which has around 1 billion parameters) and can only imagine the power of the most recent GPT3 which has 175 billion parameters! It can write from software codes to mind-blowing stories!


In [None]:
GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")


Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
GPT_Summary = ''.join(GPT2_model(body, min_length=60, num_sentences=5)).replace('. ', '. \n')
print(GPT_Summary)

Training a machine-learning model on an intelligent edge device allows it to adapt to new data and make better predictions. 
The intelligent algorithms and framework the researchers developed reduce the amount of computation required to train a model, which makes the process faster and more memory efficient. 
This technique also preserves privacy by keeping data on the device, which could be especially beneficial when data are sensitive, such as in medical applications. 
It also could enable customization of a model based on the needs of users. 
The low resource utilization makes deep learning more accessible and can have a broader reach, especially for low-power edge devices," says Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and senior author of the paper describing this innovation.


In [None]:
print('SUMMARY RESULT:\n\n\tBERT: {}\n\n\tGPT-2: {}'.format(BERT_Summary, GPT_Summary))

SUMMARY RESULT:

	BERT: Training a machine-learning model on an intelligent edge device allows it to adapt to new data and make better predictions. 
Their technique can be used to train a machine-learning model on a microcontroller in a matter of minutes. 
This technique also preserves privacy by keeping data on the device, which could be especially beneficial when data are sensitive, such as in medical applications. 
Moreover, the framework preserves or improves the accuracy of the model when compared to other training approaches. 
Han and his team previously addressed the memory and computational bottlenecks that exist when trying to run machine-learning models on tiny edge devices, as part of their TinyML initiative.

	GPT-2: Training a machine-learning model on an intelligent edge device allows it to adapt to new data and make better predictions. 
The intelligent algorithms and framework the researchers developed reduce the amount of computation required to train a model, which mak

# Abstractive Summarization

In [None]:
from transformers import pipeline

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization") #BART model
summarizer_t5 = pipeline("summarization", model = 't5-base')


ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
ARTICLE = '''Training a machine-learning model on an intelligent edge device allows it to adapt to new data and make better predictions. For instance, training a model on a smart keyboard could enable the keyboard to continually learn from the user's writing. However, the training process requires so much memory that it is typically done using powerful computers at a data center, before the model is deployed on a device. This is more costly and raises privacy issues since user data must be sent to a central server.

To address this problem, researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique that enables on-device training using less than a quarter of a megabyte of memory. Other training solutions designed for connected devices can use more than 500 megabytes of memory, greatly exceeding the 256-kilobyte capacity of most microcontrollers (there are 1,024 kilobytes in one megabyte).

The intelligent algorithms and framework the researchers developed reduce the amount of computation required to train a model, which makes the process faster and more memory efficient. Their technique can be used to train a machine-learning model on a microcontroller in a matter of minutes.

This technique also preserves privacy by keeping data on the device, which could be especially beneficial when data are sensitive, such as in medical applications. It also could enable customization of a model based on the needs of users. Moreover, the framework preserves or improves the accuracy of the model when compared to other training approaches.

"Our study enables IoT devices to not only perform inference but also continuously update the AI models to newly collected data, paving the way for lifelong on-device learning. The low resource utilization makes deep learning more accessible and can have a broader reach, especially for low-power edge devices," says Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and senior author of the paper describing this innovation.

Joining Han on the paper are co-lead authors and EECS PhD students Ji Lin and Ligeng Zhu, as well as MIT postdocs Wei-Ming Chen and Wei-Chen Wang, and Chuang Gan, a principal research staff member at the MIT-IBM Watson AI Lab. The research will be presented at the Conference on Neural Information Processing Systems.

Han and his team previously addressed the memory and computational bottlenecks that exist when trying to run machine-learning models on tiny edge devices, as part of their TinyML initiative.
The intelligent algorithms and framework the researchers developed reduce the amount of computation required to train a model, which makes the process faster and more memory efficient. Their technique can be used to train a machine-learning model on a microcontroller in a matter of minutes. This technique also preserves privacy by keeping data on the device, which could be especially beneficial when data are sensitive, such as in medical applications. It also could enable customization of a model based on the needs of users. Moreover, the framework preserves or improves the accuracy of the model when compared to other training approaches.
'''

In [None]:
summary_BART = summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)[0]['summary_text']
summary_t5 = summarizer_t5(ARTICLE, max_length=130, min_length=30, do_sample=False)[0]['summary_text']



Token indices sequence length is longer than the specified maximum sequence length for this model (691 > 512). Running this sequence through the model will result in indexing errors


In [None]:
print('BART Summary: {}\n\n'.format(summary_BART.replace(' . ', '.\n'), '\n'))
print('T5 Summary: {}\n\n'.format(summary_t5.replace(' . ', '.\n')))

BART Summary:  Researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique that enables on-device training using less than a quarter of a megabyte of memory.
The research will be presented at the Conference on Neural Information Processing Systems .


T5 Summary: researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique.
it enables on-device training using less than a quarter of a megabyte of memory.
technique can be used to train a machine-learning model on a microcontroller in minutes .




https://huggingface.co/blog/how-to-generate

#Assignment: 

* Use another model for Extractive Summarization.
* Use another model for Abstractive Summarization (from `bart-large-cnn`, `t5-small`, `t5-base`, `t5-large`, `t5-3b`, `t5-11b`). Modify some generation parameters.
* Perform extractive and abstractive summarization on another real data source (e.g Reddit, Twitter, ...) and perform summarization on various topics and threads. Automatically summarize user posts, comments, and discussions.
* Compare the different quality between two approaches (e.g semantic accuracy, correctness, coherence, and natural flow).

# Extractive Summarization

In [None]:
from transformers import pipeline
#summarizer1 = pipeline('summarization', model="sshleifer/distilbart-cnn-12-6")
#print(summarizer1(ARTICLE))

from transformers import pipeline
summarizer_XLSum = pipeline("summarization", model= "csebuetnlp/mT5_multilingual_XLSum")
summarizer_XLSum(ARTICLE)

Downloading:   0%|          | 0.00/730 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/375 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

  "The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option"


[{'summary_text': 'Researchers at the Massachusetts Institute of Technology (MIT) and IBM have developed a new technique to train artificial intelligence (AI) models on connected devices.'}]

In [None]:
print(summarizer_XLSum(ARTICLE))

[{'summary_text': 'Researchers at the Massachusetts Institute of Technology (MIT) and IBM have developed a new technique to train artificial intelligence (AI) models on connected devices.'}]


In [None]:
from transformers import pipeline
summarizer_distilbart = pipeline('summarization', model="sshleifer/distilbart-cnn-12-6")
print(summarizer_distilbart(ARTICLE))

[{'summary_text': ' Researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique that enables on-device training using less than a quarter of a megabyte of memory . Their technique can be used to train a machine-learning model on a microcontroller in a matter of minutes . The research will be presented at the Conference on Neural Information Processing Systems .'}]


In [None]:

import numpy as np
import pandas as pd
import tensorflow as tf
import transformers



In [None]:
max_length = 128  # Maximum length of input sentence to the model.
batch_size = 32
epochs = 2
# Labels in our dataset.
labels = ["contradiction", "entailment", "neutral"]

# Build the model for similarity

In [None]:
# Create the model under a distribution strategy scope.
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    # Encoded token ids from BERT tokenizer.
    input_ids = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="input_ids"
    )
    # Attention masks indicates to the model which tokens should be attended to.
    attention_masks = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="attention_masks"
    )
    # Token type ids are binary masks identifying different sequences in the model.
    token_type_ids = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="token_type_ids"
    )
    # Loading pretrained BERT model.
    bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
    # Freeze the BERT model to reuse the pretrained features without modifying them.
    bert_model.trainable = False

    bert_output = bert_model.bert(
        input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids
    )
    sequence_output = bert_output.last_hidden_state
    pooled_output = bert_output.pooler_output
    # Add trainable layers on top of frozen layers to adapt the pretrained features on the new data.
    bi_lstm = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64, return_sequences=True)
    )(sequence_output)
    # Applying hybrid pooling approach to bi_lstm sequence output.
    avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
    max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
    concat = tf.keras.layers.concatenate([avg_pool, max_pool])
    dropout = tf.keras.layers.Dropout(0.3)(concat)
    output = tf.keras.layers.Dense(3, activation="softmax")(dropout)
    model = tf.keras.models.Model(
        inputs=[input_ids, attention_masks, token_type_ids], outputs=output
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss="categorical_crossentropy",
        metrics=["acc"],
    )


print(f"Strategy: {strategy}")
model.summary()




Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Strategy: <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7fe1b3252f90>
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 128)]        0           []                               
                                                                                                  
 attention_masks (InputLayer)   [(None, 128)]        0           []                               
                                                                                                  
 token_type_ids (InputLayer)    [(None, 128)]        0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  109482240   ['input_ids[0][0]',        

# Fine tune BERT model for similarity

In [None]:
# Unfreeze the bert_model.
bert_model.trainable = True
# Recompile the model to make the change effective.
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)
model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 128)]        0           []                               
                                                                                                  
 attention_masks (InputLayer)   [(None, 128)]        0           []                               
                                                                                                  
 token_type_ids (InputLayer)    [(None, 128)]        0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_masks[0][0]',    

In [None]:
class BertSemanticDataGenerator(tf.keras.utils.Sequence):
    """Generates batches of data.

    Args:
        sentence_pairs: Array of premise and hypothesis input sentences.
        labels: Array of labels.
        batch_size: Integer batch size.
        shuffle: boolean, whether to shuffle the data.
        include_targets: boolean, whether to incude the labels.

    Returns:
        Tuples `([input_ids, attention_mask, `token_type_ids], labels)`
        (or just `[input_ids, attention_mask, `token_type_ids]`
         if `include_targets=False`)
    """

    def __init__(
        self,
        sentence_pairs,
        labels,
        batch_size=batch_size,
        shuffle=True,
        include_targets=True,
    ):
        self.sentence_pairs = sentence_pairs
        self.labels = labels
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.include_targets = include_targets
        # Load our BERT Tokenizer to encode the text.
        # We will use base-base-uncased pretrained model.
        self.tokenizer = transformers.BertTokenizer.from_pretrained(
            "bert-base-uncased", do_lower_case=True
        )
        self.indexes = np.arange(len(self.sentence_pairs))
        self.on_epoch_end()

    def __len__(self):
        # Denotes the number of batches per epoch.
        return len(self.sentence_pairs) // self.batch_size

    def __getitem__(self, idx):
        # Retrieves the batch of index.
        indexes = self.indexes[idx * self.batch_size : (idx + 1) * self.batch_size]
        sentence_pairs = self.sentence_pairs[indexes]

        # With BERT tokenizer's batch_encode_plus batch of both the sentences are
        # encoded together and separated by [SEP] token.
        encoded = self.tokenizer.batch_encode_plus(
            sentence_pairs.tolist(),
            add_special_tokens=True,
            max_length=max_length,
            return_attention_mask=True,
            return_token_type_ids=True,
            pad_to_max_length=True,
            return_tensors="tf",
        )

        # Convert batch of encoded features to numpy array.
        input_ids = np.array(encoded["input_ids"], dtype="int32")
        attention_masks = np.array(encoded["attention_mask"], dtype="int32")
        token_type_ids = np.array(encoded["token_type_ids"], dtype="int32")

        # Set to true if data generator is used for training/validation.
        if self.include_targets:
            labels = np.array(self.labels[indexes], dtype="int32")
            return [input_ids, attention_masks, token_type_ids], labels
        else:
            return [input_ids, attention_masks, token_type_ids]

    def on_epoch_end(self):
        # Shuffle indexes after each epoch if shuffle is set to True.
        if self.shuffle:
            np.random.RandomState(42).shuffle(self.indexes)


In [None]:
def check_similarity(sentence1, sentence2):
    sentence_pairs = np.array([[str(sentence1), str(sentence2)]])
    test_data = BertSemanticDataGenerator(
        sentence_pairs, labels=None, batch_size=1, shuffle=False, include_targets=False,
    )

    proba = model.predict(test_data[0])[0]
    idx = np.argmax(proba)
    proba = f"{proba[idx]: .2f}%"
    pred = labels[idx]
    return pred, proba


# Calculate Similarity

In [None]:
check_similarity(summarizer_XLSum(ARTICLE), summarizer_distilbart(ARTICLE))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.




('entailment', ' 0.66%')

# Abstractive Summarization

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization") #BART model
summarizer_t53 = pipeline("summarization", model = 't5-small')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
summary_BART = summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)[0]['summary_text']
summary_t53 = summarizer_t53(ARTICLE, max_length=130, min_length=30, do_sample=False)[0]['summary_text']

Token indices sequence length is longer than the specified maximum sequence length for this model (691 > 512). Running this sequence through the model will result in indexing errors


In [None]:
print('BART Summary: {}\n\n'.format(summary_BART.replace(' . ', '.\n'), '\n'))
print('T5-small Summary: {}\n\n'.format(summary_t53.replace(' . ', '.\n')))

BART Summary:  Researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique that enables on-device training using less than a quarter of a megabyte of memory.
The research will be presented at the Conference on Neural Information Processing Systems .


T5-small Summary: training a machine-learning model on an intelligent edge device allows it to adapt to new data.
this is more costly and raises privacy issues since user data must be sent to a central server.
other training solutions can use more than 500 megabytes of memory, exceeding the 256-kilobyte capacity of most microcontrollers .




# Calculate Similarity

In [None]:
check_similarity(summary_BART, summary_t53)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.




('entailment', ' 0.65%')

t5-3b is too big to hold in free RAM

# Applying Summarizer in different writting

# Extractive Summarization

In [None]:
para = '''

The epic’s prelude offers a general introduction to Gilgamesh, king of Uruk, who was two-thirds god and one-third man. He built magnificent ziggurats, or temple towers, surrounded his city with high walls, and laid out its orchards and fields. He was physically beautiful, immensely strong, and very wise. Although Gilgamesh was godlike in body and mind, he began his kingship as a cruel despot. He lorded over his subjects, raping any woman who struck his fancy, whether she was the wife of one of his warriors or the daughter of a nobleman. He accomplished his building projects with forced labor, and his exhausted subjects groaned under his oppression. The gods heard his subjects’ pleas and decided to keep Gilgamesh in check by creating a wild man named Enkidu, who was as magnificent as Gilgamesh. Enkidu became Gilgamesh’s great friend, and Gilgamesh’s heart was shattered when Enkidu died of an illness inflicted by the gods. Gilgamesh then traveled to the edge of the world and learned about the days before the deluge and other secrets of the gods, and he recorded them on stone tablets.

The epic begins with Enkidu. He lives with the animals, suckling at their breasts, grazing in the meadows, and drinking at their watering places. A hunter discovers him and sends a temple prostitute into the wilderness to tame him. In that time, people considered women and sex calming forces that could domesticate wild men like Enkidu and bring them into the civilized world. When Enkidu sleeps with the woman, the animals reject him since he is no longer one of them. Now, he is part of the human world. Then the harlot teaches him everything he needs to know to be a man. Enkidu is outraged by what he hears about Gilgamesh’s excesses, so he travels to Uruk to challenge him. When he arrives, Gilgamesh is about to force his way into a bride’s wedding chamber. Enkidu steps into the doorway and blocks his passage. The two men wrestle fiercely for a long time, and Gilgamesh finally prevails. After that, they become friends and set about looking for an adventure to share.

Gilgamesh and Enkidu decide to steal trees from a distant cedar forest forbidden to mortals. A terrifying demon named Humbaba, the devoted servant of Enlil, the god of earth, wind, and air, guards it. The two heroes make the perilous journey to the forest, and, standing side by side, fight with the monster. With assistance from Shamash the sun god, they kill him. Then they cut down the forbidden trees, fashion the tallest into an enormous gate, make the rest into a raft, and float on it back to Uruk. Upon their return, Ishtar, the goddess of love, is overcome with lust for Gilgamesh. Gilgamesh spurns her. Enraged, the goddess asks her father, Anu, the god of the sky, to send the Bull of Heaven to punish him. The bull comes down from the sky, bringing with him seven years of famine. Gilgamesh and Enkidu wrestle with the bull and kill it. The gods meet in council and agree that one of the two friends must be punished for their transgression, and they decide Enkidu is going to die. He takes ill, suffers immensely, and shares his visions of the underworld with Gilgamesh. When he finally dies, Gilgamesh is heartbroken.

Gilgamesh can’t stop grieving for Enkidu, and he can’t stop brooding about the prospect of his own death. Exchanging his kingly garments for animal skins as a way of mourning Enkidu, he sets off into the wilderness, determined to find Utnapishtim, the Mesopotamian Noah. After the flood, the gods had granted Utnapishtim eternal life, and Gilgamesh hopes that Utnapishtim can tell him how he might avoid death too. Gilgamesh’s journey takes him to the twin-peaked mountain called Mashu, where the sun sets into one side of the mountain at night and rises out of the other side in the morning. Utnapishtim lives beyond the mountain, but the two scorpion monsters that guard its entrance refuse to allow Gilgamesh into the tunnel that passes through it. Gilgamesh pleads with them, and they relent.

After a harrowing passage through total darkness, Gilgamesh emerges into a beautiful garden by the sea. There he meets Siduri, a veiled tavern keeper, and tells her about his quest. She warns him that seeking immortality is futile and that he should be satisfied with the pleasures of this world. However, when she can’t turn him away from his purpose, she directs him to Urshanabi, the ferryman. Urshanabi takes Gilgamesh on the boat journey across the sea and through the Waters of Death to Utnapishtim. Utnapishtim tells Gilgamesh the story of the flood—how the gods met in council and decided to destroy humankind. Ea, the god of wisdom, warned Utnapishtim about the gods’ plans and told him how to fashion a gigantic boat in which his family and the seed of every living creature might escape. When the waters finally receded, the gods regretted what they’d done and agreed that they would never try to destroy humankind again. Utnapishtim was rewarded with eternal life. Men would die, but humankind would continue.

When Gilgamesh insists that he be allowed to live forever, Utnapishtim gives him a test. If you think you can stay alive for eternity, he says, surely you can stay awake for a week. Gilgamesh tries and immediately fails. So Utnapishtim orders him to clean himself up, put on his royal garments again, and return to Uruk where he belongs. Just as Gilgamesh is departing, however, Utnapishtim’s wife convinces him to tell Gilgamesh about a miraculous plant that restores youth. Gilgamesh finds the plant and takes it with him, planning to share it with the elders of Uruk. But a snake steals the plant one night while they are camping. As the serpent slithers away, it sheds its skin and becomes young again.

When Gilgamesh returns to Uruk, he is empty-handed but reconciled at last to his mortality. He knows that he can’t live forever but that humankind will. Now he sees that the city he had repudiated in his grief and terror is a magnificent, enduring achievement—the closest thing to immortality to which a mortal can aspire.
'''

In [None]:
from summarizer import Summarizer

bert_model = Summarizer()

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
result = bert_model(para, min_length=60, num_sentences=5)
BERT_Summary = ''.join(result).replace('. ', '. \n')
print(BERT_Summary)

The epic’s prelude offers a general introduction to Gilgamesh, king of Uruk, who was two-thirds god and one-third man. 
When Enkidu sleeps with the woman, the animals reject him since he is no longer one of them. 
Exchanging his kingly garments for animal skins as a way of mourning Enkidu, he sets off into the wilderness, determined to find Utnapishtim, the Mesopotamian Noah. 
Gilgamesh finds the plant and takes it with him, planning to share it with the elders of Uruk. 
When Gilgamesh returns to Uruk, he is empty-handed but reconciled at last to his mortality.


In [None]:
from summarizer import Summarizer, TransformerSummarizer

In [None]:
GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")

In [None]:
GPT_Summary = ''.join(GPT2_model(para, min_length=60, num_sentences=5)).replace('. ', '. \n')
print(GPT_Summary)

The epic’s prelude offers a general introduction to Gilgamesh, king of Uruk, who was two-thirds god and one-third man. 
The gods meet in council and agree that one of the two friends must be punished for their transgression, and they decide Enkidu is going to die. 
He takes ill, suffers immensely, and shares his visions of the underworld with Gilgamesh. 
Ea, the god of wisdom, warned Utnapishtim about the gods’ plans and told him how to fashion a gigantic boat in which his family and the seed of every living creature might escape. 
When Gilgamesh insists that he be allowed to live forever, Utnapishtim gives him a test.


In [None]:
print('SUMMARY RESULT:\n\n\tBERT: {}\n\n\tGPT-2: {}'.format(BERT_Summary, GPT_Summary))

SUMMARY RESULT:

	BERT: The epic’s prelude offers a general introduction to Gilgamesh, king of Uruk, who was two-thirds god and one-third man. 
When Enkidu sleeps with the woman, the animals reject him since he is no longer one of them. 
Exchanging his kingly garments for animal skins as a way of mourning Enkidu, he sets off into the wilderness, determined to find Utnapishtim, the Mesopotamian Noah. 
Gilgamesh finds the plant and takes it with him, planning to share it with the elders of Uruk. 
When Gilgamesh returns to Uruk, he is empty-handed but reconciled at last to his mortality.

	GPT-2: The epic’s prelude offers a general introduction to Gilgamesh, king of Uruk, who was two-thirds god and one-third man. 
The gods meet in council and agree that one of the two friends must be punished for their transgression, and they decide Enkidu is going to die. 
He takes ill, suffers immensely, and shares his visions of the underworld with Gilgamesh. 
Ea, the god of wisdom, warned Utnapishtim 

# Calculate Similarity

In [None]:
check_similarity(BERT_Summary, GPT_Summary)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.




('entailment', ' 0.65%')

# Abstractive Summarization

In [None]:
from transformers import pipeline

summarizer_t5_small = pipeline("summarization", model = 't5-small') 
summarizer_t5 = pipeline("summarization", model = 't5-base')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
#summary_t5_small = summarizer(para, max_length=130, min_length=30, do_sample=False)[0]['summary_text']
summary_t5_base = summarizer_t5(para, max_length=130, min_length=30, do_sample=False)[0]['summary_text']

Token indices sequence length is longer than the specified maximum sequence length for this model (1645 > 512). Running this sequence through the model will result in indexing errors


t5-small is out of index

In [None]:
#print('BART Summary: {}\n\n'.format(summary_BART.replace(' . ', '.\n'), '\n'))
print('T5 Summary: {}\n\n'.format(summary_t5_base.replace(' . ', '.\n')))

T5 Summary: after the flood, the gods decided to destroy humankind.
a snake steals a plant that restores youth.
when he returns to Uruk, he sees that the city he repudiated in his grief is a magnificent achievement .




# Comparison between Extractive & Abstractive summary

## Calculate Similarity

In [None]:
check_similarity(BERT_Summary, summary_t5_base)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.




('entailment', ' 0.67%')

In [None]:
check_similarity(GPT_Summary, summary_t5_base)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.




('entailment', ' 0.66%')