<a href="https://colab.research.google.com/github/tejpat98/Textbook-Summarisation/blob/main/HuggingFace_GooglePegasus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://huggingface.co/transformers/model_doc/pegasus.html

- Google Pegasus is the only pretrained seq-to-seq model trained specifically for abstract summarisation.

In [None]:
!pip install transformers
!pip install sentencepiece

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 18.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 45.2MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 48.2MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=e4212ecc19

In [None]:
!nvidia-smi

Mon Mar 22 14:57:33 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

**Check HuggingFace installed correctly**

In [None]:
!python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"
#Expected Output: "[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]"

2021-03-22 14:57:38.230585: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Downloading: 100% 629/629 [00:00<00:00, 957kB/s]
Downloading: 100% 268M/268M [00:05<00:00, 48.6MB/s]
Downloading: 100% 232k/232k [00:00<00:00, 13.7MB/s]
Downloading: 100% 48.0/48.0 [00:00<00:00, 80.7kB/s]
[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]


**Fine-tuning Pegasus**

Create a preconfigured model, then pass it to the fine tuning script.

HuggingFace's own Fine tuning script: https://github.com/huggingface/transformers/blob/master/examples/seq2seq/run_summarization.py

In [None]:
from transformers import PegasusModel, PegasusConfig

# Initializing a PEGASUS google/pegasus-large style configuration
configuration = PegasusConfig()

# Initializing a model from the google/pegasus-large style configuration
model = PegasusModel(PegasusConfig())

# Accessing the model configuration
configuration = model.config

print(model.config)

PegasusConfig {
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "attention_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 0,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 1,
  "forced_eos_token_id": 1,
  "gradient_checkpointing": false,
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "max_position_embeddings": 1024,
  "model_type": "pegasus",
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "scale_embedding": false,
  "transformers_version": "4.4.2",
  "use_cache": true,
  "vocab_size": 50265
}



In [None]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from datetime import datetime
import torch
src_text = [""" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""]
text = [
        "I am currently learning some machine learning and I know how to calculate\
         the euclidean distance between different data points; however, I was wondering\
          if anyone knows how to calculate the accuracy by hand in order to see which k-value is the best as the choice of k?\
          I know how to implement basic python classes to calculate the accuracy for me, but want to learn how to do it by hand as well.\
           I tried googling it, but they all just show python implementations.For instance, let's pretend you only have 8 data points,\
            4 red and 4 orange; I pick for instance k = 3 and get 2 red and 1 orange (so the new data point is classified as red).\
             Now I want to calculate the accuracy of this K value?"
]
st = datetime.now()

model_name = 'google/pegasus-large'
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
batch = tokenizer(text, truncation=True, padding='longest', return_tensors="pt").to(device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)

et = datetime.now()
dur = et-st

#assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
print("Time Taken: " + str(dur) + " Output: ", tgt_text)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1912529.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=65.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=88.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2866.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2275327883.0, style=ProgressStyle(descr…


Time Taken: 0:01:29.234186 Output:  ['I am currently learning some machine learning and I know how to calculate the euclidean distance between different data points; however, I was wondering if anyone knows how to calculate the accuracy by hand in order to see which k-value is the best as the choice of k?']
