<a href="https://colab.research.google.com/github/vahedshaik/cmpe-297-project/blob/main/Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook provides a simple UI in order to demonstrate the summarization and specially on the financial domain.

## Features
This notebook allows for:

*   Summarizing any financial newswire document.
*   Choosing among different decoding strategies.
(Note that each strategy has its own behavior).
*   Changing hyperparameters of each decoding strategy.
*   Interpreting the final model's decision .


## Decoding Strategies

*   **Greedy Search** chooses the token with the highest probability in a greedy way at each timestep.
*   **Beam Search** chooses over a number of tokens with the highest probability according to the number of beams (beam width).
*   [**Diverse Beam Search**](https://arxiv.org/pdf/1610.02424.pdf) adds a diversity penalty to the beam search to enhance diversity between beams.
*   **Random Sampling** chooses the next token randomly out of the vocabulary distribution. We combine random sampling with a temperature parameter to reduce the randomness.
*   [**Top-*k* Sampling**](https://arxiv.org/pdf/1805.04833.pdf) chooses the next token over the the top-*k* words with the highest probability from the vocabulary distribution.
*   [**Top-*p* or "nucleus" Sampling**](https://arxiv.org/abs/1904.09751) chooses the next token over the words whose cumulative probability is up to *p*. We can combine top-*p* with top-*k* sampling.



# Imports

In [None]:
!pip install transformers==4.1.0
!pip install sentencepiece==0.1.95

Collecting transformers==4.1.0
  Downloading transformers-4.1.0-py3-none-any.whl (1.5 MB)
[?25l[K     |▏                               | 10 kB 3.1 MB/s eta 0:00:01[K     |▍                               | 20 kB 4.1 MB/s eta 0:00:01[K     |▋                               | 30 kB 4.4 MB/s eta 0:00:01[K     |▉                               | 40 kB 4.5 MB/s eta 0:00:01[K     |█                               | 51 kB 3.7 MB/s eta 0:00:01[K     |█▎                              | 61 kB 4.2 MB/s eta 0:00:01[K     |█▌                              | 71 kB 4.2 MB/s eta 0:00:01[K     |█▊                              | 81 kB 4.7 MB/s eta 0:00:01[K     |██                              | 92 kB 4.7 MB/s eta 0:00:01[K     |██▏                             | 102 kB 4.3 MB/s eta 0:00:01[K     |██▍                             | 112 kB 4.3 MB/s eta 0:00:01[K     |██▋                             | 122 kB 4.3 MB/s eta 0:00:01[K     |██▉                             | 133 kB 4.3 MB/s e

In [None]:
import copy, itertools, html, torch, nltk, string

import ipywidgets as widgets
import pandas as pd
import numpy as np

from transformers import PegasusConfig, PegasusForConditionalGeneration, PegasusTokenizer
from IPython.core.display import display, HTML
from PIL import ImageColor
from ipywidgets import Layout
from nltk import tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f'[{device}] will be used')

[cuda] will be used


# Training the model


###Extract weights

In [None]:

def extract_attention_weights(text_to_summarize, colour):
  '''
  This is a function for extracting the attention weights.
  '''
  input_ids = tokenizer(text_to_summarize, max_length=256, truncation=True, return_tensors="pt").to(device).input_ids
  tokens = tokenizer.tokenize(text_to_summarize)# Get the list of tokens from the original text
  output = model(input_ids=input_ids)
  encoder_attentions = output.encoder_attentions
  mean_layers = torch.mean(torch.stack(encoder_attentions), dim=0)
  mean_layers = torch.squeeze(mean_layers)
  mean_average_heads = torch.mean(mean_layers, dim=0)
  np_attentions = mean_average_heads.detach().cpu().numpy()
  final_attentions = np.mean(np_attentions, axis=0)

  # Rescale  the weights
  final_attentions = rescale(final_attentions)
  final_attentions = final_attentions[:-1]

  tokens = tokens[:len(final_attentions)]

  punctuation_tokens = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '“', '”', '‘', '’', '.“', '.”', '.‘', '.’', ".'", '."', '.!', '.?', ',“', ',”', ',‘', ',’', ",'", ',"', ',!', ',?', '--']

  for i, token in enumerate(tokens):
    token = token.replace("▁", "")
    if token.lower() in stopwords.words('english') or token in punctuation_tokens:
        final_attentions[i] = 0

  tokens = tokens[:len(final_attentions)]

  df = pd.DataFrame({'token': tokens, 'attention': final_attentions })

  map = {}
  for x in df.iterrows():
      x = x[1]
      map[x[0]] = (x[1])

  attentions = []
  intensity_value = 0.1
  hex_to_rgb = ImageColor.getrgb(colour)
  r, g, b = hex_to_rgb[0], hex_to_rgb[1], hex_to_rgb[2]

  for token in tokens:
      attention = map.get(token)
      attentions.append('<span style="background-color:rgba('+str(r)+','+str(g)+','+str(b)+',' + str(attention/intensity_value) + ');">' + html.escape(token) + '</span>')

  attentions = ' '.join(attentions)

  return attentions


###Summarize

In [None]:
#@title
def get_summaries(outputs):
  '''
  Get summaries / sentences
  '''
  summaries = []

  for output in outputs:
    summary = tokenizer.decode(output, skip_special_tokens=True)

    if not (summary[-1] == '.' or summary[-1] == "!" or summary[-1] == "?"):
      summary = summary + "."

    summaries.append(summary)

  sentences = []

  for summary in summaries:
    sentences.append(tokenize.sent_tokenize(summary))

  return summaries, sentences

### Rescale

In [None]:
#@title
def rescale(input_array):
    '''
    Rescale in order to have more consistent results
   '''
    max = np.max(input_array)
    min = np.min(input_array)
    rescale = (input_array - min) / (max-min)
    return rescale



### Other functionalities


In [None]:
#@title
def on_generate_button_click(button):
  show_loading()

  generate_output.clear_output()

  with generate_output:
    if not text_to_summarize.value or len(text_to_summarize.value.strip()) == 0:
      hide_loading()
      print("You have to enter text!")
      return

    input_ids = tokenizer(text_to_summarize.value, max_length=256, truncation=True, return_tensors="pt").to(device).input_ids

    if strategy.value == "Greedy Search":
      output = model.generate(input_ids, max_length=32)
    elif strategy.value == 'Beam Search':
      if seq_no.value > beams.value:
        print("The number of return sequences should be smaller or equal to the number of beams. Try again!")
        hide_loading()
        return

      output = model.generate(input_ids,
                              max_length=32,
                              num_beams=beams.value,
                              num_return_sequences=seq_no.value,
                              early_stopping=True)
    elif strategy.value == 'Diverse Beam Search':
      if seq_no.value > beams_for_diverse.value:
        print("The number of return sequences should be smaller or equal to the number of beams. Try again!")
        hide_loading()
        return

      if beams_for_diverse.value < beam_groups.value:
        print("The number of beam groups should be smaller or equal to the number of beams. Try again!")
        hide_loading()
        return

      if beams_for_diverse.value % beam_groups.value  != 0:
        print("The number of beams should be divisible by the number of beam groups. Try again!")
        hide_loading()
        return

      output = model.generate(input_ids,
                              max_length=32,
                              num_beams=beams_for_diverse.value,
                              num_return_sequences= seq_no.value,
                              num_beam_groups = beam_groups.value,
                              diversity_penalty=penalty.value,
                              early_stopping=True)
    elif strategy.value == "Random Sampling":
      output = model.generate(input_ids,
                              do_sample=True,
                              max_length=32,
                              top_k=0,
                              temperature=temp.value,
                              num_return_sequences=seq_no.value)
    elif strategy.value == "Top-k Sampling":
      output = model.generate(input_ids,
                              do_sample=True,
                              max_length=32,
                              top_k=k.value,
                              num_return_sequences=seq_no.value)
    else:
      output = model.generate(input_ids,
                              do_sample=True,
                              max_length=32,
                              top_k=k.value,
                              top_p=p.value,
                              num_return_sequences=seq_no.value)

  summaries_returned, sentences_returned = get_summaries(output)

  summaries_grouped = copy.deepcopy(sentences_returned)

  for index, summary in enumerate(summaries_returned):
    summaries_grouped[index].insert(0, summary)

  global summary_texts
  global summary_checkboxes
  global summary_orders

  summary_texts = list(itertools.chain(*summaries_grouped))
  summary_checkboxes = [ widgets.Checkbox(layout=Layout(width='35px', padding='10px'), indent=False) for i in range(len(summary_texts)) ]
  summary_orders = [ widgets.BoundedIntText(value=1, min=1, max=len(summary_texts), step=1, layout=Layout(width='60px', padding='10px 10px 10px 0px')) for i in range(len(summary_texts)) ]

  [ o.observe(on_summary_selection_change, names='value') for o in summary_checkboxes ]

  [ o.observe(on_summary_selection_change, names='value') for o in summary_orders ]

  synthesize_widgets = []

  summary_index = 0

  for group_index, group_values in enumerate(summaries_grouped):
    synthesize_widgets.append(widgets.HTML(f'<h2>Summaries Group {group_index + 1}</h2>'))

    for summary in group_values:
      synthesize_widgets.append(widgets.HBox([ summary_checkboxes[summary_index], summary_orders[summary_index], widgets.HTML(value=summary, layout=Layout(border='1px solid black', padding='10px'))]))

      summary_index = summary_index + 1

  synthesize_widgets.append(widgets.HTML('<h2>Synthesize Summary</h2>'))

  summary_to_synthesize_textarea.value = ''

  synthesize_widgets.append(summary_to_synthesize_textarea)

  with generate_output:
    display(widgets.VBox(synthesize_widgets))

  hide_loading()

def on_strategy_change(change):
  generate_output.clear_output()

  strategy_controls_output.clear_output()

  with strategy_controls_output:
    generate_ui_elements = []

    if strategy.value == "Beam Search":
      generate_ui_elements.extend([beams, seq_no, generate_button])
    elif strategy.value == "Diverse Beam Search":
      generate_ui_elements.extend([beams_for_diverse, beam_groups, penalty, seq_no, generate_button])
    elif strategy.value == "Top-k Sampling":
      generate_ui_elements.extend([k, seq_no, generate_button])
    elif strategy.value == "Top-p Sampling":
      generate_ui_elements.extend([k, p, seq_no, generate_button])
    elif strategy.value == "Random Sampling":
      generate_ui_elements.extend([temp, seq_no, generate_button])
    else:
      generate_ui_elements.append(generate_button)

    display(widgets.VBox(generate_ui_elements))

def on_summary_selection_change(change):
  summaries_with_orders = []

  for index, checkbox in enumerate(summary_checkboxes):
    if checkbox.value:
      summaries_with_orders.append((summary_orders[index].value, summary_texts[index]))

  sorted_summaries = map(lambda o: o[1], sorted(summaries_with_orders, key=lambda x: x[0]))

  summary_to_synthesize_textarea.value = ' '.join(sorted_summaries)

def on_visualize_button_click(button):
  show_loading()

  visualize_output.clear_output()

  with visualize_output:
    if not text_to_summarize.value or len(text_to_summarize.value.strip()) == 0:
      hide_loading()
      print("You have to enter text!")
      return

    display(HTML(extract_attention_weights(text_to_summarize.value, colour.value)))

  hide_loading()

def show_loading():
  with loading_output:
    display(loading)

def hide_loading():
  loading_output.clear_output()

# Load the model

In [None]:
model_name = 'human-centered-summarization/financial-summarization-pegasus'
config = PegasusConfig.from_pretrained(model_name, output_attentions=True)
model = PegasusForConditionalGeneration.from_pretrained(model_name, config=config).to(device).eval()
tokenizer = PegasusTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

# Interface












## How to use

1.   Enter the text that you want to summarize into the text box.
2.   Press **Visualize Weights** if you want to visualize the self-attention weights of the input document.
3. Optionally, you can also choose the desired colour for the visualization.
4.   Select among the different decoding strategies (greedy search, beam search, diverse beam search, random sampling, top-*k* sampling, top-*p* or "nucleus" sampling)
5.   If the decoding strategy allows for hyperparameterization, then some widgets are appeared. You can leave the default values or change them as you wish.
6.   Press **Generate Summaries**.
7.   You can now choose among the model's generated summaries simply by putting a ✔ in the summary or the sentences that you approve.
8.   You can also define the order of the sentences adjusting the value of the box next to the checkbox.
9.   Your synthesized summary is refreshed in real time on the text box below the model's summaries. You can also edit the synthesized summary or even write a completely new version.
10.   You can repeat this process changing at any time the decoding strategy.

## Intepretability

In order to interpret the final decisions of the generated summaries, we extract the self-attention weights of the encoder layers. We assume that if a word is attented from many other words, it is more important for the final generation. In this way, we highlight the words from the original input document based on their self-attention weight.



In [None]:
#@title Enter text to summarise { form-width: "20px" }

text_to_summarize = widgets.Textarea(
    placeholder='Type here the text you want to summarize.',
    value='Central banks became gold sellers for the first time since 2010 as some producing nations exploited near-record prices to soften the blow from the coronavirus pandemic.Net sales totaled 12.1 tons of bullion in the third quarter, compared with purchases of 141.9 tons a year earlier, according to a report by the World Gold Council. Selling was driven by Uzbekistan and Turkey, while Russia’s central bank posted its first quarterly sale in 13 years, the WGC said.While inflows into exchange-traded funds have driven gold’s advance in 2020, buying by central banks has helped underpin bullion in recent years. Citigroup Inc. last month predicted that central bank demand would rebound in 2021, after slowing this year from near-record purchases in both 2018 and 2019.“It’s not surprising that in the circumstances banks might look to their gold reserves,” said Louise Street, lead analyst at the WGC. “Virtually all of the selling is from banks who buy from domestic sources taking advantage of the high gold price at a time when they are fiscally stretched.”The central banks of Turkey and Uzbekistan sold 22.3 tons and 34.9 tons of gold, respectively, in the third quarter, the WGC said. Uzbekistan has been diversifying international reserves away from gold as the central Asian nation unwinds decades of isolation.',
    description='Text',
    layout={'height': '300px', 'width': '600px'}
)

summary_to_synthesize_textarea = widgets.Textarea(
    layout={'height': '150px', 'width': '400px'}
)

strategy = widgets.Dropdown(
    options=['Greedy Search', 'Beam Search', 'Diverse Beam Search', 'Random Sampling', 'Top-k Sampling', 'Top-p Sampling'],
    value=None,
    description='Strategy'
)

colour = widgets.ColorPicker(
    concise=False,
    description='Pick a color',
    value='red'
)

beams = widgets.IntSlider(
    value=5,
    min=2,
    max=10,
    step=1,
    description='#Beams',
    continuous_update=True,
)

beams_for_diverse = widgets.IntSlider(
    value=10,
    min=2,
    max=20,
    step=1,
    description='#Beams',
    continuous_update=True,
)

beam_groups = widgets.IntSlider(
    value=5,
    min=2,
    max=10,
    step=1,
    description='#Groups',
    continuous_update=True,
)

penalty = widgets.FloatSlider(
    value=2.3,
    min=1.0,
    max=10.0,
    step=0.1,
    description='Penalty',
    continuous_update=True,
)

seq_no = widgets.BoundedIntText(
    value=3,
    min = 1,
    max=10,
    step = 1,
    description='Sequences'
)

temp = widgets.FloatSlider(
    value=0.7,
    min=0,
    max=1,
    step=0.1,
    description='Temperature',
    continuous_update=True,
)

k = widgets.BoundedIntText(
    value=50,
    min = 0,
    max = 100,
    step = 1,
    description='Top-k'
)

p = widgets.BoundedFloatText(
    value=0.90,
    min = 0.0,
    max= 1.0,
    step = 0.1,
    description = 'Top-p'
)

loading = widgets.HTML('<strong>Please wait...</strong>')

generate_button = widgets.Button(description='Generate Summaries')
visualize_button = widgets.Button(description='Visualize Weights')

generate_button.on_click(on_generate_button_click)
visualize_button.on_click(on_visualize_button_click)

strategy.observe(on_strategy_change, names='value')

strategy_controls_output = widgets.Output()

loading_output = widgets.Output()
visualize_output = widgets.Output()
generate_output = widgets.Output()

# These variables are used for communicating information between callbacks.
summary_checkboxes = []
summary_texts = []
summary_orders = []

display(loading_output)
display(text_to_summarize)
display(widgets.VBox([colour, visualize_button]))
display(visualize_output)
display(strategy)
display(strategy_controls_output)
display(generate_output)

Output()

Textarea(value='Central banks became gold sellers for the first time since 2010 as some producing nations expl…

VBox(children=(ColorPicker(value='red', description='Pick a color'), Button(description='Visualize Weights', s…

Output()

Dropdown(description='Strategy', options=('Greedy Search', 'Beam Search', 'Diverse Beam Search', 'Random Sampl…

Output()

Output()