<img src='image.jpeg' alt='image' width='300'/>

# Meeting Recording Summarisation Internship Project

 Text summarization is one of the challenges of Natural Language Processing. Given the volume of texts produced daily on the Internet, managers can no longer have an exhaustive reading of current events, or progress reports from their employees, etc. They urgently need tools to automatically produce a summary of this flow of information. As a first approach, extractive summarization tools have been produced and there are now commercial tools available. However, this family of systems is not well suited to certain types of texts such as written transcriptions of dialogues or meetings. In that case, abstractive summarization tools are needed. Research in that field is very old but has been particularly stimulated since the mid-2010s by the recent successes of deep learning.

 Text summarization is a well explored area in NLP. As shown in Figure 1, the field of text summarization can be split based on input document type, output type and purpose. Regarding output type, text summarization dissects into extractive and abstractive methods.

• Extractive: In the Extractive methods, a summarizer tries to find and combine the most significant sentences of the corpus to form a summary. There are some techniques to identify the principal sentences and measure their importance such as Topic Representation, and Indicator Representation.

• Abstractive: Abstractive Text Summarization (ATS) is the process of finding the most essential meaning of a text and rewriting them in a summary. The resulting summary is an interpretation of the source. Abstractive summarization is closer to what a human usually does. He conceives the text, compares it with his memory and related in-formation, and then re-create its core in a brief text. That is why the abstractive summarization is more challenging than the extractive method, as the model should break the source corpus apart to the very tokens and regenerate the target sentences. Achieving meaningful and grammatically correct sentences in the summaries is a big deal that demands highly precise and sophisticated models.



In [None]:
import warnings
warnings.filterwarnings("ignore")

# 1.0.0 Colab output wrapper

In [None]:
#wrap the output in colab cells
import IPython
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre{
      white-space: pre-wrap;
    }
  
  '''))
get_ipython().events.register('pre_run_cell', set_css)

For CPU-support only, you can conveniently install 🤗 Transformers and a deep learning library in one line. For example, install 🤗 Transformers and PyTorch with:

# 2.0.0  Install Transformers And Import Dependables

In [None]:
!pip install transformers[torch]
!pip install transformers[sentencepiece]
from transformers import PegasusTokenizer, PegasusXForConditionalGeneration
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
from transformers.utils.dummy_pt_objects import AutoModelForSeq2SeqLM
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
from transformers import AutoTokenizer , AutoModelForSeq2SeqLM
from transformers import DistilBertTokenizer, DistilBertModel

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers[torch]
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 4.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 44.6 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 42.1 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x

Finally, check if 🤗 Transformers has been properly installed by running the following command. It will download a pretrained model:Then print out the label and score:[{'label': 'POSITIVE', 'score': 0.9998704791069031}] 

# 3.0.0  Mount Google Drive on Colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 4.0.0 Read input file from Google Drive

In [None]:
file = open('/content/drive/MyDrive/meeting_recording_to_text.txt', 'r')
FileContent = file.read().strip()

# 5.0.0 display file content

In [None]:
FileContent

'Das : Hi and welcome to the a16z podcast. I’m Das, and in this episode, I talk SaaS go-to-market with David Ulevitch and our newest enterprise general partner Kristina Shen. The first half of the podcast looks at how remote work impacts the SaaS go-to-market and what the smartest founders are doing to survive the current crisis. The second half covers pricing approaches and strategy, including how to think about free versus paid trials and navigating the transition to larger accounts. But we start with why it’s easier to move upmarket than down… and the advantage that gives a SaaS startup against incumbents.\nDavid : If you have a cohort of customers that are paying you $10,000 a year for your product, you’re going to find a customer that self-selects and is willing to pay $100,000 a year. Once you get one of those, your organization will figure out how you sell to, how you satisfy and support, customers at that price point and that size. But it’s really hard for a company that sells 

# 6.0.0  Bart-Transformer Model

The Bart model was proposed in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.

According to the abstract,

Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.
BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
This model was contributed by sshleifer. The Authors’ code can be found here.

## 6.1.0 Load the Distilbart-Transformer Model and Tokenizer

In [None]:
# import and initialize the tokenizer and model from the checkpoint
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

checkpoint = "sshleifer/distilbart-cnn-12-6"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

## 6.1.1 Some Distilbart-Transformer model statistics

### max tokens including the special tokens

First we want to check some statistics of our model, we want to check the number of tokens our model can support including the special tokens

In [None]:
# max tokens including the special tokens
tokenizer.model_max_length

1024

Next we want to check the number of tokens our model can support excluding the special tokens

In [None]:
# max tokens excluding the special tokens
tokenizer.max_len_single_sentence

1022

So from this result we can see that the model can adds one special token for the input sequence. which we can verify from the nest code

In [None]:
# number of special tokens
tokenizer.num_special_tokens_to_add()

2

## 6.2.0 Text Preprocessing

Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you’ll learn that for:

Text, use a Tokenizer to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.

The main tool for preprocessing textual data is a tokenizer. A tokenizer splits text into tokens according to a set of rules. The tokens are converted into numbers and then tensors, which become the model inputs. Any additional inputs required by the model are added by the tokenizer.

### 6.2.1  Convert file content to sentences

In [None]:
# extract the sentences from the document
import nltk
nltk.download('punkt')
sentences = nltk.tokenize.sent_tokenize(FileContent)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# find the max tokens in the longest sentence
max([len(tokenizer.tokenize(sentence)) for sentence in sentences])

92

## 6.3.0 Create the Sentence chunks for  Distilbart-Transformer model

In [None]:
# initialize
length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

6

### 6.3.1 Some checks on the chucked output

We have created 6 chunks from our file, Lets find out the number of tokens excluding the special tokens in each of the 6 chunks

In [None]:
[len(tokenizer.tokenize(c)) for c in chunks]

[1014, 1019, 1005, 1019, 1000, 389]

Let us also check the number of tokens including the special tokens in our file

In [None]:
[len(tokenizer(c).input_ids) for c in chunks]

[1016, 1021, 1007, 1021, 1002, 391]

Next we want to find out the total number of tokens in all the chunk, which should be seen as equal to the total mmunber of token in the original file content. 

In [None]:
sum([len(tokenizer.tokenize(c))for c in chunks])

5446

In [None]:
len(tokenizer.tokenize(FileContent))

Token indices sequence length is longer than the specified maximum sequence length for this model (5502 > 1024). Running this sequence through the model will result in indexing errors


5502

## 6.4.0  Summarization Modelling 

### 6.4.1 Get the inputs

In [None]:
inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# 6.4.2 getting the outputs



In [None]:
for input in inputs:
  output = model.generate(**input)
  print(tokenizer.decode(*output, skip_special_tokens=True))

 A16z podcast talks SaaS go-to-market with David Ulevitch and Kristina Shen. The first half of the podcast looks at how remote work impacts the Saa-market. The second half covers pricing approaches and strategy, including how to think about free versus paid trials and navigating the transition to larger accounts.
 Remote work and working from home is only going to catalyze more of the conversion from on-premise over to cloud and SaaS. Kristina: In general, software spend declines 20% during an economic downturn, but in the last downturn in ’08, Saa’S spend actually increased 10%.
 New modern SaaS pricing is keep it simple, keep it tied to value, and make sure you’re solving one thing really, really well. David: You want to make it easy for your customers to give you money. If your customers don’t understand your pricing, that’s a huge red flag. Kristina: The most common that most people know about is PEPM or per employee per month.
 David: People need to price on value, and they don't 

#7.0.0 PEGASUS-X Transformer Model

The PEGASUS-X model was proposed in Investigating Efficiently Extending Transformers for Long Input Summarization by Jason Phang, Yao Zhao and Peter J. Liu.
PEGASUS-X (PEGASUS eXtended) extends the PEGASUS models for long input summarization through additional long input pretraining and using staggered block-local attention with global tokens in the encoder.

The abstract from the paper is the following:

While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.



## 7.1.0 Load the PEGASUS-X Transformer Model and Tokenizer

In [None]:
# import and initialize the tokenizer and model from the checkpoint

from transformers import PegasusTokenizer, PegasusXForConditionalGeneration
checkpoint ="google/pegasus-x-base"
model = PegasusXForConditionalGeneration.from_pretrained(checkpoint)
tokenizer = PegasusTokenizer.from_pretrained(checkpoint)


Downloading:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

## 7.1.1 Some PEGASUS-X Transformer Model statistics

###  max tokens including the special tokens

First we want to check some statistics of our model, we want to check the number of tokens our model can support including the special tokens

In [None]:
# max tokens including the special tokens
tokenizer.model_max_length

1024

Next we want to check the number of tokens our model can support excluding the special tokens

In [None]:
# max tokens excluding the special tokens
tokenizer.max_len_single_sentence

1023

So from this result we can see that the model can adds one special token for the input sequence. which we can verify from the nest code

In [None]:
# number of special tokens
tokenizer.num_special_tokens_to_add()

1

##7.2.0 Create the Sentence chunks for PEGASUS-X Transformer Model

In [None]:
# initialize
length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
  count += 1
  combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter

  if combined_length  <= tokenizer.max_len_single_sentence: # if it doesn't exceed
    chunk += sentence + " " # add the sentence to the chunk
    length = combined_length # update the length counter

    # if it is the last sentence
    if count == len(sentences) - 1:
      chunks.append(chunk.strip()) # save the chunk
    
  else: 
    chunks.append(chunk.strip()) # save the chunk
    
    # reset 
    length = 0 
    chunk = ""

    # take care of the overflow sentence
    chunk += sentence + " "
    length = len(tokenizer.tokenize(sentence))
len(chunks)

6

## 7.4.0  Summarization Modelling for PEGASUS-X Transformer Model

### 7.4.1 Get the inputs

In [None]:
inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]

# 7.4.2 getting the outputs


In [None]:
for input in inputs:
  output = model.generate(input["input_ids"])
  print(tokenizer.batch_decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])