# Project Implementation
## Implementation for the Lecture Summarizer
This Jupyter Notebook contains the relevant code for the implementation of this project. All techniques and models are grouped under their respective headers. You can directly change any input parameters when running the cells including the text you wish to summarize.

## Overview
The Lecture Summarizer Project is a Natural Language Processing (NLP) project that aims to summarize lectures automatically using various text summarization techniques. 

## Text Summarization Techniques

Text summarization techniques can be divided into three categories:

1. **Extractive**: This technique extracts the most important sentences or phrases from the original text to create a summary. It is widely used due to its simplicity and effectiveness.

2. **Abstractive**: This technique generates a summary by paraphrasing and rephrasing the original text. It is more challenging than extractive summarization, but it can generate summaries that are more concise and coherent.

3. **Descriptive**: This technique provides a description of the main topics and ideas discussed in the original text. It is less common than the other two techniques but can be useful for certain applications.

## Transformers

Transformers are a type of neural network architecture that has revolutionized NLP in recent years. Some popular transformer models used in text summarization are:

1. **T5**: T5 (Text-to-Text Transfer Transformer) is a transformer model that can perform a wide range of NLP tasks, including text summarization. It has achieved state-of-the-art performance in many NLP benchmarks.

2. **BERT**: BERT (Bidirectional Encoder Representations from Transformers) is another transformer model that has shown excellent results in various NLP tasks, including text summarization.

3. **Longformer Encoder-Decoder (LED)**: This is a transformer model that can handle long sequences of text, which is important for text summarization since it often involves processing large amounts of text.

## Relevant Information

The Lecture Summarizer Project can have various applications, such as helping students to study more efficiently, enabling researchers to scan through a large number of papers quickly, and assisting professionals to prepare for meetings and presentations. However, developing an accurate and reliable lecture summarizer is still a challenging task, and researchers are continuously working on improving the existing techniques and models.

## Abstractive Text Summarization (Base Models)
This cell contains the implmentation of abstractive text summarization using the Base T5 and BART Transformer models.

In [5]:
import torch
from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import T5Tokenizer, T5ForConditionalGeneration

print("""Select Model:
       1. BART 
       2. T5""")
model = input("Enter Model Number:")

if model == '1':
    _num_beams = 4
    _no_repeat_ngram_size = 3
    _length_penalty = 1
    _min_length = 12
    _max_length = 128
    _early_stopping = True
else:
    _num_beams = 4
    _no_repeat_ngram_size = 3
    _length_penalty = 2
    _min_length = 30
    _max_length = 200
    _early_stopping = True

text = """***ENTER TEXT TO SUMMARIZE HERE***"""
text = input("ENTER TEXT TO SUMMARIZE HERE:")

def run_model(input_text):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    if model == "BART":
        bart_model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
        bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
        input_text = str(input_text)
        input_text = ' '.join(input_text.split())
        input_tokenized = bart_tokenizer.encode(input_text, return_tensors='pt').to(device)
        summary_ids = bart_model.generate(input_tokenized,
                                          num_beams=_num_beams,
                                          no_repeat_ngram_size=_no_repeat_ngram_size,
                                          length_penalty=_length_penalty,
                                          min_length=_min_length,
                                          max_length=_max_length,
                                          early_stopping=_early_stopping)

        output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in
                  summary_ids]
        print('\n\n\n***Summary***')
        print(output[0])

    else:
        t5_model = T5ForConditionalGeneration.from_pretrained("t5-base")
        t5_tokenizer = T5Tokenizer.from_pretrained("t5-base")
        input_text = str(input_text).replace('\n', '')
        input_text = ' '.join(input_text.split())
        input_tokenized = t5_tokenizer.encode(input_text, return_tensors="pt").to(device)
        summary_task = torch.tensor([[21603, 10]]).to(device)
        input_tokenized = torch.cat([summary_task, input_tokenized], dim=-1).to(device)
        summary_ids = t5_model.generate(input_tokenized,
                                        num_beams=_num_beams,
                                        no_repeat_ngram_size=_no_repeat_ngram_size,
                                        length_penalty=_length_penalty,
                                        min_length=_min_length,
                                        max_length=_max_length,
                                        early_stopping=_early_stopping)
        output = [t5_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in
                  summary_ids]
        print('\n\n\n***Summary***')
        print(output[0])
if text!="""***ENTER TEXT TO SUMMARIZE HERE***""":
    run_model(text)

Select Model:
       1. BART 
       2. T5
Enter Model Number:1
ENTER TEXT TO SUMMARIZE HERE:BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based machine learning model that has been widely used in natural language processing (NLP) tasks. BERT was introduced in a 2018 paper by researchers at Google, and it has since become one of the most popular and widely used NLP models.  BERT is characterized by its ability to generate contextualized word embeddings, which are vector representations of words in a text that capture their meaning in context. Unlike traditional word embeddings that have a fixed representation for each word, contextualized word embeddings can change depending on the context in which the word appears. This allows BERT to better capture the nuances and complexities of natural language.  BERT is pre-trained on large amounts of text data using an unsupervised learning technique called masked language modeling. During training, a certain per

## Abstractive Text Summarization (Fine-Tuned Models)
This cell contains the implmentation of abstractive text summarization using the T5-Long, LED-Base and LED-Long Transformer models fine-tuned on the BookSum dataset to generate abstractive and descriptive text.

In [7]:
from transformers import pipeline
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

#define model maps
model_map={'Long T5':'pszemraj/long-t5-tglobal-base-16384-book-summary',
           'LED Base':'pszemraj/led-base-book-summary',
           'LED Long':'pszemraj/led-large-book-summary'
           }

#Select the model to be used for the summary
print("""Select Model:
       1. Long T5
       2. LED Base
       3. LED Long""")
model = input("Enter Model Number:")

if model == '1':
    model = 'Long T5'
elif model == '2':
    model = 'LED Base'
else:
    model = 'LED Long'

#Slider to control the model hyperparameter
_min_length = input("Minimum Length:")
_max_length = input("Maximum Length:")
_num_beams = input("Number of Beams:")
_repetition_penalty = float(input("Repetition Penalty:"))
_no_repeat_ngram_size = input("N-Gram Repeats:")
_encoder_no_repeat_ngram_size = input("Encoder N-Gram Repeats:")
        
#Provide the input text to be summarized
text = input("ENTER TEXT TO SUMMARIZE HERE:")

#write a run_model function that runs the program
def run_model(input_text):
    global model
    hf_name = model_map[model]
    #initialize the summarizer in pipeline
    if model == 'Long T5':
        summarizer = pipeline(
            "summarization",
            hf_name,
            device=0 if torch.cuda.is_available() else -1,
        )
        result = summarizer(
           input_text,
           min_length=int(_min_length), 
           max_length=int(_max_length),
           no_repeat_ngram_size=int(_no_repeat_ngram_size), 
           encoder_no_repeat_ngram_size=int(_encoder_no_repeat_ngram_size),
           repetition_penalty=_repetition_penalty,
           num_beams=int(_num_beams),
           do_sample=False,
           early_stopping=True
        )
        summary=result[0]['summary_text']
        print('\n\n\n***Summary***')
        print(summary)
    elif model == 'LED Base':
        summarizer = pipeline(
            "summarization",
            hf_name,
            device=0 if torch.cuda.is_available() else -1,
        )
        result = summarizer(
           input_text,
           min_length=int(_min_length), 
           max_length=int(_max_length),
           no_repeat_ngram_size=int(_no_repeat_ngram_size), 
           encoder_no_repeat_ngram_size=int(_encoder_no_repeat_ngram_size),
           repetition_penalty=_repetition_penalty,
           num_beams=int(_num_beams),
           do_sample=False,
           early_stopping=True
        )
        summary=result[0]['summary_text']
        print('\n\n\n***Summary***')
        print(summary)
    else:
        summarizer = pipeline(
            "summarization",
            hf_name,
            device=0 if torch.cuda.is_available() else -1,
        )
        result = summarizer(
           input_text,
           min_length=int(_min_length), 
           max_length=int(_max_length),
           no_repeat_ngram_size=int(_no_repeat_ngram_size), 
           encoder_no_repeat_ngram_size=int(_encoder_no_repeat_ngram_size),
           repetition_penalty=_repetition_penalty,
           num_beams=int(_num_beams),
           early_stopping=True
        )
        summary=result[0]['summary_text']
        print('\n\n\n***Summary***')
        print(summary)
    

#Creating button for execute the text summarization
if text:
    run_model(text)

Select Model:
       1. Long T5
       2. LED Base
       3. LED Long
Enter Model Number:2
Minimum Length:16
Maximum Length:256
Number of Beams:4
Repetition Penalty:3.5
N-Gram Repeats:3
Encoder N-Gram Repeats:3
ENTER TEXT TO SUMMARIZE HERE:BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based machine learning model that has been widely used in natural language processing (NLP) tasks. BERT was introduced in a 2018 paper by researchers at Google, and it has since become one of the most popular and widely used NLP models.  BERT is characterized by its ability to generate contextualized word embeddings, which are vector representations of words in a text that capture their meaning in context. Unlike traditional word embeddings that have a fixed representation for each word, contextualized word embeddings can change depending on the context in which the word appears. This allows BERT to better capture the nuances and complexities of natural language.  BERT is

## Extractive Text Summarization
This cell contains the implmentation of extractive text summarization using the BERT Transformer models fine-tuned on the CNN-DM dataset to generate extractive text. This library is based on dmmiller612's lecture-summarizer repo of which it is a generalization.

In [9]:
### This module is presently facing an error due to issues with access of oython libraries in macos

from summarizer import Summarizer

#Provide the input area for text to be summarized
text = input("ENTER TEXT TO SUMMARIZE HERE:")

# Take parameters as input
_min_length = int(input("Minimum Sentence Length:"))
_max_length = int(input("Maximum Sentence Length:"))
_ratio = float(input("Reduction Ratio (0-1):"))

def run_model(input_text):
    model = Summarizer()
    result = model(input_text, min_length=_min_length,max_length=_max_length,ratio=_ratio)
    summary = ''.join(result)
    print('\n\n\n***Summary***')
    print(summary)

#Creating button for execute the text summarization
if text:
    run_model(text)

ENTER TEXT TO SUMMARIZE HERE:BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based machine learning model that has been widely used in natural language processing (NLP) tasks. BERT was introduced in a 2018 paper by researchers at Google, and it has since become one of the most popular and widely used NLP models.  BERT is characterized by its ability to generate contextualized word embeddings, which are vector representations of words in a text that capture their meaning in context. Unlike traditional word embeddings that have a fixed representation for each word, contextualized word embeddings can change depending on the context in which the word appears. This allows BERT to better capture the nuances and complexities of natural language.  BERT is pre-trained on large amounts of text data using an unsupervised learning technique called masked language modeling. During training, a certain percentage of the words in the input text are masked, and the model

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


AttributeError: 'NoneType' object has no attribute 'split'