# Note:
- This notebook file may contain methods or algorithms that are NOT covered by the teaching content of BT4222 and hence will not be assessed in your midterm exam.
- It serves to increase your exoposure in depth and breath to the practical methods in addressing the specific project topic. We believe it will be helpful for your current project and also your future internship endeavors.

# BT4222 Text Summarization Project

Text summarization is an essential natural language processing (NLP) task that aims to condense and capture the key information from a given document while preserving its most crucial aspects. In this project, we will explore various text summarization techniques to generate concise and coherent summaries from lengthy textual content, such as articles, blog posts, or news articles. The project will involve implementing both extractive and abstractive summarization methods using large language models.

## Extractive Summarization
Extractive summarization is a text summarization technique that involves selecting and extracting a subset of sentences or phrases from the original text to form the summary. The goal is to identify the most informative and important sentences that convey the key information of the document.

Here we implement an algorithm that follows a basic approach to extractive summarization by ranking sentences based on word frequencies and selecting the most important sentences for the summary.

In [None]:
!pip -q install wikipedia

In [None]:
import wikipedia
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist

`punkt` tokenizer is a pre-trained sentence tokenizer provided by NLTK, which can split raw text into sentences.

`stopwords` are common words (e.g., "the," "is," "and") that are often filtered out from text during text processing because they do not contribute significant meaning to the overall context.

In [None]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Preprocess the raw text to prepare it for summarization.

In [None]:
def clean_text(text):
    # Remove special characters and digits
    text = text.replace('\n', ' ').replace('\r', '')
    text = ''.join(c for c in text if not c.isdigit() and c.isprintable())
    return text

Calculates scores for each sentence in a list of sentences based on word frequencies.

In [None]:
def calculate_sentence_scores(sentences, word_frequencies):
    sentence_scores = {}
    for sentence in sentences:
        for word in nltk.word_tokenize(sentence.lower()):
            if word in word_frequencies:
                if len(sentence.split()) < 30:  # Limit the sentence length to avoid long sentences bias
                    if sentence not in sentence_scores:
                        sentence_scores[sentence] = word_frequencies[word]
                    else:
                        sentence_scores[sentence] += word_frequencies[word]
    return sentence_scores

Rank the sentences based on these scores and then choose the top-ranked sentences to create the final summary.


In [None]:
def summarize_wikipedia_article(article_title, num_sentences=3):
    try:
        # Fetch the Wikipedia article
        article_text = wikipedia.page(article_title).content

        # Clean the text from special characters and digits
        cleaned_text = clean_text(article_text)

        print("Original text:", cleaned_text)

        # Tokenize the sentences
        sentences = sent_tokenize(cleaned_text)

        # Tokenize the words
        words = nltk.word_tokenize(cleaned_text.lower())

        # Remove stop words and punctuations
        stop_words = set(stopwords.words('english'))
        words = [word for word in words if word.isalpha() and word not in stop_words]

        # Calculate word frequencies
        word_frequencies = FreqDist(words)

        # Calculate sentence scores
        sentence_scores = calculate_sentence_scores(sentences, word_frequencies)

        # Select top sentences based on scores
        top_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:num_sentences]

        # Construct the summary
        summary = ' '.join(top_sentences)
        return summary

    except wikipedia.exceptions.PageError:
        return "Article not found."
    except wikipedia.exceptions.DisambiguationError:
        return "Multiple articles found. Please specify the exact article title."

Here we summarize the article "Artificial Intelligence" from Wikipedia.

In [None]:
article_title = "Artificial Intelligence"

num_sentences = 1
summary = summarize_wikipedia_article(article_title, num_sentences)
print("Summary:", summary)

Original text: Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of human beings or animals. AI applications include advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Waymo), generative or creative tools (ChatGPT and AI art), and competing at the highest level in strategic games (such as chess and Go).Artificial intelligence was founded as an academic discipline in . The field went through multiple cycles of optimism followed by disappointment and loss of funding, but after , when deep learning surpassed all previous AI techniques, there was a vast increase in funding and interest. The various sub-fields of AI research are centered around particular goals and the use of particular tools. The traditional goals of AI research include reasoning, knowledge representation, planning, learning, natu

In extractive summarization, there are some alternative scoring schemes to rank sentences apart from simply counting the most frequent words. Here are a few more advanced summarization approaches along with their pros and cons:

1. **TF-IDF** (Term Frequency-Inverse Document Frequency):
TF-IDF is a widely used method that takes into account both the frequency of a word in a sentence and its rarity across the entire document. This helps identify words that are important to a specific sentence but not overly common in the entire document. Sentences with higher TF-IDF scores are considered more relevant.
- Pros: Considers the importance of words within a sentence in the context of the entire document. Handles common words well.
- Cons: May not capture semantic relationships between words effectively. Might not handle synonyms and related terms perfectly.
- Resources: You can refer to https://en.wikipedia.org/wiki/Tf%E2%80%93idf.


2. **TextRank**:
TextRank is a graph-based algorithm inspired by PageRank used in web search. Sentences are treated as nodes in a graph, and the importance of a sentence is determined by its relationship with other sentences. Sentences that are connected to many other sentences and are referenced by important sentences tend to have higher scores.
- Pros: Captures sentence relationships and context well, and does not rely solely on word frequency.
- Cons: May not handle complex semantic relationships and nuances effectively.
- Resources: https://towardsdatascience.com/textrank-for-keyword-extraction-by-python-c0bae21bcec0

3. **LDA** (Latent Dirichlet Allocation):
LDA is a topic modeling technique that represents documents as mixtures of topics and assigns words to topics probabilistically. Sentences can be ranked based on their distribution of topics, capturing the underlying themes in the text.
- Pros: Captures the underlying topics and themes in the document. Useful for longer texts.
- Cons: Requires topic modeling techniques which can be computationally expensive and complex.
- Resources: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

## Abstractive Summarization

Abstractive summarization is a text summarization technique that goes beyond the extractive approach of selecting and copying sentences from the original text. Instead, abstractive summarization generates new sentences that capture the main ideas of the input document while potentially using different words and phrases. This process mimics human summarization, where the summary is written in a more concise and coherent manner, often rephrasing the content in a novel way.

We leverage a pre-trained language model called T5-small and utilize the `pipeline()` function from the huggingface library https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html#transformers.SummarizationPipeline.


The T5 model was presented in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel, et al. It is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate English to German: …, for summarization: summarize: ….


T5 comes in different sizes. The reason we use T5-small is that it can fit into Google Colab environment. You can check https://huggingface.co/docs/transformers/model_doc/t5 for more information.


There are some other popular pretrained models that can be used for summarization.
- GPT-2
- PEGASUS
- mT5
- BART

Feel free to explore other pre-trained language models https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf



**Please make sure that you have selected GPU under Colab runtime -> change runtime type -> hardware accelerator.**

In [None]:
!pip -q install transformers

In [None]:
import wikipedia
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline

def clean_text(text):
    # Remove special characters and digits
    text = text.replace('\n', ' ').replace('\r', '')
    text = ''.join(c for c in text if not c.isdigit() and c.isprintable())
    return text

def summarize(article_title, max_length=100, min_length=10):
    try:
        # Fetch the Wikipedia article
        article_text = wikipedia.page(article_title).content

        article_text = clean_text(article_text)

        # Load the BERT model and tokenizer
        model_name = "t5-small"
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Create the summarization pipeline
        summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, device = "cuda")

        # Summarize the article
        summary = summarizer(article_text[:600], max_length=max_length, min_length=min_length, do_sample=True, clean_up_tokenization_spaces=True),
        return summary[0][0]['summary_text']

    except wikipedia.exceptions.PageError:
        return "Article not found."
    except wikipedia.exceptions.DisambiguationError:
        return "Multiple articles found. Please specify the exact article title."


In [None]:
article_title = "artificial intelligence"
summary = summarize(article_title)
print("Summary:", summary)

Summary: artificial intelligence (AI) is the intelligence of machines or software. applications include advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), understanding human speech (such as Siri and Alexa)


## Evaluation

Evaluating different text summarization methods is crucial to determine their effectiveness and choose the most suitable one for your task. Here are a few ways to evaluate these techniques:

- **Human Evaluation**: Direct human evaluation involves having human annotators rate or compare summaries based on various criteria such as fluency, coherence, informativeness, and relevance. This approach provides qualitative insights into summary quality.

- **F1 Score**: You can treat summarization as a binary classification problem, where each word in the reference summary is either correctly included or not in the generated summary. Calculate the F1 score to balance precision and recall.

- **Content Overlap**: Measure the overlap of content words (excluding stop words) between the generated summary and the reference summary. This provides an indication of how well the summary captures important information.

- **ROUGE** (Recall-Oriented Understudy for Gisting Evaluation): ROUGE is a widely used metric for evaluating summarization systems by comparing the overlap of n-grams (usually unigrams, bigrams, and trigrams) between the generated summary and the reference (human-created) summary.

- **BLEU** (Bilingual Evaluation Understudy): Initially developed for machine translation, BLEU is also used for summarization evaluation. It measures the precision of n-grams in the generated summary compared to reference summaries. However, BLEU has limitations when it comes to capturing fluency and coherence.

It is important to use a combination of these evaluation methods to get a comprehensive understanding of how well different summarization techniques perform. Different techniques may excel in different aspects, so a holistic evaluation approach is recommended.

Evaluating various summarization techniques can help you understand the strengths and limitations of different approaches. Feel free to discuss your findings and insights with your peers or ask questions if you come across something interesting.