<a href="https://colab.research.google.com/github/theviperyt/AI-text-summarizer/blob/main/text_summarizer_ml_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install nltk transformers torch --quiet

In [2]:
import nltk
import heapq
import re
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from transformers import pipeline

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [3]:
def extractive_summarizer(text, num_sentences=3):

    text = re.sub(r'\s+', ' ', text)

    sentences = sent_tokenize(text)

    words = word_tokenize(text.lower())

    stop_words = set(stopwords.words('english'))

    word_frequencies = {}
    for word in words:
        if word.isalnum() and word not in stop_words:
            word_frequencies[word] = word_frequencies.get(word, 0) + 1
    if not word_frequencies:
        return text
    max_freq = max(word_frequencies.values())
    for word in word_frequencies:
        word_frequencies[word] /= max_freq

    sentence_scores = {}
    for sent in sentences:
        for word in word_tokenize(sent.lower()):
            if word in word_frequencies:
                sentence_scores[sent] = sentence_scores.get(sent, 0) + word_frequencies[word]

    summary_sentences = heapq.nlargest(num_sentences, sentence_scores, key=sentence_scores.get)

    summary = ' '.join(summary_sentences)
    return summary

In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

def custom_abstractive_summarizer_func(text, max_length=120, min_length=40, do_sample=False):
    inputs = tokenizer(text, return_tensors='pt', max_length=1024, truncation=True)

    summary_ids = model.generate(
        inputs['input_ids'],
        min_length=min_length,
        max_new_tokens=max_length,
        num_beams=4,
        do_sample=do_sample
    )

    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return [{'generated_text': summary}]

abstractive_model = custom_abstractive_summarizer_func

print("Abstractive summarizer (custom function) initialized successfully.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Please make sure the generation config includes `forced_bos_token_id=0`. 


Loading weights:   0%|          | 0/511 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Abstractive summarizer (custom function) initialized successfully.


In [5]:
def abstractive_summarizer(text, num_lines):

    max_len = 40 * num_lines
    min_len = 20 * num_lines

    summary = abstractive_model(
        text,
        max_length=max_len,
        min_length=min_len,
        do_sample=False
    )

    return summary[0]['generated_text']

In [7]:
long_text = input("Paste your text and press Enter when done:\n")

num_lines = int(input("How many lines should the summary have? Enter any positive number: "))

if num_lines <= 0:
    print("Invalid number. Defaulting to 2 lines.")
    num_lines = 2

Paste your text and press Enter when done:
The solar system consists of the Sun and the objects that orbit it. The Sun is the central star of the solar system. Eight planets orbit the Sun, including Earth, Mars, and Jupiter. The Sun provides the energy necessary for life on Earth. Because the Sun is so massive, its gravity keeps all the planets in their orbits. Most of the mass in the solar system is contained within the Sun itself
How many lines should the summary have? Enter any positive number: 3


In [8]:
choice = input("Choose summarization method (1 = Extractive, 2 = Abstractive): ")

if choice == "1":
    summary = extractive_summarizer(long_text, num_lines)
elif choice == "2":
    summary = abstractive_summarizer(long_text, num_lines)
else:
    print("Invalid choice. Defaulting to Abstractive.")
    summary = abstractive_summarizer(long_text, num_lines)

print(summary)

Choose summarization method (1 = Extractive, 2 = Abstractive): 1
The solar system consists of the Sun and the objects that orbit it. Eight planets orbit the Sun, including Earth, Mars, and Jupiter. Most of the mass in the solar system is contained within the Sun itself


In [9]:
print("\nExtractive Summary:\n")
print(extractive_summarizer(long_text, num_lines))

print("\nAbstractive Summary:\n")
print(abstractive_summarizer(long_text, num_lines))


Extractive Summary:

The solar system consists of the Sun and the objects that orbit it. Eight planets orbit the Sun, including Earth, Mars, and Jupiter. Most of the mass in the solar system is contained within the Sun itself

Abstractive Summary:

The solar system consists of the Sun and the objects that orbit it. Eight planets orbit the Sun, including Earth, Mars, and Jupiter. The Sun provides the energy necessary for life on Earth. Because the Sun is so massive, its gravity keeps all the planets in their orbits. Most of the mass in the solar system is contained within the Sun itself.
