# Automatic Text Summarization with Python
### In today's age of information overload, it can be difficult to keep up with the amount of content available online. Automatic text summarization is a technique that can help us quickly extract the most important information from a piece of text.

### In this Jupyter notebook, we'll walk through an implementation of an extractive text summarization algorithm using Python. We'll be using the requests, re, beautifulsoup4 and nltk libraries to fetch and process the content from a webpage, create a frequency table of stemmed words, score each sentence based on the frequency of its non-stop words, and generate a summary based on the top-scoring sentences.

### Let's get started!

### > Importing libraries

In [1]:
import requests
import re
import warnings
warnings.filterwarnings("ignore")
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

### > Step 1: Fetch the Article Content

In [2]:
def fetch_article_content(url):
    """
    Fetches the content of the webpage at the given URL and returns it as a string.
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    article_content = ''
    for paragraph in soup.find_all('p'):
        article_content += paragraph.text.strip() + ' '
    article_content = re.sub(r'\[[0-9]*\]', '', article_content)
    article_content = re.sub(r'\s+', ' ', article_content)
    return article_content

### > Step 2: Create a Frequency Table

In [3]:
def create_frequency_table(text_string):
    """
    Creates a frequency table of stemmed words from the input string, excluding stopwords.
    """
    stop_words = set(stopwords.words("english"))
    words = word_tokenize(text_string)
    stemmer = PorterStemmer()

    freq_table = dict()
    for word in words:
        word = stemmer.stem(word)
        if word in stop_words:
            continue
        if word in freq_table:
            freq_table[word] += 1
        else:
            freq_table[word] = 1

    return freq_table

### > Step 3: Score Sentences

In [4]:
def score_sentences(sentences, freq_table):
    """
    Scores each sentence based on the frequency of its non-stop words.
    """
    sentence_value = dict()
    for sentence in sentences:
        word_count_in_sentence = len(word_tokenize(sentence))
        word_count_in_sentence_except_stop_words = 0
        for word_value in freq_table:
            if word_value in sentence.lower():
                word_count_in_sentence_except_stop_words += 1
                if sentence[:10] in sentence_value:
                    sentence_value[sentence[:10]] += freq_table[word_value]
                else:
                    sentence_value[sentence[:10]] = freq_table[word_value]

        if sentence[:10] in sentence_value:
            sentence_value[sentence[:10]] = sentence_value[sentence[:10]] / word_count_in_sentence_except_stop_words

    return sentence_value

### > Step 4: Find the Average Score

In [5]:
def find_average_score(sentence_value):
    """
    Calculates the average score of the sentences.
    """
    sum_values = sum(sentence_value.values())
    average = sum_values / len(sentence_value)
    return average

### > Step 5: Generate the Summary

In [6]:
def generate_summary(sentences, sentence_value, threshold):
    """
    Generates the summary of the article using the sentence scores and threshold.
    """
    summary = ''
    for sentence in sentences:
        if sentence[:10] in sentence_value and sentence_value[sentence[:10]] >= threshold:
            summary += " " + sentence
    return summary

### > Step 6: Summarization Algorithm

In [7]:
def run_summarization(article_content):
    """
    Runs the summarization algorithm on the input article content.
    """
    # Create frequency table
    freq_table = create_frequency_table(article_content)

    # Tokenize sentences
    sentences = sent_tokenize(article_content)

    # Score the sentences
    sentence_value = score_sentences(sentences, freq_table)

    # Find the threshold
    threshold = 1.3 * find_average_score(sentence_value)

    # Generate the summary
    summary = generate_summary(sentences, sentence_value, threshold)

    return summary

### > Step 7: Executing the code.

In [8]:
# 
if __name__ == '__main__':
    # Define the web page URL and fetch the content
    url = 'https://en.wikipedia.org/wiki/Machine_Learning'
    article_content = fetch_article_content(url)
    clean_string = url.split('/')[-1].replace('_', ' ')
    # Generate the summary
    summary = clean_string+": "+ run_summarization(article_content)

    # Print the summary
    print(summary)

Machine Learning:  A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. In its application across business problems, machine learning is also referred to as predictive analytics. As a scientific endeavor, machine learning grew out of the quest for artificial intelligence (AI). In the early days of AI as an academic discipline, some researchers were interested in having machines learn from data. : 25 Machine learning (ML), reorganized and recognized as its own field, started to flourish in the 1990s. Some statisticians have adopted methods from machine learning, leading to a combined field that they call statistical learning. Instead, probabilistic bounds on the performance are quite common. If the hypothesis is less complex than the func

### Bonus: If you found the above code interesting, here is something for you.
This is a Python code that generates a summary of a given input text using BART (Bidirectional and Auto-Regressive Transformer) transformer. BART is a state-of-the-art pre-trained language model developed by Facebook AI that can perform a wide range of natural language processing tasks. The code uses the BART model to generate a summary by first tokenizing the input text and then generating a summary using beam search. The resulting summary is decoded from the tokenized summary IDs and returned as a string.

In [9]:
# from transformers import BartTokenizer, TFBartForConditionalGeneration

# def BART_generate_summary(text: str, model_name: str = "facebook/bart-large-cnn") -> str:
#     """
#     Generate summary using BART transformer
#     """
#     # Initialize tokenizer and model
#     tokenizer = BartTokenizer.from_pretrained(model_name, from_tf=True)
#     model = TFBartForConditionalGeneration.from_pretrained(model_name)

#     # Tokenize input text
#     inputs = tokenizer.batch_encode_plus([text], max_length=1024, return_tensors='tf', truncation=True)

#     # Generate summary
#     summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=200, early_stopping=True)
#     summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

#     return summary

# bart_summary = BART_generate_summary(article_content)
# print(bart_summary)