# Intro

As a tutor, I encounter a particular type of situation fairly frequently - my student needs help on some concentrated subject matter, and though it may be in a course whose content I am confident, oftentimes I am not well versed in that particular aspect and run out of time to review. This text summarizer is fairly rudimentary, but I'm proud of it because it was borne of necessity and has been markedly handy in extracting valuable information from notes, so I can better help my student without necessarily giving them bad information. 

My original text summarizer was a flask-based app that used extractive summarization (selecting the most pertinent sentences already in the text for a summary) by way of term frequency - inverse document frequency and cosine distance - both of which will be discussed further below - and implemented using numpy and NLTK to take a chunk of plain text and summarize it into a given amount of sentences, the number of which can be dictated by the user. I also included the functionality of web scraping using Beautiful Soup to generate summaries in a similar format for text like Wikipedia articles and the like. 

I decided to improve on this old design by making a more robust layout with the TKinter GUI, and adding the ability to read in text from a text file. Additionally, I wanted to experiment with abstractive summarization (generating a summary using completely new sentences and language).

I constructed this Jupyter Notebook to detail the summarization process, but to see the full app architecture, check out the full code on my GitHub: https://github.com/wrikych/tldr_tkinter_app

## Basics

Imports first

In [6]:
import re ## regex to handle stripping punctuation and annotation symbols
import nltk
import nltk.tokenize 
from nltk.corpus import stopwords ## stop words
from nltk.tokenize import sent_tokenize, word_tokenize ## to make list of lists 
import heapq # handle table creation
import math # math lol
import urllib # webscraping
import bs4 as bs 
import numpy as np 

stop_words = set(stopwords.words('english'))

Talk about how beautifulsoup4 scrapes data. Finding all the text with the p-tags. 

In [7]:
## Pull Wikipedia Text 
def pull_text(article_url):
    
    ## Read in article
    scraped_data = urllib.request.urlopen(article_url) 
    article = scraped_data.read()
    parsed_article = bs.BeautifulSoup(article,'lxml')
    paragraphs = parsed_article.find_all('p')
    article_text = ""
    
    for p in paragraphs:
        article_text += p.text
    
    return article_text

In [8]:
## example -->  

URL = 'https://en.wikipedia.org/wiki/Ice_cream'

wiki_text = pull_text(URL)
wiki_text

'\nIce cream is a frozen dessert typically made from milk or cream that has been flavoured with a sweetener, either sugar or an alternative, and a spice, such as cocoa or vanilla, or with fruit, such as strawberries or peaches. Food colouring is sometimes added in addition to stabilizers. The mixture is cooled below the freezing point of water and stirred to incorporate air spaces and prevent detectable ice crystals from forming. It can also be made by whisking a flavoured cream base and liquid nitrogen together. The result is a smooth, semi-solid foam that is solid at very low temperatures (below 2\xa0°C or 35\xa0°F). It becomes more malleable as its temperature increases.\nIce cream may be served in dishes, eaten with a spoon, or licked from edible wafer ice cream cones held by the hands as finger food. Ice cream may be served with other desserts—such as cake or pie—or used as an ingredient in cold dishes—like ice cream floats, sundaes, milkshakes, and ice cream cakes—or in baked ite

We first format the text, and prepare it for tokenization (the act of making the sentence into a list of words), getting rid of any errant punctuation. We keep the original text as a variable to pull from at the end:


Talk about tokenization and what we're doing here with the stop words and the bag of words stuff. 

In [12]:
## Format Text
def fix_it_up(article_text):
    ## Remove square brackets and extra spaces 
    article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
    article_text = re.sub(r'\s+', ' ', article_text) # Remove extra space 
    ## Remove everything else 
    formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
    formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
    return formatted_article_text, article_text

In [13]:
formatted_wiki_text, wiki_text = fix_it_up(wiki_text)

# Extractive Summarization

In essence, extractive summarization is the process of constructing an overview of a piece of text by using sentences already included within. In other words, extractive summarization focuses on picking the most important points of a passage to generate a summary, instead of creating totally new sentences. 

The method I chose to use for extractive text summarization is called term frequency - inverse document frequency, or tf-idf for short. Tf-IDF at its core is a commonly used statistical measure for natural language processing that records how important a word is within a "document", relative to the documents in the "corpus". This measure is in the form of a floating point number - the higher the number, the more important the word. In our context, one document is a paragraph, with the corpus being the entire passage. 

For example, if I read a passage about ice cream and its creation, the word "milk" would likely have a higher tf-IDF value than that of the word "butter", because milk is more likely to be an integral part of the passage's paragraphs (since milk is one of ice cream's key components). The purpose for the two-tiered measurement of inspecting a word's importance within a paragraph and subsequently the amount of paragraphs it shows up in is useful for offsetting any one paragraph that overwhelmingly contains a different word. Even if there was a separate paragraph in the piece talking about ice cream's similarities and differences with butter (just go with me on this), since this is one paragraph of many, butter would still have a lower tf-IDF value than milk.

## A Few Things to Consider 

Obviously, there is one glaring pitfall that tf-IDF can find itself victim to - the phenomenon of filler words (what the Natural Language Toolkit, a python library for Natural Language Processing, calls "stop words"). These are words used to glue ideas together: the, like, as, to, etc. The tf-IDF value for these words would lead the program into believing every passage is a passage about the word "the". 

As we've seen Natural Language Toolkit (NLTK) has a useful package called stopwords that contains a list of all necessary filler words. With this package, handling filler words becomes as simple as teaching the program to detect them in sentences and omit them from the tf-IDF calculation. 

### Tokenization

In [None]:
## Tokenize Sentences, Find Word Frequency
def sentence_tokenize(formatted_article_text, article_text, stop_words = stop_words):  
    ## Tokenize the sentences in the OG article text, initialize stopwords
    sentence_list = nltk.sent_tokenize(article_text)
    stp_wds = stop_words
    word_frequencies = {} # A dictionary of words and how often they show up 
    ## Fill up word_freq dict with (you guessed it) word frequencies
    for word in nltk.word_tokenize(formatted_article_text):
        if word not in stp_wds:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1
    return word_frequencies, sentence_list

In [15]:
word_freq, sent_list = sentence_tokenize(formatted_wiki_text, wiki_text, stop_words=stop_words)

## tf-IDF

In [6]:
## IDF Calculation
def idf(sent_list, word_sent):
    return math.log(len(sent_list)/word_sent)

# Create IDF values for individual words 
def word_idf_create(sent_list, word_freq):
    word_idf = {}
    for word in word_freq.keys():
        word_idf[word] = idf(sent_list, word_freq[word])
    return word_idf

In [16]:
word_idf = word_idf_create(sent_list, word_freq)

## Cosine Distance 

As mentioned above, tf-IDF gives us floating point values for the importance of words within a paragraph and in the passage as a whole. However, using this information to select important sentences takes some additional calculation. Before moving forward, a work flow should be established:

Our example: "Ice cream is a delicious summer treat made with milk and flavoring. The milk is mixed with ingredients, churned, and cooled."

- Sentences are pulled from the passage, stripped of punctuation and errant symbols, and organized in a list of individual sentences, using NLTK to make a list of lists, where each sentence becomes a list containing string values for all the words that show up
	- ie. [["Ice", "cream", "is", "a", "delicious" ...], ["The", "milk", "is", .....]
- Sentences are cleaned of filler words
	- ie. [["Ice", "cream", "delicious", "summer", "treat", "made", "milk", "flavoring"], ["milk", "mixed", "ingredients", "churned", "cooled"]]
- The tf-IDF value is calculated for each word, and then stored back into its corresponding place in the list
	- ie. [[0.012, 0.032, 0.0003, 0.001,...], [0.043, 0.002, 0.001, ...]]

Stopping here, we notice that we've taken our text and converted it into a list of lists containing numerical values - in reality, we've created vectors. Since the most important sentences are bound to have similar words with high tf-IDF values, theoretically we can find similar sentences with the highest values to make our summary. 

To handle this, we look to vector operations - particularly the dot product. The dot product between two vectors can be described as the product of their respective magnitudes multiplied by the cosine of the angle between them. Naturally, the closer together numerically one vector is to another, the smaller the angle is between them. In minimizing the angle using the cosine value garnered through the quotient of the dot product and product of the magnitudes of the vectors, we can find vectors that are the closest together. 

- Cross-reference each vector with one another to find their respective cosine values, and place them in a table
- With the amount of sentences specified by the user, find the top n vectors in the table, grab the corresponding sentences, and construct the summary. 

In [17]:
## Find IDF Values for Sentences
def sent_idf_create(sent_list, word_idf):
    sent_vec = [word_tokenize(val) for val in sent_list]
    sent_idf = {}
    for sent in sent_vec:
        sent_counter = 0.0
        sent_idx = sent_vec.index(sent)
        for word in sent:
            if word in word_idf.keys():
                sent_counter += word_idf[word]
        sent_trueVal = sent_list[sent_idx]
        sent_idf[sent_trueVal] = sent_counter
    return sent_idf

In [19]:
sent_idf = sent_idf_create(sent_list, word_idf)

In [18]:
def top_n(sent_idf, num_sents):
    all_sents = list(sent_idf.keys())
    all_stats = list(sent_idf.values())
    final_sents = []
    top_idx = list(np.argsort(all_stats)[-1*num_sents:])
    top_idx.sort()
    for idx in top_idx:
        final_sents.append(all_sents[idx])
    return final_sents

In [22]:
### for 12 sentences:

final_sents = top_n(sent_idf, num_sents=12)

for sent in final_sents:
    print(sent)
    print("")

The earliest known written process to artificially make ice is known not from culinary texts but from the 13th-century writings of Arab historian Ibn Abi Usaybi'a in his book Kitab Uyun al-anba fi tabaqat-al-atibba (Book of Sources of Information on the Classes of Physicians) concerning medicine, in which Ibn Abu Usaybi’a attributes the process to an even older author, Ibn Bakhtawayhi, of whom nothing is known.

One hundred years later, Charles I of England was reportedly so impressed by the "frozen snow" that he offered his own ice cream maker a lifetime pension in return for keeping the formula secret, so that ice cream could be a royal prerogative.

Take Tin Ice-Pots, fill them with any Sort of Cream you like, either plain or sweeten’d, or Fruit in it; shut your Pots very close; to six Pots you must allow eighteen or twenty Pound of Ice, breaking the Ice very small; there will be some great Pieces, which lay at the Bottom and Top: You must have a Pail, and lay some Straw at the Bott

## The Full Workflow

Let's take each individual step and put them together to see the full work flow. 

In [24]:
## Full pipeline from wikipedia article
def wiki_to_sents(article_url, num_sents):
    article_text = pull_text(article_url) ## get text from URL 
    formatted_article_text, article_text = fix_it_up(article_text) ## format text 
    word_freq, sent_list = sentence_tokenize(formatted_article_text, article_text, stop_words) ## tokenize 
    word_F = word_idf_create(sent_list, word_freq) ## get IDF values for words
    sent_F = sent_idf_create(sent_list, word_F) ## make IDF vectors
    top_sents = top_n(sent_F, num_sents=num_sents) ## cosine distance and table 
    return top_sents


## Full pipeline for plain text 
def text_to_sents(article_text, num_sents):
    formatted_article_text, article_text = fix_it_up(article_text)
    word_freq, sent_list = sentence_tokenize(formatted_article_text, article_text, stop_words)
    word_F = word_idf_create(sent_list, word_freq)
    sent_F = sent_idf_create(sent_list, word_F)
    top_sents = top_n(sent_F, num_sents=num_sents)
    return top_sents

Let's run an example of it using the following text: 

```
Members of the northern business elite forged close ties with each other to protect and expand their economic interests. Marriages between leading families formed a crucial strategy to advance economic advantage, and the homes of the northern elite became important venues for solidifying social bonds. Exclusive neighborhoods started to develop as the wealthy distanced themselves from the poorer urban residents, and cities soon became segregated by class.

Industrial elites created chambers of commerce to advance their interests; by 1858 there were ten in the United States. These networking organizations allowed top bankers and merchants to stay current on the economic activities of their peers and further strengthen the bonds among themselves. The elite also established social clubs to forge and maintain ties. The first of these, the Philadelphia Club, came into being in 1834. Similar clubs soon formed in other cities and hosted a range of social activities designed to further bind together the leading economic families. Many northern elites worked hard to ensure the transmission of their inherited wealth from one generation to the next. Politically, they exercised considerable power in local and state elections. Most also had ties to the cotton trade, so they were strong supporters of slavery.

The Industrial Revolution led some former artisans to reinvent themselves as manufacturers. These enterprising leaders of manufacturing differed from the established commercial elite in the North and South because they did not inherit wealth. Instead, many came from very humble working-class origins and embodied the dream of achieving upward social mobility through hard work and discipline. As the beneficiaries of the economic transformations sweeping the republic, these newly established manufacturers formed a new economic elite that thrived in the cities and cultivated its own distinct sensibilities. They created a culture that celebrated hard work, a position that put them at odds with southern planter elites who prized leisure and with other elite northerners who had largely inherited their wealth and status.

Peter Cooper provides one example of the new northern manufacturing class. Ever inventive, Cooper dabbled in many different moneymaking enterprises before gaining success in the glue business. He opened his Manhattan glue factory in the 1820s and was soon using his profits to expand into a host of other activities, including iron production. One of his innovations was the steam locomotive, which he invented in 1827). Despite becoming one of the wealthiest men in New York City, Cooper lived simply. Rather than buying an ornate bed, for example, he built his own. He believed respectability came through hard work, not family pedigree. Those who had inherited their wealth derided self-made men like Cooper, and he and others like him were excluded from the social clubs established by the merchant and financial elite of New York City. Self-made northern manufacturers, however, created their own organizations that aimed to promote upward mobility. The Providence Association of Mechanics and Manufacturers was formed in 1789 and promoted both industrial arts and education as a pathway to economic success. In 1859, Peter Cooper established the Cooper Union for the Advancement of Science and Art, a school in New York City dedicated to providing education in technology. Merit, not wealth, mattered most according to Cooper, and admission to the school was based solely on ability; race, sex, and family connections had no place. The best and brightest could attend Cooper Union tuition-free, a policy that remained in place until 2014.
```

In [2]:
new_article_text = "Members of the northern business elite forged close ties with each other to protect and expand their economic interests. Marriages between leading families formed a crucial strategy to advance economic advantage, and the homes of the northern elite became important venues for solidifying social bonds. Exclusive neighborhoods started to develop as the wealthy distanced themselves from the poorer urban residents, and cities soon became segregated by class. Industrial elites created chambers of commerce to advance their interests; by 1858 there were ten in the United States. These networking organizations allowed top bankers and merchants to stay current on the economic activities of their peers and further strengthen the bonds among themselves. The elite also established social clubs to forge and maintain ties. The first of these, the Philadelphia Club, came into being in 1834. Similar clubs soon formed in other cities and hosted a range of social activities designed to further bind together the leading economic families. Many northern elites worked hard to ensure the transmission of their inherited wealth from one generation to the next. Politically, they exercised considerable power in local and state elections. Most also had ties to the cotton trade, so they were strong supporters of slavery. The Industrial Revolution led some former artisans to reinvent themselves as manufacturers. These enterprising leaders of manufacturing differed from the established commercial elite in the North and South because they did not inherit wealth. Instead, many came from very humble working-class origins and embodied the dream of achieving upward social mobility through hard work and discipline. As the beneficiaries of the economic transformations sweeping the republic, these newly established manufacturers formed a new economic elite that thrived in the cities and cultivated its own distinct sensibilities. They created a culture that celebrated hard work, a position that put them at odds with southern planter elites who prized leisure and with other elite northerners who had largely inherited their wealth and status. Peter Cooper provides one example of the new northern manufacturing class. Ever inventive, Cooper dabbled in many different moneymaking enterprises before gaining success in the glue business. He opened his Manhattan glue factory in the 1820s and was soon using his profits to expand into a host of other activities, including iron production. One of his innovations was the steam locomotive, which he invented in 1827). Despite becoming one of the wealthiest men in New York City, Cooper lived simply. Rather than buying an ornate bed, for example, he built his own. He believed respectability came through hard work, not family pedigree. Those who had inherited their wealth derided self-made men like Cooper, and he and others like him were excluded from the social clubs established by the merchant and financial elite of New York City. Self-made northern manufacturers, however, created their own organizations that aimed to promote upward mobility. The Providence Association of Mechanics and Manufacturers was formed in 1789 and promoted both industrial arts and education as a pathway to economic success. In 1859, Peter Cooper established the Cooper Union for the Advancement of Science and Art, a school in New York City dedicated to providing education in technology. Merit, not wealth, mattered most according to Cooper, and admission to the school was based solely on ability; race, sex, and family connections had no place. The best and brightest could attend Cooper Union tuition-free, a policy that remained in place until 2014."

In [27]:
top_sentences = text_to_sents(new_article_text, num_sents=8)

In [28]:
top_sentences

['Marriages between leading families formed a crucial strategy to advance economic advantage, and the homes of the northern elite became important venues for solidifying social bonds.',
 'Exclusive neighborhoods started to develop as the wealthy distanced themselves from the poorer urban residents, and cities soon became segregated by class.',
 'These networking organizations allowed top bankers and merchants to stay current on the economic activities of their peers and further strengthen the bonds among themselves.',
 'As the beneficiaries of the economic transformations sweeping the republic, these newly established manufacturers formed a new economic elite that thrived in the cities and cultivated its own distinct sensibilities.',
 'They created a culture that celebrated hard work, a position that put them at odds with southern planter elites who prized leisure and with other elite northerners who had largely inherited their wealth and status.',
 'Those who had inherited their wealt

# Abstractive Summarization


Abstractive summarization is a text generation technique that aims to produce concise and coherent summaries by understanding the context of the input document and generating new sentences. Unlike extractive summarization, which selects and rearranges existing sentences, abstractive summarization can capture the main ideas and express them in a more human-like manner.

The advantages of abstractive summarization lie in its ability to generate concise summaries that convey the essential meaning of the source text. It can go beyond the limitations of extractive methods by generating novel sentences that capture the overall message and tone of the original document. This makes abstractive summarization more suitable for tasks where brevity and readability are crucial.

The T5 (Text-to-Text Transfer Transformer) architecture is a powerful model for abstractive summarization. It employs a transformer-based neural network that uses self-attention mechanisms to capture the relationships between words and generate high-quality summaries. T5 is pre-trained on a large corpus of text data and fine-tuned for specific summarization tasks, enabling it to understand diverse writing styles and produce coherent and contextually relevant summaries. Its flexibility and performance make T5 a popular choice for various natural language processing applications.

In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration


def abstract_summarize(input_text):
    # Load the T5 tokenizer and model
    tokenizer = T5Tokenizer.from_pretrained('t5-small')
    model = T5ForConditionalGeneration.from_pretrained('t5-small')

    # Preprocess the input text
    inputs = tokenizer.encode("summarize: " + input_text, return_tensors='pt', max_length=512, truncation=True)

    # Generate the summary
    outputs = model.generate(inputs, max_length=150, min_length=40, num_beams=4, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return summary

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
new_summary = abstract_summarize(new_article_text)

In [13]:
def print_sentences_on_new_line(text):
    sentences = text.split('. ')
    for sentence in sentences:
        print(sentence)

print_sentences_on_new_line(new_summary)

the northern business elite established social clubs to forge and maintain ties
by 1858 there were ten in the united states
many northern elites worked hard to ensure the transmission of their inherited wealth from one generation to the next.


While the T5 transformer has numerous advantages, it also has a few disadvantages to consider. Firstly, the computational resources required to train and deploy the T5 model can be substantial, making it challenging for individuals or organizations with limited access to high-performance computing infrastructure. Additionally, the large size of the model may hinder its deployment on resource-constrained devices or in scenarios where real-time processing is essential. Moreover, the T5 model's complexity can lead to difficulties in interpreting its decisions and understanding the underlying reasoning behind its generated outputs, which may raise concerns regarding transparency and trustworthiness. Finally, as with any language model, the T5 transformer is not immune to biases present in the training data, which may inadvertently influence the generated summaries. Careful evaluation and mitigation strategies are necessary to address these limitations effectively.

# Final Thoughts and Conclusions

In conclusion, the T5 transformer offers the advantage of describing the overall theme of a passage, while extractive summarization excels in highlighting important details. To achieve a comprehensive summary, a hybrid approach that combines both styles seems promising. Although my experience was limited to the "T5-small" version due to CPU limitations, an ideal scenario would involve training the transformer on extensive textbook data and utilizing a more complex model with tuned hyperparameters. Through this journey, I have come to appreciate the intricacy of text summarization and natural language processing, recognizing the importance of considering a word's innate significance and its balance within the larger context. Abstractive and extractive summarization demonstrate the remarkable power of algorithms in distilling complex subtleties into binary form, paving the way for further advancements in this field.