Let's use TFIDF to build a basic text summarization function. This is just a concept that will be improved upon later.

Importing relevant packages

In [1]:
# for TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# allows us to compute the cosine similarity between two vectors
from sklearn.metrics.pairwise import cosine_similarity

# importing nlargest to return the n largest elements from an iterable in descending order.
from heapq import nlargest

from nltk.tokenize import sent_tokenize

Suppose we want to make a function that takes in a string (text) and an integer (n) and wants us to return the most relevant n sentences. For example:

In [2]:
# generated from chatGPT
text = '''Having a bad day is an inevitable part of life that we all experience at some point. It's one of those days where everything seems to go wrong, leaving us feeling frustrated, overwhelmed, and emotionally drained. Whether it's a series of unfortunate events or simply waking up on the wrong side of the bed, a bad day can cast a shadow over our mood and outlook.
From the moment we open our eyes, the signs of a bad day start to manifest. The alarm clock fails to go off, resulting in a frantic rush to get ready for the day ahead. Breakfast burns, leaving a lingering smell of charred toast in the air. Rushing out the door, we're met with unexpected traffic or delayed public transportation, further exacerbating the sense of being behind schedule.
Once at work or school, tasks and responsibilities pile up, seeming insurmountable. Miscommunications and conflicts arise, adding to the mounting stress. The day progresses in a seemingly endless cycle of setbacks, small annoyances, and disappointments. Technology malfunctions, important documents go missing, and unforeseen obstacles arise at every turn.
Amidst the chaos, personal struggles and emotional burdens may weigh heavily on our minds. Coping with personal issues while simultaneously navigating the challenges of the day can feel overwhelming. Negative thoughts and self-doubt creep in, intensifying the feeling of having a bad day.
Physically, exhaustion sets in as the day drags on, making it difficult to focus or find motivation. Even the simplest tasks become arduous, and the desire to retreat and escape from the world grows stronger. Social interactions may feel strained, as our own negative emotions color our interactions with others.
As the day draws to a close, we may find solace in the knowledge that tomorrow is a fresh start—a chance to leave the bad day behind and embrace the possibility of a better one. Reflecting on the day's challenges can provide insights into areas for growth and improvement, allowing us to learn from the experience.
Though having a bad day can be disheartening, it is important to remember that it is just a passing phase. Life is a mixture of ups and downs, and bad days serve as reminders of our resilience and ability to overcome adversity. By practicing self-care, seeking support from loved ones, and maintaining a positive mindset, we can weather the storm and look forward to brighter days ahead.'''

In [3]:
# Let's check out a summary with the top 5 sentences
n = 5

In [4]:
#sentence tokenizer
sentences = sent_tokenize(text)
sentences

['Having a bad day is an inevitable part of life that we all experience at some point.',
 "It's one of those days where everything seems to go wrong, leaving us feeling frustrated, overwhelmed, and emotionally drained.",
 "Whether it's a series of unfortunate events or simply waking up on the wrong side of the bed, a bad day can cast a shadow over our mood and outlook.",
 'From the moment we open our eyes, the signs of a bad day start to manifest.',
 'The alarm clock fails to go off, resulting in a frantic rush to get ready for the day ahead.',
 'Breakfast burns, leaving a lingering smell of charred toast in the air.',
 "Rushing out the door, we're met with unexpected traffic or delayed public transportation, further exacerbating the sense of being behind schedule.",
 'Once at work or school, tasks and responsibilities pile up, seeming insurmountable.',
 'Miscommunications and conflicts arise, adding to the mounting stress.',
 'The day progresses in a seemingly endless cycle of setback

Here we are creating the Tfidf matrix

In [5]:
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)

Using cosine similarity, we can find the sentences that are most similar to the entire body of text

In [6]:
scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]

Using the nlargest function, we can find the n sentences that are most relevant to the text

In [7]:
summary_sentences = nlargest(n, range(len(scores)), key=scores.__getitem__)
summary_sentences

[13, 4, 1, 20, 0]

Finally we can append each sentence in the original order that it was in so that the summary can maintain meaning

In [8]:
sorted(summary_sentences)

[0, 1, 4, 13, 20]

In [9]:
result_vector = []
for i in sorted(summary_sentences):
    result_vector.append(sentences[i])
result = " ".join(result_vector)

In [10]:
result

"Having a bad day is an inevitable part of life that we all experience at some point. It's one of those days where everything seems to go wrong, leaving us feeling frustrated, overwhelmed, and emotionally drained. The alarm clock fails to go off, resulting in a frantic rush to get ready for the day ahead. Negative thoughts and self-doubt creep in, intensifying the feeling of having a bad day. Life is a mixture of ups and downs, and bad days serve as reminders of our resilience and ability to overcome adversity."

Using Rogue to evaluate accuracy

In [11]:
from rouge import Rouge
rouge = Rouge()

scores = rouge.get_scores(result, text)


An "F" score for Rouge-L ranges from 0 to 1, with 1 being the highest

In [12]:
scores[0]['rouge-l']['f']

0.4358974324884944

In [13]:
sentences

['Having a bad day is an inevitable part of life that we all experience at some point.',
 "It's one of those days where everything seems to go wrong, leaving us feeling frustrated, overwhelmed, and emotionally drained.",
 "Whether it's a series of unfortunate events or simply waking up on the wrong side of the bed, a bad day can cast a shadow over our mood and outlook.",
 'From the moment we open our eyes, the signs of a bad day start to manifest.',
 'The alarm clock fails to go off, resulting in a frantic rush to get ready for the day ahead.',
 'Breakfast burns, leaving a lingering smell of charred toast in the air.',
 "Rushing out the door, we're met with unexpected traffic or delayed public transportation, further exacerbating the sense of being behind schedule.",
 'Once at work or school, tasks and responsibilities pile up, seeming insurmountable.',
 'Miscommunications and conflicts arise, adding to the mounting stress.',
 'The day progresses in a seemingly endless cycle of setback

In [14]:
from datasets import load_dataset
dataset = load_dataset("billsum", split = "test")

Found cached dataset billsum (/Users/kevinhamakawa/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc)


In [15]:
import pandas as pd
dataset.set_format("pandas")
df = dataset[0:]
df

Unnamed: 0,text,summary,title
0,SECTION 1. ENVIRONMENTAL INFRASTRUCTURE.\n\n ...,Amends the Water Resources Development Act of ...,To make technical corrections to the Water Res...
1,That this Act may be cited as the ``Federal Fo...,Federal Forage Fee Act of 1993 - Subjects graz...,Federal Forage Fee Act of 1993
2,SECTION 1. SHORT TITLE.\n\n This Act may be...,. Merchant Marine of World War II Congression...,Merchant Marine of World War II Congressional ...
3,SECTION 1. SHORT TITLE.\n\n This Act may be...,Small Business Modernization Act of 2004 - Ame...,To amend the Internal Revenue Code of 1986 to ...
4,SECTION 1. SHORT TITLE.\n\n This Act may be...,Fair Access to Investment Research Act of 2016...,Fair Access to Investment Research Act of 2016
...,...,...,...
3264,SECTION 1. PLACEMENT PROGRAMS FOR FEDERAL EMPL...,Public Servant Priority Placement Act of 1995 ...,Public Servant Priority Placement Act of 1995
3265,SECTION 1. SHORT TITLE.\n\n This Act may be...,Sportsmanship in Hunting Act of 2008 - Amends ...,"A bill to amend title 18, United States Code, ..."
3266,SECTION 1. SHORT TITLE.\n\n This Act may be...,Helping College Students Cross the Finish Line...,Helping College Students Cross the Finish Line...
3267,SECTION 1. SHORT TITLE.\n\n This Act may be...,Makes proceeds from such conveyances available...,Texas National Forests Improvement Act of 2000


In [16]:
def summarizer(text, n):
    sentences = sent_tokenize(text)
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)
    scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]
    summary_sentences = nlargest(n, range(len(scores)), key=scores.__getitem__)
    result_vector = []
    for i in sorted(summary_sentences):
        result_vector.append(sentences[i])
    result = " ".join(result_vector)
    
    return result

In [17]:
summarizer(df["text"][0], 5)

'SECTION 1. ENVIRONMENTAL INFRASTRUCTURE. (a) Jackson County, Mississippi.--Section 219 of the Water \nResources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is \namended--\n        (1) in subsection (c), by striking paragraph (5) and inserting \n    the following:\n        ``(5) Jackson county, mississippi.--Provision of an alternative \n    water supply and a project for the elimination or control of \n    combined sewer overflows for Jackson County, Mississippi.'

In [18]:
df["summary"][0]

"Amends the Water Resources Development Act of 1999 to: (1) authorize appropriations for FY 1999 through 2009 for implementation of a long-term resource monitoring program with respect to the Upper Mississippi River Environmental Management Program (currently, such funding is designated for a program for the planning, construction, and evaluation of measures for fish and wildlife habitat rehabilitation and enhancement); (2) authorize the Secretary of the Army to carry out modifications to the navigation project for the Delaware River, Pennsylvania and Delaware, if such project as modified is technically sound, environmentally (currently, economically) acceptable, and economically justified; (3) subject certain previously deauthorized water resources development projects to the seven-year limitation governing project deauthorizations under the Act, with the exception of such a project for Indian River County, Florida; (4) except from a certain schedule of the non-Federal cost of the per

In [25]:
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kevinhamakawa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [26]:
def summarizer(text, n):
    sentences = sent_tokenize(text)
    
    stemmed_sentences = []
    for sentence in sentences:
        words = word_tokenize(sentence)
        stemmed_words = [stemmer.stem(word) for word in words]
        stemmed_sentence = ' '.join(stemmed_words)
        stemmed_sentences.append(stemmed_sentence)
        
    
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(stemmed_sentences)
    scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]
    summary_sentences = nlargest(n, range(len(scores)), key=scores.__getitem__)
    result_vector = []
    for i in sorted(summary_sentences):
        result_vector.append(sentences[i])
    result = " ".join(result_vector)
    
    return result

In [27]:
summarizer(text, 10)

"Having a bad day is an inevitable part of life that we all experience at some point. It's one of those days where everything seems to go wrong, leaving us feeling frustrated, overwhelmed, and emotionally drained. From the moment we open our eyes, the signs of a bad day start to manifest. The alarm clock fails to go off, resulting in a frantic rush to get ready for the day ahead. The day progresses in a seemingly endless cycle of setbacks, small annoyances, and disappointments. Coping with personal issues while simultaneously navigating the challenges of the day can feel overwhelming. Negative thoughts and self-doubt creep in, intensifying the feeling of having a bad day. Physically, exhaustion sets in as the day drags on, making it difficult to focus or find motivation. As the day draws to a close, we may find solace in the knowledge that tomorrow is a fresh start—a chance to leave the bad day behind and embrace the possibility of a better one. Though having a bad day can be dishearte