In [1]:
from datasets import load_dataset
dataset = load_dataset("billsum", split = "test")

Found cached dataset billsum (/Users/stae/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc)


In [71]:
import pandas as pd
dataset.set_format("pandas")
df = dataset[0:]
df

Unnamed: 0,text,summary,title
0,SECTION 1. ENVIRONMENTAL INFRASTRUCTURE.\n\n ...,Amends the Water Resources Development Act of ...,To make technical corrections to the Water Res...
1,That this Act may be cited as the ``Federal Fo...,Federal Forage Fee Act of 1993 - Subjects graz...,Federal Forage Fee Act of 1993
2,SECTION 1. SHORT TITLE.\n\n This Act may be...,. Merchant Marine of World War II Congression...,Merchant Marine of World War II Congressional ...
3,SECTION 1. SHORT TITLE.\n\n This Act may be...,Small Business Modernization Act of 2004 - Ame...,To amend the Internal Revenue Code of 1986 to ...
4,SECTION 1. SHORT TITLE.\n\n This Act may be...,Fair Access to Investment Research Act of 2016...,Fair Access to Investment Research Act of 2016
...,...,...,...
3264,SECTION 1. PLACEMENT PROGRAMS FOR FEDERAL EMPL...,Public Servant Priority Placement Act of 1995 ...,Public Servant Priority Placement Act of 1995
3265,SECTION 1. SHORT TITLE.\n\n This Act may be...,Sportsmanship in Hunting Act of 2008 - Amends ...,"A bill to amend title 18, United States Code, ..."
3266,SECTION 1. SHORT TITLE.\n\n This Act may be...,Helping College Students Cross the Finish Line...,Helping College Students Cross the Finish Line...
3267,SECTION 1. SHORT TITLE.\n\n This Act may be...,Makes proceeds from such conveyances available...,Texas National Forests Improvement Act of 2000


Let's take a look at a particular row in the 'text' column:

In [72]:
df["text"][7]

"SECTION 1. SHORT TITLE.\n\n    This Act may be cited as the ``Special Agent Scott K. Carey Public \nSafety Officer Benefits Enhancement Act''.\n\n  TITLE I--EDUCATIONAL ASSISTANCE TO OFFICERS DISABLED IN THE LINE OF \n                                  DUTY\n\nSEC. 101. BASIC ELIGIBILITY.\n\n    Section 1212(a)(1) of the Omnibus Crime Control and Safe Streets \nAct of 1968 (42 U.S.C. 3796d-1(a)(1)) is amended--\n            (1) by striking ``a dependent'' and inserting ``an eligible \n        dependent''; and\n            (2) by striking ``education'' and all that follows through \n        the period at the end and inserting ``education.''.\n\nSEC. 102. APPLICATIONS; APPROVAL.\n\n    Section 1213 of the Omnibus Crime Control and Safe Streets Act of \n1968 (42 U.S.C. 3796d-2) is amended--\n            (1) in subsection (b)--\n                    (A) by striking ``the dependent'' each place it \n                appears and inserting ``the applicant''; and\n                    (B) by stri

We can see that there are A LOT of special characters such as line breaks (\n) and dashes, as well as code numbers (ex. '(U.S.C. 1087kk)')that we would not need for a summary. So, we get rid of these characters by using regex.

In [73]:
import re

# first, get rid of line breaks
df["text"] = [re.sub("\n", "", df["text"][row]) for row in range(len(df))]
df["text"][7]

"SECTION 1. SHORT TITLE.    This Act may be cited as the ``Special Agent Scott K. Carey Public Safety Officer Benefits Enhancement Act''.  TITLE I--EDUCATIONAL ASSISTANCE TO OFFICERS DISABLED IN THE LINE OF                                   DUTYSEC. 101. BASIC ELIGIBILITY.    Section 1212(a)(1) of the Omnibus Crime Control and Safe Streets Act of 1968 (42 U.S.C. 3796d-1(a)(1)) is amended--            (1) by striking ``a dependent'' and inserting ``an eligible         dependent''; and            (2) by striking ``education'' and all that follows through         the period at the end and inserting ``education.''.SEC. 102. APPLICATIONS; APPROVAL.    Section 1213 of the Omnibus Crime Control and Safe Streets Act of 1968 (42 U.S.C. 3796d-2) is amended--            (1) in subsection (b)--                    (A) by striking ``the dependent'' each place it                 appears and inserting ``the applicant''; and                    (B) by striking ``the dependent's'' each place it          

So what's going on here?

The first argument in the re.sub() function is the character that you are looking to remove (in this case '\n'). The second argument is the character you want to replace the character that you are removing (in this case we just want to remove line breaks, so we put an empty string). The third argument is the string that you are performing this operation on.

In [78]:
# Now moving onto parentheses and characters within
df["text"] = [re.sub(r'\([^)]*\)', '', df["text"][row]) for row in range(len(df))]
# Removing dashes
df["text"] = [re.sub("--", '', df["text"][row]) for row in range(len(df))]
# Removing extra spaces
df["text"] = [re.sub(r'\s+', ' ', df["text"][row]) for row in range(len(df))]

df["text"][7]

"SECTION 1. SHORT TITLE. This Act may be cited as the ``Special Agent Scott K. Carey Public Safety Officer Benefits Enhancement Act''. TITLE IEDUCATIONAL ASSISTANCE TO OFFICERS DISABLED IN THE LINE OF DUTYSEC. 101. BASIC ELIGIBILITY. Section 1212 of the Omnibus Crime Control and Safe Streets Act of 1968 is amended by striking ``a dependent'' and inserting ``an eligible dependent''; and by striking ``education'' and all that follows through the period at the end and inserting ``education.''.SEC. 102. APPLICATIONS; APPROVAL. Section 1213 of the Omnibus Crime Control and Safe Streets Act of 1968 is amended in subsection by striking ``the dependent'' each place it appears and inserting ``the applicant''; and by striking ``the dependent's'' each place it appears and inserting ``the applicant's''; and in subsection , by striking ``a dependent'' and inserting ``an applicant''.SEC. 103. RETROACTIVE BENEFITS. Section 1216 of the Omnibus Crime Control and Safe Streets Act of 1968 is amended to r

In the first regex function, we first start with wrapping the characters we want to remove with r'', which tells Python to treat everything inside r'' as raw text. This means that we do not need to use backslashes ('\') to escape characters that have certain functions.

' \( ' looks for an opening round bracket.
' [^)]* ' looks for characters other than closing parentheses that comes after the opening round bracket.
' \) ' looks for a closing round bracket.

RegEx is a complicated tool to use, so I suggest looking into it yourself (or just search it up on slack everytime you use it lol).

Now moving onto tokenzing, stemming, and stopwords:

In [81]:
from nltk.tokenize import sent_tokenize, word_tokenize
# tokenize; first try one row 
test_token = sent_tokenize(df["text"][7])
test_token

['SECTION 1.',
 'SHORT TITLE.',
 "This Act may be cited as the ``Special Agent Scott K. Carey Public Safety Officer Benefits Enhancement Act''.",
 'TITLE IEDUCATIONAL ASSISTANCE TO OFFICERS DISABLED IN THE LINE OF DUTYSEC.',
 '101.',
 'BASIC ELIGIBILITY.',
 "Section 1212 of the Omnibus Crime Control and Safe Streets Act of 1968 is amended by striking ``a dependent'' and inserting ``an eligible dependent''; and by striking ``education'' and all that follows through the period at the end and inserting ``education.''.SEC.",
 '102.',
 'APPLICATIONS; APPROVAL.',
 "Section 1213 of the Omnibus Crime Control and Safe Streets Act of 1968 is amended in subsection by striking ``the dependent'' each place it appears and inserting ``the applicant''; and by striking ``the dependent's'' each place it appears and inserting ``the applicant's''; and in subsection , by striking ``a dependent'' and inserting ``an applicant''.SEC.",
 '103.',
 'RETROACTIVE BENEFITS.',
 "Section 1216 of the Omnibus Crime Con

We tokenize by sentence instead of by words, because we plan to use tf-idf to summarize our text. For tf-idf, we need to calculate the frequency of each word for each sentence.

In [86]:
# stemming and stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

ps = PorterStemmer()
processed_sentence = []
stop_words = set(stopwords.words('english'))

for sentence in test_token:
    words = word_tokenize(sentence)
    for word in words:
        word = word.lower()
        word = ps.stem(word)
        if word not in stop_words:
            processed_sentence.append(word)
    
processed_sentence

['section',
 '1',
 '.',
 'short',
 'titl',
 '.',
 'thi',
 'act',
 'may',
 'cite',
 '``',
 'special',
 'agent',
 'scott',
 'k.',
 'carey',
 'public',
 'safeti',
 'offic',
 'benefit',
 'enhanc',
 'act',
 "''",
 '.',
 'titl',
 'ieduc',
 'assist',
 'offic',
 'disabl',
 'line',
 'dutysec',
 '.',
 '101',
 '.',
 'basic',
 'elig',
 '.',
 'section',
 '1212',
 'omnibu',
 'crime',
 'control',
 'safe',
 'street',
 'act',
 '1968',
 'amend',
 'strike',
 '``',
 'depend',
 "''",
 'insert',
 '``',
 'elig',
 'depend',
 "''",
 ';',
 'strike',
 '``',
 'educ',
 "''",
 'follow',
 'period',
 'end',
 'insert',
 '``',
 'educ',
 '.',
 "''.sec",
 '.',
 '102',
 '.',
 'applic',
 ';',
 'approv',
 '.',
 'section',
 '1213',
 'omnibu',
 'crime',
 'control',
 'safe',
 'street',
 'act',
 '1968',
 'amend',
 'subsect',
 'strike',
 '``',
 'depend',
 "''",
 'place',
 'appear',
 'insert',
 '``',
 'applic',
 "''",
 ';',
 'strike',
 '``',
 'depend',
 "'s",
 "''",
 'place',
 'appear',
 'insert',
 '``',
 'applic',
 "'s",
 "''",


We stem the words, since if there are words with the same root but different tenses, we still want them to be weighted together. Stopwords are common words that are not valuable, such as 'and' & 'or', so we remove them.