In [1]:
from datasets import load_dataset
dataset = load_dataset("billsum", split = "test")

Found cached dataset billsum (/Users/stae/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc)


In [2]:
import pandas as pd
dataset.set_format("pandas")
df = dataset[0:]
df

Unnamed: 0,text,summary,title
0,SECTION 1. ENVIRONMENTAL INFRASTRUCTURE.\n\n ...,Amends the Water Resources Development Act of ...,To make technical corrections to the Water Res...
1,That this Act may be cited as the ``Federal Fo...,Federal Forage Fee Act of 1993 - Subjects graz...,Federal Forage Fee Act of 1993
2,SECTION 1. SHORT TITLE.\n\n This Act may be...,. Merchant Marine of World War II Congression...,Merchant Marine of World War II Congressional ...
3,SECTION 1. SHORT TITLE.\n\n This Act may be...,Small Business Modernization Act of 2004 - Ame...,To amend the Internal Revenue Code of 1986 to ...
4,SECTION 1. SHORT TITLE.\n\n This Act may be...,Fair Access to Investment Research Act of 2016...,Fair Access to Investment Research Act of 2016
...,...,...,...
3264,SECTION 1. PLACEMENT PROGRAMS FOR FEDERAL EMPL...,Public Servant Priority Placement Act of 1995 ...,Public Servant Priority Placement Act of 1995
3265,SECTION 1. SHORT TITLE.\n\n This Act may be...,Sportsmanship in Hunting Act of 2008 - Amends ...,"A bill to amend title 18, United States Code, ..."
3266,SECTION 1. SHORT TITLE.\n\n This Act may be...,Helping College Students Cross the Finish Line...,Helping College Students Cross the Finish Line...
3267,SECTION 1. SHORT TITLE.\n\n This Act may be...,Makes proceeds from such conveyances available...,Texas National Forests Improvement Act of 2000


Let's take a look at a particular row in the 'text' column:

In [3]:
"{:.1000}".format(df["text"][0])

"SECTION 1. ENVIRONMENTAL INFRASTRUCTURE.\n\n    (a) Jackson County, Mississippi.--Section 219 of the Water \nResources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is \namended--\n        (1) in subsection (c), by striking paragraph (5) and inserting \n    the following:\n        ``(5) Jackson county, mississippi.--Provision of an alternative \n    water supply and a project for the elimination or control of \n    combined sewer overflows for Jackson County, Mississippi.''; and\n        (2) in subsection (e)(1), by striking ``$10,000,000'' and \n    inserting ``$20,000,000''.\n    (b) Manchester, New Hampshire.--Section 219(e)(3) of the Water \nResources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is \namended by striking ``$10,000,000'' and inserting ``$20,000,000''.\n    (c) Atlanta, Georgia.--Section 219(f)(1) of the Water Resources \nDevelopment Act of 1992 (106 Stat. 4835; 113 Stat. 335) is amended by \nstriking ``$25,000,000 for''.\n    (d) Paterson, Pas

We can see that there are A LOT of special characters such as line breaks (\n) and dashes, as well as code numbers (ex. '(U.S.C. 1087kk)')that we would not need for a summary. So, we get rid of these characters by using regex.

In [4]:
import re

# first, get rid of line breaks
df["text"] = [re.sub("\n", "", df["text"][row]) for row in range(len(df))]
"{:.1000}".format(df["text"][0])

"SECTION 1. ENVIRONMENTAL INFRASTRUCTURE.    (a) Jackson County, Mississippi.--Section 219 of the Water Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is amended--        (1) in subsection (c), by striking paragraph (5) and inserting     the following:        ``(5) Jackson county, mississippi.--Provision of an alternative     water supply and a project for the elimination or control of     combined sewer overflows for Jackson County, Mississippi.''; and        (2) in subsection (e)(1), by striking ``$10,000,000'' and     inserting ``$20,000,000''.    (b) Manchester, New Hampshire.--Section 219(e)(3) of the Water Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is amended by striking ``$10,000,000'' and inserting ``$20,000,000''.    (c) Atlanta, Georgia.--Section 219(f)(1) of the Water Resources Development Act of 1992 (106 Stat. 4835; 113 Stat. 335) is amended by striking ``$25,000,000 for''.    (d) Paterson, Passaic County, and Passaic Valley, New

So what's going on here?

The first argument in the re.sub() function is the character that you are looking to remove (in this case '\n'). The second argument is the character you want to replace the character that you are removing (in this case we just want to remove line breaks, so we put an empty string). The third argument is the string that you are performing this operation on.

In [5]:
# Now moving onto parentheses and characters within
df["text"] = [re.sub(r'\([^)]*\)', '', df["text"][row]) for row in range(len(df))]
# Removing dashes
df["text"] = [re.sub("--", '', df["text"][row]) for row in range(len(df))]
# Removing extra spaces
df["text"] = [re.sub(r'\s+', ' ', df["text"][row]) for row in range(len(df))]

"{:.1000}".format(df["text"][0])

"SECTION 1. ENVIRONMENTAL INFRASTRUCTURE. Jackson County, Mississippi.Section 219 of the Water Resources Development Act of 1992 is amended in subsection , by striking paragraph and inserting the following: `` Jackson county, mississippi.Provision of an alternative water supply and a project for the elimination or control of combined sewer overflows for Jackson County, Mississippi.''; and in subsection , by striking ``$10,000,000'' and inserting ``$20,000,000''. Manchester, New Hampshire.Section 219 of the Water Resources Development Act of 1992 is amended by striking ``$10,000,000'' and inserting ``$20,000,000''. Atlanta, Georgia.Section 219 of the Water Resources Development Act of 1992 is amended by striking ``$25,000,000 for''. Paterson, Passaic County, and Passaic Valley, New Jersey.Section 219 of the Water Resources Development Act of 1992 is amended by striking ``$20,000,000 for''. Elizabeth and North Hudson, New Jersey.Section 219 of the Water Resources Development Act of 1992 

In the first regex function, we first start with wrapping the characters we want to remove with r'', which tells Python to treat everything inside r'' as raw text. This means that we do not need to use backslashes ('\') to escape characters that have certain functions.

' \( ' looks for an opening round bracket.
' [^)]* ' looks for characters other than closing parentheses that comes after the opening round bracket.
' \) ' looks for a closing round bracket.

RegEx is a complicated tool to use, so I suggest looking into it yourself (or just search it up on slack everytime you use it lol).

Now moving onto tokenzing, stemming, and stopwords:

In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize
# tokenize; first try one row 
test_token = sent_tokenize(df["text"][0])
test_token[0:10]

['SECTION 1.',
 'ENVIRONMENTAL INFRASTRUCTURE.',
 'Jackson County, Mississippi.Section 219 of the Water Resources Development Act of 1992 is amended in subsection , by striking paragraph and inserting the following: `` Jackson county, mississippi.Provision of an alternative water supply and a project for the elimination or control of combined sewer overflows for Jackson County, Mississippi.',
 "''; and in subsection , by striking ``$10,000,000'' and inserting ``$20,000,000''.",
 "Manchester, New Hampshire.Section 219 of the Water Resources Development Act of 1992 is amended by striking ``$10,000,000'' and inserting ``$20,000,000''.",
 "Atlanta, Georgia.Section 219 of the Water Resources Development Act of 1992 is amended by striking ``$25,000,000 for''.",
 "Paterson, Passaic County, and Passaic Valley, New Jersey.Section 219 of the Water Resources Development Act of 1992 is amended by striking ``$20,000,000 for''.",
 "Elizabeth and North Hudson, New Jersey.Section 219 of the Water Reso

We tokenize by sentence instead of by words, because we plan to use tf-idf to summarize our text. For tf-idf, we need to calculate the frequency of each word for each sentence.

In [7]:
# stemming and stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

ps = PorterStemmer()
processed_sentence = []
stop_words = set(stopwords.words('english'))

for sentence in test_token:
    words = word_tokenize(sentence)
    for word in words:
        word = word.lower()
        word = ps.stem(word)
        if word not in stop_words:
            processed_sentence.append(word)
    
processed_sentence[0:10]

['section',
 '1',
 '.',
 'environment',
 'infrastructur',
 '.',
 'jackson',
 'counti',
 ',',
 'mississippi.sect']

We stem the words, since if there are words with the same root but different tenses, we still want them to be weighted together. Stopwords are common words that are not valuable, such as 'and' & 'or', so we remove them.