# Scraping lecture

We have some information about pages we want to scrape in a file called `bills.json`. The ultimate goal is to download the full text of each bill and count the number of words.

## Import modules

In [1]:
# parse json file
import json

# what we need for scraping
import requests # request HTTP
from bs4 import BeautifulSoup # parse HTML

# helpful modules for cleaning up text
import re # regex
import string

# good ole pandas to structure our data
import pandas as pd

## Bring in the data

In [2]:
with open('bills.json') as file:
    bills = json.load(file)

I've commented out the below code because a lot of text gets printed out; watch the lecture screen to view the results.

In [4]:
# # this is a way to 'pretty-print' a JSON file
# print(json.dumps(bills, indent=2))

In [6]:
len(bills)

40

## Start with a test page

We'll start with the first item in the `bills` list.

In [5]:
test_bill = bills[0]
test_bill

{'congress': 116,
 'chamber': 'house',
 'bill_url': 'https://www.congress.gov/bill/116th-congress/house-bill/133/text?r=1&s=3',
 'bill_number': 133}

Create a variable called `test_url` that gets the value of `bill_url` from `test_bill`:

In [7]:
test_url = test_bill['bill_url']
test_url

'https://www.congress.gov/bill/116th-congress/house-bill/133/text?r=1&s=3'

Before we download this page, let's look at the HTML and see if we can find where the bill exists in the HTML.

### Request the url

In [8]:
test_page = requests.get(test_url)

### Save the HTML so we don't have to re-download it later

If you're going to scrape tens or hundreds or thousands of URLs, it could be helpful to save the HTML so you don't have to re-download thousands of pages later. I don't want to clutter up this coding folder so I'm going to create a new directory to save all these pages.

One very cool thing about Jupyter notebooks is that you can execute some basic terminal commands by using an exclamation point. Below, I'm going to create a new directory called 'pages'. When you use the `-p` flag, you won't get an error if the directory already exists.

In [11]:
!mkdir -p pages

I included `pages` in your `.gitignore` file â€” that means it'll save on your hard drive but it won't be pushed to git.

In [None]:
# save the test page so i don't have to dl again
with open('pages/test_page.html', 'w') as file:
    file.write(test_page.text)

### Parse the test page with Beautiful Soup

We'll use Beautiful Soup's built-in html parser. This allows us to search for nested elements.

In [12]:
test_soup = BeautifulSoup(test_page.text, features='html.parser')

_The following code is commented out because it outputs a LOT of text. Look at the screen during lecture to see what the output looks like._

In [14]:
# test_soup

In [15]:
type(test_soup)

bs4.BeautifulSoup

### Find and get what's inside `id='billTextContainer'`
Because we know that all of a bill's text is contained within an element with the ID of 'billTextContainer', we can use bs4's `.find(id='')` method:

In [16]:
bill_text_container = test_soup.find(id='billTextContainer')

Remember that the result is still a bs4 `type`:

In [17]:
type(bill_text_container)

bs4.element.Tag

_The following code is commented out because it outputs a LOT of text. Look at the screen during lecture to see what the output looks like._

In [19]:
# bill_text_container

If we want to extract only the text, we'll use the bs4 method `.get_text()`:

In [20]:
bill_text = bill_text_container.get_text()

_The following code is commented out because it outputs a LOT of text. Look at the screen during lecture to see what the output looks like._

In [22]:
# bill_text

What is the type of `bill_text`?

In [23]:
type(bill_text)

str

In [24]:
len(bill_text)

7692027

In [28]:
test_string = "The lazy fox jumps over the quick dog."

In [29]:
test_string[0:2]

'Th'

In [27]:
# Show first 500 characters of the string bill_text
bill_text[0:500]

'[116th Congress Public Law 260]\n[From the U.S. Government Publishing Office]\n\n\n\n[[Page 1181]]\n\n                  CONSOLIDATED APPROPRIATIONS ACT, 2021\n\n                                     \n\n                                     \n\n                                     \n\n                                     \n\n\n\n__________\n\n    * Editorial note: Part 1 contains pages 134 Stat. 1182 through 134 \nStat. 2247. See note at the end.\n\n\n[[Page 134 STAT. 1182]]\n\nPublic Law 116-260\n116th Congress\n\n           '

In [32]:
# Show last 500 characters of the string bill_text
bill_text[-500:]

'                                                        Vol. 166 (2020):\n                                    Jan. 15, considered and passed \n                                        Senate, amended.\n                                    Dec. 21, House concurred in Senate \n                                        amendment with an amendment. \n                                        Senate concurred in House \n                                        amendment.\n\n                                  <all>\n\n'

### Clean up `bill_text`

The text is pretty messy. We want to:
- replace punctuation with spaces
- replace newlines with spaces (`\n` means "newline")
- replace 2+ spaces with 1 space

#### Replace punctuation with space

In [33]:
# got the code from here: https://stackoverflow.com/a/37221663
punctuation_table = str.maketrans({key: ' ' for key in string.punctuation})
bill_text_cleaned = bill_text.translate(punctuation_table)  

Read more about these string methods in the Python documentation:

- [str.maketrans()](https://docs.python.org/3.3/library/stdtypes.html#str.maketrans)
- [str.translate()](https://docs.python.org/3.3/library/stdtypes.html#str.translate)

_The following code is commented out because it outputs a LOT of text. Look at the screen during lecture to see what the output looks like._

In [41]:
print(bill_text_cleaned[0:500])

 116th Congress Public Law 260 
 From the U S  Government Publishing Office 



  Page 1181  

                  CONSOLIDATED APPROPRIATIONS ACT  2021

                                     

                                     

                                     

                                     



          

      Editorial note  Part 1 contains pages 134 Stat  1182 through 134 
Stat  2247  See note at the end 


  Page 134 STAT  1182  

Public Law 116 260
116th Congress

           


In [39]:
asdf = "Text \n asdf"
print(asdf)

Text 
 asdf


In [40]:
asdf2 = "Text \\n asdf"
print(asdf2)

Text \n asdf


#### Replace newlines with space

In [42]:
bill_text_cleaned = re.sub('\\n', ' ', bill_text_cleaned)

In [43]:
bill_text_cleaned[0:500]

' 116th Congress Public Law 260   From the U S  Government Publishing Office       Page 1181                      CONSOLIDATED APPROPRIATIONS ACT  2021                                                                                                                                                                                  Editorial note  Part 1 contains pages 134 Stat  1182 through 134  Stat  2247  See note at the end      Page 134 STAT  1182    Public Law 116 260 116th Congress             '

#### Replace multiple spaces with one space

In [44]:
bill_text_cleaned = re.sub('\s{2,}', ' ', bill_text_cleaned)

In [45]:
bill_text_cleaned[0:500]

' 116th Congress Public Law 260 From the U S Government Publishing Office Page 1181 CONSOLIDATED APPROPRIATIONS ACT 2021 Editorial note Part 1 contains pages 134 Stat 1182 through 134 Stat 2247 See note at the end Page 134 STAT 1182 Public Law 116 260 116th Congress An Act Making consolidated appropriations for the fiscal year ending September 30 2021 providing coronavirus emergency response and relief and for other purposes NOTE Dec 27 2020 H R 133 Be it enacted by the Senate and House of Repres'

What are some problems you see in the final `bill_text_cleaned`? Do you think it's OK for the purposes of this project?

### Word count

#### Get the word count

You can get the word count of a string by splitting the string. By default `str.split()` will by split on spaces. Then, you are left with a list of words. The length of the list, or `len()` is how many words you have in the string.

In [46]:
test_string = 'This class ends at 9 pm.'

In [53]:
type(test_string)

str

In [55]:
test_list = test_string.split()
test_list

['This', 'class', 'ends', 'at', '9', 'pm.']

In [56]:
type(test_list)

list

In [58]:
len(test_list)

6

In [51]:
bill_word_count = len(bill_text_cleaned.split())

In [52]:
bill_word_count

967689


#### Create the dataframe

Let's make a pandas dataframe where we can save the word count.

The neat thing about `bills` is that it's already structured in a way that makes it very easy to create a dataframe. It's a list of dictionaries that only have one level. (If this doesn't sound familiar to you, you might want to brush up on [lists and dictionaries in the Python documentation](https://docs.python.org/3/tutorial/datastructures.html).)

In [60]:
# bills

In [61]:
bills_df = pd.DataFrame(bills)
bills_df

Unnamed: 0,congress,chamber,bill_url,bill_number
0,116,house,https://www.congress.gov/bill/116th-congress/h...,133
1,116,house,https://www.congress.gov/bill/116th-congress/h...,150
2,116,house,https://www.congress.gov/bill/116th-congress/h...,251
3,116,house,https://www.congress.gov/bill/116th-congress/h...,259
4,116,house,https://www.congress.gov/bill/116th-congress/h...,263
5,116,house,https://www.congress.gov/bill/116th-congress/h...,266
6,116,house,https://www.congress.gov/bill/116th-congress/h...,276
7,116,house,https://www.congress.gov/bill/116th-congress/h...,299
8,116,house,https://www.congress.gov/bill/116th-congress/h...,430
9,116,house,https://www.congress.gov/bill/116th-congress/h...,434


##### Create a new column, method 1

In [62]:
new_columns = list(bills_df.columns) + ['word_count']
new_columns

['congress', 'chamber', 'bill_url', 'bill_number', 'word_count']

In [63]:
bills_df = bills_df.reindex(columns=list(bills_df.columns) + ['word_count'])

In [64]:
bills_df

Unnamed: 0,congress,chamber,bill_url,bill_number,word_count
0,116,house,https://www.congress.gov/bill/116th-congress/h...,133,
1,116,house,https://www.congress.gov/bill/116th-congress/h...,150,
2,116,house,https://www.congress.gov/bill/116th-congress/h...,251,
3,116,house,https://www.congress.gov/bill/116th-congress/h...,259,
4,116,house,https://www.congress.gov/bill/116th-congress/h...,263,
5,116,house,https://www.congress.gov/bill/116th-congress/h...,266,
6,116,house,https://www.congress.gov/bill/116th-congress/h...,276,
7,116,house,https://www.congress.gov/bill/116th-congress/h...,299,
8,116,house,https://www.congress.gov/bill/116th-congress/h...,430,
9,116,house,https://www.congress.gov/bill/116th-congress/h...,434,


##### Create a new column, method 2
You need to `import numpy as np` for this but it's easier!

In [65]:
import numpy as np
bills_df['word_count'] = np.nan

In [66]:
bills_df

Unnamed: 0,congress,chamber,bill_url,bill_number,word_count
0,116,house,https://www.congress.gov/bill/116th-congress/h...,133,
1,116,house,https://www.congress.gov/bill/116th-congress/h...,150,
2,116,house,https://www.congress.gov/bill/116th-congress/h...,251,
3,116,house,https://www.congress.gov/bill/116th-congress/h...,259,
4,116,house,https://www.congress.gov/bill/116th-congress/h...,263,
5,116,house,https://www.congress.gov/bill/116th-congress/h...,266,
6,116,house,https://www.congress.gov/bill/116th-congress/h...,276,
7,116,house,https://www.congress.gov/bill/116th-congress/h...,299,
8,116,house,https://www.congress.gov/bill/116th-congress/h...,430,
9,116,house,https://www.congress.gov/bill/116th-congress/h...,434,


In [69]:
bills_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   congress     40 non-null     int64  
 1   chamber      40 non-null     object 
 2   bill_url     40 non-null     object 
 3   bill_number  40 non-null     int64  
 4   word_count   0 non-null      float64
dtypes: float64(1), int64(2), object(2)
memory usage: 1.7+ KB


In [70]:
bills_df['bill_number'].nunique()

40

In [67]:
bills[0]

{'congress': 116,
 'chamber': 'house',
 'bill_url': 'https://www.congress.gov/bill/116th-congress/house-bill/133/text?r=1&s=3',
 'bill_number': 133}

#### Save the word count

How do I update Bill 133's 'word_count'? 

You'll use `df.loc`:

```python
df.loc[subset_expression, 'column_to_change'] = new_value
```
In effect, you're subsetting the dataframe and applying a value to a column.

In the below code, we subset for rows where 'bill_number' is 133: `bills_df['bill_number'] == 133`


In [71]:
bills_df.loc[bills_df['bill_number'] == 133, 'word_count'] = bill_word_count

In [68]:
bills_df[bills_df['bill_number'] == 133]

Unnamed: 0,congress,chamber,bill_url,bill_number,word_count
0,116,house,https://www.congress.gov/bill/116th-congress/h...,133,


In [72]:
bills_df.head()

Unnamed: 0,congress,chamber,bill_url,bill_number,word_count
0,116,house,https://www.congress.gov/bill/116th-congress/h...,133,967689.0
1,116,house,https://www.congress.gov/bill/116th-congress/h...,150,
2,116,house,https://www.congress.gov/bill/116th-congress/h...,251,
3,116,house,https://www.congress.gov/bill/116th-congress/h...,259,
4,116,house,https://www.congress.gov/bill/116th-congress/h...,263,


## Time for a loop

We wrote all the code for ONE test page. But we have more than one item in `bills`.

### How do we loop through bills?

In [75]:
for bill in bills:
    pass
    # print(bill['bill_number'])

At this point, it'll be useful to check out the Table of Contents of this notebook in Lab. What are the steps we need to take?

- Request the URL
- Save the HTML of the URL
- Parse the page with bs4
- Find and get what's inside `id='billTextContainer'`
- Clean up the bill text
  - Replace punctuation with space
  - Replace newlines with space
  - Replace multiple spaces into one space
- Get the word count
- Save the word count in the dataframe

We're going to switch up a couple things though. The following steps only need to be done once, so they should be executed BEFORE we go through the loop.
- Create the folder for saving all the HTML
- Create the dataframe to save all the information

We'll write the loop in a new notebook for classwork: [`scraping_classwork.ipynb`](scraping_classwork.ipynb).

But before we do, I want to introduce you to another Python module that is really helpful when you're scraping: `tqdm`.

## tqdm

You can wrap `tqdm()` around any iterable (list, array, etc.) to create a progress bar.

In [76]:
from tqdm.notebook import tqdm
from time import sleep # this module just helps us visualize a delay

In [77]:
for n in tqdm(range(20)):
    sleep(0.2)

  0%|          | 0/20 [00:00<?, ?it/s]