## Congressional Committees Links Scraper

Write a scraper <a href="https://www.congress.gov/committees">for this page</a> that gathers all the standing committee names and related links.

In the near future, you will write code that follows those links and scrapes data from each link destination.

**Note**: this assignment will require you use several skills covered in this course so far:

- ```List slicing```
- ```for loops``` or ```list comprehension```
- ```BeautifulSoup``` scraping techniques

Most importantly, it will call on you to conceptualize and apply these techniques to the current goal.



### Scrape plan:

1 - explore target data using inspector

2 - Notice what holds the two columns we want.

3 - Write the scraper



In [1]:
## import libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
## url to scrape
url = "https://www.congress.gov/committees"

In [3]:
## capture response
response = requests.get(url)
response.status_code

200

In [4]:
## make some soup
soup = BeautifulSoup(response.text, "html.parser")
soup

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<title>Committees of the U.S. Congress | Congress.gov | Library of Congress</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="http://www.congress.gov/" name="canonical"/>
<meta content="http://www.congress.gov/" name="dc.identifier"/>
<meta content="eng" name="dc.language"/>
<meta content="Text is government work" name="dc.rights"/>
<meta content="U.S. Congress Committees" name="dc.subject"/>
<meta content="Legislative Data" name="dc.subject"/>
<meta content="Congress" name="dc.subject"/>
<meta content="Committees of the U.S. Congress" name="dc.title"/>
<meta content="legislation" name="dc.type"/>
<meta content="webpage" name="dc.type"/>
<meta content="Congress.gov covers the activities of the standing committees of the House and Senate, which provide legislative, oversight and administrative services." name="description"/>
<meta content="width=device-width,initial-s

In [5]:
## target the class "plain" or the "margin7" class which holds ALL the committees. 
## No need to target both.

all_comms = soup.find_all("ul", class_="plain")
all_comms

[<ul class="plain margin7">
 <li><a href="/committee/house-agriculture/hsag00">Agriculture</a></li>
 <li><a href="/committee/house-appropriations/hsap00">Appropriations</a></li>
 <li><a href="/committee/house-armed-services/hsas00">Armed Services</a></li>
 <li><a href="/committee/house-budget/hsbu00">Budget</a></li>
 <li><a href="/committee/house-education-and-labor/hsed00">Education and Labor</a></li>
 <li><a href="/committee/house-energy-and-commerce/hsif00">Energy and Commerce</a></li>
 <li><a href="/committee/house-ethics/hsso00">Ethics</a></li>
 <li><a href="/committee/house-financial-services/hsba00">Financial Services</a></li>
 <li><a href="/committee/house-foreign-affairs/hsfa00">Foreign Affairs</a></li>
 <li><a href="/committee/house-homeland-security/hshm00">Homeland Security</a></li>
 <li><a href="/committee/committee-on-house-administration/hsha00">House Administration</a></li>
 <li><a href="/committee/house-judiciary/hsju00">Judiciary</a></li>
 <li><a href="/committee/hous

In [7]:
## check the length
## note that it holds MORE than we actuall want
len(all_comms)

7

In [6]:
## type of data
type(all_comms)

bs4.element.ResultSet

## Things to realize:

1. We have generated a list of lists (nested lists) because of the same class in multiple places.
2. At some point, we'll have to iterate through the large enclosing list to isolate each nested list, which we then iterate through to grab what each nested list holds. Basically we have to ```for loop``` within a ```for loop```.

The next cell just illustrates that when we ```for loop``` through the enclosing list, it isolates each nested list. I have separated each nested listed  with ```"********"```.

In [8]:
## Here you go:
for item in all_comms:
    print(item.find_all("a"))
    print(type(item.find_all("a")))
    print("********")

[<a href="/committee/house-agriculture/hsag00">Agriculture</a>, <a href="/committee/house-appropriations/hsap00">Appropriations</a>, <a href="/committee/house-armed-services/hsas00">Armed Services</a>, <a href="/committee/house-budget/hsbu00">Budget</a>, <a href="/committee/house-education-and-labor/hsed00">Education and Labor</a>, <a href="/committee/house-energy-and-commerce/hsif00">Energy and Commerce</a>, <a href="/committee/house-ethics/hsso00">Ethics</a>, <a href="/committee/house-financial-services/hsba00">Financial Services</a>, <a href="/committee/house-foreign-affairs/hsfa00">Foreign Affairs</a>, <a href="/committee/house-homeland-security/hshm00">Homeland Security</a>, <a href="/committee/committee-on-house-administration/hsha00">House Administration</a>, <a href="/committee/house-judiciary/hsju00">Judiciary</a>, <a href="/committee/house-natural-resources/hsii00">Natural Resources</a>, <a href="/committee/house-oversight-and-reform/hsgo00">Oversight and Reform</a>, <a href=

In [9]:
## the a tags hold everything we need – the names and the links.
a_tags_list = [item.find_all("a") for item in all_comms]
a_tags_list

[[<a href="/committee/house-agriculture/hsag00">Agriculture</a>,
  <a href="/committee/house-appropriations/hsap00">Appropriations</a>,
  <a href="/committee/house-armed-services/hsas00">Armed Services</a>,
  <a href="/committee/house-budget/hsbu00">Budget</a>,
  <a href="/committee/house-education-and-labor/hsed00">Education and Labor</a>,
  <a href="/committee/house-energy-and-commerce/hsif00">Energy and Commerce</a>,
  <a href="/committee/house-ethics/hsso00">Ethics</a>,
  <a href="/committee/house-financial-services/hsba00">Financial Services</a>,
  <a href="/committee/house-foreign-affairs/hsfa00">Foreign Affairs</a>,
  <a href="/committee/house-homeland-security/hshm00">Homeland Security</a>,
  <a href="/committee/committee-on-house-administration/hsha00">House Administration</a>,
  <a href="/committee/house-judiciary/hsju00">Judiciary</a>,
  <a href="/committee/house-natural-resources/hsii00">Natural Resources</a>,
  <a href="/committee/house-oversight-and-reform/hsgo00">Oversig

In [10]:
## we still have lists within a big list.
## let's check what one nested list holds
a_tags_list[0]

[<a href="/committee/house-agriculture/hsag00">Agriculture</a>,
 <a href="/committee/house-appropriations/hsap00">Appropriations</a>,
 <a href="/committee/house-armed-services/hsas00">Armed Services</a>,
 <a href="/committee/house-budget/hsbu00">Budget</a>,
 <a href="/committee/house-education-and-labor/hsed00">Education and Labor</a>,
 <a href="/committee/house-energy-and-commerce/hsif00">Energy and Commerce</a>,
 <a href="/committee/house-ethics/hsso00">Ethics</a>,
 <a href="/committee/house-financial-services/hsba00">Financial Services</a>,
 <a href="/committee/house-foreign-affairs/hsfa00">Foreign Affairs</a>,
 <a href="/committee/house-homeland-security/hshm00">Homeland Security</a>,
 <a href="/committee/committee-on-house-administration/hsha00">House Administration</a>,
 <a href="/committee/house-judiciary/hsju00">Judiciary</a>,
 <a href="/committee/house-natural-resources/hsii00">Natural Resources</a>,
 <a href="/committee/house-oversight-and-reform/hsgo00">Oversight and Reform<

In [11]:
## how many are there?
len(a_tags_list)

7

In [12]:
## always check the datatype
type(a_tags_list[0] )

bs4.element.ResultSet

## Things to realize:

1. The links are relative links (partial). We'll need to convert them to full absolute paths.
2. We need to target just two of lists within the large list since the assignment calls for the main senate and house committees.
3. We'll have to for loop within a for loop to capture what we need
4. As a bonus, I wrote a conditional to add a column that will tell us which branch of congress the committee is in base on info in the url itself. (each url either says "house" or "senate"

In [13]:
## code here
base_url = "https://www.congress.gov"  ## a base url 
final_list = [] ## initialize an empty list
for items in a_tags_list[:2]: ## this slices our list to target only the first two
    for item in items: ## this second forloop goes through one nested list at a time
        name = item.get_text() ## pull out the name of the committee
        url = base_url + item.get("href") ## pull out the url but add the base_url to it
        ## BONUS: which branch of congress
        if "senate" in url:
            branch = "senate"
        else:
            branch = "house"
        
        ## append dictionary to our list
        final_list.append({"committee_name": name, "branch": branch, "link": url})

In [14]:
## check our list
final_list

[{'committee_name': 'Agriculture',
  'branch': 'house',
  'link': 'https://www.congress.gov/committee/house-agriculture/hsag00'},
 {'committee_name': 'Appropriations',
  'branch': 'house',
  'link': 'https://www.congress.gov/committee/house-appropriations/hsap00'},
 {'committee_name': 'Armed Services',
  'branch': 'house',
  'link': 'https://www.congress.gov/committee/house-armed-services/hsas00'},
 {'committee_name': 'Budget',
  'branch': 'house',
  'link': 'https://www.congress.gov/committee/house-budget/hsbu00'},
 {'committee_name': 'Education and Labor',
  'branch': 'house',
  'link': 'https://www.congress.gov/committee/house-education-and-labor/hsed00'},
 {'committee_name': 'Energy and Commerce',
  'branch': 'house',
  'link': 'https://www.congress.gov/committee/house-energy-and-commerce/hsif00'},
 {'committee_name': 'Ethics',
  'branch': 'house',
  'link': 'https://www.congress.gov/committee/house-ethics/hsso00'},
 {'committee_name': 'Financial Services',
  'branch': 'house',
  '

In [15]:
## our our list into a df
df = pd.DataFrame(final_list)
df

Unnamed: 0,committee_name,branch,link
0,Agriculture,house,https://www.congress.gov/committee/house-agric...
1,Appropriations,house,https://www.congress.gov/committee/house-appro...
2,Armed Services,house,https://www.congress.gov/committee/house-armed...
3,Budget,house,https://www.congress.gov/committee/house-budge...
4,Education and Labor,house,https://www.congress.gov/committee/house-educa...
5,Energy and Commerce,house,https://www.congress.gov/committee/house-energ...
6,Ethics,house,https://www.congress.gov/committee/house-ethic...
7,Financial Services,house,https://www.congress.gov/committee/house-finan...
8,Foreign Affairs,house,https://www.congress.gov/committee/house-forei...
9,Homeland Security,house,https://www.congress.gov/committee/house-homel...
