# Project: Track bills related to voting rights

### Purpose: Scrape, Loop, Publish

In an ideal world, this program should scrape bulk data for new zip file to download every 30 days. Then it should loop through the xmls to determine if there are any bills related to voting rights. If there are, then it should create a dataframe containing relevant bill data. Finally, this should dump the information into a spreadsheet or R Shiny app, or some other data vis for easy web viewing.

#### Remaining To-Do's:
- figure out why "find_all" is not finding all values for tags
- automate download of zipped file of xmls directly from the govinfo website every 30 days
- recreate this for the House bills and joint resolution **cries single tear**

#### Note:
    
- I am not parsing the xml directly from govinfo because I couldn't figure it out. I opted to download one big zip file of xml files for all bills
- I may have to scrape a different set of xml files to track bill statuses. **cries blood**

In [2]:
from bs4 import BeautifulSoup as bs
import glob
import os
from pathlib import Path
import pandas as pd
import re
import numpy as np

<div class="alert alert-block alert-danger">
<b>Do not change 'Making the soup'</b>
    <br>It broke once already and we don't know how we fixed it.
</div>

In [3]:
# Making the soup

os.chdir("/Users/jessicayanez/Desktop/senate")

folder = Path("/Users/jessicayanez/Desktop/senate/")

for p in Path('.').glob('*.xml'):
    with p.open() as f:
        soup = bs(f,'lxml')
#         print(soup)

In [4]:
sponsors = [x.get_text() for x in soup.find_all(r'sponsor')]
s = "\n"
sponsor_name = s.join(sponsors)
sponsors

['Mr. Markey']

In [5]:
cosponsors = [x.get_text() for x in soup.find_all(r'cosponsor')]
s = "\n"
cosponsor_names = s.join(cosponsors)
cosponsors

['Mr. Sasse', 'Mr. Blunt', 'Mr. Schatz', 'Ms. Collins', 'Mr. Bennet']

### Looking for other table variables in our BeautifulSoup

##### Note: Element Descriptions and Content Models for Bills, Resolutions, and Amendments can be found [HERE](https://xml.house.gov/)

In [6]:
title = [x.get_text() for x in soup.find_all(re.compile(r'\bdc:title\b'))]
title

['117 S971 IS: Children and Media Research Advancement Act ']

In [7]:
official = [x.get_text() for x in soup.find_all(re.compile(r'\bofficial-title\b'))]
official

['To amend the Public Health Service Act to authorize a program on children and the media within the National Institute of Health to study the health and developmental effects of technology on infants, children, and adolescents. ']

In [8]:
date_created = [x.get_text() for x in soup.find_all(re.compile(r'\bdc:date\b'))]
date_created

['2021-03-25']

In [9]:
published_by = [x.get_text() for x in soup.find_all(re.compile(r'\bdc:publisher\b'))]
published_by

['U.S. Senate']

### First Attempt at building the bill table

<div class="alert alert-block alert-warning">
<b>Pros: All bill cosponsors show up<b/>
<br>
Cons: Row one repeats for length of index; Data for other bills does not appear.
    <div/>

In [10]:
index_values = (1,2,3)

cols = ('Bill Sponsor',
        'Bill Cosponsors',
        'Title',
        'Official Description',
        'Date Created',
        'Publisher')

data = {'Bill Sponsor':[sponsor_name],
       'Bill Cosponsors':[cosponsor_names],
       'Title':[title],
       'Official Description':[official],
       'Date Created':[date_created],
       'Publisher':[published_by]}

bill_df = pd.DataFrame(data, columns=cols, index=index_values)
bill_df

Unnamed: 0,Bill Sponsor,Bill Cosponsors,Title,Official Description,Date Created,Publisher
1,Mr. Markey,Mr. Sasse\nMr. Blunt\nMr. Schatz\nMs. Collins\...,[117 S971 IS: Children and Media Research Adva...,[To amend the Public Health Service Act to aut...,[2021-03-25],[U.S. Senate]
2,Mr. Markey,Mr. Sasse\nMr. Blunt\nMr. Schatz\nMs. Collins\...,[117 S971 IS: Children and Media Research Adva...,[To amend the Public Health Service Act to aut...,[2021-03-25],[U.S. Senate]
3,Mr. Markey,Mr. Sasse\nMr. Blunt\nMr. Schatz\nMs. Collins\...,[117 S971 IS: Children and Media Research Adva...,[To amend the Public Health Service Act to aut...,[2021-03-25],[U.S. Senate]


## Simplified Version

<div class="alert alert-block alert-warning">
<b>Pros: Concise script<b/>
<br>
Cons: Only first cosponsor name appears; Only first Bill information appears
    <div/>

In [11]:
s = soup.find_all('sponsor')
c = soup.find_all('cosponsor')
t = soup.find_all('dc:title')
o = soup.find_all('official-title')
d = soup.find_all('dc:date')
p = soup.find_all('dc:publisher')

In [12]:
bill_data = []
for i in range(0,len(s)):
    rows = [s[i].get_text(),
            c[i].get_text(),
            t[i].get_text(),
            o[i].get_text(),
            d[i].get_text(),
            p[i].get_text()]
    bill_data.append(rows)
print(bill_data)

[['Mr. Markey', 'Mr. Sasse', '117 S971 IS: Children and Media Research Advancement Act ', 'To amend the Public Health Service Act to authorize a program on children and the media within the National Institute of Health to study the health and developmental effects of technology on infants, children, and adolescents. ', '2021-03-25', 'U.S. Senate']]


In [13]:
bill_df_2 = pd.DataFrame(bill_data, columns=['Bill Sponsor',
                                      'Bill Cosponsors',
                                      'Title',
                                      'Official Description',
                                      'Date Created',
                                      'Publisher'],dtype = str)
bill_df_2

Unnamed: 0,Bill Sponsor,Bill Cosponsors,Title,Official Description,Date Created,Publisher
0,Mr. Markey,Mr. Sasse,117 S971 IS: Children and Media Research Advan...,To amend the Public Health Service Act to auth...,2021-03-25,U.S. Senate
