# Collection Legislation Data

## Introduction

Maryland's General Assembly provides a lot of different data concerning the legislative process. There's an [excel file of all proposed bills](), [information on each legislator](), and [even a fiscal analysis of each bill proposed](). In this scraper, I'll be gathering information from each bill proposed in the 2018 sessions.

There's a good deal of data available on the bill's "homepage".

![web page of bill](documents\bill_homepage.PNG)

In addition to information on the website, each bill is published as a PDF. That PDF contains the text of the bill, and the new legislation or the legislation that will be repealed.

![PDF of bill](documents\bill_pdf.PNG)

For this topic modeling project, I'm more interested in the actual text of the bill. The additional information [could help improve the topic modeling process](https://link.springer.com/chapter/10.1007/978-3-642-40501-3_39), but I'm going to start simple and just use the text of the bill. Luckily, each bill begins with a sort of "executive summary", which is what I'll use for my topic model.

## Requirements

You can find the requirements for the entire project in the home directory of this project, however I'd also like to include a list of the tools I used to collect the data

* **Scrapy**==1.4.0
* **pdfminer.six**==20170720
* **requests**==2.18.4

## Building the Web Scraping Script


### Defining Items

There are a few different web scrapping tools out there, but I prefer `Scrapy` and that's what I'll be doing here. After using `scrapy startproject` to set up my work environment, I'm first going to define the items- what data I'll be collecting from the web scrape.

```python
import scrapy


class LegislationScraperItem(scrapy.Item):
    bill_name = scrapy.Field()
    bill_number = scrapy.Field()
    url = scrapy.Field()
    sponsor = scrapy.Field()
    status = scrapy.Field()
    committee = scrapy.Field()
    broad_subjects = scrapy.Field()
    narrow_subjects = scrapy.Field()
    purpose = scrapy.Field()
```

I want to collect all the data from each bill web page, as well as the "executive summary" of the bill that I mentioned in the introduction. I'll call this the "purpose" of the bill.

### Building the Spider

Now I'm going to build out the main part of the spider. This is the script that will point the web crawler towards the data we want to collect, and define how the spider should move through the website.

First, I'll import some requires packages and create the spider object. When creating the spider object, I need to give it a name, define the allowed domains, and provide some start URLs. Since I want to get information on every bill that was proposed last sessions, I'm going to a web page that that show each bill proposed in a given committee, and include every committee's web page in that list.

![web page highlighting bills proposed in a committee](documents\committee_bills.PNG)
```python
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from legislation_scraper.items import LegislationScraperItem


class LegislationSpider(CrawlSpider):
    name = 'legislation_spider'
    allowed_domains = ['mgaleg.maryland.gov']
    start_urls = [
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=app&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=b%26t&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=ecm&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=ehe&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=env&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=exn&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=fin&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=hgo&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=jpr&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=jud&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=sru&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=hru&stab=02&pid=cmtepage&tab=subject3&ys=2018RS',
        'http://mgaleg.maryland.gov/webmga/frmMain.aspx?id=w%26m&stab=02&pid=cmtepage&tab=subject3&ys=2018RS'
    ]
```

With these start URLs, I have access to each bill proposed in 2018, but I'll need to define a path for the spider to follow in order to access the bill's web page and PDF.

The bill numbers ion this page are actually hyperlinks, and the spider could follow those links in order to access the relevant information. Using Xpaths, the Google Chrome developer tools, and a helpful addon called [Xpath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en), I was able to find and define an xpath that directs the spider to all the hyperlinks that navigate to a bill's web page.

By including it below in the rules, and restricting the spider to only those xpaths, we can crawl the webpage and use the `parse_item` function to collect all the items I defined above.
```python
    rules = [
        Rule(LinkExtractor(
            restrict_xpaths='//table[@class = "grid"]//tr/td[1]/a[1]'), callback='parse_item', follow=True)
    ]
```

The `parse_item` function uses xpaths to extract text information from the web page. It was created using the same tools I used to find the xpath above, along with some trial and error.

```python
    def parse_item(self, response):
        item = LegislationScraperItem()
        item['bill_name'] = response.xpath(
            '//table[@class = "billheader"]//tr[1]/td[3]/h3/text()').extract()
        item['bill_number'] = response.xpath(
            '//table[@class = "billheader"]//tr[2]/td[1]/h2/a/text()').extract()
        item['url'] = ''.join(["http://mgaleg.maryland.gov", response.xpath(
            '//table[@class = "billheader"]//tr[2]/td[1]/h2/a/@href').extract()[0]])
        item['sponsor'] = response.xpath(
            '//table[@class = "billheader"]//tr[2]/td[3]/h3/a/text()').extract()
        item['status'] = response.xpath(
            '//table[@class = "billheader"]//tr[3]/td[3]/h3/text()').extract()
        item['committee'] = response.xpath(
            '//table[@class = "subcomm"]//td/a/text()').extract()
        item['broad_subjects'] = response.xpath(
            '//table[@class = "billsum"]//tr[descendant::text()[contains(.,"Broad Subject")]]/td/a/text()').extract()
        item['narrow_subjects'] = response.xpath(
            '//table[@class = "billsum"]//tr[descendant::text()[contains(.,"Narrow Subject")]]/td/a/text()').extract()
        yield item
```

### Building the Item Pipeline

The only item I haven't collected is `purpose`. Remember, this data is actually text that is only found in the bills text which is save as a PDF; it's not available in full on the website. Luckily, we can use `pdfminer` to extract this information (you'll need to use `pdfminer.six` if you're on Python 3.x like I am). 

I'm going to use the `pipeline` module from `Scrapy` and `pdf2txt` to extract the `purpose` from each document.

In building to the `process_item` function , the first step is to call the URL that I collected from the bill homepage.

```python
import os
import re
import requests
import time


class LegislationScraperPipeline(object):

    def process_item(self, item, spider):
        url = item['url']
        bill_number = item['bill_number'][0]
        r = requests.get(url, stream=True)
```

`r` contains the PDF for a specific bill. I'll need to save the content to a PDF file and then convert it to a `txt` file using `pdf2txt`. I'm going to save the PDF file in the `data` folder and name it according to the bill name.

```python
        pdf_filename = 'data\\' + bill_number + '.pdf'
        txt_filename = 'data\\' + bill_number + '.txt'
        with open(pdf_filename, 'wb') as f:
            f.write(r.content)
```

Now, there isn't a module I can use in a python script to call `pdf2txt`. Instead, I'm going to use `os.system` to call `pdf2txt` as if I were running it in the command shell.

```python
        pdf2txt_path = 'legislation_scraper/pdf2txt.py'
        print("python " + pdf2txt_path + " -o " +
              txt_filename + " " + pdf_filename)
        os.system("python " + pdf2txt_path + " -o " +
                  txt_filename + " " + pdf_filename)
```

Now I'm going to use a regular expression to extract the purpose from the bill text and save it as a text file.

I'm adding in a `while` statement to wait until the text file is written.

Finally, I'm closing all the files just to wrap things up.

```python
        while not os.path.exists(txt_filename):
            time.sleep(0.001)
        file = open(txt_filename, 'r', encoding='utf8')
        filetext = file.read()
        filetext = filetext.replace('\n', '')
        filetext = filetext.replace('\r', '')
        filetext = ''.join(filetext)
        mat = re.search(
            r'(?=(FOR|AN ACT)).+?(?=( BY|(|\(1\) )SECTION 1.| WHEREAS| EXPLANATION))', filetext)
        if mat:
            item['purpose'] = ''.join(mat.group(0))
        else:
            item['purpose'] = ""
        file.close()
        os.remove(pdf_filename)
        os.remove(txt_filename)
        return item
```

### Changing Scraper Settings

Finally, I need to change some of the setting to make sure the spider and pipeline are enabled, and ensure that I don't flood the website with requests

```python
BOT_NAME = 'legislation_scraper'

SPIDER_MODULES = ['legislation_scraper.spiders']
NEWSPIDER_MODULE = 'legislation_scraper.spiders'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
    'legislation_scraper.pipelines.LegislationScraperPipeline': 300,
}

AUTOTHROTTLE_ENABLED = True

AUTOTHROTTLE_START_DELAY = 5

AUTOTHROTTLE_MAX_DELAY = 60
```


## Executing the Spider

Finally, we'll open up a command prompt and run the following command to save the data as a csv file.

scrapy crawl legislation_scraper -o data.csv -t csv