# Setting up your development environ

1. Clone the state scraper repo from Github

  ```
  git clone https://github.com/influence-usa/scrapers-us-state.git
  ```

2. Make a new virtualenv

  ```
  mkvirtualenv --python=$(which python3)
  ```
  
3. Use `pip` to install requirements

  ```
  pip install -r requirements.txt
  ```
  
4. If you don't see a folder for the state you're working on, run the following:

  ```
    (iusa-scrape)$>pupa init arizona
    no pupa_settings on path, using defaults
    jurisdiction name (e.g. City of Seattle): Arizona
    division id (e.g. ocd-division/country:us/state:wa/place:seattle): ocd-division/country:us/state:az
    classification (can be: government, legislature, executive, school_system): government
    official URL: http://www.az.gov/
    create disclosures scraper? [Y/n]: Y
    create bills scraper? [y/N]: n
    create events scraper? [y/N]: y
    create votes scraper? [y/N]: n
    create people scraper? [y/N]: y
  ```

...what this did was create a new folder for the state. In this example, the state was Arizona (`arizona`).

  ```
    (iusa-scrape)$>tree
    .
    ├── ak
    │   └── __init__.py
    ├── al
    │   ├── __init__.py
    │   └── people.py
    ├── arizona
    │   ├── disclosures.py
    │   ├── events.py
    │   ├── __init__.py
    │   └── people.py
    ├── md
    │   └── __init__.py
    ├── README.md
    ├── requirements.txt
    ├── Untitled.ipynb
    └── utils
        ├── __init__.py
        └── lxmlize.py
  ```
  
To follow the broader pupa convention, we'll change the directory name to `az`:

```
    (iusa-scrape)$>mv arizona az
    (iusa-scrape)$>tree
    tree
    .
    ├── ak
    │   └── __init__.py
    ├── al
    │   ├── __init__.py
    │   └── people.py
    ├── az
    │   ├── disclosures.py
    │   ├── events.py
    │   ├── __init__.py
    │   └── people.py
    ├── md
    │   └── __init__.py
    ├── README.md
    ├── requirements.txt
    ├── Untitled.ipynb
    └── utils
        ├── __init__.py
        └── lxmlize.py

5 directories, 13 files
```
  
Because we told it to in teh questions asked above, it also created the starter code for our scrapers: there's one each for disclosures, events and people.

Also interesting is the `__init__.py` file in our state's directory.  It used the answers to our questions to build a `Jurisdiction` object that represents the state government:

```
class Arizona(Jurisdiction):
    division_id = "ocd-division/country:us/state:az"
    classification = "government"
    name = "Arizona"
    url = "https://az.gov/"
    scrapers = {
        "events": ArizonaEventScraper,
        "people": ArizonaPersonScraper,
        "disclosures": ArizonaDisclosureScraper,
    }

    def get_organizations(self):
        yield Organization(name=None, classification=None)
```

# Version control

This is a good time to add and commit our changes so far.

```
(iusa-scrape)$>git add az
(iusa-scrape)$>git commit -m "initialized arizona"
[master 3e622ef] initialized arizona
 4 files changed, 47 insertions(+)
 create mode 100644 az/__init__.py
 create mode 100644 az/disclosures.py
 create mode 100644 az/events.py
 create mode 100644 az/people.py
 ```

# Where is the data?

# Creating global authority organizations

## Create the Secretary of State

The `get_organizations` method of the `Jurisdiction` class lets us define some global organizations for all of the data that we'll be scraping from Arizona's sites. For campaign finance disclosures, we'll have to define the Arizona Secretary of State's Office.

```
def get_organizations(self):                                        
```

First, initialize using the `Organization` class.

```
    secretary_of_state = Organization(                                    
        name="Office of the Secretary of State, State of Arizona",        
        classification="office"                                           
    )                                                               ```
    
Here, we're able to set particular attributes using `kwargs`.  To get a sense of which attributes you can set at this point, check out the [source](https://github.com/influence-usa/pupa/blob/disclosures/pupa/scrape/popolo.py#L132-L182).

Now, we can add other attribtues, using the helper methods found on the `Organization` class:

```
    secretary_of_state.add_contact_detail(                                
        type="voice",                                                     
        value="602-542-4285"                                              
    )                    

    secretary_of_state.add_contact_detail(                                
        type="address",                                                   
        value="1700 W Washington St Fl 7, Phoenix AZ 85007-2808"          
    )                                                                     
    secretary_of_state.add_link(                                          
        url="http://www.azsos.gov/",                                      
        note="Home page"                                                  
    )                                                                     
```

We should add the organization we've created to the `Jurisdiction` object as a semi-private property. This is useful, beacuse the `Jurisdiction` object will essentially always be accessible to all of our scrapers. Whenever we want to refer to the AZ Secretary of State, we can always access it from `Arizona` jurisdiction object.

```
    self._secretary_of_state = secretary_of_state                   
```

Finally, yield the organization we created. This is beacause `get_organizations` is actually the first scraper that we'll run each time we run Arizona scrapers of any kind.

```
    yield secretary_of_state                                          
```

## Test what we have so far!

Cool, let's try out what we have so far.  From the project root (`scrapers-us-state`), run the command:

```
(iusa-scrape)$>pupa update az --scrape
```

This will throw a `ScrapeError` because we haven't written any of the main scrapers yet, but before it does we'll see that it creates our `Jurisdiction` object, and the `Organization` representing teh Arizona Secretary of State.

```
no pupa_settings on path, using defaults
az (scrape)
  events: {}
  people: {}
  disclosures: {}
Not checking sessions...
13:30:10 INFO pupa: save jurisdiction Arizona as jurisdiction_ocd-jurisdiction-country:us-state:az-government.json
13:30:10 INFO pupa: save organization Office of the Secretary of State, State of Arizona as organization_1e330580-e20b-11e4-a4f5-e90fe0697b56.json
```

# Starting a new scraper

Now it's time to start writing the real meat and potatoes of our scraping code.

## Locate the source of the data

Check out the [Big Board](https://docs.google.com/spreadsheets/d/18-MvVJXg8TkUUNhtBmWoCEPUWEMf7F6-YVV6x7CWrg4/pubhtml) to see which URL you should use to start. Explore the links on that page until you find the data you're looking for.

For this example, we'll look at the Arizona Super PAC list. 

In [1]:
PAC_LIST_URL = "http://apps.azsos.gov/apps/election/cfs/search/SuperPACList.aspx"

## Adding new scrape routines

We're going to add our code to `az/disclosures.py`. 

```
class ArizonaDisclosureScraper(Scraper):
                                           
    def scrape_super_pacs(self):           
        pass                               
                                           
    def scrape(self):                      
        # needs to be implemented          
        yield from self.scrape_super_pacs()
```

When we're through, the `pupa` CLI commands will call the `scrape` command. It's good practice to follow this pattern to break down that command into a series of subroutines, one for each type of data you're returning. The pupa software actually doesn't care, though, it just expects a stream of Open Civic Data scrape objects (`Person`, `Organizaton`, `Event`, etc).

## Developing your scraper

At this point, you might want to move to a REPL (or, even better, to an IPython notebook) so that you can start figuring out how to obtain the target data. You'll 

In this example, things are fairly straightforward.  There's a `<table>` element in the middle of the page that has all the information we need to generate the `Organization` objects that we want.

In [9]:
from lxml import etree
from lxml.html import HTMLParser

import scrapelib

In [12]:
resp = scrapelib.urlopen('http://localhost:8000/SuperPACList.aspx.html')

In [15]:
d = etree.fromstring(resp, parser=HTMLParser())

The easiest thing to do is just look for the table we're interested by writing an xpath query. The `<table>` 

In [17]:
d.xpath('//table')

[<Element table at 0x7fbe9c435138>, <Element table at 0x7fbe9c435188>]

Hm, looks like there's more than one, so we're going to have to narrow our XPath query