# Chicago City Council Voting Records
### 2 April 2017 (v1)
### 8 April 2017 (v2)

As part of the Data for Democracy "Chicago Lobbyists" project, I've scraped historical voting records for each of the city's Alder-persons. 

The outputs of this notebook are at https://data.world/stephen-hoover/chicago-city-council-votes . 

You can find the Chicago Lobbyists project at https://data.world/lilianhj/chicago-lobbyists . 

To learn more about Data for Democracy, go to https://github.com/Data4Democracy/read-this-first .

In [1]:
from collections import Counter
from glob import glob
from itertools import chain
import os
import re
import string
import subprocess
import sys
import tempfile
import urllib

import bs4
import pandas as pd

from typing import Dict, List, Tuple

In [2]:
WORKING_DIR = os.path.join(os.path.expanduser('~'), 'projects', 'd4d', 'chicago-lobbyists')

# Download Data

Chicago city council voting records exist in the form of pdfs on the city clerk's website. The votes all seem to be in pdfs titled "Attendence and Divided Roll Call Vote", so download only those pdfs. The website is paginated (as of April 2017). We can find the number of pages by inspecting the "Go to last page" link.

The table on the city clerk's website includes a meeting date with each download link. Use that to create the file name.

In [497]:
base = 'http://www.chicityclerk.com/'
suffix = 'legislation-records/journals-and-reports/council-meeting-reports'
page_query = '?field_publish_date_value[value]&page={num}'
resp = urllib.request.urlopen(base + suffix + page_query.format(num=0))
soup = bs4.BeautifulSoup(resp.read(), "html5lib")

In [498]:
last_page_link = [l for l in soup.find_all('a') if l.attrs.get('title') == 'Go to last page'][0].attrs['href']
print(last_page_link)
last_page = int(last_page_link.split('=')[-1])
print(last_page)

/legislation-records/journals-and-reports/council-meeting-reports?field_publish_date_value[value]=&page=27
27


In [499]:
download_links = []
for page_num in range(last_page + 1):
    resp = urllib.request.urlopen(base + suffix + page_query.format(num=page_num))
    soup = bs4.BeautifulSoup(resp.read(), "html5lib")
    for link in soup.find_all('a', string='Download'):
        link_info = {'title': list(link.parent.parent.parent.children)[3].text.strip(),
                     'date': list(link.parent.parent.parent.children)[1].text.strip(),
                     'href': link.attrs['href']}
        download_links.append(link_info)
len(download_links)

1097

In [500]:
roll_call_votes = [l for l in download_links if 'roll call' in l['title'].lower()]
roll_call_votes

[{'date': '2017-03-29',
  'href': '/file/7189/download?token=bxMpAQh6',
  'title': 'City Council - Attendance and Divided Roll Call Vote 3-29-2017'},
 {'date': '2017-02-22',
  'href': '/file/7165/download?token=kgouFPr6',
  'title': 'City Council - Attendance and Divided Roll Call Vote 2-22-2017'},
 {'date': '2017-01-25',
  'href': '/file/7139/download?token=WbxqcBhj',
  'title': 'City Council - Attendance and Divided Roll Call Vote 1-25-2017'},
 {'date': '2016-12-14',
  'href': '/file/7070/download?token=gQTk6kbQ',
  'title': 'City Council - Attendance and Divided Roll Call Vote 12-14-2016'},
 {'date': '2016-09-14',
  'href': '/file/6986/download?token=LaaZKUeq',
  'title': 'Attendance and Divided Roll Call Report 09/14/2016'},
 {'date': '2016-06-23',
  'href': '/file/6888/download?token=0xaUDVRg',
  'title': 'Attendance and Divided Roll Call Report 06/22/2016'},
 {'date': '2016-04-13',
  'href': '/file/7150/download?token=5GN2as0G',
  'title': 'Attendance and Divided Roll Call Report

In [501]:
print(len(roll_call_votes))

153


In [502]:
def download_link(doc_link):
    local_name = doc_link['date'].replace('-', '') + '_roll_call_report.pdf'
    path = os.path.join(os.path.expanduser('~'), 'projects', 'd4d', 'chicago-lobbyists', 'roll_calls', local_name)
    url = base + doc_link['href'].lstrip('/')
    try:
        print('Downloading "{}" from {}'.format(doc_link['title'], url))
        return urllib.request.urlretrieve(url, filename=path)
    except:
        print("Failed!")

In [None]:
download_resp = []
for doc_link in roll_call_votes:
    local_name = doc_link['date'].replace('-', '') + '_roll_call_report.pdf'
    path = os.path.join(WORKING_DIR, 'roll_calls', local_name)
    url = base + doc_link['href'].lstrip('/')
    try:
        print("Downloading {}".format(url))
        download_resp.append(urllib.request.urlretrieve(url, filename=path))
    except:
        print("Failed")

In [None]:
download_resp

# Parse Voting Records

There's two formats of voting records. The older format has a single table indexed by ward # and Alderman name, with a matrix of votes. The column names are the record numbers of the measures being considered on that day. Titles of each measure are in the block of text above the voting records. I've called this format the "vote table". Parsing is controlled by the "parse_file_with_table" function.

The newer format has one vote per page. Each page has the record name, title, and a block of warn # and Alderman name with votes in two columns. I've called this a "vote block" format. Parsing is controlled by the "parse_file_with_blocks" function. First I locate each block (each measure considered), then parse that block. Some tables translated cleanly from pdf -- these use the "parse_vote_block_clean" function. Some got a bit scrambled -- those use the "parse_vote_block_dirty" function.

For both formats, I've special-cased a few documents that didn't parse cleanly for one reason or another.

In [3]:
# This cell contains the code for converting vote blocks 
# of the newer format into DataFrames of voting records.

columns = ['Ward', 'Alderman', 'Vote']
VOTE_TOKENS = ['Y', 'N', 'A', 'NV', 'V', 'E', 'R', 'Recused', '-']


def standardize_name(name: str) -> str:
    """Standardize names of Aldermen"""
    if name.lower() in ['vacant', 'vacancy']:
        name = 'vacant'
    elif name == 'Ramirez Garza Rosa':
        # This is a common scrambling of "Ramirez Rosa" and "Sadlowski Garza"
        name = 'Ramirez Rosa'
    elif name == 'Sadlowski':
        name = 'Sadlowski Garza'
    elif name == 'Timothy M.':
        name = 'Timothy M. Cullerton'
    elif name == 'George A.':
        name = 'George A. Cárdenas'
    return name


def is_number(token: str):
    try:
        int(token)
        return True
    except ValueError:
        return False

    
def next_index_with_condition(line: List[str], condition):
    for i_token, token in enumerate(line):
        if condition(token):
            return i_token


def vote_from_token_end(token: str) -> Tuple[str, str]:
    if token == 'VACANCY':
        return None, token
    elif token[-2:] in VOTE_TOKENS:
        return token[-2:], token[:-2]
    elif token[-1] in VOTE_TOKENS:
        return token[-1], token[:-1]
    else:
        return None, token
    

# Special casing for corner cases and single document errors.
# Use these to clean lines.
SPECIAL_LINE_CLEANING = [('YO’Connor', 'Y O’Connor'),
                        ('AO’Connor', 'A O’Connor'),
                        ('V41', 'V 41'),
                        ('Brookins45 Cappleman YY', 'Brookins 45 Cappleman Y Y'),
                        ('PaY war', 'Pawar Y'),
                        ('OstermaY n', 'Osterman Y'),
                        ('Jackson32', 'Jackson 32'),
                        ('Thompson36', 'Thompson 36'),
                        ('denas37', 'denas 37'),
                        ('Napolitano NY', 'Napolitano N Y'),
                        ('Napolitano YY', 'Napolitano Y Y'),
                        ('34 Y', 'Y 34'),
                        ('Brookins45 Capple Y man Y',
                         'Brookins 45 Cappleman Y Y'),
                        ('N Rosa', 'Rosa N'),
                        ('Y Rosa', 'Rosa Y'),
                        ('Sadlowski 35 RaGarza mirez Rosa',
                         'Sadlowski Garza 35 Ramirez Rosa')]


def correct_record_miscodings(record: Dict[str, str]) -> Dict[str, str]:
    """Correct known data entry errors or inconsistencies
    
    Input should be a dictionary with keys "Ward", "Alderman", "Vote"
    """
    record = record.copy()  # Don't alter input
    if record['Alderman'] == 'Cappleman' and record['Ward'] == '45':
        record['Ward'] = '46'
    if record['Vote'] == 'Recused':
        record['Vote'] = 'R'
    return record


def parse_vote_block_dirty(lines: List[str]) -> pd.DataFrame:
    # This works on a vote block with messed up formatting.
    # Assume lines are organized as
    # WARD1 NAME1 WARD2 NAME2 VOTE1 VOTE2
    # The "VOTE1" may or may not have whitespace between it and NAME2
    # Possibly VOTE1 is between NAME1 and WARD2 for some lines.
    
    votes = []
    for i, line in enumerate(lines):
        if not line:
            break
        #print(line)
        line = ' '.join(line.split())  # Convert all whitespace to one space
        for special in SPECIAL_LINE_CLEANING:
            line = line.replace(*special)
        #print(line)
        tokens = line.split()
        #tokens = list(chain.from_iterable([split_vote_token(t) for t in line.split()]))
        clean_tokens = [], []
        stage = 0
        clean_tokens[0].append(tokens.pop(0))
        
        i_next_ward = next_index_with_condition(tokens, is_number)
        name_0 = ' '.join(tokens[:i_next_ward]).strip()
        vote_0, name_0 = vote_from_token_end(name_0)
        clean_tokens[0].append(name_0.strip())
        clean_tokens[1].append(tokens[i_next_ward])
        
        tokens = tokens[i_next_ward + 1:]  # Remove consumed tokens
        vote_1 = tokens.pop(len(tokens) - 1)  # Last token is the second vote
        
        # Remaining tokens are a combination of first vote and second name
        if not vote_0:
            vote_0, tokens[-1] = vote_from_token_end(tokens[-1])
        if not vote_0:
            vote_0 = tokens.pop(1)
        clean_tokens[0].append(vote_0)
        
        name_1 = ' '.join(tokens).strip()
        if not name_1 and vote_1 not in VOTE_TOKENS:
            # Probably the name and last vote merged together.
            vote_1, name_1 = vote_from_token_end(vote_1)
        clean_tokens[1].append(name_1)
        clean_tokens[1].append(vote_1)
        
        for cleaned in clean_tokens:
            record = dict(zip(columns, cleaned))
            votes.append(correct_record_miscodings(record))
    df = pd.DataFrame.from_records(votes)
    df['Ward'] = df['Ward'].astype(int)
    df['Alderman'] = df['Alderman'].apply(standardize_name)
    
    assert len(df) == 50, "There should be 50 wards"
    assert (np.sort(df['Ward'].values) == np.arange(1, 51)).all(),\
        "Wards should be 1-50"
    
    return df.sort_values(by='Ward').reset_index(drop=True)


def parse_vote_block_clean(lines: List[str]) -> pd.DataFrame:
    # This works on a well-formatted vote block
    votes = []
    # Go over enough lines to be sure you have it all. Break when done.
    for i, line in enumerate(lines):
        if not line:
            break
        tokens = line.split()
        clean_tokens = []
        stage = 0
        name_tokens = []
        for t in tokens:
            if stage == 0:
                clean_tokens.append(t)
                stage += 1
            elif stage == 1:
                if t in VOTE_TOKENS:
                    clean_tokens.append(' '.join(name_tokens))
                    clean_tokens.append(t)
                    record = dict(zip(columns, clean_tokens))
                    votes.append(correct_record_miscodings(record))
                    clean_tokens, name_tokens = [], []
                    stage = 0
                else:
                    name_tokens.append(t)
    df = pd.DataFrame.from_records(votes)
    df['Ward'] = df['Ward'].astype(int)
    df['Alderman'] = df['Alderman'].apply(standardize_name)
    
    assert len(df) == 50, "There should be 50 wards"
    assert (np.sort(df['Ward'].values) == np.arange(1, 51)).all(),\
        "Wards should be 1-50"
    
    return df.sort_values(by='Ward').reset_index(drop=True)


In [4]:
# This cell finds the portions of data in the newer format -- 
# record indicator, title, and voting data -- and cleans and 
# wraps everything into one dictionary per vote.

def startswith_index(page, i_start, substr):
    for i, l in enumerate(page[i_start:]):
        if l.strip().lower().startswith(substr.lower()):
            return i + i_start

def select_chunks(page):
    chunks, starts, stops = {}, {}, {}
    end_index = None
    
    i_roll_call_page = startswith_index(page, 0, 'Roll Call Vote')
    if not i_roll_call_page:
        return chunks, end_index
    
    starts['record'] = startswith_index(page, i_roll_call_page, 'Record')
    stops['record'] = starts['record'] + 1
    starts['title'] = startswith_index(page, starts['record'], 'Title')
    stops['title'] = startswith_index(page, starts['title'], 'Vote')
    starts['votes'] = startswith_index(page, stops['title'], 'Ward') + 1
    if page[starts['votes']].strip().startswith('Ward'):
        # Sometimes the table header splits onto two lines
        # E.g. roll_calls/20131211_roll_call_report.txt
        starts['votes'] += 1
    for i in range(starts['votes'], len(page)):
        if not page[i].strip() or page[i].strip()[0] not in string.digits:
            stops['votes'] = i
            break
    else:
        raise RuntimeError("Couldn't find the end of the vote block.")
    
    for name in starts:
        chunks[name] = page[starts[name]: stops[name]]
        
    return chunks, stops['votes'] + 1

   
def clean_record(line: str) -> str:
    if len(line) != 1:
        raise ValueError('Expected one line!')
    return line[0].split()[-1]
    #return line[0][len('Record No.:'):].strip()

def clean_title(lines: List[str]) -> str:
    line = ' '.join(lines)
    tokens = line.split()
    return ' '.join(tokens[1:])
    #return line[len('Title/Description:'):].strip()

def clean_votes(lines: List[str]) -> pd.DataFrame:
    try:
        return parse_vote_block_clean(lines)
    except (ValueError, KeyError):
        return parse_vote_block_dirty(lines)

CLEAN = {'record': clean_record, 'title': clean_title, 'votes': clean_votes}


def read_page(file_name: str) -> List[str]:
    with open(file_name) as _fin:
        page = [l.strip() for l in _fin.readlines()]
    return page


def parse_file_with_blocks(file_name: str, date_str: str) -> List[Dict]:
    records = []
    page = read_page(file_name)
    index = 0
    while True:
        this_rec, next_index = select_chunks(page[index:])
        if this_rec:
            records.append({k: CLEAN[k](v) for k, v in this_rec.items()})
            records[-1]['date'] = date_str
            index += next_index
        else:
            break
    return records

In [5]:
# This cell parses voting records in the older style.

def select_vote_table(lines: List[str], votes_only: bool=False) -> List[str]:
    # Search for a unified table of votes
    i_start, i_stop = None, None
    for i_line, line in enumerate(lines):
        if ('issue:' in line.strip().lower() and 
                (lines[i_line + 1].strip().startswith('1st') or
                 lines[i_line + 2].strip().startswith('1st'))):
            i_start = i_line
        elif votes_only and line.strip().startswith('1st'):
            # We already have the header, so we only need the block of votes
            i_start = i_line
        elif i_start and line.strip().startswith('50th'):
            i_stop = i_line + 1
            break
    else:
        return []
    return lines[i_start: i_stop]

def parse_table_line(line: str, n_issues: int) -> Tuple[str, str, str]:
    tokens = line.split()
    if len(tokens) < 3:
        return None
    ward = tokens.pop(0)[:-2]  # Remove "st", "nd", "th", etc.
    tokens.pop(0)  # Should be "Ward:"
    
    votes = []
    while tokens[-1] in VOTE_TOKENS:
        votes.append(tokens.pop(len(tokens) - 1))
    additional_vote, tokens[-1] = vote_from_token_end(tokens[-1])
    while additional_vote:
        votes.append(additional_vote)
        additional_vote, tokens[-1] = vote_from_token_end(tokens[-1])
    votes = votes[::-1]  # Invert list -- we processed from the end inward
        
    name = ' '.join(tokens)
    if name.lower() == 'vacant' and not votes:
        votes = n_issues * ['V']
    return [ward, name] + votes


def get_title(page: List[str], issue: str) -> str:
    if issue.startswith('ADJOURN'):
        return "Motion to adjourn"
    if issue.lower().startswith('case of'):
        return issue
    if issue == 'O2009-6230':
        # The title detail was miscoded in the pdfs as "O2010-6230"
        return ("Amendment of Chapter 2-32 of Municipal "
                "Code by addition of new Section 6267 establishing "
                "property tax relief program for qualified homeowners.")
    issue = issue.replace(';', ',')
    issue = issue.replace('(v', ' (v')
    for i_line, line in enumerate(page):
        if line.strip().startswith(issue):
            break
            
    possible_title_lines = [page[i_line]]
    for line in page[i_line+1: i_line+10]:
        line = line.strip()
        if line.lower().lstrip('(').strip().startswith('click here'):
            continue
        elif (not line or 
                line.lower().startswith("key: ") or 
                re.search(r'^[A-Z][a-zA-Z]?\d{4}-', line)):
            break
        possible_title_lines.append(line)
    blob = ' '.join(possible_title_lines)
    i_end = blob.lower().find('click here')
    title = blob[: i_end] if i_end > 0 else blob
    title = title[len(issue) + 1:].strip().lstrip('-').strip()
    title = title.rstrip('(').strip()
    return title


def parse_issues(lines: List[str]) -> List[str]:
    # Convert lines of text with issue record names into a list of names
    # Need to handle line breaks.
    # See e.g. roll_calls/20071113_roll_call_report.txt

    # Differently-broken:
    # roll_calls/20061115_roll_call_report.txt
    
    # Remove spaces from "Motion to Adjourn" to aid in processing
    lines[0] = lines[0].replace('Motion to Adjourn', 'ADJOURN')
    
    if len(lines) == 1:
        return ''.join([l.strip().replace('--', '-') for l in lines]).split()[1:]
    if len(lines) == 2:
        if ';' in lines[0]:
            lines[0].replace('; ', ';')
        elif Counter(lines[0])['-'] > len(lines[0].split()) - 1:
            lines[0] = lines[0].replace('-', '- ').replace(' - ', ' ')
        lines[1] = lines[1].replace(' (', '(')
        tokens1 = lines[0].split()[1:]  # Remove "Issue:"
        tokens2 = lines[1].split()
        if len(tokens2) == 1:
            return ''.join([l.strip().replace('--', '-') for l in lines]).split()[1:]
        elif len(tokens2) == len(tokens1):
            return [''.join(pair).replace('--', '-') for pair in zip(tokens1, tokens2)]
    raise RuntimeError('Unable to parse issues: \n{}'.format(lines))

        
def parse_file_with_table(file_name: str, date_str: str, issues: List[str]=None) -> List[Dict]:
    records = []
    page = read_page(file_name)
    
    vote_table = select_vote_table(page, votes_only=(issues is not None))
    if not vote_table:
        return
    
    if not issues:
        i_start = [i for i, l in enumerate(vote_table) if l.startswith('1st')][0]
        issues = parse_issues(vote_table[: i_start])
    else:
        i_start = 0
    parsed_lines = [parse_table_line(l, len(issues)) for l in vote_table[i_start:] if len(l.split()) >= 3]
    df = pd.DataFrame.from_records(parsed_lines, columns=['Ward', 'Alderman'] + issues)
    df['Ward'] = df['Ward'].astype(int)
    df['Alderman'] = df['Alderman'].apply(standardize_name)
    
    assert len(df) == 50, "There should be 50 wards"
    assert (np.sort(df['Ward'].values) == np.arange(1, 51)).all(),\
        "Wards should be 1-50"
    
    for issue in issues:
        record = {'record': issue,
                  'title': get_title(page, issue),
                  'date': date_str,
                  'votes': df[['Alderman', issue, 'Ward']].rename(columns={issue: 'Vote'})}
        records.append(record)
    return records

In [6]:
# Hand-inspected records, verified to have no votes that day (attendance only)
known_empty = ['/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20110309_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20110413_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20110504_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20110518_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20110706_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20111005_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20111012_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20111102_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20111109_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20111214_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20120314_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20120509_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20121003_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20130117_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20130213_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20130313_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20130717_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20131016_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20140402_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20141210_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20150121_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20150415_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20150506_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20150520_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20151014_roll_call_report.txt',
 '/Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20151021_roll_call_report.txt']

# There's odd things in these files which keep them from being parsed easily.
# Hand-code the issue headers. (These are for old-style formatted documents only.)
issues_for_date = {'2009-10-07': ['SO2009-5597','SO2009-5542', 
                                  'PO2009-4114: Motion to lay on table', 
                                  'PO2009-4114: Motion to Re-refer'],
                   '2006-11-19': ['O2008-6775', 'O2008-6776', 'O2008-6782', 'O2008-6778', 'SO2008-6777'],
                   '2010-09-08': ['Case of Gary Kamen v. City of Chicago.'],
                   '2011-02-09': ['SO2010-7086; O2010-6824'],
                   '2012-04-24': ['Table Fioretti Amendment', 'Table Waguespack Substitute', 'SO2012-1366']}

In [7]:
# Use GhostScript to convert all of the pdfs into text files,
# then parse the text files.
cmd = 'gs -sDEVICE=txtwrite -o {output} {input}'

vote_records = []
empty = []
unknown = []
success = []
fail = []
val = None
pdfs = glob(os.path.abspath('roll_calls/*.pdf'))
for i_name, fname in enumerate(pdfs):
    out_fname = os.path.splitext(fname)[0] + '.txt'
    if not os.path.exists(out_fname):
        # Only reprocess the pdfs if the outputs don't exist already.
        retval = subprocess.run(cmd.format(output=out_fname, input=fname),
                                shell=True, check=False,
                                stderr=subprocess.PIPE, stdout=subprocess.PIPE)
    #print(retval.stdout, retval.stderr)
    print("Parsing {} from {}".format(fname, out_fname))
    try:
        with open(out_fname) as _fin:
            full_page = _fin.read()
            
        # Many documents have only attendance and no votes. Skip those.
        if (out_fname in known_empty or
                'There were no divided roll call votes' in full_page or
                'There was no divided roll call' in full_page):
            empty.append(out_fname)
            continue
            
        # First try to parse the document as if it were in the new-style format.
        yyyymmdd = os.path.basename(out_fname).split('_')[0]
        date_str = "{}-{}-{}".format(yyyymmdd[:4], yyyymmdd[4:6], yyyymmdd[6:])
        parsed = parse_file_with_blocks(out_fname, date_str=date_str)
        if not parsed:
            # If that didn't work, try again with the older format.
            parsed = parse_file_with_table(out_fname, date_str=date_str, 
                                           issues=issues_for_date.get(date_str))
            
        # Record success or failure.
        if not parsed:
            unknown.append(out_fname)
        else:
            success.append(out_fname)
            vote_records.extend(parsed)
    except Exception as exc:
        fail.append((out_fname, exc))
        print(exc)

Parsing /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20060524_roll_call_report.pdf from /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20060524_roll_call_report.txt
Parsing /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20060628_roll_call_report.pdf from /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20060628_roll_call_report.txt
Parsing /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20060726_roll_call_report.pdf from /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20060726_roll_call_report.txt
Parsing /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20060913_roll_call_report.pdf from /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20060913_roll_call_report.txt
Parsing /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20061004_roll_call_report.pdf from /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/20061004_roll_call_report.txt
Parsing /Users/shoover/projects/d4d/chicago-lobbyists/roll_calls/

In [8]:
print('{} vote records successfully parsed from {} files.'.format(len(vote_records), len(success)))
print(len(empty), ' files have no votes.')
print(len(fail), ' files had an error in parsing.')
print(len(unknown), ' files I don\'t know how to parse.')

217 vote records successfully parsed from 90 files.
58  files have no votes.
0  files had an error in parsing.
0  files I don't know how to parse.


In [9]:
def read_date(date):
    return read_page(os.path.join(WORKING_DIR, 'roll_calls', '{}_roll_call_report.txt'.format(date.replace('-', ''))))

In [10]:
fail

[]

In [11]:
vote_records[130]

{'date': '2014-02-05',
 'record': 'O2014-500',
 'title': 'City of Chicago General Obligation and Refunding Bonds, Series 2014 and amend Chapter 2-32 of Municipal Code of Chicago concerning debt management policies',
 'votes':        Alderman Vote  Ward
 0        Moreno    Y     1
 1      Fioretti    N     2
 2        Dowell    Y     3
 3         Burns    Y     4
 4      Hairston    Y     5
 5        Sawyer    Y     6
 6        Holmes    Y     7
 7        Harris    Y     8
 8         Beale    Y     9
 9          Pope    Y    10
 10       Balcer    Y    11
 11     Cárdenas    Y    12
 12        Quinn    Y    13
 13        Burke    R    14
 14      Foulkes    Y    15
 15     Thompson    Y    16
 16       Thomas    Y    17
 17         Lane    Y    18
 18       O’Shea    Y    19
 19      Cochran    Y    20
 20     Brookins    Y    21
 21        Muñoz    Y    22
 22     Zalewski    Y    23
 23     Chandler    Y    24
 24        Solis    Y    25
 25    Maldonado    Y    26
 26      Burnett   

In [12]:
# Inspect pages
read_page(empty[4])

['Attendance and Divided Roll Call                               Close',
 'Vote',
 'Attendance for the February 7th, 2007 Meeting of the Chicago City',
 'Council',
 'Present - The Honorable Richard M. Daley, Mayor, and Aldermen Flores, Haithcock,',
 'Tillman, Preckwinkle, Hairston, Lyle, Beavers, Stroger, Beale, Pope, Balcer, Cardenas,',
 'Olivo, Burke, T. Thomas, Coleman, L.Thomas, Murphy, Rugai, Troutman, Brookins,',
 'Munoz, Zalewski, Chandler, Solis, Ocasio, Burnett, E. Smith, Carothers, Reboyras,',
 "Suarez, Matlak, Mell, Austin, Colon, Banks, Mitts, Allen, Laurino, O'Connor, Doherty,",
 'Natarus, Daley, Tunney, Levar, Shiller, Schulter, M. Smith, Moore, Stone.',
 'Absent - None.',
 'Divided Roll Call Voting February 7th, 2007 Meeting of the Chicago City',
 'Council',
 'There were no divided roll call votes in the February 7th, 2007 meeting of the Chicago City',
 'Council.']

In [None]:
# Error checking / debugging

fname = fail[0][0]
print(fname)
page = read_page(fname)
votes = select_chunks(page)
print(votes[1])
parse_vote_block_dirty(votes[0]['votes'])
#parse_file_with_table(fname, 'yyyy')

# Output Data

Write the data to disk!

In [13]:
df_records_list = []
titles = []
for record in vote_records:
    _df = record['votes'].copy()
    _df['Date'] = record['date']
    _df['Record'] = record['record']
    df_records_list.append(_df)
    titles.append([record['date'], record['record'], record['title']])
    
df_records = pd.concat(df_records_list)
df_titles = pd.DataFrame(titles, columns=['Date', 'Record', 'Title'])

In [14]:
df_records.to_csv(os.path.join(WORKING_DIR, 'alderman_votes.csv'), index=False)
df_titles.to_csv(os.path.join(WORKING_DIR, 'legislation_titles.csv'), index=False)

# Quality Checks


In [15]:
from collections import Counter
Counter(df_records.Ward)

Counter({1: 217,
         2: 217,
         3: 217,
         4: 217,
         5: 217,
         6: 217,
         7: 217,
         8: 217,
         9: 217,
         10: 217,
         11: 217,
         12: 217,
         13: 217,
         14: 217,
         15: 217,
         16: 217,
         17: 217,
         18: 217,
         19: 217,
         20: 217,
         21: 217,
         22: 217,
         23: 217,
         24: 217,
         25: 217,
         26: 217,
         27: 217,
         28: 217,
         29: 217,
         30: 217,
         31: 217,
         32: 217,
         33: 217,
         34: 217,
         35: 217,
         36: 217,
         37: 217,
         38: 217,
         39: 217,
         40: 217,
         41: 217,
         42: 217,
         43: 217,
         44: 217,
         45: 217,
         46: 217,
         47: 217,
         48: 217,
         49: 217,
         50: 217})

In [16]:
df_records.Alderman.unique()

array(['Manuel Flores', 'Madeline L. Haithcock', 'Dorothy J. Tillman',
       'Toni Preckwinkle', 'Leslie Hairston', 'Freddrenna Lyle',
       'William M. Beavers', 'Todd Stroger', 'Anthony Beale',
       'John A. Pope', 'James A. Balcer', 'George A. Cardenas',
       'Frank J. Olivo', 'Edward M. Burke', 'Theodore Thomas',
       'Shirley A. Coleman', 'LaTasha R. Thomas', 'Thomas W. Murphy',
       'Virginia A. Rugai', 'Arenda Troutman', 'Howard Brookins Jr.',
       'Ricardo Munoz', 'Michael R. Zalewski', 'Michael D. Chandler',
       'Daniel S. Solis', 'Billy Ocasio', 'Walter Burnett, Jr.',
       'Ed H. Smith', 'Isaac S. Carothers', 'Ariel E. Reboyras',
       'Ray Suarez', 'Ted Matlak', 'Richard F. Mell', 'Carrie M. Austin',
       'Rey Colon', 'William J.P. Banks', 'Emma Mitts', 'Thomas R. Allen',
       'Margaret Laurino', "Patrick J. O'Connor", 'Brian G. Doherty',
       'Burton F. Natarus', 'Vi Daley', 'Thomas Tunney',
       'Patrick J. Levar', 'Helen Shiller', 'Eugene C. Schu

In [17]:
df_records.Vote.unique()

array(['N', 'Y', 'NV', 'A', 'E', 'V', '-', 'R'], dtype=object)

In [18]:
print(df_records.Record.nunique())
df_records.Record.unique()

211


array(['SO2006-3086', 'SO2006-2971', 'FL2006-14', 'SO2006-3936',
       'F2006-230', 'A2006-103', 'SO2006-3966', 'O2006-4519', 'O2006-4522',
       'OR2006-1552', 'O2006-4891', 'O2006-4893', 'O2006-4894',
       'SO2006-5062', 'O2008-6775', 'O2008-6776', 'O2008-6782',
       'O2008-6778', 'SO2008-6777', 'O2006-5187', 'S02007-464',
       'O2007-1402', 'R2007-1058', 'O2007-3106', 'OR2007-1009',
       'SO2007-3954', 'SO2007-5794', 'O2007-5503', 'SO2007-5799',
       'SO2007-5800', 'SO2007-5802', 'SO2007-5801', 'A2008-2',
       'SO2008-537', 'O2008-538', 'O2008-2041', 'S02008-2623',
       'SO2008-4315', 'SO2008-4302', 'O2008-6777', 'O2008-6774',
       'O2008-7304', 'Or2009-1', 'SO2009-1115', 'SO2009-2414',
       'O2009-2415', 'SO2009-3410', 'SO2009-5597', 'SO2009-5542',
       'PO2009-4114: Motion to lay on table',
       'PO2009-4114: Motion to Re-refer', 'R2009-1210', 'O2009-6224',
       'O2009-6223', 'O2009-6230', 'SO2010-163', 'ADJOURN#1', 'ADJOURN#2',
       'PO2010-1842', 'SO2

In [19]:
Counter(df_records.Record)

Counter({'A2006-103': 50,
         'A2008-2': 50,
         'A2011-176': 50,
         'A2011-56': 50,
         'A2013-92': 50,
         'A2016-15': 50,
         'ADJOURN#1': 50,
         'ADJOURN#2': 50,
         'Case of Gary Kamen v. City of Chicago.': 50,
         'F2006-230': 50,
         'FL2006-14': 50,
         'O2006-4519': 50,
         'O2006-4522': 50,
         'O2006-4891': 50,
         'O2006-4893': 50,
         'O2006-4894': 50,
         'O2006-5187': 50,
         'O2007-1402': 50,
         'O2007-3106': 50,
         'O2007-5503': 50,
         'O2008-2041': 50,
         'O2008-538': 50,
         'O2008-6774': 50,
         'O2008-6775': 100,
         'O2008-6776': 100,
         'O2008-6777': 50,
         'O2008-6778': 100,
         'O2008-6782': 50,
         'O2008-7304': 50,
         'O2009-2415': 50,
         'O2009-6223': 50,
         'O2009-6224': 50,
         'O2009-6230': 50,
         'O2010-3644': 50,
         'O2010-4213': 150,
         'O2010-5922': 50,
         'O2

In [20]:
df_records[df_records.Record == 'O2010-4213'].Date.unique()

array(['2010-10-06', '2010-11-03', '2011-01-13'], dtype=object)

In [21]:
read_date('2012-01-18')

['Attendance and Divided Roll Call                                     Close',
 'Vote',
 'Attendance for the January 18, 2012 Meeting of the Chicago City',
 'Council',
 'Present - Aldermen Moreno, Fioretti, Dowell, Burns, Hairston, Sawyer, Jackson,',
 'Harris, Beale, Pope, Balcer, Cárdenas, Quinn, Burke, Foulkes, Thompson,',
 "Thomas, Lane, O'Shea, Cochran, Brookins, Muñoz, Zalewski, Chandler, Solis,",
 'Maldonado, Burnett, Ervin, Graham, Reboyras, Suarez, Waguespack, Mell, Austin,',
 "Colón, Sposato, Mitts, Cullerton, Laurino, P. O'Connor, Reilly, Smith, Tunney,",
 '.',
 'Arena, Cappleman, Pawar, Osterman, Moore, Silverstein -- 49',
 'Absent -- Alderman M. O’Connor-- 1',
 'Divided Roll Call Vote January 18, 2012 Meeting of the Chicago City',
 'Council',
 'There was five divided roll call votes for the January 18, 2012 meeting of the',
 'Chicago City Council.',
 'SO2011-9743: Amendment of Sections 2-84-053 and 10-36-110 of Municipal',
 'Code to Authorize Execution of Agreements with Pu

In [22]:
df_titles[df_titles.Title.str.contains('Amendment of Title 2, Chapter 8 of Municipal Code Concerning')].Title.values

array([ 'Amendment of Title 2, Chapter 8 of Municipal Code Concerning Redistricting of City Wards'], dtype=object)

In [23]:
df_titles[df_titles.Title == '']

Unnamed: 0,Date,Record,Title
52,2009-10-07,PO2009-4114: Motion to lay on table,
53,2009-10-07,PO2009-4114: Motion to Re-refer,
74,2010-11-03,O2010-4213,
80,2011-01-13,O2010-4213,
88,2012-01-18,O2011-9744,
89,2012-01-18,SO2011-9778,
90,2012-01-18,SO2011-6726,
91,2012-01-18,SO2011-9742,
93,2012-02-15,O2012-53,
97,2012-04-24,Table Fioretti Amendment,


In [24]:
df_titles[df_titles.Date == '2011-09-08'].Title.values

array([ 'An order authorizing to enter into and execute a settlement order for Connie Coleman, as special administrator of the Estate of John Coleman Jr. Deceased v. City of Chicago, et al.',
       'An ordinance to amend Chapter 7-36 of the Municipal Code of Chicago regarding crib bumper pads.'], dtype=object)

In [25]:
df_titles.Title.values

array([ 'Amendment of Title 2, Chapter 8, Section 041 of Municipal Code of Chicago by Establishment of New Aldermanic Compensation Schedule. ( entire text of legislation )',
       'Amendment of Title 4 of Municipal Code of Chicago by Creation of New Chapter 404 Entitled "Large Retailers".',
       'Failed To Pass  -- Amendment of Title 4, Chapter 208 of Municipal Code of Chicago by Addition of New Section 077 Requiring Notification of Guests Concerning Work Stoppage.',
       'Rejection of Bids for purchase of City-Owned property at 6238 South Kimbark Avenue.',
       'Mayoral Veto of Ordinance adding New Chapter 4-404 of Municipal Code of Chicago entitled "Large Retailers".',
       'Appointment of Mr. Peter M. Holsten as Member of Uptown Commission (Special Service Area Number 34).',
       'Amendment of Title 4, Chapter 144 and Title 8, Chapter 24, Section 040 of Municipal Code of Chicago to Prohibit Sale, Transfer or Discharging of Replica Air Guns within City of Chicago.',
      