# Scraping Wikipedia: Bullet Points

We broke up scraping into two tasks: scraping tables and scraping bullet points. This jupyter notebook walks you through Libby's bullet scraping process. 

## Ultimate Goal
Compile a database of all the aldermen who served between 2012 and 2023. 

## Steps
This will be broken down into two parts. Part 1 is where I will grab all the bullet points from wikipedia. Part 2 is where I scrape the pages of people who have profile links.

# Part 1

In [200]:
# Read in
import lxml.html
import httpx 
import pandas as pd
import re
import json

In [201]:
# Define key elements
url = "https://en.wikipedia.org/wiki/List_of_Chicago_alderpersons_since_1923"
url_text = httpx.get(url)
root = lxml.html.fromstring(url_text.text)

In [202]:
def compile_aldermen_for_wards(ul_element, ward_number, aldermen_dict):
    """
    Gather each the ward aldermen for wards whose aldermen are listed as bullet oints. 
    
    inputs:
        ul_element: element known to contain bullet points
        ward_number: str associated with number of ward
        aldermen_dict: dictionary to contain ward_numbers as the key, values 
        will be a list of dictionaries. Key is an alderman's name, value is any 
        link associated with them
    
    outputs: 
        aldermen_dict: dictionary to contain ward_numbers as the key, values 
        will be a list of dictionaries. Key is an alderman's name, value is any 
        link associated with them
    
    """
    ward_aldermen = []
    bullet_points = ul_element.cssselect("li")
    for bullet in bullet_points:
        links = bullet.cssselect('a')
        profile_link = None        
        if links is not None:
            for element in links:
                # Wikipedia will often link to a file or its bibliography. I 
                # only want a link if it has the person's biography there
                href = element.get('href')
                if re.search("File", str(href), re.IGNORECASE) or \
                    re.search("cite_note", str(href), re.IGNORECASE):
                    continue
                else:
                    profile_link = "https://en.wikipedia.org/" + href
                    break
        # Now I should create a dictionary with the indivudual as the key and 
        # the links as the value
        # I will then add this dictionary to a list. This list will then be the 
        # value of final dictionary where the key is the ward
        
        # clean name courtesy of: https://stackoverflow.com/questions/640001/how-can-i-remove-text-within-parentheses-with-a-regex
        clean_name = re.sub(r'\([^)]*\)', '', bullet.text_content())
        person_tuple = (str.strip(clean_name), profile_link)
        ward_aldermen.append(person_tuple)
    
    aldermen_dict[ward_number] = ward_aldermen
    
    return aldermen_dict
    

In [203]:
aldermen_dict = {}

body_content = root.cssselect('div.mw-body-content')[0]
heading = body_content.cssselect('div.mw-heading.mw-heading3')
list_elements = body_content.cssselect('ul')

for ward_number in heading:
    num = re.findall("\\d+", ward_number.text_content())
    ward = num[0]
    
    # Each part of the page is structured a bit differently. 
    # Sometimes, I a ul element is found immediately after the heading. Other 
    # times, they have figures or paragraphs after the heading and before the 
    # list of aldermen. For that reason, I grab the next five elements.
    
    sibling_1 = ward_number.getnext()
    sibling_2 = sibling_1.getnext()
    sibling_3 = sibling_2.getnext()
    sibling_4 = sibling_3.getnext()
    sibling_5 = sibling_4.getnext()
    
    for element in [sibling_1, sibling_2, sibling_3, sibling_4, sibling_5]:
        if re.search("table", str(element), re.IGNORECASE):
            # if there's a table associated with this ward, I don't care about
            # it, so I move onto the next ward
            break
        elif re.search("ul", str(element), re.IGNORECASE):
            aldermen_dict = \
                compile_aldermen_for_wards(element, ward, aldermen_dict)
            break
        else:
            continue

aldermen_dict
        
    

{'5': [('Charles S. Eaton', None),
  ('Leonard J. Grossman', None),
  ('Charles S. Eaton', None),
  ('Irving J. Schreiber', None),
  ('James J. Cusack Jr.', None),
  ('Paul Howard Douglas',
   'https://en.wikipedia.org//wiki/Paul_Howard_Douglas'),
  ('Bertram B. Moss', None),
  ('Robert E. Merriam', 'https://en.wikipedia.org//wiki/Robert_E._Merriam'),
  ('Leon Despres', 'https://en.wikipedia.org//wiki/Leon_Despres'),
  ('Ross Lathrop', None),
  ('Lawrence Bloom', None),
  ('Barbara Holt', None),
  ('Leslie Hairston', 'https://en.wikipedia.org//wiki/Leslie_Hairston')],
 '6': [('Guy Guernsey', None),
  ('John F. Healy', None),
  ('Patrick Sheridan Smith', None),
  ('Francis J. Hogan', None),
  ('David R. Muir', None),
  ('Sydney A. Jones Jr.', None),
  ('Robert H. Miller', None),
  ('A. A. Rayner Jr.',
   'https://en.wikipedia.org//w/index.php?title=A._A._Rayner_Jr.&action=edit&redlink=1'),
  ('Eugene Sawyer', 'https://en.wikipedia.org//wiki/Eugene_Sawyer'),
  ('John O. Steele',
   'http

# Part 2
Great! Now that I have my dictionary. I find the start and end year for each person. I am particularly interested in the years 2012-2023, so I won't bother to get data for everyone. 

In [None]:
aldermen_dates_dict = {}
for ward, aldermen_and_links in aldermen_dict.items():
    all_info = []
    for tup in aldermen_and_links:
        alderperson = tup[0]
        link = tup[1]
        # we have a series of different labels for dates so we can evaluate 
        # what has happened
        if link is not None:
            dates = "link exists"
            resp = httpx.get(link)
            if resp.status_code == 200:
                root = lxml.html.fromstring(resp.text)
                # This is the box that has the in office dates, but sometimes 
                # it doesn't exist
                info_box = root.cssselect("table.infobox.vcard")
                if len(info_box) == 1:
                    info_elements = info_box[0].cssselect('th.infobox-header')
                    for element in info_elements:
                        # we want to find the row header that is associated 
                        # with being a Chicago alderperson
                        if re.search("ward", element.text_content(), re.IGNORECASE) \
                            or re.search("alder", element.text_content(), \
                                re.IGNORECASE) or re.search("Chicago City Council", \
                                    element.text_content(), re.IGNORECASE):
                            # We want to find the row itself this is associated 
                            # with rather than the actual header     
                            row_parent = element.getparent()
                            # The header we want is in the next two rows
                            header_option1 = row_parent.getnext()
                            header_option2 = header_option1.getnext()
                            if re.search("office", header_option1.text_content(), re.IGNORECASE):
                                dates_raw = header_option1.text_content()
                                dates_raw = re.sub("In office", "", dates_raw)
                                dates_raw = re.sub("Assumed office", "", dates_raw)
             # clean name courtesy of: https://stackoverflow.com/questions/640001/how-can-i-remove-text-within-parentheses-with-a-regex
                                dates_raw = re.sub(r'\([^)]*\)', ' ', dates_raw)
                                dates_raw = re.sub(r'\xa0', ' ', dates_raw)
                                dates = str.strip(re.sub(r'\[[^)]*\]', ' ', dates_raw))
                            elif re.search("office", header_option2.text_content(), re.IGNORECASE):
                                dates_raw = header_option2.text_content()
                                dates_raw = re.sub("In office", "", dates_raw)
                                dates_raw = re.sub("Assumed office", "", dates_raw)
             # clean name courtesy of: https://stackoverflow.com/questions/640001/how-can-i-remove-text-within-parentheses-with-a-regex
                                dates_raw = re.sub(r'\([^)]*\)', ' ', dates_raw)
                                dates_raw = re.sub(r'\xa0', ' ', dates_raw)
                                dates = str.strip(re.sub(r'\[[^)]*\]', '', dates_raw))
                            else:
                                dates = "unknown from link"
                            break
                else:
                    dates = "ERROR, BOX ELEMENT"
                
            else:
                dates = "ERROR, BAD LINK"       
        else:
            dates = "Unknown"
            
        person_tup = (alderperson, dates)
        all_info.append(person_tup)
        
    aldermen_dates_dict[ward] = all_info
        
        

1939–1942
1955–1975
May 1999 – May 15, 2023
February 28, 1971  – December 2, 1987
February 8, 1998 – May 15, 2011
May 16, 2011 – May 15, 2023
1983–2006
May 21, 2007 – January 15, 2013
February 13, 2013 – May 18, 2015
May 18, 2015
April 1923 – December 1930
1967  –1976
September 5, 2001 – December 4, 2006
December 13, 2006
May 1987 – December 1998
May 1999
1927–1942
1971–1987
May 1999   – May 2015
May 18, 2015 – May 15, 2023
June 12, 1969 – June 7, 1977
1997–2015
May 18, 2015 – February 14, 2022
March 28, 2022
2003–2022
December 14, 2022 – May 15, 2023
May 15, 2023
1994-2011
May 2011
2015–2019
May 18, 2015
February 1983 – January 27, 1991
February 1983 – January 27, 1991
January 27, 1991 – May 30, 2007
2015–2019
May 20, 2019
2000–2015
May 18, 2015
May 18, 2015
April 1990 – February 2007
2007–2019
May 20, 2019
1971–1987
1971–1979
May 2003 – May 15, 2023
March 25, 1986 – January 1, 1993
January 1993   – May 20, 2019
May 20, 2019
1975–1983
May 1995   – May 31, 2018
June 28, 2018
April 8, 1

In [205]:
aldermen_dates_dict

{'5': [('Charles S. Eaton', 'Unknown'),
  ('Leonard J. Grossman', 'Unknown'),
  ('Charles S. Eaton', 'Unknown'),
  ('Irving J. Schreiber', 'Unknown'),
  ('James J. Cusack Jr.', 'Unknown'),
  ('Paul Howard Douglas', '1939–1942'),
  ('Bertram B. Moss', 'Unknown'),
  ('Robert E. Merriam', 'ERROR, BOX ELEMENT'),
  ('Leon Despres', '1955–1975'),
  ('Ross Lathrop', 'Unknown'),
  ('Lawrence Bloom', 'Unknown'),
  ('Barbara Holt', 'Unknown'),
  ('Leslie Hairston', 'May 1999 – May 15, 2023')],
 '6': [('Guy Guernsey', 'Unknown'),
  ('John F. Healy', 'Unknown'),
  ('Patrick Sheridan Smith', 'Unknown'),
  ('Francis J. Hogan', 'Unknown'),
  ('David R. Muir', 'Unknown'),
  ('Sydney A. Jones Jr.', 'Unknown'),
  ('Robert H. Miller', 'Unknown'),
  ('A. A. Rayner Jr.', 'ERROR, BAD LINK'),
  ('Eugene Sawyer', 'February 28, 1971  – December 2, 1987'),
  ('John O. Steele', 'ERROR, BAD LINK'),
  ('Freddrenna Lyle', 'February 8, 1998 – May 15, 2011'),
  ('Roderick Sawyer', 'May 16, 2011 – May 15, 2023')],
 '7

I've checked through the bad links and decided I'm okay with letting those go. 

Now it's time to finalize a list of aldermen who have a start date and end date, add them to a dictionary

In [213]:
all_data = []
alderman_and_dates = {}

for ward, values in aldermen_dates_dict.items():
    aldermen = []
    for alder_info in values:
        alderperson = alder_info[0]
        dates = alder_info[1]
        if re.search("\\d+", dates):
            print(alderperson, dates)
            # Two types of hyphens I have to be aware of here
            if re.search('–', dates):
                beg_end = re.split('–', dates)
                begin = beg_end[0]
            elif re.search('-', dates):
                beg_end = re.split('-', dates)
                begin = beg_end[0]
            else:
                beg_end = []
                begin = dates
            if len(beg_end) == 2:
                end = beg_end[1]
            else:
                end = "present"
                
        # To ensure I'll match Getnet's data, I'll copy the format he used
            row_data = {
                "Ward": ward,
                "Alderperson": alderperson,
                "Start Date": str.strip(begin),
                "End Date": str.strip(end),
                "Party": None,
                "Notes": None
            }
            all_data.append(row_data)
                
        else:
            continue

all_data
        
            
        

Paul Howard Douglas 1939–1942
Leon Despres 1955–1975
Leslie Hairston May 1999 – May 15, 2023
Eugene Sawyer February 28, 1971  – December 2, 1987
Freddrenna Lyle February 8, 1998 – May 15, 2011
Roderick Sawyer May 16, 2011 – May 15, 2023
William Beavers 1983–2006
Sandi Jackson May 21, 2007 – January 15, 2013
Natashia Holmes February 13, 2013 – May 18, 2015
Gregory Mitchell May 18, 2015
William D. Meyering April 1923 – December 1930
William Cousins Jr. 1967  –1976
Todd Stroger September 5, 2001 – December 4, 2006
Michelle A. Harris December 13, 2006
Robert Shaw May 1987 – December 1998
Anthony Beale May 1999
William A. Rowan 1927–1942
Edward Vrdolyak 1971–1987
John Pope May 1999   – May 2015
Susan Sadlowski Garza May 18, 2015 – May 15, 2023
Michael Anthony Bilandic June 12, 1969 – June 7, 1977
James Balcer 1997–2015
Patrick Daley Thompson May 18, 2015 – February 14, 2022
Nicole Lee March 28, 2022
George Cardenas 2003–2022
Anabel Abarca December 14, 2022 – May 15, 2023
Julia Ramirez May 1

[{'Ward': '5',
  'Alderperson': 'Paul Howard Douglas',
  'Start Date': '1939',
  'End Date': '1942',
  'Party': None,
  'Notes': None},
 {'Ward': '5',
  'Alderperson': 'Leon Despres',
  'Start Date': '1955',
  'End Date': '1975',
  'Party': None,
  'Notes': None},
 {'Ward': '5',
  'Alderperson': 'Leslie Hairston',
  'Start Date': 'May 1999',
  'End Date': 'May 15, 2023',
  'Party': None,
  'Notes': None},
 {'Ward': '6',
  'Alderperson': 'Eugene Sawyer',
  'Start Date': 'February 28, 1971',
  'End Date': 'December 2, 1987',
  'Party': None,
  'Notes': None},
 {'Ward': '6',
  'Alderperson': 'Freddrenna Lyle',
  'Start Date': 'February 8, 1998',
  'End Date': 'May 15, 2011',
  'Party': None,
  'Notes': None},
 {'Ward': '6',
  'Alderperson': 'Roderick Sawyer',
  'Start Date': 'May 16, 2011',
  'End Date': 'May 15, 2023',
  'Party': None,
  'Notes': None},
 {'Ward': '7',
  'Alderperson': 'William Beavers',
  'Start Date': '1983',
  'End Date': '2006',
  'Party': None,
  'Notes': None},
 {'W

In [207]:
with open('libbys_scraped_data.json', 'w', encoding='utf-8') as f:
        json.dump(all_data, f, ensure_ascii=False, indent=4)