# Web Scraper for County Name Changes

When working with the Sanborn geographic data, I noticed that there were several counties listed as places that no longer exist. With some research, I found that many of those counties had been renamed, combined with another county, or split into two counties. To resolve some of those issues algorithmically, we can look to web scraping to create a list of those changes.

First, import the necessary packages and get data. We'll need requests and json, as usual, but to do webscraping, we'll also need these other packages:

In [2]:
import requests
import json
import urllib.request
import time
from bs4 import BeautifulSoup

I've found a website that lists the county equivalents in a table. We can request that site for the response and use BeautifulSoup to parse it.

In [3]:
url = 'https://wwwn.cdc.gov/eworld/Appendix/CountyEquivalents'
response = requests.get(url)

In [4]:
soup = BeautifulSoup(response.text, 'html.parser')

Let's look at what that response output gives us. Since we have a table, we want to get what's within the 'td' (table data cell) elements.

In [5]:
soup.findAll('td')

[<td rowspan="26" valign="top">
                 Alaska
             </td>, <td>
                 Aleutians East Borough
             </td>, <td>
                 Aleutian Islands<sup>1</sup>
 </td>, <td>
                 1968–1993
             </td>, <td>
                 Aleutians West Census Area
             </td>, <td>
                 Aleutian Islands
             </td>, <td>
                 1994–Present
             </td>, <td>
                 Anchorage Borough
             </td>, <td>
                 Anchorage District
             </td>, <td>
                 1968–Present
             </td>, <td>
                 Bethel Census Area
             </td>, <td>
                 Bethel District &amp; Kuskokwim District
             </td>, <td>
                 1968–Present
             </td>, <td>
                 Bristol Bay Borough<sup>2</sup>
 </td>, <td>
                 Bristol Bay Division
             </td>, <td>
                 1968–Present
             </td>, <td>
     

## Processing Output

Then, we'll need to find the useful parts of the site. Each line in the response when finding all of the 'td' elements is one cell of the table. Based on the table on the site, we can see that the table has 3 columns of cells, with one additional column each time there's a new state name. To loop through the cells, then, the code ignores the state names when they show up in a cell, not including them in either the current or previous name arrays or in the count. The count then tells us whether the cell contains a current name, previous name, or the date information (which also gets ignored).

In [8]:
state_names = ['ALASKA', 'ARIZONA', 'COLORADO', 'FLORIDA', 'HAWAII',
               'NEW MEXICO', 'NEW YORK', 'SOUTH DAKOTA', 'VIRGINIA']
count = 1
current_name = []
previous_name = []
for line in soup.findAll('td'):
    line = line.get_text().strip().upper()
    if line in state_names:
        continue
    if count%3 == 1:
        current_name.append(line)
    elif count%3 == 2:
        previous_name.append(line)
    count+=1

In [9]:
current_name

['ALEUTIANS EAST BOROUGH',
 'ALEUTIANS WEST CENSUS AREA',
 'ANCHORAGE BOROUGH',
 'BETHEL CENSUS AREA',
 'BRISTOL BAY BOROUGH2',
 'DILLINGHAM CENSUS AREA3',
 'LAKE AND PENINSULA BOROUGH',
 'FAIRBANKS NORTH STAR BOROUGH',
 'SOUTHEAST FAIRBANKS CENSUS AREA',
 'JUNEAU BOROUGH',
 'KENAI PENINSULA BOROUGH',
 'KETCHIKAN GATEWAY BOROUGH',
 'KODIAK ISLAND BOROUGH',
 'MATANUSKA-SUSITNA BOROUGH',
 'NOME CENSUS AREA',
 'NORTH SLOPE BOROUGH',
 'NORTHWEST ARCTIC BOROUGH',
 'PRINCE OF WALES-OUTER KETCHIKAN CENSUS AREA',
 'SITKA BOROUGH',
 'SKAGWAY-HOONAH-ANGOON CENSUS AREA5',
 'YAKUTAT CENSUS AREA',
 'VALDEZ-CORDOVA CENSUS AREA',
 'WADE HAMPTON CENSUS AREA',
 'WRANGELL-PETERSBURG CENSUS AREA',
 'YUKON-KOYUKUK CENSUS AREA7',
 'DENALI BOROUGH',
 'YUMA COUNTY8',
 'LA PAZ COUNTY',
 'ADAMS, BOULDER, JEFFERSON, AND WELD COUNTIES9',
 'BROOMFIELD COUNTY',
 'MIAMI-DADE COUNTY',
 'MAUI COUNTY10',
 'KALAWAO COUNTY',
 'VALENCIA COUNTY11',
 'CIBOLA COUNTY',
 'BRONX COUNTY',
 'KINGS COUNTY',
 'NEW YORK COUNTY',
 '

## Clean up

As we can see from the printed output, there's some lines with extra numbers at the end — we can see on the site that it's from annotations. So, let's write a function to remove those:

In [81]:
def remove_nums(strings):
    for c in range(len(strings)):
        strings[c] = ''.join([i for i in strings[c] if not i.isdigit()])

In [82]:
remove_nums(current_name)
remove_nums(previous_name)

In [83]:
previous_name

['ALEUTIAN ISLANDS',
 'ALEUTIAN ISLANDS',
 'ANCHORAGE DISTRICT',
 'BETHEL DISTRICT & KUSKOKWIM DISTRICT',
 'BRISTOL BAY DIVISION',
 'BRISTOL BAY BOROUGH (IN PART)',
 'DILLINGHAM CENSUS AREA (IN PART)',
 'FAIRBANKS DISTRICT',
 'SOUTHEAST FAIRBANKS DISTRICT',
 'JUNEAU DISTRICT',
 'KENAI-COOK INLET DISTRICT & SEWARD DISTRICT',
 'KETCHIKAN DISTRICT',
 'KETCHIKAN DISTRICT',
 'PALMER-WASILLA DISTRICT',
 'NOME DISTRICT',
 'BARROW DISTRICT',
 'KOBUK CENSUS AREA',
 'OUTER KETCHIKAN DISTRICT & PRINCE OF WALES DISTRICT',
 'SITKA DISTRICT',
 'SKAGWAY-YAKUTAT-ANGOON CENSUS AREA',
 'SKAGWAY-YAKUTAT-ANGOON CENSUS AREA (IN PART)',
 'CORDOVA-MCCARTHY DISTRICT & VALDEZ-CHITINA-WHITTIER DISTRICT',
 'WADE HAMPTON DISTRICT',
 'WRANGELL DISTRICT',
 'UPPER YUKON DISTRICT & YUKON-KOYUKUK DISTRICT',
 'YUKON-KOYUKUK CENSUS AREA (IN PART)',
 'YUMA COUNTY',
 'YUMA COUNTY (IN PART)',
 'ADAMS, BOULDER, JEFFERSON, AND WELD COUNTIES',
 'ADAMS, BOULDER, JEFFERSON, AND WELD COUNTIES (IN PART)',
 'DADE COUNTY',
 'MAUI C

I also want to remove suffixes so that they'll be easier to match to the Sanborn metadata.

In [96]:
def remove_suffix(words, remove):
    for i in range(len(words)):
        for r in remove:
            words[i] = words[i].replace(r, '')

In [97]:
to_remove = [' COUNTY', ' COUNTIES', ' DISTRICT', ' DIVISION', ' BOROUGH', ' CENSUS AREA']
remove_suffix(previous_name, to_remove)
remove_suffix(current_name, to_remove)

In [98]:
previous_name

['ALEUTIAN ISLANDS',
 'ALEUTIAN ISLANDS',
 'ANCHORAGE',
 'BETHEL & KUSKOKWIM',
 'BRISTOL BAY',
 'BRISTOL BAY (IN PART)',
 'DILLINGHAM (IN PART)',
 'FAIRBANKS',
 'SOUTHEAST FAIRBANKS',
 'JUNEAU',
 'KENAI-COOK INLET & SEWARD',
 'KETCHIKAN',
 'KETCHIKAN',
 'PALMER-WASILLA',
 'NOME',
 'BARROW',
 'KOBUK',
 'OUTER KETCHIKAN & PRINCE OF WALES',
 'SITKA',
 'SKAGWAY-YAKUTAT-ANGOON',
 'SKAGWAY-YAKUTAT-ANGOON (IN PART)',
 'CORDOVA-MCCARTHY & VALDEZ-CHITINA-WHITTIER',
 'WADE HAMPTON',
 'WRANGELL',
 'UPPER YUKON & YUKON-KOYUKUK',
 'YUKON-KOYUKUK (IN PART)',
 'YUMA',
 'YUMA (IN PART)',
 'ADAMS, BOULDER, JEFFERSON, AND WELD',
 'ADAMS, BOULDER, JEFFERSON, AND WELD (IN PART)',
 'DADE',
 'MAUI',
 'MAUI (IN PART)',
 'VALENCIA',
 'VALENCIA (IN PART)',
 'BRONX',
 'BROOKLYN',
 'MANHATTAN',
 'QUEENS',
 'STATEN ISLAND',
 'WASHABAUGH',
 'ALLEGHANY',
 'AUGUSTA',
 'AUGUSTA (IN PART)',
 'CHESAPEAKE CITY',
 'CHESAPEAKE CITY (IN PART)',
 'CHESAPEAKE CITY (IN PART)',
 'FREDERICK',
 'FREDERICK (IN PART)',
 'HALIFAX',

## Connect Previous to Current

Finally, we want to create a map that goes from the previous name to the current name(s).

In [99]:
prev2curr = {}
for i in range(len(previous_name)):
    prev = previous_name[i]
    curr = current_name[i]
    if prev not in prev2curr:
        prev2curr[prev] = []
    prev2curr[prev].append(curr)

In [100]:
prev2curr

{'ALEUTIAN ISLANDS': ['ALEUTIANS EAST', 'ALEUTIANS WEST'],
 'ANCHORAGE': ['ANCHORAGE'],
 'BETHEL & KUSKOKWIM': ['BETHEL'],
 'BRISTOL BAY': ['BRISTOL BAY'],
 'BRISTOL BAY (IN PART)': ['DILLINGHAM'],
 'DILLINGHAM (IN PART)': ['LAKE AND PENINSULA'],
 'FAIRBANKS': ['FAIRBANKS NORTH STAR'],
 'SOUTHEAST FAIRBANKS': ['SOUTHEAST FAIRBANKS'],
 'JUNEAU': ['JUNEAU'],
 'KENAI-COOK INLET & SEWARD': ['KENAI PENINSULA'],
 'KETCHIKAN': ['KETCHIKAN GATEWAY', 'KODIAK ISLAND'],
 'PALMER-WASILLA': ['MATANUSKA-SUSITNA'],
 'NOME': ['NOME'],
 'BARROW': ['NORTH SLOPE'],
 'KOBUK': ['NORTHWEST ARCTIC'],
 'OUTER KETCHIKAN & PRINCE OF WALES': ['PRINCE OF WALES-OUTER KETCHIKAN'],
 'SITKA': ['SITKA'],
 'SKAGWAY-YAKUTAT-ANGOON': ['SKAGWAY-HOONAH-ANGOON'],
 'SKAGWAY-YAKUTAT-ANGOON (IN PART)': ['YAKUTAT'],
 'CORDOVA-MCCARTHY & VALDEZ-CHITINA-WHITTIER': ['VALDEZ-CORDOVA'],
 'WADE HAMPTON': ['WADE HAMPTON'],
 'WRANGELL': ['WRANGELL-PETERSBURG'],
 'UPPER YUKON & YUKON-KOYUKUK': ['YUKON-KOYUKUK'],
 'YUKON-KOYUKUK (IN PART

For an additional clean-up step, we can see that some of the previous counties have & or AND in them, indicating that there may be some instances where one or the other name is used in historical records. So, those have been split out and given their own entries into the dictionary.

In [101]:
prev2curr2 = {}
for prev in prev2curr:
    if ' & ' in prev:
        prev_split = prev.split(' & ')
        for x in prev_split:
            prev2curr2[x] = prev2curr[prev]
    if ' AND ' in prev:
        prev_split = prev.split(' AND ')
        for x in prev_split:
            prev2curr2[x] = prev2curr[prev]

In [102]:
for prev in prev2curr2:
    if prev not in prev2curr:
        prev2curr[prev] = []
    for x in prev2curr2[prev]:
        prev2curr[prev].append(x)

In [103]:
prev2curr

{'ALEUTIAN ISLANDS': ['ALEUTIANS EAST', 'ALEUTIANS WEST'],
 'ANCHORAGE': ['ANCHORAGE'],
 'BETHEL & KUSKOKWIM': ['BETHEL'],
 'BRISTOL BAY': ['BRISTOL BAY'],
 'BRISTOL BAY (IN PART)': ['DILLINGHAM'],
 'DILLINGHAM (IN PART)': ['LAKE AND PENINSULA'],
 'FAIRBANKS': ['FAIRBANKS NORTH STAR'],
 'SOUTHEAST FAIRBANKS': ['SOUTHEAST FAIRBANKS'],
 'JUNEAU': ['JUNEAU'],
 'KENAI-COOK INLET & SEWARD': ['KENAI PENINSULA'],
 'KETCHIKAN': ['KETCHIKAN GATEWAY', 'KODIAK ISLAND'],
 'PALMER-WASILLA': ['MATANUSKA-SUSITNA'],
 'NOME': ['NOME'],
 'BARROW': ['NORTH SLOPE'],
 'KOBUK': ['NORTHWEST ARCTIC'],
 'OUTER KETCHIKAN & PRINCE OF WALES': ['PRINCE OF WALES-OUTER KETCHIKAN'],
 'SITKA': ['SITKA'],
 'SKAGWAY-YAKUTAT-ANGOON': ['SKAGWAY-HOONAH-ANGOON'],
 'SKAGWAY-YAKUTAT-ANGOON (IN PART)': ['YAKUTAT'],
 'CORDOVA-MCCARTHY & VALDEZ-CHITINA-WHITTIER': ['VALDEZ-CORDOVA'],
 'WADE HAMPTON': ['WADE HAMPTON'],
 'WRANGELL': ['WRANGELL-PETERSBURG'],
 'UPPER YUKON & YUKON-KOYUKUK': ['YUKON-KOYUKUK'],
 'YUKON-KOYUKUK (IN PART

And as the last step, of course, we need to write this data out into a file to be accessed by other scripts.

In [104]:
f = open('county-namechanges.json', 'w')
f.write(json.dumps(prev2curr))
f.close()