# Web scraping part 1: permits

## Lecture objectives

1. Understand how to scrape web pages and other data where an API doesn't exist
2. Introduce the `BeautifulSoup` library
3. Learn how to parse unstructured text data
4. More pratice with `pandas`

APIs make it relatively simple to get data from the web. But sometimes, an API doesn't exist—they take effort on the part of the agency to set up and maintain.

In these cases, we can still obtain data from the web. But rather than dropping it directly into a (geo)pandas `DataFrame`, we'll need to do more work to understand the structure of the webpage, and to clean and process the results. 

## Example: Land use permit data
Often, cities make their building and land use permit data available for download, and/or accessible through an API. But these are typically incomplete—they provide a subset of fields that are most relevant to most users (e.g., permit approval date and number of units), but perhaps exclude more esoteric fields. And parking, sadly, is one of the fields that is often excluded.

For a [recent project](https://www.tandfonline.com/doi/full/10.1080/01944363.2021.1873824), I looked at the impacts of TOD plans in Seattle and San Francisco on development outcomes, including parking ratios. Let's walk through how I obtained the data for the Seattle analysis.

The basic Seattle land use permit dataset [is available through the city's Socrata API](https://data.seattle.gov/Permitting/Land-Use-Permits/ht3q-kdvx). That's a good starting point for our work. Let's get this into a `pandas` dataframe, in the same way that we did with the Los Angeles data.

In [1]:
import json
import requests
import pandas as pd
url = 'https://data.seattle.gov/resource/ht3q-kdvx.json' # copied and pasted from the webpage
r = requests.get(url)
df = pd.DataFrame(json.loads(r.text))
print(df.head())

    permitnum           permitclass permitclassmapped   permittypemapped  \
0  3001212-LU  Single Family/Duplex       Residential  Master Use Permit   
1  3001271-LU  Single Family/Duplex       Residential  Master Use Permit   
2  3001310-LU  Single Family/Duplex       Residential  Master Use Permit   
3  3001312-LU                   N/A               N/A  Master Use Permit   
4  3001440-LU            Commercial   Non-Residential  Master Use Permit   

                                         description statuscurrent  \
0  PROJECT CANCELLED 12/8/2010 -- This short plat...      Canceled   
1  Land Use Permit to adjust the boundary between...     Completed   
2  Land use application to adjust the boundary be...     Completed   
3  Cancelled due to no activity for more than 9 y...      Canceled   
4  PROJECT CANCELLED 5/23/2011 -- Project On Hold...      Canceled   

                            relatededg_landusepermit   originaladdress1  \
0  {'type': 'Point', 'coordinates': [-122.25172

There are lots of columns, so the output is truncated.

But we can explore the contents of the dataframe in other ways. For example `.info()` gives us the column names and variable types. (Object is normally a string, or a mixed type.)

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   permitnum                 1000 non-null   object
 1   permitclass               1000 non-null   object
 2   permitclassmapped         1000 non-null   object
 3   permittypemapped          1000 non-null   object
 4   description               999 non-null    object
 5   statuscurrent             1000 non-null   object
 6   relatededg_landusepermit  915 non-null    object
 7   originaladdress1          1000 non-null   object
 8   originalcity              1000 non-null   object
 9   originalstate             1000 non-null   object
 10  originalzip               866 non-null    object
 11  link                      1000 non-null   object
 12  latitude                  915 non-null    object
 13  longitude                 915 non-null    object
 14  location1                

Notice that there is a `link` field. Let's take a look at the first one. 

In [3]:
# The .loc operator gives us an extract from the dataframe. 0 is the row index, 'link' is the column
# So this gives us the contents of the 'link' column for the first row.

print(df.loc[0, 'link'])   

{'url': 'https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001212-LU'}


Notice that this column of the pandas dataframe is a dictionary. That's perhaps a surprise, but we know how to deal with dictionaries. 

For now, [let's take a look at what this link looks like](https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001212-LU). Clearly, there is a lot more information here about the specific permit, than is provided via the API!

How do we bring the information in that webpage into Python? Remember, the `requests` library is our friend in this circumstance. While we've used it to get data from an API, `requests` can retrieve pretty much anything from the web.

First, let's extract the text string that gives the URL for this row.

In [4]:
urldict = df.loc[0,'link']
print(urldict)

{'url': 'https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001212-LU'}


As we saw before, it's a dictionary with a key of 'url', so let's extract the value.

In [5]:
permiturl = urldict['url']
print(permiturl)

# or we could do this in one step: permiturl = df.loc[0,'link']['url']


https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=3001212-LU


Now, pass that URL to `requests` in the same way that we did for the API.

In [6]:
r = requests.get(permiturl)

Let's look at what requests has returned. 

Remember, the `.text` attribute gives us the text of what's retrieved.

In [7]:
print(r.text)



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html ng-app="appAca" xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
<head id="ctl00_Head1"><link href="../App_Themes/Default/_progressbar.css" type="text/css" rel="stylesheet" /><link href="../App_Themes/Default/breadcrumb.css" type="text/css" rel="stylesheet" /><link href="../App_Themes/Default/Calendar.css" type="text/css" rel="stylesheet" /><link href="../App_Themes/Default/custom.css" type="text/css" rel="stylesheet" /><link href="../App_Themes/Default/font.css" type="text/css" rel="stylesheet" /><link href="../App_Themes/Default/form.css" type="text/css" rel="stylesheet" /><link href="../App_Themes/Default/grid.css" type="text/css" rel="stylesheet" /><link href="../App_Themes/Default/layout.css" type="text/css" rel="stylesheet" /><link href="../App_Themes/Defau

### Using BeautifulSoup
It looks like we've got the whole .html webpage. The relevant information is buried in there, but how can we get it in the sea of html code?

This is where the `BeautifulSoup` library comes in ([documentation here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)). Let's convert our text to a "soup" object.

In [8]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text)
print(type(soup))

<class 'bs4.BeautifulSoup'>


This soup object has a lot of attributes and functions (type `soup.` and press tab to autocomplete). 

We can use the `.prettify()` function to give us a better sense of what we are looking at.

In [9]:
# Not very pretty IMHO, but we can look at see where the data we want are buried
# and cross-reference that to the webpage in our browser
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-US" ng-app="appAca" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head id="ctl00_Head1">
  <link href="../App_Themes/Default/_progressbar.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/breadcrumb.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/Calendar.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/custom.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/font.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/form.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/grid.css" rel="stylesheet" type="text/css"/>
  <link href="../App_Themes/Default/layout.css" rel="stylesheet" type="text/css"/>
  <link href="../Ap

Let's suppose we want to get information the project description (where the parking information might be included, since there isn't a separate parking field). (In reality, the "description" field is in the API version, but that wasn't the case originally, and it's good practice.)

Just like with the API output that we saw earlier, extracting this is a case of step-by-step detective work.

If you look at the output above, it seems that Project Description is contained within a `<td>` tag. 

We'll use the `.find_all()` function to find the relevant text.

In [10]:
links = soup.find_all('td') # returns a "list-like" object, i.e. we can loop through it or slice it like a list

What is returned? Let's have a look.

In [11]:
type(links)

bs4.element.ResultSet

What on earth is a `ResultSet`? The [docs](https://tedboy.github.io/bs4_doc/generated/generated/bs4.ResultSet.html) tell us that it's a list. So we can use our regular methods to look at a list.

In [12]:
# look at the first element
print(links[0])


<td>
<div id="ctl00_HeaderNavigation_beforeLogin">
<!--Login link-->
<div class="ACA_FRight">
<table border="0" cellpadding="0" cellspacing="0" role="presentation">
<tr>
<td>
<a href="/Portal/Login.aspx" id="ctl00_HeaderNavigation_btnLogin">
<span class="ACA_Body_Text ACA_Body_Text_FontSize" id="ctl00_HeaderNavigation_lblLogin"><span class="ssp-login">Login</span></span>
</a>
</td>
<td class="ACA_TabRow_Line">
                                              
                                        </td>
</tr>
</table>
</div>
<!--Login link-->
<!--Report-->
<div class="ACA_FRight">
<div>
<table border="0" cellpadding="0" cellspacing="0" role="presentation">
<tr id="reportLink">
<td class="ACA_TabRow_Line">
<a class="nav_more_arro ACA_Report_Arrow NotShowLoading" href="javascript:void(0);" onclick="showReports();">
<span id="ctl00_HeaderNavigation_lblReports"></span>
<span class="ACA_Body_Text ACA_Body_Text_FontSize" id="ctl00_HeaderNavigation_lblAdminReports" style="display:none;"></span>

More systematically, let's loop through to find the element that has Project Description

In [13]:
for link in links:
    if 'Project Description' in link.text: 
        # stop here and abort the loop
        break 
        
print (link) 

<td class="td_parent_left"><div>
<h1 style="font-size:1.4em;"><span id="ctl00_PlaceHolderMain_PermitDetailList1_per_permitDetail_label_projectl637847733792449758">Project Description</span></h1><span class="ACA_SmLabel ACA_SmLabel_FontSize"><table class="table_child" role="presentation" style="TEMPLATE_STYLE"><tr><td class="td_child_left font12px"></td><td>PROJECT CANCELLED 12/8/2010 -- This short plat has an ECA exemption in the project planning template. A limited exemption was granted. Processing short plat with the ECA exemption #3002070.</td></tr></table></span>
</div></td>


Now we are getting closer! It looks like the Project Description is contained in another `<td>` tag, nested one level down. So let's do the same thing again at this second-level link.

In [14]:
sublinks = link.find_all('td')
print(sublinks)

[<td class="td_child_left font12px"></td>, <td>PROJECT CANCELLED 12/8/2010 -- This short plat has an ECA exemption in the project planning template. A limited exemption was granted. Processing short plat with the ECA exemption #3002070.</td>]


We've obtained a list! And the information we need is in the second element of that list.

In [15]:
description = sublinks[1]
print(description.text)

PROJECT CANCELLED 12/8/2010 -- This short plat has an ECA exemption in the project planning template. A limited exemption was granted. Processing short plat with the ECA exemption #3002070.


Now, let's take everything we've done so far, and put it in a function.
 
The function takes a single argument: the dictionary in the `url` column of the pandas DataFrame
 
It returns the Description text, unless that's not found, in which case it returns an empty string `''`.  

In [16]:
def getDescription(urldict):
    permiturl = urldict['url']
    r = requests.get(permiturl)
    soup = BeautifulSoup(r.text)
    links = soup.find_all('td')
    for link in links:
        if 'Project Description' in link.text: 
            sublinks = link.find_all('td')
            description = sublinks[1].text
            # once we find a description, we return it and exit the function
            return description 
    
    return '' # if we don't find it, return an empty string

# Now let's apply this function to the first link in our dataframe
urldict = df.loc[0,'link']
getDescription(urldict)

'PROJECT CANCELLED 12/8/2010 -- This short plat has an ECA exemption in the project planning template. A limited exemption was granted. Processing short plat with the ECA exemption #3002070.'

The advantage of a function is that we can now apply this procedure to every row of our pandas DataFrame.

Let's do this for 10 rows (so we are nice and don't disrupt the City's website).

The `apply` function in `pandas` applies a function to each row of a DataFrame.

In [17]:
# create a copy of the first 10 rows of the dataframe.
smalldf = df.head(10).copy()  

# for each row in smallDf, we pass the link column to getDescription
descriptions = smalldf['link'].apply(getDescription)  

In [18]:
# what's the description object? It's a pandas Series (basically, a one-column DataFrame)
print(type(descriptions))
print(descriptions)

<class 'pandas.core.series.Series'>
0    PROJECT CANCELLED 12/8/2010 -- This short plat...
1    Land Use Permit to adjust the boundary between...
2    Land use application to adjust the boundary be...
3    Cancelled due to no activity for more than 9 y...
4    PROJECT CANCELLED 5/23/2011 -- Project On Hold...
5    Land Use Permit to subdivde two parcels into t...
6    Land use permit to subdivide 1 parcel into 6 u...
7    Land Use Application to subdivide one developm...
8    PROJECT CANCELLED 2/23/12 -- PROJECT HOLD 11/2...
9    Land Use Application to allow a 3-story buildi...
Name: link, dtype: object


In [19]:
# So we can insert that into the dataframe as a new column
smalldf['newdescription'] = descriptions
# we could have done this in one step: 
# smalldf['newdescription'] = smalldf['link'].apply(getDescription) 
smalldf

Unnamed: 0,permitnum,permitclass,permitclassmapped,permittypemapped,description,statuscurrent,relatededg_landusepermit,originaladdress1,originalcity,originalstate,...,housingunitsremoved,housingunitsadded,applieddate,issueddate,expiresdate,decisiondate,permittypedesc,contractorcompanyname,estprojectcost,newdescription
0,3001212-LU,Single Family/Duplex,Residential,Master Use Permit,PROJECT CANCELLED 12/8/2010 -- This short plat...,Canceled,"{'type': 'Point', 'coordinates': [-122.2517206...",6519 S BANGOR ST,SEATTLE,WA,...,,,,,,,,,,PROJECT CANCELLED 12/8/2010 -- This short plat...
1,3001271-LU,Single Family/Duplex,Residential,Master Use Permit,Land Use Permit to adjust the boundary between...,Completed,"{'type': 'Point', 'coordinates': [-122.3569286...",4226 1ST AVE NW,SEATTLE,WA,...,0.0,0.0,2005-12-16,2006-05-15,2007-11-15,2006-05-10,,,,Land Use Permit to adjust the boundary between...
2,3001310-LU,Single Family/Duplex,Residential,Master Use Permit,Land use application to adjust the boundary be...,Completed,"{'type': 'Point', 'coordinates': [-122.3026217...",941 23RD AVE S,SEATTLE,WA,...,,,2007-02-14,2008-08-28,2011-08-14,2008-08-13,,,,Land use application to adjust the boundary be...
3,3001312-LU,,,Master Use Permit,Cancelled due to no activity for more than 9 y...,Canceled,"{'type': 'Point', 'coordinates': [-122.2913013...",3131 E MADISON ST,SEATTLE,WA,...,,,,,,,,,,Cancelled due to no activity for more than 9 y...
4,3001440-LU,Commercial,Non-Residential,Master Use Permit,PROJECT CANCELLED 5/23/2011 -- Project On Hold...,Canceled,"{'type': 'Point', 'coordinates': [-122.3721853...",9030 13TH AVE NW,SEATTLE,WA,...,,,2005-08-12,,,,,,,PROJECT CANCELLED 5/23/2011 -- Project On Hold...
5,3001442-LU,Single Family/Duplex,Residential,Master Use Permit,Land Use Permit to subdivde two parcels into t...,Completed,"{'type': 'Point', 'coordinates': [-122.2742069...",7960 46TH AVE S,SEATTLE,WA,...,0.0,0.0,2005-10-26,2007-04-23,2009-08-24,2006-04-27,,,,Land Use Permit to subdivde two parcels into t...
6,3001452-LU,Multifamily,Residential,Master Use Permit,Land use permit to subdivide 1 parcel into 6 u...,Completed,"{'type': 'Point', 'coordinates': [-122.3822076...",4017 SW ADMIRAL WAY,SEATTLE,WA,...,,,2005-12-02,2006-05-26,2007-11-26,2006-04-04,,,,Land use permit to subdivide 1 parcel into 6 u...
7,3001610-LU,Multifamily,Residential,Master Use Permit,Land Use Application to subdivide one developm...,Completed,"{'type': 'Point', 'coordinates': [-122.3587864...",918 2ND AVE W,SEATTLE,WA,...,,,2005-12-14,2012-07-12,2015-02-28,2012-02-15,,,,Land Use Application to subdivide one developm...
8,3001719-LU,Multifamily,Residential,Master Use Permit,PROJECT CANCELLED 2/23/12 -- PROJECT HOLD 11/2...,Canceled,"{'type': 'Point', 'coordinates': [-122.2864473...",14244 WESTWOOD PL NE,SEATTLE,WA,...,,,2006-04-29,,2012-03-13,2010-09-02,,,,PROJECT CANCELLED 2/23/12 -- PROJECT HOLD 11/2...
9,3001776-LU,Commercial,Non-Residential,Master Use Permit,Land Use Application to allow a 3-story buildi...,Completed,"{'type': 'Point', 'coordinates': [-122.3261037...",2701 EASTLAKE AVE E,SEATTLE,WA,...,0.0,0.0,2008-02-12,2009-11-05,2012-10-09,2009-09-25,,,,Land Use Application to allow a 3-story buildi...


### Parsing text
Now we have scraped the description for each project!

How do we get the number of parking spaces? Well, that depends on whether the city uses consistent terminology. 

You'll need to design a set of rules that cover different possibilities. For example, the description might say "2 parking spaces" or "TWO PARKING SPACES" or "1 uncovered and 1 covered parking space." Looking at your data is key.

For starters, let's take the simplest case. We'll add a column to our dataframe that indicates whether there is "no parking" in the project description.

In [20]:
# import the numpy library, which underlies pandas
# we'll use it's nan (null) value to indicate missing data
import numpy as np

def noparking(description):
    # convert the description to lower case
    text = description.lower()
    if 'no parking' in text:
        return True
    elif 'parking' in text:
        return False
    else:
        # capture all other possibilities
        return np.nan

# Now apply our function
smalldf['noparking'] = smalldf.description.apply(noparking)

In [21]:
# look at the output (just the noparking column)
smalldf.noparking

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8      NaN
9    False
Name: noparking, dtype: object

<div class="alert alert-block alert-info">
<strong>Thought exercise:</strong> If you want to get the number of parking spaces for each project, what would be your next step? In principle, how might you do that?
</div>

<div class="alert alert-block alert-info">
<strong>Let's generalize.</strong> What did we do here?
    
1. We obtained the URL for each page to scrape. (Here, it was given to us in the city's data file, but sometimes we'll have to reverse-engineer the composition of the URL.)
2. We examined a sample page, and identified the html tags that enclose the data we wanted to extract.
3. We wrote a function that pulled out the data for a specific page.
4. We applied that function to each URL / page. Since our URLs were in a pandas DataFrame, we could use the pandas <strong>apply</strong> method.
    
Every scraping project will pose different challenges, but normally it will involve each of these four steps.
</div>