## Logging on

Use Selenium to visit https://webapps1.chicago.gov/buildingrecords/ and accept the agreement.

> Think about when you use `.find_element_...` and when you use `.find_elementSSS_...`

In [1]:
from selenium import webdriver

In [2]:
driver = webdriver.Chrome()

In [3]:
driver.get('https://webapps1.chicago.gov/buildingrecords/')

In [4]:
driver.page_source

'<html lang="en"><head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n\n\t<!-- Global site tag (gtag.js) - Google Analytics -->\n\t<script type="text/javascript" async="" src="https://www.google-analytics.com/analytics.js"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=UA-5653376-8&amp;l=dataLayer&amp;cx=c"></script><script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-5653376-2"></script>\n\t<script>\n\t  window.dataLayer = window.dataLayer || [];\n\t  function gtag(){dataLayer.push(arguments);}\n\t  gtag(\'js\', new Date());\n\t  gtag(\'config\', \'UA-5653376-8\');\t \n\t</script>\n\n    <title>Building Permit and Inspection Records: Agreement</title>\n\n    <!-- CSS -->\n\t<link href="https://webapps1.cityofchicago.org/cdn/Bootstrap-4.0.0-beta.2/css/bootstrap.min.css" rel="stylesheet">\n    <link href="h

In [5]:
driver.find_element_by_id("rbnAgreement1").click()

In [6]:
submit_button = driver.find_element_by_xpath('//*[@id="submit"]')
submit_button.click()

## Searching

Search for **400 E 41ST ST**.

In [7]:
textbox = driver.find_element_by_id("fullAddress")

In [8]:
textbox.send_keys("400 E 41ST ST.")

In [9]:
submit_button = driver.find_element_by_xpath('//*[@id="submit"]')

In [10]:
submit_button.click()

## Saving tables with pandas

Use pandas to save a CSV of all **permits** to `Permits - 400 E 41ST ST.csv`. Note that there are **different sections of the page**, not just one long permits table.

In [11]:
import pandas as pd



In [12]:

tables = pd.read_html(driver.page_source)
tables[0]

Unnamed: 0,PERMIT #,DATE ISSUED,DESCRIPTION OF WORK
0,100845718,,ERECT TWO SCAFFOLDS FROM 10/14/2019 TO 10/14/2...
1,100778302,,PERMIT EXPIRES ON 10/17/2018 Erection Starts: ...
2,100721255,,PERMIT EXPIRES ON 10/24/2017 ERECTION STARTS: ...
3,100693399,,INSTALLATION OF LOW VOLTAGE BURGLAR ALARM INTE...
4,100665436,,PERMIT EXPIRES ON 10/24/2016 ERECTION STARTS: ...
5,100610771,,PERMIT EXPIRES ON 10/28/2015 ERECTION STARTS: ...
6,100581991,,TRACE AND REPAIR BROKEN UNDERGROUND FEED TO EX...
7,100479194,,INTERNALLY LIT SIGN CABINET ON SOUTH ELEVATION
8,100385721,,RPACE CONCRETE SLAB WITH NEW AT GROUNGD FLOOR ...
9,100267298,,INTERIOR ALTERATIONS TO MEDICAL OFFICE SUITE 1...


In [13]:
tables[0].to_csv(r'/Users/sheridanwall/Documents/Foundations/Permits_400_E_41ST_ST.csv', index = False)


## Saving tables the long way

Save a CSV of all DOB inspections to `Inspections - 400 E 41ST ST.csv`, but **you also need to save the URL to the inspection**. As a result, you won't be able to use pandas, you'll need to use a loop and create a list of dictionaries.

You can use Selenium (my recommendation) or you can feed the source to BeautifulSoup. You should have approximately 157 rows.

You'll probably need to find the table first, then the rows inside, then the cells inside of each row. You'll probably use lots of list indexing. I might recommend XPath for finding the table.

*Tip: If you get a "list index out of range" error, it's probably due to an issue involving `thead` vs `tbody` elements. What are they? What are they for? What's in them? There are a few ways to troubleshoot it.*

In [14]:
items = driver.find_elements_by_id('resultstable_inspections')
for item in items:
    insp_table= item.find_elements_by_tag_name('tbody')
    for insp in insp_table:
        cells = insp.find_elements_by_tag_name('tr')
        for cell in cells:
            number = cell.find_elements_by_tag_name('td')[0].text
            print(number)
            date = cell.find_elements_by_tag_name('td')[1].text
            print(date)
            status = cell.find_elements_by_tag_name('td')[2].text
            print(status)
            description = cell.find_elements_by_tag_name('td')[3].text
            print(description)

13175960
11/30/2020
FAILED
ANNUAL INSPECTION
12770690
05/30/2019
PASSED
BOILER ANNUAL INSPECTION
12670542
05/21/2019
FAILED
CONSERVATION ANNUAL
12277260
08/27/2018
FAILED
CONSERVATION ANNUAL
12418304
05/30/2018
PASSED
BOILER ANNUAL INSPECTION
12136453
06/21/2017
PASSED
ANNUAL INSPECTION
12226018
06/21/2017
PASSED
ANNUAL INSPECTION
11228963
06/19/2017
FAILED
CONSERVATION ANNUAL
12101602
04/21/2017
PASSED
ANNUAL INSPECTION
12214968
03/22/2017
PASSED
SIGN ANNUAL INSPECTION
12051724
12/21/2016
FAILED
CONSERVATION COMPLAINT INSPECT
11750904
11/03/2016
PASSED
BOILER ANNUAL INSPECTION
11986288
09/01/2016
PASSED
REFRIGERATION ANNUAL
11787131
08/24/2016
PASSED
REFRIGERATION ANNUAL
11835125
08/02/2016
PASSED
ANNUAL INSPECTION
11933971
08/02/2016
PASSED
ANNUAL INSPECTION
11413712
10/23/2015
PASSED
BOILER ANNUAL INSPECTION
11623884
08/28/2015
PASSED
ANNUAL INSPECTION
11014536
07/28/2015
PASSED
ANNUAL INSPECTION
11244373
07/28/2015
PASSED
ANNUAL INSPECTION
11542833
07/20/2015
PASSED
SIGN ANNUAL INS

In [15]:
for item in items:
    urls = item.find_elements_by_tag_name('td')
    for url in urls:
        links = url.find_elements_by_tag_name('a')
        for link in links:
            print(link.get_attribute('href'))
        

https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=13175960
https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12770690
https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12670542
https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12277260
https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12418304
https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12136453
https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12226018
https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=11228963
https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12101602
https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12214968
https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12051724
https://webapps1.chic

In [16]:
inspections = []
items = driver.find_elements_by_id('resultstable_inspections')
for item in items:
    insp_table= item.find_elements_by_tag_name('tbody')
    for insp in insp_table:
        cells = insp.find_elements_by_tag_name('tr')
        for cell in cells:
            inspection = {}
            inspection['number'] = cell.find_elements_by_tag_name('td')[0].text
            inspection['date'] = cell.find_elements_by_tag_name('td')[1].text
            inspection['status'] = cell.find_elements_by_tag_name('td')[2].text
            inspection['description'] = cell.find_elements_by_tag_name('td')[3].text
            links = cell.find_elements_by_tag_name('a')
            for link in links:
                inspection['urls'] = link.get_attribute('href')
            print(inspection)
            inspections.append(inspection)
                
            

{'number': '13175960', 'date': '11/30/2020', 'status': 'FAILED', 'description': 'ANNUAL INSPECTION', 'urls': 'https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=13175960'}
{'number': '12770690', 'date': '05/30/2019', 'status': 'PASSED', 'description': 'BOILER ANNUAL INSPECTION', 'urls': 'https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12770690'}
{'number': '12670542', 'date': '05/21/2019', 'status': 'FAILED', 'description': 'CONSERVATION ANNUAL', 'urls': 'https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12670542'}
{'number': '12277260', 'date': '08/27/2018', 'status': 'FAILED', 'description': 'CONSERVATION ANNUAL', 'urls': 'https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=364923&insp=12277260'}
{'number': '12418304', 'date': '05/30/2018', 'status': 'PASSED', 'description': 'BOILER ANNUAL INSPECTION', 'urls': 'https://webapps1.chicago.gov/buildingrecords/inspectiondetails?addr=

In [17]:
df = pd.DataFrame(inspections)
df.head()

Unnamed: 0,number,date,status,description,urls
0,13175960,11/30/2020,FAILED,ANNUAL INSPECTION,https://webapps1.chicago.gov/buildingrecords/i...
1,12770690,05/30/2019,PASSED,BOILER ANNUAL INSPECTION,https://webapps1.chicago.gov/buildingrecords/i...
2,12670542,05/21/2019,FAILED,CONSERVATION ANNUAL,https://webapps1.chicago.gov/buildingrecords/i...
3,12277260,08/27/2018,FAILED,CONSERVATION ANNUAL,https://webapps1.chicago.gov/buildingrecords/i...
4,12418304,05/30/2018,PASSED,BOILER ANNUAL INSPECTION,https://webapps1.chicago.gov/buildingrecords/i...


In [18]:
df.to_csv("Inspections_400_E41_ST.csv")

### Loopity loops

> If you used Selenium for the last question, copy the code and use it as a starting point for what we're about to do!

If you click the inspection number, it'll open up a new window that shows you details of the violations from that visit. Count the number of violations for each visit and save it in a new column called **num_violations**.

Save this file as `Inspections - 400 E 41ST ST - with counts.csv`.

Since it opens in a new window, we have to say "Hey Selenium, pay attention to that new window!" We do that with `driver.switch_to.window(driver.window_handles[-1])` (each window gets a `window_handle`, and we're just asking the driver to switch to the last one.). A rough sketch of what your code will look like is here:

```python
# Click the link that opens the new window

# Switch to the new window/tab
driver.switch_to.window(driver.window_handles[-1])

# Do your scraping in here

# Close the new window/tab
driver.close()

# Switch back to the original window/tab
driver.switch_to.window(driver.window_handles[0])
```

You'll want to play around with them individually before you try it with the whole set - the ones that pass are very different pages than the ones with violations! There are a few ways to get the number of violations, some easier than others.

In [21]:
for item in items:
    insp_table= item.find_elements_by_tag_name('tbody')
    for insp in insp_table:
        cells = insp.find_elements_by_tag_name('tr')
        for cell in cells:
            links = cell.find_elements_by_tag_name('a')
            for link in links:
                link.click()
                

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=87.0.4280.88)
