# Scraping many pages + Using Selenium

## The pages we'll be looking at

If I wanted to read specific information about a specfic mine, it takes a few steps. **Do these steps with your browser before you try any programming.**

1. Visit the [Mine Data Retrieval System](https://arlweb.msha.gov/drs/drshome.htm)
2. Scroll down to **Mine Identification Number (ID) Search**
3. Type in a mine ID number, such as `3503598`, click **Search**
4. I'm on a page! It lists the MINE NAME and MINE OWNER.

After searching for and finding a mine, I can use this page to **find reports about this mine**. Some of the reports are on accidents, violations, inspections, health samples and more. To get those reports:

1. Search for a mine (if you haven't already)
2. Scroll down and change **Beginning Date** to `1/1/1995` (violation reports begin in 1995, accidents begin in 1983)
3. Select the report type of `Violations`
4. Click **Get Report**
5. I'm on a page! It lists ALL OF THE MINE'S VIOLATIONS.

By changing the report type you're searching for you can find all sorts of different data.

# Researching mine information

## Preparation 

### When you search for information on a specific mine, what URL should Selenium visit first?

- *TIP: the answer is NOT `https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp`*

We should start with this url here: `https://arlweb.msha.gov/drs/drshome.htm`

### How can you identify the text field we're going to type the Mine ID into?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

It's an `<input>` with the name `name=MineId`.

### How can you identify the search button we're going to click, or the form we're going to submit?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

- The page has several `<form>`s that all go by the `name="search"`. This might cause problems. We can identify it by its `xpath`: `//*[@id="content"]/table[3]/form`
- The search-button is an `<input>` with no particular attributes that are searchable. We can identify it its `xpath`: `//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input`

### Use Selenium to search using the mine ID `3901432`. Get me the operator's name by scraping.

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select

In [3]:
from bs4 import BeautifulSoup

In [4]:
driver = webdriver.Chrome()

In [4]:
driver.get("https://arlweb.msha.gov/drs/drshome.htm")

In [5]:
driver.find_element_by_name("MineId").send_keys("3901432")

In [6]:
driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input').click()

In [7]:
doc = BeautifulSoup(driver.page_source, "html.parser")

In [8]:
doc.find("table", bgcolor="#FFFFBF").find_all("tr")[2].find_all("td")[1].text

'Krueger Brothers Gravel & Dirt '

# Using .apply to find data about SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [9]:
import pandas as pd

In [10]:
df = pd.read_csv("mines-subset.csv")
df

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Open up `mines-subset.csv` in a text editor, then look at your dataframe. Is something different about them?

License N.1 was imported without the leading zero. This will not work.

In [11]:
df = pd.read_csv("mines-subset.csv", dtype=str)
df

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Scrape the operator's name for each of those mines and print it

- *TIP: use .apply and a function*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook *

In [12]:
def scrape_operator_name(row):
    driver.get("https://arlweb.msha.gov/drs/drshome.htm")
    driver.find_element_by_name("MineId").send_keys(row['id'])
    driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input').click()
    doc = BeautifulSoup(driver.page_source, "html.parser")
    operator_name = doc.find("table", bgcolor="#FFFFBF").find_all("tr")[2].find_all("td")[1].text
    print (operator_name)

In [13]:
df.apply(scrape_operator_name, axis=1)

Dirt Works 
Holley Dirt Company, Inc 
M.R. Dirt Inc. 


0    None
1    None
2    None
dtype: object

### Scrape the operator's name and save it into a new column

- *TIP: Use .apply and a function*
- *TIP: Remember to use `return`*

In [14]:
def scrape_operator_name_2(row):
    driver.get("https://arlweb.msha.gov/drs/drshome.htm")
    driver.find_element_by_name("MineId").send_keys(row['id'])
    driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input').click()
    doc = BeautifulSoup(driver.page_source, "html.parser")
    operator_name = doc.find("table", bgcolor="#FFFFBF").find_all("tr")[2].find_all("td")[1].text
    return operator_name

In [15]:
df['name'] = df.apply(scrape_operator_name_2, axis=1)

In [16]:
df

Unnamed: 0,id,name
0,4104757,Dirt Works
1,801306,"Holley Dirt Company, Inc"
2,3609931,M.R. Dirt Inc.


# Researching mine violations

Read the very top again to remember how to find mine violations

### When you search for a mine's violations, what URL is Selenium going to start on?

- *TIP: `requests` can send form data to load in the middle of a bunch of steps, but Selenium has to start at the beginning

We need to start back at the beginning: `https://arlweb.msha.gov/drs/drshome.htm`

### When you're searching for violations from the Mine Information page, how are you going to identify the "Beginning Date" field?

We can identify it by the `<input>` tag with `name="BDate"`

### When you're searching for violations from the Mine Information page, how are you going to identify the "Violations" button?

We need to go by the xpath: `//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input`

### When you're searching for violations from the Mine Information page, how are you going to identify the form or the button to click to get a list of the violations?

We also need to go by the xpath: `//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input`

### Using the mine ID `3901432`, scrape all of their violations since 1/1/1995

**Save this into a CSV called `3901432-violations.csv`.** This CSV must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

**Tips:**

- *TIP: It's probably worth it to print them all first, then save them to a CSV once you know it's all working.*
- *TIP: You'll use the parent pattern - get the ROWS first (tr), then loop through and get the TABLE CELLS (td)*

In [5]:
driver.get("https://arlweb.msha.gov/drs/drshome.htm")
driver.find_element_by_name("MineId").send_keys("3901432")
driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input').click()

In [6]:
driver.find_element_by_name("BDate").send_keys("1/1/1995")

In [7]:
driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input').click()

In [8]:
driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input').click()

In [9]:
doc = BeautifulSoup(driver.page_source, "html.parser")

In [10]:
trs = doc.find_all("table")[7].find_all("tr")[1:]

In [11]:
violations = []
for tr in trs:
    violation = {}
    tds = tr.find_all("td")
    violation['violator_id'] = "3901432"
    violation['violator_name'] = tds[0].text.strip()
    violation['citation_no'] = tds[2].text.strip()
    violation['case_no'] = tds[3].text.strip()
    violation['standard_no'] = tds[10].find_all("font")[2].text.strip()
    violation['standard_link'] = tds[10].a["href"]
    violation['proposed_penalty'] = tds[11].text.strip()
    violation['amount_paid'] = tds[14].text.strip()
    violations.append(violation)

In [31]:
df = pd.DataFrame(violations, columns=["violator_id", "violator_name", "citation_no", "case_no", "standard_no", "standard_link", "proposed_penalty", "amount_paid"])
df.head()

Unnamed: 0,violator_id,violator_name,citation_no,case_no,standard_no,standard_link,proposed_penalty,amount_paid
0,3901432,Krueger Brothers Gravel & Dirt,8750964,361866,56.18010,http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...,100.0,100.0
1,3901432,Krueger Brothers Gravel & Dirt,6426439,260865,56.4201(a)(2),http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,100.0
2,3901432,Krueger Brothers Gravel & Dirt,6426438,260865,56.4101,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,100.0
3,3901432,Krueger Brothers Gravel & Dirt,6588189,260865,56.14200,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,100.0
4,3901432,Krueger Brothers Gravel & Dirt,6588210,238554,50.30(a),http://www.gpo.gov/fdsys/pkg/CFR-2010-title30-...,100.0,100.0


In [32]:
df.to_csv("3901432-violations.csv", index=False)

In [33]:
df_test = pd.read_csv("3901432-violations.csv", dtype=str)
df_test.head()

Unnamed: 0,violator_id,violator_name,citation_no,case_no,standard_no,standard_link,proposed_penalty,amount_paid
0,3901432,Krueger Brothers Gravel & Dirt,8750964,361866,56.18010,http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...,100.0,100.0
1,3901432,Krueger Brothers Gravel & Dirt,6426439,260865,56.4201(a)(2),http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,100.0
2,3901432,Krueger Brothers Gravel & Dirt,6426438,260865,56.4101,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,100.0
3,3901432,Krueger Brothers Gravel & Dirt,6588189,260865,56.14200,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,100.0
4,3901432,Krueger Brothers Gravel & Dirt,6588210,238554,50.30(a),http://www.gpo.gov/fdsys/pkg/CFR-2010-title30-...,100.0,100.0


# Using .apply to save mine data for SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [34]:
df_mines = pd.read_csv("mines-subset.csv", dtype=str)
df_mines

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Scrape the violations for each mine

**Save each mine's violations into separate CSV files.** Each CSV file must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

Make sure you are saving them into **separate files.** It might be nice to name them after the mine id.

- *TIP: Use .apply for this*
- *TIP: Print out the ID before you start scraping. That way you can take that ID and search manually to see if there is anything weird about the results.*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook 
- *TIP: It's probably worth it to print the fields first, then save them to a CSV once you know it's all working.*

In [35]:
def scrape_all_violations(row):
    
    #direct browser from the entry page to the id-overview page
    driver.get("https://arlweb.msha.gov/drs/drshome.htm")
    driver.find_element_by_name("MineId").send_keys(row['id'])
    driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input').click()
       
    #direct browser to the report-violations page
    driver.find_element_by_name("BDate").send_keys("1/1/1995")
    driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input').click()
    driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input').click()
    
    #scrape the content of the report-violations page, using extra checks
    doc = BeautifulSoup(driver.page_source, "html.parser")
    trs = doc.find_all("table")[7].find_all("tr")[1:]
    violations = []
    for tr in trs:
        violation = {}
        tds = tr.find_all("td")
        violation['violator_id'] = row['id']
        violation['violator_name'] = tds[0].text.strip()
        violation['citation_no'] = tds[2].text.strip()
        violation['case_no'] = tds[3].text.strip()
        if tds[10].text.strip() == "":
            violation['standard_no'] = ""
            violation['standard_link'] = ""
        else:
            violation['standard_no'] = tds[10].find_all("font")[2].text.strip()
            violation['standard_link'] = tds[10].a["href"]
        if (tds[11].text == "Not  Assessed Yet") | (tds[11].text == "Non-Assessable"):
            violation['proposed_penalty'] = ""
            violation['amount_paid'] = ""
        else:
            violation['proposed_penalty'] = tds[11].text.strip()
            violation['amount_paid'] = tds[14].text.strip()
        violations.append(violation)
    df = pd.DataFrame(violations, columns=["violator_id", "violator_name", "citation_no", "case_no", "standard_no", "standard_link", "proposed_penalty", "amount_paid"])
    df.to_csv(row['id']+"-violations.csv", index=False)

In [36]:
df_mines.apply(scrape_all_violations, axis=1)

0    None
1    None
2    None
dtype: object