# Scraping many pages + Using Selenium

## The pages we'll be looking at

If I wanted to read specific information about a specfic mine, it takes a few steps. **Do these steps with your browser before you try any programming.**

1. Visit the [Mine Data Retrieval System](https://arlweb.msha.gov/drs/drshome.htm)
2. Scroll down to **Mine Identification Number (ID) Search**
3. Type in a mine ID number, such as `3503598`, click **Search**
4. I'm on a page! It lists the MINE NAME and MINE OWNER.

After searching for and finding a mine, I can use this page to **find reports about this mine**. Some of the reports are on accidents, violations, inspections, health samples and more. To get those reports:

1. Search for a mine (if you haven't already)
2. Scroll down and change **Beginning Date** to `1/1/1995` (violation reports begin in 1995, accidents begin in 1983)
3. Select the report type of `Violations`
4. Click **Get Report**
5. I'm on a page! It lists ALL OF THE MINE'S VIOLATIONS.

By changing the report type you're searching for you can find all sorts of different data.

# Researching mine information

## Preparation 

### When you search for information on a specific mine, what URL should Selenium visit first?

- *TIP: the answer is NOT `https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp`*

In [142]:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://arlweb.msha.gov/drs/drshome.htm")


### How can you identify the text field we're going to type the Mine ID into?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

In [143]:
from selenium.webdriver.common.keys import Keys
#xpath = //*[@id="inputdrs"]
id_input = driver.find_element_by_name('MineId')
id_input.send_keys("3901432")



How can you identify the search button we're going to click, or the form we're going to submit?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [144]:
search_id = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
search_id.click()

### Use Selenium to search using the mine ID `3901432`. Get me the operator's name by scraping.

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

In [145]:
from bs4 import BeautifulSoup
import requests

In [146]:
doc = BeautifulSoup(driver.page_source, 'html.parser')
doc.find_all('table')[1].find_all('tr')[2].find_all('td')[1].text

'Krueger Brothers Gravel & Dirt '

# Using .apply to find data about SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [147]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [148]:
df = pd.read_csv("mines-subset.csv", dtype = {'id' :'str'})
df

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Open up `mines-subset.csv` in a text editor, then look at your dataframe. Is something different about them?

In [149]:
#id
#4104757
#0801306
#3609931

In [150]:
def printrow(row):
    print(row['id'])
    
df.apply(printrow, axis = 1)

4104757
0801306
3609931


0    None
1    None
2    None
dtype: object

### Scrape the operator's name for each of those mines and print it

- *TIP: use .apply and a function*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook *

In [151]:
def getoperator(row):
    driver = webdriver.Chrome()
    driver.get("https://arlweb.msha.gov/drs/drshome.htm")
    
    id_input = driver.find_element_by_name('MineId')
    id_input.send_keys(row['id'])
    
    search_id = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
    search_id.click()
    
    name_tag = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[1]/tbody/tr[3]/td[2]/font/b')
    print(name_tag.text)
    print(row['id'])

driver = webdriver.Chrome()
df.apply(getoperator, axis=1)
driver.close()

Dirt Works
4104757
Holley Dirt Company, Inc
0801306
M.R. Dirt Inc.
3609931


### Scrape the operator's name and save it into a new column

- *TIP: Use .apply and a function*
- *TIP: Remember to use `return`*

In [152]:
def setoperator(row):
    driver.get("https://arlweb.msha.gov/drs/drshome.htm")
    
    id_input = driver.find_element_by_name('MineId')
    id_input.send_keys(row['id'])
    
    search_id = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
    search_id.click()
    
    name_tag = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[1]/tbody/tr[3]/td[2]/font/b')
    return name_tag.text

driver = webdriver.Chrome()
df['operator'] = df.apply(setoperator, axis = 1)
driver.close()

In [153]:
df

Unnamed: 0,id,operator
0,4104757,Dirt Works
1,801306,"Holley Dirt Company, Inc"
2,3609931,M.R. Dirt Inc.


# Researching mine violations

Read the very top again to remember how to find mine violations

### When you search for a mine's violations, what URL is Selenium going to start on?

- *TIP: `requests` can send form data to load in the middle of a bunch of steps, but Selenium has to start at the beginning

In [154]:
#https://arlweb.msha.gov/drs/drshome.htm

### When you're searching for violations from the Mine Information page, how are you going to identify the "Beginning Date" field?

In [155]:
#name = BDate

### When you're searching for violations from the Mine Information page, how are you going to identify the "Violations" button?

In [156]:
#//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input

### When you're searching for violations from the Mine Information page, how are you going to identify the form or the button to click to get a list of the violations?

In [157]:
#//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input

### Using the mine ID `3901432`, scrape all of their violations since 1/1/1995

**Save this into a CSV called `3901432-violations.csv`.** This CSV must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

**Tips:**

- *TIP: It's probably worth it to print them all first, then save them to a CSV once you know it's all working.*
- *TIP: You'll use the parent pattern - get the ROWS first (tr), then loop through and get the TABLE CELLS (td)*

In [158]:
driver = webdriver.Chrome()
driver.get("https://arlweb.msha.gov/drs/drshome.htm")

id_input = driver.find_element_by_name('MineId')
id_input.send_keys("3901432")

search_id = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
search_id.click()

beginning = driver.find_element_by_name('BDate')
beginning.send_keys('1/1/1995')

button = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input')
button.click()

button2 = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input')
button2.click()




In [159]:
doc = BeautifulSoup(driver.page_source, 'html.parser')

In [160]:
lines = doc.find_all('tr', class_= 'drsviols')
mylist = []
for line in lines:
    dic = {}
    cells = line.find_all('td')
    dic['citation number'] = cells[2].text.strip()
    dic['Case number'] = cells[3].text.strip()
    if (cells[10].a):
        dic['Standard violated'] = cells[10].a.text.strip()
        dic['link to standard'] = cells[10].find('a')['href']
    dic['Proposed penalty'] = cells[11].text.strip()
    dic['Amount paid to date'] = cells[-1].text.strip()
    mylist.append(dic)

mylist
    

[{'Amount paid to date': '100.00',
  'Case number': '000361866',
  'Proposed penalty': '100.00',
  'Standard violated': '56.18010',
  'citation number': '8750964',
  'link to standard': 'http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-vol1/pdf/CFR-2014-title30-vol1-sec56-18010.pdf'},
 {'Amount paid to date': '100.00',
  'Case number': '000260865',
  'Proposed penalty': '100.00',
  'Standard violated': '56.4201(a)(2)',
  'citation number': '6426439',
  'link to standard': 'http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-vol1/pdf/CFR-2011-title30-vol1-sec56-4201.pdf'},
 {'Amount paid to date': '100.00',
  'Case number': '000260865',
  'Proposed penalty': '100.00',
  'Standard violated': '56.4101',
  'citation number': '6426438',
  'link to standard': 'http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-vol1/pdf/CFR-2011-title30-vol1-sec56-4101.pdf'},
 {'Amount paid to date': '100.00',
  'Case number': '000260865',
  'Proposed penalty': '100.00',
  'Standard violated': '56.14200',
  'citation numbe

In [161]:
import pandas as pd

df2 = pd.DataFrame(mylist)
df2.to_csv('3901432-violations.csv', index = False)

# Using .apply to save mine data for SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [162]:
df

Unnamed: 0,id,operator
0,4104757,Dirt Works
1,801306,"Holley Dirt Company, Inc"
2,3609931,M.R. Dirt Inc.


### Scrape the violations for each mine

**Save each mine's violations into separate CSV files.** Each CSV file must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

Make sure you are saving them into **separate files.** It might be nice to name them after the mine id.

- *TIP: Use .apply for this*
- *TIP: Print out the ID before you start scraping. That way you can take that ID and search manually to see if there is anything weird about the results.*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook 
- *TIP: It's probably worth it to print the fields first, then save them to a CSV once you know it's all working.*

In [163]:
def violations(row):
    driver = webdriver.Chrome()
    driver.get("https://arlweb.msha.gov/drs/drshome.htm")
    
    id_input = driver.find_element_by_name('MineId')
    id_input.send_keys(row['id'])
    
    search_id = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
    search_id.click()
    
    beginning = driver.find_element_by_name('BDate')
    beginning.send_keys('1/1/1995')

    button = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input')
    button.click()

    button2 = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input')
    button2.click()

    doc = BeautifulSoup(driver.page_source, 'html.parser')
    lines = doc.find_all('tr', class_= 'drsviols')
    
    mylist = []
    for line in lines:
        dic = {}
        cells = line.find_all('td')
        dic['citation number'] = cells[2].text.strip()
        dic['Case number'] = cells[3].text.strip()
        if (cells[10].a):
            dic['Standard violated'] = cells[10].a.text.strip()
            dic['link to standard'] = cells[10].find('a')['href']
        dic['Proposed penalty'] = cells[11].text.strip()
        dic['Amount paid to date'] = cells[-1].text.strip()
        mylist.append(dic)

    df = pd.DataFrame(mylist)
    df.to_csv(row['id']+'violations.csv', index = False)
    driver.close()
    
df.apply(violations, axis = 1)
    
    



0    None
1    None
2    None
dtype: object