# Scraping many pages + Using requests

## The pages we'll be looking at

If I wanted to read specific information about a specfic mine, it takes a few steps. **Do these steps with your browser before you try any programming.**

1. Visit the [Mine Data Retrieval System](https://arlweb.msha.gov/drs/drshome.htm)
2. Scroll down to **Mine Identification Number (ID) Search**
3. Type in a mine ID number, such as `3503598`, click **Search**
4. I'm on a page! It lists the MINE NAME and MINE OWNER.

After searching for and finding a mine, I can use this page to **find reports about this mine**. Some of the reports are on accidents, violations, inspections, health samples and more. To get those reports:

1. Search for a mine (if you haven't already)
2. Scroll down and change **Beginning Date** to `1/1/1995` (violation reports begin in 1995, accidents begin in 1983)
3. Select the report type of `Violations`
4. Click **Get Report**
5. I'm on a page! It lists ALL OF THE MINE'S VIOLATIONS.

By changing the report type you're searching for you can find all sorts of different data.

# Researching mine information

## Preparation 

### When you search for information on a specific mine, what URL are you going to be scraping?

- *TIP: the answer is NOT `https://arlweb.msha.gov/drs/drshome.htm`*

We need info from this page here: `https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp`

### When you search for information on a specific mine, do you need form data? If so, what is your form data going to be?

Inspecting the Form Data in the Headers Section for the request at `https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp` yield the following information:

- MineId: 3503598 
- x: 30 (not sure if we need this)
- y: 7 (not sure if we need this)

### Use `requests` to search using the mine ID `3901432`. Get me the operator's name by scraping.

In [1]:
import requests

In [8]:
from bs4 import BeautifulSoup

In [13]:
import pandas as pd

In [47]:
formdata = {
    "MineId": "3901432"
}

In [48]:
response = requests.post("https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp", data=formdata)

In [49]:
doc = BeautifulSoup(response.text, "html.parser")

In [50]:
doc.find("table", bgcolor="#FFFFBF").find_all("tr")[2].find_all("td")[1].text

'Krueger Brothers Gravel & Dirt '

# Using .apply to find data about SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [14]:
df = pd.read_csv("mines-subset.csv")
df

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Open up `mines-subset.csv` in a text editor, then look at your dataframe. Is something different about them? If so, make them match.

- *TIP: I can help with this.*

In [15]:
df = pd.read_csv("mines-subset.csv", dtype=str)
df

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Scrape the operator's name for each of those mines and print it

- *TIP: use .apply and a function*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook *

In [16]:
def scrape_operator_name(row):
    formdata = {
        "MineId": row['id']
    }
    response = requests.post("https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp", data=formdata)
    doc = BeautifulSoup(response.text, "html.parser")
    operator_name = doc.find("table", bgcolor="#FFFFBF").find_all("tr")[2].find_all("td")[1].text
    print (operator_name)

In [17]:
df.apply(scrape_operator_name, axis=1)

Dirt Works 
Holley Dirt Company, Inc 
M.R. Dirt Inc. 


0    None
1    None
2    None
dtype: object

### Scrape the operator's name and save it into a new column

- *TIP: Use .apply and a function*
- *TIP: Remember to use `return`*

In [18]:
def scrape_operator_name_2(row):
    formdata = {
        "MineId": row['id']
    }
    response = requests.post("https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp", data=formdata)
    doc = BeautifulSoup(response.text, "html.parser")
    operator_name = doc.find("table", bgcolor="#FFFFBF").find_all("tr")[2].find_all("td")[1].text
    operator_name = doc.find("table", bgcolor="#FFFFBF").find_all("tr")[2].find_all("td")[1].text
    return operator_name

In [19]:
df['name'] = df.apply(scrape_operator_name_2, axis=1)
df

Unnamed: 0,id,name
0,4104757,Dirt Works
1,801306,"Holley Dirt Company, Inc"
2,3609931,M.R. Dirt Inc.


# Researching mine violations

Read the very top again to remember how to find mine violations

### When you search for a mine's violations, what URL are you going to be scraping?

The url of the page listing all violations is `https://arlweb.msha.gov/drs/ASP/MineAction.asp`

### When you search for a mine's violations, do you need form data? If so, what is your form data going to be?

Inspecting the Form Data for the request at `https://arlweb.msha.gov/drs/ASP/MineAction.asp` yield the following information:

- MineId: 3503598
- BDate: 1/1/1995
- EDate: 
- Submit: Violations*
- Sort: 1 (not sure if we need this)
- submit.x: 58 (not sure if we need this)
- submit.y: 12 (not sure if we need this)

### Using the mine ID `3901432`, scrape all of their violations since 1/1/1995

**Save this into a CSV called `3901432-violations.csv`.** This CSV must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

**Tips:**

- *TIP: It's probably worth it to print them all first, then save them to a CSV once you know it's all working.*
- *TIP: You'll use the parent pattern - get the ROWS first (tr), then loop through and get the TABLE CELLS (td)*

In [84]:
formdata = {
    "MineId": "3901432",
    "BDate": "1/1/1995",
    "Edate": "",
    "Submit": "Violations*",
    "Sort": "1",
    "submit.x": "58",
    "submit.y": "12"
}

In [72]:
headerdata = {
    "Host": "arlweb.msha.gov",
    "Connection": "keep-alive",
    "Content-Length": "90",
    "Cache-Control": "max-age=0",
    "Origin": "https://arlweb.msha.gov",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Content-Type": "application/x-www-form-urlencoded",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Referer": "https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4",
    "Cookie": "__utmt_GSA_CP=1; ASPSESSIONIDQGTDRRTS=GEODEIECCLNBELFBBPOFPBKC; __utma=258023386.1291007958.1497667810.1497716828.1497719646.4; __utmb=258023386.3.10.1497719646; __utmc=258023386; __utmz=258023386.1497667810.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _ga=GA1.3.1291007958.1497667810; _gid=GA1.3.687359427.1497667950; _ga=GA1.2.1291007958.1497667810; _gid=GA1.2.687359427.1497667950"
}

In [101]:
response = requests.post("https://arlweb.msha.gov/drs/ASP/MineAction.asp", data=formdata)

In [102]:
doc = BeautifulSoup(response.text, "html5lib")

In [103]:
trs = doc.find_all("table")[7].find_all("tr")[1:]

In [104]:
violations = []
for tr in trs:
    violation = {}
    tds = tr.find_all("td")
    violation['violator_id'] = "3901432"
    violation['violator_name'] = tds[0].text.strip()
    violation['citation_no'] = tds[2].text.strip()
    violation['case_no'] = tds[3].text.strip()
    violation['standard_no'] = tds[10].find_all("font")[2].text.strip()
    violation['standard_link'] = "" #tds[10].a["href"]
    violation['proposed_penalty'] = tds[11].text.strip()
    violation['amount_paid'] = tds[14].text.strip()
    violations.append(violation)


In [79]:
#stopping at this point because the html is faulty

In [105]:
df = pd.DataFrame(violations, columns=["violator_id", "violator_name", "citation_no", "case_no", "standard_no", "standard_link", "proposed_penalty", "amount_paid"])
df.head()

Unnamed: 0,violator_id,violator_name,citation_no,case_no,standard_no,standard_link,proposed_penalty,amount_paid
0,3901432,Krueger Brothers Gravel & Dirt,8750964,361866,56.18010,,100.0,100.0
1,3901432,Krueger Brothers Gravel & Dirt,6426438,260865,56.4101,,100.0,100.0
2,3901432,Krueger Brothers Gravel & Dirt,6426439,260865,56.4201(a)(2),,100.0,100.0
3,3901432,Krueger Brothers Gravel & Dirt,6588189,260865,56.14200,,100.0,100.0
4,3901432,Krueger Brothers Gravel & Dirt,6588210,238554,50.30(a),,100.0,100.0


### After you save the CSV, open it and check it doesn't have a weird extra column.

# Using .apply to save mine data for SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

### Scrape the violations for each mine

**Save each mine's violations into separate CSV files.** Each CSV file must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

Make sure you are saving them into **separate files.** It might be nice to name them after the mine id.

- *TIP: Use .apply for this*
- *TIP: Print out the ID before you start scraping. That way you can take that ID and search manually to see if there is anything weird about the results.*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook 
- *TIP: It's probably worth it to print the fields first, then save them to a CSV once you know it's all working.*