# Print debugging + scraping with requests/BeautifulSoup

## The pages we'll be looking at

If we were doing this in a **browser**, we would follow these steps:

1. Visit the [Mine Data Retrieval System](https://arlweb.msha.gov/drs/drshome.htm)
2. Scroll down to **Mine Identification Number (ID) Search**
3. Type in a mine ID number, such as `3503598`, click **Search**
4. I'm on a page! It lists the MINE NAME and MINE OWNER.

After searching for and finding a mine, I can use this page to **find reports about this mine**. Some of the reports are on accidents, violations, inspections, health samples and more. To get those reports:

1. Search for a mine (if you haven't already)
2. Scroll down and change **Beginning Date** to `1/1/1995` (violation reports begin in 1995, accidents begin in 1983)
3. Select the report type of `Violations`
4. Click **Get Report**
5. I'm on a page! It lists ALL OF THE MINE'S VIOLATIONS.

By changing the report type you're searching for you can find all sorts of different data.

# Doing this with BeautifulSoup

To do this with requests instead of with Selenium, we need to use `requests.post` to pretend to submit a form.

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
data = {
    'MineId': '3503598',
    'BDate': '1/1/1995',
    'EDate': '',
    'Submit': 'Violations*',
    'Sort': 1,
    'submit.x': 50,
    'submit.y': 10
}
response = requests.post('https://arlweb.msha.gov/drs/ASP/MineAction.asp', data=data)
doc = BeautifulSoup(response.text, "html5lib")

### First let's scrape a single page

In [3]:
violations = doc.find_all('tr', class_='drsviols')
for violation in violations:
    # First attempt at debugging, using prints
    print('––––––––––––––––––––––––––')
    cells = violation.find_all('td')
    data['violator'] = cells[0].text
    data['contract_id'] = cells[1].text
    data['citation_no'] = cells[2].text
    data['case_no'] = cells[3].text
    data['date_issues'] = cells[4].text
    data['final_order_date'] = cells[5].text
    data['section_of_act'] = cells[6].text
    data['date_terminated'] = cells[7].text
    data['citation'] = cells[8].text
    data['s_and_s'] = cells[9].text
    data['standard'] = cells[10].text
    # print(cells[10])
    # data['standard_url'] = cells[10].find('a')['href']
    # turns out the website is using javascript to automatically generate the
    #  link, meaning that the a tag is only there after the javascript has run 
    # The following tags are not recognised by the html.parser. Changing to 
    # 'html5lib' makes it possible to scrape it all. Tradeoff is speed.
    data['proposed_penalty'] = cells[11].text
    data['citation_status'] = cells[12].text
    data['current_penalty'] = cells[13].text
    data['amount_paid'] = cells[14].text
len(violations)

––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
––––––––––––––––––––––––––
–

49

## Change to a function to scrape everything

In [4]:
import pandas as pd

def scrape_mine_data(row):
    data = {
        'MineId': row['id'],
        'BDate': '1/1/1995',
        'EDate': '',
        'Submit': 'Violations*',
        'Sort': 1,
        'submit.x': 50,
        'submit.y': 10
    }
    response = requests.post('https://arlweb.msha.gov/drs/ASP/MineAction.asp', data=data)
    doc = BeautifulSoup(response.text, 'html5lib')

    datapoints = []

    data = {}
    violations = doc.find_all('tr', class_='drsviols')
    for violation in violations:
        cells = violation.find_all('td')

        data['violator'] = cells[0].text
        data['contract_id'] = cells[1].text
        data['citation_no'] = cells[2].text
        data['case_no'] = cells[3].text
        data['date_issues'] = cells[4].text
        data['final_order_date'] = cells[5].text
        data['section_of_act'] = cells[6].text
        data['date_terminated'] = cells[7].text
        data['citation'] = cells[8].text
        data['s_and_s'] = cells[9].text
        data['standard'] = cells[10].text
        data['proposed_penalty'] = cells[11].text
        data['citation_status'] = cells[12].text
        data['current_penalty'] = cells[13].text
        data['amount_paid'] = cells[14].text

        datapoints.append(data)

    citation_df = pd.DataFrame(datapoints)
    citation_df.head()

    citation_df.to_csv("data/" + row['id'] + "-violations.csv", index=False)

# Read in our mines data

In [5]:
df = pd.read_csv("10-classwork/mines-edited.csv", dtype='str')
df.head()

Unnamed: 0,id
0,2501216
1,1401575
2,1600956
3,2200033
4,504953


## Try it on the first row

In [6]:
df.head(3).apply(scrape_mine_data, axis=1)

FileNotFoundError: [Errno 2] No such file or directory: 'data/2501216-violations.csv'