# Mine Safety

We're interested in [US mine safety](https://arlweb.msha.gov/drs/drshome.htm#MID), thank goodness we can search for these things.

## Preparation: Knowing your tags

These questions are the same for every data set, and might not work exactly for yours.

**Search for every operator with 'dirt' in their name, including abandoned mines.**

### What is the tag and class name for every row of data?

In [1]:
# every row is in a <tr> tag with no class name

### What is the tag and class name for every mine operator's name?

In [2]:
# every operater name is in a <td> and a <font> tag with no class name

### What is the tag and class name for every mine's name?

In [3]:
# every mine's name is in a <td> and a <font> tag with no class name

## Being lazy

If you only needed these results, what would you do instead of scraping them?

In [4]:
# It would be easier to copy it into excel

## Setup: Import what you'll need to scrape the page

Use `requests`, not `urllib`.

In [5]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Try to scrape the page

To test if you requested the page correctly, save the BeautifulSoup document as `doc` and run the code `doc.find_all('tr')[-1].text` to get the text of the last `<tr>` element.

- If the result starts with **Total Number of Mines Found**, you were successful.

In [6]:
data = {
    'OperSearch':'dirt',
    'Abandoned':'No',
    'MineName':'',
    'StateSearch':'None',
    'CM':'All',
    'x':'0',
    'y':'0',
    'MC':'Opersearch'
}

url = 'https://arlweb.msha.gov/drs/ASP/OprNameStatesearch.asp'

response = requests.post(url, data=data)

In [7]:
doc = BeautifulSoup(response.text,'html.parser')

In [8]:
doc.find_all('tr')[-1].text

'\nTotal Number of Mines Found:\xa0\xa019'

## Actually scraping

### Hopefully you know that each `tr` is supposed to be your data. What is the index of the first row element that is actually a result?

`.text` will help you here.

In [9]:
result_data = doc.find_all('tr')
result_data

[<tr>
 <td width="30%"><a href="/drs/drshome.htm"><img alt="Mine Data Retrieval System" border="0" height="75" src="/drs/images/drslogo.png" width="300"/></a></td>
 <th width="40%"><font style="FONT-SIZE:1.20em;">Operator Name or Mine Name<br/> Search</font></th>
 <td width="30%"> </td></tr>, <tr>
 <td valign="top" width="50%">
 <table width="100%">
 <tr>
 <td><font style="FONT-SIZE:.80em;"><b>Abandoned*</b></font></td></tr>
 <tr>
 <td valign="top" width="95%"><font style="FONT-SIZE:.75em;">Indicates Mine is Abandoned and Sealed</font></td></tr></table></td>
 <td align="right" valign="top" width="50%">
 <table align="right" width="100%">
 <tr>
 <td align="right" colspan="2"><font style="FONT-SIZE:.80em;"><b>*CM (Coal or Metal Mine/Nonmetal Mine)</b></font></td></tr>
 <tr>
 <td align="right" width="46%"><font style="FONT-SIZE:.80em;">C<br/>M</font></td>
 <td width="54%"><font style="FONT-SIZE:.80em;">...... Coal<br/>...... Metal/Nonmetal</font></td></tr>
 </table></td></tr>, <tr>
 <td><

In [10]:
# first data entry starts with "MineId"
# a counter gives us the index number, the "for loop" breaks when the first data row comes up

counter = 0
for result in result_data:
    if result.find("input", attrs={'name':'MineId'}):
        break
    counter += 1
print("The first row result is at index",counter)        

The first row result is at index 7


### Loop through each operator result, printing its name

Use LIST SLICING to skip the non-data row(s).

In [11]:
for result in result_data[7:]:
    if len(result.find_all("td"))>1: # making sure not to get empty rows
        print(result.find_all("td")[2].text.strip())

Newberg Rock & Dirt
AM Dirtworks & Aggregate Sales
Dirt Company
Dirt Con
Dirt Doctor Inc
Dirt Works
Holley Dirt Company, Inc
Krueger Brothers Gravel & Dirt
M R Dirt
M.R. Dirt Inc.
P B Dirt Movers, Inc
PB Dirt Movers
PB Dirt Movers, Inc
Prescott Dirt, LLC
R D Blankenship Dirt Work LLC
SIMPSON DIRTWORX LLC
SIMPSON DIRTWORX LLC
Spry's Dirt & Gravel, Inc.
Vogt Dirt Service


### Loop through each operator result, printing its ID

There should be ONE code per row, and NO empty rows between them.

In [12]:
for result in result_data[7:]:
    if len(result.find_all("td"))>1:
        print(result.find_all("td")[0].text.strip())

3503598
4801789
5001797
4608254
2103723
4104757
0801306
3901432
3609624
3609931
1519799
4407296
4407270
0203332
2901986
4300768
4300776
2302283
2103518


## Saving the results

### Loop through each `tr` to create a list of dictionaries

Each dictionary must contain

- Operator ID
- Operator name
- Mine name
- State
- Mine type
- Coal or metal
- Status
- Commodity

Create a new dictionary for each row.

In [13]:
miners = []

for result in result_data[7:]:
    current = {}
    if len(result.find_all("td"))>1:
        miner = result.find_all("td")
        current['Operator ID']= miner[0].text.strip()
        current['Operator Name'] = miner[2].text.strip()
        current['Mine name'] = miner[3].text.strip()
        current['State'] = miner[1].text.strip()
        current['Mine type'] = miner[4].text.strip()
        current['Coal or metal'] = miner[5].text.strip()
        current['Status'] = miner[6].text.strip()
        current['Commodity'] = miner[7].text.strip()
        miners.append(current)

### Save that to a CSV

In [14]:
df = pd.DataFrame(miners)
df.to_csv("Miners.csv",index=False)

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [15]:
df_new = pd.DataFrame()
df_new = df_new.from_csv("Miners.csv")
df_new

Unnamed: 0_level_0,Commodity,Mine name,Mine type,Operator ID,Operator Name,State,Status
Coal or metal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
M,"Crushed, Broken Stone NEC",Newberg Rock & Dirt,Surface,3503598,Newberg Rock & Dirt,OR,Active
M,Construction Sand and Gravel,AM Dirtworks & Aggregate Sales,Surface,4801789,AM Dirtworks & Aggregate Sales,ND,Intermittent
M,Construction Sand and Gravel,Bush Pilot,Surface,5001797,Dirt Company,AK,Intermittent
M,"Crushed, Broken Limestone NEC",Hog Lick Quarry,Surface,4608254,Dirt Con,WV,Temporarily Idled
M,Construction Sand and Gravel,Rock Lake Plant,Surface,2103723,Dirt Doctor Inc,MN,Intermittent
M,Construction Sand and Gravel,Portable #1,Surface,4104757,Dirt Works,TX,Intermittent
M,"Sand, Common",River Road Pit,Surface,801306,"Holley Dirt Company, Inc",FL,Active
M,Construction Sand and Gravel,PORTABLE SCREENER,Surface,3901432,Krueger Brothers Gravel & Dirt,SD,Intermittent
M,Construction Sand and Gravel,Forbes Pit,Surface,3609624,M R Dirt,PA,Intermittent
M,Dimension Stone NEC,Camptown Quarry,Surface,3609931,M.R. Dirt Inc.,PA,Intermittent
