# Mine Safety

We're interested in [US mine safety](https://arlweb.msha.gov/drs/drshome.htm#MID), thank goodness we can search for these things.

## Preparation: Knowing your tags

These questions are the same for every data set, and might not work exactly for yours.

**Search for every operator with 'dirt' in their name, including abandoned mines.**

### What is the tag and class name for every row of data?

In [9]:
# tr



### What is the tag and class name for every mine operator's name?

In [10]:
#font

### What is the tag and class name for every mine's name?

In [11]:
#font

### What is the tag and class name for every mine operator's name?

In [12]:
#font

### What is the tag and class name for every mine operator's name?

In [13]:
#font

## Being lazy

If you only needed these results, what would you do instead of scraping them?

## Setup: Import what you'll need to scrape the page

Use `requests`, not `urllib`.

In [14]:
from bs4 import BeautifulSoup
import requests



data = {
    'OperSearch':'dirt',
    'Abandoned':'No',
    'MineName':'',
    'StateSearch':'None',
    'CM':'All',
    'x':'0',
    'y':'0',
    'MC':'Opersearch'
}

header = {
    'Referer' : 'https://arlweb.msha.gov/drs/ASP/OprNameStatesearch.asp'
}

url = 'https://arlweb.msha.gov/drs/ASP/OprNameStatesearch.asp'

response = requests.post(url,headers=header,data=data)
response

<Response [200]>

## Try to scrape the page

To test if you requested the page correctly, save the BeautifulSoup document as `doc` and run the code `doc.find_all('tr')[-1].text` to get the text of the last `<tr>` element.

- If the result starts with **Total Number of Mines Found**, you were successful.

In [15]:
doc = BeautifulSoup(response.text, 'html.parser')
doc.find_all('tr')[-1].text

'\nTotal Number of Mines Found:\xa0\xa019'

## Actually scraping

### Hopefully you know that each `tr` is supposed to be your data. What is the index of the first row element that is actually a result?

`.text` will help you here.

In [16]:
result_data = doc.find_all('tr')

counter = 0
for result in result_data:
    if result.find("input", attrs={'value':'3503598'}):
        break
    counter += 1
print("The first row result is at index",counter)

The first row result is at index 7


### Loop through each operator result, printing its name

Use LIST SLICING to skip the non-data row(s).

In [17]:
lines = result_data[7:-1]
for line in lines:
    print(line.find_all('td')[2].font.text)



 Newberg Rock & Dirt  
AM Dirtworks & Aggregate Sales  
Dirt Company  
Dirt Con  
Dirt Doctor Inc  
Dirt Works  
Holley Dirt Company, Inc  
Krueger Brothers Gravel & Dirt  
M R Dirt  
M.R. Dirt Inc.  
P B Dirt Movers, Inc  
PB Dirt Movers  
PB Dirt Movers, Inc  
Prescott Dirt, LLC  
R D Blankenship Dirt Work LLC  
SIMPSON DIRTWORX LLC  
SIMPSON DIRTWORX LLC  
Spry's Dirt & Gravel, Inc.  
Vogt Dirt Service  


### Loop through each operator result, printing its ID

There should be ONE code per row, and NO empty rows between them.

In [18]:
lines = result_data[7:-1]
for line in lines:
    print(line.find_all('td')[1].font.text)

OR 
ND 
AK 
WV 
MN 
TX 
FL 
SD 
PA 
PA 
KY 
VA 
VA 
AZ 
NM 
VT 
VT 
MO 
MN 


## Saving the results

### Loop through each `tr` to create a list of dictionaries

Each dictionary must contain

- Operator ID
- Operator name
- Mine name
- State
- Mine type
- Coal or metal
- Status
- Commodity

Create a new dictionary for each row.

In [19]:
lines = result_data[7:-1]
mylist = []
for line in lines:
    dic = {}
    dic['ID'] = line.find_all('td')[0].font.text
    dic['State'] = line.find_all('td')[1].font.text       
    dic['Operator'] = line.find_all('td')[2].font.text
    dic['Mine Name'] = line.find_all('td')[3].font.text
    dic['Type'] = line.find_all('td')[4].font.text
    dic['CM*'] = line.find_all('td')[5].font.text
    dic['Status'] = line.find_all('td')[6].font.text
    dic['Commodity'] = line.find_all('td')[7].font.text
    mylist.append(dic)
    
mylist


[{'CM*': 'M\xa0',
  'Commodity': 'Crushed, Broken Stone NEC\xa0 ',
  'ID': '3503598',
  'Mine Name': 'Newberg Rock & Dirt',
  'Operator': ' Newberg Rock & Dirt \xa0',
  'State': 'OR\xa0',
  'Status': 'Active\xa0 ',
  'Type': 'Surface             '},
 {'CM*': 'M\xa0',
  'Commodity': 'Construction Sand and Gravel\xa0 ',
  'ID': '4801789',
  'Mine Name': 'AM Dirtworks & Aggregate Sales',
  'Operator': 'AM Dirtworks & Aggregate Sales \xa0',
  'State': 'ND\xa0',
  'Status': 'Intermittent\xa0 ',
  'Type': 'Surface             '},
 {'CM*': 'M\xa0',
  'Commodity': 'Construction Sand and Gravel\xa0 ',
  'ID': '5001797',
  'Mine Name': 'Bush Pilot',
  'Operator': 'Dirt Company \xa0',
  'State': 'AK\xa0',
  'Status': 'Intermittent\xa0 ',
  'Type': 'Surface             '},
 {'CM*': 'M\xa0',
  'Commodity': 'Crushed, Broken Limestone NEC\xa0 ',
  'ID': '4608254',
  'Mine Name': 'Hog Lick Quarry',
  'Operator': 'Dirt Con \xa0',
  'State': 'WV\xa0',
  'Status': 'Temporarily Idled\xa0 ',
  'Type': 'Sur

### Save that to a CSV

In [20]:
import pandas as pd

df = pd.DataFrame(mylist)
df.to_csv('dirt_mines.csv', index = False)

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.