# Mine Safety

We're interested in [US mine safety](https://arlweb.msha.gov/drs/drshome.htm#MID), thank goodness we can search for these things.

## Preparation: Knowing your tags

These questions are the same for every data set, and might not work exactly for yours.

**Search for every operator with 'dirt' in their name, including abandoned mines.**

### What is the tag and class name for every row of data?

In [2]:
from bs4 import BeautifulSoup
import requests

### What is the tag and class name for every mine operator's name?

### What is the tag and class name for every mine's name?

### What is the tag and class name for every mine operator's name?

### What is the tag and class name for every mine operator's name?

## Being lazy

If you only needed these results, what would you do instead of scraping them?

## Setup: Import what you'll need to scrape the page

Use `requests`, not `urllib`.

## Try to scrape the page

To test if you requested the page correctly, save the BeautifulSoup document as `doc` and run the code `doc.find_all('tr')[-1].text` to get the text of the last `<tr>` element.

- If the result starts with **Total Number of Mines Found**, you were successful.

In [4]:
data = {
    'OperSearch':'dirt',
    'Abandoned':'No',
    'MineName':'',
    'StateSearch':'None',
    'CM':'All',
    'x':'0',
    'y':'0',
    'MC':'Opersearch'
}

In [6]:
response = requests.post("https://arlweb.msha.gov/drs/ASP/OprNameStatesearch.asp", data = data)
doc = BeautifulSoup(response.text, "html.parser")
doc.prettify()

'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<head>\n <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n <!-- ****************************************** Begin META TAGS ********************************************* -->\n <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>\n <!-- ****************************************** End META TAGS *********************************************** -->\n <title>\n  MSHA  - Mine  Data Retrieval System - Basic Mine Information Page\n </title>\n <script src="/2010redesign/Scripts/federated-analytics.js" type="text/javascript">\n </script>\n <script src="/2010redesign/Scripts/AC_RunActiveContent.js" type="text/javascript">\n </script>\n <link href="/2010Redesign/includes/Print.css" media="print" rel="stylesheet" type="text/css"/>\n <link href="/2010Redesign/Includes/MSHAwebnew.css" media="screen" rel="stylesheet" type="text/css">\n  <link href="/2010Redesign/includes/style-screen.css" media=

## Actually scraping

### Hopefully you know that each `tr` is supposed to be your data. What is the index of the first row element that is actually a result?

`.text` will help you here.

In [9]:
doc.text

'\n\n\n\n\n\nMSHA  - Mine  Data Retrieval System - Basic Mine Information Page\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n\t$(document).ready(function() {\r\n\t\tW_Helpful.init(\'#fbcontent\');\r\n\t});(jQuery);\r\n\n<!-- ************* GOOGLE ANALYTICS ****************** -->\r\n\t(function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\r\n\t(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\r\n\tm=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\r\n\t})(window,document,\'script\',\'//www.google-analytics.com/analytics.js\',\'ga\');\r\n\r\n\tga(\'create\', \'UA-52441844-1\', \'auto\');\r\n\tga(\'send\', \'pageview\');\r\n\r\n\t$(document).ready( function() {\r\n\t\t$("a[href$=\'pdf\']").each(function() {\r\n\t\t\teventName=$(this).attr(\'href\');\r\n\t\t\teventClick="ga(\'send\',\'event\',\'PDF\',\'Download\',\'"+eventName+"\');";\r\n\t\t\t$(this).attr("onClick",eventClick);\r\n\t\t});\r\n\t});\r\n\nSkip to content\n\n\n\n

### Loop through each operator result, printing its name

Use LIST SLICING to skip the non-data row(s).

In [32]:
b = doc.find_all("tr")[7:-1]

### Loop through each operator result, printing its ID

There should be ONE code per row, and NO empty rows between them.

In [33]:
for x in b:
    print(x.td.text.strip())

3503598
4801789
5001797
4608254
2103723
4104757
0801306
3901432
3609624
3609931
1519799
4407296
4407270
0203332
2901986
4300768
4300776
2302283
2103518


## Saving the results

### Loop through each `tr` to create a list of dictionaries

Each dictionary must contain

- Operator ID
- Operator name
- Mine name
- State
- Mine type
- Coal or metal
- Status
- Commodity

Create a new dictionary for each row.

In [50]:
mine_list = []

for x in b:
    mine_dict = {}
    opr_id = x.td.text.strip()
    mine_dict['Operator ID'] = opr_id
    
    opr_name = x.find_all('td')[2].text.strip()
    mine_dict['Operator name'] = opr_name
    
    mine_name = x.find_all('td')[3].text.strip()
    mine_dict['Mine name'] = mine_name
    
    state = x.find_all('td')[1].text.strip()
    mine_dict['State'] = state
    
    mine_type = x.find_all('td')[4].text.strip()
    mine_dict['State'] = mine_type
    
    cm = x.find_all('td')[5].text.strip()
    mine_dict['Coal or metal'] = cm
    
    status = x.find_all('td')[6].text.strip()
    mine_dict['Status'] = status
    
    commodity = x.find_all('td')[7].text.strip()
    mine_dict['Commidity'] = commodity
    
    mine_list.append(mine_dict)

mine_list

[{'Coal or metal': 'M',
  'Commidity': 'Crushed, Broken Stone NEC',
  'Mine name': 'Newberg Rock & Dirt',
  'Operator ID': '3503598',
  'Operator name': 'Newberg Rock & Dirt',
  'State': 'Surface',
  'Status': 'Active'},
 {'Coal or metal': 'M',
  'Commidity': 'Construction Sand and Gravel',
  'Mine name': 'AM Dirtworks & Aggregate Sales',
  'Operator ID': '4801789',
  'Operator name': 'AM Dirtworks & Aggregate Sales',
  'State': 'Surface',
  'Status': 'Intermittent'},
 {'Coal or metal': 'M',
  'Commidity': 'Construction Sand and Gravel',
  'Mine name': 'Bush Pilot',
  'Operator ID': '5001797',
  'Operator name': 'Dirt Company',
  'State': 'Surface',
  'Status': 'Intermittent'},
 {'Coal or metal': 'M',
  'Commidity': 'Crushed, Broken Limestone NEC',
  'Mine name': 'Hog Lick Quarry',
  'Operator ID': '4608254',
  'Operator name': 'Dirt Con',
  'State': 'Surface',
  'Status': 'Temporarily Idled'},
 {'Coal or metal': 'M',
  'Commidity': 'Construction Sand and Gravel',
  'Mine name': 'Rock 

### Save that to a CSV

In [53]:
import pandas as pd
df = pd.DataFrame(mine_list)

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [54]:
df.head()

Unnamed: 0,Coal or metal,Commidity,Mine name,Operator ID,Operator name,State,Status
0,M,"Crushed, Broken Stone NEC",Newberg Rock & Dirt,3503598,Newberg Rock & Dirt,Surface,Active
1,M,Construction Sand and Gravel,AM Dirtworks & Aggregate Sales,4801789,AM Dirtworks & Aggregate Sales,Surface,Intermittent
2,M,Construction Sand and Gravel,Bush Pilot,5001797,Dirt Company,Surface,Intermittent
3,M,"Crushed, Broken Limestone NEC",Hog Lick Quarry,4608254,Dirt Con,Surface,Temporarily Idled
4,M,Construction Sand and Gravel,Rock Lake Plant,2103723,Dirt Doctor Inc,Surface,Intermittent
