## Webscraping with Beautiful Soup

I was looking at some enrollment data for universities. They have CIPS codes. I will need to convert the code to something meaningful. So... I need to 
read an html file to extract information. 
Let's try it. 
https://nces.ed.gov/Ipeds/cipcode/browse.aspx?y=55
read the codes from the file. 


In [1]:
from bs4 import BeautifulSoup 
import requests  

In [2]:
page = requests.get( 'https://nces.ed.gov/Ipeds/cipcode/browse.aspx?y=55' )
if page.status_code != 200:
   print( 'mainline(), could not retrieve HTML for https://nces.ed.gov/Ipeds/cipcode/browse.aspx?y=55' )
   sys.exit( 0 )
else:
   tree = BeautifulSoup( page.content, 'html.parser' )

Looking at the html, looks like what I want is the parts that look like this. 
<li><a href="cipdetail.aspx?y=55&amp;cipid=87979" title="View this CIP">01.0101) Agricultural Business and Management, General.</a></li> since I want the numeric code and the description both. 
The information is in the title'd href's. 

In [3]:
# this will get all the links in the whole page. 
tree.find_all('a')

[<a name="top"></a>,
 <a href="#content" title="Skip Navigation">Skip Navigation</a>,
 <a href="//ies.ed.gov">IES</a>,
 <a class="top_nav" href="/">
 <div class="l ctr">NCES</div>
 <span class="long_name">National Center for<br/> Education Statistics</span>
 <div class="l menuIcon"><img alt="Menu" src="//nces.ed.gov/images/icons/menuIcon.png"/></div>
 </a>,
 <a href="/surveys">Surveys &amp; Programs</a>,
 <a href="/annuals/">Annual Reports</a>,
 <a href="/programs/coe/" title="Condition of Education">Condition of Education</a>,
 <a href="/programs/digest/" title="Digest of Education Statistics">Digest of Education Statistics</a>,
 <a href="/programs/pes/" title="Projections of Education Statistics">Projections of Education Statistics</a>,
 <a href="/surveys/annualreports/topical-studies/summary/">Topical Studies</a>,
 <a href="/surveys/SurveyGroups.asp?group=4">National Assessments</a>,
 <a href="/nationsreportcard" title="National Assessment of Educational Progress (NAEP)">National As

In [4]:
# this finds all the links titled "view this CIP"
tree.find_all(title="View this CIP")

[<a href="cipdetail.aspx?y=55&amp;cipid=87977" title="View this CIP">01) AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES.</a>,
 <a href="cipdetail.aspx?y=55&amp;cipid=87176" title="View this CIP">01.00) Agriculture, General.</a>,
 <a href="cipdetail.aspx?y=55&amp;cipid=87742" title="View this CIP">01.0000) Agriculture, General.</a>,
 <a href="cipdetail.aspx?y=55&amp;cipid=87978" title="View this CIP">01.01) Agricultural Business and Management.</a>,
 <a href="cipdetail.aspx?y=55&amp;cipid=87979" title="View this CIP">01.0101) Agricultural Business and Management, General.</a>,
 <a href="cipdetail.aspx?y=55&amp;cipid=87980" title="View this CIP">01.0102) Agribusiness/Agricultural Business Operations.</a>,
 <a href="cipdetail.aspx?y=55&amp;cipid=87981" title="View this CIP">01.0103) Agricultural Economics.</a>,
 <a href="cipdetail.aspx?y=55&amp;cipid=87982" title="View this CIP">01.0104) Farm/Farm and Ranch Management.</a>,
 <a href="cipdetail.aspx?y=55&amp;cipid=87743" title="

In [5]:
# create a list of the results and look at 1
cips = tree.find_all(title="View this CIP")
cips[22]

<a href="cipdetail.aspx?y=55&amp;cipid=87180" title="View this CIP">01.0307) Horse Husbandry/Equine Science and Management.</a>

In [6]:
# need to find CIP"> and get the part after until the </a>
cips[22].text

'01.0307) Horse Husbandry/Equine Science and Management.'

In [7]:
# split at the )
cips[22].text.split(')')

['01.0307', ' Horse Husbandry/Equine Science and Management.']

In [8]:
# set up a dictionary entry with first part as key and second as entry
my_dict = {cips[22].text.split(')')[0]: cips[22].text.split(')')[1]}
my_dict

{'01.0307': ' Horse Husbandry/Equine Science and Management.'}

## create the dictionary of CIPS codes

In [9]:
cips_dict = dict() # create an empty dictionary
cips_dict = {entry.text.split(')')[0]: entry.text.split(')')[1] for entry in cips}



In [10]:
cips_dict

{'01': ' AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES.',
 '01.00': ' Agriculture, General.',
 '01.0000': ' Agriculture, General.',
 '01.01': ' Agricultural Business and Management.',
 '01.0101': ' Agricultural Business and Management, General.',
 '01.0102': ' Agribusiness/Agricultural Business Operations.',
 '01.0103': ' Agricultural Economics.',
 '01.0104': ' Farm/Farm and Ranch Management.',
 '01.0105': ' Agricultural/Farm Supplies Retailing and Wholesaling.',
 '01.0106': ' Agricultural Business Technology.',
 '01.0199': ' Agricultural Business and Management, Other.',
 '01.02': ' Agricultural Mechanization.',
 '01.0201': ' Agricultural Mechanization, General.',
 '01.0204': ' Agricultural Power Machinery Operation.',
 '01.0205': ' Agricultural Mechanics and Equipment/Machine Technology.',
 '01.0299': ' Agricultural Mechanization, Other.',
 '01.03': ' Agricultural Production Operations.',
 '01.0301': ' Agricultural Production Operations, General.',
 '01.0302': ' Animal/Li