In [35]:
import requests
from bs4 import BeautifulSoup
import json

Status codes for APIs:
200 - Good! Successful!
300 - Redirecting endpoint.
401 - Not authenticated to use the server
400 - Bad request
403 - Resource is forbidden
404 - Resource is not found 

Initial practice with data.gov API 

In [5]:
requests.get('http://catalog.data.gov/api/3/')

<Response [200]>

In [13]:
response = requests.get('http://catalog.data.gov/api/3').text
response

'{"version": 3}'

In [12]:
json.loads(response)

{'version': 3}

Admission requirements page for UC Admissions. 

In [49]:
page = requests.get('http://admission.universityofcalifornia.edu/freshman/requirements/index.html')

In [50]:
page.status_code

200

In [51]:
page.content

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en">\n<head>\n<meta content="text/html; charset=UTF-8" http-equiv="Content-Type" />\n<meta content="index,follow" name="robots" />\n<meta content="7 days" name="revisit-after" />\n<meta content="General" name="rating" />\n<meta content="English" name="language" />\n<meta content="width=1115" name="viewport" />\n<meta content="UC, University of California, freshman admission requirements" name="keywords" /><meta content="Review the admission standards for freshmen applying to UC" name="description" />\n<title>Admission requirements | UC Admissions</title>\n<link href="../../_files2/css/reset.css" media="all" rel="stylesheet" type="text/css" />\n<link href="../../_files2/css/main.css" media="all" rel="stylesheet" type="text/css" />\n<link href="../../_files2/css/print.css" media="print" rel="stylesheet" type="text/css" />\n<scri

In [85]:
soup = BeautifulSoup(page.content, 'html.parser')

Make it look goooooooOOOd ~ 

In [86]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js">
 <head>
  <!-- Meta -->
  <meta content="width=device-width" name="viewport"/>
  <link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
  <title>
   National Weather Service
  </title>
  <meta content="National Weather Service" name="DC.title">
   <meta content="NOAA National Weather Service National Weather Service" name="DC.description"/>
   <meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/>
   <meta content="" name="DC.date.created" scheme="ISO8601"/>
   <meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
   <meta content="weather, National Weather Service" name="DC.keywords"/>
   <meta content="NOAA's National Weather Service" name="DC.publisher"/>
   <meta content="National Weather Service" name="DC.contributor"/>
   <meta content="http://www.weather.gov/disclaimer.php" name="DC.rights"/>
   <meta content="General" name="rating"/>
   <meta content="index,follow" name="robots"/>

In [54]:
[type(item) for item in list(soup.children)] 

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

All of the items are BeautifulSoup objects. 
Doctype object : contains information about the type of the document
NavigableString : represents text found in the HTML document.
Tag object : contains other nested tags. 

The Tag object allows us to navigate through an HTML document, and extract other tags and text. 

In [55]:
html = list(soup.children)[2]

In [56]:
list(html.children)

['\n', <head>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="index,follow" name="robots"/>
 <meta content="7 days" name="revisit-after"/>
 <meta content="General" name="rating"/>
 <meta content="English" name="language"/>
 <meta content="width=1115" name="viewport"/>
 <meta content="UC, University of California, freshman admission requirements" name="keywords"/><meta content="Review the admission standards for freshmen applying to UC" name="description"/>
 <title>Admission requirements | UC Admissions</title>
 <link href="../../_files2/css/reset.css" media="all" rel="stylesheet" type="text/css"/>
 <link href="../../_files2/css/main.css" media="all" rel="stylesheet" type="text/css"/>
 <link href="../../_files2/css/print.css" media="print" rel="stylesheet" type="text/css"/>
 <script src="../../_files2/js/jquery-1.8.1.js" type="text/javascript"></script>
 <script src="../../_files2/js/swfobject.js" type="text/javascript"></script>
 <script src="../..

In [57]:
soup.find_all('p') #just dicking around here 

[<p>
 <input name="cx" type="hidden" value="001353086028328797155:-5vn6spszms"/>
 <input name="ie" type="hidden" value="UTF-8"/>
 <input id="search-txt" maxlength="300" name="q" placeholder="Search" title="Enter keywords to search the University of California website" type="text"/>
 <input alt="click here" id="search-btn" name="sa" type="image" value="submit"/>
 </p>,
 <p>Our admission guidelines are designed to ensure you are well-prepared to succeed at UC.</p>,
 <p>If you're interested in entering the University of California as a freshman, you'll have to satisfy these requirements:</p>,
 <p>a. <a href="a-g-requirements/index.html#history">History</a></p>,
 <p>2 years</p>,
 <p>b. <a href="a-g-requirements/index.html#english">English</a></p>,
 <p>4 years</p>,
 <p>c. <a href="a-g-requirements/index.html#math">Mathematics</a></p>,
 <p>3 years</p>,
 <p>d. <a href="a-g-requirements/index.html#lab">Laboratory science</a></p>,
 <p>2 years</p>,
 <p>e. <a href="a-g-requirements/index.html#lan

In [63]:
classes_info = soup.find(id="main")
classes_info

<div class="group" id="main">
<div class="col col-sml" id="col-1"><h2><a href="../index.html">Freshman</a></h2><ul class="nav" id="secondary-nav"><li><a class="expanded" href="index.html">Admission requirements</a><ul><li><a href="a-g-requirements/index.html">Subject requirement</a></li><li><a href="gpa-requirement/index.html">GPA requirement</a></li><li><a href="examination-requirement/index.html">Examination requirement</a></li><li><a href="examination/index.html">Admission by exam</a></li><li><a href="admission-by-exception/index.html">Admission by exception </a></li><li><a href="english-proficiency/index.html">English language proficiency</a></li></ul></li><li><a href="../california-residents/index.html">California residents</a></li><li><a href="../out-of-state/index.html">Out-of-state students</a></li><li><a href="../homeschool/index.html">Home-schooled students</a></li><li><a href="../how-applications-reviewed/index.html">How applications are reviewed</a></li><li><a href="../addi

In [76]:
classes_group = classes_info.find_all(class_="col col-sml")
first = classes_group[0]

In [77]:
print(first.prettify())

<div class="col col-sml" id="col-1">
 <h2>
  <a href="../index.html">
   Freshman
  </a>
 </h2>
 <ul class="nav" id="secondary-nav">
  <li>
   <a class="expanded" href="index.html">
    Admission requirements
   </a>
   <ul>
    <li>
     <a href="a-g-requirements/index.html">
      Subject requirement
     </a>
    </li>
    <li>
     <a href="gpa-requirement/index.html">
      GPA requirement
     </a>
    </li>
    <li>
     <a href="examination-requirement/index.html">
      Examination requirement
     </a>
    </li>
    <li>
     <a href="examination/index.html">
      Admission by exam
     </a>
    </li>
    <li>
     <a href="admission-by-exception/index.html">
      Admission by exception
     </a>
    </li>
    <li>
     <a href="english-proficiency/index.html">
      English language proficiency
     </a>
    </li>
   </ul>
  </li>
  <li>
   <a href="../california-residents/index.html">
    California residents
   </a>
  </li>
  <li>
   <a href="../out-of-state/index.html">

In [81]:
first.find_all('li')

[<li><a class="expanded" href="index.html">Admission requirements</a><ul><li><a href="a-g-requirements/index.html">Subject requirement</a></li><li><a href="gpa-requirement/index.html">GPA requirement</a></li><li><a href="examination-requirement/index.html">Examination requirement</a></li><li><a href="examination/index.html">Admission by exam</a></li><li><a href="admission-by-exception/index.html">Admission by exception </a></li><li><a href="english-proficiency/index.html">English language proficiency</a></li></ul></li>,
 <li><a href="a-g-requirements/index.html">Subject requirement</a></li>,
 <li><a href="gpa-requirement/index.html">GPA requirement</a></li>,
 <li><a href="examination-requirement/index.html">Examination requirement</a></li>,
 <li><a href="examination/index.html">Admission by exam</a></li>,
 <li><a href="admission-by-exception/index.html">Admission by exception </a></li>,
 <li><a href="english-proficiency/index.html">English language proficiency</a></li>,
 <li><a href=".

In [84]:
links = first.find('li').get_text()
print(links)

Admission requirementsSubject requirementGPA requirementExamination requirementAdmission by examAdmission by exception English language proficiency


OVERALL:

Not the best page for finding the data/knowing how to format the data how I want.. my goal is to be able to create a dataframe of what I found from the HTML that I can work with. 

Next step: Work more with APIs and learn how to obtain a JSON file for data. 