# Education Locations

## Overview

This activity allows you to practice using the [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) library to scrape some data from the web. It also allows you to practice using a **Jupyter Notebook** to both document and perform your work. As you can see, you can write _Markdown_, as well as Python

_(quick tip: hit `esc` then `m` to start writing Markdown rather than Python. Then, hit `shift` and `enter` to run a code section)_.

## Set up

In order to use the Beautiful soup library, we'll need to ensure it's installed on your machine. You can do this easily by running the following command on your terminal 

```
# Install beautifulsoup using pip on the terminal
pip install beautifulsoup4
```

You should now be able to import the library inside of this notebook by running the following line of Python code

In [1]:
from bs4 import BeautifulSoup as bs, SoupStrainer as ss


We'll also need to import a few other libraries, such as `pandas` to manage our data, and `requests` to make URL requests

In [133]:
import requests as r
import pandas as p
import re

## Identify Institution Links

Our first task is to use python to identify the **links to institution pages** on their [website](https://collegecost.ed.gov/catc/Default.aspx). We'll begin by making a request of the page content. Due to peculiarities of how the page is built on the client side, we'll read a local version of the page using the `codecs` package.

In [134]:
import codecs
file = codecs.open("college-site.html", 'r')
page_content = file.read()
soup = bs(page_content, 'html.parser')

### Now that we have all the page content, you should open up the [website](https://collegecost.ed.gov/catc/Default.aspx) in your browser to _identify the part of the DOM_ where the relevant information is.

In [135]:
# Find the TuitionGrid table
table = soup.find(id = 'dvCATWTuitionGrid')

In [136]:
# Extract each row from the table
table_rows = table.find_all('tr', recursive=True)

In [150]:
# Look at a single row
table_rows[0]['onclick']
re.findall(r"'(.*?)'", table_rows[0]['onclick'], re.DOTALL)

[u'http://nces.ed.gov/collegenavigator/?id=142328']

## Extracting links from table rows

In this section, we'll iterate through the table rows and extract the links from each one.

In [154]:
# Function to extract url
def extract_url(row):
    links = re.findall(r"'(.*?)'", row['onclick'], re.DOTALL)
    return links[0]

In [157]:
# List to store links
links = []
for tr in table_rows:
    i += 1
    link = extract_url(tr)
    links.append(link)
print len(links)

33


## Iterate through links and extract content from webpage