# Web scraping
A common data collection task is to collect data from web pages and transform them into an analysis ready format. In this exercise, you'll be scraping the [Informatics Course Information](https://www.washington.edu/students/crscat/info.html) page to ask some basic questions about the courses offered.

## Set up
The `beautifulsoup` package should already be installed as part of your Anaconda distribution. You can import it using the following syntax:

```python
from bs4 import BeautifulSoup as bs
```

## Scraping

In [1]:
# Import the libraries you need: beautiful soup (bs4) and `requests`
from bs4 import BeautifulSoup as bs
import requests as r

In [2]:
# Use the `get` method of the requests library to fetch the page content
content = r.get("https://www.washington.edu/students/crscat/info.html")

In [3]:
# Use bs to create a BeautifulSoup object of the page content
soup = bs(content.text, "html.parser")

In [4]:
# We can now use the `find_all` method to find all course title elements
# Iterate through these elements and store the *text* of each course title in a variable
# Hint: You'll need to review the HTML to figure out how to identify them
# Hint: use a list comprehension!
titles = [t.text for t in soup.find_all('b')]

In [5]:
# We can now use the `find_all` method to find all course description elements
# Iterate through these elements and store the *text* of each course description in a variable
# Hint: You'll need to review the HTML to figure out how to identify them
# Hint: you may have to skip certain elements...
descriptions = [t.next_element.next_element.next_element for t in soup.find_all('b')]

## Data processing
Now that you have the data, we'll re-structure it so that we can easily ask questions about the data

In [6]:
# Create a dictionary where the *keys are course numbers* (e.g., INFO 370), and the values are *dictionaries* 
# With the following values: 
#     - "title": title of the course (from above)
#     - "description": description of the course (from above)
#     - "credits": can be a string of the number of credits (some are a range)
#     - "level": 100, 200, 300, or 400 (an *integer*)
#     - "meets_requirements": string of requriment(s) met (i.e., VLPA, I&S, etc.)
# Hint: start with an empty dictionary use a loop, keeping track of the *index* 
# Hint: think of creative ways to get the credits/level from your string

import re

courses = {}

for i in range(len(titles)):
    course_number = re.findall("^INFO\s[0-9]{3}", titles[i])[0]
    title = re.findall("(.*) \(", titles[i])[0]
    description = descriptions[i]
    credits = re.findall("\(.*\)", titles[i])[0]
    level = re.findall("\d", titles[i])[0] + "00"
    meets_requirements = re.findall("\) (.*)", titles[i])
    
    
    courses[course_number] = {"title": title, "description": description, "credits": credits, "level": level, 
                             "meets_requirements": meets_requirements}

In [7]:
courses["INFO 101"]

{'title': 'INFO 101 Social Networking Technologies',
 'description': "Explores today's most popular social networks, gaming applications, and messaging applications. Examines technologies, social implications, and information structure. Focuses on logic, databases, networked delivery, identity, access, privacy, ecommerce, organization, and retrieval.",
 'credits': '(5)',
 'level': '100',
 'meets_requirements': ['I&S/NW']}

## Asking questions of the data
Now we can filter the dataset to ask questions of interest

In [8]:
# How many courses are 300 level courses?
# Hint: use a list comprehension! 
level_300 = sum([1 for name in courses.values() if name['level'] == '300'])

In [9]:
# What are the course titles of courses that meet *some* university requirement?
requirement_courses = [name['title'] for name in courses.values() if name['meets_requirements'] is not '']

In [10]:
# Write a function that takes in your courses object and a course level (100, 200, etc.) and 
# returns all of the *course titles* of courses that are that level

# Make sure to use a doc string to document your function
def find_courses_by_level(courses, level):
    """takes in courses dictionary and course level, returns all course titles of that level"""
    return [name['title'] for name in courses.values() if name['level'] == str(level)]

In [11]:
# Demonstrate that your function works
level_100 = find_courses_by_level(courses, 100)
level_100

['INFO 101 Social Networking Technologies',
 'INFO 102 Gender and Information Technology',
 'INFO 180 Introduction to Data Science',
 'INFO 198 Exploring Informatics']