# Lab 6 - Getting Data

The topics of week 6 is getting data with web scraping. 

In this lab notebook you will gain experience webscraping data of various forms.

We will be looking at various websites and using them to get information (these sites are selected because of their public informative nature). 

## Lab Setup 

In [None]:
# Uncomment and run first run on cloud resources. 
!pip install html5lib lxml

In [None]:
from bs4 import BeautifulSoup
import requests 
import pandas as pd
import numpy as np
import re

import os
if os.environ["HOME"]=='/home/jovyan':
    !pip install --upgrade otter-grader
    
import otter
grader = otter.Notebook()

## Exercise 1 

Identify the name of the data set most relevant to 'climate' on https://www.data.gov 

You will want to use the **url = https://catalog.data.gov/dataset?q=climate&sort=score+desc%2C+name+asc**.

Then, you will use BeautifulSoup to parse the website to find the name of the data set.  

In [None]:
# Identify the name of the data set most relevant to "climate" on data.gov
# Use BeautifulSoup to parse the website to find the name of the data set. 
url = 'https://catalog.data.gov/dataset?q=climate&sort=score+desc%2C+name+asc'
resp_q1 = requests.get(url) 
soup_q1 = BeautifulSoup(...)

name = ...
name

In [None]:
grader.check("q1")

## Exercise 2

Identify the most viewed data set on Michigan state's open data portal.   
https://data.michigan.gov/browse?sortBy=most_accessed

Stores its name in the provided "most_viewed_data_nm" variable and the number of views in "num_views".

Note, you only have to consider those items listed on the current page of the website, you do not have to cycle through all entries. 

In [None]:

most_viewed_data_nm = ...
num_views = ...


print("Most visited data: %s  with %d views" % (most_viewed_data_nm, num_views))

In [None]:
grader.check("q2")

## Exercise 3

Travel advisories are given by the US government to international travelers. 

Data available at: https://travel.state.gov/content/travel/en/traveladvisories/traveladvisories.html/


You will be asked to sum the number of travel advisories by level. 

Here we get the HTML from the site, you can use this in each of the sub-questions below. 



In [None]:
site = requests.get('https://travel.state.gov/content/travel/en/traveladvisories/traveladvisories.html/').text
q3 = BeautifulSoup(site, 'html5lib')

### Exercise 3a 

Create a solution for this question using the pandas `read_html` function. 

In [None]:
# Exercise 3a
# Read in the table using pandas "read_html" function, store DataFrame in q3a_df 
# Report out number of each warnings 

...
q3a_df = ...

l1Num1 = ...
l2Num1 = ...
l3Num1 = ...
l4Num1 = ...


print("Number of travel Level 1 warnings : " +  str(l1Num1)) 
print("Number of travel Level 2 warnings : " +  str(l2Num1))
print("Number of travel Level 3 warnings : " +  str(l3Num1))
print("Number of travel Level 4 warnings : " +  str(l4Num1))

In [None]:
grader.check("q3a")

### Exericse 3b 

Create an another solution just using beautiful soup, regular expressions, and other Python functions to scrape the table.

In [None]:
# Exercise 3b
# Write another solution just using beautiful soup, regex, and other Python functions to 
#   scrape the table information
# Report out number of each warnings 

...

l1Num2 = ...
l2Num2 = ...
l3Num2 = ...
l4Num2 = ...


print("Number of travel Level 1 warnings : " +  str(l1Num2)) 
print("Number of travel Level 2 warnings : " +  str(l2Num2))
print("Number of travel Level 3 warnings : " +  str(l3Num2))
print("Number of travel Level 4 warnings : " +  str(l4Num2))

In [None]:
grader.check("q3b")

### Exercise 3c

Using the DataFrame `q3a_df` from part (a), use the `to_datetime` function to add a new `Date` column with the `Date Updated` converted to a datetime object. 

https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

Using this new column determine the following: 

* `q3c_df_yr2023` a DataFrame with only those travel advisories updated during 2023. 
* `q3c_df_aug` a DataFrame with only those travel advisories updated in August (any year).
* `q3c_df_tues` a DataFrame with only those travel advisories updated on a Tuesday. 
* `oldest_adv` a string with the oldest travel advisory. 

In [None]:
# Exercise 3c 
# Copy the q3a_df DataFrame to q3c_df. 
# Add a 'Date' column 
# Answer several questions 

q3c_df = q3a_df.copy()

q3c_df_yr2023 = ...

q3c_df_aug = ... 

q3c_df_tues = ...

oldest_adv = ...



In [None]:
grader.check("q3c")

##  Exercise 4

We want to create a function `make_faculty_df(site)` which takes in a MTU department's faculty listing, e.g., [CS faculty listing](https://www.mtu.edu/cs/department/people/) and which returns a DataFrame with information on each of the faculty members.  

We will define several helper functions to aid in this: 

* `process_bio(div)` which takes in a `<div>` tree corresponding to a single faculty bio and returns a dictionary of information on that faculty member: `Name`, `Title`, `Email`, `Office` 

* `process_page(divs)` which takes in a list of `<div>` trees corresponding to the faculty page and returns a DataFrame with all of the relevant faculty information. 

Use regular expressions and string operations to clean up the text returned, e.g., 

* remove extra white space 
* clean up titles so you only use options such as: "Professor", "Associate Professor", "Assistant Professor", "Teaching Professor", "Associate Teaching Professor", "Assistant Teaching Professor", "Research Assistant Professor", "Senior Research Scientist", "Professor Emeritus", "Adjunct Professor", "Adjunct Associate Professor", "Adjunct Assistant Professor"
    * pay attention to the white space adn remove extra blank spaces 
    * Note, you should not perform this operation with a giant if-elsif or switch statement. Use regular expressions to help with this code. 
* For office locations, remove the word "Hall" from the locations. 

If any information is missing, then use `NaN` in the DataFrame. 

Your responses should not be hard coded to the CS faculty webpage, I may test your code on another department's faculty webpage. 

Remove duplicate listings, e.g., Dr. Wang is listed at top given his role as chair of the department and also appears in the main program faculty listing. 

In [None]:
def process_bio(div): 
    # Input: a <div> tree of a single faculty bio 
    # Return: a dict of information on that faculty member: Name, Title, Email, Office 
    ...
    return None

def process_page(divs): 
    # Input: a list of <div> trees for the faculty page 
    # Return: a DataFrame with all of the relevant faculty information 
    ...
    return None

def make_faculty_df(site):



    # Download page and create BeautifulSoup object of the response
    ...
    soup_q4 = ...

    # Create DataFrame using information on the page and requested helper functions
    ...
    
    return df

url = 'https://www.mtu.edu/cs/department/people/'
q4_df = make_faculty_df(url)
q4_df.head()

In [None]:
grader.check("q4")

## Congratulations! You have finished Lab6! 

### Submission Instructions

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. Once you submit this file to the Lab 6 assignment on Gradescope. 


<div class="alert alert-warning">
<strong>Warning!</strong> 
    This lab notebook will be graded a bit differently.  The `requests` module does not run properly on Gradescope.  Normally, when you upload your submission to Gradescope, it runs your code and reports the results on the test cases.  
</div>

For this assignment, the results and variables you create in your notebook will be saved out to a log file `.OTTER_LOG`.  This file will be included in your zip, when you run the export function below. 



Make sure you have run all cells in your notebook **in order** before running the cell below, so that all information gets saved to the log file correctly. The cell below will generate a zip file for you to submit. **Please save before exporting!**

If you run the notebook repeatedly, more and more information gets added to the `.OTTER_LOG` file. 

<div class="alert alert-warning">
<strong>Warning! - Clean log file</strong>     
    Before running your last single run through the notebook, clear all clear, restart the kernel, delete the `.OTTER_LOG` so that a fresh one is created. 
</div>

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)