# Week 7 Assignment

_MkKinney 6.1_

This week has been all about getting information off the internet both in structured data formats (CSV, JSON, etc) as well as HTML.  For these exercises, we're going to use two practical examples of fetching data from web pages to show how to use Pandas and BeautifulSoup to extract structured information from the web.

---
---

### 33.1 Parsing a list in HTML

Go to the Banner Health Price Transparency Page: https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency

Notice that there is a list of hospitals and the city they are in.  We want to parse the underlying HTML to create a list of all the hospitals along with which city they're in.

```json
[
    ["Banner - University Medical Center Phoenix", "Arizona"],
    ["Banner - University Medical Center South ", "Arizona"],
    ...
]
```

To examine the underlying HTML code, you can use Chrome, right-click, and choose **Inspect**.

For reference, the documentation for BeautifulSoup is here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In the cell below, create a function called **parse_banner(url)** that takes as it's one parameter the URL of the webpage to be parsed for links.  Make sure you include docstrings and a good test case using hte URL provided above.

In [1]:
from bs4 import BeautifulSoup
import requests
def parse_banner(url):
    
    """(url) -> string
    This function returns an array of 2 element long arrays.
    Each element is a "hospital name" anda state.
    
    >>> parse_banner('https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency')
    [['Banner - University Medical Center Phoenix', 'Arizona'], ['Banner - University Medical Center South\xa0', 'Arizona'], ['Banner - University Medical Center Tucson', 'Arizona'], ['Banner Baywood Medical Center\xa0', 'Arizona'], ['Banner Behavioral Health Hospital', 'Arizona'], ['Banner Boswell Medical Center', 'Arizona'], ['Banner Casa Grande Medical Center', 'Arizona'], ['Banner Del E. Webb Medical Center', 'Arizona'], ["Banner Desert Medical Center/Cardon Children's Medical Center\xa0\xa0", 'Arizona'], ['Banner Estrella Medical Center', 'Arizona'], ['Banner Gateway Medical Center/Banner MD Anderson Cancer Center', 'Arizona'], ['Banner Goldfield Medical Center\xa0\xa0', 'Arizona'], ['Banner Heart Hospital', 'Arizona'], ['Banner Ironwood Medical Center', 'Arizona'], ['Banner Ocotillo Medical Center', 'Arizona'], ['Banner Payson Medical Center', 'Arizona'], ['Banner Rehabilitation Hospitals', 'Arizona'], ['Banner Thunderbird Medical Center', 'Arizona'], ['Page Hospital', 'Arizona'], ['Banner Lassen Medical Center', 'California'], ['Banner Fort Collins Medical Center', 'Colorado'], ['McKee Medical Center', 'Colorado'], ['North Colorado Medical Center', 'Colorado'], ['Sterling Regional Medical Center', 'Colorado'], ['East Morgan County Hospital', 'Colorado'], ['Ogallala Community Hospital', 'Nebraska'], ['Banner Churchill Community Hospital', 'Nevada'], ['Banner Wyoming Medical Center\xa0Central Campus', 'Wyoming'], ['Banner Wyoming Medical Center East Campus', 'Wyoming'], ['Community Hospital', 'Wyoming'], ['Washakie Medical Center', 'Wyoming'], ['Platte County Memorial Hospital', 'Wyoming']]
    """
    returnlist = []
    # Note that you'll need to fetch the data using the following syntax to include headers
    # that make the web server think you're a real web browser.
    headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    for link in soup.find_all('p'):
        if (link.strong):
            state = link.strong.text
            for sib in link.find_next_sibling("ul"):
                if sib.name == "li":
                    hospital = sib.text
                    returnlist.append([hospital, state])
    return returnlist

In [2]:
import doctest
doctest.run_docstring_examples(parse_banner, globals(), verbose=True)

Finding tests in NoName
Trying:
    parse_banner('https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency')
Expecting:
    [['Banner - University Medical Center Phoenix', 'Arizona'], ['Banner - University Medical Center South ', 'Arizona'], ['Banner - University Medical Center Tucson', 'Arizona'], ['Banner Baywood Medical Center ', 'Arizona'], ['Banner Behavioral Health Hospital', 'Arizona'], ['Banner Boswell Medical Center', 'Arizona'], ['Banner Casa Grande Medical Center', 'Arizona'], ['Banner Del E. Webb Medical Center', 'Arizona'], ["Banner Desert Medical Center/Cardon Children's Medical Center  ", 'Arizona'], ['Banner Estrella Medical Center', 'Arizona'], ['Banner Gateway Medical Center/Banner MD Anderson Cancer Center', 'Arizona'], ['Banner Goldfield Medical Center  ', 'Arizona'], ['Banner Heart Hospital', 'Arizona'], ['Banner Ironwood Medical Center', 'Arizona'], ['Banner Ocotillo Medical Center', 'Arizona'], ['Banner Payson Medical Center', '

In [3]:
banner = parse_banner('https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency')
assert len(banner)==32, 'Length of result should have been 38, but {} returned.'.format(len(banner))
assert banner[0][1]=='Arizona', 'Wrong data found in the first result item: {}'.format(banner[0])

---

## 33.2 Using a REST API (from GitHub.com)

Many websites provide something called a REST API to access information from their site programatically, rather than relying on HTML.  One example is GitHub.com, whose API allows you do to things like "list all the public repositories for a user."

The documentation for GitHub.com's REST API can be found here: https://docs.github.com/en/rest/guides/getting-started-with-the-rest-api

Create a function called **repo_summary(user)** that takes a GitHub.com user name as it's parameter and retrieves a list of all the repositories you can see for that user.  The specific documentation for the this kind of request can be found here: https://docs.github.com/en/rest/reference/repos#list-repositories-for-a-user. Make sure your function is well documented with a docstring and includes a simple test to verify that you get back 12 repositories when querying for the repositories for user **paulboal**.

I've provided a related example to help you out.

In [4]:
# Example -- this example of code shows how to get basic information on the user paulboal
# For your solution, make sure you meet the requirements in the instructions above.

import requests

response = requests.get('https://api.github.com/users/paulboal')
data = response.json()

print('This information is about {}. His website is {}.'.format(data.get('login'), data.get('blog')))

This information is about paulboal. His website is www.amitechsolutions.com.


In [5]:
def print_array(array):
    """(url) -> string (github user name)
    This function returns an array of public repository names.
    
    >>> print_array([{"a", 1}, {"b",2}])
    {1, 'a'}
    {2, 'b'}

    """   
    for i in array:
        print(i)

In [6]:
import doctest
doctest.run_docstring_examples(print_array, globals(), verbose=True)

Finding tests in NoName
Trying:
    print_array([{"a", 1}, {"b",2}])
Expecting:
    {1, 'a'}
    {2, 'b'}
ok


In [7]:
# Your code Here
# Example -- this example of code shows how to get basic information on the user paulboal
# For your solution, make sure you meet the requirements in the instructions above.

def repo_summary(github_user):
    """(url) -> string (github user name)
    This function returns an array of public repository names.
    
    >>> repo_summary('paulboal')
    [{'url': 'https://github.com/paulboal/ajaxterm', 'desc': 'Patched copy of Ajaxterm that allows connecting to remote SSH servers'}, {'url': 'https://github.com/paulboal/cms_hospital_compare', 'desc': 'Hadoop Sandbox demo with CMS Hospital Compare data'}, {'url': 'https://github.com/paulboal/collibra-scripts', 'desc': 'Scripts (mostly Python) for automating tasks with the Collibra API'}, {'url': 'https://github.com/paulboal/coronadatascraper', 'desc': 'COVID-19 Coronavirus data scraped from government and curated data sources.'}, {'url': 'https://github.com/paulboal/hadoop-heuristicsminer', 'desc': 'The project implements the HeuristicsMiner process mining algorithm in Hadoop MapReduce.  See the README for additional information.'}, {'url': 'https://github.com/paulboal/hds5210-2021', 'desc': 'Course content for HDS5210 Spring 2021'}, {'url': 'https://github.com/paulboal/hds5210-2022', 'desc': 'Main class repository for HDS5210-2022'}, {'url': 'https://github.com/paulboal/jupyterhub-nbgrader', 'desc': 'Reference deployment of JupyterHub with docker'}, {'url': 'https://github.com/paulboal/nppes_demo', 'desc': 'Loading NPPES data into Hadoop for search'}, {'url': 'https://github.com/paulboal/pexpect-curses', 'desc': 'An extension to pexpect (Python expect) that allows scraping curses screens using the vt102 module.'}, {'url': 'https://github.com/paulboal/scm-products', 'desc': 'Supply Chain Project'}, {'url': 'https://github.com/paulboal/tdwi-accelerate-2017-python', 'desc': 'Code and materials for the TDWI Accelerate 2017 Python Quick Camp'}]
    """  
    url = "https://api.github.com/users/" + github_user + "/repos" 
    results = []
    import requests
    import json
    
    response = requests.get(url)
    data = response.json()
    #print(json.dumps(data, indent = 4))
    #print('This information is about {}. His website is {}.'.format(data.get('login'), data.get('blog')))
    for repo in data:
        repo_url = repo.get('html_url')
        desc = repo.get('description')
        results.append({'url':repo_url, 'desc': desc})
    return results
    
#print_array(repo_summary("paulboal"))


In [8]:
import doctest
doctest.run_docstring_examples(repo_summary, globals(), verbose=True)

Finding tests in NoName
Trying:
    repo_summary('paulboal')
Expecting:
    [{'url': 'https://github.com/paulboal/ajaxterm', 'desc': 'Patched copy of Ajaxterm that allows connecting to remote SSH servers'}, {'url': 'https://github.com/paulboal/cms_hospital_compare', 'desc': 'Hadoop Sandbox demo with CMS Hospital Compare data'}, {'url': 'https://github.com/paulboal/collibra-scripts', 'desc': 'Scripts (mostly Python) for automating tasks with the Collibra API'}, {'url': 'https://github.com/paulboal/coronadatascraper', 'desc': 'COVID-19 Coronavirus data scraped from government and curated data sources.'}, {'url': 'https://github.com/paulboal/hadoop-heuristicsminer', 'desc': 'The project implements the HeuristicsMiner process mining algorithm in Hadoop MapReduce.  See the README for additional information.'}, {'url': 'https://github.com/paulboal/hds5210-2021', 'desc': 'Course content for HDS5210 Spring 2021'}, {'url': 'https://github.com/paulboal/hds5210-2022', 'desc': 'Main class reposito

In [9]:
import doctest
doctest.run_docstring_examples(repo_summary, globals(), verbose=True)

Finding tests in NoName
Trying:
    repo_summary('paulboal')
Expecting:
    [{'url': 'https://github.com/paulboal/ajaxterm', 'desc': 'Patched copy of Ajaxterm that allows connecting to remote SSH servers'}, {'url': 'https://github.com/paulboal/cms_hospital_compare', 'desc': 'Hadoop Sandbox demo with CMS Hospital Compare data'}, {'url': 'https://github.com/paulboal/collibra-scripts', 'desc': 'Scripts (mostly Python) for automating tasks with the Collibra API'}, {'url': 'https://github.com/paulboal/coronadatascraper', 'desc': 'COVID-19 Coronavirus data scraped from government and curated data sources.'}, {'url': 'https://github.com/paulboal/hadoop-heuristicsminer', 'desc': 'The project implements the HeuristicsMiner process mining algorithm in Hadoop MapReduce.  See the README for additional information.'}, {'url': 'https://github.com/paulboal/hds5210-2021', 'desc': 'Course content for HDS5210 Spring 2021'}, {'url': 'https://github.com/paulboal/hds5210-2022', 'desc': 'Main class reposito

In [10]:
repos = repo_summary('paulboal')
assert len(repos)==12, 'Expecing 12, but {} were found'.format(len(repos))

---

### 33.3 Find Something of Your Own

Do some web searches and find an HTML page with some data that is interesting to something you're studying.  You can extract and parse that information using either BeautifulSoup or Pandas.  If you're using Pandas, then do something interesting to format and structure your data.  If you're using BeautifulSoup, you'll just need to do the work of parsing the data out of HTML -- that's hard enough!

You don't need to build this as a function.  Just use notebook cells as I've done above.  You will be graded based on _style_.  Use variable names that make sense for your problem / solution. Cleanup anything you don't need before you submit your work.

### This next section obtains a specific table from a wikipedia page of Texas -  min/max of the daily temperature from the major cities in Texas
#### 1. Pick on the August and Janurary temperature n Fahrenheit
#### 2. List the important cities 
#### 3. Split up the columm into 2 columns "max and min"
#### 4. Find the city with the max daily temperature and the one with the min temperature

In [11]:
import pandas as pd
dfs = pd.read_html('https://en.wikipedia.org/wiki/Texas', match = "Average daily maximum and minimum temperatures for selected cities in Texas")
txinfo = dfs[0]
txinfo.info()
txinfo

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Location      8 non-null      object
 1   August (°F)   8 non-null      object
 2   August (°C)   8 non-null      object
 3   January (°F)  8 non-null      object
 4   January (°C)  8 non-null      object
dtypes: object(5)
memory usage: 448.0+ bytes


Unnamed: 0,Location,August (°F),August (°C),January (°F),January (°C)
0,Houston,94/75,34/24,63/54,17/12
1,San Antonio,96/74,35/23,63/40,17/5
2,Dallas,96/77,36/25,57/37,16/3
3,Austin,97/74,36/23,61/45,16/5
4,El Paso,92/67,33/21,57/32,14/0
5,Laredo,100/77,37/25,67/46,19/7
6,Amarillo,89/64,32/18,50/23,10/−4
7,Brownsville,94/76,34/24,70/51,21/11


In [12]:
print("1. Dropping the columns with the Celsius Info")
print("----------------------------------------------------------------------")

# Drop 2 columns (celsius info)
txinfo.drop('January (°C)', axis = 1, inplace = True)
txinfo.drop('August (°C)', axis = 1, inplace = True)
txinfo.info()
type(txinfo['Location'])

1. Dropping the columns with the Celsius Info
----------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Location      8 non-null      object
 1   August (°F)   8 non-null      object
 2   January (°F)  8 non-null      object
dtypes: object(3)
memory usage: 320.0+ bytes


pandas.core.series.Series

In [13]:
print("2. Important Cities")
print("----------------------------------------------------------------------")
print(txinfo['Location'])

2. Important Cities
----------------------------------------------------------------------
0        Houston
1    San Antonio
2         Dallas
3         Austin
4        El Paso
5         Laredo
6       Amarillo
7    Brownsville
Name: Location, dtype: object


In [14]:
print("3. Split a column by the / delimiter for January and August, drop some columns, print the table ")
print("----------------------------------------------------------------------")

txinfo[['January max (F)', 'January min (F)']] = txinfo['January (°F)'].str.split('/', expand=True)
txinfo[['August max (F)', 'August min (F)']] = txinfo['August (°F)'].str.split('/', expand=True)
txinfo.drop('January (°F)', axis = 1, inplace = True)
txinfo.drop('August (°F)', axis = 1, inplace = True)

# Change a few data types to integer 

txinfo["January max (F)"] = pd.to_numeric(txinfo["January max (F)"])
txinfo["January min (F)"] = pd.to_numeric(txinfo["January min (F)"])
txinfo["August max (F)"] = pd.to_numeric(txinfo["August max (F)"])
txinfo["August min (F)"] = pd.to_numeric(txinfo["August min (F)"])


3. Split a column by the / delimiter for January and August, drop some columns, print the table 
----------------------------------------------------------------------


In [15]:
print("4.1 Find the city with the max January daily temperature")
print("-----------------------------------------------------------------------------------------")
txinfo.loc[txinfo['January max (F)'].idxmax()]['Location']

4.1 Find the city with the max January daily temperature
-----------------------------------------------------------------------------------------


'Brownsville'

In [16]:
print("4.2 Find the city with the min January daily temperature")
print("-----------------------------------------------------------------------------------------")
txinfo.loc[txinfo['January min (F)'].idxmin()]['Location']

4.2 Find the city with the min January daily temperature
-----------------------------------------------------------------------------------------


'Amarillo'

In [17]:
print("4.3 Find the city with the max August daily temperature")
print("-----------------------------------------------------------------------------------------")
txinfo.loc[txinfo['August max (F)'].idxmax()]['Location']

4.3 Find the city with the max August daily temperature
-----------------------------------------------------------------------------------------


'Laredo'

In [18]:
print("4.4 Find the city with the min August daily temperature")
print("-----------------------------------------------------------------------------------------")
txinfo.loc[txinfo['August min (F)'].idxmin()]['Location']

4.4 Find the city with the min August daily temperature
-----------------------------------------------------------------------------------------


'Amarillo'

---

## Check your work above

If you didn't get them all correct, take a few minutes to think through those that aren't correct.


## Submitting Your Work

In order to submit your work, you'll need to use the `git` command line program to **add** your homework file (this file) to your local repository, **commit** your changes to your local repository, and then **push** those changes up to github.com.  From there, I'll be able to **pull** the changes down and do my grading.  I'll provide some feedback, **commit** and **push** my comments back to you.  Next week, I'll show you how to **pull** down my comments.

To run through everything one last time and submit your work:
1. Use the `Kernel` -> `Restart Kernel and Run All Cells` menu option to run everything from top to bottom and stop here.
2. Follow the instruction on the prompt below to either ssave and submit your work, or continue working.

If anything fails along the way with this submission part of the process, let me know.  I'll help you troubleshoort.

---

In [19]:
a=input('''
Are you ready to submit your work?
1. Click the Save icon (or do Ctrl-S / Cmd-S)
2. Type "yes" or "no" below
3. Press Enter

''')

if a=='yes':
    !git add week07_assignment_2.ipynb
    !git commit -a -m "Submitting the week 7 programming exercises"
    !git push
else:
    print('''
    
OK. We can wait.
''')


Are you ready to submit your work?
1. Click the Save icon (or do Ctrl-S / Cmd-S)
2. Type "yes" or "no" below
3. Press Enter

 yes


[main 3380c47] Submitting the week 7 programming exercises
 2 files changed, 804 insertions(+), 2 deletions(-)
 create mode 100644 week07/week07_assignment_2.ipynb
Counting objects: 5, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 7.41 KiB | 1.48 MiB/s, done.
Total 5 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To github.com:rasalt/hds5210-2022.git
   6542e21..3380c47  main -> main
