# Week 7 Assignment

_MkKinney 6.1_

This week has been all about getting information off the internet both in structured data formats (CSV, JSON, etc) as well as HTML.  For these exercises, we're going to use two practical examples of fetching data from web pages to show how to use Pandas and BeautifulSoup to extract structured information from the web.

---
---

### 33.1 Parsing a list in HTML

Go to the Banner Health Price Transparency Page: https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency

Notice that there is a list of hospitals and the city they are in.  We want to parse the underlying HTML to create a list of all the hospitals along with which city they're in.

```json
[
    ["Banner - University Medical Center Phoenix", "Arizona"],
    ["Banner - University Medical Center South ", "Arizona"],
    ...
]
```

To examine the underlying HTML code, you can use Chrome, right-click, and choose **Inspect**.

For reference, the documentation for BeautifulSoup is here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [1]:
from bs4 import BeautifulSoup
import requests
headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" }

response = requests.get('https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

In [2]:
div = soup.find_all('div', {"class":"col-md-8"})[0]
for hospital_list in div.find_all('ul'):
    state = hospital_list.previous_sibling.previous_sibling.string
    for hospital in hospital_list.find_all('li'):
        print(state, hospital.text)

Arizona Banner - University Medical Center Phoenix
Arizona Banner - University Medical Center South 
Arizona Banner - University Medical Center Tucson
Arizona Banner Baywood Medical Center 
Arizona Banner Behavioral Health Hospital
Arizona Banner Boswell Medical Center
Arizona Banner Casa Grande Medical Center
Arizona Banner Del E. Webb Medical Center
Arizona Banner Desert Medical Center/Cardon Children's Medical Center  
Arizona Banner Estrella Medical Center
Arizona Banner Gateway Medical Center/Banner MD Anderson Cancer Center
Arizona Banner Goldfield Medical Center  
Arizona Banner Heart Hospital
Arizona Banner Ironwood Medical Center
Arizona Banner Ocotillo Medical Center
Arizona Banner Payson Medical Center
Arizona Banner Thunderbird Medical Center
Arizona Page Hospital
California Banner Lassen Medical Center
Colorado Banner Fort Collins Medical Center
Colorado McKee Medical Center
Colorado North Colorado Medical Center
Colorado Sterling Regional Medical Center
Colorado East Morg

---

### 33.2 Using Pandas to Read Tables


Pandas documentation for loading data https://pandas.pydata.org/pandas-docs/version/0.23.4/api.html#input-output

Pandas documentation for describing the shape of data https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.shape.html

In [3]:
import pandas as pd

In [4]:
tables = pd.read_html('https://en.wikipedia.org/wiki/Diagnosis-related_group')
len(tables)

5

In [5]:
for index,table in enumerate(tables):
    print("**************TABLE {}".format(index))
    print(table)

**************TABLE 0
    0                                                  1
0 NaN  This article has multiple issues. Please help ...
1 NaN  This article needs to be updated. Please updat...
2 NaN  This article needs additional citations for ve...
**************TABLE 1
    0                                                  1
0 NaN  This article needs to be updated. Please updat...
**************TABLE 2
    0                                                  1
0 NaN  This article needs additional citations for ve...
**************TABLE 3
   Hypothetical patient at Generic Hospital in San Francisco, CA, DRG 482, HIP & FEMUR PROCEDURES EXCEPT MAJOR JOINT W/O CC/MCC (2001)[15]:8  \
0                                         Description                                                                                          
1                              Average length of stay                                                                                          
2                      L

In [6]:
drgs = tables[4]
drgs

Unnamed: 0,Name,Version,Start date,Notes
0,MS-DRG,25,"October 1, 2007","Group numbers resequenced, so that for instanc..."
1,MS-DRG,26,"October 1, 2008",One main change: implementation of Hospital Ac...
2,MS-DRG,27,"October 1, 2009",Changes involved are mainly related to Influen...
3,MS-DRG,31,"October 1, 2013",
4,MS-DRG,32,"October 1, 2014",
5,MS-DRG,33,"October 1, 2015",Convert from ICD-9-CM to ICD-10-CM.[17]
6,MS-DRG,34,"October 1, 2016",Address ICD-10 replication issues introduced i...
7,MS-DRG,35,"October 1, 2017",MS-DRGs 984 through 986 deleted and reassigned...


---

### 33.3 Find Something of Your Own

Do some web searches and find an HTML page with some data that is interesting to something you're studying.  You can extract and parse that information using either BeautifulSoup or Pandas.  If you're using Pandas, then do something interesting to format and structure your data.  If you're using BeautifulSoup, you'll just need to do the work of parsing the data out of HTML -- that's hard enough!

You don't need to build this as a function.  Just use notebook cells as I've done above.  You will be graded based on _style_.  Use variable names that make sense for your problem / solution. Cleanup anything you don't need before you submit your work.

In [7]:
# Extract the reference list from a PMC free article page
# The output is a list of dictionaries
# Each dictionary contains the information of one reference 
from bs4 import BeautifulSoup
import requests
headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" }
article_link = 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2815940/'
response = requests.get(article_link, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

In [8]:
# Extract reference information and compile the dictinary
div = soup.find_all('div', {"id":"reference-list"})[0]
import re
ref_list = []
for ref in div.find_all('li'):
    if ref.find('span',{"class":"element-citation"}).find('a', text="PubMed") == None:
        pmid = ''
    else:
        pmid = ref.find('span',{"class":"element-citation"}).find('a', text="PubMed")['href'].split("/")[-1]
    if ref.find('span',{"class":"element-citation"}).find('span',{"class":"ref-vol"}) == None:
        pub_type = 'book'
        author = ref.find('span',{"class":"element-citation"}).find('span',{"class":"ref-journal"}).previous_sibling.split('.')[0].strip()
        title = ref.find('span',{"class":"element-citation"}).find('span',{"class":"ref-journal"}).text.strip(". ")
        year = re.split('[;.]', ref.find('span',{"class":"element-citation"}).find('span',{"class":"ref-journal"}).next_sibling)[1].strip('.; ')
        publisher = ref.find('span',{"class":"element-citation"}).find('span',{"class":"ref-journal"}).next_sibling.split(';')[0].strip('.; ')
        ref_list.append({'type':pub_type,
                         'author':author,
                         'title':title,
                         'year':year,
                         'publisher':publisher,
                         'PMID':pmid})  
    else:
        pub_type = 'article'
        author = ref.find('span',{"class":"element-citation"}).find('span',{"class":"ref-journal"}).parent.previous_sibling.split('.')[0].strip()
        title = ref.find('span',{"class":"element-citation"}).find('span',{"class":"ref-journal"}).parent.previous_sibling.split('.')[1].strip()
        journal = ref.find('span',{"class":"element-citation"}).find('span',{"class":"ref-journal"}).text.strip(". ")
        year = ref.find('span',{"class":"element-citation"}).find('span',{"class":"ref-vol"}).previous_sibling.strip('; ')
        vol = ref.find('span',{"class":"element-citation"}).find('span',{"class":"ref-vol"}).text.strip(". ")
        pages = ref.find('span',{"class":"element-citation"}).text.split(".")[-2].strip().split(":")[-1]
        ref_list.append({'type':pub_type,
                         'author':author,
                         'title':title,
                         'journal':journal,
                         'year':year,
                         'vol':vol,
                         'pages':pages,
                         'PMID':pmid})  

In [9]:
 ref_list

[{'type': 'article',
  'author': 'Addessi E, Mancini A, Crescimbene L, Padoa-Schioppa C, Visalberghi E',
  'title': 'Preference transitivity and symbolic representation in capuchin monkeys (cebus apella)',
  'journal': 'PLoS ONE',
  'year': '2008',
  'vol': '3',
  'pages': 'e2414',
  'PMID': '18545670'},
 {'type': 'article',
  'author': 'Alajouanine T',
  'title': 'Aphasia and artistic realization',
  'journal': 'Brain',
  'year': '1948',
  'vol': '71',
  'pages': '229–241',
  'PMID': '18099548'},
 {'type': 'book',
  'author': 'Bahn PG',
  'title': 'The Cambridge Illustrated History of Prehistoric Art',
  'year': '1998',
  'publisher': 'Cambridge: Cambridge University Press',
  'PMID': ''},
 {'type': 'article',
  'author': 'Behar DM, Villems R, Soodyall H, et al',
  'title': 'The dawn of human matrilineal diversity',
  'journal': 'Am J Hum Gen',
  'year': '2008',
  'vol': '82',
  'pages': '1130–1140',
  'PMID': '18439549'},
 {'type': 'article',
  'author': 'Berridge KC',
  'title': 'Pl

In [10]:
# Calculate frequency of journals in the refertence list
journal_list = []
journal_freq = {}
for ref in ref_list:
    if ref['type'] == 'article':
        journal_list.append(ref['journal'])
journal_set = set(journal_list)
for journal in journal_set:
    count = 0
    for name in journal_list:
        if name == journal:
            count += 1
    journal_freq[journal] = count
journal_freq_sorted = dict(sorted(journal_freq.items(), key=lambda item: (item[1], item[0]), reverse=True))
for j, c in journal_freq_sorted.items():
    print(j, c)

Science 4
Nature 3
Brain 3
New Scientist 2
Lancet 2
Trans Am Ophthalmol Soc 1
Spat Vis 1
Sci American 1
Prog Neurobio 1
PLoS ONE 1
Neurosci Biobehav Rev 1
Neuropsychologia 1
Neurology 1
NeuroReport 1
NeuroImage 1
Nat Rev Neurosci 1
Nat Neurosci 1
J World Prehis 1
J Neurophys 1
J Hum Evol 1
Horm Behav 1
Heredity 1
Geology 1
Funct Neurol 1
Eur J Neurol 1
Curr Anthro 1
Conscious Cogn 1
Cognition 1
Clin Exp Optom 1
CNS Spectrum 1
Bull Psycho Arts 1
Brain Cogn 1
Am J Hum Gen 1


In [11]:
# Calculate frequency of years of publication in the refertence list
year_list = []
year_freq = {}
for ref in ref_list:
    year_list.append(ref['year'])
year_set = set(year_list)
for year in year_set:
    count = 0
    for y in year_list:
        if y == year:
            count += 1
    year_freq[year] = count
year_freq_sorted = dict(sorted(year_freq.items(), key=lambda item: item[0], reverse=True))
for y, c in year_freq_sorted.items():
    print(y, c)

2009 1
2008 10
2007 3
2006 6
2005 8
2004 4
2003 6
2002 2
2001 2
2000 2
1999 2
1998 2
1996 1
1992 2
1990 1
1982 1
1978 1
1974 1
1962 1
1956 1
1948 1
1871 1


In [12]:
# Print the reference list
for ref in ref_list:
    if ref['type']=='article':
        print(ref['author']+'. '+ref['title']+'. '+ref['journal']+'. '+ref['year']+';'+ref['vol']+':'+ref['pages']+'.')
    else:
        print(ref['author']+'. '+ref['title']+'. '+ref['publisher']+'; '+ref['year']+'.')

Addessi E, Mancini A, Crescimbene L, Padoa-Schioppa C, Visalberghi E. Preference transitivity and symbolic representation in capuchin monkeys (cebus apella). PLoS ONE. 2008;3:e2414.
Alajouanine T. Aphasia and artistic realization. Brain. 1948;71:229–241.
Bahn PG. The Cambridge Illustrated History of Prehistoric Art. Cambridge: Cambridge University Press; 1998.
Behar DM, Villems R, Soodyall H, et al. The dawn of human matrilineal diversity. Am J Hum Gen. 2008;82:1130–1140.
Berridge KC. Pleasures of the brain. Brain Cogn. 2003;52:106–128.
Blanke O, Ortigue S, Landis T. Colour neglect in an artist. Lancet. 2003;361:264.
Bogousslavsky J, Boller F, editors. Frontiers in Neurological Neuroscience. Basel: Karger; 2005.
Boller F, Sinforiani E, Mazzucchi A. Preserved painting abilities after a stroke. Funct Neurol. 2005;20:151–155.
Burgdorf J, Panksepp J. The neurobiology of positive emotions. Neurosci Biobehav Rev. 2006;30:173–187.
Carstairs-McCarthy A. Language: many perspectives, no consensu

---

## Submitting Your Work

In order to submit your work, you'll need to use the `git` command line program to **add** your homework file (this file) to your local repository, **commit** your changes to your local repository, and then **push** those changes up to github.com.  From there, I'll be able to **pull** the changes down and do my grading.  I'll provide some feedback, **commit** and **push** my comments back to you.  Next week, I'll show you how to **pull** down my comments.

To run through everything one last time and submit your work:
1. Use the `Kernel` -> `Restart Kernel and Run All Cells` menu option to run everything from top to bottom and stop here.
2. Save this note with Ctrl-S (or Cmd-S)
2. Skip down to the last command cell (the one starting with `%%bash`) and run that cell.

If anything fails along the way with this submission part of the process, let me know.  I'll help you troubleshoort.

In [13]:
assert False, "DO NOT REMOVE THIS LINE"

AssertionError: DO NOT REMOVE THIS LINE

---

In [14]:
%%bash
git pull
git add week08_assignment_2.ipynb
git commit -a -m "Submitting the week 8 programming assignment"
git push

Already up to date.
On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
	../practice.ipynb
	../week02/week02_inclass.ipynb
	../week04/week04_lookups.ipynb
	../week05_assignment_2.ipynb
	../week06/2019_nCoV_20200121_20200206.csv
	../week06/allergies.json
	../week06/output.csv
	../week06/week06_assignment_2copy.ipynb
	../week07-midterm/midterm-2021_copy.ipynb
	../week07-midterm/test-patients.csv
	week08_assignment_2-Copy1.ipynb
	week08_assignment_2copy1.ipynb
	../week09/

nothing added to commit but untracked files present


Everything up-to-date



---

If the message above says something like _Submitting the week 8 programming assignment_ or _Everything is up to date_, then your work was submitted correctly.