# DNDS6013 Scientific Python: 10th Class
## Central European University, Winter 2019/2020

Instructor: Márton Pósfai, TA: Luis Natera Orozco

Emails: posfaim@ceu.edu, natera_luis@phd.ceu.edu



Go though this notebook and
* Follow links to videos explaining some basics of the XML and HTML file formats
* Study the example codes
* Solve the exercises in the notebook. Try looking at solutions only when you are done.
* Complete a final task and upload only the resulting figure to Moodle in pdf format, do not upload your code.

If you have any questions or you get stuck with one of the exercises I will be available on the [slack channel](http://sp2020winter.slack.com). I will be online during regular class hours, outside of that I will try to get back to you as soon as possible.


## Today

Many of your final project proposals include scraping data from online sources. To give you a powerful tool for this, we will learn how to use `BeautifulSoup` to parse and search websites and XML files. We use these tools to plot the change of the human population over history using data obtained from Wikipedia.


## XML

[Watch video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=0ba489d3-a006-4e9d-adf1-ab8200ccdb32).

XML stands for Extendable Markup language.
* A universal purpose markup language
* Both human and computer readable
* Represents information in a hierarchical way
* Strict syntax rules → effective and unambiguous parsing
* Stored in plain text
* Many API use it for communication
* Many file formats are special cases of XML:
    * SVG: Scalable Vector Graphics
    * RSS feeds
    * Microsoft Word docx

Let's take a look at an example (for further details see slides):

```xml
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu tasty="True">
  <food kind="vegan">
    <item>Belgian Waffles</item>
    <price>$5.95</price>
    <description>Two of our famous...</description>
    <calories>650</calories>
  </food>
  <food>
    <item>French Toast</item>
    <price>$4.50</price>
    <description>Thick slices made...</description>
    <calories>600</calories>
  </food>
  <food>
    <item>Homestyle Breakfast</item>
    <price>$6.95</price>
    <description>Two eggs, bacon...</description>
    <calories>950</calories>
  </food>
</breakfast_menu>
```


An XML parser can help you navigate and manipulate the XML tree. We will use the module [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), technically it is built on top of a lower level parser with many useful functions.

In [None]:
from bs4 import BeautifulSoup

with open("test.xml",encoding='utf-8') as f:
    #from the file we create a soup object
    #we also specify the parser it uses as "xml", later we will use BeautifulSoup to parse html files too
    soup = BeautifulSoup(f,"xml")

### Tags and attributes

In [None]:
#access child by name
print(type(soup.breakfast_menu))
print()

#get name of the tag
print(soup.breakfast_menu.name)
print()

#get attributes of a tag as a dictionary
print(soup.breakfast_menu['tasty'])
print()

#if multiple children with the same tag, we get the first one
#text: all text, including text contained by offspring
print(soup.breakfast_menu.food)
print()

#string: only the text directly inside tag
print(soup.breakfast_menu.food.item.string)


We can also change things, for example:

In [None]:
print(soup.breakfast_menu.food.item.string)
soup.breakfast_menu.food.item.string = "Yummy Belgian Waffles"
print(soup.breakfast_menu.food.item.string)

### Navigating the tree
* Moving down: iterate through the children of a tag

In [None]:
#the first food on the menu
food1 = soup.breakfast_menu.food

list_of_children = list(food1.children)
print(list_of_children)
print()

#children also include strings containing end of line characters, to exclude them
for child in food1.children:
    if child != '\n':
        print(child)

* Moving up: get the parent of a tag

In [None]:
print(food1.parent.name)

* You can also move sideways: check documentation for `next_siblings` and `previous_siblings`

## Searching

Most often instead of navigating up and down the tree we search for tags that match some requirements. For this we can use:
* Find the first match: `soup.find()`
* Find all matches: `soup.find_all()`

We can specify what we are searching for in various ways:
* Search based on tag names:

In [None]:
for food in soup.find_all("food"):
    print(food.item.text)

* Search based on attributes:

In [None]:
for vegan_stuff in soup.find_all(kind="vegan"):
    print(vegan_stuff.item.text)

* We can even use functions! This is a powerful tool that allows for very complex searches. For example, we can look for food options that have waffles in their description:

In [None]:
#the function takes a tag as input
#outputs True if it's a match, False if it's not
def match(tag):
    if tag.name=="food": #check if it is a food
        if "waffles" in tag.text: #check if waffles are mentioned 
            return True
    #if we didn't return True, we return False
    return False

for waffles in soup.find_all(match):
    print(waffles.item.string)


* Or we can do the same thing using lambda functions:

In [None]:
for waffles in soup.find_all(lambda tag: tag.name=="food" and "waffles" in tag.text):
    print(waffles.item.string)

### Exercise

Print out the price of all food items.

<details><summary><u>Hint</u></summary>
<p>

You are looking for the `<price>` tags, use the `soup.find_all()` function to get all matches.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
for price in soup.find_all("price"):
    print(price.string)
```
    
</p>
</details>

### Exercise
Calculate the average calorie of the food options.

<details><summary><u>Hint</u></summary>
<p>

The calories are in string format, you have to convert them to a number using the `float()` function. If you store these numbers in a list, you can use `numpy`'s `np.mean()` function to get the average.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
import numpy as np
calories = [float(cal.string) for cal in soup.find_all("calories")]
print("%.2f"%np.mean(calories))
```
    
</p>
</details>

### Exercise
Print out all food items that have less than 800 calories
<details><summary><u>Hint</u></summary>
<p>

This is a bit more tricky. You can define `match(tag)` function to use with `find_all()` as we did when we were searching for waffles. To get the calories withing the `mahtch(tag)` function use `tag.calories.string`.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
def match(tag):
    if tag.name=="food":
        if float(tag.calories.string)<800:
            return True
    return False

for food in soup.find_all(match):
    print(food.item.string)

#or with a lambda function
for food in soup.find_all(lambda tag: tag.name=="food" and float(tag.calories.text)<800):
    print(food.item.string)
```
    
</p>
</details>

### Exercise
You are worried about global warming and you would like to encourage people to reduce their carbon footprint. Reduce the price of all vegan items by 10 percent!

<details><summary><u>Hint</u></summary>
<p>

In a previous example, we already searched for vegan items. Scroll back if you don't remember how.

You can modify the string contained in a tag simply by overwriting it, e.g., `food.price.sting = new_price`.
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
for food in soup.find_all(kind='vegan'):
    #convert price to number
    price = float(food.price.string[1:])
    #calculate new price
    new_price = .9*price
    #convert new price to string
    new_str = "$%.2f"%new_price
    #update xml
    food.price.string = new_str
    
#test
for food in soup.find_all('food'):
    print(food.item.string, food.price.string)
```
    
</p>
</details>

## Webscraping by parsing HTML

[Watch video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=276ebf77-45c6-464b-9ba4-ab8200e4194f)

HTML stands for Hypertext Markup Language, it is a text file that describes how a website looks. When you open a website in your browser, it downloads the HTML file and translates it into what you see.

You can view the HTML source code of a website by hitting ctrl-u (or Command+Option+u in Safari). For a simple example visit [this site](http://posfaim.web.elte.hu/example.html) and look at the source code.

It looks very similar to an XML file, but there are some differences. Check out the additional slides and the next video for more details.

We can use `BeautifulSoup` to parse and search the website as we did with XML before! Let's open the website using the `urllib.request` module and create a soup object:

In [None]:
import urllib.request
webpage = urllib.request.urlopen("http://posfaim.web.elte.hu/example.html")
soup = BeautifulSoup(webpage,"lxml") 

Note that we use a different parser, the HTML parser is confusingly called `"lxml"`.

Now we can navigate and search the HTML tree. For example, we can access tags by their names:

In [None]:
#for example, we can access tags by their names
title = soup.html.head.title.string
print(title)

Or we can find the table and access the data in it:

In [None]:
#there is only one table on the webpage, so we can use the find() function
#which returns the first match
table = soup.find("table") 

for row in table.find_all("tr")[1:]: #iterate through the rows of the table, we skip the header
    animal = row.th.string.strip().lower() #get the row headers, and make the string look prettier
    
    cells = row.find_all("td") # get all cells in the row
    legs = cells[2].string.strip() #grab the third one
    print("The "+animal+" has "+legs+" legs.")

### Exercise

Find all links on the page and print out all urls that they point to. If you forgot what tags represent links, look at the website's source code.

<details><summary><u>Hint</u></summary>
<p>

To access the attribute `att` of a tag use `tag['att']`.    
</p>
</details>

In [None]:
for a in soup.find_all("a"):
    print(a['href'])

<details><summary><u>Solution.</u></summary>
<p>
    
```python
for a in soup.find_all("a"):
    print(a['href'])
```
    
</p>
</details>

## World population over time

Now for some real webscraping. Our task is to plot the world population as a function of time based on this [table](https://en.wikipedia.org/wiki/World_population#Past_population) found on Wikipedia. Your first step is to investigate the source code of the website, you will find that it is more complex and longer than our first little example.

### Exercise

Dowload the https://en.wikipedia.org/wiki/World_population webpage and create a soup object called `pop_soup`.

<details><summary><u>Hint</u></summary>
<p>

You can use the same code as we used to download the simple example website, you only have to change the URL.    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
webpage = urllib.request.urlopen("https://en.wikipedia.org/wiki/World_population")
pop_soup = BeautifulSoup(webpage,"lxml") 
```
    
</p>
</details>

[Watch video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=3d773823-6623-4da1-ac87-ab82012d0aa2)

Open the source code of the website https://en.wikipedia.org/wiki/World_population in your browser. Let's try to figure out a way to find the `<table>` tag containing the "Past population" table.

One possibility if to search for the pattern "<table" in your browser using `ctrl-f`, this will find and count the tables. Iterating throught them by hand, we can see that our table of interest is the 11th.

In [None]:
table = pop_soup.find_all("table")[10]

Another possibility is that we notice that this is the only table that has BC dates in them.

In [None]:
table = pop_soup.find(lambda tag: tag.name=="table" and "BC" in tag.text)

Here we used `tag.text`, remember `tag.string` is the text directly in the tag, `tag.text` is all the text stored in the tag, its children and other descedents.

Now we have the table, lets extract the year column:

In [None]:
year_list = []
for row in table.find_all("tr")[1:]: #we leave out the first row, because that is just a header
    #the year is in the row headers
    year_list.append(row.th.string)
    
print(year_list)
    

If we want to use this for a plot, we have to convert these strings into numbers.

In [None]:
num_year_list = []
for str_year in year_list:
    #remove any commas
    str_year = str_year.replace(",","")
    #remove AD
    str_year = str_year.replace("AD","")
    
    #if the year is BC, remove BC and add a negative sign to the front
    if "BC" in str_year:
        str_year = "-"+str_year.replace("BC","")
    
    #convert it to a number
    num_year_list.append(float(str_year))

print(num_year_list)

### Exercise

Obtain a list `num_pop_list` containing the world population as a number. Treat "<0.015" as "0.015".

<details><summary><u>Hint</u></summary>
<p>

Do something similar as the previous example. Remove `,` and `<` characters and convert the string to a float. 
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
num_pop_list = []
for row in table.find_all("tr")[1:]: #we leave out the first row, because that is just a header
    #the world population is the first column
    #if we refer to the tag <td> by name, it returns the first instance
    pop = row.td.string
    #remove '<'
    pop = pop.replace("<","")
    #remove ','
    pop = pop.replace(",","")
    num_pop_list.append(float(pop))
    
print(num_pop_list)
```
    
</p>
</details>

### Exercise

Plot the world population as a function of the year. The oldest datapoints are rough estimates, try excluding them.

Bonus: Is the growth exponential? Try setting the y axis to log scale too! 

<details><summary><u>Hint</u></summary>
<p>

Import `matplotlib.pyplot` and use the `plot()` function. To set the y axis to logscale, you can use `plt.yscale("log")`. To exclude the 3 oldest datapoints use `num_pop_list[3:]`.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1,2,figsize=(14,5))
plt.sca(axes[0])
plt.xlabel("Year")
plt.ylabel("Population [million]")
plt.plot(num_year_list[3:],num_pop_list[3:],'o-')

plt.sca(axes[1])
plt.plot(num_year_list[3:],num_pop_list[3:],'o-g')
plt.xlabel("Year")
plt.ylabel("Population [million]")
plt.yscale('log')
plt.show()
```
    
</p>
</details>

## Saving figures in pdfs

Some of you had trouble saving your plots in pdf files. You can do this using the `plt.savefig("myfigure.pdf")` function, the format of the file is decided based on the extension of the filename.

A common pitfall is that you cannot use `plt.savefig("myfigure.pdf")` after calling `plt.show()`. The reason for this that while you are plotting there is an active figure that matplotlib keeps track of in the background, when you call `plt.show()` clears the active figure, and `plt.savefig()` ends up saving an empty figure. To avoid this just run `plt.savefig()` first, or leave `plt.show()` out altogether. You can try this out using the following example:

In [None]:
plt.plot(range(10))
#plt.show()
plt.savefig("myfigure.pdf")
#plt.show()


## Final exercise

Pick **one** of the following problems and upload **only** the resulting figure as a pdf to Moodle. Successfully completing this task counts as attendance.
1. Scrape and plot a table from Wikipedia, for example,
    * [Voter turnout in US elections](https://en.wikipedia.org/wiki/Voter_turnout_in_the_United_States_presidential_elections)
    * [The monthly average temperature for a city of your choice](https://en.wikipedia.org/wiki/List_of_cities_by_average_temperature)
    * Anything else you find interesting <br><br>
2. [RSS](https://en.wikipedia.org/wiki/RSS) (Really Simple Syndication) feeds publish updates of websites or other information in XML format. Download, parse and create a plot using one of the following feeds:
    * Plot the maximum temperature forcasted for the next three days in [Budapest](https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/3054643)
    * Create a bar chart showing the number of recent observations of Great Bustards (Túzok) and Peregrine Falcons (Vándorsólyom) in Hungary using the [feed of birding.hu](http://www.birding.hu/rss.php?rss=erdekes) *This feed is in Hungarian.*
    * Find an interesting feed and plot something. To locate feeds look for the <img src=https://upload.wikimedia.org/wikipedia/en/thumb/4/43/Feed-icon.svg/256px-Feed-icon.svg.png style="width: 2%; display: inline"> icon on websites.