### Working with APIs##
#Exercises#



---


<font color='violet'>
Hints are written in white, so you do not see them immediately. If you highlight them (or double-click on them), they will appear!
<font color='white'> I am a hint! :-)


---


## 1. Basic exercises

Note: In these exercises, you will have to work with dictionaries, list comprehensions and loops quite often. If you don't remember/understand them very well, you should first go through the respective concepts once more!

### Exercise 1.1

Import the ``requests`` and the ``BeautifulSoup`` library.

In [1]:
import requests
from bs4 import BeautifulSoup

Use the Wikipedia API to retrieve the Wikipedia page on tigers (Tiger).

In [2]:
URL = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "action": "parse",
    "page": "Tiger",
    "format": "json",
}

r = requests.get(url=URL, params=PARAMS)

Parse the JSON response object (so that it is converted to a Python dictionary).

In [3]:
tiger = r.json()

What is the URL you retrieved the data from? Type it into your browser and inspect the structure of the data.

In [4]:
r.url

'https://en.wikipedia.org/w/api.php?action=parse&page=Tiger&format=json'

Now inspect the keys of your dictionary (and those of the dictionary within the "parse" key)

In [5]:
tiger.keys()

dict_keys(['parse'])

In [6]:
tiger["parse"].keys()
# Note: To save some coding below, you could type: tiger = tiger["parse"]



Now try to retrieve the following:

1. The title of the page
2. All external links
3. All section headings (*extra task: all main section headings, i.e. headings on level 2)
4. The number of images in the article
5. All URLs to Wikipedia articles on tigers in other languages

In [7]:
# 1. The title of the page
tiger["parse"]["title"]

'Tiger'

In [8]:
# 2. All external links
print(tiger["parse"]["externallinks"])

['https://www.iucnredlist.org/species/15955/214862019', 'https://doi.org/10.2305%2FIUCN.UK.2022-1.RLTS.T15955A214862019.en', 'https://archive.org/stream/mobot31753000798865#page/41/mode/2up', 'https://archive.org/details/checklistofpalae00elle/page/318/mode/2up', 'https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.04.0057:entry%3Dti/gris', 'https://web.archive.org/web/20201021200154/http://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.04.0057:entry=ti/gris', 'https://archive.org/details/onlatinlanguage01varruoft/page/96/mode/2up', 'https://doi.org/10.1086%2F693884', 'https://www.jstor.org/stable/26560471', 'https://api.semanticscholar.org/CorpusID:165388712', 'https://archive.org/details/journalofbomb33341929bomb/page/n133', 'https://archive.org/stream/PocockMammalia1/pocock1#page/n247/mode/2up', 'http://www.departments.bucknell.edu/biology/resources/msw3/browse.asp?id=14000259', 'http://www.google.com/books?id=JgAMbNSt8ikC&pg=PA546', 'https://search.worldcat.org

In [9]:
# 3. All section headings
sections = tiger["parse"]["sections"]
[section["line"] for section in sections]

['Etymology',
 'Taxonomy',
 'Subspecies',
 'Evolution',
 'Hybrids',
 'Characteristics',
 'Coat',
 'Colour variations',
 'Distribution and habitat',
 'Population density',
 'Behaviour and ecology',
 'Social spacing',
 'Communication',
 'Hunting and diet',
 'Competitors',
 'Reproduction and life cycle',
 'Health and diseases',
 'Threats',
 'Conservation',
 'Relationship with humans',
 'Hunting',
 'Attacks',
 'Captivity',
 'Cultural significance',
 'See also',
 'References',
 'Bibliography',
 'External links']

In [10]:
# 3. Extra task: all top level section headings (level 2)
[section["line"] for section in sections if section["level"]=="2"]

['Etymology',
 'Taxonomy',
 'Characteristics',
 'Distribution and habitat',
 'Behaviour and ecology',
 'Threats',
 'Conservation',
 'Relationship with humans',
 'See also',
 'References',
 'External links']

In [11]:
# 4. The number of images in the article
len(tiger["parse"]["images"])

63

In [12]:
# 5. All URLs to Wikipedia articles on tigers in other languages
languages = tiger["parse"]["langlinks"]

langlinks = [elem["url"] for elem in languages]
print(langlinks)

['https://ace.wikipedia.org/wiki/Rimu%C3%ABng', 'https://kbd.wikipedia.org/wiki/%D0%A5%D1%8C%D1%8D%D1%89%D0%BE%D0%BC%D1%8B%D1%89', 'https://ady.wikipedia.org/wiki/%D0%9A%D1%8A%D1%8D%D0%BF%D0%BB%D1%8A%D0%B0%D0%BD', 'https://af.wikipedia.org/wiki/Tier', 'https://als.wikipedia.org/wiki/Tiger', 'https://am.wikipedia.org/wiki/%E1%8A%90%E1%89%A5%E1%88%AD', 'https://anp.wikipedia.org/wiki/%E0%A4%AC%E0%A4%BE%E0%A4%98', 'https://ang.wikipedia.org/wiki/Tiger', 'https://ar.wikipedia.org/wiki/%D8%A8%D8%A8%D8%B1', 'https://an.wikipedia.org/wiki/Panthera_tigris', 'https://roa-rup.wikipedia.org/wiki/Tigru', 'https://ast.wikipedia.org/wiki/Panthera_tigris', 'https://gn.wikipedia.org/wiki/Jaguareterusu', 'https://az.wikipedia.org/wiki/P%C9%99l%C9%99ng', 'https://azb.wikipedia.org/wiki/%D9%82%D8%A7%D9%BE%D9%84%D8%A7%D9%86', 'https://ban.wikipedia.org/wiki/Macan', 'https://bn.wikipedia.org/wiki/%E0%A6%AC%E0%A6%BE%E0%A6%98', 'https://zh-min-nan.wikipedia.org/wiki/H%C3%B3%CD%98', 'https://ba.wikipedia.org/

Repetition (dictionaries and loops): Now make a dictionary of all languages for which a Wikipedia page exists as well as the titles of these pages. Use the languages as keys and the titles as values (`{"English":"Tiger", ...}`). How are tigers called in `Finnish`?

In [13]:
tiger_lang = {"English":"Tiger"}
for elem in languages:
  tiger_lang[elem["langname"]] = elem["*"]
tiger_lang

{'English': 'Tiger',
 'Achinese': 'Rimuëng',
 'Kabardian': 'Хьэщомыщ',
 'Adyghe': 'Къэплъан',
 'Afrikaans': 'Tier',
 'Alemannic': 'Tiger',
 'Amharic': 'ነብር',
 'Angika': 'बाघ',
 'Old English': 'Tiger',
 'Arabic': 'ببر',
 'Aragonese': 'Panthera tigris',
 'Aromanian': 'Tigru',
 'Asturian': 'Panthera tigris',
 'Guarani': 'Jaguareterusu',
 'Azerbaijani': 'Pələng',
 'South Azerbaijani': 'قاپلان',
 'Balinese': 'Macan',
 'Bangla': 'বাঘ',
 'Minnan': 'Hó͘',
 'Bashkir': 'Юлбарыҫ',
 'Belarusian': 'Тыгр',
 'Belarusian (Taraškievica orthography)': 'Тыгр',
 'Bhojpuri': 'बाघ',
 'Central Bikol': 'Tigre',
 'Bulgarian': 'Тигър',
 'Tibetan': 'སྟག',
 'Bosnian': 'Tigar',
 'Breton': 'Tigr',
 'Russia Buriat': 'Бар',
 'Catalan': 'Tigre',
 'Chuvash': 'Тигр',
 'Cebuano': 'Panthera tigris',
 'Czech': 'Tygr',
 'Corsican': 'Tigru',
 'Welsh': 'Teigr',
 'Danish': 'Tiger',
 'German': 'Tiger',
 'Navajo': 'Náshdóítsoh noodǫ́zígíí',
 'Doteli': 'बाग',
 'Dzongkha': 'སྟག',
 'Estonian': 'Tiiger',
 'Greek': 'Τίγρη',
 'Spanish

In [14]:
tiger_lang["Finnish"]

'Tiikeri'

### Exercise 1.2

Find the HTML text within the structured data on the tiger page and assign it to a variable named ``htmlText``.

In [15]:
htmlText = tiger["parse"]["text"]["*"]

Convert the string to a BeautifulSoup object and assign it to a variable called ``tigerSoup``.

In [16]:
tigerSoup = BeautifulSoup(htmlText)

Repetitition: Extract (1) the first table and (2) the third paragraph of the article.

In [17]:
tigerSoup.find("table")

<table class="infobox biota" style="text-align: left; width: 200px; font-size: 100%">
<tbody><tr>
<th colspan="2" style="text-align: center; background-color: rgb(235,235,210)">Tiger<br/><div style="font-size: 85%;">Temporal range: <span class="noprint"><span style="display:inline-block;"></span><span style="display:inline-block;">Early Pleistocene – Present</span> <span style="display:inline-block;"></span><div id="Timeline-row" style="margin: 4px auto 0; clear:both; width:220px; padding:0px; height:18px; overflow:visible; white-space:nowrap; border:1px #666; border-style:solid none; position:relative; z-index:0; font-size:97%;">
<div style="position:absolute; height:100%; left:0px; width:207.23076923077px; padding-left:5px; text-align:left; background-color:rgb(254,217,106); background-image: linear-gradient(to right, rgba(255,255,255,1), rgba(254,217,106,1) 15%, rgba(254,217,106,1));"><a href="/wiki/Precambrian" title="Precambrian">PreꞒ</a></div>
<div style="position:absolute; heigh

In [18]:
tigerSoup.find_all("p")[2]

<p>Throughout the tiger's range, it inhabits mainly forests, from <a href="/wiki/Conifer" title="Conifer">coniferous</a> and <a class="mw-redirect" href="/wiki/Temperate_broadleaf_and_mixed_forest" title="Temperate broadleaf and mixed forest">temperate broadleaf and mixed forests</a> in the <a href="/wiki/Russian_Far_East" title="Russian Far East">Russian Far East</a> and <a href="/wiki/Northeast_China" title="Northeast China">Northeast China</a> to <a href="/wiki/Tropical_and_subtropical_moist_broadleaf_forests" title="Tropical and subtropical moist broadleaf forests">tropical and subtropical moist broadleaf forests</a> on the <a href="/wiki/Indian_subcontinent" title="Indian subcontinent">Indian subcontinent</a> and <a href="/wiki/Southeast_Asia" title="Southeast Asia">Southeast Asia</a>. The tiger is an <a href="/wiki/Apex_predator" title="Apex predator">apex predator</a> and preys mainly on <a href="/wiki/Ungulate" title="Ungulate">ungulates</a>, which it takes by ambush. It lives 

### Exercise 1.3

Retrieve the information about 10 Wikipedia pages that match with the word "tiger" and convert the response to a dictionary. ><font color='violet'> Hint: <font color='white'> Hint: The "srlimit" parameter allows you to specify how many pages you want to retrieve.

In [19]:
# Perform page search
URL = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "action": "query",
    "format": "json",
    "list": "search",
    "srsearch": "tiger",
    "srlimit": 10
}

r = requests.get(url=URL, params=PARAMS)

# Convert to dictionary
tigers = r.json()

Navigate through the dictionary or type the URL into your browser to inspect how the data is structured.

In [20]:
# Get URL so you can explore the structure of the data in your Browser
r.url

'https://en.wikipedia.org/w/api.php?action=query&format=json&list=search&srsearch=tiger&srlimit=10'

In [21]:
# Explore the dictionary
print(tigers.keys())
print(tigers["query"].keys())
type(tigers["query"]["search"]) # tigers["query"]["search"] contains a list of dictionaries!

dict_keys(['batchcomplete', 'continue', 'query'])
dict_keys(['searchinfo', 'search'])


list

Now print out the titles of the 10 tiger pages.

In [22]:
# Extract titles
tiger_pages = tigers["query"]["search"]

[elem["title"] for elem in tiger_pages]

['Tiger',
 'Topologically Integrated Geographic Encoding and Referencing',
 'Tiger Tiger',
 'Tiger Woods',
 'Tiger (zodiac)',
 'Tiger 3',
 'Siberian tiger',
 'Bengal tiger',
 'Tiger I',
 'Tiger II']

### Exercise 1.4

In the web scraping exercises, you wrote a simple scraper that fetched you some information from the following animal pages:

In [23]:
animals = ["Cat", "Dog", "Tiger", "Giant_panda"]

You will now try to do the same using the API: Write a loop that fetches all the pages and retrieves (1) the title and (2) the number of images on each page.





Let's do this step by step. First try to fetch the "Cat" page and convert the JSON response you get into a Python dictionary.

In [24]:
URL = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "action": "parse",
    "page": "Cat",
    "format": "json",
}

r = requests.get(url=URL, params=PARAMS)
data = r.json()


Now try to retrieve the title and the number of images. Assign each of them to a variable.

In [25]:
title = data["parse"]["title"]
images = len(data["parse"]["images"])
print(title, images)

Cat 59


Now write a loop that fetches all the pages and writes the response into a list. You will have to create an empty list and ``append`` the new elements to it:

In [26]:
L = []
for pagename in animals:
  URL = "https://en.wikipedia.org/w/api.php"

  PARAMS = {
        "action": "parse",
        "page": pagename,
        "format": "json",
        }

  r = requests.get(url=URL, params=PARAMS)
  L.append(r)

L

[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]

Finally, you can bring everything together. Improve your loop so that it parses each page, retrieves the title and the number of images and writes them into a nested list.

In [27]:
L = []
for pagename in animals:
  URL = "https://en.wikipedia.org/w/api.php"

  PARAMS = {
        "action": "parse",
        "page": pagename,
        "format": "json",
        }

  r = requests.get(url=URL, params=PARAMS)
  data = r.json()

  title = data["parse"]["title"]
  images = len(data["parse"]["images"])

  L.append([title, images])

L

[['Cat', 59], ['Dog', 47], ['Tiger', 63], ['Giant panda', 25]]

## 2. Advanced exercises*

---


<font color='red'>
*Feel free to skip the advanced exercises if you feel overwhelmed or if trying to solve the basic exercises already took you a lot of time!


---




### Exercise 2.1

Write a function called ``getWiki`` that allows you to enter the name of a Wikipedia page and returns the parsed JSON response as a Python dictionary.

In [28]:
def getWiki(pagename):
  ENDPOINT = "https://en.wikipedia.org/w/api.php"

  PARAMS = {
      "action": "parse",
      "page": pagename,
      "format": "json",
  }

  r = requests.get(url=ENDPOINT, params=PARAMS)
  return r.json()

In [29]:
getWiki("Python")

{'parse': {'title': 'Python',
  'pageid': 46332325,
  'revid': 1233294168,
  'text': {'*': '<div class="mw-content-ltr mw-parser-output" lang="en" dir="ltr"><style data-mw-deduplicate="TemplateStyles:r1235681985">.mw-parser-output .side-box{margin:4px 0;box-sizing:border-box;border:1px solid #aaa;font-size:88%;line-height:1.25em;background-color:var(--background-color-interactive-subtle,#f8f9fa);display:flow-root}.mw-parser-output .side-box-abovebelow,.mw-parser-output .side-box-text{padding:0.25em 0.9em}.mw-parser-output .side-box-image{padding:2px 0 2px 0.9em;text-align:center}.mw-parser-output .side-box-imageright{padding:2px 0.9em 2px 0;text-align:center}@media(min-width:500px){.mw-parser-output .side-box-flex{display:flex;align-items:center}.mw-parser-output .side-box-text{flex:1;min-width:0}}@media(min-width:720px){.mw-parser-output .side-box{width:238px}.mw-parser-output .side-box-right{clear:right;float:right;margin-left:1em}.mw-parser-output .side-box-left{margin-right:1em}}</

### Exercise 2.2

You would like to know if Zürich or Bern is more popular on Wikipedia. For this purpose, you will measure (1) the number of Wikipedia articles within a 1km-radius around the train station and (2) the total number of images in these articles. Try to work with functions instead of copying and pasting code!

You can take the following coordinates:
* Bern: 46.949722, 7.439444
* Zürich: 47.377455, 8.539688


Start by comparing the number of articles:

In [30]:
# Define function to get names of all articles

def getArts(coords):

  URL = "https://de.wikipedia.org/w/api.php"  # Change to German Wikipedia!
  PARAMS = {
      "format": "json",
      "list": "geosearch",
      "gscoord": coords,
      "gslimit": "max",
      "gsradius": 1000,
      "action": "query"
  }

  r = requests.get(url=URL, params=PARAMS)
  DATA = r.json()
  PLACES = DATA['query']['geosearch']
  results = [place["title"] for place in PLACES]
  return results

In [31]:
bern = getArts("46.949722|7.439444")
print(len(bern))
print(bern)

206
['Bahnhof Bern', 'Hauptgebäude der Universität Bern', 'Golatenmatttor', 'Bollwerk (Bern)', 'Universität Bern', 'Hotel Schweizerhof (Bern)', 'Äusseres Aarbergertor', 'Grosse Schanze (Bern)', 'Rotes Quartier', 'Genfergasse (Bern)', 'Burgerspital', 'BLS AG', 'Schanzen (Bern)', 'Heiliggeistkirche (Bern)', 'Neuengasse', 'Aarbergergasse', 'Staatsarchiv des Kantons Bern', 'Ryffligässchen', 'Ryfflibrunnen', 'Schweizerische Theatersammlung', 'Stiftung SAPA, Schweizer Archiv der Darstellenden Künste', 'Speichergasse (Bern)', 'Christoffelturm', 'Bubenbergplatz', 'Haus der Kantone', 'Welle von Bern', 'Loebegge', 'Schweizer Reisekasse', 'Obergericht des Kantons Bern', 'Obere Altstadt', 'Spitalgasse (Bern)', 'Beratungsstelle für Unfallverhütung', 'Pfeiferbrunnen', 'Altes Schützenhaus (Bern)', 'Hodlerstrasse (Bern)', 'Kornhausplatz (Bern)', 'Christoffelgasse', 'Kunstmuseum Bern', 'Bernische Kunstgesellschaft', 'Bubenberg-Denkmal', 'Schwanengasse (Bern)', 'Progr', 'Sommerleist', 'Clientis', 'DC Ba

In [32]:
zurich = getArts("47.377455|8.539688")
len(zurich)

223

Now try to compare the total number of images in these articles. You will have to write a loop to retrieve each page.

In [33]:
# Function to get number of images
def nrImg(pagename):
  URL = "https://de.wikipedia.org/w/api.php"

  PARAMS = {
      "action": "parse",
      "page": pagename,
      "format": "json",
  }

  r = requests.get(url=URL, params=PARAMS)
  DATA = r.json()
  nr_images = len(DATA["parse"]["images"])
  return nr_images

#print(nrImg("Burgerspital"))

In [34]:
# Compute total for Bern
bern_total = 0

for i in bern:
  images = nrImg(i)
  bern_total += images

print(bern_total) # total number of images
print(len(bern))
bern_total / len(bern) # average number of images per page

1080
206


5.242718446601942

In [35]:
# Compute total for Zürich
zurich_total = 0

for i in zurich:
  images = nrImg(i)
  zurich_total += images

print(zurich_total) # total number of images
zurich_total / len(zurich) # average number of images per page

1625


7.286995515695067