# Web APIs and HTTP requests: part II

Brief reminder. Web API (_Application Programming Interface_) is a (often public or semi-public) web-based, programmable interface that allows third parties to communicate with some digital system in an organized and automated manner.

Most importantly most of web API use the HTTP protocol to communicate with the external world. HTTP protocol is the standard communication protocol used everywhere around the World Wide Web (this is exactly the protocol that your browser uses to communicate with servers that host websites you use). By this virtue REST API are very universal and flexible as HTTP-based communication is supported by every serious programming language and/or operating system. Moreover, it makes it even possible to use REST APIs via a browser, which is very handy for testting.

So in summary a web API is:

* **Remote.** Users can access the resourcesfrom anywhere, provided they have an internet connection.
* **Reliable.** The interface exposed to users is stable, which means does not change often in time and is largely independent from changes within the system on top of which it sits.
* **Programmable.** API can be interacted with based on a predefined set of commands/methods/endpoints (an interface) in a way that can be expressed with a programming language.

Moreover, perhaps one of the most important features of a web API is the fact that it is identified by a unique URL and IP number (exactly in the same way as any ordinary website). Every particular method/endpoint in the API can be interacted with by extending the base URL of the API with appropriate query parameters and/or subdirectories. For example:

* https://en.wikipedia.org/w/api.php is the base URL of the Wikipedia API.
* https://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=2 is the URL that uses the `query` endpoint and the `list` method nested within it.

Let us remind ourselves of the anatomy of the URL-based communication with web APIs.

The base url `https://en.wikipedia.org/w/api.php` is a simple, standard URL. Nothing special about except for the fact that it points to the location at which the Wikipedia API lives. Now note that the `query` action is an extension of the base url of the following form:

* `<BASE_URL> ? <QUERY STRING>`
* Where query string is a sequence of key-value pairs of the following form `<key1>=<value1>&<key2>=<value2> ...`.

The `?` sign separates the base URL from the query string part. And in our example the query params (key-value pairs) specify that we want the API to use the `query` endpoint and use it to execute the `list=random` method with the following arguments: `rnamespace=0&rnlimit=2`. The `query` endpoint and all its methods such as `list=random` are properly documented at: `https://en.wikipedia.org/w/api.php?action=help&modules=query` (note that this website is served also through an API query).

# Working with Wikipedia API: part II

The last time we extracted the list of Wikiprojects and counted them. This time we will try to do something a little bit more involved.

1. We gonna take a random sample of 10 Wikipedia articles. There is an endpoint in the Wikipedia for doing just that. However, note because of the sampling each of you will get different results.
2. The first step will give us only id numbers and title of the pages. We will use the to extract the full text of the pages via a different endpoint of the Wikipedia API.
3. We will compute word length distributions of the pages. Reuse your code that you developed earlier for the final exercise from the notebook 3.

## Step 1.

First we have to sample 10 random Wikipedia articles. This should not be too hard, since we have a special method for this, so it should be just one simple API call.

The method we are looking for is `list=random` and it is defined within the `query` endpoint (`action=query`). We can read more about it [here](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brandom).

**HINT.** Remeber that you can view results of your queries directly in the browser.

Quick read of the docpage and we can decide that we need only two query parameters:

1. `rnamespace=0` (which limits the results to the namespace `0` which is the part of Wikipedia where actual encyclopedic articles live).
2. `rnlimit=10` (because we want to extract only 10 random articles).

In [25]:
# The above considerations lead us the to the following payload
# we will want to attach to out query URL.

payload = {
    'action': 'query',  # since we want to use the `query` endpoint
    'list': 'random',   # because we want to use the `random` method
    # But we also need to add arguments for the `random` method
    'rnnamespace': 0,
    'rnlimit': 10,
    'format': 'json'    # we need to add it so the data can be read by Python
}

In [26]:
# Now we are ready to make the GET request
# But first we need to import the requests package
import requests as rq
# And define our base URL
BASE_URL = "https://en.wikipedia.org/w/api.php"

response = rq.get(BASE_URL, params=payload)
response

<Response [200]>

In [27]:
# We see that our response is OK (HTTP response code 200 means 'OK')
# So we can extract the data from the response object with `.json()`
# method defined on it.
data = response.json()
data

{'batchcomplete': '',
 'continue': {'rncontinue': '0.842030615149|0.842032552876|22861096|0',
  'continue': '-||'},
 'query': {'random': [{'id': 52666268, 'ns': 0, 'title': 'Henchir-Ezzguidane'},
   {'id': 2592649, 'ns': 0, 'title': 'Herbert Murerwa'},
   {'id': 1190077, 'ns': 0, 'title': 'One-time password'},
   {'id': 42914672, 'ns': 0, 'title': 'Sierra Leone Brewery Limited'},
   {'id': 45338291, 'ns': 0, 'title': 'Gustaf Nielsen'},
   {'id': 25707018, 'ns': 0, 'title': 'Concision'},
   {'id': 29319967, 'ns': 0, 'title': 'Cheese ripening'},
   {'id': 2583375, 'ns': 0, 'title': 'DATAR'},
   {'id': 18445956, 'ns': 0, 'title': 'Asnoun'},
   {'id': 40190989, 'ns': 0, 'title': 'Guillermo de la Dehesa'}]}}

In [28]:
# The obtained a relatively simple dictionary.
# We can extract the list of page ids as follows:
page_ids = [ page['id'] for page in data['query']['random'] ]
page_ids

[52666268,
 2592649,
 1190077,
 42914672,
 45338291,
 25707018,
 29319967,
 2583375,
 18445956,
 40190989]

## Step 2.
Now we have a nice list of page ids, so we can use it to extract content of the pages using a different method defined on the `query` endpoint.

We will use a so-called _cirrus doc_ endpoint. _Cirrus_ is a system for organizing and storing text documents used by Wikipedia. It does not really matter to us. What matters is the fact that an endpoint like this exists and that it has a particular format.

As we said _cirrus doc_ is a method on the `query` endpoint and we can call it with `prop=cirrusdoc`. However, to obtain any data we have also to pass a list of page ids in a proper format.

Remember every piece of data that we provide through URL parameters (query string) is always treated as a string. Thanks to this every API can use some convention for defining lists of values. The Wikipedia API uses `|` as the separator, so it uses the following convention:

* `<item 1>|<item 2>| ... |<item n>`

In [29]:
# Thus we have to join our page ids to form a single string
page_ids_string = "|".join(str(p) for p in page_ids)
page_ids_string

'52666268|2592649|1190077|42914672|45338291|25707018|29319967|2583375|18445956|40190989'

In [31]:
# Now, the above considerations already enforce a particular form of a payload
# that we will have to attach to the request URL.

payload = {
    'action': 'query',
    'prop': 'cirrusdoc',
    'pageids': page_ids_string,
    'format': 'json'
}
payload

{'action': 'query',
 'prop': 'cirrusdoc',
 'pageids': '52666268|2592649|1190077|42914672|45338291|25707018|29319967|2583375|18445956|40190989',
 'format': 'json'}

In [32]:
# And now we are ready to make a request
response = rq.get(BASE_URL, params=payload)
response

<Response [200]>

In [38]:
# And parse the response to a json dictionary
data = response.json()
# We can look and the top-level keys of the dict
data.keys()

dict_keys(['batchcomplete', 'query'])

In [40]:
# We should be interested in the query field, since judging by the name
# it should contain the results of our query
data['query'].keys()

dict_keys(['pages'])

In [41]:
# Great, now we have only one key on the lower level, so it has to store the data
pages = data['query']['pages']
pages.keys()

dict_keys(['1190077', '2583375', '2592649', '18445956', '25707018', '29319967', '40190989', '42914672', '45338291', '52666268'])

In [43]:
# We see that the pages dictionary store all the pages we requested identified with their ids
# Let us look at the inner keys of sub-dict with data of a single page
key = list(pages)[0]
pages[key].keys()

dict_keys(['pageid', 'ns', 'title', 'cirrusdoc'])

In [48]:
# It seems that the main data is stored under the `cirrusdoc` key.
type(pages[key]['cirrusdoc'])

list

In [50]:
# Hmm, the cirrusdoc property is a list.
# So we have to extract data from it.
pages[key]['cirrusdoc'][0].keys()

dict_keys(['index', 'type', 'id', 'version', 'source'])

In [51]:
# Okay, finally we see the source key, that must store the actual article content
pages[key]['cirrusdoc'][0]['source'].keys()

dict_keys(['template', 'content_model', 'wiki', 'auxiliary_text', 'language', 'title', 'text', 'defaultsort', 'timestamp', 'redirect', 'wikibase_item', 'heading', 'source_text', 'version_type', 'coordinates', 'version', 'external_link', 'namespace_text', 'namespace', 'text_bytes', 'incoming_links', 'category', 'outgoing_link', 'popularity_score', 'create_timestamp', 'opening_text'])

In [53]:
# Bingo!! We see the `text` field. It contains the article text.
# This is exactly what we want to extract.
pages[key]['cirrusdoc'][0]['source']['text'][:500]

'A one-time password (OTP), also known as one-time pin or dynamic password, is a password that is valid for only one login session or transaction, on a computer system or other digital device. OTPs avoid a number of shortcomings that are associated with traditional (static) password-based authentication; a number of implementations also incorporate two-factor authentication by ensuring that the one-time password requires access to something a person has (such as a small keyring fob device with th'

We examined the anatomy of the response of the _cirrus doc_ method in the Wikipedia API. So now we understand it and we can use this new knowledge to automatically extract content of all the articles.

In [57]:
articles = [ p['cirrusdoc'][0]['source']['text'] for p in pages.values() ]
len(articles)

10

Great!!! We finally extracted the data we want. Now we can apply our method for computing word length distributions to this data.

# Homework (deadline: 04.12.2019)

## HW1 | Exercise 1.

Read about the `pageviews` method (`prop=pageviews`) in the `query endpoint` ([docpage](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageviews)). Use this method to extract page views data for the pages from the previous exercise (if you want you can sample 10 new pages with the `list=random` method) for the last 60 days.

The results will be broken down by single days, so you have to aggregate the results (sum) so they give the total page views count for the entire period of 60 days.

Remember that to select pages by page ids you pass `pageids=<id 1>|<id 2>|...|<id n>`

## HW 1 | Exercise 2.

(this is a pure Python exercise for practice; not related to web APIs)

Write a function that takes one argument `n` and prints a simple pyramid of the following form:

$n = 3$
```
  *
 ***
*****
```

$n = 5$
```
    *
   ***
  *****
 *******
*********
```

Remeber that we define function in Python like this.

```python
def add_two_numbers(x, y):
    return x + y
```

And that you can print from functions like this.

```python
def print_a_string_from_function(string):
    print(string)
```

Note that to print something you do not use the `return` statement.

In [58]:
def print_pyramid(n):
    pass # Remove this and fill the function with proper code

Example usage:

```python
print_pyramid(4)
```

Should print:
```
   *
  ***
 *****
*******
```