# Web APIs and HTTP requests: part II

Brief reminder. Web API (*Application Programming Interface*) is a web-based (often public or semi-public), a programmable interface that allows third parties to communicate with some digital system in an organized and automated manner.

More importantly, the majority of web APIs use the HTTP protocol to communicate with the external world. The HTTP protocol is the standard communication protocol used everywhere around the World Wide Web (this is exactly the protocol that your browser uses to communicate with servers that host websites you use). By this virtue REST API are very universal and flexible as HTTP-based communication is supported by every serious programming language and/or operating system. Moreover, it makes it even possible to use REST APIs via a browser, which is very handy for testing.

So, in summary, a web API is:

* **Remote.** Users can access the resources from anywhere, provided they have an internet connection.
* **Reliable.** The interface exposed to users is stable, which means that it does not change often in time and is largely independent of changes within the system on top of which it sits.
* **Programmable.** API can be interacted with based on a predefined set of commands/methods/endpoints (an interface) in a way that can be expressed with a programming language.

Moreover, perhaps one of the most important features of a web API is the fact that it is identified by a unique URL and IP number (exactly in the same way as any ordinary website). Every particular method/endpoint in the API can be interacted with by extending the base URL of the API with appropriate query parameters and/or subdirectories. For example:

* https://en.wikipedia.org/w/api.php is the base URL of the Wikipedia API.
* https://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=2 is the URL that uses the `query` endpoint and the `list` method nested within it.

Let us remind ourselves of the anatomy of URL-based communication with web APIs.

The base URL `https://en.wikipedia.org/w/api.php` is a simple, standard URL. Nothing special about except for the fact that it points to the location at which the Wikipedia API lives. Now, note that the `query` action is an extension of the base URL of the following form:

* `<BASE_URL> ? <QUERY STRING>`
* Where query string is a sequence of key-value pairs of the following form `<key1>=<value1>&<key2>=<value2> ...`.

The `?` sign separates the base URL from the query string part. And in our example, the query params (key-value pairs) specify that we want the API to use the `query` endpoint and use it to execute the `list=random` method with the following arguments: `rnamespace=0&rnlimit=2`. The `query` endpoint and all its methods such as `list=random` are properly documented at: `https://en.wikipedia.org/w/api.php?action=help&modules=query` (note that this website is served also through an API query).

# Working with Wikipedia API: part II

The last time we extracted the list of Wikiprojects and counted them. This time we will try to do something a little bit more involved.

1. We gonna take a random sample of 10 Wikipedia articles. There is an endpoint in the Wikipedia for doing just that. However, note because of the sampling each of you will get different results.
2. The first step will give us only id numbers and the title of the pages. We will use them to extract the full text of the pages via a different endpoint of the Wikipedia API.
3. We will compute word length distributions of the pages. Exactly, we will reuse the code that you developed earlier for the final exercise from notebook 3.

## Step 1.

First, we have to sample 10 random Wikipedia articles. This should not be too hard since we have a special method for this, so it should be just one simple API call.

The method we are looking for is `list=random` and it is defined within the `query` endpoint (`action=query`). We can read more about it [here](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brandom).

**HINT.** Remember that you can view the results of your queries directly in the browser.

A quick read of the doc page and we can decide that we need only two query parameters:

1. `rnamespace=0` (which limits the results to the namespace `0` which is the part of Wikipedia where actual encyclopedic articles live).
2. `rnlimit=10` (because we want to extract only 10 random articles).

In [None]:
# The above considerations lead us to the following payload
# we will want to attach to out query URL.

payload = {
    'action': 'query',  # since we want to use the `query` endpoint
    'list': 'random',   # because we want to use the `random` method
    # But we also need to add arguments for the `random` method
    'rnnamespace': 0,
    'rnlimit': 10,
    'format': 'json'    # we need to add it so the data can be read by Python
}

In [None]:
# Now we are ready to make the GET request
# But first we need to import the requests package
import requests as rq
# And define our base URL
BASE_URL = "https://en.wikipedia.org/w/api.php"

response = rq.get(BASE_URL, params=payload)
response

In [None]:
# We see that our response is OK (HTTP response code 200 means 'OK')
# So we can extract the data from the response object with `.json()`
# method defined on it.
data = response.json()
data

Ok, so now we have titles of articles and their unique ids. As you can probably imagine we are only interested in the unique ids at this point. Because we will need them to in different endpoint to extract texts of the articles. But how exactly we are going to access them? It is a dictionary so it should not be a major problem to access a single value, right?

In [None]:
data['query']['random'][0]['id']

So what exactly happened there? First, we got a mapping with three keys: `batchcomplete`, `continue`, and `query`. We were only interested in the `query` field. Therefore, we typed as follows:
```python
data['query']
```
However, it was again a mapping inside a mapping with only one field: `random`. Therefore:
```python
data['query']['random']
```
Inside this mapping, we had a five-element list. So to access the first element we typed:
```python
data['query']['random'][0]
```
Every element of that list was also a mapping again with three keys: `id`, `ns`, and `title`. We were only interested in the `id` field. So, we just typed:
```python
data['query']['random'][0]['id']
```
However, again we could access all the `ids` manually but it would be easier just to use a for-loop. As you probably can imagine we are going to loop over that list because the rest of the fields are going to be the same.

```python
for page in data['query']['random']:
    print(page['id'])
```
So a loop like this would work fine if we only wanted to print the `ids`. We could even modify it a bit to store the `ids` in the list (it is what we want to do), for example:
```python
list_ids = []
for page in data['query']['random']:
    list_ids.append(page['id'])
```
So first, we would create a list outside of the loop and then use a method `append` to add each value of `page['id']` as the last element of the list. It is doable. But Python offers a smarter way of saving results of the loop in a list. It is called **list comprehension** and in this particular example looks like this:
```python
page_ids = [ page['id'] for page in data['query']['random'] ]
```
It does exactly the same as the previous example but in a neater way. The difference is that first you write what is happening in the loop `page['id']` and afterward you define the loop `for page in data['query']['random']`.

In [None]:
# From the obtained relatively simply dictionary
# We can extract the list of page ids as follows:
page_ids = [ page['id'] for page in data['query']['random'] ]
page_ids

## Step 2.
Now we have a nice list of page ids, so we can use it to extract the content of the pages using a different method defined on the `query` endpoint.

We will use a so-called _cirrus doc_ endpoint. _Cirrus_ is a system for organizing and storing text documents used by Wikipedia. It does not really matter to us. What matters is the fact that an endpoint like this exists and that it has a particular format.

As we said _cirrus doc_ is a method on the `query` endpoint and we can call it with `prop=cirrusdoc`. However, to obtain any data we have also to pass a list of page ids in a proper format.

Remember every piece of data that we provide through URL parameters (query string) is always treated as a string. Thanks to this every API can use some convention for defining lists of values. The Wikipedia API uses `|` as the separator, so it uses the following convention:

* `<item 1>|<item 2>| ... |<item n>`

In [None]:
# Thus we have to join our page ids to form a single string
page_ids_string = "|".join(str(p) for p in page_ids) ## this for loop is written similarly as the previous one
page_ids_string

In [None]:
# Now, the above considerations already enforce a particular form of a payload
# that we will have to attach to the request URL.

payload = {
    'action': 'query',
    'prop': 'cirrusdoc',
    'pageids': page_ids_string,
    'format': 'json'
}
payload

In [None]:
# And now we are ready to make a request
response = rq.get(BASE_URL, params=payload)
response

In [None]:
# And parse the response to a json dictionary
data = response.json()
# We can look and the top-level keys of the dict
data.keys()

In [None]:
# We should be interested in the query field, since judging by the name
# it should contain the results of our query
data['query'].keys()

In [None]:
# Great, now we have only one key on the lower level, so it has to store the data
pages = data['query']['pages']
pages.keys()

In [None]:
# We see that the pages dictionary store all the pages we requested identified with their ids
# Let us look at the inner keys of sub-dict with data of a single page
key = list(pages)[0]
pages[key].keys()

In [None]:
# It seems that the main data is stored under the `cirrusdoc` key.
type(pages[key]['cirrusdoc'])

In [None]:
# Hmm, the cirrusdoc property is a list.
# So we have to extract data from it.
pages[key]['cirrusdoc'][0].keys()

In [None]:
# Okay, finally we see the source key, that must store the actual article content
pages[key]['cirrusdoc'][0]['source'].keys()

In [None]:
# Bingo!! We see the `text` field. It contains the article text.
# This is exactly what we want to extract.
pages[key]['cirrusdoc'][0]['source']['text']

We examined the anatomy of the response of the _cirrus doc_ method in the Wikipedia API. So now we understand it and we can use this new knowledge to automatically extract the content of all the articles.

In [None]:
articles = [ p['cirrusdoc'][0]['source']['text'] for p in pages.values() ]
len(articles)

In [None]:
## NOTE THAT THE PREVIOUS EXPRESSION
## DOES THE SAME AS THE FOLLOWING MORE VERBOSE EXPRESSION
articles = []
for page_id in pages.keys():
  page = pages[page_id]
  cirrus = page['cirrusdoc']
  page_data = cirrus[0]
  source = page_data['source']
  text = source['text']
  articles.append(text)

len(articles)

Great!!! We finally extracted the data we want. Now we can apply our method for computing word length distributions to this data.

# Homework (deadline: 11.12.2019)

Write solutions for the homework exercises in this notebook. Once the work is done download the notebook file (`File > Download .ipynb`) rename it properly so it follows a template `HW1_<NAME>_<SURNAME>.ipynb` and send the file to us. Use one (or preferably both) of the following e-mails:

* <stalaga@uw.edu.pl>
* <m.biesaga@uw.edu.pl>

Remember that you can contact us if you have any problems. You can describe your problems in the `hw1` channel or in private messages to us on Slack. You can also write normal e-mails. Moreover, you can also visit us in the ISS on the fourth floor (room 415). Usually, at least one of us is there after 11/12 for at least a few hours. Although it is best to set up a meeting earlier via e-mail or a private message.

## HW1 | Exercise 1.

Read about the `pageviews` method (`prop=pageviews`) in the `query endpoint` ([docpage](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageviews)). Use this method to extract page views data for the pages from the previous exercise (if you want you can sample 10 new pages with the `list=random` method) for the last 60 days.

The results will be broken down by single days, so you have to aggregate the results (sum) so they give the total page views count for the entire period of 60 days.

Remember that to select pages by page ids you pass `pageids=<id 1>|<id 2>|...|<id n>`. We did a very similar thing when we extracted article content through the `cirrusdoc` method in the Wikipedia API in the previous part of this notebook.

Your final output should be a `dict` object that maps page ids to pageviews (total number of pageviews over 60 days). It should look something like this:

```python
results = {
    # page_id: pageviews
    153253: 10204,
    423423: 101,
    11012:  12,
    42435:  546,
    # and so on
}
```

If you want you can sample 10 pages yourself. Otherwise, you may use the following list of page ids that we prepared for you.
Sampling pages yourself will give you extra credit (but it is possible to get maximum points without it as well).

In [2]:
import requests

page_ids = [
    19969580,
    39982842,
    25699035,
    52642931,
    53055349,
    24133565,
    1164662,
    40656459,
    12533026,
    47110862
]

In [19]:
BASE_URL = 'https://en.wikipedia.org/w/api.php'

params = {
    'action': 'query',
    'prop': 'pageviews',
    'pageids': '|'.join(str(pid) for pid in page_ids),
    'pvidays': 60,
    'format': 'json'
}
response = requests.get(BASE_URL, params=params)
response

<Response [200]>

In [21]:
data = response.json()['query']['pages']
PV = { int(k): sum(filter(None, v.get('pageviews', {}).values())) for k, v in data.items() }
PV

{1164662: 1697,
 12533026: 845,
 19969580: 407,
 24133565: 2103,
 25699035: 54,
 39982842: 28,
 40656459: 20,
 47110862: 60,
 52642931: 18,
 53055349: 89}

## HW 1 | Exercise 2.

(this is a pure Python exercise for practice; not related to web APIs)

Write a function that takes one argument `n` and prints a simple pyramid of the following form:

$n = 3$
```
  *
 ***
*****
```

$n = 5$
```
    *
   ***
  *****
 *******
*********
```

Remember that we define a function in Python like this.

```python
def add_two_numbers(x, y):
    return x + y
```

And that you can print from functions like this.

```python
def print_a_string_from_function(string):
    print(string)
```

Note that to print something you do not use the `return` statement.

HINT. You may want to use the fact that in Python strings can be easily multiplied.
For instance:

```python
'x' * 5 == 'xxxxx'
```

Note that you can do the same with an ,,empty'' space.

```python
" " * 5 == "     "
```

HINT 2. It may be convenient to use a for loop for printing.

Example usage:

```python
print_pyramid(4)
```

Should print:
```
   *
  ***
 *****
*******
```

In [32]:
def print_pyramid(n):
    print("\n".join(' '*(n-1-i) + '*'*(2*i + 1) for i in range(n)))

print_pyramid(10)

         *
        ***
       *****
      *******
     *********
    ***********
   *************
  ***************
 *****************
*******************
