# Python Block Course
# Session 5: Data streaming and parsing

Prof. Dr. Karsten Donnay, Stefan Scholz

Winter Term 2019 / 2020

In this fifth session we will learn how to stream and parse data. For this purpose we will discuss how data from the internet can be used generally, and attempt it on an exemplary API and website. 

## 5.1 Data Streams

In our context, a **data stream** is a **sequence** of **information**. While in a true data stream **not** all data is **available** at one point in time, in pseudo data streams the data is completely available but the data is usually so **big** that the hard drive or memory cannot cope with it. Depending on the type of data stream, it can contain **various information**, i.e. timestamps, attributes, raw data or processed data. 

These information are streamed in a very broad range of applications. Below are a few **examples**.

| Application | Description |
| -------- | ------- |
| Fraud detection | detect patterns and anomalies in banking transactions with time, amount, location |
| Click analytics | profile advertisment, social media, websites by user behavior |
| News analytics | underpin trading strategy or business decision by automated text analysis on news |
| High-frequency trading | make automated trading decision in milliseconds based on live stock prices |


In order to get a data stream, most of the times either **websites** are used where the data comes straight from a data provider without being processed, or from **APIs** where the data is somehow modified, validated or cleaned by some data provider. What this exactly means and how we get access to these data streams, we will learn in today's session.

## 5.2 Iterators

But before we come to data streams from the internet, we should first learn how data streams are defined in **Python**. In fact, to work with changing and large **data streams**, Python offers `Iterator`.

In our first session, we got to know a similarly named construct called `Iterable`. However, a `Iterable` is a **data structure**, e.g. list, tuple, dictionary, whereas `Iterator` is more like a **pointer** which points on certain data. Technically, a `Iterator` is an object that can be **iterated upon** with the help of the **methods** `__iter__()` and `__next__()`. Actually most iterables have a **build-in iterable** to access them. But you can also create your **own iterator object** which implements the two methods above. 

The big **advantage** about **iterators** is that with **large amounts** of **data** not all data need to be stored at once. For example, if we want to access all **numbers** from $0$ to $10^{50}$, we could try to store all numbers in a list, which will be for sure super **large**, but will probably not even fit into your **memory**. Instead we could use an **iterator** that creates each number after the other, such that only **one number** has to be **stored** at a time. This is a typical example where iterators are very helpful. 

Let us create this **iterator** and return some **numbers**. 

In [None]:
class Number:
    """
    Class Number
    """
    
    def __iter__(self):
        self.next_number = 1
        return self
    
    def __next__(self):
        current_number = self.next_number
        self.next_number += 1
        return current_number

In [None]:
# iterate over numbers
for number in Number():
    # print next number
    print(number)
    
    # stop iteration
    if number >= 3:
        break

In [None]:
# define iterator for numbers
numbers = iter(Number())

# print first number
print(next(numbers))

# print second number
print(next(numbers))

# print third number
print(next(numbers))

By the way, all this time we have been working with an **iterator** that generates these **numbers** in the same way - the **function** `range()`.

## 5.3 Exceptions

Until now you have probably strumbled across several **error messages** when you wrote Python code. In general, these error messages are divided into two categories: First, there are **syntax errors**, which indicate that at some point in your code you used an **invalid command**, e.g. you forgot an indent or wrote a colon too much. The interpreter checks for these syntax errors before you code is actually executed. But we do not want to go into detail here. Instead, we want to discuss the second category of error messages. These error messages are problems which the interpreter encounters when it actually executes your code. These errors are also called **exceptions**. By default, they are **fatal** and stop your program immediately when the exception occurs. 

In the following is a list of **common exceptions**: 

| Exception | Cause |
| -------- | ------- |
| Attribute Error | Raised when attribute assignment or reference fails |
| Import Error | Raised when the imported module is not found |
| Index Error | Raised when index of a sequence is out of range |
| KeyError | Raised when a key is not found in a dictionary | 
| Keyboard Interrupt | Raised when the user hits interrupt key(Ctrl + C or Delete) |
| Memory Error | Raised when an operation runs out of memory | 
| Name Error | Raised when a variable is not found in local or global scope | 
| Syntax Error | Raised by parser when syntax error is encountered |
| IndentationError | Raised when there is incorrect indentation | 
| Type Error | Raised when a function or operation is applied to an object of incorrect type | 
| Value Error | Raised when a function gets argument of correct type but improper value |
| Zero Division Error | Raised when second operand of division or modulo operation is zero |

If you are working with **data streams**, e.g. from websites and APIs, it is advisable that you take certain **errors** into **account** such that not the whole program aborts because of an **unimportant detail** in the data stream. Besides the data stream itself, there is an endless number of potential causes for errors. 

In these cases, we wrap our code with a `try` **statement**, and catch a possible **exception** with an `except` statement. You can also catch **multiple exceptions** at the same time by adding underneath more `except` statements. If you want to have the respective **message** of the exception available in the `exception` block give it a **variable name**, like in the `with` statement. If you use a `finally` **statement** at the end of your `try` statement, the **clause** inside the `finally` statement will be **executed last**, whether or not the `try` statement raised an exception. 

Let us **catch** some trivial **exceptions**. 

In [None]:
# error prone code
size = len(w)

In [None]:
try:
    # error prone code
    size = len(x)
except NameError as e:
    # report name error
    print("Got error: {}".format(e))

In [None]:
try: 
    # error prone code
    y = 12
    size = len(y)
except NameError as e:
    # report name error
    print("Got name error: {}".format(e))
except TypeError as e:
    # report type error
    print("Got type error: {}".format(e))

In [None]:
try: 
    # error prone code
    size = len(z)
except NameError as e:
    # report name error
    print("Got name error: {}".format(e))
except TypeError as e:
    # report type error
    print("Got type error: {}".format(e))
finally:
    # report finished block
    print("Finished try block")

Please keep in mind, however, that you should not **abuse** `try` statements to make **poor code** run, but only to deal with **unavoidable problems**. This is also the reason why we have not introduced exception handling earlier on. 

## 5.4 REST APIs

For the purpose of **communication** between **clients** and **servers**, a wide range of application programming interfaces (APIs) has been developed. An API can appear in different forms, e.g. as web system, operating system or database system. However, in today's session we want to dicuss a specific type called **REST APIs**.  **REST** is an acronym for **RE**presentational **S**tate **T**ransfer **A**pplication **P**rogramming **I**nterface. This form of interfaces defines a **uniform** and **predefined set** of **stateless operations**. 

In this context, **stateless** means that in every request **all information** is sent and no previous request is considered. This ensures that the interfaces are **fast**, **reliable** and **fault tolerant**. 

In particular, the client sends a **request** with a URL, with an endpoint, an access token and other parameters. You can imagine this as if you access a **webpage**. But an API is much more **generic** and **abstract** and allows for much more requests. The **HTTP protocol** is used to handle the requests listed below. We will also send **requests** using the **package** `requests`. This package is a very powerful package which does a lot of work for us in the background to send such requests. All we have to do is determine our request with a URL, header and parameters. 

In the following is a list of the most common **HTTP methods** used together with **REST APIs**. 

| Method | Description |
| -------- | ------- |
| GET | request data from the server |
| POST | sends data to the server |
| PUT | changes data on the server |
| DELETE | deletes existing data on the server |

For us, when we want to stream data, the first method `GET` will be particularly important. These requests will be processed by the server and then the client will receive a **response** according to the corresponding request, usually as **HTML**, **XML** or **JSON**. This response will be processed by the **package** `requests` again, which we can use to access the underlying data. 

We will discuss the **individual steps** to work with a **REST API** in the following. Therefore, we will use as an **example** the [News API](https://newsapi.org/). This is a REST API that provides **news headlines** from 30,000 news sources worldwide. However, for copyright reasons, it does not provide the full texts of the articles but only the **link** to these articles. Basically, these steps can be implemented similarly for all REST APIs. 

### Endpoints

**REST APIs** have so-called **endpoints** against which **requests** can be made. But one REST API can have **multiple endpoints** which all provide **different information**. However, if you work with a single REST API, then all its endpoints have the same URL as base. In this case, for the different endpoints only the **suffix** behind the base URL changes. For example, [News API](https://newsapi.org/) has in total **three endpoints** which provide different kinds of information. A detailed **description** of these endpoints is available in the [documentation](https://newsapi.org/docs/endpoints).

Below you find an **overview** of the different **endpoints** of [News API](https://newsapi.org/). 

| Endpoint | Description |
| -------- | ------- |
| https://newsapi.org/v2/top-headlines | breaking news headlines |
| https://newsapi.org/v2/everything | recent news and blog articles |
| https://newsapi.org/v2/sources | available news sources |

### Parameters

**Parameters** are **options** you can pass with the endpoints which **restrict** what specific information you are interested in. Usually three different types of parameters are used. For all REST APIs, the **documentation** will tell you in specific which information you should pass in which parameter type. 

The following list shows the different **parameter types**. 

| Type | Description |
| -------- | ------- |
| Header | in request header, e.g. authorization |
| Query string | in request string after endpoint and question mark `?`, e.g. keywords |
| Request body | in request body usually as JSON, e.g. send data |

We will later combine the various parameters with the package `requests`. This package will take care of the **appropriate handling** of the different **types** of **parameters**. We will come to this when we actually make a request. Before that we should have a look at the **parameters** of [News API](https://newsapi.org/). In specific, we will only look at one endpoint of it which provides recent news and blog articles. You can find a detailed description for all parameters and endpoints in the [documentation](https://newsapi.org/docs/endpoints/). 

Most REST APIs use an **API key**, also [News API](https://newsapi.org/). This API key should be passed in the **headers** of your requests. The following list shows the **required parameters** in the **header** for a request on the [News API](https://newsapi.org/). 

| Header | Description |
| -------- | ------- |
| X-Api-Key | authentication with API key |

While this authentification method in the header is very common and widely used, the **query parameters** differ significantly from API to API. This is also because they depend on what information is available in the API. You can find a detailed description of the query parameters for the endpoint `everything` in the [documentation](https://newsapi.org/docs/endpoints/everything). In the following list we see a few **query parameters** which can be used together with the endpoint `everything`. 

| Query string | Description |
| -------- | ------- |
| q | keywords or phrases to search for in the article title and body, e.g. `bitcoin` |
| qInTitle | keywords or phrases to search for in the article title only, e.g. `bitcoin` |
| sources | comma-seperated string of identifiers for the new sources or blogs you want headlines from (maximum 20), e.g. `the-new-york-times` |
| from | date and optional time for the oldest article allowed, e.g. `2019-10-17` |
| to | date and optional time for the newest article allowed, e.g. `2019-10-17` |
| language |  2-letter ISO-639-1 code of the language you want to get headlines, e.g. `de`, `en`, `es`, `fr` |
| sortBy | order to sort the articles in, e.g. `relevancy`, `publishedAt`  |

When we have prepared all the parameters, we are ready to make a request to a REST API. 

### Requests

There are several ways to send a request to a REST API, among others you can do it with the package `requests`. As mentioned above, this package allows you to make **HTTP requests** very **easily** and **quickly**. It provides all functions and methods to write your **parameters** into requests, send you **requests** and work with your **responses**. 

Let us first **import** the **package** or install it if necessary. 

In [None]:
import requests

In our example with the [News API](https://newsapi.org/) we are interested in getting **recent news articles**. The best way to understand requests is to prepare an **exemplary request** on some news articles. 

Let us first define the **URL** with the endpoint in the variable `url`, then the **headers** in the variable `header` and the **query strings** in the variable `parameters`. Then we pass these variables into the **function** `requests.get()` to make the corresponding **request**. This request is best wrapped inside a `try` **statement**, e.g. to handle an **HTTP error** `requests.exceptions.HTTPError`. An **exception** will also be raised by the method `raise_for_status()` when the response contains an **invalid status**. 

In [None]:
# define token
token = ""

# define entrypoint
url = "https://newsapi.org/v2/everything"

# define header
header = {"X-Api-Key": token}

# define query strings
parameters = {"qInTitle": "bitcoins", "language": "en"}

try:
    # make request
    response = requests.get(url, params=parameters, headers=header)
    # check response
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    print("HTTPError: {}".format(e))
except Exception as e:
    print("Error: {}".format(e))

If the request works and throws no error, you can have a first look into the **response** with the attribute `content`. 

In [None]:
# inspect response
print(response.content)

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Sign up for <a href="https://newsapi.org/register">NewsAPI</a>.
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Request English articles from the NY Times on Brexit from the <a href="https://newsapi.org/">News API</a>. 
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Implement your request inside classes and methods which take all details on the request and return the response. 
</div>

### Responses

After a **request** has been sent to a **REST API**, you will receive a **response** to it. This response consists of different information, i.e. a **status code**, a **header** and a **body**. If you have made your request with the package `requests`, then you can easily access these information. So far we have used the **attribute** `content` to inspect the **body** without knowing what it actually is. Behind the attribute `content` is the body of the response sent by the server. This body is in a **JSON format** - a combination of **attribute-value-pairs** and **array-data-types**. 

In Python, there are two ways to **parse** **JSON format**. Either we use the **function** `json.loads()` from the **module** `json` to load the **string** into a **dictionary**. Alternatively, we can directly call the **method** `json()` on our **response**. Both options output dictionaries which we can work with in Python.

Let us demonstrate the two different methods to **parse responses**. 

In [None]:
import json

# parse response with json
parsed_response = json.loads(response.content)

# print type of response
print(type(parsed_response))

In [None]:
# parse response with method
parsed_response = response.json()

# print type of response
print(type(parsed_response))

Even though we have already learned how to access **dictionaries** in the first session, we will show how you can **access certain information** by their keys and values. In case of our response from the <a href="https://newsapi.org/register">News API</a>, we might be interested to access the articles' titles or URLs. 

Let us demonstrate once how to parse the **articles' titles** from the **response**. 

In [None]:
# inspect keys and values
for key, value in parsed_response.items():
    print(key)
    print(value)
    print("-----")

In [None]:
# inspect article
for article in parsed_response["articles"]:
    print(article)
    break

In [None]:
# inspect title
for article in parsed_response["articles"]:
    print(article["title"])
    break

In [None]:
# inspect all titles
for article in parsed_response["articles"]:
    print(article["title"])

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Parse from the response from requests all articles' URL and store them in a list. 
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Implement your parser inside classes and methods which take the response from requests and return the list of the articles' URL. 
</div>

## 5.5 Webpages

Another **web resource** in the **world wide web** is **webpages**. Every day we use hundreds of **websites** like [Google](https://www.google.com/), [Wikipedia](https://frr.wikipedia.org/wiki/), [Youtube](https://www.youtube.com/), [Facebook](https://www.facebook.com/), [StackOverflow](https://stackoverflow.com/), and [GitHub](https://github.com/) to get information. These **websites** usually consist of **several webpages** written in **HTML** or a comparable **markup language**. So why should we not also be able to stream data from webpages? 

Yes, in general we are able to stream our data also from webpages with Python. But in comparision to REST APIs it requires much more **effort**, **time** and **tears**, because the desired information has **not** been **collected**, **cleaned** or **structured** by a provider. Still, if there is no suitable API available or it is unreasonable expensive, it is a good idea to implement a so-called **web scraper**. They are used to **extract data** from **webpages** automatically either with a bot or web crawler. Usually these webpages are scraped repeatedly to observe changes and generate data streams. 

In a first step, we want to extract data from single webpages. To extract their data we can use the **package** `requests` because the communication is based on the **HTTP protocol** again. We will explain step by step how we can make **requests** to **webpages** - which is very similar to REST APIs. As examples we will use the **webpages** behind the **news articles** from the [NY Times](https://www.nytimes.com/) which we collected in the previous section. Then we will show you how to **extract information** from these webpages. 

### URLs

**Uniform resource locators** (URLs) are **references** to all kinds of web resources. In the case of APIs, we called them endpoints. With webpages, we will not go into detail how they include parameters in their URLs. This is mainly due to the fact that many websites use parameters differently and not use a standard like REST APIs. Instead we will assume that our **URLs** are already **given**. 

In the following you can see the **schema** of a **URL** and **two examples**. 

```
scheme:[//authority]path[?query][#fragment]
```

```
https://www.nytimes.com/2019/10/15/world/europe/brexit-deal-boris-johnson-eu.html
https://en.wikipedia.org/wiki/Web_scraping#Techniques
```

### Requests

There are several ways to send a request to a webpage, among others you can do it with the package `requests`. As mentioned above, this package allows you to make **HTTP requests** very **easily** and **quickly**. It provides all functions and methods to write your **parameters** into requests, send you **requests** and work with your **responses**. 

Let us first **import** the **package** or install it if necessary. 

In [None]:
import requests

In our example with the articles of the [NY Times](https://www.nytimes.com/) we are interested in getting the **full texts** of **recent news articles**. The best way to understand requests is to prepare an **exemplary request** on some news articles. 

Let us define some article's **URL** in the variable `url`. Then we pass this variables into the **function** `requests.get()` to make the corresponding **request**. This request is best wrapped inside a `try` **statement**, e.g. to handle an **HTTP error** `requests.exceptions.HTTPError`. An **exception** will also be raised by the method `raise_for_status()` when the response contains an **invalid status**. 

In [None]:
# define article
url = "https://www.nytimes.com/2019/10/15/world/europe/brexit-deal-boris-johnson-eu.html"

try:
    # make request
    response = requests.get(url)
    # check response
    response.raise_for_status()
except requests.exceptions.HTTPError as e:
    print("HTTPError: {}".format(e))
except Exception as e:
    print("Error: {}".format(e))

If the request works and throws no exception, you can have a first look into the **response** with the attribute `content`. 

In [None]:
# inspect response
print(response.content)

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Crawl from your list of URLs to articles of the <a href="https://www.nytimes.com/">NY Times</a> the corresponding webpages. 
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Implement your crawler inside classes and methods which take the list of URLs and return the list of webpages. 
</div>

### Responses

After a **request** has been sent to a **webpage**, you will receive a **response** to it. This response consists of different information, i.e. a **status code**, a **header** and a **body**. If you have made your request with the package `requests`, then you can easily access these information. With the **attribute** `content` on the response, we can access its **body** again, but this time it is not in JSON but **HTML format** - a markup with opening tags `<tag>` and closing tags `</tag>`. 

What follows is a very simple **HTML document**. 

```
<!DOCTYPE html>  
<html>  
    <head>
    </head>
    <body>
        <h1> Title </h1>
        <p> Full Text </p>
    <body>
</html>
```

In Python, you can **parse HTML** formatted strings with the use of the **package** `BeautifulSoup`. The package provides idiomatic ways of **navigating**, **searching**, and **modifying HTML**. In this way, we are aware of the structure und can extract certain information. 

Let us first **import** the **package** or install it if necessary. 

In [None]:
from bs4 import BeautifulSoup

To convert an **HTML string** into a `BeautifulSoup` object, we have to pass the **string** and the corresponding **parser** `html.parser` into the class `BeautifulSoup`. On this object we will have various **methods** available to work with the **HTML format**. First, we will print out a structured form of the HTML with the method `prettify()`. 

Let us **parse** one **webpage** with `BeautifulSoup`. 

In [None]:
# parse html 
soup = BeautifulSoup(webpages[3], "html.parser")

# print structured html
print(soup.prettify())

Suppose you are now interested in a certain **information** in your **HTML string**. You can search for this information by its **tag** and **attributes**. For this, you can use the **method** `find()` which finds exactly **one tag** with the defined tag and attributes. However, if you want to find **all tags** that have the defined tag and attributes, then you better use the **method** `find_all()`. Afterwards, you can access the **actual text** behind these tags with the **method** `get_text()`. 

But before we can call these methods on our `BeautifulSoup` object, we first have to **find** out under which **tags** and **attributes** our **information** is hidden. The best way to do this is to open the corresponding **webpage** and **search** for the desired information. Once you have found it, you can **right-click** on it in your browser and select `Inspect` to see all the information about the underlying tag. If this method does not work for you, then you have to look into the HTML of the webpage and find the desired information by yourself. This can be done with `BeautifulSoup` together with `prettify()` or your own browser. 

Let us try to **access** the **full text** of one article. 

In [None]:
# find tag with certain tag and class
text = soup.find("p", {"class": "css-exrw3m evys1bk0"})

# print tag
print(text)

In [None]:
# find all tags with certain tag and class
paragraphs = soup.find_all("p", {"class": "css-exrw3m evys1bk0"})

# print number tags
print(len(paragraphs))

# append paragraphs to full text
full_text = " ".join(paragraph.get_text() for paragraph in paragraphs)

# inspect full text
print(full_text)

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Parse from your list of webpages of the <a href="https://www.nytimes.com/">NY Times</a> the full texts and headlines. 
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Implement your parser inside classes and methods which take the HTML and return the dates, titles and full texts. 
</div>