# Webscraping
The general idea behind webscraping is quite simple. Instead of copying the website yourself you employ a computer program to do so. You all have already seen the scheme presented below.
![Webscraping](png/webscraping.jpg)
Although it is that simple you need to remember that the robot (the web spider) does not see the webpage exactly as we do in a web browser, but looks only at the HTML code. We talked a bit about it during the introduction to APIs, but now we will dig into it even more. Let's first click at the link below and see the webpage under it.

[Wikileaks](https://wikileaks.org/fishrot/)

So this is the way we normally see the webpages, but as we wrote above it is not a very useful way of looking at webpages from the webscraping point of view. Let's now look at what is behind the nicely looking webpage. To see the HTML source code you simply need to press `ctrl + shift + I` (or in Safari on Mac `cmd + option + I`). What you see is the HTML code of this particular WikiLeaks article. This is what your program will see. So let's talk now a bit about what is there.

## HTML (HypterText Markdown Language)

HTML is a programming language in which most of the websites you browse on the Internet are written. The HTML code describes the structure of the webpage and contains multiple elements, which are represented as tags. During the introduction to APIs we talked a bit about tags, but let us now concretize what was said then. So, for the time being, let's move away from WikiLeaks and concentrate on the easy example of a HTML code which creates a simple webpage (if you do not believe us you can just copy the code below and save it in the notepad with extension HTML).

```html
    <!DOCTYPE html>
    <html>
        <head>
            <title>
                Justyna Kowalczyk fandom
            </title>
        </head>
        <body>
            <h1>
                Why Justyna Kowalczyk is the best?
            </h1>
            <p>
                Because she is just <b>the best</b> cross-county skier in the history of the sport. 
                You can learn more about her amazing achivements visiting her Wikipedia webpage:
                https://pl.wikipedia.org/wiki/Justyna_Kowalczyk.
            </p>
        </body>
    </html>
```

### What are tags?

Tags are used to mark up the start of an HTML element and they are enclosed in **angle brackets**. Above we have a few tags but the most important is the root tag ```<html>```. Inside this tag, between ```<html>``` and ```</html>``` all other elements live. In the example above we have following tags:

* ```<head></head>``` element contains meta information about the document
* ```<title></title>``` element specifies a title for the document
* ```<body></body>``` element contains the visible page content
* ```<h1></h1>``` element defines a large heading
* ```<p></p>``` element defines a paragraph
* ```<b></b>``` element defines a boldface

It might not be clear from the very first glance but HTML code has a structure of a tree. Hopefully, it will be more clear on this scheme:

![DOM](png/dom.png)

However, apart from obvious reasons this website has a major drawback. The address to Justyna Kowalczyk Wikipedia webpage is not a link. So this website does not really use one of the most basic features of HTML - hypertexts, links. So how to fix it. We will use for that attributes of the tags.
    
### What are attributes?

Attributes provide a piece of extra information about elements and they are always specified in the opening tag of the element. They come in item value pairs. For example, if in our website about Justyna Kowalczyk instead of writing the address of her Wikipedia page we would tag the address in the following way:
```html
    <a href=https://pl.wikipedia.org/wiki/Justyna_Kowalczyk> https://pl.wikipedia.org/wiki/Justyna_Kowalczyk <\a>
```
We would get a clickable link. In HTML there are many attributes for specific tags, but since we are going to only webscrap data not create webpages, let's focus on the useful ones:

* ```href``` specifies the URL (web address) for a link
* ```src``` specifies the URL (web address) for an image
* ```id``` specifies a unique id for an element
* ```class``` specifies a class of an element

So far so good, right?

### Division tag
Let's now move back to the [Wikileaks](https://wikileaks.org/fishrot/) webpage and open its source code. At first glance, it looks how it should. There are ```<html>```, ```<head>```, and ```<body>``` tags. However, instead of getting straight into paragraphs and titles in the ```<body>``` tag we have some strange ```<div>``` tags. They serve as containers of other HTML elements and as the name suggests divide the HTML document into meaningful sections. If we look at the WikiLeaks webpage there is a bunch of different ```<div>``` tags inside the ```<body>``` tag (there are two other tags there as well, but we will ignore them for time being). In most of the web browser, you can either hover over the tag to highlight the specific part of the webpage or hover over the part of the webpage to highlight the specific container. Either way, you need to know in which container the data you are interested in is stored so you can direct your robot (webspider) to navigate to this specific tag and extract the text from it.

# CSS selectors

The core idea in webscraping is to use the fact that every website is an HTML document and every HTML document is very nicely structured (it has a form of a tree).
As a result, it is (almost) always possible to identify an element uniquely by providing a path leading to it from the root of the document. In practical context, we usually only have to specify a partial path.

![HTML document as a tree](png/html-tree.jpg)

Let us assume that we want to extract the title (Google) data from the simple document above. To do this, we have to specify a unique path going from the root of the document tree (`<html>` tag) to the `<title>` tag. Technically speaking, there are many ways to do this. One of the most convenient is to use so-called CSS selectors. CSS selectors are super easy as they are defined just as simple strings in which each word (separated with space) corresponds to a tag at a given level of the tree, counting from the root. So in r case, the appropriate CSS selector is the following:

```python
"html head title"
```

Note that we do not have to use tag braces (`<` and `>`). In general, we can also omit some levels of the tree and write a more general selector. Such a selector can give us multiple tags if there are multiple matches. In our case we can really simplify the selector to the following form:

```python
"title"
```

Why? Because there is only a single `<title>` tag in our document, so our selector in simplified still uniquely determines the part of the webpage we want to extract data from. However, what will a selector like this do? Will it return a single element of the webpage or multiple elements?

```python
"body div"
```

## Advanced selectors: classes and ids

Usually referring to elements of a webpage only by using generic tag names is not enough to get what we want. That is why we have to include some additional data in our selectors. Most often we extend selectors by adding information on classes and ids attached to HTML tags. Classes and ids are used in web development to provide finely-grained control over aesthetics and layout of a webpage as they allow to address different parts with greater precision. Similarly in the context of webscraping they can be used to write more precise selectors. Consider the example below.

```html
<!DOCTYPE html>
<html>
  <head>
      <title>A website with CSS classes and ids</title>       
  </head>
  <body>
    <div class="outer">
      <div class="inner" id="first-one">
        Example I
      </div>
      <div class="inner active">
        Example II
      </div>
      <div class="inner">
        Last example
      </div>
    </div>
  </body>
</html>
```

Data provided in the `class` and `id` attributes are just a sequence of labels (separated with spaces) that can be used to identify particular elements of the HTML document. The only, but very important, difference between a `class` and an `id` is that the same `class` can be assigned to multiple elements (so classes are in principle used to identify groups of elements that are supposed to be in some sense equivalent) and any `id` can be assigned to only one unique element (so ids are used to identify specific, distinguished elements).

Classes and ids can be specified in a CSS selector correspondingly by appending their names to a tag name after a dot (`.`) or a hash (`#`).

```python
"div.outer"           # selects all <div> elements with "outer"
"div.inner.active"    # selects all <div> elements with both "inner" and "active" class

"div#first-one"        # selects the unique <div> with id "first-one"
"div#first-one.inner"  # selects the unique <div> with id "first-one" and class "inner"
```

Let us now try to extract data from particular inner `<div>` containers. In order to do this properly, we will have to use CSS selectors with classes and ids.

A selector such as

```python
"body div.inner"
```

will not do because it will match and return multiple (exactly three) `<div>` elements with class `"inner"`. However, if we want to get the second `<div>`, then we can do this because it is uniquely specified if we also include the `"active"` class. So we can rewrite the selector like this:

```python
"body div.inner.active"
```

Similarly, we can address the last inner `<div>` like this:

```python
"body div.inner#first-one"
```

In fact, we can even simplify this selector, because we know that HTML ids are always unique. Can you find the simples (shortest) selector?

However, the first `<div>` is problematic, since we can not write a selector that would select only it. In order to do so, we will have to use additional functionalities provided by our Python tools for webscraping.

NOTE. In general, it is possible to write a selector for this problem, but it is quite technically involved, so we will not do this here.

In [None]:
from bs4 import BeautifulSoup    # This is the Python package that provides nice tools for parsing HTML documents

html_string = """
<!DOCTYPE html>
<html>
  <head>
      <title>A website with CSS classes and ids</title>       
  </head>
  <body>
    <div class="outer">
      <div class="inner" id="first-one">
        Example I
      </div>
      <div class="inner active">
        Example II
      </div>
      <div class="inner">
        Last example
      </div>
    </div>
  </body>
</html>
"""

# Here we create an object representing our HTML document
# We will use it to easily work with the data it contains
html = BeautifulSoup(html_string, 'html.parser')
html

In [None]:
## Extract the first div
html.select_one("body div.inner#first-one").text

In [None]:
## It is better to strip the text in order to get rid of the whitespace
html.select_one("body div.inner#first-one").text.strip()

In [None]:
## Extract the middle div
html.select_one("body div.inner.active").text.strip()

However, things are not as nice when we try to extract the last `<div>` because we can not write a unique selector for it.

In [None]:
html.select_one("body div.inner")

Note that we did not get an error, but only the first matched item. If we want to get all the matching elements we can use:

In [None]:
html.select("body div.inner")

This actually makes it very simple for us to just extract the element we want with simple indexing because we got a list of matched elements.

In [None]:
html.select("body div.inner")[0]

In [None]:
# And we can extract text from it in the usual way
html.select("body div.inner")[0].text.strip()

# Example: scraping Wikileaks

Wikileaks page has a relatively simple structure so it constitutes a very good playground for us to try to really understand webscraping in Python.

In [None]:
import requests
from bs4 import BeautifulSoup

First, we have to use the `requests` package to download the actual HTML document representing the website we are interested in.

In [None]:
response = requests.get("https://wikileaks.org/fishrot/")
html = BeautifulSoup(response.content, 'html.parser')

Now we need to examine the structure of the website and figure out what kind of selector do we need. Once we have it, we can use it to extract the data.

In [None]:
for p in html.select('div.content div.release div.leak-content p'):
    text = p.text
    if text:
        print(text)

# Exercise I.

Extract titles of articles from the first page of results from Google Scholar for a query "computational social science". We already wrote a request call for you. Build a list of all the titles and print them.

In [None]:
response = requests.get("https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=computational+social+science&btnG=")