# Webscraping
The general idea behind webscraping is quite simple. Instead of copying the website yourself you employ a computer program to do so. You all have already seen the scheme presented below.
![Webscraping](png/webscraping.jpg)
Although it is that simple you need to remember that the robot (the web spider) does not see the webpage exactly as we do in a web browser, but looks only at the HTML code. We talked a bit about it during the introduction to APIs, but now we will dig into it even more. Let's first click at the link below and see the webpage under it.

[Wikileaks](https://wikileaks.org/fishrot/)

So this is the way we normally see the webpages, but as we wrote above it is not a very useful way of looking at webpages from the webscraping point of view. Let's now look at what is behind the nicely looking webpage. To see the HTML source code you simply need to press `ctrl + U` (or in Safari on Mac `cmd + option + U`). What you see is the HTML code of this particular WikiLeaks article. This is what your program will see. So let's talk now a bit about what is there.

## HTML (HypterText Markdown Language)

HTML is a programming language in which most of the websites you browse on the Internet are written. The HTML code describes the structure of the webpage and contains multiple elements, which are represented as tags. During the introduction to APIs we talked a bit about tags, but let us now concretize what was said then. So, for the time being, let's move away from WikiLeaks and concentrate on the easy example of a HTML code which creates a simple webpage (if you do not believe us you can just copy the code below and save it in the notepad with extension HTML).

```html
    <!DOCTYPE html>
    <html>
        <head>
            <title>
                Justyna Kowalczyk fandom
            </title>
        </head>
        <body>
            <h1>
                Why Justyna Kowalczyk is the best?
            </h1>
            <p>
                Because she is just <b>the best</b> cross-county skier in the history of the sport. 
                You can learn more about her amazing achivements visiting her Wikipedia webpage:
                https://pl.wikipedia.org/wiki/Justyna_Kowalczyk.
            </p>
        </body>
    </html>
```

### What are tags?

Tags are used to mark up the start of an HTML element and they are enclosed in **angle brackets**. Above we have a few tags but the most important is the root tag ```<html>```. Inside this tag, between ```<html>``` and ```</html>``` all other elements live. In the example above we have following tags:

* ```<head></head>``` element contains meta information about the document
* ```<title></title>``` element specifies a title for the document
* ```<body></body>``` element contains the visible page content
* ```<h1></h1>``` element defines a large heading
* ```<p></p>``` element defines a paragraph
* ```<b></b>``` element defines a boldface

It might not be clear from the very first glance but HTML code has a structure of a tree. Hopefully, it will be more clear on this scheme:

![DOM](png/dom.png)

However, apart from obvious reasons this website has a major drawback. The address to Justyna Kowalczyk Wikipedia webpage is not a link. So this website does not really use one of the most basic features of HTML - hypertexts, links. So how to fix it. We will use for that attributes of the tags.
    
### What are attributes?

Attributes provide a piece of extra information about elements and they are always specified in the opening tag of the element. They come in item value pairs. For example, if in our website about Justyna Kowalczyk instead of writing the address of her Wikipedia page we would tag the address in the following way:
```html
    <a href=https://pl.wikipedia.org/wiki/Justyna_Kowalczyk> https://pl.wikipedia.org/wiki/Justyna_Kowalczyk <\a>
```
We would get a clickable link. In HTML there are many attributes for specific tags, but since we are going to only webscrap data not create webpages, let's focus on the useful ones:

* ```href``` specifies the URL (web address) for a link
* ```src``` specifies the URL (web address) for an image
* ```id``` specifies a unique id for an element
* ```class``` specifies a class of an element

So far so good, right?

### Division tag
Let's now move back to the [Wikileaks](https://wikileaks.org/fishrot/) webpage and open its source code. At first glance, it looks how it should. There are ```<html>```, ```<head>```, and ```<body>``` tags. However, instead of getting straight into paragraphs and titles in the ```<body>``` tag we have some strange ```<div>``` tags. They serve as containers of other HTML elements and as the name suggests divide the HTML document into meaningful sections. If we look at the WikiLeaks webpage there is a bunch of different ```<div>``` tags inside the ```<body>``` tag (there are two other tags there as well, but we will ignore them for time being). In most of the web browser, you can either hover over the tag to highlight the specific part of the webpage or hover over the part of the webpage to highlight the specific container. Either way, you need to know in which container the data you are interested in is stored so you can direct your robot (webspider) to navigate to this specific tag and extract the text from it.

In [None]:
import re
import requests
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

In [None]:
response = requests.get("https://wikileaks.org/fishrot/")
html = BeautifulSoup(response.content, 'html.parser')

In [None]:
for p in html.select('div.content div.release div.leak-content p'):
    text = p.text
    if text:
        print(text)

In [None]:
response = requests.get("https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=computational+social+science&btnG=")
html = BeautifulSoup(response.content, 'html.parser')

In [None]:
item = None
for item in html.select('div#gs_res_ccl_mid div.gs_ri .gs_rt a'):
    text = item.text
    if text:
        print(text)