In [2]:
from bs4 import BeautifulSoup as bs
import requests

[BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) & [requests documentation](http://docs.python-requests.org/en/master/)

put the HTML structure into the variable `r`

In [3]:
r = requests.get("http://www.eib.org/infocentre/register/all/index.htm")

create a `soup`- prepares the HTML to be parsed by BeautifulSoup

In [4]:
soup = bs(r.text)

## BS basics

here are some basic operations you will use when scraping with BeautifuSoup.
`.title` will give you the title of the page

In [5]:
soup.title

<title lang="en">Basic search</title>

However, we are often not really interested in the tags (`<title>`), but in the text itself. Easy enough! Let's just put `.text` behind the expression. You can use `.text` to leave out the tags of every expression

In [6]:
soup.title.text

u'Basic search'

How about the text of the first paragraph on the page? For more examples [see the documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [8]:
soup.p.text

u'Skip to navigation'

In our EIB example, we want to extract links from a table. We need to specify which table in order to get the right links (and not just any links on the page). Right click + `inspect element` on a page shows us the HTML structure. We see that the table we are interested in has `class = datatable`. This is how we tell BeautifulSoup to find the table.

In [9]:
table = soup.find("table", {"class" :"datatable"})

## Scraping PDFs

before extracting the links, let's create a variable that will "hold" the links first. Brackets `[]` indicate a list. [Click here for more information on what is a list](http://www.tutorialspoint.com/python/python_lists.htm)

In [10]:
pdfs = []

All set! Let's find all the links in the table with `table.find_all('a')`. Do you remember how the HTML structure of a link looks like?

`<a href="http://herebethelink.com/this-you-dont-see">This is what you see</a>`

We extract the `href` attribute from `a` by `link.get('href')`. Here `link` is each `a` BeautifulSoup encounters in `table.find_all('a')`. We then replace the `htm` with `pdf`, as we know that that is the link to the documents we are after. We then put all the links into the list we created by `.append`. Please see more about [list methods](###) in the documentation

In [11]:
for link in table.find_all('a'):
    pdfs.append(link.get('href').replace("htm", "pdf"))

Let's have a peak at the first 10 items of the list we have created. Read more about [slicing lists](http://www.tutorialspoint.com/python/python_lists.htm).

In [15]:
pdfs[:10]

['/infocentre/register/help/result-page/index',
 '/infocentre/register/all/66903378.pdf',
 '/infocentre/register/all/66903378.pdf',
 '/infocentre/register/all/66883999.pdf',
 '/infocentre/register/all/66883999.pdf',
 '/infocentre/register/all/66889675.pdf',
 '/infocentre/register/all/66889675.pdf',
 '/infocentre/register/all/66882234.pdf',
 '/infocentre/register/all/66882234.pdf',
 '/infocentre/register/all/66900854.pdf']

oops, some of the links are dubble in there. They were also double in the table. That is fine. We just need to use the method `set()`. That removes all the duplicates from the list. Try to do **that** in excel!

In [13]:
pdfs_clean = set(pdfs)

let's first check the first 20 links we have cleaned.

In [None]:
pdfs_clean[:20]

looks good! Lets scrape now! There is a lot happening in this bit, so let's have a look at it bit by bit.

`with open(*something*) as *someting* ` is a special expression we use when opening files for writing. Everything that comes underneath this expression happens while the file is still open. As soon as we go out of the loop, the file closes. The `as` part is basically the same as above when we renamed BeautifulSoup into `bs`.

`open()` opens up a file in Python. `"w"` indicates the file has to be opened for writing. See more about [opening, reading and writing files](###).

`.split()` splits a string (text) to a list according to another piece of text. In our case, we split the url we have on `/` . This gives us a list, e.g. ### . 

`[-1]` is again slicing (see above), but reversed. Instead of giving us the first element from the front, we grab the first one from the back, which we make our filename.

`write()` writes to a file
`requests.get(*something*).content` gets content of a non-text response (remember that webpage = HTML is a text response! A PDF is not.

In [17]:
for pdf in pdfs_clean:
    with open(pdf.split("/")[-1], "w") as output:
        output.write(requests.get("http://www.eib.org"+pdf).content)

All set! Keep it up, scrape, read about how to do it, use [StackOverflow](http://stackoverflow.com/), try it out and make many mistakes in order to learn more. Happy scraping!