# Scraping multiple pages

For this scraping excercise we'll use 3 libraries : `requests` and `Beautiful Soup` and `pandas`.
- `requests` : opens a website
- `BeautifulSoup` : parses the HTML
- `pandas` : for data analysis and transformation.
Let's import them.

- Go to the [Regulation No 31: laying down the Staff Regulations of Officials and the Conditions of Employment of Other Servants](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20230101).
- Look at the left-hand side side panel - it contains the older versions of this regulation. All of these contain this table. Let's collect all this data into one table.

## How:
- grab the links for the older regulations from the left-hand side panel
- write a `for` loop that loops over a range changing the URL for the regulations of previous years
- Reuse the script we wrote in the previous class to scrape all the pages. Put it into a function.
- Add a `year` column to the data so you know what year the allowances come from

 

### There are multiple ways to find elements in a page

```python
soup.find('table') # returns the first occurence of the element `table` on the page
soup.find(string='text') # searches for text and returns the text !Does not return partial matches!
soup.find_all('div') # returns a list with all occurences of the element `div`
```

### You can also use CSS selectors
`soup.select()` always returns a list with all the occurences of the search, even if it's just a single element

```python
soup.select('a') #  all elements 'a'
soup.select('.classname') #  all elements with class name `.classname`
soup.select(".consLegNav a")
```

```
soup.a.get()
```
You will most ofthen combine them.

![html](img/css_selectors.png)

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

In [3]:
url = "https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20100101"
r = requests.get(url)
soup = bs(r.text)

In [12]:
len(soup.select(".consLegNav"))

1

In [18]:
["hello"][0]

'hello'

In [33]:
# create an empty list containing the data
dates_list = []

for a in soup.select(".consLegNav a"):
    # this is the same > soup.select(".consLegNav")[0].find_all('a')
    
    # this is the whole tag
    print("1." + str(a))
    
    # extract just the 'href' from the tag
    link = a.get("href")
    print("2." +link)

    # split the string into a list on dash https://www.w3schools.com/python/ref_string_split.asp
    date = link.split("-")
    print("3." + str(date))

    # extract just the date part of the string by slicing
    print("4." + date[2])

    # append the date to a list
    dates_list.append(date[2])
    
    break

1.<a class="" data-celex="01962R0031-20230101" href="./../../../legal-content/EN/AUTO/?uri=CELEX:01962R0031-20230101" title="" xmlns="http://www.w3.org/1999/xhtml">01/01/2023</a>
2../../../../legal-content/EN/AUTO/?uri=CELEX:01962R0031-20230101
3.['./../../../legal', 'content/EN/AUTO/?uri=CELEX:01962R0031', '20230101']
4.20230101


In [49]:
url_list = []

for a in soup.select(".consLegNav a")[:-1]:
    url_list.append(a.get("href").split("-")[2])

In [67]:
table_data = []
base_url = "https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-"

for date in url_list:

    url = base_url + date
    print(url)
    r = requests.get(url)
    soup = bs(r.text)
    
    table = soup.find(string="Daily allowance").parent.parent.parent.parent
    
    for row in table.find_all("tr")[1:]:
    
        elements = row.find_all('td')
        data = {}
    
        data["Destination"] = elements[0].text.strip()
        
        if date in ["20040501","20050101","20060101"]:
            data["Hotel ceiling"] = elements[2].text.strip()
            data["Daily allowance"] = elements[1].text.strip()
        else:
            data["Hotel ceiling"] = elements[1].text.strip()
            data["Daily allowance"] = elements[2].text.strip()
        data["Date"] = date
    
        table_data.append(data)

    #print(data)
    

https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20230101
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20220701
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20220101
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20210101
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20200101
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20190101
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20180101
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20170101
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20160910
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20160101
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20140701
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A01962R0031-20140501
https://eur-lex.europa.eu/legal-content/

In [66]:
pd.DataFrame(table_data).to_csv("regulation_data.csv", index=False)

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date
0,Belgium,148,102,20230101
1,Bulgaria,135,57,20230101
2,Czech Republic,124,70,20230101
3,Denmark,173,124,20230101
4,Germany,128,97,20230101
...,...,...,...,...
631,Slovenia,11000,6000,20040501
632,Slovakia,12500,5000,20040501
633,Finland,14098,9234,20040501
634,Sweden,14177,9291,20040501
