# Scraping files from the web

For this scraping excercise we'll use only 2 libraries : `pandas` and `requests`
- `pandas` : for data analysis and transformation.
- `requests` : opens a website
Let's import them.

In [33]:
import pandas as pd
import requests

## Your best friend: the Inspector

![img](img/inspector.png)

## Excercise 1: csv

Go to https://integritywatch.eu/organizations.php and get the table out.
- Open the Inspector
- Go to the `Network` tab & reload the page

![img](img/inspector_network.png)

- Use the pandas [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) and [`.to_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) function 

In [40]:
csv_url = "https://integritywatch.eu/data/lobbyists/organizations_new.csv"
df = pd.read_csv(csv_url)
#requests.get(csv_url).text
df.to_csv("organizations_new.csv")

## Excercise 2: json

Go to https://integritywatch.eu/ecmeetings.php and get the table out.
- Find the underlying file with the Inspectors `Network` tab. (*hint: it's a `.json` file*)
- Use the pandas [`read_json`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) and [`.to_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) function.

In [41]:
json_url = "https://integritywatch.eu/data/ecmeetings/ecmeetings.json"
df2 = pd.read_json(json_url)
df2

Unnamed: 0,Cat,Cat2,Host,Org,OrgId,cabinet,date,location,portfolio,subject,type,entities,dgacronym,dgname
0,Trade and business associations,Promotes their own interests or the collective...,[Szabolcs Horvath (Cabinet member)],European Club Association,925747033224-18,Cabinet of Commissioner Tibor Navracsics,2019-12-11,Brussels,"[Education, Culture, Youth and Sport]",new Commissioner and Cabinet,rep,,,
1,Trade unions and professional associations,Promotes their own interests or the collective...,[Zaneta Vegnere (Cabinet member)],spiritsEUROPE,64926487056-58,Cabinet of Executive Vice-President Valdis Dom...,2023-11-08,webex,[An Economy that Works for People],EU-China Working Group on Spirits and follow u...,rep,,,
2,Trade unions and professional associations,Promotes their own interests or the collective...,"[Ylva Johansson (Commissioner), Asa Webber (Ca...",The Swedish Trade Union Confederation,673091017982-82,Cabinet of Commissioner Ylva Johansson,2023-11-08,"Brussels, Belgium",[Home Affairs],Climate\r\nLabour\r\nMigration,rep,,,
3,Professional consultancies,Advances interests of their clients,[Juraj Nociar (Cabinet member)],"EPA Consulting, s.r.o.",210418516054-55,Cabinet of Vice-President Maro&#353; &#352;ef&...,2023-11-08,Brussels,[Interinstitutional Relations and Foresight],Electricity Supply.,rep,,,
4,Companies & groups,Promotes their own interests or the collective...,"[Florian Denis (Cabinet member), Nathalie De B...",SIX GROUP LTD,259182121223-88,Cabinet of Commissioner Mairead Mcguinness,2023-11-08,Brussels,"[Financial services, financial stability and C...","Capital Markets Union, MIFID, EMIR, Listing ac...",rep,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21806,Companies & groups,Promotes their own interests or the collective...,[Roberto Viola (Director-General)],QUALCOMM Incorporated,00358442856-45,,2019-12-02,Brussels,[],"Commission future agenda on 5G, AI and cloud",dg,,CNECT,"Communications Networks, Content and Technology"
21807,Companies & groups,Promotes their own interests or the collective...,[Roberto Viola (Director-General)],Meta Platforms Ireland Limited and its various...,28666427835-74,,2019-12-02,Brussels,[],Platforms and DSA,dg,,CNECT,"Communications Networks, Content and Technology"
21808,Think tanks and research institutions,Does not represent commercial interests,[Jean-Eric Paquet (Director-General)],Institute for future-fit economies gemeinnützi...,630393933743-37,,2019-12-02,Brussels,[Innovation and Youth],COST Action,dg,,RTD,Research and Innovation
21809,Companies & groups,Promotes their own interests or the collective...,[Timo Pesonen (Acting Director-General)],Teollisuuden Voima Oyj,352103717639-15,,2019-12-02,BREY-Brussels,[Internal Market],presenting their views on nuclear energy - lis...,dg,,GROW,"Internal Market, Industry, Entrepreneurship an..."


## HTML/CSS

Another important part of the Inspector is the `Inspector` tab

![img](img/inspector_inspector.png)

Here we find `HTML/CSS` source of the website

### HTML
![html](img/anatomy-of-an-html-element.jpg)

### CSS
![html](img/html-element.png)

We use these elements to navigate to the part of the webstie we want to scrape

## Excercise 3: PDF

Go to the [Commission expert groups website](https://ec.europa.eu/transparency/expert-groups-register/screen/expert-groups?lang=en)

* Open an expert group meeting
* Find the URL of the meeting minutes and any other documents you can find
* Use `requests` and to save them on your computer

In [42]:
pdf_url = "https://ec.europa.eu/transparency/expert-groups-register/core/api/front/expertGroupAddtitionalInfo/32821/download"

In [43]:
r = requests.get(pdf_url)

In [44]:
r

<Response [200]>

In [52]:
with open("myfile.pdf", "wb") as outfile:
    outfile.write(r.content)

In [51]:
with open("myfile.txt", "w") as outfile:
    outfile.write("Haai!")

## Excercise 4: get multiple PDFs

* Go to the [Recovery Expert Group (for mutual tax recovery assistance) (E03234) ](https://ec.europa.eu/transparency/expert-groups-register/screen/expert-groups/consult?lang=en&groupID=3234)
* get all the Agenda PDFs

In [53]:
# import the BeautifulSoup library
from bs4 import BeautifulSoup as bs

In [None]:
url = "https://ec.europa.eu/transparency/expert-groups-register/screen/expert-groups/consult?lang=en&groupID=3234"
r = requests.get(url)
soup = bs(r.text)
soup.select("a")

In [70]:
import json

In [95]:
additional_info_json = "https://ec.europa.eu/transparency/expert-groups-register/core/api/front/expertGroups/3234/additionalInformation"
r = requests.get(additional_info_json)

for item in json.loads(r.text)["activityReports"]:
    pdf_url = "https://ec.europa.eu" + item["documents"][0]["urlDownload"]

    title = item["documents"][0]['title']
    r_pdf = requests.get(pdf_url)

    print(title)

    with open("pdfs/" + title, "wb") as outfile:
        outfile.write(r_pdf.content)

Agenda - 2017 06 16.pdf
Agenda - 2017 02 22.pdf
Summary report - 2016 09 23.pdf
Agenda - 2016 04 15.pdf
Agenda - 2015 10 13.pdf
Summary report - 2015 02 27.pdf
Agenda.pdf
