# <center>Web Scraping</center>

<img src="../image/text_DALLE.jpeg" width=30% align="right" style="in-line">

>*Data is like garbage.*
>
>*You’d better know what you are going to do with it before you collect it.*
>
>— Mark Twain ? ( source: [Forbes](https://www.forbes.com/councils/forbestechcouncil/2023/05/09/the-delta-between-trust-and-usability-where-data-management-still-falls-short/) )

<img src="../image/quote1_ChatGPT.png" width=70% align="left" style="in-line">

## Agenda

1. Web page basics (see slides)
2. Web scraping with Python

<a name="2"></a>
## Agneda 2. Web scraping with Python

Sometimes webs scraping can be really easy, other times it can be complicated. 

- Easy: static HTML
- Hard: HTML and CSS
- Harder: Javascript - Often requires a "Headless" web browser

Let's start the web scraping. We will collect some news regarding mobility and transport.

This is the website: [European Commission - Mobility and Transport News](https://transport.ec.europa.eu/news-events/news_en?page=0).

&#x1F4D6; **<font color=teal>WHAT WE HAVE LEARNED: </font>Legal and ethical considerations** 
* **Terms of Use:** The European Commission allows the reuse of its content under certain conditions.
>Unless otherwise indicated (e.g. in individual copyright notices), content owned by the EU on this website is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence. This means that reuse is allowed, provided appropriate credit is given and changes are indicated.

     For educational purposes, reuse is usually permitted. Review [the Legal Notice](https://commission.europa.eu/legal-notice_en) to ensure compliance.
     

* **Robots.txt:** Check [the website's robots.txt](https://transport.ec.europa.eu/robots.txt) file to see any disallowed paths. We can see that https://transport.ec.europa.eu/news-events is not among the disallowed paths.

We will use a library called `requests` to  download web pages. The `requests` will make a [GET request](https://en.wikipedia.org/wiki/HTTP#Request_methods) to a web server, which will download the HTML contents of a given web page for us. And we will use a library called `BeautifulSoup` to parse the HTML document.

&#x270A; **<font color=firebrick>DO THIS: </font> Run the cell below to check if you have the libraries installed. If not, install them now.**

In [20]:
import requests
from bs4 import BeautifulSoup

In [22]:
url = "https://transport.ec.europa.eu/news-events/news_en?page=0"

In [24]:
page = requests.get(url)

In [26]:
print(page)

<Response [200]>


After running our request, we get a Response object. This object has a status code, which shows us if the page was downloaded successfully. A status code of 200 means that the page was downloaded successfully.

&#x1F4A1; **HTTP status codes** (Source: [wikipedia](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes))

>* 1xx informational response – the request was received, continuing process
>* 2xx successful – the request was successfully received, understood, and accepted
>* 3xx redirection – further action needs to be taken in order to complete the request
>* 4xx client error – the request contains bad syntax or cannot be fulfilled
>* 5xx server error – the server failed to fulfil an apparently valid request

We now use `BeautifulSoup` to parse the page.

In [28]:
soup = BeautifulSoup(page.content, 'html.parser')

In [30]:
# dir(soup)
# help(soup)

['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'DEFAULT_INTERESTING_STRING_TYPES',
 'EMPTY_ELEMENT_EVENT',
 'END_ELEMENT_EVENT',
 'ROOT_TAG_NAME',
 'START_ELEMENT_EVENT',
 'STRING_ELEMENT_EVENT',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_clone',
 '_decode_markup',
 '_event_stream',
 '_feed',
 '_find_all',
 '_find_one',
 '_format_tag',
 '_indent_string',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_linkage_fixer',
 '_marku

In [34]:
# print out the HTML content of the page
# print(soup)
print(soup.prettify()) #format the page nicely

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="og: https://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="News" name="description"/>
  <meta content="en" http-equiv="content-language"/>
  <link href="https://transport.ec.europa.eu/news-events/news_en" rel="canonical"/>
  <meta content="follow, noindex" name="robots"/>
  <meta content="auto" property="og:determiner"/>
  <meta content="Mobility and Transport" property="og:site_name"/>
  <meta content="website" property="og:type"/>
  <meta content="https://transport.ec.europa.eu/news-events/news_en" property="og:url"/>
  <meta content="News" property="og:title"/>
  <meta content="News" property="og:description"/>
  <meta content="https://transport.ec.europa.eu/profiles/contrib/ewcms/modules/ewcms_seo/assets/images/ec-socialmedia-fallback.png" property="og:image"/>
  <meta content="Mobility and Transport" property="og:image:alt"/>
  <meta content="summary_large_image" name="twitter:card"/>
  <meta content="News" name="t

The task now is to **locate the specific content that we want to scrape**. You can either do it by a quick search of keywords, or view the page structure in a browser (for example: in Chrome by clicking `View` -> `Developer` -> `Inspect Elements`).

Search keyboard shortcut: 
* Windows: `Ctrl` + `f`
* Mac: `command` + `f`


Once locate the content, look for the tag and attribute of the target element.

&#x1F4D6; **<font color=teal>WHAT WE HAVE LEARNED: </font> HTML tags and attributes**

<img src="../image/HTML_element.png" width=50% align="left" >

There could be more than one way to locate the target element. 

&#x1F4A1; **HTML elements:** [documentation](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

In [36]:
news = soup.find_all("div", class_="ecl-content-item-block__item")
#news = soup.find_all("article", class_="ecl-content-item")

In [38]:
#print(news)
len(news)

10

In [40]:
print(news[0]) # the first news

<div class="ecl-content-item-block__item contextual-region ecl-u-mb-l ecl-col-12"><div data-contextual-id="node:node=3864:" data-contextual-token="FsIpHy1-eR819dEL6zzN3vB1mwNA5IJpwtqJvFX19qo" data-drupal-ajax-container=""></div><article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2024-10-17T12:00:00Z">17 October 2024</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone" href="/news-events/news/solidarity-lanes-latest-figures-september-2024-2024-10-17_en">Solidarity Lanes: Latest figures – September 2024</a></div><div class="ecl-content-block__description"><p>Latest figures on Ukrainian exports and imports via the EU-Ukraine Solidarity Lane

In [None]:
#copy paste it in a Markdown cell here



In [42]:
# a for loop to get all the titles
for item in news:
    title = item.find("a", class_="ecl-link ecl-link--standalone")
    print(title.get_text())
    print("====")

Solidarity Lanes: Latest figures – September 2024
====
20,400 lives lost in EU road crashes last year
====
October infringements package: key decisions
====
Single European Sky: annual monitoring highlights persistent capacity shortages and failure to meet environmental targets in 2023, as traffic recovers
====
Women in Rail 2024 Award to boost female talent for a sustainable and competitive rail industry
====
Commission seeks feedback on the Flight Emissions Label
====
CEF Transport: €2.5 billion to boost resilience and safety across the EU transport network
====
Rail market opening: competition leads to lower ticket prices, EU study finds
====
Solidarity Lanes: Latest figures – August 2024
====
European Mobility Week kicks off to promote shared public spaces for sustainable urban mobility and improved quality of life
====


A note to myself: go back to the slides before the big task.

&#x270A; **<font color=firebrick>DO THIS: </font>** Here is an example project. We would like to find out what the European Union has done recently (let's say since 2023) to advance sustainable mobility and transport. One possible data source is the news we just scraped, but we need more information other than the title of the news.

So now please write some code to collect **the date, the title, the short description, the news type, and the link to the full text** of all news in 2023. Save the data to a **csv** file.

Here are some tips:
1. How many pages do you need to scrape? Observe how the web addresses change between the first page and the second.
2. Remember we have talked about **avoid overloading servers** in ethics. Make sure to use `time.sleep()`.
3. Maybe AI tools such as ChatGPT can help. But you need to make sure its solution works.

If you would like to challenge yourself, see if you can scrape the full text (not the short description) of the news. Try with one or two pieces of news would be enough.

In [44]:
# the extra packages you will need
import time
import csv

In [None]:
# This pulls everything from site 1 to 17, also including some entries from 2022

In [50]:
import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_news():
    base_url = "https://transport.ec.europa.eu/news-events/news_en"
    start_page = 1  # Start from page 1
    end_page = 17   # End at page 17 where 2023 news starts
    news_data = []
    
    while start_page <= end_page:
        url = f"{base_url}?page={start_page}"
        print(f"Scraping {url}")
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        news_items = soup.find_all("div", class_="ecl-content-item-block__item")

        if not news_items:
            print("No more news items found, stopping...")
            break

        for item in news_items:
            title_tag = item.find("a", class_="ecl-link ecl-link--standalone")
            title = title_tag.get_text(strip=True) if title_tag else "No title"
            link = f'https://transport.ec.europa.eu{title_tag["href"]}' if title_tag else "No link"
            date_tag = item.find("time")
            date = date_tag.get_text(strip=True) if date_tag else "No date"
            short_desc_tag = item.find("div", class_="ecl-content-block__description")
            short_desc = short_desc_tag.get_text(strip=True) if short_desc_tag else "No description"

            news_data.append([date, title, short_desc, link])

        start_page += 1
        time.sleep(5)  # Pause to avoid overloading the server

    return news_data

def save_news_to_csv(news_data, filename):
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Date', 'Title', 'Short Description', 'Link'])
        for row in news_data:
            writer.writerow(row)

news_data = scrape_news()
save_news_to_csv(news_data, 'eu_transport_news_2023.csv')
print("News data has been saved to 'eu_transport_news_2023.csv'")


Scraping https://transport.ec.europa.eu/news-events/news_en?page=1
Scraping https://transport.ec.europa.eu/news-events/news_en?page=2
Scraping https://transport.ec.europa.eu/news-events/news_en?page=3
Scraping https://transport.ec.europa.eu/news-events/news_en?page=4
Scraping https://transport.ec.europa.eu/news-events/news_en?page=5
Scraping https://transport.ec.europa.eu/news-events/news_en?page=6
Scraping https://transport.ec.europa.eu/news-events/news_en?page=7
Scraping https://transport.ec.europa.eu/news-events/news_en?page=8
Scraping https://transport.ec.europa.eu/news-events/news_en?page=9
Scraping https://transport.ec.europa.eu/news-events/news_en?page=10
Scraping https://transport.ec.europa.eu/news-events/news_en?page=11
Scraping https://transport.ec.europa.eu/news-events/news_en?page=12
Scraping https://transport.ec.europa.eu/news-events/news_en?page=13
Scraping https://transport.ec.europa.eu/news-events/news_en?page=14
Scraping https://transport.ec.europa.eu/news-events/news_

In [None]:
# This pulls everything from page 1 to 17, but only keeps entries from the years 2024 and 2023

In [54]:
import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_news():
    base_url = "https://transport.ec.europa.eu/news-events/news_en"
    start_page = 1  # Start from page 1
    end_page = 17   # End at page 17 where 2023 news starts
    news_data = []
    
    while start_page <= end_page:
        url = f"{base_url}?page={start_page}"
        print(f"Scraping {url}")
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        news_items = soup.find_all("div", class_="ecl-content-item-block__item")

        if not news_items:
            print("No more news items found, stopping...")
            break

        for item in news_items:
            title_tag = item.find("a", class_="ecl-link ecl-link--standalone")
            title = title_tag.get_text(strip=True) if title_tag else "No title"
            link = f'https://transport.ec.europa.eu{title_tag["href"]}' if title_tag else "No link"
            date_tag = item.find("time")
            date = date_tag.get_text(strip=True) if date_tag else "No date"
            year = date[-4:]  # Extracting the year from the date

            # Allow news from 2023 and 2024
            if year not in ["2023", "2024"]:
                continue

            short_desc_tag = item.find("div", class_="ecl-content-block__description")
            short_desc = short_desc_tag.get_text(strip=True) if short_desc_tag else "No description"

            news_data.append([date, title, short_desc, link])

        start_page += 1
        time.sleep(3)  # Pause to avoid overloading the server

    return news_data

def save_news_to_csv(news_data, filename):
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Date', 'Title', 'Short Description', 'Link'])
        for row in news_data:
            writer.writerow(row)

news_data = scrape_news()
save_news_to_csv(news_data, 'eu_transport_news_2023_2024.csv')
print("News data has been saved to 'eu_transport_news_2023_2024.csv'")


Scraping https://transport.ec.europa.eu/news-events/news_en?page=1
Scraping https://transport.ec.europa.eu/news-events/news_en?page=2
Scraping https://transport.ec.europa.eu/news-events/news_en?page=3
Scraping https://transport.ec.europa.eu/news-events/news_en?page=4
Scraping https://transport.ec.europa.eu/news-events/news_en?page=5
Scraping https://transport.ec.europa.eu/news-events/news_en?page=6
Scraping https://transport.ec.europa.eu/news-events/news_en?page=7
Scraping https://transport.ec.europa.eu/news-events/news_en?page=8
Scraping https://transport.ec.europa.eu/news-events/news_en?page=9
Scraping https://transport.ec.europa.eu/news-events/news_en?page=10
Scraping https://transport.ec.europa.eu/news-events/news_en?page=11
Scraping https://transport.ec.europa.eu/news-events/news_en?page=12
Scraping https://transport.ec.europa.eu/news-events/news_en?page=13
Scraping https://transport.ec.europa.eu/news-events/news_en?page=14
Scraping https://transport.ec.europa.eu/news-events/news_

In [None]:
# Displaying the csv file

In [62]:
import pandas as pd

filename = 'eu_transport_news_2023_2024.csv'
df = pd.read_csv(filename)

df.head(50)


Unnamed: 0,Date,Title,Short Description,Link
0,9 September 2024,TEN-T Corridor Coordinators,Nine European Coordinators have been designate...,https://transport.ec.europa.eu/news-events/new...
1,29 August 2024,Sustainable Transport Forum - new sub-groups o...,In the wake of the adoption of the new Alterna...,https://transport.ec.europa.eu/news-events/new...
2,14 August 2024,Solidarity Lanes: Latest figures – July 2024,Latest figures on Ukrainian exports and import...,https://transport.ec.europa.eu/news-events/new...
3,31 July 2024,Commission enforces temporary restrictions on ...,The Commission will temporarily enforce restri...,https://transport.ec.europa.eu/news-events/new...
4,29 July 2024,European Commission launches Ship Financing Po...,"The Ship Financing Portal, designed to improve...",https://transport.ec.europa.eu/news-events/new...
5,25 July 2024,July infringement package: key decisions,July infringement package: key decisions,https://transport.ec.europa.euhttps://ec.europ...
6,23 July 2024,FuelEU Maritime Regulation: Q&A on implementation,As the maritime sector prepares for the FuelEU...,https://transport.ec.europa.eu/news-events/new...
7,22 July 2024,Solidarity Lanes: Latest figures – June 2024,Latest figures on Ukrainian exports and import...,https://transport.ec.europa.eu/news-events/new...
8,22 July 2024,Commission publishes new guidelines for more c...,While more citizens of the European Union are ...,https://transport.ec.europa.eu/news-events/new...
9,17 July 2024,"EU invests record €7 billion in sustainable, s...",The Commission has selected 134 projects to re...,https://transport.ec.europa.eu/news-events/new...


In [None]:
# After parsing, the Date column is now the index of the DataFrame

In [66]:
import pandas as pd

# Load the CSV file into a DataFrame and parse the 'Date' column as datetime
filename = 'eu_transport_news_2023_2024.csv'
df = pd.read_csv(filename, parse_dates=['Date'])

# Set the 'Date' column as the index of the DataFrame
df.set_index('Date', inplace=True)

# Display the first few rows of the DataFrame to check its content
df.head(50)


Unnamed: 0_level_0,Title,Short Description,Link
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-09-09,TEN-T Corridor Coordinators,Nine European Coordinators have been designate...,https://transport.ec.europa.eu/news-events/new...
2024-08-29,Sustainable Transport Forum - new sub-groups o...,In the wake of the adoption of the new Alterna...,https://transport.ec.europa.eu/news-events/new...
2024-08-14,Solidarity Lanes: Latest figures – July 2024,Latest figures on Ukrainian exports and import...,https://transport.ec.europa.eu/news-events/new...
2024-07-31,Commission enforces temporary restrictions on ...,The Commission will temporarily enforce restri...,https://transport.ec.europa.eu/news-events/new...
2024-07-29,European Commission launches Ship Financing Po...,"The Ship Financing Portal, designed to improve...",https://transport.ec.europa.eu/news-events/new...
2024-07-25,July infringement package: key decisions,July infringement package: key decisions,https://transport.ec.europa.euhttps://ec.europ...
2024-07-23,FuelEU Maritime Regulation: Q&A on implementation,As the maritime sector prepares for the FuelEU...,https://transport.ec.europa.eu/news-events/new...
2024-07-22,Solidarity Lanes: Latest figures – June 2024,Latest figures on Ukrainian exports and import...,https://transport.ec.europa.eu/news-events/new...
2024-07-22,Commission publishes new guidelines for more c...,While more citizens of the European Union are ...,https://transport.ec.europa.eu/news-events/new...
2024-07-17,"EU invests record €7 billion in sustainable, s...",The Commission has selected 134 projects to re...,https://transport.ec.europa.eu/news-events/new...


---------
### Congratulations, we are done!

This notebook is written by [Meng Cai](https://www.verkehr.tu-darmstadt.de/vv/das_institut_ivv/team_ivv/wissenschaftliche_mitarbeiter_doktoranden/meng_cai/standardseite_204.de.jsp), Technical University of Darmstadt. Special thanks to [Dirk Colbry](https://icer.msu.edu/contact-directory/Dirk-Colbry) for sharing his course materials on this topic. This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a>