# Scraping German train stations 

The aim of this project is to automatize scraping informations about departures and arrivals from particular train station in Germany. In order to do this we will use following website - "https://bahnauskunft.info". Page "https://bahnauskunft.info/bahnhoefe-deutschland/" provides information about every train station in Germany. After clicking one of them and scrolling down you will see two tables which we want to scrap.

Although there are many webscraping tools to complete this project we will use Selenium because this website uses JavaScript to display datatables and requires clicking and selecting content before scraping.

In [1]:
import datetime
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pickle

driver = webdriver.Safari()

## Stations dictionary

To scrap informations about train station we need to have their name and link redirecting to this train station website. The following code scraps informations about every train station name and link redirecting to their page. The only one disadvantage is that process takes a few minutes( so by default I've hashed this part of code) and to speed up I've already downloaded this data and stored them as dictionary { "name_of_station" : link } in file called 'stations_dictionary.pkl'. If you want to scrap them by your own unhash following code.

In [None]:
# website = "https://bahnauskunft.info/bahnhoefe-deutschland/"
# driver.get(website)

In [2]:
# cookie_button = driver.find_element_by_xpath('//*[@id="BorlabsCookieBox"]/div/div/div/div[1]/div/div/div[2]/p[1]/a') 
# driver.execute_script("arguments[0].click();", cookie_button)

In [3]:
# stations_dictionary = {}
# stations = driver.find_elements_by_class_name('wp-show-posts-entry-title')
# for station in stations:
#     stations_dictionary[station.find_element_by_tag_name('a').text] = station.find_element_by_tag_name('a').get_attribute('href')
#     
# del stations_dictionary["Deutsche Bahn Fahrplan"]
# del stations_dictionary["Fahrgastrechte Bahn"]
# del stations_dictionary["Deutsche Bahn Sitzplatzreservierung"]
# del stations_dictionary["Deutsche Bahn WLAN"]
# del stations_dictionary["ICE Sprinter"]
# del stations_dictionary["Deutsche Bahn mit Kindern"]

To load the dictionary we will use pickle package.

In [2]:
with open('stations_dictionary.pkl', 'rb') as dict:
    stations_dictionary = pickle.load(dict)

Let's see all available options to choose from.

In [3]:
print("List of " + str(len(stations_dictionary)) +" available stations\n")
print(stations_dictionary.keys())

List of 5392 available stations

dict_keys(['Aachen Hauptbahnhof', 'Aachen Schanz', 'Aachen West', 'Aachen-Rothe Erde', 'Aalen Hauptbahnhof', 'Abensberg', 'Achern', 'Achim', 'Achmer', 'Achterwehr', 'Adelebsen', 'Adelschlag', 'Adelsdorf (Mittelfranken)', 'Adelsheim Nord', 'Adelsheim Ost', 'Adorf (Vogtland)', 'Affaltrach', 'Agatharied', 'Agathenburg', 'Aglasterhausen', 'Aha', 'Ahaus', 'Ahlen (Westfalen)', 'Ahlhorn', 'Ahlten (Hannover)', 'Ahrbrück', 'Ahrensburg', 'Ahrensburg-Gartenholz', 'Ahrensfelde', 'Ahrensfelde Friedhof', 'Ahrensfelde Nord', 'Ahrweiler', 'Ahrweiler Markt', 'Aichach', 'Aichstetten', 'Ainring', 'Albbruck', 'Albersdorf', 'Albersweiler (Pfalz)', 'Albig', 'Albrechtshof', 'Albshausen', 'Albsheim (Eis)', 'Albstadt Ebingen West', 'Albstadt-Ebingen', 'Albstadt-Laufen Ort', 'Albstadt-Lautlingen', 'Aldekerk', 'Aldingen', 'Aletshausen', 'Alexanderplatz', 'Alfeld (Leine)', 'Alfter-Impekoven', 'Alfter-Witterschlick', 'Algermissen', 'Alheim-Heinebach', 'Aligse', 'Allee-Center Leipzi

There is a lot of train stations in Germany so feel free to choose on of them.

In [4]:
while True:
    station = input("Select available station")
    if str(station) in stations_dictionary:
        print("Successfully selected station")
        break
    else:
        print("Selected station is not available. Try again.") 
        continue

Successfully selected station


Now we will connect to site preseting informations about choosen train station.

In [5]:
website = stations_dictionary[str(station)]
driver.get(str(website))

At the beggining we need to accept cookies by clicking the button.

In [6]:
cookie_button = driver.find_element_by_xpath('//*[@id="BorlabsCookieBox"]/div/div/div/div[1]/div/div/div[2]/p[1]/a')
driver.execute_script("arguments[0].click();", cookie_button)

As you can see by default website present only 10 rows in each table. Let's change it to display 100 rows.

In [7]:
for table in ["DataTables_Table_0_length", "DataTables_Table_1_length"]:
    dropdown_button = driver.find_element_by_name(table)
    dropdown_button.click()
    select_dropdown = Select(dropdown_button)
    select_dropdown.select_by_index(3)

Now is time to design how our scraped table should look like.

In [8]:
arrivals = pd.DataFrame(columns=["Line", "From", "To", "Delay", "Hour", "Minute", "Day", "Day of Year", "Weekday","Week number", "Month", "Year"])
departures = pd.DataFrame(columns=["Line", "From", "To", "Delay", "Hour", "Minute", "Day", "Day of Year", "Weekday","Week number", "Month", "Year"])

This code scraps informations from each table, adds some time informations and store them in dataframes.

In [9]:
time = datetime.datetime.now()

for dataset in ["DataTables_Table_0", "DataTables_Table_1"]:
    table = driver.find_element_by_id(dataset) 
    trs = table.find_elements_by_tag_name("tr")
    for tr in trs:
        lst = []
        tds = tr.find_elements_by_tag_name("td")
        for td in tds:
            lst.append(td.text)
        if len(lst) == 0:
            continue
        else:
            if dataset == "DataTables_Table_0":
                departures = pd.concat([departures, pd.DataFrame({"Line" : lst[0], "From" : lst[2], "To" : lst[3], "Delay" : lst[5], "Hour": lst[1][:2], "Minute": lst[1][3:5], "Day" : time.strftime("%d"), "Day of Year" : time.strftime("%j"), "Weekday" : time.strftime("%w"), "Week number" : time.strftime("%W"), "Month": time.strftime("%m"), "Year" : time.strftime("%Y")}, index = [0])], ignore_index=True)
            elif dataset == "DataTables_Table_1":
                arrivals = pd.concat([arrivals, pd.DataFrame({"Line" : lst[0], "From" : lst[2], "To" : lst[3], "Delay" : lst[5], "Hour": lst[1][:2], "Minute": lst[1][3:5], "Day" : time.strftime("%d"), "Day of Year" : time.strftime("%j"), "Weekday" : time.strftime("%w"), "Week number" : time.strftime("%W"), "Month": time.strftime("%m"), "Year" : time.strftime("%Y")}, index = [0])], ignore_index=True)

As you can see if connection is delayed its represented as "+(minutes_of_delay)" otherwise there's an empty string. Let's change it to display delay as integer. At the end we can also set datatypes in dataframe to corresponds to data each column stores.

In [10]:
for dataset in [departures, arrivals]:
    dataset.Delay = dataset.Delay.apply(lambda x : 0 if x == "" else x.lstrip("+"))
    dataset = dataset.astype({"Delay" : "int64" , "Hour": "int64", "Minute": "int64", "Day" : "int64", "Day of Year" : "int64", "Weekday" : "int64", "Week number" :"int64", "Month": "int64", "Year" : "int64"})

Take a look at Departures scraped table.

In [11]:
departures

Unnamed: 0,Line,From,To,Delay,Hour,Minute,Day,Day of Year,Weekday,Week number,Month,Year
0,OPB RB95,Hof Hbf,Rehau,1,11,23,5,248,2,36,9,2023
1,ag RB96,Hof Hbf,Rehau,0,11,54,5,248,2,36,9,2023
2,ag RB96,Selb Stadt,Rehau,0,11,54,5,248,2,36,9,2023
3,OPB RB95,Marktredwitz,Rehau,0,12,28,5,248,2,36,9,2023
4,ag RB96,Selb Stadt,Rehau,0,12,54,5,248,2,36,9,2023
5,ag RB96,Hof Hbf,Rehau,0,12,54,5,248,2,36,9,2023
6,OPB RB95,Hof Hbf,Rehau,0,13,22,5,248,2,36,9,2023
7,ag RB96,Selb Stadt,Rehau,0,13,54,5,248,2,36,9,2023
8,ag RB96,Hof Hbf,Rehau,0,13,56,5,248,2,36,9,2023
9,OPB RB95,Marktredwitz,Rehau,0,14,28,5,248,2,36,9,2023


Here are Arrivals.

In [12]:
arrivals

Unnamed: 0,Line,From,To,Delay,Hour,Minute,Day,Day of Year,Weekday,Week number,Month,Year
0,OPB RB95,Rehau,Marktredwitz,1,11,23,5,248,2,36,9,2023
1,ag RB96,Rehau,Hof Hbf,0,11,55,5,248,2,36,9,2023
2,ag RB96,Rehau,Selb Stadt,0,11,56,5,248,2,36,9,2023
3,OPB RB95,Rehau,Hof Hbf,0,12,28,5,248,2,36,9,2023
4,ag RB96,Rehau,Hof Hbf,0,12,55,5,248,2,36,9,2023
5,ag RB96,Rehau,Selb Stadt,0,12,56,5,248,2,36,9,2023
6,OPB RB95,Rehau,Marktredwitz,0,13,22,5,248,2,36,9,2023
7,ag RB96,Rehau,Hof Hbf,0,13,57,5,248,2,36,9,2023
8,ag RB96,Rehau,Selb Stadt,0,13,58,5,248,2,36,9,2023
9,OPB RB95,Rehau,Hof Hbf,0,14,28,5,248,2,36,9,2023


It's time to save this dataframes into csv files. Before that we need to make sure that name of csv is proper. Some names of stations have space so the first thing is to change them to underscore sign. The name of files will be as follows : "Departures/Arrivals_Station_Hour_Minute_Day_Month_Year.csv"

In [13]:
station = station.replace(" ", "_")
departures_filename = "Departures" + "_" + station + "_" + str(time.strftime("%H_%M_%d_%m_%Y")) + ".csv"
departures.to_csv(departures_filename)
arrivals_filename = "Arrivals" + "_" + station + "_" + str(time.strftime("%H_%M_%d_%m_%Y")) + ".csv"
arrivals.to_csv(arrivals_filename)

In the end we are disconnecting from website.

In [14]:
driver.quit()