# 🚌 Data Retrieval and Modelling for Berlin's Trams and Busses

In this notebook, I invite to witness a rather complicated method of gathering data :D
There is a public API for public transport in Berlin. It seems impossible however to just query all stops that are part of the underlying dataset. When one queries a specific however, the API returns not one, but hundreds of stops somewhat related to the queried one. Therefore I decided to iterate thru the alphabet and pass each single letter to the API in order to get every stop there is... At the point in time I'm writing this, it looks like this approach worked!

## 🛠️ Setup

### Basics

In [9]:
# import necessary libraries

import requests
import pandas as pd
import string

In [3]:
# establish API url's

base_url = 'https://v6.bvg.transport.rest'

url_filter = '/locations?results=10000&poi=false&address=false&query='

In [13]:
# creating list to use as api query

query_list = list(string.ascii_lowercase)

### Creating functions

In [4]:
# function to query api

def query_api(query_key):
    
    url = f"{base_url}{url_filter}{query_key}" # concat full api url

    response = requests.get(url) # connect to api

    if response.status_code == 200: # check for valid result
        return response.json()

    else:
        print(f"Retrieval failed. Status code: {response.status_code}")

The result is a list with nested dictionaries that are not all stops, but other locations too.

In [6]:
# function to filter for bus and tram stops

def filter_for_stops_only(query_result):

    list_with_only_bus_tram_stops = [] # initilize empty list

    for i in range(len(query_result)): # iterate through list

        if query_result[i]['type'] == 'stop':  # filter for type 'stop' only

            if (query_result[i]['products']['tram'] == True or query_result[i]['products']['bus'] == True):

                stop_dict = query_result[i]

                list_with_only_bus_tram_stops.append(stop_dict) # append the dictionary to new list

            else:
                continue

        else: 
            continue

    return list_with_only_bus_tram_stops

The list now contains only data about tram and bus stops. However, there's still nested dicts with irrelevant data for each stop.

In [20]:
# function to extract relevant data and create new list

def extract_relevant_stop_data(filter_result):

    list_with_extractions = []

    for k in range(len(filter_result)):

        id = filter_result[k]['id']

        name = filter_result[k]['name']

        lat = filter_result[k]['location']['latitude']

        lon = filter_result[k]['location']['longitude']

        tram = filter_result[k]['products']['tram']

        bus = filter_result[k]['products']['bus']

        extraction = {'id':id, 'name':name, 'lat':lat, 'lon':lon, 'tram':tram, 'bus':bus}

        list_with_extractions.append(extraction)

    return list_with_extractions        

The resulting list has one dictionary containing only relevant data for each stop. These dicts can be added to a dataset, each one as an individual row.

In [17]:
# function to make it a dataframe

def append_to_dataset(dataframe, extraction_result):

    if type(extraction_result) == list:

        for n in range(len(extraction_result)):

            dataframe.loc[len(dataframe)] = extraction_result[n]

    else: 
        print('Something went wrong!')

In [14]:
def main_function():

    properties = ['id', 'name', 'lat', 'lon', 'tram', 'bus']

    bus_tram_df = pd.DataFrame(columns=properties)

    for char in query_list:

        query_result = query_api(char)

        filter_result = filter_for_stops_only(query_result)

        extraction_result = extract_relevant_stop_data(filter_result)

        append_to_dataset(bus_tram_df, extraction_result)

    return bus_tram_df

## 📥 Retrieve data and create initial dataset

In [21]:
# call main function

bus_tram_dataset = main_function()

In [27]:
bus_tram_dataset.head()

Unnamed: 0,id,name,lat,lon,tram,bus
0,900260009,Flughafen BER,52.36461,13.50987,False,True
1,900086106,Auguste-Viktoria-A./Humboldtstr. (Berlin),52.568845,13.329852,False,True
2,900170515,Adersleber Weg (Berlin),52.537896,13.560318,True,True
3,900151501,Ahrenshooper Str. (Berlin),52.566212,13.501888,True,True
4,900140005,Albertinenstr. (Berlin),52.549788,13.457778,True,True


## 🧽 Clean the data

### Filter for stop that are in Berlin

So appearently a few stops that are outside of the berlin area slipped in. Stops in Berlin are typically marked with '(Berlin)'.

In [None]:
# create boolean series that marks stops in berlin with true

stops_inside_berlin = bus_tram_dataset['name'].str.contains('(Berlin)')

  stops_inside_berlin = bus_tram_dataset['name'].str.contains('(Berlin)')


In [None]:
# use bool series to subset dataset

bus_tram_dataset_filtered = bus_tram_dataset[stops_inside_berlin]

In [31]:
bus_tram_dataset_filtered

Unnamed: 0,id,name,lat,lon,tram,bus
1,900086106,Auguste-Viktoria-A./Humboldtstr. (Berlin),52.568845,13.329852,False,True
2,900170515,Adersleber Weg (Berlin),52.537896,13.560318,True,True
3,900151501,Ahrenshooper Str. (Berlin),52.566212,13.501888,True,True
4,900140005,Albertinenstr. (Berlin),52.549788,13.457778,True,True
5,900161517,Alfred-Kowalke-Str. (Berlin),52.505714,13.519704,True,True
...,...,...,...,...,...,...
11455,900093207,Amandastr. (Berlin),52.626134,13.319470,False,True
11456,900085250,Amendestr. (Berlin),52.571515,13.373639,False,True
11462,900081271,Alt-Buckow/Dorfteich (Berlin),52.422537,13.432833,False,True
11464,900175509,Am Birkenwerder (Berlin),52.485102,13.570143,False,True


### Check for missing values

In [33]:
bus_tram_dataset_filtered.isnull().any().any()

np.False_

### Check for duplicates

In [None]:
bus_tram_dataset_filtered.duplicated().sum() # ufff

np.int64(5077)

In [48]:
# create df without duplicates

bus_tram_dataset_filtered_no_dups = bus_tram_dataset_filtered.drop_duplicates(keep='first').reset_index()

In [46]:
private_url = '/Users/bastianlenkers/Documents/Masterschool/Webeet/Sprint1_Data_Collection/'

file_name = 'bus_tram_preliminary.csv'

In [49]:
bus_tram_dataset_filtered_no_dups.to_csv(f"{private_url}{file_name}")