Autor: Karol Szwed\
Data: 17.02.2023 r.\
Przedmiot: Kurs programowania w Python\
Projekt: Końcowe zadanie zaliczeniowe - analiza danych z użyciem API dla ZTM w Warszawie

# Cel projektu

Korzystając z danych dostępnych na stronie https://api.um.warszawa.pl/# zbierane są informacje o pozycjach autobusów w zadanym przedziale czasu. Do analizy wzięte zostały pod uwagę dwa okresy 1-godzinne, dokładnie od 13:30 do 14:30 w czwartek oraz od 16:30 do 17:30 w piątek.

Na podstawie zebranych danych została przeprowadzona analiza:
 - Średniej prędkości autobusów (np. ile autobusów przekroczyło prędkość 50 km/h?)
 - Czy były miejsca, w których średnia prędkość autobusów była szczególnie wysoka?
 - Punktualność autobusów w obserwowanym okresie (porównanie rzeczywistego czasu dojazdu na przystanki z rozkładem jazdy).
 
Następnie wyniki przeprowadzonej analizy zostały zwizualizowane oraz opisane.

# Pobranie danych GPS dla autobusów

Na początku pobrane zostały dane GPS dla autobusów, w tym:
- nr linii autobusowej
- szerokość i długość geograficzna
- czas pomiaru
- nr brygady
- nr pojadzu

In [None]:
# Because we are working with large objects that are collected over long periods of time, let's define two helper
# functions that will save those objects as binary files, able to be quickly loaded later.

In [20]:
# All libraries used across the whole project
import warsaw_data_api
import datetime
import time
import json
import pandas as pd
import csv
import pickle
from collections import defaultdict
import geopy.distance

"""
 * INPUT:
    - "pickle_filename" - string with the name of the pickle file where the object will be saved;
    - "obj_to_save" - the object that will be saved in the pickle file.
 * FUNCTION: Save an object to a pickle file.
 * OUTPUT: None; function has side effect of creating a pickle file with the saved object.
"""
def save_obj_as_pickle_file(pickle_filename, obj_to_save):
    with open(pickle_filename, 'wb') as f:
        pickle.dump(obj_to_save, f)
        

"""
 * INPUT:
    - "pickle_filename" - string with the name of the pickle file from which the object will be loaded;
    - "obj_to_load" - the object that will be loaded from the pickle file.
 * FUNCTION: Load an object from a pickle file.
 * OUTPUT: The object loaded from the pickle file.
"""
def load_obj_from_pickle_file(pickle_filename):
    with open(pickle_filename, 'rb') as f:
        return pickle.load(f)

Biblioteka "warsaw_data_api", dostępna poprzez PIP, udostępnia trzy funkcje, które wykorzystane zostały w projekcie:
- ztm.get_buses_location() -> zwraca dane GPS wszystkich autobusów w Warszawie
- ztm.get_lines_for_bus_stop_id(stop_id, stop_pole) -> zwraca listę linii zatrzymujących się na przystanku
- ztm.get_bus_stop_schedule_by_name("Banacha-Szpital", "01", "504") -> zwraca rozkład jazdy dla danego przystanku;

Uzasadnione jest to faktem, iż dokumentacja Warsaw Open Data API nie zawiera wszystkich zapytań API, jakie mogą zostać zażądane, ID żądań nie zawsze są aktualne oraz nie zawsze opisana jest odpowiedź na dane żądanie.\
Stąd w celu uniknięcia długiej serii prób i błędów uznałem, że skorzystam z tej prostej biblioteki open-source.

In [None]:
"""
 * INPUT:
     - "filename" - string with name of the csv file where data will be written to;
     - "_MY_API_KEY" - string with my API key needed for API calls;
     - "timespan" - integer representing amount of seconds for how long the data will be collected;
     - "time_delta_for_buses" - maximum age (in seconds) of the bus GPS data to be considered valid;
     - "update_interval" - time (in seconds) between each data import from the Warsaw Open Data API.
 * FUNCTION: Gather live bus GPS data using API calls and save it to csv file (with header).
 * OUTPUT: None; function has side effect of creating csv file with bus GPS data.
"""
def import_bus_gps_data(filename, _MY_API_KEY, timespan, time_delta_for_buses, update_interval):
    ztm = warsaw_data_api.ztm(apikey=_MY_API_KEY)  # Pass API key
    start_time = time.time()

    print("Starting the data import at", datetime.datetime.now())
    print("Expected to end the data import at", datetime.datetime.now() + datetime.timedelta(seconds=timespan))

    with open(filename, 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(["lines", "latitude", "longitude", "time", "brigade", "vehicle_number"])  # write header

    while True:
        try:
            buses_all = ztm.get_buses_location()
            with open(filename, 'a', newline='') as file:
                writer = csv.writer(file)
                for bus in buses_all:
                    now = datetime.datetime.now()
                    time_diff = now - bus.time

                    # We want to gather data that is current, so we only collect location data
                    # that is at most 1 min old
                    if time_diff.seconds < time_delta_for_buses:
                        writer.writerow([bus.lines, bus.location.latitude, bus.location.longitude,
                                         bus.time.time(), bus.brigade, bus.vehicle_number])
            time.sleep(update_interval)  # wait for 1 minute between updates
            if time.time() - start_time > timespan:  # If timespan have passed, break the loop
                break
        except Exception:  # If an error occurs, ignore it and try again
            continue
            

# Example of usage:
filename = "Buses_location_afternoon.csv" # Where GPS data will be saved
_MY_API_KEY = "2620c061-1099-44d9-baab-fdc3a772ab29"  # My API key

timespan = 3600  # 60 minutes
time_delta_for_buses = 60  # 1 minute
update_interval = 60  # 1 minute

import_bus_gps_data(filename, _MY_API_KEY, timespan, time_delta_for_buses, update_interval)

# Pobranie danych o przystankach autobusowych

TODO: Opisać

In [1]:
file_json = open("bus_stops.json", "r")
json_dict = json.loads(file_json.read())

# Extract the list of dictionaries
dict_list = json_dict['result']

# Convert each dictionary in the list
new_dict_list = []
for d in dict_list:
    new_dict = {item['key']: item['value'] for item in d['values']}
    new_dict_list.append(new_dict)

# Convert the list of new dictionaries into a DataFrame
bus_stops_table = pd.DataFrame(new_dict_list)

# Drop the last column
last_column = bus_stops_table.columns[-1]
bus_stops_table = bus_stops_table.drop(last_column, axis=1)

# Rename the columns
bus_stops_table.columns = ["stop_id", "stop_pole", "stop_name",
                           "street_id", "latitude", "longitude", "direction"]

print(bus_stops_table.head())
print(bus_stops_table.shape)

  stop_id stop_pole stop_name street_id   latitude  longitude       direction
0    1001        01  Kijowska      2201  52.248455  21.044827  al.Zieleniecka
1    1001        02  Kijowska      2201  52.249078  21.044443       Ząbkowska
2    1001        03  Kijowska      2201  52.248928  21.044169  al.Zieleniecka
3    1001        04  Kijowska      2201  52.249969  21.041588       Ząbkowska
4    1001        05  Kijowska      1203  52.250319  21.043861  al.Zieleniecka
(7058, 7)


## Tworzymy słownik zawierający zbiór informację o trasie każdej z linii

Todo: opis

In [3]:
_MY_API_KEY = "2620c061-1099-44d9-baab-fdc3a772ab29"  # my api key
ztm = warsaw_data_api.ztm(apikey=_MY_API_KEY)  # pass api key

bus_line_stops = {}
every_200th = 0 # For printing every 200th entry
for index, row in bus_stops_table.iterrows():
    every_200th += 1
    stop_id = row['stop_id']
    stop_pole = row['stop_pole']
    stop_info = tuple(row.values)
    bus_lines = ztm.get_lines_for_bus_stop_id(stop_id, stop_pole)

    if every_200th == 200:
        print(stop_info)
        print(bus_lines)
        every_200th = 0

    for bus_line in bus_lines:
        if bus_line not in bus_line_stops:
            bus_line_stops[bus_line] = set()
        bus_line_stops[bus_line].add(stop_info)

print(bus_line_stops['504'])

('1055', '01', 'Gorzykowska', '1909', '52.268593', '21.060081', 'Piotra Skargi')
['120', '160', '190', 'N02', 'N11', 'N61']
('1129', '01', 'Bruszewska', '1839', '52.327496', '21.028882', 'Zakłady Zbożowe')
['326', '705', '735', '736']
('1209', '02', 'Powstańców', '1757', '52.277575', '21.115400', 'Czapelska')
['145']
('1296', '01', 'Mochtyńska', '2960', '52.357728', '21.035482', 'Płochocińska')
['326', 'N14']
('1407', '02', 'Wenecka', '0137', '52.356885', '21.133208', 'Struga')
['140', '738', 'L40', 'L45', 'N61']
('1530', '01', 'Januszewicza', '1799', '52.314983', '21.178599', 'Szymańskiego')
['L27']
('1749', '03', 'Piłsudskiego-Kościół', '1757', '52.350916', '21.230183', 'Lipowa')
['L36']
('1864', '01', 'Andersena', '0926', '52.282917', '21.139519', 'Powstańców')
['199']
('2006', '03', 'Międzyborska', '0802', '52.245864', '21.075239', 'Praga-Płd.-Ratusz')
['3', '6', '9', '22', '24', '26']
('2069', '01', 'Pocisk', '1515', '52.252753', '21.151553', 'Marsa-Las')
['183', 'N21']
('2143', '

In [13]:
def calculate_distance(coords_1, coords_2):
    return geopy.distance.geodesic(coords_1, coords_2).meters  # convert to meters


def calculate_time_diff(time_1, time_2):
    time_format = "%H:%M:%S"
    t1 = datetime.datetime.strptime(time_1, time_format)
    t2 = datetime.datetime.strptime(time_2, time_format)
    return (t2 - t1).seconds  # keep as seconds

In [27]:
def get_vehicle_data(filename, vehicle_nr):
    with open(filename, 'r') as file:
        reader = csv.reader(file)
        rows = list(reader)
        bus_rows = [row for row in rows if row[-1] == vehicle_nr]
        vehicle_rows = defaultdict(list)
        for row in bus_rows:
            vehicle_rows[row[-1]].append(row)
        return vehicle_rows


def get_list_of_vehicle_numbers(filename, bus_line):
    with open(filename, 'r') as file:
        reader = csv.reader(file)
        rows = list(reader)
        bus_rows = [row for row in rows if row[0] == bus_line]
        vehicle_numbers = set(row[-1] for row in bus_rows)
        return list(vehicle_numbers)


def calculate_avg_speeds(filename, vehicle):
    vehicle_rows = get_vehicle_data(filename, vehicle)
    rows = vehicle_rows[vehicle]
    if len(rows) < 2:
        return "Not enough GPS points"
    speeds = []
    for i in range(len(rows) - 1):
        coords_1 = (float(rows[i][1]), float(rows[i][2]))
        coords_2 = (float(rows[i + 1][1]), float(rows[i + 1][2]))
        distance = calculate_distance(coords_1, coords_2)
        time_diff = calculate_time_diff(rows[i][3], rows[i + 1][3])
        avg_speed = distance / time_diff  # m/s
        speeds.append((avg_speed, rows[i][3], rows[i + 1][3]))
    return speeds


def get_avg_speeds_for_all_vehicles(filename, bus_line):
    vehicle_numbers = get_list_of_vehicle_numbers(filename, bus_line)
    avg_speeds = {}
    for vehicle in vehicle_numbers:
        avg_speeds[vehicle] = calculate_avg_speeds(filename, vehicle)
    return avg_speeds


_BUS_GPS_FILENAME = "Buses_location_afternoon.csv"
avg_speed_line = get_avg_speeds_for_all_vehicles(_BUS_GPS_FILENAME, '504')
print(avg_speed_line)

#print(get_list_of_vehicle_numbers(_BUS_GPS_FILENAME, '504'))
#print(get_vehicle_data(_BUS_GPS_FILENAME, '8307'))

#print(calculate_avg_speeds(_BUS_GPS_FILENAME, '8818'))

{'8346': [(0.0, '13:31:03', '13:32:22'), (0.19068514133255204, '13:32:22', '13:33:17'), (0.1872800495230422, '13:33:17', '13:34:13'), (0.0, '13:34:13', '13:35:17'), (0.0, '13:35:17', '13:36:20'), (0.0, '13:36:20', '13:37:23'), (0.03375617844018743, '13:37:23', '13:38:26'), (0.46845361776386363, '13:38:26', '13:39:28'), (0.0, '13:39:28', '13:40:32'), (0.0, '13:40:32', '13:41:20'), (0.0, '13:41:20', '13:42:22'), (0.0, '13:42:22', '13:43:26'), (0.0, '13:43:26', '13:44:29'), (0.0, '13:44:29', '13:45:32'), (0.0, '13:45:32', '13:46:53'), (0.0, '13:46:53', '13:48:14'), (0.0, '13:48:14', '13:49:01'), (0.0, '13:49:01', '13:50:18'), (0.0, '13:50:18', '13:51:04'), (2.5746711245245013, '13:51:04', '13:52:06'), (3.3576591183335087, '13:52:06', '13:52:58'), (5.071592735393419, '13:52:58', '13:54:30'), (4.557416103281722, '13:54:30', '13:55:16'), (7.468715387166985, '13:55:16', '13:56:24'), (5.214644464626516, '13:56:24', '13:57:25'), (8.061532389117282, '13:57:25', '13:58:27'), (3.181924665339443, '