# Example: Data analysis

There are two types of insights that can be extracted from data: punctuality and speed. Instead of downloading data now, we will use already downloaded data that are stored in this repository.

In case there are data missing, we can download them, without saving, during our analysis. We will use API key made for debugging purposes of this project, but you should create your own account and API key if you want to perform your analysis ([website](https://api.um.warszawa.pl/)).  

In [1]:
API_KEY = '5fbe79ed-1f5b-4019-ab03-641443842d8b'

In [2]:
from bwaw.io.load import load_response_from_csv
from bwaw.utils.format_conversion import column_str_to_datetime
from bwaw.insights.data import get_all_of_time, get_all_of_line, get_all_of_brigade, remove_duplicates
from bwaw.insights.speed import get_speed_incidents_for_bus, get_all_incidents, get_short_incidents_summary, get_full_incidents_summary
from bwaw.insights.punctuality import get_punctuality_list_for_bus, get_punctuality_list_for_buses, get_punctuality_report

In [3]:
from pathlib import Path

In [4]:
DATA_PATH = Path('../data')

In [5]:
buses_activity = load_response_from_csv(DATA_PATH / 'active_buses_2021_02_09.csv')
coordinates = load_response_from_csv(DATA_PATH / 'coordinates_2021_02_09.csv')

## Speed insights

We can use `speed` module to:
1. find speed incidents for a single bus,
2. find speed incidents for multiple buses,
3. generate short, human-readable summary of speed incidents,
4. generate longer, human-readable summary.

Firstly, we will prepare data using `data` and `format_conversion` modules. We will delete duplicates and format `Time` column properly. We can also restrict time to chosen period.

In [6]:
buses_activity = remove_duplicates(buses_activity)
buses_activity['Time'] = column_str_to_datetime(buses_activity['Time'])
buses_activity = get_all_of_time(buses_activity, start='2021-02-09 15:45:00', end='2021-02-09 16:45:00')
buses_activity['Lon'] = buses_activity['Lon'].apply(lambda x: float(x))
buses_activity['Lat'] = buses_activity['Lat'].apply(lambda x: float(x))

In [7]:
buses_activity.head()

Unnamed: 0,Lines,Lon,VehicleNumber,Time,Lat,Brigade
0,311,21.074444,1000,2021-02-09 15:45:29,52.249487,2
1,213,21.092148,1001,2021-02-09 15:45:27,52.224536,2
2,213,21.18876,1002,2021-02-09 15:45:29,52.14994,1
3,196,21.176237,1005,2021-02-09 15:45:29,52.256781,1
4,130,21.009918,1007,2021-02-09 15:45:24,52.203511,1


Now we generate the insights from points 1-4.

In [8]:
example_line = get_all_of_line(buses_activity, line='213')
example_brigade = get_all_of_brigade(example_line, brigade='1')
print(get_speed_incidents_for_bus(data=example_brigade, speed_limit=50))

      Speed        Lat        Lon                    Time
0  59.78140  52.182806  21.206846 2021-02-09 16:27:10.000
1  52.12374  52.195111  21.169966 2021-02-09 16:36:28.500


For line 213, brigade 1 and speed limit of 50 km/h, 2 speed limit incidents were found. Each time, we save when and where they happen. By incident, we mean any speed higher than speed limit calculated between two consecutive points on the bus route.

In [9]:
get_all_incidents(data=buses_activity, speed_limit=50)

Unnamed: 0,Lines,Speed,Lat,Lon,Time
0,172,52.433580,52.208009,20.976122,2021-02-09 15:45:43.000
1,172,86.458391,52.195749,21.008716,2021-02-09 16:15:38.500
2,172,51.883088,52.197099,21.018591,2021-02-09 16:19:11.000
3,733,69.479078,52.099988,20.822176,2021-02-09 16:01:40.000
4,733,69.580050,52.105487,20.835756,2021-02-09 16:02:37.500
...,...,...,...,...,...
939,707,61.292923,52.118680,20.892547,2021-02-09 15:49:02.000
940,707,58.630468,52.083617,20.958537,2021-02-09 16:04:14.500
941,L-3,53.847739,52.033646,20.859375,2021-02-09 16:15:22.500
942,L-3,67.938871,52.047116,20.864655,2021-02-09 16:17:12.500


Instead of doing this analysis for single bus, we can calculate all incidents for all buses. Additionally to the single bus analysis, we will get a column with bus number that had such incident.

In [10]:
print(get_short_incidents_summary(data=buses_activity, speed_limit=50)[0])

Speed limit: 50 km/h.
Total number of incidents: 943.
165/252 buses had incidents (65.48%).



To make analysis easier, reports are generated in a human-readable form. We can see that on 09-02-2021 between 15:45 and 16:45 there were 941 speed limit incidents and more than 60% of buses had at least one incident.

In [11]:
print(get_full_incidents_summary(data=buses_activity, speed_limit=50)[0])

Speed limit: 50 km/h.
Total number of incidents: 943.
165/252 buses had incidents (65.48%).
Top 3 buses with highest number of incidents were:
E-9    56 incidents
511    44 incidents
402    34 incidents
Top 3 places with highest number of incidents were:
(52.23, 21.07) - 23 incidents.
(52.29, 20.99) - 17 incidents.
(52.17, 21.07) - 13 incidents.



Additionaly to short report, we can get more insights, such as which buses had the highest number of incidents and where the incidents happen most often.

The place where incidents occur is calculated within 8x8 grid over Warsaw map. For each piece in the grid, incidents are summarized and then statistics per piece are shown using geogprahic location.

## Punctuality insights

Similarly to `speed`, using `punctuality` module we can generate:
1. punctuality insight for a single bus,
2. for multiple buses,
3. human-readable report.

Again, before analysis we need to additionaly prepare some coordinates data.

In [12]:
coordinates = remove_duplicates(coordinates)
coordinates['Longitude'] = coordinates['Longitude'].apply(lambda x: float(x))
coordinates['Latitude'] = coordinates['Latitude'].apply(lambda x: float(x))
coordinates.head()

Unnamed: 0,ID,Number,Latitude,Longitude,Destination,Validity
0,1001,1,52.248455,21.044827,al.Zieleniecka,2020-10-12 00:00:00.0
1,1001,2,52.249078,21.044443,Ząbkowska,2020-10-12 00:00:00.0
2,1001,3,52.248998,21.043983,al.Zieleniecka,2020-10-12 00:00:00.0
3,1001,3,52.248928,21.044169,al.Zieleniecka,2020-11-19 00:00:00.0
4,1001,4,52.249905,21.041726,Ząbkowska,2020-10-12 00:00:00.0


There are two ways how you can compare loaded data to timetables: by downloading timetables online and by using already downloaded timetables.

In [13]:
example_line = get_all_of_line(buses_activity, '109')
print(get_punctuality_list_for_bus(example_line, coordinates, path=DATA_PATH, proximity=3))
print(get_punctuality_list_for_bus(example_line, coordinates, api_key=API_KEY, proximity=3, verbosity=True))

  0%|          | 0/440 [00:00<?, ?it/s]

[False, True, True, True, True, False, True, True, True, True, False, False, False, True, False, True, True, True, False, False]


 99%|█████████▉| 436/440 [00:17<00:00, 25.61it/s]

[False, True, True, True, True, False, True, True, True, True, False, False, False, True, False, True, True, True, False, False]


Punctuality is stored in a list. For each match between bus and bus stop location, we determine if it arrived before or after time given in a timetable. If we decide to use online timetable downloading, we can set `verbosity` parameter to `True` to see the progress of download. Timetables for the whole route from time period given are downloaded for a bus.

Similarly, we can generate such list for all buses in data using `get_punctuality_list_for_buses`, but instead we will prepare punctuality report that uses this function under the hood.

We can choose list of buses that interest us and compare their punctuality.

In [14]:
chosen_lines = ['109', '183']
lines = buses_activity[buses_activity['Lines'].isin(chosen_lines)]
print(get_punctuality_report(lines, coordinates, api_key=API_KEY, proximity=5, verbosity=False))

100%|██████████| 440/440 [00:20<00:00, 21.25it/s]


Percentage of punctuality incidents:
- 109 line: 46.67% incidents.
- 183 line: 33.33% incidents.



We can see that 109 has more punctuality incidents than line 183.