I use FoCS environment. Python 3.11.6 version.

In [2]:
# Import useful libraries
import os
import json
import re
import pandas as pd
import numpy as np
from collections import defaultdict

In [3]:
# Configuration file
with open('config.json', 'r') as f:
    config = json.load(f)

In [4]:
# Set the display options to show numbers without scientific notation
pd.set_option('display.float_format', lambda x: '%.3f' % x)

You have to work on the [ZTBus: A Large Dataset of Time-Resolved City Bus Driving Missions](https://www.research-collection.ethz.ch/handle/20.500.11850/626723) repository.

It contains:
*  [metaData.csv](https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/626723/metaData.csv?sequence=1&isAllowed=y), shortly *trips*
*  several other files containing detailed data on some bus parameters, whose name is in the *trips* file. Those files can be downloaded as a [zip file](https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/626723/ZTBus_compressed.zip?sequence=3&isAllowed=y). Let us call those datasets the *details* datasets.

# Import

**ZT_bus Folder Structure**: The 'ZT_bus' folder is the main directory containing data related to driving missions.
Inside this folder, you'll find information on 1409 driving missions.

**Metadata Information**: Additionally, there is a 'metadata' folder that stores metadata related to these missions.
The metadata includes information for all 1490 missions.


## Metadata csv

In [10]:
metadata = pd.read_csv(config['path_metadata'])

In [11]:
metadata.head()

Unnamed: 0,name,busNumber,startTime_iso,startTime_unix,endTime_iso,endTime_unix,drivenDistance,busRoute,energyConsumption,itcs_numberOfPassengers_mean,itcs_numberOfPassengers_min,itcs_numberOfPassengers_max,status_gridIsAvailable_mean,temperature_ambient_mean,temperature_ambient_min,temperature_ambient_max
0,B183_2019-04-30_03-18-56_2019-04-30_08-44-20,183,2019-04-30T03:18:56Z,1556594336,2019-04-30T08:44:20Z,1556613860,77213.87,-,478585200.0,5.539,0,20,0.741,282.378,281.15,293.15
1,B183_2019-04-30_13-22-07_2019-04-30_17-54-02,183,2019-04-30T13:22:07Z,1556630527,2019-04-30T17:54:02Z,1556646842,59029.6,31,402258500.0,33.115,4,74,0.855,287.544,285.15,293.15
2,B183_2019-05-01_05-58-51_2019-05-01_22-32-30,183,2019-05-01T05:58:51Z,1556690331,2019-05-01T22:32:30Z,1556749950,240900.4,33,1445733000.0,19.689,0,55,0.778,288.749,280.15,294.15
3,B183_2019-05-03_02-50-21_2019-05-03_05-53-20,183,2019-05-03T02:50:21Z,1556851821,2019-05-03T05:53:20Z,1556862800,42565.48,-,281986700.0,1.685,0,8,0.767,282.413,281.15,292.15
4,B183_2019-05-03_15-41-57_2019-05-03_23-06-24,183,2019-05-03T15:41:57Z,1556898117,2019-05-03T23:06:24Z,1556924784,125277.2,72,620725800.0,23.754,1,67,0.907,284.733,282.15,287.15


In [5]:
metadata.shape

(1409, 16)

In [26]:
# Check metadata dtypes
metadata.dtypes

name                             object
busNumber                         int64
startTime_iso                    object
startTime_unix                    int64
endTime_iso                      object
endTime_unix                      int64
drivenDistance                  float64
busRoute                         object
energyConsumption               float64
itcs_numberOfPassengers_mean    float64
itcs_numberOfPassengers_min       int64
itcs_numberOfPassengers_max       int64
status_gridIsAvailable_mean     float64
temperature_ambient_mean        float64
temperature_ambient_min         float64
temperature_ambient_max         float64
dtype: object

`startTime_iso` has `object` format. It should be a `datetime` object. Later, when this variable will be used for analysis it will be better to change the format or to use `startTime_unix`. The same regards the `endTime_iso` variable. The other format are correct for the nature of the variable they represent. 

TODO: Check if other data types should be different.

### Nan values check

In [29]:
metadata.isna().sum()

name                            0
busNumber                       0
startTime_iso                   0
startTime_unix                  0
endTime_iso                     0
endTime_unix                    0
drivenDistance                  0
busRoute                        0
energyConsumption               0
itcs_numberOfPassengers_mean    0
itcs_numberOfPassengers_min     0
itcs_numberOfPassengers_max     0
status_gridIsAvailable_mean     0
temperature_ambient_mean        0
temperature_ambient_min         0
temperature_ambient_max         0
dtype: int64

In [31]:
print(f'Metadata dataframes has: {metadata.isna().sum().sum()} null values.')

Metadata dataframes has: 0 null values.


Checking the output of before, the variable `busRoute` should have nan values because from `metadata.head()` I see that some cells have `-` corresponding to a missing value. If this variable will be used for analysis I have to take in consideration this aspect. 

## Import ZT_bus data

In [5]:
file_names = [file_name for file_name in os.listdir(config['path_ZTbus_folder']) if 
              os.path.isfile(os.path.join(config['path_ZTbus_folder'], file_name))]

In [6]:
# Print only the first 10 results
file_names[:10]

['B183_2020-11-13_14-52-45_2020-11-13_19-13-45.csv',
 'B183_2020-10-06_04-23-44_2020-10-06_07-33-54.csv',
 'B183_2019-10-03_03-04-42_2019-10-03_18-38-45.csv',
 'B183_2022-05-02_03-02-19_2022-05-02_17-07-49.csv',
 'B183_2021-04-23_03-47-54_2021-04-23_07-48-48.csv',
 'B208_2022-08-15_03-31-51_2022-08-15_12-34-10.csv',
 'B183_2020-07-24_04-01-31_2020-07-24_18-04-39.csv',
 'B183_2022-10-28_13-36-23_2022-10-28_16-37-08.csv',
 'B208_2021-04-21_04-10-07_2021-04-21_18-19-32.csv',
 'B183_2022-07-28_14-27-33_2022-07-28_19-17-23.csv']

In [10]:
len(file_names) == metadata.shape[0]

True

As expected, ZT_bus folder contains data related to 1409 driving missions and metadata contains information related to that missions (always 1490).

In [7]:
dataframes = {}
for el in file_names:
    dataframes[el[:-4]] = pd.read_csv(f'{config["path_ZTbus_folder"]}/{el}')

In [12]:
type(dataframes)

dict

`dataframes` is a dictionary of dataframes, in total 1490. Each of them correspond to a driving mission.

In [111]:
# Print the first 10 keys
list(dataframes.keys())[:10]

['B183_2020-11-13_14-52-45_2020-11-13_19-13-45',
 'B183_2020-10-06_04-23-44_2020-10-06_07-33-54',
 'B183_2019-10-03_03-04-42_2019-10-03_18-38-45',
 'B183_2022-05-02_03-02-19_2022-05-02_17-07-49',
 'B183_2021-04-23_03-47-54_2021-04-23_07-48-48',
 'B208_2022-08-15_03-31-51_2022-08-15_12-34-10',
 'B183_2020-07-24_04-01-31_2020-07-24_18-04-39',
 'B183_2022-10-28_13-36-23_2022-10-28_16-37-08',
 'B208_2021-04-21_04-10-07_2021-04-21_18-19-32',
 'B183_2022-07-28_14-27-33_2022-07-28_19-17-23']

In [15]:
len(dataframes.keys())

1409

Let's take two samples dataframes to analyse.

In [32]:
dataframes['B183_2019-04-30_03-18-56_2019-04-30_08-44-20'].head()

Unnamed: 0,time_iso,time_unix,electric_powerDemand,gnss_altitude,gnss_course,gnss_latitude,gnss_longitude,itcs_busRoute,itcs_numberOfPassengers,itcs_stopName,...,odometry_wheelSpeed_mr,odometry_wheelSpeed_rl,odometry_wheelSpeed_rr,status_doorIsOpen,status_gridIsAvailable,status_haltBrakeIsActive,status_parkBrakeIsActive,temperature_ambient,traction_brakePressure,traction_tractionForce
0,2019-04-30T03:18:56Z,1556594336,-13.84551,,,,,-,,-,...,0.0,0.0,0.0,1,1,0,0,293.15,251666.7,0.0
1,2019-04-30T03:18:57Z,1556594337,-3.849362,,,,,-,,-,...,0.0,0.0,0.0,1,1,0,0,292.3688,254876.2,0.0
2,2019-04-30T03:18:58Z,1556594338,-0.672331,,,,,-,,-,...,0.0,0.0,0.0,1,1,0,0,292.931,251783.3,0.0
3,2019-04-30T03:18:59Z,1556594339,-1.087931,,,,,-,,-,...,0.0,0.0,0.0,1,1,0,0,293.15,255000.0,0.0
4,2019-04-30T03:19:00Z,1556594340,-0.811985,,,,,-,,-,...,0.0,0.0,0.0,1,1,0,0,293.15,253000.0,0.0


In [18]:
dataframes['B183_2019-04-30_03-18-56_2019-04-30_08-44-20'].columns

Index(['time_iso', 'time_unix', 'electric_powerDemand', 'gnss_altitude',
       'gnss_course', 'gnss_latitude', 'gnss_longitude', 'itcs_busRoute',
       'itcs_numberOfPassengers', 'itcs_stopName',
       'odometry_articulationAngle', 'odometry_steeringAngle',
       'odometry_vehicleSpeed', 'odometry_wheelSpeed_fl',
       'odometry_wheelSpeed_fr', 'odometry_wheelSpeed_ml',
       'odometry_wheelSpeed_mr', 'odometry_wheelSpeed_rl',
       'odometry_wheelSpeed_rr', 'status_doorIsOpen', 'status_gridIsAvailable',
       'status_haltBrakeIsActive', 'status_parkBrakeIsActive',
       'temperature_ambient', 'traction_brakePressure',
       'traction_tractionForce'],
      dtype='object')

In [19]:
dataframes['B183_2019-04-30_03-18-56_2019-04-30_08-44-20'].shape

(19525, 26)

In [23]:
dataframes['B183_2019-04-30_03-18-56_2019-04-30_08-44-20'].dtypes

time_iso                       object
time_unix                       int64
electric_powerDemand          float64
gnss_altitude                 float64
gnss_course                   float64
gnss_latitude                 float64
gnss_longitude                float64
itcs_busRoute                  object
itcs_numberOfPassengers       float64
itcs_stopName                  object
odometry_articulationAngle    float64
odometry_steeringAngle        float64
odometry_vehicleSpeed         float64
odometry_wheelSpeed_fl        float64
odometry_wheelSpeed_fr        float64
odometry_wheelSpeed_ml        float64
odometry_wheelSpeed_mr        float64
odometry_wheelSpeed_rl        float64
odometry_wheelSpeed_rr        float64
status_doorIsOpen               int64
status_gridIsAvailable          int64
status_haltBrakeIsActive        int64
status_parkBrakeIsActive        int64
temperature_ambient           float64
traction_brakePressure        float64
traction_tractionForce        float64
dtype: objec

In [21]:
dataframes['B208_2022-03-25_23-51-19_2022-03-26_03-42-34'].head()

Unnamed: 0,time_iso,time_unix,electric_powerDemand,gnss_altitude,gnss_course,gnss_latitude,gnss_longitude,itcs_busRoute,itcs_numberOfPassengers,itcs_stopName,...,odometry_wheelSpeed_mr,odometry_wheelSpeed_rl,odometry_wheelSpeed_rr,status_doorIsOpen,status_gridIsAvailable,status_haltBrakeIsActive,status_parkBrakeIsActive,temperature_ambient,traction_brakePressure,traction_tractionForce
0,2022-03-25T23:51:19Z,1648252279,2795.944,,,,,-,,-,...,0.0,0.0,0.0,1,0,1,1,293.839,245833.3,0.0
1,2022-03-25T23:51:20Z,1648252280,2717.339,,,,,-,,-,...,0.0,0.0,0.0,1,0,1,1,293.461,245833.3,0.0
2,2022-03-25T23:51:21Z,1648252281,2904.655,,,,,-,,-,...,0.0,0.0,0.0,1,0,1,1,293.15,245833.3,0.0
3,2022-03-25T23:51:22Z,1648252282,2862.673,,,,,-,,-,...,0.0,0.0,0.0,1,0,1,1,293.839,245833.3,0.0
4,2022-03-25T23:51:23Z,1648252283,2927.541,,,,,-,,-,...,0.0,0.0,0.0,1,0,1,1,293.461,245833.3,0.0


In [22]:
dataframes['B208_2022-03-25_23-51-19_2022-03-26_03-42-34'].shape

(13876, 26)

In [24]:
dataframes['B208_2022-03-25_23-51-19_2022-03-26_03-42-34'].dtypes

time_iso                       object
time_unix                       int64
electric_powerDemand          float64
gnss_altitude                 float64
gnss_course                   float64
gnss_latitude                 float64
gnss_longitude                float64
itcs_busRoute                  object
itcs_numberOfPassengers       float64
itcs_stopName                  object
odometry_articulationAngle    float64
odometry_steeringAngle        float64
odometry_vehicleSpeed         float64
odometry_wheelSpeed_fl        float64
odometry_wheelSpeed_fr        float64
odometry_wheelSpeed_ml        float64
odometry_wheelSpeed_mr        float64
odometry_wheelSpeed_rl        float64
odometry_wheelSpeed_rr        float64
status_doorIsOpen               int64
status_gridIsAvailable          int64
status_haltBrakeIsActive        int64
status_parkBrakeIsActive        int64
temperature_ambient           float64
traction_brakePressure        float64
traction_tractionForce        float64
dtype: objec

As explained before, `time_iso` should be `datetime` format but instead is `object`. For future analysis it will be better to change the format or to use `time_unix`. The other variables formats are correct for the nature of the variable they represent. I assume this adapts also to the other dataframes. 

In [19]:
diz = defaultdict(int)
# defaultdict(int) initializes the dictionary values to 0

for name in file_names:
    prefix = name[:4]
    diz[prefix] += 1

diz

defaultdict(int, {'B183': 864, 'B208': 545})

* 864 driving missions with bus 183
* 545 driving missions with bus 208

### Nan values check

In [34]:
dataframes['B183_2019-04-30_03-18-56_2019-04-30_08-44-20'].isna().sum()

time_iso                          0
time_unix                         0
electric_powerDemand              0
gnss_altitude                 19328
gnss_course                   19328
gnss_latitude                 19328
gnss_longitude                19328
itcs_busRoute                     0
itcs_numberOfPassengers       19332
itcs_stopName                     0
odometry_articulationAngle        0
odometry_steeringAngle            0
odometry_vehicleSpeed             0
odometry_wheelSpeed_fl            0
odometry_wheelSpeed_fr            0
odometry_wheelSpeed_ml            0
odometry_wheelSpeed_mr            0
odometry_wheelSpeed_rl            0
odometry_wheelSpeed_rr            0
status_doorIsOpen                 0
status_gridIsAvailable            0
status_haltBrakeIsActive          0
status_parkBrakeIsActive          0
temperature_ambient               0
traction_brakePressure            0
traction_tractionForce            0
dtype: int64

In [33]:
dataframes['B208_2022-03-25_23-51-19_2022-03-26_03-42-34'].isna().sum()

time_iso                          0
time_unix                         0
electric_powerDemand              0
gnss_altitude                   115
gnss_course                     110
gnss_latitude                   110
gnss_longitude                  110
itcs_busRoute                     0
itcs_numberOfPassengers       13781
itcs_stopName                     0
odometry_articulationAngle        0
odometry_steeringAngle            0
odometry_vehicleSpeed             0
odometry_wheelSpeed_fl            0
odometry_wheelSpeed_fr            0
odometry_wheelSpeed_ml            0
odometry_wheelSpeed_mr            0
odometry_wheelSpeed_rl            0
odometry_wheelSpeed_rr            0
status_doorIsOpen                 0
status_gridIsAvailable            0
status_haltBrakeIsActive          0
status_parkBrakeIsActive          0
temperature_ambient               0
traction_brakePressure            0
traction_tractionForce            0
dtype: int64

In both datasets, it appears that the columns `gnss_altitude`, `gnss_course`, `gnss_latitude`, and `itcs_numberOfPassengers` have some missing values. The number of missing values varies between the two datasets. It might be worth investigating why these specific columns have missing data and how it could impact your analysis or modeling. 

Based on this two sample I think that there are missing values for the two columns described above. In future analysis it's important to take in consideration this aspect but also to consider that possible other variables can have Null values because for now I only checked two samples.


TODO: 
* Describe a bit the data

# Project

## 1. Extract all trips with `busRoute` 83

As described before the variable `busRoute` should have nan values because from `metadata.head()` I see that some cells have `-`. In order to extract the trips with busRoute 83 I decide to transform substitute the values `-` with a null value because the data for that cell is missing.

In [12]:
metadata.head()

Unnamed: 0,name,busNumber,startTime_iso,startTime_unix,endTime_iso,endTime_unix,drivenDistance,busRoute,energyConsumption,itcs_numberOfPassengers_mean,itcs_numberOfPassengers_min,itcs_numberOfPassengers_max,status_gridIsAvailable_mean,temperature_ambient_mean,temperature_ambient_min,temperature_ambient_max
0,B183_2019-04-30_03-18-56_2019-04-30_08-44-20,183,2019-04-30T03:18:56Z,1556594336,2019-04-30T08:44:20Z,1556613860,77213.87,-,478585200.0,5.539,0,20,0.741,282.378,281.15,293.15
1,B183_2019-04-30_13-22-07_2019-04-30_17-54-02,183,2019-04-30T13:22:07Z,1556630527,2019-04-30T17:54:02Z,1556646842,59029.6,31,402258500.0,33.115,4,74,0.855,287.544,285.15,293.15
2,B183_2019-05-01_05-58-51_2019-05-01_22-32-30,183,2019-05-01T05:58:51Z,1556690331,2019-05-01T22:32:30Z,1556749950,240900.4,33,1445733000.0,19.689,0,55,0.778,288.749,280.15,294.15
3,B183_2019-05-03_02-50-21_2019-05-03_05-53-20,183,2019-05-03T02:50:21Z,1556851821,2019-05-03T05:53:20Z,1556862800,42565.48,-,281986700.0,1.685,0,8,0.767,282.413,281.15,292.15
4,B183_2019-05-03_15-41-57_2019-05-03_23-06-24,183,2019-05-03T15:41:57Z,1556898117,2019-05-03T23:06:24Z,1556924784,125277.2,72,620725800.0,23.754,1,67,0.907,284.733,282.15,287.15


In [39]:
metadata.dtypes

name                             object
busNumber                         int64
startTime_iso                    object
startTime_unix                    int64
endTime_iso                      object
endTime_unix                      int64
drivenDistance                  float64
busRoute                         object
energyConsumption               float64
itcs_numberOfPassengers_mean    float64
itcs_numberOfPassengers_min       int64
itcs_numberOfPassengers_max       int64
status_gridIsAvailable_mean     float64
temperature_ambient_mean        float64
temperature_ambient_min         float64
temperature_ambient_max         float64
dtype: object

In [13]:
metadata['busRoute'] = metadata['busRoute'].replace('-', np.nan)

In [14]:
metadata.head()

Unnamed: 0,name,busNumber,startTime_iso,startTime_unix,endTime_iso,endTime_unix,drivenDistance,busRoute,energyConsumption,itcs_numberOfPassengers_mean,itcs_numberOfPassengers_min,itcs_numberOfPassengers_max,status_gridIsAvailable_mean,temperature_ambient_mean,temperature_ambient_min,temperature_ambient_max
0,B183_2019-04-30_03-18-56_2019-04-30_08-44-20,183,2019-04-30T03:18:56Z,1556594336,2019-04-30T08:44:20Z,1556613860,77213.87,,478585200.0,5.539,0,20,0.741,282.378,281.15,293.15
1,B183_2019-04-30_13-22-07_2019-04-30_17-54-02,183,2019-04-30T13:22:07Z,1556630527,2019-04-30T17:54:02Z,1556646842,59029.6,31.0,402258500.0,33.115,4,74,0.855,287.544,285.15,293.15
2,B183_2019-05-01_05-58-51_2019-05-01_22-32-30,183,2019-05-01T05:58:51Z,1556690331,2019-05-01T22:32:30Z,1556749950,240900.4,33.0,1445733000.0,19.689,0,55,0.778,288.749,280.15,294.15
3,B183_2019-05-03_02-50-21_2019-05-03_05-53-20,183,2019-05-03T02:50:21Z,1556851821,2019-05-03T05:53:20Z,1556862800,42565.48,,281986700.0,1.685,0,8,0.767,282.413,281.15,292.15
4,B183_2019-05-03_15-41-57_2019-05-03_23-06-24,183,2019-05-03T15:41:57Z,1556898117,2019-05-03T23:06:24Z,1556924784,125277.2,72.0,620725800.0,23.754,1,67,0.907,284.733,282.15,287.15


In [15]:
metadata.isna().sum()

name                             0
busNumber                        0
startTime_iso                    0
startTime_unix                   0
endTime_iso                      0
endTime_unix                     0
drivenDistance                   0
busRoute                        11
energyConsumption                0
itcs_numberOfPassengers_mean     0
itcs_numberOfPassengers_min      0
itcs_numberOfPassengers_max      0
status_gridIsAvailable_mean      0
temperature_ambient_mean         0
temperature_ambient_min          0
temperature_ambient_max          0
dtype: int64

In [16]:
print(f'busRoute variable contains: {metadata["busRoute"].isna().sum()} missing values')

busRoute variable contains: 11 missing values


In [17]:
metadata.dtypes

name                             object
busNumber                         int64
startTime_iso                    object
startTime_unix                    int64
endTime_iso                      object
endTime_unix                      int64
drivenDistance                  float64
busRoute                         object
energyConsumption               float64
itcs_numberOfPassengers_mean    float64
itcs_numberOfPassengers_min       int64
itcs_numberOfPassengers_max       int64
status_gridIsAvailable_mean     float64
temperature_ambient_mean        float64
temperature_ambient_min         float64
temperature_ambient_max         float64
dtype: object

Let's check the unique values of `busRoute`.

In [18]:
metadata['busRoute'].unique()

array([nan, '31', '33', '72', '46', '32', '83', 'N4', 'N2', 'N1'],
      dtype=object)

The bus routes are not only integers because there are also `N4`, `N2` and `N1`. Probably this trips correspond to routes done in the night. I can keep the data type as a `object` and the format to `string` and I can extract the trips with busRoute 83.

In [19]:
metadata_busRoute_83 = metadata[metadata['busRoute']=='83']
metadata_busRoute_83

Unnamed: 0,name,busNumber,startTime_iso,startTime_unix,endTime_iso,endTime_unix,drivenDistance,busRoute,energyConsumption,itcs_numberOfPassengers_mean,itcs_numberOfPassengers_min,itcs_numberOfPassengers_max,status_gridIsAvailable_mean,temperature_ambient_mean,temperature_ambient_min,temperature_ambient_max
154,B183_2020-03-03_04-42-38_2020-03-03_19-44-51,183,2020-03-03T04:42:38Z,1583210558,2020-03-03T19:44:51Z,1583264691,225047.900,83,1544278000.000,23.475,0,118,0.472,280.545,279.150,289.150
155,B183_2020-03-06_04-53-23_2020-03-06_19-44-42,183,2020-03-06T04:53:23Z,1583470403,2020-03-06T19:44:42Z,1583523882,224512.300,83,1631816000.000,17.416,0,69,0.451,279.885,278.150,289.150
157,B183_2020-03-09_14-16-13_2020-03-09_19-34-17,183,2020-03-09T14:16:13Z,1583763373,2020-03-09T19:34:17Z,1583782457,77824.360,83,540601300.000,23.182,0,74,0.460,281.049,279.150,291.150
158,B183_2020-03-10_04-50-03_2020-03-10_19-51-25,183,2020-03-10T04:50:03Z,1583815803,2020-03-10T19:51:25Z,1583869885,225095.800,83,1692171000.000,20.964,0,86,0.475,279.836,279.150,291.150
159,B183_2020-03-12_04-56-41_2020-03-12_19-44-57,183,2020-03-12T04:56:41Z,1583989001,2020-03-12T19:44:57Z,1584042297,224181.200,83,1145860000.000,17.212,0,80,0.341,287.344,282.150,291.150
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1399,B208_2022-11-30_04-47-53_2022-11-30_19-50-22,208,2022-11-30T04:47:53Z,1669783673,2022-11-30T19:50:22Z,1669837822,223165.000,83,1560888000.000,27.891,2,100,0.456,280.695,279.150,293.150
1400,B208_2022-12-01_05-19-41_2022-12-01_18-20-57,208,2022-12-01T05:19:41Z,1669871981,2022-12-01T18:20:57Z,1669918857,190196.000,83,1418847000.000,26.039,0,96,0.450,279.765,279.150,292.150
1401,B208_2022-12-02_04-47-48_2022-12-02_19-40-01,208,2022-12-02T04:47:48Z,1669956468,2022-12-02T19:40:01Z,1670010001,224473.400,83,1611150000.000,24.804,2,91,0.439,279.789,279.150,291.150
1405,B208_2022-12-07_05-13-02_2022-12-07_19-19-53,208,2022-12-07T05:13:02Z,1670389982,2022-12-07T19:19:53Z,1670440793,210041.600,83,1536697000.000,28.785,0,115,0.435,279.528,278.150,292.666


In [79]:
print(f'The number of trips with bus route 83 is: {metadata_busRoute_83.shape[0]}')

The number of trips with bus route 83 is: 846


In [81]:
# Same of doing 
print(f'The number of trips with bus route 83 is: {len(metadata_busRoute_83)}')

The number of trips with bus route 83 is: 846


## 2. Extract all trips where `busRoute` is not a number 

Not a number mean Nan values or night trip.

In [20]:
# Filter rows where 'busRoute' is not numeric
metadata_busRoute_NotInt = metadata[~metadata['busRoute'].astype(str).str.isnumeric()]
metadata_busRoute_NotInt

Unnamed: 0,name,busNumber,startTime_iso,startTime_unix,endTime_iso,endTime_unix,drivenDistance,busRoute,energyConsumption,itcs_numberOfPassengers_mean,itcs_numberOfPassengers_min,itcs_numberOfPassengers_max,status_gridIsAvailable_mean,temperature_ambient_mean,temperature_ambient_min,temperature_ambient_max
0,B183_2019-04-30_03-18-56_2019-04-30_08-44-20,183,2019-04-30T03:18:56Z,1556594336,2019-04-30T08:44:20Z,1556613860,77213.870,,478585200.000,5.539,0,20,0.741,282.378,281.150,293.150
3,B183_2019-05-03_02-50-21_2019-05-03_05-53-20,183,2019-05-03T02:50:21Z,1556851821,2019-05-03T05:53:20Z,1556862800,42565.480,,281986700.000,1.685,0,8,0.767,282.413,281.150,292.150
9,B183_2019-05-10_03-16-11_2019-05-10_18-51-37,183,2019-05-10T03:16:11Z,1557458171,2019-05-10T18:51:37Z,1557514297,210577.000,,1303391000.000,8.230,0,43,0.741,287.562,282.150,293.150
10,B183_2019-05-13_03-10-23_2019-05-13_23-16-13,183,2019-05-13T03:10:23Z,1557717023,2019-05-13T23:16:13Z,1557789373,267033.800,,1647432000.000,7.892,0,45,0.804,284.676,280.150,293.150
19,B183_2019-05-24_02-52-47_2019-05-24_22-35-11,183,2019-05-24T02:52:47Z,1558666367,2019-05-24T22:35:11Z,1558737311,263432.600,,1448057000.000,7.520,0,44,0.761,293.144,283.150,299.150
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1373,B208_2022-10-21_22-38-32_2022-10-22_02-42-21,208,2022-10-21T22:38:32Z,1666391912,2022-10-22T02:42:21Z,1666406541,78567.160,N1,434776600.000,16.333,0,45,0.432,289.255,288.150,296.150
1374,B208_2022-10-22_22-34-45_2022-10-23_02-29-59,208,2022-10-22T22:34:45Z,1666478085,2022-10-23T02:29:59Z,1666492199,73427.970,N2,399773700.000,17.711,0,57,0.443,287.349,285.150,295.150
1394,B208_2022-11-25_23-35-16_2022-11-26_03-30-39,208,2022-11-25T23:35:16Z,1669419316,2022-11-26T03:30:39Z,1669433439,72911.260,N2,447553400.000,11.217,1,32,0.465,281.388,280.150,293.150
1407,B208_2022-12-09_23-55-12_2022-12-10_03-24-28,208,2022-12-09T23:55:12Z,1670630112,2022-12-10T03:24:28Z,1670642668,59548.570,N1,451916500.000,20.105,0,74,0.496,279.454,277.150,291.150


`df[~df['column_name'].condition]`

`~` is used to negate the condition. In this case I want to extract the rows whether each value in the `busRoute` column is not numeric using the `str.isnumeric()` method.

In [21]:
print(f'The number of trips where BusRoute is not Int is: {metadata_busRoute_NotInt.shape[0]}')

The number of trips where BusRoute is not Int is: 84


In [22]:
# Unique values
metadata_busRoute_NotInt['busRoute'].unique()

array([nan, 'N4', 'N2', 'N1'], dtype=object)

If I don't want to include Nan values

In [23]:
metadata_busRoute_NotInt_NotNan = metadata[~metadata['busRoute'].astype(str).str.isnumeric() & ~metadata['busRoute'].isna()]
metadata_busRoute_NotInt_NotNan

Unnamed: 0,name,busNumber,startTime_iso,startTime_unix,endTime_iso,endTime_unix,drivenDistance,busRoute,energyConsumption,itcs_numberOfPassengers_mean,itcs_numberOfPassengers_min,itcs_numberOfPassengers_max,status_gridIsAvailable_mean,temperature_ambient_mean,temperature_ambient_min,temperature_ambient_max
533,B183_2021-12-18_23-37-00_2021-12-19_03-38-35,183,2021-12-18T23:37:00Z,1639870620,2021-12-19T03:38:35Z,1639885115,76216.060,N4,481350300.000,9.199,0,37,0.492,276.863,275.150,288.150
553,B183_2022-01-07_23-40-43_2022-01-08_03-31-21,183,2022-01-07T23:40:43Z,1641598843,2022-01-08T03:31:21Z,1641612681,68557.060,N2,453625100.000,4.627,0,13,0.427,276.967,275.150,287.150
554,B183_2022-01-08_23-40-17_2022-01-09_03-35-32,183,2022-01-08T23:40:17Z,1641685217,2022-01-09T03:35:32Z,1641699332,67962.920,N2,475383300.000,7.495,0,26,0.516,278.565,277.150,288.150
561,B183_2022-01-15_23-41-46_2022-01-16_03-40-23,183,2022-01-15T23:41:46Z,1642290106,2022-01-16T03:40:23Z,1642304423,77156.700,N1,525168300.000,6.513,0,32,0.474,274.994,273.150,286.150
568,B183_2022-01-21_23-35-40_2022-01-22_03-26-24,183,2022-01-21T23:35:40Z,1642808140,2022-01-22T03:26:24Z,1642821984,71917.750,N2,455476000.000,5.357,0,23,0.494,275.307,274.150,281.150
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1373,B208_2022-10-21_22-38-32_2022-10-22_02-42-21,208,2022-10-21T22:38:32Z,1666391912,2022-10-22T02:42:21Z,1666406541,78567.160,N1,434776600.000,16.333,0,45,0.432,289.255,288.150,296.150
1374,B208_2022-10-22_22-34-45_2022-10-23_02-29-59,208,2022-10-22T22:34:45Z,1666478085,2022-10-23T02:29:59Z,1666492199,73427.970,N2,399773700.000,17.711,0,57,0.443,287.349,285.150,295.150
1394,B208_2022-11-25_23-35-16_2022-11-26_03-30-39,208,2022-11-25T23:35:16Z,1669419316,2022-11-26T03:30:39Z,1669433439,72911.260,N2,447553400.000,11.217,1,32,0.465,281.388,280.150,293.150
1407,B208_2022-12-09_23-55-12_2022-12-10_03-24-28,208,2022-12-09T23:55:12Z,1670630112,2022-12-10T03:24:28Z,1670642668,59548.570,N1,451916500.000,20.105,0,74,0.496,279.454,277.150,291.150


In [24]:
print(f'The number of trips where BusRoute is not Int and is not Nan is: {metadata_busRoute_NotInt_NotNan.shape[0]}')

The number of trips where BusRoute is not Int and is not Nan is: 73


In [25]:
print(f'The number of trips where BusRoute is Int is: {metadata[metadata["busRoute"].astype(str).str.isnumeric()].shape[0]}')

The number of trips where BusRoute is Int is: 1325


## 3. For each (busNumber, busRoute) pair, determine the number of trips

In [84]:
metadata['busNumber'].dtypes

dtype('int64')

`busNumber` is correctly data type integer. 

Let's check the unique values.

In [87]:
metadata['busNumber'].unique()

array([183, 208])

As expected, since that there are 2 driving missions: 183 and 208, `bus Number` has two values: 183 and 208.

To answer to this question I consider a dataframe without Nan values (11 rows deleted). I drop the rows with Nan value on `busRoute` column. As checked before `busNumber` does not have missing values.

In [26]:
metadata_notNan = metadata.dropna()
metadata_notNan

Unnamed: 0,name,busNumber,startTime_iso,startTime_unix,endTime_iso,endTime_unix,drivenDistance,busRoute,energyConsumption,itcs_numberOfPassengers_mean,itcs_numberOfPassengers_min,itcs_numberOfPassengers_max,status_gridIsAvailable_mean,temperature_ambient_mean,temperature_ambient_min,temperature_ambient_max
1,B183_2019-04-30_13-22-07_2019-04-30_17-54-02,183,2019-04-30T13:22:07Z,1556630527,2019-04-30T17:54:02Z,1556646842,59029.600,31,402258500.000,33.115,4,74,0.855,287.544,285.150,293.150
2,B183_2019-05-01_05-58-51_2019-05-01_22-32-30,183,2019-05-01T05:58:51Z,1556690331,2019-05-01T22:32:30Z,1556749950,240900.400,33,1445733000.000,19.689,0,55,0.778,288.749,280.150,294.150
4,B183_2019-05-03_15-41-57_2019-05-03_23-06-24,183,2019-05-03T15:41:57Z,1556898117,2019-05-03T23:06:24Z,1556924784,125277.200,72,620725800.000,23.754,1,67,0.907,284.733,282.150,287.150
5,B183_2019-05-05_07-41-02_2019-05-05_23-20-07,183,2019-05-05T07:41:02Z,1557042062,2019-05-05T23:20:07Z,1557098407,283206.900,46,1661700000.000,16.499,0,74,0.998,280.167,277.150,291.150
6,B183_2019-05-06_03-10-43_2019-05-06_19-20-34,183,2019-05-06T03:10:43Z,1557112243,2019-05-06T19:20:34Z,1557170434,224131.600,31,1388008000.000,28.035,0,83,0.871,282.243,277.150,291.150
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1404,B208_2022-12-06_14-43-49_2022-12-06_18-22-52,208,2022-12-06T14:43:49Z,1670337829,2022-12-06T18:22:52Z,1670350972,51798.780,32,426041900.000,39.809,0,83,0.739,279.640,278.150,291.150
1405,B208_2022-12-07_05-13-02_2022-12-07_19-19-53,208,2022-12-07T05:13:02Z,1670389982,2022-12-07T19:19:53Z,1670440793,210041.600,83,1536697000.000,28.785,0,115,0.435,279.528,278.150,292.666
1406,B208_2022-12-08_05-22-20_2022-12-08_18-39-15,208,2022-12-08T05:22:20Z,1670476940,2022-12-08T18:39:15Z,1670524755,190372.700,83,1415700000.000,29.879,0,102,0.440,279.172,277.150,292.150
1407,B208_2022-12-09_23-55-12_2022-12-10_03-24-28,208,2022-12-09T23:55:12Z,1670630112,2022-12-10T03:24:28Z,1670642668,59548.570,N1,451916500.000,20.105,0,74,0.496,279.454,277.150,291.150


In [27]:
metadata_notNan.shape[0]

1398

In [91]:
print(f'The number of rows dropped is: {metadata.shape[0] - metadata_notNan.shape[0]}')

The number of rows dropped is: 11


In [102]:
trips_busNum_busRoute = metadata_notNan.groupby(['busNumber', 'busRoute'], as_index=False).size()

In [103]:
trips_busNum_busRoute = trips_busNum_busRoute.rename(columns={'size': 'numTrips'})

In [104]:
trips_busNum_busRoute

Unnamed: 0,busNumber,busRoute,numTrips
0,183,31,12
1,183,32,12
2,183,33,130
3,183,46,104
4,183,72,114
5,183,83,441
6,183,N1,10
7,183,N2,19
8,183,N4,11
9,208,31,5


In [106]:
trips_busNum_busRoute['numTrips'].sum()

1398

## 4. For each trip, compute the ratio between the energy consumption and the average number of passengers

In [28]:
metadata.columns

Index(['name', 'busNumber', 'startTime_iso', 'startTime_unix', 'endTime_iso',
       'endTime_unix', 'drivenDistance', 'busRoute', 'energyConsumption',
       'itcs_numberOfPassengers_mean', 'itcs_numberOfPassengers_min',
       'itcs_numberOfPassengers_max', 'status_gridIsAvailable_mean',
       'temperature_ambient_mean', 'temperature_ambient_min',
       'temperature_ambient_max'],
      dtype='object')

The ratio between energy consumption and the average number of passenger can be calculated in the following way: 
`energyConsumption` / `itcs_numberOfPassengers_mean` which represents the ratio of energy consumption per passenger on average.


In [29]:
metadata['ratio_EnergCons_MeanNumOfPass'] = metadata['energyConsumption'] / metadata['itcs_numberOfPassengers_mean']

In [30]:
metadata[['energyConsumption', 'itcs_numberOfPassengers_mean', 'ratio_EnergCons_MeanNumOfPass']].head()

Unnamed: 0,energyConsumption,itcs_numberOfPassengers_mean,ratio_EnergCons_MeanNumOfPass
0,478585200.0,5.539,86405000.307
1,402258500.0,33.115,12147474.013
2,1445733000.0,19.689,73427940.479
3,281986700.0,1.685,167332785.421
4,620725800.0,23.754,26131895.121


**Comment on first row values**

* `energyConsumption`: The energy consumption for the first bus trip is approximately 479,585,200.
* `itcs_numberOfPassengers_mean`: The average number of passengers for the first bus trip is approximately 5.54.
* `ratio_EnergCons_MeanNumOfPass`: The ratio of energy consumption per average number of passengers is a derived metric that helps assess the energy efficiency of a bus trip. In this context, the value of approximately 86,405,000 suggests that, on average, each passenger contributes to an energy consumption of about 86.4 million units during this particular bus trip. This ratio provides insights into the energy efficiency of the transportation system, with lower values indicating more efficient energy utilization per passenger.

The ratio tends to increase when the average number of passengers is lower. The higher ratio in situations with fewer passengers may indicate a relatively less efficient use of energy resources per passenger, emphasizing the potential benefits of improving passenger occupancy for enhanced energy.





## 5. For each station (`itcs_stopName`), determine the average number of passengers.

The information `itcs_stopName` is a variable included in each dataframe of `dataframes` dictionary.

In [8]:
dataframes['B208_2022-03-25_23-51-19_2022-03-26_03-42-34'].columns

Index(['time_iso', 'time_unix', 'electric_powerDemand', 'gnss_altitude',
       'gnss_course', 'gnss_latitude', 'gnss_longitude', 'itcs_busRoute',
       'itcs_numberOfPassengers', 'itcs_stopName',
       'odometry_articulationAngle', 'odometry_steeringAngle',
       'odometry_vehicleSpeed', 'odometry_wheelSpeed_fl',
       'odometry_wheelSpeed_fr', 'odometry_wheelSpeed_ml',
       'odometry_wheelSpeed_mr', 'odometry_wheelSpeed_rl',
       'odometry_wheelSpeed_rr', 'status_doorIsOpen', 'status_gridIsAvailable',
       'status_haltBrakeIsActive', 'status_parkBrakeIsActive',
       'temperature_ambient', 'traction_brakePressure',
       'traction_tractionForce'],
      dtype='object')

In [13]:
dataframes['B208_2022-03-25_23-51-19_2022-03-26_03-42-34']['itcs_stopName'].unique()

array(['-', 'Zürich, Bahnhofplatz/HB', 'Zürich, Löwenplatz',
       'Zürich, Sihlpost / HB', 'Zürich, Kanonengasse',
       'Zürich, Militär-/Langstrasse', 'Zürich, Bäckeranlage',
       'Zürich, Güterbahnhof', 'Zürich, Hardplatz',
       'Zürich, Herdernstrasse', 'Zürich, SBB-Werkstätte',
       'Zürich, Letzipark', 'Zürich, Letzibach',
       'Zürich, Bahnhof Altstetten', 'Zürich, Bristenstrasse',
       'Zürich, Lindenplatz', 'Zürich, Rautistrasse',
       'Zürich, Schulhaus Buchlern', 'Zürich, Rautihalde',
       'Zürich, Salzweg', 'Zürich, Dunkelhölzli', 'Zürich, Sihlquai/HB',
       'Zürich, Museum für Gestaltung', 'Zürich, Limmatplatz',
       'Zürich, Quellenstrasse', 'Zürich, Löwenbräu',
       'Zürich, Escher-Wyss-Platz', 'Zürich, Rosengartenstrasse',
       'Zürich, Lehenstrasse', 'Zürich, Rebbergsteig',
       'Zürich, Kempfhofsteig', 'Zürich, Schwert',
       'Zürich, Meierhofplatz', 'Zürich, Wieslergasse',
       'Zürich, Singlistrasse', 'Zürich, Segantinistrasse',
      

In [14]:
dataframes['B208_2022-03-25_23-51-19_2022-03-26_03-42-34']['itcs_numberOfPassengers'].unique()

array([nan, 46., 55., 60., 67., 42., 40., 37., 43., 41., 19., 13.,  6.,
        5.,  0.,  2.,  4.,  7., 12., 14., 45.,  9., 17., 11.,  8.,  3.,
        1., 15., 16.])

In [15]:
dataframes['B183_2019-04-30_03-18-56_2019-04-30_08-44-20']['itcs_stopName'].unique()

array(['-', 'Zürich, Herdernstrasse', 'Zürich, Hardplatz',
       'Zürich, Güterbahnhof', 'Zürich, Bäckeranlage',
       'Zürich, Militär-/Langstrasse', 'Zürich, Kanonengasse',
       'Zürich, Sihlpost / HB', 'Zürich, Löwenplatz',
       'Zürich, Bahnhofplatz/HB', 'Zürich, Central', 'Zürich, Neumarkt',
       'Zürich, Kunsthaus', 'Zürich, Sprecherstrasse',
       'Zürich, Kreuzplatz', 'Zürich, Signaustrasse',
       'Zürich, Hegibachplatz', 'Zürich, Botanischer Garten',
       'Zürich, Höschgasse', 'Zürich, Fröhlichstrasse',
       'Zürich, Wildbachstrasse', 'Zürich, Bahnhof Tiefenbrunnen',
       'Zürich, Freiestrasse', 'Zürich, Klusplatz',
       'Zürich, Hölderlinsteig', 'Zürich, Klosbach', 'Zürich, Hofstrasse',
       'Zürich, Kirche Fluntern', 'Zürich, Hinterbergstrasse',
       'Zürich, Spyriplatz', 'Zürich, Bethanien', 'Zürich, Toblerplatz'],
      dtype=object)

In [17]:
dataframes['B183_2019-04-30_03-18-56_2019-04-30_08-44-20']['itcs_numberOfPassengers'].unique()

array([nan,  1.,  2.,  3.,  5.,  6.,  0.,  4.,  8.,  9.,  7., 20., 15.,
       12., 17., 10., 14., 13., 18., 11., 16., 19.])