04 - Movement Decomposition and antecedents

This notebook illustrates not only the movement decomposition features of the `wmatarawnav` package, but also the necessary data preparation that takes place before the decomposition and the association of stop-related information that takes place after the decomposition. 

The movement decomposition is intended to support analyses of transit service improvements, but applications of the movement decomposition are not discussed here. These applications remain under development. 

The contents of this notebook include: 

1. Environment Setup
2. Walkthrough of Decomposition Code

    1. Summary of Read-in and Stop Indexing that Precedes decomposition
    2. Pre-Decomposition Cleaning
    3. Speed and Other Calculations 
    4. Movement Decomposition
    5. Relating Decomposition to Other Stop Locations

3. Summary of Decomposition Fields

# 1. Environment Setup

In [None]:
# TODO: insert here

# 2. Walkthrough of Decomposition Code

## 2.1 Summary of Read-in and Stop Indexing that Precedes decomposition

rawnav data that is decomposed is first read into python using the process described in the first vignette, Intro to Rawnav Parsing. This process is likely to change as the Datamart develops, such that data would be read directly from a WMATA database. 

Next, stop locations for each route pattern are associated to rawnav data using the same process identified in the second vignette, Relate Rawnav to Other Sources. This process is also likely to change as the map matching process used in the Datamart project develops. However, the current process can also be continued with the use of connections to existing WMATA schedule databases.


## 2.2 Pre-Decomposition Cleaning

Because the movement decomposition algorithm is sensitive to changes in speed and acceleration over intervals as short as one second, clean data is especially important before beginning the decomposition. Two cleaning steps take place to address odometer values that can result in noisy or inaccurate speed calculations. If these issues are unaddressed, rawnav data can briefly show infinite speed (non-zero distance traveled in zero seconds) or show noisy speed values (brief speed increases followed by equally sharp decreases in the next second). 

The cleaning steps are described in more detail below and in the function documentation itself.

### 2.2.1 Aggregating Repeated Timestamp Data

The number of seconds after a trip instance (called a 'run' in previous documentation) begins is an essential element of rawnav data. Referred to in python code as `sec_past_st` or "seconds past start", it begins at 0 and increments in integer seconds until the end of the trip instance. At times, these second values will repeat, such that the `sec_past_st` value of, say, 30, might appear twice in a row. The reasons for this will not be described in detail here, but the duplicate time values typically occur when other trip tags appear in the data, such as APC tags or calibration tags. 

When these seconds values are repeated, other parts of the data may nevertheless change, such as the door state, the rawnav stop window identifier, or the odometer reading (as an aside, based on this, it is believed that these integer seconds values are therefore rounded from a more precise recording of the time elapsed, rather than a regular interval at which data is recorded). If the odometer increases between two pings but the amount of time elapsed does not, then the bus effectively travels at an infinite speed between these two points. While these infinite speed values could simply be removed (as was done in earlier code for the Queue Jump study), addressing the root of the problem provides some benefits. In particular, converting these integer seconds values into timestamp values allows for the use of additional timeseries-oriented functions in Python. However, many of these functions do not accept repeated timestamp values. 

A sample of rawnav data for a second with repeated `sec_past_st` values is shown below.

In [None]:
# TODO: insert little block showing repeated rows

To address the challenges above, the first step in pre-decomposition cleaning is to aggregate data for repeated seconds values such that each second appears only once. For most variables, the last value observed is recorded, such that if a door open and then a door close value are seen in the same second, only the door close value is maintained. This is based on the assumption that most state changes that last less than a second are not important, and are available in other parts of the Datamart if relevant. There are exceptions to this practice for some variables that are documented within the functions themselves, though the handling of variables are likely to change as the Datamart matures.

For the odometer reading (`odom_ft` in the python code), the minimum and maximum reading within the second are recorded in new columns. These values play a role in the steps that follow.

This aggregation is performed with the agg_sec() function, which can readily accept other columns that are introduced to rawnav data as Datamart processing code evolves. The result of the agg_sec() function is rawnav data with no duplicated seconds values and several additional columns that are the product of the aggregation process. A sample of rows is shown below.

In [None]:
# TODO: print some rows

### 2.2.2 Noisy Odometer Readings

Repeated observations at the same second are one aspect of problematic rawnav data. At other times, rawnav data for a trip instance will have a short, one-second gap between pings. It has been observed that where this occurs, the odometer reading for the ping prior to the short gap will be especially high given the speed observed prior to that point and the speed observed after the gap. This results in noisy speed calculations from one ping to the next. An illustration of this is shown below.

In [None]:
# TODO: insert little chart

While smoothing functions can address this kind of noise, there are again benefits to addressing the problem closer to its root. Noise will nevertheless exist in the speed values even if this problem did not exist, so sorting out the above issue in advance allows smoothing function parameters to be tuned in a way that addresses the typical noise in odometer and seconds elapsed measurements, rather than large, irregular ones. There are also rolling averages used elsewhere in the decomposition code. While these rolling averages could suppress some of this noise, in developing various thresholds and parameters based on these rolling averages, it seems better to disentangle the use of averaging for smoothing from the use of averaging when used to evaluate how the vehicle is moving along its route.

In the cases described above--a series of uninterrupted pings followed by a one-second gap--the last odometer value in the set of uninterrupted pings is set to NA so that the odometer reading at that point can instead be interpolated in the next step.

### 2.2.3 Interpolating Over Problematic Data

Odometer readings are interpolated for each of the above two cases:

1. Repeated seconds observations with multiple odometer readings
2. Odometer readings that are abnormally high immediately before a short gap in rawnav data

The interpolation is a simple linear interpolation based on the number of seconds elapsed and nearby, non-problematic odometer readings. Because the time over which the interpolation is done is often only a second or two, the use of linear interpolation appeared sufficient. Some checks are performed: if the interpolation is outside the range of the minimum and maximum odometer reading from the aggregation step, the interpolated value is reset to the closest value within that range (e.g., if the interpolated value exceeds the maximum value, it is reset to the maximum value). 

While further experimentation could be done on ways to address the above issues, exploratory analysis proved that they were effective in addressing difficult to handle odometer readings. To be clear, as this odometer interpolation takes place, there are no new rawnav pings/new timestamp values that are created; existing pings that have these problematic values are the ones that are modified. After the interpolation takes place, most odometer values will remain integer values with the exception of the interpolated values, which may have many decimal places.

## 2.3 Speed and Other Calculations 

With more consistent odometer and time data prepared, vehicle speed and acceleration can be calculated. While these calculations are straightforward, there remains some 'noise' to the calculated speed values. This is believed to be the result of both normal variation in speed and the fact that odometer and time values that have been rounded to the nearest integer. Short, sharp changes in speed can result in changes in acceleration that can affect the decomposition algorithm below, so additional smoothing is applied to speed values, and rolling values are calculated based on these smoothed values. The subsections below provide more detail.

### 2.3.1 Calculating Speed

The speed at a particular ping is calculated as the difference in odometer reading between that ping and the next ping divided by the difference in time elapsed between that ping and the next ping. The variable is called `fps_next`. The next ping is used instead of the previous ping because the resulting speeds based on the next ping better correlate with other information about vehicle state at a ping (e.g., where the door is open, we expect recorded speed to be 0 or near 0, and this is more often the case when speed is calculated from that ping relative to the next ping).

### 2.3.2 Smoothing Speed Values

The `fps_next` speed value is smoothed using the Savitzky-Golay filter, an easy-to-implement low-pass filter. This filter tends to preserve the shape of speed curves well while being easy to implement. The key inputs to the filter are the type of polynomial (we use a 3rd degree polynomial) and the number of seconds over which the filter applies (we use 21 seconds, a semi-arbitary choice based on exploratory analysis). The use of the Savitzy-Golay filter requires some temporary transformations of rawnav data that are described in more detail in the Python function documentation.

A Kalman Filter may be the more appropriate choice for smoothing data that can be the product of sensor inaccuracy. However, implementing such a filter would require more development effort and has not been pursued to date.

The smoothing process produces the smoothed value `fps_next_sm`, but the original value `fps_next` still has a role to play in the decomposition.

### 2.3.3 Calculating Rolling Values and Derivatives of Speed

Based on the smoothed speed values, acceleration is calculated. The use of the derivative of acceleration--'jerk'--was briefly investigated for use in the movement decomposition. Though it appeared ultimately unnecessary for decomposition, jerk is still calculated in the Python code.

Rolling averages based on smoothed values are then calculated for windows of three and nine seconds for speed, acceleration, and jerk. While smoothing functions already transform the data in ways similar to a rolling average, a separate calculation of these rolling averages is used to identify meaningful changes in the vehicle's state as described in the next section.

### Conclusion

Overall, the process of aggregating, cleaning, interpolating, smoothing, and calculating rolling values of speed data may appear overcomplicated. Because the movement decomposition was developed iteratively, a different solution might be able to reach this result in a more straightforward manner. For now, each of the steps above fills a logical roll in the preparation of rawnav data for decomposition.

## 2.4 Movement Decomposition

The movement decomposition takes place at two levels of detail:

- The "basic" decomposition, which decomposes time in a trip instance into acceleration, steady state, deceleration, stopped time, and 'other delay'. 
- The stop decomposition, which further decomposes stopped time, especially in cases where passengers are served. 

In the process of performing these decompositions, other fields that may support analysis of bus performance are also calculated, especially regarding vehicle heading. As these features are further developed, they will be expanded upon in this section.

The basic and stop decomposition steps are described below.

### 2.4.1 Basic Decomposition

The basic decomposition follows a simple logic that relies on three key parameters:

- the maximum speed of a "stopped" vehicle (`stopped_fps`)
- the maximum speed of a slow vehicle (`slow_fps`)
- the range of acceleration values allowable in a steady state (`steady_accel_thresh`)

The application of these parameters to generate the movement decomposition is described in the steps below.

1. A vehicle is stopped when its unsmoothed speed value `fps_next` drops below the `stopped_fps` threshold. The unsmoothed speed value is used because this avoids cases where smoothing or rolling averages would make the vehicle appear to be moving when it is stopped. The `stopped_fps` defaults to 3 feet per second, or 2 miles per hour, due to a quirk in how rawnav data is recorded. Vehicles tend to stop generating new rawnav pings once stopped, but once moving again, will generate pings. As a result, over the long gap between a ping when the vehicle stops and the next ping recorded, a small, non-zero speed will usually be seen. 
2. A vehicle is in steady state under two conditions: 
 - Its average, smoothed speed over three seconds exceeds a minimum speed of 10 mph (the `slow_fps`, which defaults to 14.67 feet per second)
 - Its average acceleration over nine seconds does not exceed +/- 2 fps squared, (the default `steady_accel_thresh`).
The use of rolling averages on smoothed speed values is used to avoid cases where steady state is entered or exited frequently based on brief changes in the vehicle's movement.
3. While moving, a vehicle is in a deceleration phase between its last ping in steady state to the next time it stops. Similarly, a vehicle is in an acceleration phase between being stopped and the first steady state ping recorded. 
4. During steady state, a vehicle may at times accelerate or decelerate beyond the acceleration threshold or decrease in speed below the slow vehicle threshold without stopping. This state is identified as "other delay". Reasons for and interpretation of other delay can vary and are not elaborated on here. In some cases, if a vehicle never exceeds the speed of a slow vehicle while moving between stops, it remains in the state of 'other delay' for the entire segment.

### 2.4.2 Stop Decomposition

#### Collapsing Stop Activity 

At near side stops, a vehicle will often serve passengers, move forward slightly to the intersection stop bar, and then stop again while waiting for the signal to change. In the basic decomposition framework described above, this would appear as "stopped" -> "other_delay" -> "stopped", because vehicles will rarely exceed 10 mph and have near-zero acceleration while making these brief movements.

Before stop decomposition occurs, this activity around a stop is collapsed into a single group of "stopped" activity, which may in fact include vehicle movement! This process of collapsing occurs under specific conditions described in more detail in the Python code.

#### Characteristics When Stopped

A vehicle's activity in these stop groups is characterized on several dimensions:

- Door case (same for all pings in a stop group): did the doors open (`doors`) or did they remain shut (`nodoors`)?
- Vehicle state: is the vehicle stopped (`S`) with `fps_next` less than `stopped_fps`, or is the vehicle moving (`M`)? Note that while the vehicle state is a variable within the original rawnav data, it is recalculated here. 
- Door state: are doors open (`O`) or closed (`C`)? This variable is sourced directly from the original rawnav data.
- Order relative to first door open event in the stop group: Is this ping:
    - Before the first door open (`pre`)
    - the first door open event (`at)
    - after the first door open (`post`)
    - not in a stop group with a door open even (`na`)

These characteristics are combined into a single field `stop_decomp` that can be used to understand changes in vehicle behavior around a stop, such as the amount of time a vehicle spends stopped after its doors have closed and it is no longer serving passengers.

TODO: incorporate the powerpoint slides here

Additional characteristics assigned to pings at a stop are described in subsequent sections. For now, a door state of 'open' is implicitly treated as though passengers were served. This is not necessarily the case, as can been confirmed by common experiences on a bus and by the absence of APC tags on some door open events in the data. 

Note that to this point, there is no assertion that door open events occur at bus stops or that other stopping events are related to intersections. That additional inference takes place in the following step when the decomposition is compared to data sources external to rawnav.

## 2.5 Relating Decomposition to Other Stop Locations

### 2.5.1 Matching Stops

### 2.5.2 Creating Stop Segments

### 2.5.3 Trimming Trip Ends

### 2.5.4 Odometer Alignments / Resets

# 3. Summary of Decomposition Fields



In [None]:
#TODO: create a csv in the style we want, add to version control, then read-in here.