# Introduction to parsing rawnav data with `wmatarawnav`

This notebook offers an introduction to the rawnav parsing code developed for the WMATA Analysis of Fine-Grained Bus AVL to Evaluate Queue Jump Effectiveness study (Queue Jump Effectiveness study). In general, code for this project exists in two forms:

1. **Code usable for any analysis of rawnav data.** These functions for importing and cleaning rawnav data are contained in the in-development Python package `wmatarawnav`.
2. **Code specific to the Queue Jump Effectiveness study**. This code contains project-specific steps to import, clean, and analyze rawnav data using functions in the `wmatarawnav` package along the way. 

In this notebook, code usable for any analysis of rawnav data are illustrated using the custom functions contained in the `wmatarawnav` package. The actual import and cleaning process used for the Queue Jump Effectiveness study will differ slightly in form, but still makes use of these general steps. Future notebooks will illustrate additional `wmatarawnav` processing steps involving the combining of rawnav data with outputs from GTFS as well as the results of code specific to the Queue Jump Effectiveness study. *Note that these functions remain under development may evolve to meet future demands during the WMATA Queue Jump Effectiveness study.*

The contents of this notebook include:

1. Creating an Inventory of Rawnav Data
2. Loading Rawnav Data
3. Cleaning Rawnav Data
4. Assembling Rawnav Data

## About Rawnav Data

For this demonstration, we'll use a small subset of rawnav files to demonstrate the `wmatarawnav` toolset. Rawnav data is contained in a text file (e.g., "rawnav06431190501.txt") that is zipped (e.g., "rawnav06431190501.txt.zip"), typically with a matching name plus the ".zip" file type. Each text file represents an individual bus on a given day and all of its runs, including pull-out, pull-in, and travel in service. The first five characters in the filename (e.g, "06421" in "rawnav06431190501.txt.zip") identify the bus ID, while the remaining characters identify the date (e.g., May 1st, 2019 from "190501"). Within each of these files is a set of tags identifying changes in the bus' state (such as the start of a run), the vehicle's location at small intervals (latitude and longitude points and other bus characteristics), and APC data (counts of boardings and alightings at the stop). Additional checks performed on data and other details about this data are included in the sections below.

## Environment Setup

We begin by importing dependencies required for this notebook and the wmatarawnav package using `import wmatarawnav as wr`. 

In [1]:
import os, sys, glob, pandas as pd
sys.path.append('../..')
import wmatarawnav as wr

These import steps will differ according to the context and as the development of the package continues. Further instructions will be provided for importing these functions for use in future projects and in other situations, as well as for installing the required dependencies of `wmatarawnav`. 

Other notebook-specific options are set below.

In [123]:
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

## 1. Creating an Inventory of Rawnav Data

**Key function: find_rawnav_routes**

Often, only rawnav data specific to a particular route, pattern, day of week, or time of day is needed for further analysis. However, identifying the rawnav files containing relevant data is non-trivial: filenames do not reference particular routes or other information characterizing a bus run. This is especially problematic because the quantity of rawnav files produced in a given time period is vast. Cleaning and parsing *all* available rawnav data before restricting an analysis to a particular route or pattern would be especially time consuming and taxing on storage resouces.

The `find_rawnav_routes` function allows a user to search through rawnav files and return a dataframe containing relevant information about bus runs in each file. The inventory it returns can be used to restrict processing of rawnav data to a much smaller subset of files. In this demo, we'll look for rawnav files related to route U6.

First, we begin by creating a list of zipped rawnav files to examine. During an actual analysis, such a list might encompass all the rawnav files from a particular analysis period, such as October 2019. The set of zipped files included for this demonstration are listed below. 

In [124]:
ZippedFilesDirParent = os.path.join("../../data/00-raw/demo_data")
os.listdir(ZippedFilesDirParent)

['rawnav00001191015.txt.zip',
 'rawnav00001191016.txt.zip',
 'rawnav00008191007.txt.zip',
 'rawnav00008191008.txt.zip',
 'rawnav00101191016.txt.zip',
 'rawnav00500191001.txt.zip',
 'rawnav00500191002.txt.zip',
 'rawnav00500191003.txt.zip',
 'rawnav00500191004.txt.zip',
 'rawnav00500191005.txt.zip',
 'rawnav00500191006.txt.zip',
 'rawnav00500191007.txt.zip']

In case other non-rawnav files are present in this directory, we'll use `glob` to obtain a list of paths to rawnav files. 

In [125]:
FileUniverse = glob.glob(os.path.join(ZippedFilesDirParent,'rawnav*.zip'))
FileUniverse

['../../data/00-raw/demo_data\\rawnav00001191015.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00001191016.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00008191007.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00008191008.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00101191016.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00500191001.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00500191002.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00500191003.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00500191004.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00500191005.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00500191006.txt.zip',
 '../../data/00-raw/demo_data\\rawnav00500191007.txt.zip']

We pass this list of paths to rawnav files to the `find_rawnav_routes` function. Arguments to this function will be described below. 

In [126]:
rawnav_inventory = wr.find_rawnav_routes(FileUniverse, nmax = None, quiet = True)

A preview of the resulting inventory is shown below. Each row represents a combination of a text file and one of the tags located in that text file. 

Key columns include:

- *fullpath*: a member of the list of FileUniverse where data has been found.
- *filename*: the name of the rawnav text file within the zipped folder
- *file_busid*: the bus number obtained from the file name.
- *file_id*: The string of characters containing bus id and date information in the filename.
- *taglist*: A 'tag' with identifying information about a run. Rawnav datapoints after each tag and before the next one are associated with a run, aside from other data cleaning that takes place. This tag field is used to generate other fields and is not formatted; in particular, rawnav text files that are blank will contain only commas. 
- *line_num*: The line number on which the given tag is found. This is essential for rawnav parsing conducted later.
- *route_pattern*: The route and pattern information identified in the tag.
- *tag_busid*: The bus id found in the tag. Generally, this value should match *file_busid*.
- *tag_date*: The date found in the tag, stored as a datetime value. This date should match the date implied by *file_id*.
- *Unk1*: An unknown field found in the tag and returned for completeness.
- *tag_time*: The timestamp of the run start time, stored as a character.
- *CanBeMiFt*: An unknown field found in the tag and returned for completeness, believed to be a conversion factor from feet to miles. Its value is always 05280 where present.
- *route*: The route extracted from *route_pattern*
- *pattern*: The pattern extracted from *route_pattern*
- *tag_datetime*: The date and time of the tag, stored as a datetime value. This combines the *tag_date* and *tag_time* fields.
- *tag_starthour*: The hour of the start time of the tag (*tag_time*), stored as a numeric value. 
- *wday*: The day of the week of the tag.

From this, we can see that the files rawnav00001191015.txt and rawnav00001191016.txt were empty, while rawnav00008191007.txt included a number tags related to route U6.

Note that the function contains several additional arguments:

- *nmax*: For testing purposes, the number of rawnav files parsed can be limited to a set number of files. The default, `None`, makes no limitation. 
- *quiet*: To track the inventorying process, `quiet` may be set to `False` to see each filename printed in turn as it is parsed.

In [127]:
rawnav_inventory

Unnamed: 0,fullpath,filename,file_busid,file_id,taglist,line_num,route_pattern,tag_busid,tag_date,tag_time,Unk1,CanBeMiFt,route,pattern,tag_datetime,tag_starthour,wday
0,../../data/00-raw/demo_data\rawnav00001191015....,rawnav00001191015.txt,1,00001191015,",,,,,,",,,,NaT,,,,,,NaT,,
1,../../data/00-raw/demo_data\rawnav00001191016....,rawnav00001191016.txt,1,00001191016,",,,,,,",,,,NaT,,,,,,NaT,,
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,00008191007,"9,PO04726,8,10/06/19,05:15:24,36476,05280",9,PO04726,8.0,2019-10-06,05:15:24,36476,05280,,,2019-10-06 05:15:24,5.0,Sunday
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,00008191007,"608, U601,8,10/06/19,05:36:41,36476,05280",608,U601,8.0,2019-10-06,05:36:41,36476,05280,U6,01,2019-10-06 05:36:41,5.0,Sunday
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,00008191007,"1819, U602,8,10/06/19,06:00:10,36476,05280",1819,U602,8.0,2019-10-06,06:00:10,36476,05280,U6,02,2019-10-06 06:00:10,6.0,Sunday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7,../../data/00-raw/demo_data\rawnav00500191003....,rawnav00500191003.txt,500,00500191003,",,,,,,",,,,NaT,,,,,,NaT,,
8,../../data/00-raw/demo_data\rawnav00500191004....,rawnav00500191004.txt,500,00500191004,",,,,,,",,,,NaT,,,,,,NaT,,
9,../../data/00-raw/demo_data\rawnav00500191005....,rawnav00500191005.txt,500,00500191005,",,,,,,",,,,NaT,,,,,,NaT,,
10,../../data/00-raw/demo_data\rawnav00500191006....,rawnav00500191006.txt,500,00500191006,",,,,,,",,,,NaT,,,,,,NaT,,


With this inventory in hand, we will want to restrict our parsing only to files matching a set of criteria. 

Many options exist for filtering this inventory data frame, much of which will be specific to a given analysis. In this simple case, we'll restrict parsing only to files with route 'N4'. An example is shown below using pandas that takes a list of analysis routes and filters the `rawnav_inventory` dataframe to only those containing the analysis routes in `rawnav_inventory_filtered`. Currently, the rawnav_inventory must include all tags (even the irrelevant ones) to accurately track where runs start and end within a file, and the line number field must be converted for iteration and other checks further below. From this, a set of the first tags in a file are returned (`rawnav_inv_filt_first`) that will be used for iteration below. This behavior may change in the future.

In [128]:
AnalysisRoutes = ['U6']
rawnav_inventory_filtered = rawnav_inventory[rawnav_inventory.groupby('filename')['route'].transform(lambda x: x.isin(AnalysisRoutes).any())]
rawnav_inventory_filtered = rawnav_inventory_filtered.astype({"line_num": 'int'})
rawnav_inv_filt_first = rawnav_inventory_filtered.groupby(['fullpath','filename']).line_num.min().reset_index()
rawnav_inventory_filtered.head()

Unnamed: 0,fullpath,filename,file_busid,file_id,taglist,line_num,route_pattern,tag_busid,tag_date,tag_time,Unk1,CanBeMiFt,route,pattern,tag_datetime,tag_starthour,wday
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"9,PO04726,8,10/06/19,05:15:24,36476,05280",9,PO04726,8.0,2019-10-06,05:15:24,36476,5280,,,2019-10-06 05:15:24,5.0,Sunday
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"608, U601,8,10/06/19,05:36:41,36476,05280",608,U601,8.0,2019-10-06,05:36:41,36476,5280,U6,1.0,2019-10-06 05:36:41,5.0,Sunday
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"1819, U602,8,10/06/19,06:00:10,36476,05280",1819,U602,8.0,2019-10-06,06:00:10,36476,5280,U6,2.0,2019-10-06 06:00:10,6.0,Sunday
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"2489, U601,8,10/06/19,06:21:00,36476,05280",2489,U601,8.0,2019-10-06,06:21:00,36476,5280,U6,1.0,2019-10-06 06:21:00,6.0,Sunday
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"3645, U602,8,10/06/19,06:48:51,36476,05280",3645,U602,8.0,2019-10-06,06:48:51,36476,5280,U6,2.0,2019-10-06 06:48:51,6.0,Sunday


## 2. Loading Rawnav Data

**Key function: load_rawnav_data**

With a filtered inventory of rawnav files to parse, we begin to load each file into memory before cleaning begins in the next step. The process to iterate over files is not included within the `wmatarawnav` package but is left to project-specific implementation; in certain circumstances, it may be more appropriate to run loading and processing steps (see next steps) for each file in turn and then place these cleaned results onto a SQL server table. The process that follows is drawn from the WMATA Queue Jump Effectiveness study process and can be run from a single user's laptop.

The loading process takes place in the cell below. In short:

1. A Python dictionary `RouteRawTagDict` is created to store the loaded rawnav files.
2. For each tag in the filtered rawnav dictionary, the following steps are performed:
    1. Identify the file associated with the tag
    2. Find the line number on which the first tag appears
    3. Unzip the the rawnav file, skipping the appropriate number of lines, using the load_rawnav_data function
    4. Return the loaded rawnav data into the `RouteRawTagDict`. 
    
Each entry in the `RouteRawTagDict` dictionary is keyed to a particular filename (e.g., "rawnav00008191007.txt"). Each of these dictionary entries will in turn contain another dictionary containing two dataframes:
1. tagLineInfo: Tag info for all tags in the file
2. RawData: the data found in the file
Note that data is not yet cleaned and data not associated with the desired route is likely to be present. This dictionary-based, hierarchical format helps to accommodate the fact that rawnav tables are not yet cleanly formatted.

In [129]:
RouteRawTagDict = {}

for index, row in rawnav_inv_filt_first.iterrows():
    tagInfo_LineNo = rawnav_inventory_filtered[rawnav_inventory_filtered['filename'] == row['filename']]
    Refrence = min(tagInfo_LineNo.line_num)
    tagInfo_LineNo.loc[:,"NewLineNo"] = tagInfo_LineNo.line_num - Refrence-1
    # FileID gets messy; string to number conversion loose the initial zeros. "filename" is easier to deal with.
    temp = wr.load_rawnav_data(ZipFolderPath = row['fullpath'], skiprows = row['line_num'])
    RouteRawTagDict[row['filename']] = {'RawData':temp,'tagLineInfo':tagInfo_LineNo}

We can see a preview of the contained data for RawData and tagLineInfo using the `get` method.

In [130]:
RouteRawTagDict.get("rawnav00008191007.txt").get("RawData")

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,38.921298,-76.969803,312,C,S,0.0,0.0,17.0,,9.0,38.921298,-76.969803
1,38.921322,-76.969838,314,C,M,15.0,264.0,16.0,,0.0,38.921312,-76.969818
2,38.921350,-76.969887,315,C,M,32.0,266.0,16.0,,0.0,38.921335,-76.969862
3,38.921380,-76.969937,317,C,M,50.0,268.0,16.0,,0.0,38.921365,-76.969912
4,38.921417,-76.969983,322,C,M,69.0,270.0,17.0,,0.0,38.921398,-76.969962
...,...,...,...,...,...,...,...,...,...,...,...,...
30104,/ 21:53:03 Buswares is Shutting down WaitResul...,102276.000000,,,,,,,,,,
30105,/ 22:54:43 BWRawNav Collection Module was STAR...,,,,,,,,,,,
30106,cal,0.000000,0,,,,,,,,,
30107,cal,96448.000000,158,,,,,,,,,


In [131]:
RouteRawTagDict.get("rawnav00008191007.txt").get("tagLineInfo").head()

Unnamed: 0,fullpath,filename,file_busid,file_id,taglist,line_num,route_pattern,tag_busid,tag_date,tag_time,Unk1,CanBeMiFt,route,pattern,tag_datetime,tag_starthour,wday,NewLineNo
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"9,PO04726,8,10/06/19,05:15:24,36476,05280",9,PO04726,8.0,2019-10-06,05:15:24,36476,5280,,,2019-10-06 05:15:24,5.0,Sunday,-1
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"608, U601,8,10/06/19,05:36:41,36476,05280",608,U601,8.0,2019-10-06,05:36:41,36476,5280,U6,1.0,2019-10-06 05:36:41,5.0,Sunday,598
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"1819, U602,8,10/06/19,06:00:10,36476,05280",1819,U602,8.0,2019-10-06,06:00:10,36476,5280,U6,2.0,2019-10-06 06:00:10,6.0,Sunday,1809
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"2489, U601,8,10/06/19,06:21:00,36476,05280",2489,U601,8.0,2019-10-06,06:21:00,36476,5280,U6,1.0,2019-10-06 06:21:00,6.0,Sunday,2479
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"3645, U602,8,10/06/19,06:48:51,36476,05280",3645,U602,8.0,2019-10-06,06:48:51,36476,5280,U6,2.0,2019-10-06 06:48:51,6.0,Sunday,3635


## 3. Cleaning Rawnav Data

**Key function: clean_rawnav_data**

With each rawnav file loaded into memory, the cleaning process begins. Several steps take place within the `clean_rawnav_data` function:

1. End of runs are identified.
2. APC and other tags are removed, though the locations of these removed indicators are noted.
3. A summary of the rawnav run is generated.

As before, iteration is left to project-specific code to provide flexibility for future implementations of the `wmatarawnav` functions. An example is shown below based on the WMATA Queue Jump Effectiveness study.

In [132]:
RawnavDataDict = {}
SummaryDataDict = {}

for key, datadict in RouteRawTagDict.items():
    Temp = wr.clean_rawnav_data(datadict, key)
    RawnavDataDict[key] = Temp['rawnavdata']
    SummaryDataDict[key] = Temp['SummaryData']


The results of the rawnav file are returned into two dictionaries:

1. RawnavDataDict: A dictionary containing a dataframe for each rawnav file with APC and other tags removed. Added fields include:
    - *route_pattern*, *route*, *pattern*: The route and pattern associated with the rawnav record
    - *IndexTripStartInCleanData*: The row number identifying the start of an associated run in the rawnav data. Rawnav entries from the same run will share the value in this field. Together with *filename*, a bus run is uniquely identified with *IndexTripStartInCleanData* and *filename*. 
    - *IndexTripEndInCleanData*: The row number identifying the end of an associated run in the rawnav data. Rawnav entries from the same run will share the value in this field.

An example is shown below for the file rawnav00008191007.txt.

In [133]:
RawnavDataDict.get("rawnav00008191007.txt")

Unnamed: 0,IndexLoc,Lat,Long,Heading,DoorState,VehState,OdomtFt,SecPastSt,SatCnt,StopWindow,Blank,LatRaw,LongRaw,RowBeforeAPC,route_pattern,route,pattern,IndexTripStartInCleanData,IndexTripEndInCleanData,filename
0,0,38.921298,-76.969803,312,C,S,0.0,0.0,17.0,,9.0,38.921298,-76.969803,0,PO04726,,,0.0,597.0,rawnav00008191007.txt
1,1,38.921322,-76.969838,314,C,M,15.0,264.0,16.0,,0.0,38.921312,-76.969818,0,PO04726,,,0.0,597.0,rawnav00008191007.txt
2,2,38.921350,-76.969887,315,C,M,32.0,266.0,16.0,,0.0,38.921335,-76.969862,0,PO04726,,,0.0,597.0,rawnav00008191007.txt
3,3,38.921380,-76.969937,317,C,M,50.0,268.0,16.0,,0.0,38.921365,-76.969912,0,PO04726,,,0.0,597.0,rawnav00008191007.txt
4,4,38.921417,-76.969983,322,C,M,69.0,270.0,17.0,,0.0,38.921398,-76.969962,0,PO04726,,,0.0,597.0,rawnav00008191007.txt
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29667,30098,38.922097,-76.969265,215,C,M,31398.0,1018.0,15.0,,0.0,38.922137,-76.969228,0,PI04323,,,29286.0,30102.0,rawnav00008191007.txt
29668,30099,38.922060,-76.969298,215,C,M,31413.0,1019.0,15.0,,0.0,38.922097,-76.969265,0,PI04323,,,29286.0,30102.0,rawnav00008191007.txt
29669,30100,38.922032,-76.969323,214,C,M,31425.0,1020.0,15.0,,0.0,38.922060,-76.969298,0,PI04323,,,29286.0,30102.0,rawnav00008191007.txt
29670,30101,38.922002,-76.969348,214,C,M,31436.0,1022.0,15.0,,0.0,38.922013,-76.969338,0,PI04323,,,29286.0,30102.0,rawnav00008191007.txt


2. SummaryDataDict: A dictionary containing summary data entries for each run in a rawnav file. Additional fields include:
    - *TripDurFromSec*: The duration of the trip calculated as the difference between the timestamp of the first record for a run to the timestamp of the last record for a run.
    - *TripDurationFromTags*: The duration of the trip calculated from the current run's tag date and time to the next run's tag date and time. 
    - *DistOdomMi*: The distance in miles from the first record for a run to the last record for a run based on the odometer.
    - *SpeedOdomMPH*: The average speed of the trip in miles per hour based on *DistOdomMi* and *TripDurFromSec*.
    - *SpeedTripTagMPH*: The average speed of the trip in miles per hour based on *DistOdomMi* and *TripDurationFromTags*.
    - *CrowFlyDistLatLongMi*: The distance between the coordinates of the first and last record for a run.
    - Latitude and Longitude of Start and End observations: The final four columns identify the locations of the first and last rawnav observations for a run. 
   
Note that the runs contained still may include pull-outs, pull-ins, and bus runs where the trips were cut short, ran long, or are otherwise unusable for analysis. Moreover, the first and last records of a trip may later be modified to account for additional dwell at the first and last stops of a run. To minimize computationally-intensive data cleaning efforts, additional data cleaning steps related to these items are postponed until the data has been further subsetted.
   
An example of the Summary output is shown below for the file rawnav00008191007.txt.

In [134]:
SummaryDataDict.get("rawnav00008191007.txt").head()

Unnamed: 0,fullpath,filename,file_busid,file_id,taglist,route_pattern,tag_busid,route,pattern,wday,...,TripDurFromSec,TripDurationFromTags,DistOdomMi,SpeedOdomMPH,SpeedTripTagMPH,CrowFlyDistLatLongMi,LatStart,LongStart,LatEnd,LongEnd
0,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"9,PO04726,8,10/06/19,05:15:24,36476,05280",PO04726,8.0,,,Sunday,...,1277,00:21:17,4.383712,12.358155,12.36,2.005349,38.921298,-76.969803,38.89852,-76.9467
1,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"608, U601,8,10/06/19,05:36:41,36476,05280",U601,8.0,U6,1.0,Sunday,...,1409,00:23:29,5.343561,13.652816,13.65,1.039395,38.89852,-76.9467,38.894867,-76.927972
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"1819, U602,8,10/06/19,06:00:10,36476,05280",U602,8.0,U6,2.0,Sunday,...,752,00:12:32,3.420076,16.372703,16.37,1.034329,38.894867,-76.927972,38.898232,-76.946692
3,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"2489, U601,8,10/06/19,06:21:00,36476,05280",U601,8.0,U6,1.0,Sunday,...,1671,00:27:51,5.295644,11.408928,11.41,1.054953,38.898305,-76.947085,38.89501,-76.927952
4,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"3645, U602,8,10/06/19,06:48:51,36476,05280",U602,8.0,U6,2.0,Sunday,...,830,00:13:50,3.429167,14.873494,14.87,1.030809,38.89501,-76.927952,38.898237,-76.946645


## Assemble Cleaned Data

**Key Function: subset_rawnav_trip1**

From here, one might begin assembling the cleaned rawnav data and summary data frames into a single dataframe. An example is shown below, first for the Summary Data. As mentioned earlier, iteration will be customized to fit project specific needs, but the specific methods shown here are used for the Effectiveness of Queue Jumps study.

In [135]:
FinSummaryDat = pd.DataFrame()
FinSummaryDat = pd.concat(SummaryDataDict.values()) # 
FinSummaryDat.loc[:,"count1"]=FinSummaryDat.groupby(['filename','IndexTripStartInCleanData'])['IndexTripStartInCleanData'].transform('count')
# In cases where two tags appear back-to-back, they may have the same identifying information for 'IndexTripStartInCleanData'.
# For now, only one of these records is kept to avoid creating apparently duplicate entries, though these entries will 
# ultimately be removed downstream in any case. 
# In the future, more of these cleaning steps may be moved to the cleaning function.
IssueDat = FinSummaryDat.query('count1>1')
IssueDat = FinSummaryDat.query('IndexTripStartInCleanData>IndexTripEnd')
FinSummaryDat = FinSummaryDat[~FinSummaryDat.duplicated(['filename','IndexTripStartInCleanData'],keep='last')] 

In [136]:
FinSummaryDat.head()

Unnamed: 0,fullpath,filename,file_busid,file_id,taglist,route_pattern,tag_busid,route,pattern,wday,...,TripDurationFromTags,DistOdomMi,SpeedOdomMPH,SpeedTripTagMPH,CrowFlyDistLatLongMi,LatStart,LongStart,LatEnd,LongEnd,count1
0,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"9,PO04726,8,10/06/19,05:15:24,36476,05280",PO04726,8.0,,,Sunday,...,00:21:17,4.383712,12.358155,12.36,2.005349,38.921298,-76.969803,38.89852,-76.9467,1
1,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"608, U601,8,10/06/19,05:36:41,36476,05280",U601,8.0,U6,1.0,Sunday,...,00:23:29,5.343561,13.652816,13.65,1.039395,38.89852,-76.9467,38.894867,-76.927972,1
2,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"1819, U602,8,10/06/19,06:00:10,36476,05280",U602,8.0,U6,2.0,Sunday,...,00:12:32,3.420076,16.372703,16.37,1.034329,38.894867,-76.927972,38.898232,-76.946692,1
3,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"2489, U601,8,10/06/19,06:21:00,36476,05280",U601,8.0,U6,1.0,Sunday,...,00:27:51,5.295644,11.408928,11.41,1.054953,38.898305,-76.947085,38.89501,-76.927952,1
4,../../data/00-raw/demo_data\rawnav00008191007....,rawnav00008191007.txt,8,8191007,"3645, U602,8,10/06/19,06:48:51,36476,05280",U602,8.0,U6,2.0,Sunday,...,00:13:50,3.429167,14.873494,14.87,1.030809,38.89501,-76.927952,38.898237,-76.946645,1


Similar combining of records for the rawnav data itself is shown below. In the case below, additional fields from the summary file (such as the start time and day of week) are joined with the individual rawnav records to provide a more context for particular rawnav entries for the Effectiveness of Queue Jumps study. The process is as follows:

1. Dataframes containing rawnav data are subset to those in the selected routes and combined. A helper function `subset_rawnav_trip1` is used to assist with this process.
2. Relevant data from the summary table is joined to these records. 
3. Minor data cleaning is performed. These steps may be moved into other rawnav functions in the future.

In [137]:
FinDat = wr.subset_rawnav_trip1(RawnavDataDict, rawnav_inventory_filtered, AnalysisRoutes)
temp = FinSummaryDat[['filename','IndexTripStartInCleanData','wday','StartDateTime']]
FinDat = FinDat.merge(temp, on = ['filename','IndexTripStartInCleanData'],how='left')
FinDat = FinDat.assign(Lat = lambda x: x.Lat.astype('float'),
                           Heading = lambda x: x.Heading.astype('float'),
                           IndexTripStartInCleanData =lambda x: x.IndexTripStartInCleanData.astype('int'),
                           IndexTripEndInCleanData =lambda x: x.IndexTripEndInCleanData.astype('int'))


In [138]:
FinDat.head()

Unnamed: 0,IndexLoc,Lat,Long,Heading,DoorState,VehState,OdomtFt,SecPastSt,SatCnt,StopWindow,...,LongRaw,RowBeforeAPC,route_pattern,route,pattern,IndexTripStartInCleanData,IndexTripEndInCleanData,filename,wday,StartDateTime
0,599,38.89852,-76.9467,222.0,C,M,0.0,0.0,17.0,,...,-76.94666,0,U601,U6,1,599,1807,rawnav00008191007.txt,Sunday,2019-10-06 05:36:41
1,600,38.898487,-76.946737,222.0,C,M,12.0,1.0,17.0,,...,-76.9467,0,U601,U6,1,599,1807,rawnav00008191007.txt,Sunday,2019-10-06 05:36:41
2,601,38.898487,-76.946737,222.0,C,M,13.0,1.0,17.0,X-1,...,-76.9467,0,U601,U6,1,599,1807,rawnav00008191007.txt,Sunday,2019-10-06 05:36:41
3,602,38.898457,-76.946775,223.0,C,M,29.0,2.0,17.0,,...,-76.946737,0,U601,U6,1,599,1807,rawnav00008191007.txt,Sunday,2019-10-06 05:36:41
4,603,38.898422,-76.946815,223.0,C,M,47.0,3.0,17.0,,...,-76.946775,0,U601,U6,1,599,1807,rawnav00008191007.txt,Sunday,2019-10-06 05:36:41


## Conclusion

This concludes the introduction to parsing rawnav data using the reusable functions developed for the Effectiveness of Queue Jumps study. Additional documentation on these functions is provided in the package documentation for `wmatarawnav`. Future notebooks will illustrate the process of connecting GTFS data to these entries and to making use of this data for the Effectiveness of Queue Jumps study.