# Clean skims
Skim data used in the West Station Area study were provided as csv files. These are imported into emma's `Skim` class. However, in some cases, the raw skim data needed to be tidied up. For example, transit skims often include records for all possible origin-destination pairs, even though only a fraction of these pairs are actually connected by transit. Removing pairs where no transit connections are available speeds up the import process. 

Moreover, column naming is sometimes inconsistent across scenarios. The skim cleaning process also offers the opportunity to rename columns before importing, making the import process more reliable and predictable.

To begin the cleaning process, import supporting modules. This initial cell shows how to connect to the emma package and the west station scripting module; it will be featured throughout tutorials. 

In [1]:
import os
import pathlib
import sys

# Tutorials will routinely add the `weststation` module from a relative path
rel_path = pathlib.Path(os.getcwd()).parents[1]
sys.path.append(str(rel_path))

# Provide a path to emma:
sys.path.append(r"K:\Tools\RP\emma\scripts")

import weststation as wsa


Next, point to a skim and preview its contents. This preview returns the first few rows of a file to review contents and column names.

In [2]:
# Change the path to point to tutorial data.
os.chdir(os.path.join(rel_path, "TutorialData"))

in_file = r"K:\Projects\MAPC\WestStationScenarios\input\skims\LRTP_gc_parking_tt\WAT_AM_full.csv"
wsa.wsafuncs.previewSkim(in_file)

Unnamed: 0,from_zone_id,to_zone_id,pair_ID,Fare,Generalized_Cost,Total_Cost,Total_IVTT,Total_OVTT
0,1000,1,1000_1,3.37,33.564674,7.37,84.828659,28.072369
1,1000,10,1000_10,3.37,31.575066,5.37,79.578659,28.223333
2,1000,100,1000_100,11.02,37.254036,13.02,61.356087,32.407063
3,1000,1000,1000_1000,0.0,0.0,0.0,0.0,0.0
4,1000,1001,1000_1001,1.02,10.841872,1.02,2.340853,23.384254


Specify columns to be renamed. Defining names consistently through the skim cleaning process simplifies the import step. During the West Station Study, inconsistencies in column names were observed for the "Generalized Cost", "Total Cost", and "Access Drive Distance" fields for transit skims. Renaming conventions were applied to set columns to "GenCost", "TotCost", and "DriveDist" respectively. Column renaming specs are provided in a dictionary. The dictionary only needs to include columns to be renamed (all others will remain as they are found) and can include column names that aren't found in the csv file (these are simply ignored). Thus, a comprehensive renaming dicionary can be built to cover the bases during the skim cleaning process.

In [4]:
# If the key is found as a column in the csv, it will be renamed 
#  to the value.
rename = {
    "Generalized_Cost": "GenCost",
    "GeneralizedCost": "GenCost",
    "Total_Cost": "TotCost",
    "TotalCost": "TotCost",
    "Access_Drive_Distance": "DriveDist",
    "Access_Drive_Dist": "DriveDist",
    "AccessDriveDist": "DriveDist"
}

Specify criteria for which rows to include in clean skim. Rows that have valid values for cost variables should be retained. In many cases, transit skims may include rows for OD pairs with no transit connections, reflected in suspiscious generalized cost estimates (0.0, 99999.0, e.g.). Filtering criteria are provided as tuples that include three components:

1. Column name: The column whose values will be reviewed for filtering. The name
after renaming should be given.
2. Comparison method: The compartor to apply when filtering (equal to , less than, greater than or equal to, etc.)
3. Value: The value to compare this column's value against.

The comparison method is provided as a string corresponding to python's built-in comparison operators:
 - `__eq__()` = equals [==]
 - `__ne__()` = not equal to [!=]
 - `__lt__()` = less than [<]
 - `__le__()` = less than or equal to [<=]
 - `__gt__()` = greater than [>]
 - `__ge__()` = greater than or equal to [>=]


In [3]:
# These criteria will retain rows where values in "Generalized_Cost"
#  are less than 99999 and not equal to 0.
criteria = [
    ("GenCost", "__lt__", 99999),
    ("GenCost", "__ne__", 0)
]

Finally, specify an output file and run the `cleanSkims` function.

In [6]:
out_file = r"K:\Projects\MAPC\WestStationScenarios\input\skims\LRTP_gc_parking_tt\WAT_AM.csv"
wsa.wsafuncs.cleanSkims(in_file, out_file, criteria, rename=rename)

Complete. 8154973/32936121 rows retained (24.76%)
