# Augment: Intercity Passenger Rail Service Station Performance Metrics

This notebook augments the quarterly [Amtrak](https://www.amtrak.com/home.html) station performance
metrics with additional information about each station. The dataset is sourced from the US
Department of Transportation (DOT), Bureau of Transportation Statistics (BTS), ArcGIS online
[Amtrak Stations](https://geodata.bts.gov/datasets/1ed62a9f46304679aaa396bed4c8565a_0/about) layer.
The dataset contains information about the location of each station, including the station name,
city, state, and geo coordinates.

### Variable names

A number of variable names in this project leverage the following abbreviations. The naming
strategy is to strike a balance between brevity and readability:

* `amtk`: Amtrak (reporting mark)
* `chrt`: chart
* `cols`: columns
* `const`: constant
* `cwd`: current working directory
* `eb`: eastbound direction of travel
* `lm`: linear model
* `mi`: miles
* `mm`: minutes (ISO 8601)
* `nb`: northbound direction of travel
* `psgr`: passenger
* `qtr`: quarter
* `rte`: route
* `sb`: southbound direction of travel
* `stats`: summary statistics
* `stn`: station
* `stns`: stations
* `svc`: service
* `trn`: train
* `wb`: westbound direction of travel

In [None]:
import json
import numpy as np
import pandas as pd
import pathlib as pl
import re
import tomllib as tl

import fra_amtrak.amtk_frame as frm

# Set random seed
rdg = np.random.default_rng(24)

## 1.0 Read files

### 1.1 Resolve paths

Instantiate instances of `pathlib.Path` to represent absolute paths to the `data/interim` and `data/processed` directories.

In [None]:
parent_path = pl.Path.cwd()  # current working directory
parent_path

data_raw_path = parent_path.joinpath("data", "raw")
data_interim_path = parent_path.joinpath("data", "interim")
data_processed_path = parent_path.joinpath("data", "processed")

### 1.2 Load constants

Load a companion [TOML](https://toml.io/en/) file containing constants.

In [None]:
filepath = parent_path.joinpath("notebook.toml")
with open(filepath, "rb") as file_obj:
    const = tl.load(file_obj)

# Access constants
COLS = const["columns"]

filepath = data_interim_path.joinpath("station_performance_metrics-v1p1.csv")
stations = pd.read_csv(filepath)

### 1.3 Retrieve performance data

In [None]:
filepath = data_interim_path.joinpath("station_performance_metrics-v1p1.csv")
stations = pd.read_csv(filepath)

### 1.4 Review the `DataFrame`

In [None]:
stations.shape

In [None]:
stations.info()

In [None]:
stations.head(3)

## 2.0 Add route miles

Every named train is associated with a route that Amtrak measures in miles. The route miles data was sourced
from the FRA's
[_Methodology Report for the Performance and Service Quality of Intercity Passenger Train Operations_](https://railroads.dot.gov/sites/fra.dot.gov/files/2024-08/Methodology%20Report_FY24Q3_web.pdf) (FY 2024 v.2), pp. 12-15.

### 2.1 Retrieve data

In [None]:
with open(data_processed_path.joinpath("amtk_sub_services.json"), "r") as file:
    amtk_sub_svcs = json.load(file)

route_miles = [
    {"Route": route["sub service"], "Route Miles": sum([host["miles"] for host in route["hosts"]])}
    for route in amtk_sub_svcs
]

# Create DataFrame
route_miles = pd.DataFrame.from_dict(route_miles, orient="columns")
route_miles

### 2.2 Combine data [1 pt]

Add `route_miles` to the `stations` `DataFrame`. Once the data is combined, move the `route_miles`
column from the last position to the fifth (`5th`) position in `stations`. Drop any redundant
columns after reordering the columns.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## 3.0 Add location data

The Bureau of Transportation Statistics (BTS) maintains an [Amtrak stations](https://data-usdot.opendata.arcgis.com/datasets/amtrak-stations/about) dataset that provides mapping (i.e., location) information.

### 3.1 Retrieve data

In [None]:
filepath = data_raw_path.joinpath("NTAD_Amtrak_Stations_-3056704789218436106.csv")
ntad_stations = pd.read_csv(filepath)

### 3.2 Filter data [1 pt]

Filter out all bus stations and reset the index. Retain only the "StaType", "ZipCode", "Address2",
"Address1", "Code", "lon", and "lat" columns.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 3.3 Drop columns [1 pt]

Drop the following columns. They are not required for the analysis.

* OBJECTID
* StnType
* State
* Name
* StationName
* StationFacilityName
* StationAliases
* DateModif
* x
* y

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## 4.0 Clean data

### 4.1 Blank and missing values

No empty or missing values it appears.

In [None]:
# Combined condition to check for empty strings or NaN
mask = (ntad_stations == "") | pd.isna(ntad_stations)
empty_nan_values = ntad_stations.columns[mask.any()]
empty_nan_values

# Count empty or NaN values
# empty_nan_counts = ntad_stations[empty_nan_values].apply(lambda x: x.isin(["", np.nan]).sum())
# empty_nan_counts

### 4.2 Normalize strings

Trim each string value of leading/trailing spaces. Also search and remove unnecessary spaces in each string value based on the regular expression `re.Pattern` object. Call the function `frm.normalize_dataframe_strings()` to perform this operation.

#### 4.2.1 Locate suspect strings

As is illustrated below, the regex pattern to employ is `"\s{2,}"`.

In [None]:
# Locate extra spaces in all string columns
extra_spaces = ntad_stations.select_dtypes(include="object").apply(
    lambda x: x.str.contains(r"\s{2,}").sum()
)
extra_spaces

#### 4.2.2 Clean strings [1 pt]

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## 5.0 Manipulate data

### 5.1 Rename the columns

Note use of constants.

In [None]:
mapper = {
    "StaType": COLS["station_type"],
    "ZipCode": COLS["zip_code"],
    "City": COLS["city"],
    "Address2": COLS["address_02"],
    "Address1": COLS["address_01"],
    "Code": "Code",
    "lon": COLS["lon"],
    "lat": COLS["lat"],
}
ntad_stations.rename(columns=mapper, inplace=True)
ntad_stations.head(3)

### 5.2 Reorder columns

:bulb: By convention, latitude is always listed before longitude.

In [None]:
columns = [
    "Code",
    COLS["station_type"],
    COLS["city"],
    COLS["address_01"],
    COLS["address_02"],
    COLS["zip_code"],
    COLS["lat"],
    COLS["lon"],
]
ntad_stations = ntad_stations.loc[:, columns]
ntad_stations.sample(n=7, random_state=rdg)

## 6.0 Merge data [1 pt]

Merge `stations` and `ntad_stations`. Perform a __left join__ to retain all rows in the `stations` `DataFrame`, joining on the "Arrival Station Code" column in `stations` and the "Code" column in `ntad_stations`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## 7.0 Check geo coordinates [1 pt]

Check for missing geo coordinates in the "lat" and "lon" columns in the merged DataFrame. Assign the
results to a new `DataFrame` named `missing_coords`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 7.1 Missing geo coordinates

The BTS Amtrak stations dataset does not contain geo coordinates for the following stations:

* CBN: Canadian Border, NY
* FAL: Falmouth, ME
* MCI: Michigan City, IN

#### 7.1.1 CBN

This is not a physical station but an international border crossing in the vicinity of
Niagra Falls that features an exchange of US and Canadian train crews. The MCI
[Michigan City Station](https://en.wikipedia.org/wiki/Michigan_City_station) is a former Amtrak
station that was closed on 4 April 2022. The geo coordinates for the station can be obtained from
[Google Maps](https://www.google.com/maps/place/41%C2%B043'16.0%22N+86%C2%B054'20.0%22W/@41.721111,-86.905556,15z/data=!4m4!3m3!8m2!3d41.721111!4d-86.905556?hl=en&entry=ttu&g_ep=EgoyMDI0MTAyOS4wIKXMDSoASAFQAw%3D%3D).

#### 7.1.2 FAL

A special event stop for the Amtrak [Downeaster](https://www.amtrak.com/downeaster-train)
in support of the _The Live + Work in Maine Open Golf Tournament_ held at the
[Falmouth Country Club](https://www.falmouthcc.org/) during June 24-27, 2021 and June 23-26, 2022
(source: http://www.trainweb.org/usarail/falmouth.htm).

FAL row values can be updated with the following information:

Muirfield Road at Railroad Crossing \
Falmouth, ME 04105 \
Latitude: `43.769600`, Longitude: `-70.259500`

In [None]:
values = ("Falmouth", "Muirfield Road at Railroad Crossing", "04105", 43.769600, -70.259500)
mask = stations[COLS["station_code"]] == "FAL"
stations.loc[
    mask,
    [COLS["city"], COLS["address_01"], COLS["zip_code"], COLS["lat"], COLS["lon"]],
] = values
stations[mask]

#### 7.1.3 MCI

Formerly Amtrak's Michigan City, IN station, closed since April 2022. MCI row values can be
updated with the following information:

Amtrak Michigan City Station (closed)
100 Washington Street \
Michigan City, Indiana 46360 \
Latitude: `41.721111`, Longitude: `-86.905556`

In [None]:
values = ("Michigan City", "100 Washington Street", "46360", 41.721111, -86.905556)
mask = stations[COLS["station_code"]] == "MCI"
stations.loc[
    mask, [COLS["city"], COLS["address_01"], COLS["zip_code"], COLS["lat"], COLS["lon"]]
] = values
stations[mask]

In [None]:
stations.info()

## 8.0 Reorder columns

The `stations` columns are reordered as follows:

| Position | Column Name | Note |
| :----- | :------------- | :------------- |
| `0`-`1` | "Fiscal Year", "Fiscal Quarter" | &nbsp; |
| `2`-`5` | "Service Line", "Service", "Sub Service", "Train Number" | &nbsp; |
| `6`-`9` | "Arrival Station", "Arrival Station Type", "Code", "Arrival Station Code" | Drop "Code" after confirming column order. |
| `10`-`13` | "City", "Address 01", "Address 02", "ZIP Code" | &nbsp; |
| `14`-`17` | "State", "Division", "Region", "Country" | &nbsp; |
| `18`-`19` | "Latitude", "Longitude" | &nbsp; |
| `20`-`22` | "Total Detraining Customers", "Late Detraining Customers", "Late Detraining Customers Avg Min Late" | &nbsp; |

In [None]:
# Indices of interest
state_idx = stations.columns.get_loc(COLS["state"])
total_detrain_idx = stations.columns.get_loc(COLS["total_detrn"])
code_idx = stations.columns.get_loc("Code")

columns_start = stations.columns[:state_idx].tolist()
columns_start.extend([
    "Code",
    COLS["station_type"],
    COLS["city"],
    COLS["address_01"],
    COLS["address_02"],
    COLS["zip_code"],
])
print(f"columns_start = {columns_start}")

columns_middle = stations.columns[state_idx:total_detrain_idx].tolist()
columns_middle.extend([COLS["lat"], COLS["lon"]])
print(f"columns_middle = {columns_middle}")

columns_end = stations.columns[total_detrain_idx:code_idx].tolist()
print(f"columns_end = {columns_end}")

columns = columns_start + columns_middle + columns_end
print(f"columns = {columns}")

# Reorder DataFrame
stations = stations.loc[:, columns]
stations.shape

In [None]:
stations.head(3)

## 9.0 Drop column [1 pt]

Drop the redundant "Code" column.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## 10.0 Late detraining passengers

Calculate the ratio of late detraining passengers to total detraining passengers _for each station_
and assign the results to a new column named "Late to Total Detraining Customers Ratio" (use the
associated `COLS` constant rather than hard-coding the string name ibnto the code). Round the 
values to the fitfh (`5th`) decimal place.

Note: Design your `lambda` function carefully to avoid a `ZeroDivisionError` error.

### 10.1 Calculate the percentage [1 pt]

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 10.2 Sample the rows

Return a sample of rows to verify row values.

In [None]:
# Apply weights to sample (CBN stations are fewer)
weights = stations[COLS["svc_line"]].apply(lambda row: 3 if row == "Long Distance" else 1)
stations.sample(n=7, weights=weights, random_state=rdg)

### 10.3 Reorder columns [1 pt]

Move "Late to Total Detraining Customers Ratio" to the __second to last__ position in `stations`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## 11.0 Persist data

### 11.1 Recheck data.

In [None]:
stations.info()

### 11.2 Write to file. [1 pt]

Write data to a CSV file.

In [None]:
filepath = data_processed_path.joinpath("station_performance_metrics-v1p2.csv")
stations.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

## 12.0 Watermark

In [None]:
%load_ext watermark
%watermark -h -i -iv -m -v