# Explore: Amtrak Trains

## Intercity Passenger Rail Service Station Performance Metrics

The Amtrak [network](https://www.amtrak.com/content/dam/projects/dotcom/english/public/documents/Maps/Amtrak-System-Map-020923.pdf)
is a passenger rail service that provides intercity rail service in the
continental United States and to select Canadian cities. The network is operated by the
[National Railroad Passenger Corporation](https://railroads.dot.gov/passenger-rail/amtrak/amtrak),
a federally chartered for-profit corporation that receives some state funding and covers its
operating costs by selling tickets and providing other services.

This notebook commences exploration of the augmented quarterly
[Amtrak](https://www.amtrak.com/home.html) station performance metrics. The goal is to better
understand individual Amtrak train performance and identify potential areas for further analysis.

### Variable names

A number of variable names in this project leverage the following abbreviations. The naming
strategy is to strike a balance between brevity and readability:

* `amtk`: Amtrak (reporting mark)
* `chrt`: chart
* `cols`: columns
* `const`: constant
* `cwd`: current working directory
* `eb`: eastbound direction of travel
* `lm`: linear model
* `mi`: miles
* `mm`: minutes (ISO 8601)
* `nb`: northbound direction of travel
* `psgr`: passenger
* `qtr`: quarter
* `rte`: route
* `sb`: southbound direction of travel
* `stats`: summary statistics
* `stn`: station
* `stns`: stations
* `svc`: service
* `trn`: train
* `wb`: westbound direction of travel

In [None]:
import numpy as np
import pandas as pd
import pathlib as pl
import tomllib as tl

import fra_amtrak.amtk_detrain as detrn
import fra_amtrak.amtk_frame as frm
import fra_amtrak.amtk_network as ntwk
import fra_amtrak.chart_box_preagg as boxp
import fra_amtrak.chart_hist as hst
import fra_amtrak.chart_title as ttl


## 1.0 Read files

### 1.1 Resolve paths

In [None]:
parent_path = pl.Path.cwd() # current working directory
parent_path

### 1.2 Load constants

Load a companion [TOML](https://toml.io/en/) file containing constants.

In [None]:
filepath = parent_path.joinpath("notebook.toml")
with open(filepath, "rb") as file_obj:
    const = tl.load(file_obj)

# Access constants
AGG = const["agg"]
CHRT_BAR = const["chart"]["bar"]
COLORS = const["colors"]
COLS = const["columns"]
DIRECTION = const["train"]["direction"]
SUB_SVC = const["train"]["sub_service"]
TRN = const["train"]


### 1.3 Retrieve performance data

In [None]:
filepath = parent_path.joinpath("data", "processed", "station_performance_metrics-v1p2.csv")
trains = pd.read_csv(
    filepath, dtype={"Address 02": "str", "ZIP Code": "str"}, low_memory=False
)  # avoid DtypeWarning
trains.shape

In [None]:
trains.info()

In [None]:
trains.head(3)

## 2.0 Select trains: Northeast Corridor (NEC)

Amtrak's Northeast Corridor (NEC) is the busiest passenger rail corridor in the United States. It is the only Amtrak line that operates high-speed Acela Express service. The NEC is a shared asset with commuter rail operators, including the Massachusetts Bay Transportation Authority (MBTA), Metro-North Railroad, New Jersey Transit, Southeastern Pennsylvania Transportation Authority (SEPTA), and the Maryland Transit Administration (MTA).

### 2.1 _Acela Express_ (Boston - New York - Philadelphia - Washington, D.C.)  [1 pt]

Amtrak's [_Acela_](https://www.amtrak.com/acela-train) service is a high-speed rail service(`150` mph | `240` km/h) that operates along the Northeast Corridor (NEC) between Boston, MA and Washington, D.C. The service features multiple departure daily. The service features express trains that make limited stops and regional trains that make all stops.

This section focuses on the _Acela Express_ service. Retrieve the _Acela Express_ performance data by calling the appropriate `amtk_network` function. Assign the return value of the function call to a variable named `acela_xp`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 2.2 _Acela Express_: on-time performance metrics (entire period) [1 pt]

_Acela Express_ performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [None]:
# Total train arrivals
acela_xp_trn_arrivals = acela_xp.shape[0]

# Detraining totals
acela_xp_detrn = acela_xp[COLS["total_detrn"]].sum()
acela_xp_detrn_late = acela_xp[COLS["late_detrn"]].sum()
acela_xp_detrn_on_time = acela_xp_detrn - acela_xp_detrn_late

print(
    f"Train Arrivals: {acela_xp_trn_arrivals}",
    f"Total Detraining Customers: {acela_xp_detrn}",
    f"Late Detraining Customers: {acela_xp_detrn_late}",
    f"On-Time Detraining Customers: {acela_xp_detrn_on_time}",
    sep="\n",
)

# Compute summary statistics
acela_xp_stats = detrn.get_sum_stats(acela_xp, AGG["columns"], AGG["funcs"])
acela_xp_stats

In [None]:
#hidden tests are within this cell

### 2.3 _Acela Express_ trains [1 pt]

Each _Acela Express_ train is identified by a unique train number.

Create a `DataFrame` named `acela_xp_trns` that contains one row for each train comprising the _Acela Express_ service. Include the following columns in the `DataFrame` in the order specified:

1. "Service Line"
2. "Service"
3. "Sub Service"
4. "Route Miles"
5. "Train Number"

Reset the index (set `drop=True`) when creating the new `DataFrame`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 2.4 _Acela Express_: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of _Acela Express_ trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [None]:
# Drop missing values
acela_xp_avg_mm_late = acela_xp[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
acela_xp_avg_mm_late_describe = frm.describe_numeric_column(acela_xp_avg_mm_late)
acela_xp_avg_mm_late_describe

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of _Acela Express_ trains are positively skewed and features features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 2.5 _Acela Express_: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 2.5.1 Create the chart data

In [None]:
# Convert to DataFrame
acela_xp_avg_mm_late = acela_xp_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = acela_xp_avg_mm_late_describe["center"]["mean"]
sigma = acela_xp_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = acela_xp_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
acela_xp_mm_late, bins, num_bins, bin_width = frm.create_bins(
    acela_xp_avg_mm_late, COLS["avg_mm_late"], 10
)

# Bin the data
chrt_data = frm.bin_data(acela_xp_mm_late, COLS["avg_mm_late"], bins)
# chrt_data

#### 2.5.2 Generate the histogram

In [None]:
# Chart title
title_txt = f"Amtrak {SUB_SVC['ace_xp']} Service Late Detraining Passengers"
title = ttl.format_title(acela_xp_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()

### 2.6 _Acela Express_, Trains 2155 & 2154

_Acela_ trains 2155 (southbound) and 2154 (northbound) operate daily between [South Station](https://www.amtrak.com/stations/bos) Boston, MA ([BOS](https://www.amtrak.com/stations/bos)) and [Union Station](https://www.amtrak.com/stations/was) Washington, D.C. ([WAS](https://www.amtrak.com/stations/was)).

#### 2.6.1 _Acela Express_ Train 2155, southbound, detraining passengers summary statistics [1 pt]

Departing daily from [South Station](https://www.amtrak.com/stations/bos), Boston, MA ([BOS](https://www.amtrak.com/stations/bos)).

In [None]:
# Base columns for routes
rte_cols = [
    COLS["trn"],
    COLS["station_code"],
    COLS["station"],
    COLS["state"],
    COLS["lat"],
    COLS["lon"],
]

# Train 2154 southbound
amtk_2155 = ntwk.by_train_number(trains, 2155)
amtk_2155_rte = ntwk.create_route(amtk_2155, TRN["2154"]["direction"])
amtk_2155_rte_stats = detrn.get_route_sum_stats(
    amtk_2155_rte,
    COLS["station_code"],
    AGG["columns"],
    AGG["funcs"],
    rte_cols,
)
amtk_2155_rte_stats

In [None]:
#hidden tests are within this cell

##### 2.6.1.1 Write to file [1 pt]

Write `amtk_2155_rte_stats` to a CSV file named `stu-amtk_2155_rte_stats.csv`. Store the file in the `data/student` directory. Then compare it to the accompanying `fxt-amtk_2155_rte_stats.csv` file. It must match line for line, character for character.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

#### 2.6.2 _Acela Express_ Train 2155: late detraining metrics (fiscal year and quarter)

Review the central tendency, dispersion, and shape for the mean late arrival times of _Acela Express_ trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.


In [None]:
# Drop missing values
amtk_2155_avg_mm_late = amtk_2154[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_2155_avg_mm_late_describe = frm.describe_numeric_column(amtk_2155_avg_mm_late)
amtk_2155_avg_mm_late_describe

##### 2.6.2.1 Retrieve the data

In [None]:
# Base columns for average minutes late
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_2155_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
chrt_data

##### 2.6.2.2 Preaggregate the data

Attempting to instantiate an instance of a Vega-Altair [`alt.Chart()`](https://altair-viz.github.io/user_guide/generated/toplevel/altair.Chart.html) class by passing to it a dataset comprising more than `5000` rows will trigger a `MaxRowsError`. You can disable the `MaxRows` check by calling `alt.data_transformers.disable_max_rows()` method. However, disabling the check may result in performance issues, including browser crashes.

The preferred approach when [working with large datasets](https://altair-viz.github.io/user_guide/large_datasets.html#large-datasets) is to _preaggregate_ the data before generating a plot. This can be achieved "manually"&mdash;the approach adopted in this notebook&mdash;or by [installing](https://altair-viz.github.io/user_guide/large_datasets.html#installing-vegafusion[) and [enabling](https://altair-viz.github.io/user_guide/large_datasets.html#enabling-the-vegafusion-data-transformer) Altair's companion [vegafusion](https://vegafusion.io/) data transformer package.

In [None]:
# Base columns for aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 2.6.2.3 Generate box plots

Visualize the distribution of mean late arrival times for late detraining passengers. Illustrate with box plots.

In [None]:
# Create chart title
txt = TRN["2155"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_2155_rte_stats, title_txt)

# Create and display the vertical boxplot
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

#### 2.6.3 _Acela Express_ Train 2154, northbound, detraining passengers summary statistics [1 pt]

Departing daily from [Union Station](https://www.amtrak.com/stations/was), Washington, D.C. 
([WAS](https://www.amtrak.com/stations/was)).

Review previous code employed to generate summary statistics for an Amtrak train. Then leverage functions available in the `amtk_network` and `amtk_detrain` modules to create three new `DataFrame` objects named `amtk_2154`, `amtk_2154_rte`, and `amtk_2154_rte_stats`, respectively.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

##### 2.6.3.1 Write to file [1 pt]

Write `amtk_2154_stats` to a CSV file named `stu-amtk_2154_route_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-2154_route_stats.csv` file.
It must match line for line, character for character.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

#### 2.6.4 _Acela Express_ Train 2154: late detraining metrics (fiscal year and quarter)

Review the central tendency, dispersion, and shape for the mean late arrival times of _Acela Express_ Train 2154. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics. Then visualize each fiscal year and quarter data with a box plot.

In [None]:
# Drop missing values
amtk_2154_avg_mm_late = amtk_2154[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_2154_avg_mm_late_describe = frm.describe_numeric_column(amtk_2154_avg_mm_late)
amtk_2154_avg_mm_late_describe

##### 2.6.4.1 Retrieve the chart data

In [None]:
# Base columns for average minutes late
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_2154_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
# chrt_data

##### 2.6.4.2 Preaggregate the data

In [None]:
# Base columns for aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 2.6.4.3 Generate box plots

In [None]:
# Create chart title
txt = TRN["2154"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_2154_rte_stats, title_txt)

# Create and display the vertical boxplot
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

## 3.0 Select trains: State Supported Michigan Service

Amtrak's state-supported trains are funded by state governments. These services are typically shorter in length and operate within a single state or across multiple states.

### 3.1 _Pacific Surfliner_ Service (San Luis Obispo - Santa Barbara - Los Angeles - San Diego) [1 pt]

Amtrak's [Pacific Surfliner](https://www.amtrak.com/pacific-surfliner-train) service operates between San Luis Obispo, CA ([SLO](https://www.amtrak.com/stations/slo)) and [Santa Fe Depot](https://www.amtrak.com/stations/san), San Diego, CA ([SAN](https://www.amtrak.com/stations/san)). The service features multiple departures daily and serves a number of popular destinations, including Santa Barbara, CA ([SBA](https://www.amtrak.com/stations/sba)), Los Angeles, CA ([LAX](https://www.amtrak.com/stations/lax)), Anaheim, CA ([ANA](https://www.amtrak.com/stations/ana)), and San Juan Capistrano, CA ([SNC](https://www.amtrak.com/stations/snc)).

Retrieve the _Pacific Surfliner_ performance data by calling the appropriate `amtk_network` function. Assign the return value of the function call to a variable named `surf`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 3.2 _Pacific Surfliner_: on-time performance metrics (entire period) [1 pt]

Pacific Surfliner performance data is a compilation of quarterly metrics that focus on late detraining passengers. Detraining assengers are considered on-time if they arrive at their destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other detraining passengers are considered late.

In [None]:
# Total train arrivals
surf_trn_arrivals = surf.shape[1]

# Detraining totals
surf_detrn = surf[COLS["total_detrn"]].sum()
surf_detrn_late = surf[COLS["late_detrn"]].sum()
surf_detrn_on_time = surf_detrn_late - surf_detrn

print(
    f"Train Arrivals: {surf_trn_arrivals}",
    f"Total Detraining Customers: {surf_detrn}",
    f"Late Detraining Customers: {surf_detrn_late}",
    f"On-Time Detraining Customers: {surf_detrn_on_time}",
    sep="\n",
)

# Compute summary statistics
surf_stats = detrn.get_sum_stats(surf, AGG["columns"], AGG["funcs"])
surf_stats

In [None]:
#hidden tests are within this cell

### 3.3 _Pacific Surfliner_ trains [1 pt]

Each _Pacific Surfliner_ train is identified by a unique train number. Create a `DataFrame` named `surf_trns` that contains one row for each train comprising the _Pacific Surfliner_ service. Include the following columns in the `DataFrame` in the order specified: 

1. "Service Line"
2. "Service"
3. "Sub Service"
4. "Route Miles"
5. "Train Number"

Reset the index (set `drop=True`) when creating the new `DataFrame`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 3.4 _Pacific Surfliner_: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of _Pacific Surfliner_ trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [None]:
# Drop missing values
surf_avg_mm_late = surf[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
surf_avg_mm_late_describe = frm.describe_numeric_column(surf_avg_mm_late)
surf_avg_mm_late_describe

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of Pacific Surfliner trains is positively skewed and features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 3.5 _Pacific Surfliner_: visualize distribution of mean late arrival times: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 3.5.1 Create the chart data [1 pt]

In [None]:
# Convert to DataFrame
surf_avg_mm_late = surf_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = surf_avg_mm_late_describe["center"]["mean"]
sigma = surf_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = surf_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
surf_mm_late, bins, num_bins, bin_width = frm.create_bins(surf_avg_mm_late, COLS["avg_mm_late"], 10)

# Bin the data
chrt_data = frm.bin_data(surf_mm_late, COLS["avg_mm_late"], bins)
# chrt_data

In [None]:
#hidden tests are within this cell

#### 3.5.2 Generate the histogram

In [None]:
# Chart title
title_txt = f"Amtrak {SUB_SVC['surf']} Service Late Detraining Passengers"
title = ttl.format_title(surf_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()

### 3.6 _Pacific Surfliner_ Trains 774 & 777

_Pacific Surfliner_ trains 774 (southbound) and 777 (northbound) operate Sunday to Friday between San Luis Obispo, CA ([SLO](https://www.amtrak.com/stations/slo)) and [Santa Fe Depot](https://www.amtrak.com/stations/san), San Diego, CA ([SAN](https://www.amtrak.com/stations/san)).

#### 3.6.1 _Pacific Surfliner_ Train 774, southbound, detraining passengers summary statistics [1 pt]

Departing Sunday to Friday from San Luis Obispo, CA ([SLO](https://www.amtrak.com/stations/slo)).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

##### 3.6.1.1 Write to file [1 pt]

Write `amtk_774_rte_stats` to a CSV file named `stu-amtk_774_rte_stats.csv`. Store the file in the `data/student` directory. Then compare it to the accompanying `fxt-amtk_774_rte_stats.csv` file. It must match line for line, character for character.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

#### 3.6.2 _Pacific Surfliner_ Train 774: late detraining metrics (fiscal year and quarter)

Review the central tendency, dispersion, and shape for the mean late arrival times of _Pacific Surfliner_ Train 774. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics. Then visualize each fiscal year and quarter data with a box plot.

In [None]:
# Drop missing values
amtk_774_avg_mm_late = amtk_774[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_774_avg_mm_late_describe = frm.describe_numeric_column(amtk_774_avg_mm_late)
amtk_774_avg_mm_late_describe

##### 3.6.2.1 Retrieve the chart data

In [None]:
# Base columns for average minutes late
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_774_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
chrt_data

##### 3.6.2.2 Preaggregate the data

In [None]:
# Base columns for aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 3.6.2.3 Generate box plots

In [None]:
# Create chart title
txt = TRN["774"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_774_rte_stats, title_txt)

# Create and display the vertical boxplot
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

#### 3.6.3 _Pacific Surfliner_ Train 777, westbound, detraining passengers summary statistics [1 pt]

Departs Sunday to Friday from [Santa Fe Depot](https://www.amtrak.com/stations/san), San Diego, CA ([SAN](https://www.amtrak.com/stations/san)).

Review previous code employed to generate summary statistics for an Amtrak train. Then leverage functions available in the `amtk_network` and `amtk_detrain` modules to create three new `DataFrame` objects named `amtk_777`, `amtk_777_rte`, and `amtk_777_rte_stats`, respectively.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

##### 3.6.3.1 Write to file [1 pt]

Write `amtk_777_rte_stats` to a CSV file named `stu-amtk_777_rte_stats.csv`. Store the file in the `data/student` directory. Then compare it to the accompanying `fxt-amtk_777_rte_stats.csv` file. It must match line for line, character for character.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

#### 3.6.4 _Pacific Surfliner_ Train 777: late detraining metrics (fiscal year and quarter) [1 pt]

Review the central tendency, dispersion, and shape for the mean late arrival times of _Pacific Surfliner_ Train 777. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics. Then visualize each fiscal year and quarter data with a box plot.

In [None]:
# Drop missing values
amtk_777_avg_mm_late = amtk_777[COLS["late_detrn_avg_mm_late"]].reset_index(drop=True)

# Describe the column
amtk_777_avg_mm_late_describe = frm.describe_numeric_column(amtk_777_avg_mm_late)
amtk_777_avg_mm_late_describe

In [None]:
#hidden tests are within this cell

##### 3.6.4.1 Retrieve the chart data

In [None]:
# Base columns for average minutes late
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_777_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
# chrt_data

##### 3.6.4.2 Preaggregate the data

In [None]:
# Base columns for aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 3.6.4.3 Generate box plots

Visualize the distribution of mean late arrival times for late detraining passengers. Illustrate with box plots.

In [None]:
# Create chart title
txt = TRN["777"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_777_rte_stats, title_txt)

# Create and display the vertical boxplot
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

## 4.0 Long-distance trains

Amtrak's long-distance trains operate across the United States. These trains are typically overnight services that connect major cities and regions. Long-distance trains are known for their scenic routes and dining car services.

### 4.1 _City of New Orleans_ service (Chicago - Memphis - New Orleans) [1 pt]

The [_City of New Orleans_](https://www.amtrak.com/city-of-new-orleans-train), operates daily between [Chicago Union Station](https://www.amtrak.com/stations/chi), Chicago, IL ([CHI](https://www.amtrak.com/stations/chi)) and [Union Passenger Terminal](https://www.amtrak.com/stations/NOL), New Orleans, LA ([NOL](https://www.amtrak.com/stations/NOL)) via [Central Station](https://www.amtrak.com/stations/mem), Memphis, TN ([MEM](https://www.amtrak.com/stations/mem)). The train revives the name of the Illinois Central Railroad's [_City of New Orleans_](https://en.wikipedia.org/wiki/City_of_New_Orleans_(train)) that operated between 1947 and 1971. The train is also known for its association with the classic tune [\"City of New Orleans\"](https://en.wikipedia.org/wiki/City_of_New_Orleans_(song)) written by Steve Goodman and popularized by Arlo Guthrie. 

Retrieve the _City of New Orleans_ performance data by calling the appropriate `amtk_network` function. Assign the return value of the function call to a variable named `cno`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 4.2 _City of New Orleans_: on-time performance metrics (entire period)

_City of New Orleans_ performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [None]:
# Total train arrivals
cno_trn_arrivals = cno.shape[0]

# Detraining totals
cno_detrn = cno[COLS["total_detrn"]].sum()
cno_detrn_late = cno[COLS["late_detrn"]].sum()
cno_detrn_on_time = cno_detrn - cno_detrn_late

print(
    f"Train Arrivals: {cno_trn_arrivals}",
    f"Total Detraining Customers: {cno_detrn}",
    f"Late Detraining Customers: {cno_detrn_late}",
    f"On-Time Detraining Customers: {cno_detrn_on_time}",
    sep="\n",
)

# Compute summary statistics
cno_stats = detrn.get_sum_stats(cno, AGG["columns"], AGG["funcs"])
cno_stats

### 4.3 _City of New Orleans_ trains [1 pt]

Each _City of New Orleans_ train is identified by a unique train number.

Create a `DataFrame` named `cno_trns` that contains one row for each train comprising the _City of New Orleans_ service. Include the following columns in the `DataFrame` in the order specified:

1. "Service Line"
2. "Service"
3. "Sub Service"
4. "Route Miles"
5. "Train Number"

Reset the index (set `drop=True`) when creating the new `DataFrame`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 4.4 _City of New Orleans_: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of _City of New Orleans_ trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [None]:
# Drop missing values
cno_avg_mm_late = cno[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
cno_avg_mm_late_describe = frm.describe_numeric_column(cno_avg_mm_late)
cno_avg_mm_late_describe

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of _City of New Orleans_ trains is positively skewed and features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 4.5 _City of New Orleans_: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 4.5.1 Create the chart data

In [None]:
# Convert to DataFrame
cno_avg_mm_late = cno_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = cno_avg_mm_late_describe["center"]["mean"]
sigma = cno_avg_mm_late_describe["spread"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = cno_avg_mm_late_describe["position"]["max"].astype(int)
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
cno_mm_late, bins, num_bins, bin_width = frm.create_bins(cno_avg_mm_late, COLS["avg_mm_late"], 10)

# Bin the data
chrt_data = frm.bin_data(cno_mm_late, COLS["avg_mm_late"], bins)
# chrt_data

#### 4.5.2 Generate the histogram

In [None]:
# Chart title
title_txt = f"Amtrak {SUB_SVC['cno']} Service Late Detraining Passengers"
title = ttl.format_title(cno_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()

### 4.6 _City of New Orleans_, Train 59 and 58

City of New Orleans trains 59 (southbound) and 58 (northbound) operate daily between [Chicago Union Station](https://www.amtrak.com/stations/chi), Chicago, IL ([CHI](https://www.amtrak.com/stations/chi)) and [Union Passenger Terminal](https://www.amtrak.com/stations/NOL), New Orleans, LA ([NOL](https://www.amtrak.com/stations/NOL)).

#### 4.6.1 _City of New Orleans_ Train 59, southbound, detraining passengers summary statistics [1 pt]

Departs daily from [Chicago Union Station](https://www.amtrak.com/stations/chi), Chicago, IL ([CHI](https://www.amtrak.com/stations/chi)).

In [None]:
# Base columns for routes
rte_cols = [COLS["station_code"], COLS["station"], COLS["state"], COLS["lat"], COLS["lon"]]

# Train 59 southbound
amtk_59 = ntwk.by_train_number(trains, 58)
amtk_59_rte = ntwk.create_route(amtk_59, "southbound")
amtk_59_rte_stats = detrn.get_route_sum_stats(
    amtk_59_rte, COLS["station_code"], AGG["columns"], AGG["funcs"], rte_cols
)
amtk_59_rte_stats

In [None]:
#hidden tests are within this cell

##### 4.6.1.1 Write to file [1 pt]

Write `amtk_59_rte_stats` to a CSV file named `stu-amtk_59_rte_stats.csv`. Store the file in the `data/student` directory. Then compare it to the accompanying `fxt-amtk-59_route_stats.csv` file. It must match line for line, character for character.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

#### 4.6.2 _City of New Orleans_ Train 59: late detraining metrics (fiscal year and quarter)

Review the central tendency, dispersion, and shape for the mean late arrival times of _City of New Orleans_ Train 59. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics. Then visualize each fiscal year and quarter data with a box plot.

In [None]:
# Drop missing values
amtk_59_avg_mm_late = amtk_59[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_59_avg_mm_late_describe = frm.describe_numeric_column(amtk_59_avg_mm_late)
amtk_59_avg_mm_late_describe

##### 4.6.2.1 Retrieve the chart data

In [None]:
# Base columns for chart data
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Get the chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_59_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
chrt_data

##### 4.6.2.2 Preaggregate the data

In [None]:
# Base columns for aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 4.6.2.3 Generate box plots

In [None]:
# Create chart title
txt = TRN["59"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_59_rte_stats, title_txt)

# Create and display vertical boxplots
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

#### 4.6.3 _City of New Orleans_ Train 58, northbound, detraining passengers summary statistics [1 pt]

Departs daily from [Union Passenger Terminal](https://www.amtrak.com/stations/NOL), New Orleans, LA ([NOL](https://www.amtrak.com/stations/NOL)). Review previous code employed to generate summary statistics for an Amtrak train. Then leverage functions available in the `amtk_network` and `amtk_detrain` modules to create three new `DataFrame` objects named `amtk_58`, `amtk_58_rte`, and `amtk_58_rte_stats`, respectively.

In [None]:
# Train 58 northbound
amtk_58 = ntwk.by_train_number(trains, 58)
amtk_58_rte = ntwk.create_route(amtk_58, "northbound")
amtk_58_rte_stats = detrn.get_route_summary_stats(
    amtk_58_rte, COLS["station_code"], AGG["columns"], AGG["funcs"], rte_cols
)
amtk_58_rte_stats

In [None]:
#hidden tests are within this cell

##### 4.6.3.1 Write to file [1 pt]

Write `amtk_58_rte_stats` to a CSV file named `stu-amtk_58_rte_stats.csv`. Store the file in the `data/student` directory. Then compare it to the accompanying `fxt-amtk-58_route_stats.csv` file. It must match line for line, character for character.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

#### 4.6.4 _City of New Orleans_ Train 58: late detraining metrics (fiscal year and quarter)

Review the central tendency, dispersion, and shape for the mean late arrival times of _City of New Orleans_ Train 58. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics. Then visualize each fiscal year and quarter data with a box plot.

In [None]:
# Drop missing values
amtk_58_avg_mm_late = amtk_58[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_58_avg_mm_late_describe = frm.describe_numeric_column(amtk_58_avg_mm_late)
amtk_58_avg_mm_late_describe

##### 4.6.4.1 Retrieve the chart data [1 pt]

In [None]:
# Base columns for average minutes late
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_59_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
chrt_data

In [None]:
#hidden tests are within this cell

##### 4.6.4.2 Preaggregate the data

In [None]:
# Base columns for average minutes late
cols = [COLS["year_quarter"], COLS["late_detrn_avg_min_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 4.6.4.3 Generate box plots

In [None]:
# Create chart title
txt = TRN["58"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_58_rte_stats, title_txt)

# Create and display vertical boxplots
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

## 5.0 Watermark

In [None]:
%load_ext watermark
%watermark -h -i -iv -m -v