# Explore: Amtrak Stations

## Intercity Passenger Rail Service Station Performance Metrics

The Amtrak [network](https://www.amtrak.com/content/dam/projects/dotcom/english/public/documents/Maps/Amtrak-System-Map-020923.pdf)
is a passenger rail service that provides intercity rail service in the
continental United States and to select Canadian cities. The network is operated by the
[National Railroad Passenger Corporation](https://railroads.dot.gov/passenger-rail/amtrak/amtrak),
a federally chartered for-profit corporation that receives some state funding and covers its
operating costs by selling tickets and providing other services.

This notebook commences exploration of the augmented quarterly
[Amtrak](https://www.amtrak.com/home.html) station performance metrics. The goal is to better
understand individual Amtrak station performance and identify potential areas for further analysis.

### Variable names

A number of variable names in this project leverage the following abbreviations. The naming
strategy is to strike a balance between brevity and readability:

* `amtk`: Amtrak (reporting mark)
* `chrt`: chart
* `cols`: columns
* `const`: constant
* `cwd`: current working directory
* `eb`: eastbound direction of travel
* `lm`: linear model
* `mi`: miles
* `mm`: minutes (ISO 8601)
* `nb`: northbound direction of travel
* `psgr`: passenger
* `qtr`: quarter
* `rte`: route
* `sb`: southbound direction of travel
* `stats`: summary statistics
* `stn`: station
* `stns`: stations
* `svc`: service
* `trn`: train
* `wb`: westbound direction of travel

In [None]:
import numpy as np
import pandas as pd
import pathlib as pl
import tomllib as tl

import fra_amtrak.amtk_detrain as detrn
import fra_amtrak.amtk_frame as frm
import fra_amtrak.amtk_network as ntwk
import fra_amtrak.chart_bar as vis_bar
import fra_amtrak.chart_box as box
import fra_amtrak.chart_hist as hst
import fra_amtrak.chart_title as ttl


## 1.0 Read files

### 1.1 Resolve paths

In [None]:
parent_path = pl.Path.cwd() # current working directory
parent_path


### 1.2 Load constants

Load a companion [TOML](https://toml.io/en/) file containing constants.

In [None]:
filepath = parent_path.joinpath("notebook.toml")
with open(filepath, "rb") as file_obj:
    const = tl.load(file_obj)

# Access constants
AGG = const["agg"]
CHRT_BAR = const["chart"]["bar"]
CHRT_BOX = const["chart"]["box"]
COLORS = const["colors"]
COLS = const["columns"]
STNS = const["stations"]


### Retrieve performance data

In [None]:
filepath = parent_path.joinpath("data", "processed", "station_performance_metrics-v1p2.csv")
stations = pd.read_json(
    filepath, dtype={"Address 02": "str", "ZIP Code": "str"}, low_memory=False
)  # avoid DtypeWarning

### 1.4 Review the `DataFrame`

In [None]:
stations.shape

In [None]:
stations.info()

In [None]:
stations.head(3)

## 2.0 Passenger arrivals

### 2.1 Top 10 stations (entire period)

The function `ntwk.get_n_busiest_stations()` is designed to retrieve the `n` busiest stations. The
results can be filtered on a geographical unit (e.g., state, divisiion, region) and/or a fiscal year
and its associated quarters.

Below is listed the top ten (`10`) busiest stations based on passenger arrivals for the entire period under review.

In [None]:
# Columns of interest (for display output only)
cols = [
    COLS["station_code"],
    COLS["station"],
    COLS["city"],
    COLS["state"],
    COLS["division"],
    COLS["region"],
    COLS["total_detrn"],
]

top_n_stns = ntwk.get_n_busiest_stations(stations, 10)[cols]
top_n_stns

### 2.2 Top 10 stations (2023 Q1-Q2) [1 pt]

Top ten (`10`) busiest stations based on passenger arrivals for the year `2023`, quarters `01` and
`02`. This example demonstrates how to filter the data based on a fiscal year and its associated
quarters.

In [None]:
top_n_stns = ntwk.get_n_busiest_stations(stations, 10, None, 2023, 2)[cols]
top_n_stns

In [None]:
#hidden tests are within this cell

### 2.3 Top 3 stations (by region, entire period)

 Top three (`3`) busiest stations in each US Census Bureau region based on passenger arrivals.

In [None]:
region_top_n_stns = ntwk.get_n_busiest_stations(stations, 3, COLS["region"])[cols]
region_top_n_stns

### 2.4 Top 3 stations (by division, entire period)

Top three (`3`) busiest stations in each US Census Bureau division based on passenger arrivals.

In [None]:
div_top_n_stns = ntwk.get_n_busiest_stations(stations, 3, COLS["division"])[cols]
div_top_n_stns

### 2.5 Top 3 stations (by state)

The top three (`3`) busiest stations in each state based on passenger arrivals.

In [None]:
state_top_n_stns = ntwk.get_n_busiest_stations(stations, 3, COLS["state"])[cols]
state_top_n_stns

## 3.0 Select Station metrics

### 3.1 Moynihan Train Hall at Penn Station (NYP), New York, NY

[Moynihan Train Hall](https://www.amtrak.com/stations/nyp) at Penn Station ([NYP](https://www.amtrak.com/stations/nyp)) is a major transportation hub and Amtrak's busiest station.

In [None]:
# All fiscal years and quarters
nyp = ntwk.by_station(stations, "NYP")
nyp.shape

### 3.2 NYP: on-time performance metrics (entire period)

NYP station performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [None]:
# Train arrivals (total)
nyp_trn_arrivals = nyp.shape[0]

# Detraining totals
nyp_detrn = nyp[COLS["total_detrn"]].sum()
nyp_detrn_late = nyp[COLS["late_detrn"]].sum()
nyp_detrn_on_time = nyp_detrn - nyp_detrn_late

print(
    f"Train Arrivals: {nyp_trn_arrivals}",
    f"Total Detraining Customers: {nyp_detrn}",
    f"Late Detraining Customers: {nyp_detrn_late}",
    f"On-Time Detraining Customers: {nyp_detrn_on_time}",
    sep="\n",
)

nyp_stats = detrn.get_sum_stats(nyp, AGG["columns"], AGG["funcs"])
nyp_stats

### 3.3 NYP: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of NYP trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [None]:
# Drop missing values
nyp_avg_mm_late = nyp[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
nyp_avg_mm_late_describe = frm.describe_numeric_column(nyp_avg_mm_late)
nyp_avg_mm_late_describe

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of NYP trains are positively skewed and features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 3.4 NYP: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 3.4.1 Create chart data

In [None]:
# Convert to DataFrame
nyp_avg_mm_late = nyp_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = nyp_avg_mm_late_describe["center"]["mean"]
sigma = nyp_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = nyp_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
nyp_min_late, bins, num_bins, bin_width = frm.create_bins(nyp_avg_mm_late, COLS["avg_mm_late"], 15)

# Bin the data
chrt_data = frm.bin_data(nyp_min_late, COLS["avg_mm_late"], bins)

# chrt_data

#### 3.4.2 Generate the histogram

In [None]:
# Chart title
title_txt = f"Late Detraining Passengers: {STNS['nyp']}"
title = ttl.format_title(nyp_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()

### 3.5 NYP: on-time performance metrics (by fiscal year and quarter)

Compute OTP summary statistics per fiscal year and quarter. Add quarterly train arrival metrics to
the `DataFrame` named `nyp_qtr_stats`.

In [None]:
nyp_qtr_stats = detrn.get_sum_stats_by_group(
    nyp,
    [COLS["year"], COLS["quarter"]],
    AGG["funcs"],
    AGG["columns"],
    nyp_trn_arrivals,
    nyp_detrn,
)
nyp_qtr_stats.sort_values(by=[COLS["year"], COLS["quarter"]], ascending=[True, True])

#### 3.5.1 Write to file [1 pt]

Write `nyp_qtr_stats` to a CSV file named `stu-amtk-nyp_qtr_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-nyp_qtr_stats.csv` file. It
must match line for line, character for character.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 3.6 NYP: visualize detraining passengers

Visualize NYP detraining passengers, both on-time and late, across all years and quarters with a
bar chart.

In [None]:
# Assemble the data for the chart
chrt_data = vis_bar.create_detrain_chart_frame(nyp_qtr_stats, CHRT_BAR["columns"])

# Get station code, station name, city, and state to use in the chart title
text = frm.drop_dups_and_squeeze(
    nyp, [COLS["station_code"], COLS["station"], COLS["city"], COLS["state"]]
)

# Chart title
title_txt = (
    f"Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(nyp_stats, title_txt)

# Create and display grouped bar chart
chart = vis_bar.create_grouped_bar_chart(
    chrt_data,
    "Fiscal Period:N",
    "Passengers:Q",
    "Arrival Status:N",
    CHRT_BAR["xoffset_sort"],
    CHRT_BAR["colors"],
    title,
)

chart.display()

### 3.7 NYP: On-time performance metrics by service line

Group train arrivals by service line.

In [None]:
nyp_svc_trns = nyp.groupby(COLS["svc_line"]).size().reset_index()  # Includes rows with NaN
nyp_svc_trns.columns = [COLS["svc_line"], COLS["trn_arrivals"]]
nyp_svc_trns.sort_values(by=COLS["trn_arrivals"], ascending=False, inplace=True)
nyp_svc_trns.reset_index(drop=True, inplace=True)

# Add train arrival ratios (year_qtr/total)
nyp_svc_trns.loc[:, COLS["trn_arrival_ratio"]] = (
    nyp_svc_trns[COLS["trn_arrivals"]] / nyp_trn_arrivals
)
nyp_svc_trns

#### 3.7.1 NYP: compute on-time performance metrics by service line. [1 pt]

In [None]:
# Get summary stats by COLS["svc_line"]
nyp_svc_line_stats = detrn.get_sum_stats_by_group(
    nyp, COLS["svc_line"], AGG["columns"], AGG["funcs"]
)

# Merge train arrivals by service line
nyp_svc_line_stats = nyp_svc_line_stats.merge(nyp_svc_trns, on=COLS["svc_line"], how="inner")

# Move train arrival columns
cols = nyp_svc_line_stats.columns.tolist()
cols = [cols[0]] + cols[-2:] + cols[1:-2]
nyp_svc_line_stats = nyp_svc_line_stats[cols]

# Add service line detraining ratios
nyp_svc_line_stats.loc[:, "Service Line Detraining Ratio"] = (
    nyp_svc_line_stats["Total Detraining Customers sum"] / nyp_detrn
)

# Move service line detraining ratio column
nyp_svc_line_stats.insert(
    3, "Service Line Detraining Ratio", nyp_svc_line_stats.pop("Service Line Detraining Ratio")
)

# Sort by passengers detrained (descending order)
nyp_svc_line_stats.sort_values(by="Total Detraining Customers sum", inplace=True)

# Reset index
nyp_svc_line_stats.reset_index(drop=True, inplace=True)
nyp_svc_line_stats

In [None]:
#hidden tests are within this cell

#### 3.7.2 NYP: visualize distribution of mean late arrival times

Illustrate with box plots.

In [None]:
nyp_svc_lines = nyp.groupby(COLS["svc_line"])[[COLS["svc_line"], COLS["late_detrn_avg_mm_late"]]]
chrt_data = nyp_svc_lines.apply(lambda x: x).reset_index(drop=True)  # Flatten for Altair
chrt_data.head()

In [None]:
title_txt = (
    f"Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(nyp_stats, title_txt)

# Create and display the box plots
chart = box.create_box_plot(
    chrt_data,
    "Late Detraining Customers Avg Min Late:Q",
    "Average Minutes Late",
    "Service Line:N",
    COLS["svc_line"],
    CHRT_BOX["y_axis"]["sort"],
    CHRT_BOX["colors"],
    title,
    CHRT_BOX["padding"],
)

chart.display()

### 3.8 Chicago Union Station (CHI), Chicago, IL

[Chicago Union Station](https://www.amtrak.com/stations/chi) ([CHI](https://www.amtrak.com/stations/chi)) is a key node in the Amtrak
network, supporting both regional services in the Midwest and long distance routes.

In [None]:
chi = ntwk.by_station(stations, "CHI")
chi.shape

### 3.9 CHI: on-time performance metrics (entire period)

CHI station performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [None]:
# Train arrivals (total)
chi_trn_arrivals = chi.shape[0]

# Detraining totals
chi_detrn = chi[COLS["total_detrn"]].sum()
chi_detrn_late = chi[COLS["late_detrn"]].sum()
chi_detrn_on_time = chi_detrn - chi_detrn_late

print(
    f"Train Arrivals: {chi_trn_arrivals}",
    f"Total Detraining Customers: {chi_detrn}",
    f"Late Detraining Customers: {chi_detrn_late}",
    f"On-Time Detraining Customers: {chi_detrn_on_time}",
    sep="\n",
)

chi_stats = detrn.get_sum_stats(chi, AGG["columns"], AGG["funcs"])
chi_stats

### 3.10 CHI: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of CHI trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [None]:
# Drop missing values
chi_avg_mm_late = chi[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
chi_avg_mm_late_describe = frm.describe_numeric_column(chi_avg_mm_late)
chi_avg_mm_late_describe

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of CHI trains are positively skewed and features features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 3.11 CHI: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 3.11.1 Create chart data [1 pt]

In [None]:
# Convert to DataFrame
chi_avg_mm_late = chi_avg_mm_late.to_frame(name="avg_mm_late")

# Get mean and standard deviation
mu = chi_avg_mm_late_describe["center"]["mean"]
sigma = chi_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = chi_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
chi_min_late, bins, num_bins, bin_width = frm.create_bins(chi_avg_mm_late, COLS["avg_mm_late"], 10)

# Bin the data
chrt_data = frm.bin_data(chi_min_late, COLS["avg_mm_late"], bins)
# chrt_data

In [None]:
#hidden tests are within this cell

#### 3.11.2 Generate the histogram

In [None]:
# Chart title
title_txt = f"Late Detraining Passengers: {STNS['chi']}"
title = ttl.format_title(chi_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()

### 3.12 CHI: on-time performance metrics (by fiscal year and quarter) [1 pt]

Compute OTP summary statistics per fiscal year and quarter. Add quarterly train arrival metrics to
the `DataFrame` named `chi_qtr_stats`.

In [None]:
chi_qtr_stats = detrn.get_sum_stats_by_group(
    chi, [COLS["year"], COLS["quarter"]], AGG["columns"], AGG["funcs"], chi_detrn
)
chi_qtr_stats.sort_values(by=[COLS["year"], COLS["quarter"]], ascending=[True, True])
chi_qtr_stats

In [None]:
#hidden tests are within this cell

#### 3.12.1 Write to file [1 pt]

Write `chi_qtr_stats` to a CSV file named `stu-amtk-chi_qtr_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-chi_qtr_stats.csv` file. It
must match line for line, character for character.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 3.13 CHI: visualize detraining passengers

Visualize CHI detraining passengers, both on-time and late, across all years and quarters with a
bar chart.

In [None]:
# Assemble the data for the chart
chrt_data = vis_bar.create_detrain_chart_frame(chi_qtr_stats, CHRT_BAR["columns"])

# Get station code, station name, city, and state to use in the chart title
text = frm.drop_dups_and_squeeze(
    chi, [COLS["station_code"], COLS["station"], COLS["city"], COLS["state"]]
)

# Chart title
title_txt = (
    f"Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(chi_stats, title_txt)

# Create and display grouped bar chart
chart = vis_bar.create_grouped_bar_chart(
    chrt_data,
    "Fiscal Period:N",
    "Passengers:Q",
    "Arrival Status:N",
    CHRT_BAR["xoffset_sort"],
    CHRT_BAR["colors"],
    title,
)

chart.display()

### 3.14 CHI: On-time performance metrics by service line [1 pt]

Group train arrivals by service line.

In [None]:
chi_svc_trns = chi.groupby(COLS["svc_line"]).size().reset_index()  # Includes rows with NaN
chi_svc_trns.columns = [COLS["svc_line"], COLS["trn_arrivals"]]
chi_svc_trns.sort_values(by=COLS["trn_arrivals"], ascending=False, inplace=True)
chi_svc_trns.reset_index(drop=True, inplace=True)

# Add train arrival ratios (year_qtr/total)
chi_svc_trns.loc[:, COLS["trn_arrival_ratio"]] = (
    chi_trn_arrivals / chi_svc_trns[COLS["trn_arrivals"]]
)
chi_svc_trns

In [None]:
#hidden tests are within this cell

#### 3.14.1 CHI: compute on-time performance metrics by service line.

In [None]:
# Get summary stats by COLS["svc_line"]
chi_svc_line_stats = detrn.get_sum_stats_by_group(
    chi, COLS["svc_line"], AGG["columns"], AGG["funcs"]
)

# Merge train arrivals by service line
chi_svc_line_stats = chi_svc_line_stats.merge(chi_svc_trns, on=COLS["svc_line"], how="inner")

# Move train arrival columns
cols = chi_svc_line_stats.columns.tolist()
cols = [cols[0]] + cols[-2:] + cols[1:-2]
chi_svc_line_stats = chi_svc_line_stats[cols]

# Add service line detraining ratios
chi_svc_line_stats.loc[:, "Service Line Detraining Ratio"] = (
    chi_svc_line_stats["Total Detraining Customers sum"] / chi_detrn
)

# Move service line detraining ratio column
chi_svc_line_stats.insert(
    3, "Service Line Detraining Ratio", chi_svc_line_stats.pop("Service Line Detraining Ratio")
)

# Sort by passengers detrained (descending order)
chi_svc_line_stats.sort_values(by="Total Detraining Customers sum", ascending=False, inplace=True)

# Reset index
chi_svc_line_stats.reset_index(drop=True, inplace=True)
chi_svc_line_stats

#### 3.14.2 CHI: visualize distribution of mean late arrival times

Illustrate with box plots.

In [None]:
chi_svc_lines = chi.groupby(COLS["svc_line"])[[COLS["svc_line"], COLS["late_detrn_avg_mm_late"]]]
chrt_data = chi_svc_lines.apply(lambda x: x).reset_index(drop=True)  # Flatten for Altair
chrt_data.head()

In [None]:
# Chart title
title_txt = (
    f"Late Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(chi_stats, title_txt)

# Create and display the box plots
chart = box.create_box_plot(
    chrt_data,
    "Late Detraining Customers Avg Min Late:Q",
    "Average Minutes Late",
    "Service Line:N",
    COLS["svc_line"],
    CHRT_BOX["y_axis"]["sort"],
    CHRT_BOX["colors"],
    title,
    CHRT_BOX["padding"],
)

chart.display()

### 3.15 Los Angeles Union Station (LAX), Los Angeles, CA

[Los Angeles Union Station](https://www.amtrak.com/stations/lax) ([LAX](https://www.amtrak.com/stations/lax)) serves the West Coast with
connections to Amtrak's long distance routes.

In [None]:
lax = ntwk.by_station(stations, "LAX")
lax.shape

### 3.16 LAX: on-time performance metrics (entire period) [1 pt]

LAX station performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [None]:
# Train arrivals (total)
lax_trn_arrivals = lax.shape[0]

# Detraining totals
lax_detrn = lax[COLS["total_detrn"]].sum()
lax_detrn_late = lax[COLS["late_detrn"]]
lax_detrn_on_time = lax_detrn - lax_detrn_late

print(
    f"Train Arrivals: {lax_trn_arrivals}",
    f"Total Detraining Customers: {lax_detrn}",
    f"Late Detraining Customers: {lax_detrn_late}",
    f"On-Time Detraining Customers: {lax_detrn_on_time}",
    sep="\n",
)

lax_stats = detrn.get_sum_stats(lax, AGG["columns"], AGG["funcs"])
lax_stats

In [None]:
#hidden tests are within this cell

### 3.17 LAX: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of LAX trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [None]:
# Drop missing values
lax_avg_min_late = lax[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
lax_avg_min_late_describe = frm.describe_numeric_column(lax_avg_min_late)
lax_avg_min_late_describe

### 3.18 LAX: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 3.18.1 Create chart data [1 pt]

In [None]:
# Convert to DataFrame
lax_avg_min_late = lax_avg_min_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = lax_avg_min_late_describe["center"]
sigma = lax_avg_min_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = lax_avg_min_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
lax_min_late, bins, num_bins, bin_width = frm.create_bins(lax_avg_min_late, COLS["avg_mm_late"], 10)

# Bin the data
chrt_data = frm.bin_data(lax_min_late, COLS["avg_mm_late"], bins)
# chrt_data

In [None]:
#hidden tests are within this cell

#### 3.18.2 Generate the histogram

In [None]:
# Chart title
title_txt = f"Late Detraining Passengers: {STNS['lax']}"
title = ttl.format_title(lax_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()


### 3.19 LAX: on-time performance metrics (by fiscal year and quarter)

Compute OTP summary statistics per fiscal year and quarter. Add quarterly train arrival metrics to
the `DataFrame` named `chi_qtr_stats`.

In [None]:
lax_qtr_stats = detrn.get_sum_stats_by_group(
    lax,
    [COLS["year"], COLS["quarter"]],
    AGG["columns"],
    AGG["funcs"],
    lax_trn_arrivals,
    lax_detrn,
)
lax_qtr_stats.sort_values(by=[COLS["year"], COLS["quarter"]], ascending=[True, True])

#### 3.19.1 Write to file [1 pt]

Write `lax_qtr_stats` to a CSV file named `stu-amtk-lax_qtr_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-lax_qtr_stats.csv` file. It
must match line for line, character for character.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 3.20 LAX: visualize detraining passengers

Visualize LAX detraining passengers, both on-time and late, across all years and quarters with a
bar chart.

In [None]:
# Assemble the data for the chart
chrt_data = vis_bar.create_detrain_chart_frame(lax_qtr_stats, CHRT_BAR["columns"])

# Get station code, station name, city, and state to use in the chart title
text = frm.drop_dups_and_squeeze(
    lax, [COLS["station_code"], COLS["station"], COLS["city"], COLS["state"]]
)

# Chart title
title_txt = (
    f"Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(lax_stats, title_txt)

# Create and display grouped bar chart
chart = vis_bar.create_grouped_bar_chart(
    chrt_data,
    "Fiscal Period:N",
    "Passengers:Q",
    "Arrival Status:N",
    CHRT_BAR["xoffset_sort"],
    CHRT_BAR["colors"],
    title,
)

chart.display()

### 3.21 LAX: On-time performance metrics by service line [1 pt]

Group train arrivals by service line.

In [None]:
lax_svc_trains = lax.groupby(COLS["svc_line"]).size().reset_index()  # Includes rows with NaN
lax_svc_trains.columns = [COLS["svc_line"], COLS["trn_arrivals"]]
lax_svc_trains.sort_values(by=COLS["trn_arrivals"], ascending=False)
lax_svc_trains.reset_index(drop=True, inplace=True)

# Add train arrival ratios (year_qtr/total)
lax_svc_trains.loc[:, COLS["trn_arrival_ratio"]] = (
    lax_svc_trains[COLS["trn_arrivals"]] / lax_trn_arrivals
)
lax_svc_trains

In [None]:
#hidden tests are within this cell

#### 3.21.1 LAX: compute on-time performance metrics by service line. [1 pt]

In [None]:
# Get summary stats by COLS["svc_line"]
lax_svc_line_stats = detrn.get_sum_stats_by_group(
    lax, COLS["svc_line"], AGG["columns"], AGG["funcs"]
)

# Merge train arrivals by service line
lax_svc_line_stats = lax_svc_line_stats.merge(lax_svc_trains, on=COLS["svc_line"], how="inner")

# Move train arrival columns
cols = lax_svc_line_stats.columns.tolist()
cols = [cols[0]] + cols[-2:] + cols[1:-2]
lax_svc_line_stats = lax_svc_line_stats[cols]

# Add service line detraining ratios
lax_svc_line_stats.loc[:, "Service Line Detraining Ratio"] = (
    lax_svc_line_stats["Total Detraining Customers sum"] / lax_detrn
)

# Sort by passengers detrained (descending order)
lax_svc_line_stats.sort_values(by="Total Detraining Customers sum", ascending=False, inplace=True)

# Reset index
lax_svc_line_stats.reset_index(drop=True, inplace=True)
lax_svc_line_stats

In [None]:
#hidden tests are within this cell

#### 3.21.2 LAX: visualize distribution of mean late arrival times

Illustrate with box plots.

In [None]:
lax_svc_lines = lax.groupby(COLS["svc_line"])[[COLS["svc_line"], COLS["late_detrn_avg_mm_late"]]]
chrt_data = lax_svc_lines.apply(lambda x: x).reset_index()  # Flatten for Altair
chrt_data.head()

In [None]:
# Chart title
title_txt = (
    f"Late Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(lax_stats, title_txt)

chart = box.create_box_plot(
    chrt_data,
    "Late Detraining Customers Avg Min Late:Q",
    "Average Minutes Late",
    "Service Line:N",
    COLS["svc_line"],
    CHRT_BOX["y_axis"]["sort"],
    CHRT_BOX["colors"],
    title,
    CHRT_BOX["padding"],
)

chart.display()

## 3.0 Watermark

In [None]:
%load_ext watermark
%watermark -h -i -iv -m -v