# Clean: Intercity Passenger Rail Service Station Performance Metrics

This notebook "cleans" the combined [Amtrak](https://www.amtrak.com/home.html) station performance
metrics, addressing issues involving missing values,string formatting, type conversion, and column
redundancy. The notebook also leverages each station's "State" value to add "Division" and "Region"
columns based on [US Census](https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf)
geographic groupings. The notebook then writes the updated dataset to a CSV file for follow up
cleaning, manipulation, and analysis.

### Variable names

A number of variable names in this project leverage the following abbreviations. The naming
strategy is to strike a balance between brevity and readability:

* `amtk`: Amtrak (reporting mark)
* `chrt`: chart
* `cols`: columns
* `const`: constant
* `cwd`: current working directory
* `eb`: eastbound direction of travel
* `lm`: linear model
* `mi`: miles
* `mm`: minutes (ISO 8601)
* `nb`: northbound direction of travel
* `psgr`: passenger
* `qtr`: quarter
* `rte`: route
* `sb`: southbound direction of travel
* `stats`: summary statistics
* `stn`: station
* `stns`: stations
* `svc`: service
* `trn`: train
* `wb`: westbound direction of travel

In [None]:
import json
import numpy as np
import pandas as pd
import pathlib as pl
import re
import tomllib as tl

import fra_amtrak.amtk_frame as frm
import fra_amtrak.amtk_network as ntwk

# Set random seed
rdg = np.random.default_rng(24)

## 1.0 Read files

### 1.1 Resolve paths

Instantiate instances of `pathlib.Path` to represent absolute paths to the `data/interim` and `data/processed` directories.

In [None]:
parent_path = pl.Path.cwd()  # current working directory
parent_path

data_interim_path = parent_path.joinpath("data", "interim")
data_processed_path = parent_path.joinpath("data", "processed")

### 1.2 Load constraints

Load a companion [TOML](https://toml.io/en/) file containing constants.

In [None]:
filepath = parent_path.joinpath("notebook.toml")
with open(filepath, "rb") as file_obj:
    const = tl.load(file_obj)

# Access constants
COLS = const["columns"]

### 1.3 Retrieve performance data (interim)

In [None]:
filepath = data_interim_path.joinpath("station_performance_metrics-v1p0.csv")
stations = pd.read_csv(filepath)

### 1.4 Review the `DataFrame`

In [None]:
stations.shape

In [None]:
stations.info()

In [None]:
stations.head()

## 2.0 Normalize strings

Trim each string value of leading/trailing spaces. Also search and remove unnecessary spaces in each string value based on the regular expression `re.Pattern` object. Call the function `frm.normalize_dataframe_strings()` to perform this operation.

### 2.1 Locate suspect strings

As is illustrated below, the regex pattern to employ is `"\s{2,}"`.

In [None]:
# Locate extra spaces in all string columns
extra_spaces = stations.select_dtypes(include="object").apply(
    lambda x: x.str.contains(r"\s{2,}").sum()
)
extra_spaces

### 2.2 Clean strings [1 pt]

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## 3.0 Manipulate data

### 3.1 Why two "average min late" columns?

The dataset contains two columns that appear to record the same information: average minutes late. The columns are: "Avg Min Late (Lt CS)" and "Avg Min Late (Lt C)". The "LT CS" column is well-stocked with non-`NaN` values; in contrast "Lt C" column contains only `4668` numeric values. Perhaps this data can be moved to the "Avg Min Late (Lt CS)". Investigate.

#### 3.1.1 Compare "Avg Min Late (Lt CS)" and "Avg Min Late (Lt C)" values

First, return a `DataFrame` filtered on "Avg Min Late (Lt C)" non-NA values.

In [None]:
mask = stations[COLS["avg_mm_late_c"]].notna()
lt_c_notna = stations[mask].reset_index(drop=True)
lt_c_notna.shape

Check if `lt_c_notna` numeric values can be found throughout the dataset or are confined to a specific years and/or quarters.

In [None]:
years_qtrs = lt_c_notna[[COLS["year"], COLS["quarter"]]].drop_duplicates().reset_index(drop=True)
years_qtrs

Next, create a second `DataFrame` filtered on "Avg Min Late (Lt C)" non-NA values _and_ "Avg Min Late (Lt CS)" NA values.

In [None]:
mask = (stations[COLS["avg_mm_late_c"]].notna()) & (stations[COLS["avg_mm_late_cs"]].isna())
lt_c_notna_lt_cs_isna = stations[mask].reset_index(drop=True)
lt_c_notna_lt_cs_isna.shape

Check the two `DataFrames` for equality. If they are equal, the non-NA "Avg Min Late (Lt C)" values can be copied to the "Avg Min Late (Lt CS)" column.

In [None]:
assert lt_c_notna.equals(lt_c_notna_lt_cs_isna)

#### 3.1.2 Update the "Avg Min Late (Lt CS)" column with non-NA "Avg Min Late (Lt C)" values

The values are safe to transfer.

In [None]:
mask = stations[COLS["avg_mm_late_c"]].notna()
stations.loc[mask, COLS["avg_mm_late_cs"]] = stations.loc[mask, COLS["avg_mm_late_c"]]
stations[mask].head(3)

#### 3.1.3 Drop the "Avg Min Late (Lt C)" column [1 pt]

The column is now redundant.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 3.2 Split "Arrival Station Name" string into multiple columns [1 pt]

The "Arrival Station Name" column is overloaded with location information. The station name, state,
and country are usually resident in the string.

Split the column values and unpack the substrings into three new columns named "Arrival Station",
"State", and "Country". Use the available `COLS` constants to define the new column names.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

#### 3.2.1 Review "State" column values

Compare values to jurisdictions contained in `states_provinces.json` file. The file contains a list of US states, the District of Columbia, and Canadian provinces. Update values as needed.

In [None]:
with open(data_processed_path.joinpath("states_provinces.json"), "r") as file:
    states_provinces = json.load(file)

# Combine US and Canadian jurisdictions
jurisdictions = states_provinces["United States"] + states_provinces["Canada"]

# Check for missing and/or incorrect values
mask = ~stations[COLS["state"]].isin(jurisdictions)  # negation
bad_values = stations[mask].loc[:, COLS["state"]].unique()
bad_values

#### 3.2.2 Update "State" column CA and VT values [1 pt]

Update the "State" column, replacing the US state codes "CA" and "VT" with "California" and "Vermont", respectively.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

#### 3.2.3 Update "State" column `NaN` values

In [None]:
# Check "States" column for missing values
mask = stations[COLS["state"]].isna()
bad_values = (
    stations[mask]
    .loc[:, [COLS["station_code"], COLS["station"], COLS["state"]]]
    .drop_duplicates()
    .reset_index(drop=True)
)
bad_values

The `NaN` values are associated with the following stations:

* CBN: [Canadian Border (Niagara Falls, NY)](https://www.amtrak.com/stations/cbn)
* NRG: [Northridge, CA](https://www.amtrak.com/stations/nrg)

Update the "State" column values for these stations.

In [None]:
# Update missing States and Country valuee
mapper = {"CBN": "New York", "NRG": "California"}
stations[COLS["state"]] = stations[COLS["station_code"]].map(mapper).fillna(stations["State"])

Sample to confirm that the "State" column values have been updated.

In [None]:
# Sample to confirm CBN and NRB stations have been updated
mask = (stations[COLS["station_code"]] == "CBN") | (stations[COLS["station_code"]] == "NRG")

# Apply weights to sample (CBN stations are fewer)
weights = stations[mask][COLS["station_code"]].apply(lambda x: 7 if x == "CBN" else 1)
stations[mask].sample(n=7, weights=weights, random_state=rdg)

### 3.3 Update the "Country" column [1 pt]

Levarage the "State" column to update each "Country" column row value with either the "United States" or "Canada".

In [None]:
# Read states
filepath = data_processed_path.joinpath("states_provinces.json")
with open(filepath, "r") as file_obj:
    states_provinces = json.load(file_obj)

# Count US and Canadian stations
country_counts = stations[COLS["country"]].value_counts()
print(f"country_counts = {country_counts}")

Update the "Country" column with "United States" and "Canada" values by applying the function `get_country()` to each row value.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

Recheck the "Country" column values.

In [None]:
# Count US and Canadian stations
country_counts = stations[COLS["country"]].value_counts()
print(f"country_counts = {country_counts}")

### 3.3 Add region and division columns

Read the `regions_divisions.json` file to acquire region and division values. Then levarage the "State" column to add new "Region" and "Division" columns to the `DataFrame`.

In [None]:
filepath = data_processed_path.joinpath("regions_divisions.json")
with open(filepath, "r") as file_obj:
    regions_divisions = json.load(file_obj)

print(regions_divisions.keys())
print(regions_divisions["West"].keys())
print(regions_divisions["West"].items())

Apply the function `add_regions_divisions()` to each "Region" and "Division" row.

In [None]:
# Assign region to each state, province, and district
stations.loc[:, [COLS["region"], COLS["division"]]] = (
    stations.loc[:, COLS["state"]]
    .apply(lambda x: pd.Series(ntwk.get_region_division(regions_divisions, x)))
    .values
)
stations.head()

### 3.4 Reorder columns [1 pt]

Reorder the columns as specified in the table below.

| Position | Column Name | Note |
| :----- | :------------- | :------------- |
| `0`-`1` | "Fiscal Year", "Fiscal Quarter" | &nbsp; |
| `2`-`5` | "Service Line", "Service", "Sub Service", "Train Number" | &nbsp; |
| `6-8` | "Arrival Station Code", "Arrival Station Name", "Arrival Station" | Drop "Arrival Station Name" after confirming column order. |
| `9`-`12` | "State", "Division", "Region", "Country" | &nbsp; |
| `13`-`14` | "Total Detraining Customers", "Late Detraining Customers" | &nbsp; |
| `15` | "Avg Min Late (Lt CS)" | &nbsp; |

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 3.5 Drop "Arrival Station Name" column [1 pt]

Now redundant. Remove.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

### 3.6 Rename the "Avg Min Late (Lt CS)" column [1 pt]

The presence of parentheses `()` in the "Avg Min Late (Lt CS)" column name may cause issues in subsequent analysis. Rename the column to "Late Detraining Avg Min Late".

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## 4.0 Persist data

### 4.1 Recheck data.

In [None]:
stations.info()

### 4.2 Write to file [1 pt]

Write data to a CSV file.

In [None]:
filepath = data_interim_path.joinpath("station_performance_metrics-v1p1.csv")
stations.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

## 5.0 Watermark

In [None]:
%load_ext watermark
%watermark -h -i -iv -m -v