# Validation

Using outputs from <i>Los Angeles Spatial Analysis.ipynb</i>, this notebook will determine why some stops were not included in Transit-Rich Housing's layer and included in ours. The goal of this notebook is to figure out if we have made a mistake in calculating headways, or if our layer is correct.

In [None]:
import pandas as pd, numpy as np, shutil, os, re
from matplotlib import pyplot as plt
from collections import Counter

In [None]:
# load in stops with headway and TRH comparison information and load in stops2routes data

stops = pd.read_csv("output/Metro LA (HQT) - Transit Rich Housing comparison.csv", index_col=0)
stops2routes = pd.read_csv("LA-Metro/Metro - Los Angeles stops2routes.csv", index_col=0)

In [None]:
# merge route_short_names onto stops
stops = stops.merge(stops2routes, on="stop_id", how="left")

In [None]:
stops.head()

In [None]:
stops.query("within == False")[["stop_id", "stop_name", "route_short_names"]]['route_short_names']

In [None]:
pd.crosstab(stops['route_short_names'], stops['within'], )

The following section examines the <i>route_short_names</i> of the routes listed for each stop. Every time a route appears, it is recorded. The goal here is to create a table that has the number of times a route appears at a stop we consider to be high-quality transit and TRH does not.

This should allow us to find particular routes that TRH has omitted and we have included. From there, we can examine agency timetables for those routes, make a final determination, and evaluate our code accordingly.

In [None]:
false_route_short_names = []
for item in list(stops.query("within == False")['route_short_names']):
    false_route_short_names += item[1:-1].replace("'", "").replace(" ", '').split(',')

true_route_short_names = []
for item in list(stops.query("within == True")['route_short_names']):
    true_route_short_names += item[1:-1].replace("'", "").replace(" ", '').split(',')
    
all_route_short_names = []
for item in list(stops['route_short_names']):
    all_route_short_names += item[1:-1].replace("'", "").replace(" ", '').split(',')

In [None]:
false_routes = pd.DataFrame(data=dict(Counter(false_route_short_names)), index=["count"]).T.reset_index()
false_routes.rename(columns={"index":"route_short_name"}, inplace=True)

true_routes = pd.DataFrame(data=dict(Counter(true_route_short_names)), index=["count"]).T.reset_index()
true_routes.rename(columns={"index":"route_short_name"}, inplace=True)

all_routes = pd.DataFrame(data=dict(Counter(all_route_short_names)), index=["count"]).T.reset_index()
all_routes.rename(columns={"index":"route_short_name"}, inplace=True)

In [None]:
# merge all these count tables together

route_freq = all_routes.merge(true_routes, on="route_short_name", suffixes=["_all", "_true"])

route_freq = route_freq.merge(false_routes, on="route_short_name")

route_freq.rename(columns={"count":"count_false"}, inplace=True)

route_freq['pct_true'] = route_freq['count_true'] / route_freq['count_all']
route_freq['pct_false'] = route_freq['count_false'] / route_freq['count_all']

In [None]:
route_freq.sort_values("pct_false", ascending=False)

In [None]:
plt.clf()

plt.figure(figsize=(10,10))

plt.hist(route_freq['pct_false'], bins=20)

plt.vlines(route_freq.describe()['pct_false']["50%"], ymax=0, ymin=16, colors='r')
plt.vlines(route_freq.describe()['pct_false']["25%"], ymax=0, ymin=16, colors='r', linestyles='--')
plt.vlines(route_freq.describe()['pct_false']["75%"], ymax=0, ymin=16, colors='r', linestyles='--')


plt.ylabel("Number of routes")
plt.xlabel("Pct. of stops ommitted from TRH")
plt.title("Metro - LA")

plt.show()

The above visualization shows the percentage of a route's stops that are considered not high-quality transit by Transit Rich Housing but are considered high-quality transit by us.

A higher percentage of a route's stops being considered to be not high quality transit is indicative of the route being considered unqualifying by Transit Rich Housing. This is also easy to remedy -- routes with high percentages of their stops excluded from Transit Rich Housing can be reevaluated manually by looking at the time table.

Whole route discrepancies between us and Transit Rich Housing are where I'd think we'd be able to most quickly and easily identify any mistakes.

This graph and the table from which it was produced can help us to identify routes that should be prioritized in checking our work.

In [None]:
route_freq.describe()['pct_false']

If we are to prioritize routes to double-check, here are all routes at the 75th percentile and above.

In [None]:
q = "pct_false >= %s" % str(route_freq.describe()['pct_false']["75%"])
route_freq.query(q).sort_values("pct_false", ascending=False)

In [None]:
q = "pct_false >= %s" % str(route_freq.describe()['pct_false']["25%"])
route_freq.query(q).sort_values("pct_false", ascending=True)

Let's check it against some headways.

In [None]:
stops.head()

Navigate to the agency's PDF timetables, identify a stop by name, and filter with that stop's name.

In [None]:
stops[stops['stop_name'] == "Culver City Transit Center"]

<p>I have checked and have confimed HQT for the following routes:
    <ol>
        <li><b>76</b> - the frequency of this bus alone is enough to make all of its stops qualifying.</li>
        <li><b>96</b> - this bus doesn't have sufficient frequency, but the other routes with which it shares stops enable the stops to meet the required frequency.</li>
        <li><b>110</b> - this bus doesn't have sufficient frequency, but the other routes with which it shares stops enable the stops to meet the required frequency</li>
</p>