## 1. Setup

We will be working with data from Los Angeles Geohub: https://geohub.lacity.org/datasets/9b1bc4861f1e4277b6bd6e51f48e0f4d/explore. Let's import `pandas` and `plotly`, which we will use for data cleaning and creating visualizations, respectively.

In [2]:
import pandas as pd
from plotly import express as px
df = pd.read_csv("Dataframes/Metro_Stations.csv")
#The data is from the link above.

## 2. Visualizing the System's Coverage and Structure

This data was much nicer and smaller than the New York Subway's hourly ridership data, but this data is also considerably less informative. We were unable to find ridership data by station, only by line, which made creating nice graphics and visualizations with ridership data much more difficult. Below is a preview of our data:

In [3]:
df

Unnamed: 0,X,Y,OBJECTID,source,ext_id,cat1,cat2,cat3,org_name,Name,...,description,zip,link,use_type,latitude,longitude,date_updated,dis_status,POINT_X,POINT_Y
0,-118.192933,33.768076,72713,Metropolitan Transportation Authority (MTA),,Transportation,Metro Stations,,,Downtown Long Beach Station,...,Blue Line,,,publish,33.768076,-118.192933,2023/04/04 16:19:54+00,,33.768076,-118.192933
1,-118.193712,33.772263,72714,Metropolitan Transportation Authority (MTA),,Transportation,Metro Stations,,,Pacific Ave Station,...,Blue Line,,,publish,33.772263,-118.193712,2023/04/04 16:19:54+00,,33.772263,-118.193712
2,-118.189396,33.781835,72715,Metropolitan Transportation Authority (MTA),,Transportation,Metro Stations,,,Anaheim Street Station,...,Blue Line,,,publish,33.781835,-118.189396,2023/04/04 16:19:54+00,,33.781835,-118.189396
3,-118.189394,33.789095,72716,Metropolitan Transportation Authority (MTA),,Transportation,Metro Stations,,,Pacific Coast Hwy Station,...,Blue Line,,,publish,33.789095,-118.189394,2023/04/04 16:19:54+00,,33.789095,-118.189394
4,-118.189846,33.807084,72717,Metropolitan Transportation Authority (MTA),,Transportation,Metro Stations,,,Willow Street Station,...,Blue Line,,,publish,33.807084,-118.189846,2023/04/04 16:19:54+00,,33.807084,-118.189846
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125,-118.378703,33.945678,72838,Metropolitan Transportation Authority (MTA),,Transportation,Metro Stations,,,Aviation/Century Station,...,K Line,,,publish,33.945678,-118.378703,2023/04/04 16:19:54+00,,33.945678,-118.378703
126,-118.377271,33.929635,72839,Metropolitan Transportation Authority (MTA),,Transportation,Metro Stations,,,Aviation/LAX Station,...,K Line,,,publish,33.929635,-118.377271,2023/04/04 16:19:54+00,,33.929635,-118.377271
127,-118.251208,34.054751,72840,Metropolitan Transportation Authority (MTA),,Transportation,Metro Stations,,,Grand Av Arts/Bunker Hill,...,Regional Connector,,,publish,34.054751,-118.251208,2023/04/04 16:19:54+00,,34.054751,-118.251208
128,-118.246166,34.052039,72841,Metropolitan Transportation Authority (MTA),,Transportation,Metro Stations,,,Historic Broadway,...,Regional Connector,,,publish,34.052039,-118.246166,2023/04/04 16:19:54+00,,34.052039,-118.246166


The description column holds the line data. Some of the line data is inconsistent, so we can simplify and clean up the inconsistencies withe the Regional Connector, which is part of the A and E lines. We make 2 separate columns `Line`, which takes only 1 of the lines at the station and `Lines`, which tracks all the lines at a station. This will help us when it comes to graphing without losing too much important information. We can then get rid of many of the extra columns that we don't need.

In [4]:
#get rid of "Line" in each row as it is unnecessary information
df['description'] = df['description'].str.split().str.get(0)

#clean up the line names
for row in range(len(df)):
    if df['description'][row] == "Regional" or df['description'][row] == "Blue/EXPO":
        df['description'][row] = "Blue/Expo"
    if df['description'][row] == "EXPO":
        df['description'][row] = "Expo"

#clean up the lines column
df["Line"] = df["description"].str.split('/').str.get(0)
df["Lines"] = df["description"].str.split('/')

#we dont need many of these columns anymore
cols = ['OBJECTID', 'post_id', 'latitude', 'longitude', 'Name', 'Lines', 'Line']
df = df[cols]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['description'][row] = "Blue/Expo"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['description'][row] = "Blue/Expo"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['description'][row] = "Expo"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['description'][row] = "Expo"
A value is trying to be set on

Let's send this to a new csv file.

In [5]:
df.to_csv('Dataframes/Clean_Metro_Stations.csv')

From here, the station IDs are jumbled and unordered. It's a bit difficult to sort these IDs computationally due to the nature of the system and the way it has expanded over time, as lines have been merged since the construction of the regional connector and new stations were added to the center of the A and E lines with non-sequential station IDs. So unlike with New York, where this was unfeasible, we manually sorted through the columns, putting the stations on each line in consecutive order. We also made a few minor edits to the station names to maintain the consistency of the data with the other dataset we planned to merge with this one, namely changing "St" to "Street" for a few stations. While editing station names is computationally possible and feasible, it was much more efficient to find inconsistencies by testing merges between the 2 datasets and fixing the inconsistencies or missing data found from merging. 

It is important to note that we are not changing or deleting any data here, but the order of the rows will be important for our `line_mapbox` later. Let's read in the edited data.

In [6]:
df = pd.read_csv("Dataframes/Clean_Metro_Stations_Manual_Clean.csv")

Now, we are ready to create visualizations! Let's start with a rough system map showing the locations of all the stations and what line they are on:

In [7]:
#station map, lines not connected, The G-Line (BRT, not Rail) is shown as well
fig = px.scatter_mapbox(df, 
                        lat = "Latitude",
                        lon = "Longitude",
                        color = "Line",
                        hover_name = "Name",
                        zoom = 8.8,
                        height = 800,
                        width = 800,
                        title = "Graph Representation of LA Metro Station Locations by Line",
                        mapbox_style = "carto-positron")

fig.show()

Here's one with the lines connected. Initially, we wanted to weight the lines based on ridership, but this was not possible with a `plotly.line_mapbox`. Again, for this graph in particular, the order of the rows corresponded to the order the lines were plotted, so keeping points in order prevented the plot from drawing lines all across the map.

In [8]:
#line map, stations not shown
fig = px.line_mapbox(df, 
                        lat = "Latitude",
                        lon = "Longitude",
                        color = "Line",
                        hover_name = "Name",
                        line_group='Line',
                        zoom = 8.8,
                        height = 800,
                        width = 800,
                        title = "Approximate Graph Representation of LA Metro Lines",
                        mapbox_style = "carto-positron")

fig.show()

Here is a plot estimating the distance from/to the nearest station. This plot is more of a check for accessibility and reach, highlighting areas that are reasonably close to a station for convenient transportation.

In [9]:
#density mapbox, shows areas within a certain radius to a station, in this case 1 mile. The higher the number, the closer the straight line distance to a metro station
fig = px.density_mapbox(df,
                        lat='Latitude',
                        lon='Longitude',
                        radius=15,
                        opacity=0.3,
                        hover_name = "Name",
                        zoom = 8.8,
                        height = 800,
                        width = 800,
                        title = "Areas Within Approximately 1 Mile of a LA Metro Station",
                        mapbox_style = "carto-positron")
fig.show()

From these plots, we can see that the LA metro system has a very radial model; many of the the lines converge and interchange in downtown LA. Therefore, intuitively, travellers from one end of the system to another would appear to have to traverse that area, making it appear as a choke point of sorts. Furthermore, we can see that the system spans a huge geographical area with relatively wide stop spacing in some places, but many areas of LA are also not reached by the system, due to its wide sprawl and another challenges. The first map also reveals the wide stop spacing on many of the system's lines, and the last map in particular reveals that huge swaths of LA county, such as Whittier, Torrance, and San Fernando, are not currently served by the system.

Below, we import the March 2023 average weekly data scraped by the data scraper. The B (Red) and D (Purple) Line data are merged together as this is how the data was stored on the website. This is likely because they share tracks for most of their length, with the D Line currently being only a short 2-station spur off the B line. Also, due to the recency of the splitting of the L (Gold) line and merging into the A (Blue) and E (Expo) Lines as part of the Regional Connector Project, the A and E line data do not include the L line ridership as, again, since we didnt have station-by-station ridership data, we could not accurately allocate percentages of L line ridership to the extended A and E lines based on the realigned system. 

In [10]:
#dataframes/csv files obtained from the scraper
Adf = pd.read_csv("Dataframes/gvRailBlueMarch2023.csv")
BDdf = pd.read_csv("Dataframes/gvRailRedMarch2023.csv")
Cdf = pd.read_csv("Dataframes/gvRailGreenMarch2023.csv")
Edf = pd.read_csv("Dataframes/gvRailExpoMarch2023.csv")

We were unable to get station-by-station ridership data, and the only thing we were able to get was line-by-line data. As a result, we can take the ridership number from any of the rows to be our line ridership.

In [11]:
#for each line, convert the ridership value to an integer that we can insert into our main dataframe in a new column.
ARidership = int(Adf["MAR 2023"][0].replace(',', ''))
BDRidership = int(BDdf["MAR 2023"][0].replace(',', ''))
CRidership = int(Cdf["MAR 2023"][0].replace(',', ''))
ERidership = int(Edf["MAR 2023"][0].replace(',', ''))

For lines that we don't have data on (such as the K line), we will set the ridership to 1 just so the lines still appear on the map. For the other values where such data exists, we will attach the ridership value to the stations on that line.

In [12]:
#create a new column and set all row values in that column equal to 1
df["Average Weekday Ridership"] = 1

In [13]:
#depending on line, set the avg, weekly ridership to the values we extracted above.
for i in range(len(df)):
    if df["Line"][i] == "A":
        df["Average Weekday Ridership"][i] = ARidership
    elif df["Line"][i] == "B" or df["Line"][i] == "D":
        df["Average Weekday Ridership"][i] = BDRidership
    elif df["Line"][i] == "C":
        df["Average Weekday Ridership"][i] = CRidership
    elif df["Line"][i] == "E":
        df["Average Weekday Ridership"][i] = ERidership



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/i

The B and D line ridership is the highest value, so we will scale the weights based on this value as a percentage. Then, we will multiply the values by a constant so that the resulting sizes are proportional to the percentage of system ridership that each line takes up, and also shows up significantly large enough on the plotly scatter mapbox plot.

In [14]:
#ridership plot node size formula
#df["Rough Line Weight"] is the column we are creating with the adjusted weights
#df["Average Weekday Ridership"] is the column storing the line ridership data for each station
#BDRidership is the ridership of the line with the highest ridership, the B and D lines. Their data was combined together as they essentially operate as the same line.
df["Rough Line Weight"] = (df["Average Weekday Ridership"] / BDRidership) * 25

Let's preview the data again to check how it currently looks.

In [15]:
df

Unnamed: 0.1,Unnamed: 0,OBJECTID,post_id,Latitude,Longitude,Name,Lines,Line,Average Weekday Ridership,Rough Line Weight
0,31.0,72744,3306158,33.773603,-118.189424,5th Street Station,['Blue'],A,31752,9.780439
1,,72743,3306157,33.768745,-118.189374,1st Street Station,['Blue'],A,31752,9.780439
2,0.0,72713,3306127,33.768076,-118.192933,Downtown Long Beach Station,['Blue'],A,31752,9.780439
3,1.0,72714,3306128,33.772263,-118.193712,Pacific Ave Station,['Blue'],A,31752,9.780439
4,2.0,72715,3306129,33.781835,-118.189396,Anaheim Street Station,['Blue'],A,31752,9.780439
...,...,...,...,...,...,...,...,...,...,...
135,122.0,72835,3306249,33.967299,-118.351393,Downtown Inglewood Station,['K'],K,1,0.000308
136,123.0,72836,3306250,33.962021,-118.374455,Westchester/Veterans Station,['K'],K,1,0.000308
137,124.0,72837,3306251,33.949625,-118.378663,LAX/Metro Transit Center Station,['K'],K,1,0.000308
138,125.0,72838,3306252,33.945678,-118.378703,Aviation/Century Station,['K'],K,1,0.000308


We can now plot the stations based on ridership weight. Again, a line plot would have been ideal for this, but plotly's line_mapbox does not allow for changing the width of the lines, which is pretty unfortunate. For this graph, the G and K lines are minimized as we do not have adequate ridership data for either, but they still appear on the map for the sake of marking their existence.

In [16]:
#disconnected station map with line/ridership weights
fig = px.scatter_mapbox(df, 
                        lat = "Latitude",
                        lon = "Longitude",
                        color = "Line",
                        size = "Rough Line Weight",
                        hover_name = "Name",
                        zoom = 8.8,
                        height = 800,
                        width = 800,
                        title = "Graph Representation of LA Metro Station Locations by Line (With Ridership Weights)",
                        mapbox_style = "carto-positron")

fig.show()

From this, we can see that the B and D lines have the highest ridership, followed by the A and E lines. The B and D lines travel through high-density areas of Hollywood, Koreatown, and Downtown, and serves a vital corridor especially for lower-income commuters in the area. The C line is comparatively low in ridership; while it serves the vicinity of LAX, it is also completely avoids downtown. There was inadequate data on the K line to perform any sort of analysis, and the L line had been discontinued with the opening of the Regional Connector.

Let's read in the other dataframe now with the station IDs we want to used for forming the network using NetworkX. The station name changing from earlier comes into play here, as the 2 sets of data have different and unrelated station IDs, so we will merge by name instead. Unfortunately, this dataset is a bit older, and does not contain certain stations on the newly-opened regional connector. It also ignores the G line, as the G line is a bus line rather than rail (though there are plans to convert it to rail in the future!)

In [17]:
#read in a dataframe containing the station IDs we want to use.
stops = pd.read_csv("Dataframes/stops_manual.csv")
stops

Unnamed: 0.1,Unnamed: 0,stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,stop_url,location_type,parent_station,tpis_name
0,0,80101,80101,Downtown Long Beach Station,,33.768071,-118.192921,,0,80101S,Long Bch
1,4,80102,80102,Pacific Ave Station,,33.772258,-118.193700,,0,80102S,Pacific
2,8,80105,80105,Anaheim Street Station,,33.781830,-118.189384,,0,80105S,Anaheim
3,12,80106,80106,Pacific Coast Hwy Station,,33.789090,-118.189382,,0,80106S,PCH
4,15,80107,80107,Willow Street Station,,33.807079,-118.189834,,0,80107S,Willow
...,...,...,...,...,...,...,...,...,...,...,...
99,403,80705,80705,Fairview Heights Station,,33.975252,-118.336072,,0,80705S,Fairview
100,407,80706,80706,Hyde Park Station,,33.988187,-118.330816,,0,80706S,Hyde Park
101,411,80707,80707,Leimert Park Station,,34.003909,-118.332016,,0,80707S,Leimert
102,415,80708,80708,Martin Luther King Jr Station,,34.009563,-118.335359,,0,80708S,MLK


We only need the station IDs and the station names for merging, so we can get rid of the excess columns.

In [28]:
#adjust the columns, get rid of redundant info
stops = stops.rename(columns={"stop_name": "Name", "stop_id": "Station ID"})
cols = ["Station ID", "Name"]
stops = stops[cols]

Now, we merge the data into a single dataframe.

In [19]:
#merge the 2 dataframes together
df = pd.merge(df, stops, on="Name")

Again, let's get rid of unnecessary columns. We don't need the old set of station IDs anymore.

In [20]:
#keep only the columns of data that we need, remove the old station ids to avoid confusion
cols = ["Station ID", "Name", "Latitude", "Longitude", "Lines", "Line", "Average Weekday Ridership", "Rough Line Weight"]
df = df[cols]
df

Unnamed: 0,Station ID,Name,Latitude,Longitude,Lines,Line,Average Weekday Ridership,Rough Line Weight
0,80154,5th Street Station,33.773603,-118.189424,['Blue'],A,31752,9.780439
1,80153,1st Street Station,33.768745,-118.189374,['Blue'],A,31752,9.780439
2,80101,Downtown Long Beach Station,33.768076,-118.192933,['Blue'],A,31752,9.780439
3,80102,Pacific Ave Station,33.772263,-118.193712,['Blue'],A,31752,9.780439
4,80105,Anaheim Street Station,33.781835,-118.189396,['Blue'],A,31752,9.780439
...,...,...,...,...,...,...,...,...
109,80708,Martin Luther King Jr Station,34.010167,-118.335347,['K'],K,1,0.000308
110,80707,Leimert Park Station,34.004582,-118.332665,['K'],K,1,0.000308
111,80706,Hyde Park Station,33.988290,-118.330836,['K'],K,1,0.000308
112,80705,Fairview Heights Station,33.975284,-118.336030,['K'],K,1,0.000308


We can create the same ridership-weighted visualization as above using this new dataset.

In [21]:
#disconnected station map with line/ridership weights
fig = px.scatter_mapbox(df, 
                        lat = "Latitude",
                        lon = "Longitude",
                        color = "Line",
                        size = "Rough Line Weight",
                        hover_name = "Name",
                        zoom = 8.8,
                        height = 800,
                        width = 800,
                        title = "Graph Representation of LA Metro Station Locations by Line (With Ridership Weights)",
                        mapbox_style = "carto-positron")

fig.show()

Finally, lets read in the stations ranked in the top 20 by the RLP formula and let's merge the dataframes again.

In [22]:
#read in a dataframe with the important stations ranked by the formula
ranks = pd.read_csv("Dataframes/Stations by Rank.csv")
ranks

Unnamed: 0.1,Unnamed: 0,Station ID,Station Rank
0,0,80122,1
1,1,80124,2
2,2,80213,3
3,3,80215,4
4,4,80409,5
5,5,80120,6
6,6,80126,7
7,7,80214,8
8,8,80121,9
9,9,80123,10


In [23]:
#merge ranks into df on the station IDs
df = pd.merge(df, ranks, on="Station ID")

Now, lets take a look at our final data.

In [24]:
df

Unnamed: 0.1,Station ID,Name,Latitude,Longitude,Lines,Line,Average Weekday Ridership,Rough Line Weight,Unnamed: 0,Station Rank
0,80118,Washington Station,34.019655,-118.243096,['Blue'],A,31752,9.780439,17,18
1,80119,San Pedro Street Station,34.026812,-118.255517,['Blue'],A,31752,9.780439,13,14
2,80120,Grand / LATTC Station,34.03316,-118.269345,['Blue'],A,31752,9.780439,5,6
3,80121,Pico Station,34.04074,-118.26613,"['Blue', 'Expo']",A,31752,9.780439,8,9
4,80121,Pico Station,34.04074,-118.26613,"['Blue', 'Expo']",E,28174,8.678322,8,9
5,80122,7th Street / Metro Center Station,34.048615,-118.258834,"['Blue', 'Expo']",A,31752,9.780439,0,1
6,80122,7th Street / Metro Center Station,34.048639,-118.258694,"['Red', 'Purple']",B,81162,25.0,0,1
7,80122,7th Street / Metro Center Station,34.048639,-118.258694,"['Red', 'Purple']",D,81162,25.0,0,1
8,80122,7th Street / Metro Center Station,34.048615,-118.258834,"['Blue', 'Expo']",E,28174,8.678322,0,1
9,80211,7th Street / Metro Center Station,34.048615,-118.258834,"['Blue', 'Expo']",A,31752,9.780439,11,12


It's worth noting that some stations have multiple ranks as each of the platforms is ranked individually. 7th Street-Metro Center's A/E Line platform ranks #1 while the B/D line platform ranks #12. Interestingly enough, while the B and D lines have higher ridership than the combined ridership of the A and E lines, disruption at the A/E Line Platform is more impactful than disruption on the B/D Line Platform.

Let's do a similar thing as before. We can determine the relative size of the points by rank, but the ranks are currently in ascending order with 1 being the highest and 20 being the lowest. However, we want the highest rank to be the largest point on the graph. Let's also make the size descrepancies more apparent by doubling the value before trimming everything down to a reasonable size.

In [25]:
#rank plot node size formula
#df["Station Rank Plot Size"] is the column we are creating with the adjusted weights
#df["Station Rank"] is the column storing the RLP rank for the top 20 stations
df["Station Rank Plot Size"] = (25 - df["Station Rank"]) * 2 - 5

Now, let's plot these top-ranked stations and see the result!

In [26]:
#Plots only the most important stations in the system, as ranked by RLP
fig = px.scatter_mapbox(df, 
                        lat = "Latitude",
                        lon = "Longitude",
                        color = "Line",
                        size = "Station Rank Plot Size",
                        hover_name = "Name",
                        zoom = 8.8,
                        height = 800,
                        width = 800,
                        opacity = 0.5,
                        title = "Important LA Metro Stations, as Ranked by the Formula",
                        mapbox_style = "carto-positron")
fig.show()

Everything appears to be centered in downtown, which may or may not be suprising considering the radial layout of the system, and all of the top-ranked stations are on the A, B, D, or E lines. Let's zoom in on downtown.

In [27]:
#Plots only the most important stations in the system, as ranked by RLP
fig = px.scatter_mapbox(df, 
                        lat = "Latitude",
                        lon = "Longitude",
                        color = "Line",
                        size = "Station Rank Plot Size",
                        hover_name = "Name",
                        zoom = 11.9,
                        height = 800,
                        width = 800,
                        opacity = 0.5,
                        title = "Important LA Metro Stations, as Ranked by the Formula",
                        mapbox_style = "carto-positron")
fig.show()

The important interchange stations on this map are 7th Street-Metro Center, Union Station, and Pico Station. However, while the A/E Line platforms of 7th Street-Metro Center hold the highest rank, Union Station and Pico Station only have their highest ranks at 5 and 9, despite Union Station being a major transit hub in Downtown and Pico Station being the location where the A/E Line concurrency splits. Wilshere/Vermont, a major transfer station at where the B and D lines separate, doesn't even make the top 20. The remainder of the top 5 (Jefferson/USC, Civic Center/Grand Park, and Wilshere/Normandie) are not major transfer stations: Jefferson/USC serves USC, Civic Center/Grand Park is near city hall and a number of points of interest in downtown, and Wilshere/Normandie is in the heart of Koreatown. So perhaps surprisingly and in contrary to common expectations, major transfer stations are not necessarily the points of disruption in a system that result in the most impact or damage.