# 1. Overview

Here is the git repo link: https://github.com/sreejeetsreenivasan/PIC16B-Project

The problem we want to solve is understanding and managing the phenomena of delayed stops—places where there is already high ridership and any delay causes large disruptions. We will be focusing on the LA Metro and investigating delay source stations that are most likely to have a large impact on the stations we care about through the propagation of delays from that source station. This is important because delays can disrupt passenger schedules, affect overall network performance, and contribute to persistent negative impacts throughout the transportation system. Timely and reliable transit services are essential for fostering sustainable urban mobility, reducing congestion in the inner city, increasing economic output, and promoting public transportation as a viable alternative.

We began with some large scale data cleaning, scraping, and collection as we planned to do some examples with other transit systems as well as that of Los Angeles. These included some testing examples with New York and Boston’s rapid transit systems. The project was initially more broad, but as we ran into constraints along the way, such as the lack of suitable data for certain transit systems, we narrowed our project to focus more on LA’s own rail system. We recognized that using either Scrapy or BeautifulSoup to parse the data directly from the website would be the most efficient way to retrieve the information. However, we found that the url of the linked website before did not change as we would submit new information in regard to the month and year for which we wanted data. As such, we realized that Selenium would be the most optimal way to work around this.

Taking this data we obtained through GTFS and the data we scraped, we can use the package NetworkX to create graph objects of our rail network. With the stations as nodes and the routes as the edges, we are able to now perform mathematical analysis on the network. After some simple analysis, we understood that we did not have enough tools at our disposal to answer our questions and proceeded to see what methods other data scientists have employed. We narrowed our research down to 3 papers and using ideas from all three, we were able to improve the method we encountered earlier on in our exploration. The culmination of our analysis was in the creation of our implementation of the reverse localized pathing (RLP) method. Using the results from our function, we could then go ahead and visualize the results.



# 2. [First Technical Component: Sreejeet Scraping]

We used tools that we had learned both during lecture and discussion to complete this task. Our approach followed a few key steps:

1. Using the "developer tools" window in the Chrome browser, we were able to find the exact table and id tags which we needed in order to access that particular tabular group on the website
2. We recognized that BeautifulSoup would be the most useful tool in scraping this information off. We considered using Scrapy at first, but BeautifulSoup got the job done without some of the implementation complexity we encountered with Scrapy
3. We used the Python Selenium module to use the "webdriver" class provided by the module to help input the "month" and "year" fields in the dropdowns. We realized we needed this module because the data was interactive (i.e. the data loaded without changing the url), and using the Requests module was not particularly helpful, even when sending POST request data



In [1]:
# selenium_driver implementation

def selenium_driver(month: str, year: str, list_of_ids: list):
    """
    This method takes in a month and year, and uses it as input to submit a form on the LA Metro ridership website
    This returns a dataframe which contains the ridership information, as well as the weighted importance of each line
    by year when possible

    The LA Metro website stores the data for all three years including the year passed in (eg."2023" will provide information
    for 2022 and 2021 as well)
    """
    # Preliminary input checking
    if type(month) != str and type(year) != str:
        return "ValueError: Month and year must be a string"
    if int(year) >= 2024 or int(year) < 2009:
        return "Year must be between 2009 and 2023"
    if int(year) == 2023:
        if month == "November" or month == "December":
            return "Month must be October or earlier for the year 2023"

    driver = webdriver.Chrome()
    driver.get("https://isotp.metro.net/MetroRidership/YearOverYear.aspx")

    # Month element
    month_element_tag = driver.find_element(By.XPATH, "//select[@name='ctl00$ContentPlaceHolder1$ddlPeriod']")
    month_options = month_element_tag.find_elements(By.TAG_NAME, 'option')

    for option in month_options:
        month_val = option.text
        if month_val == month:
            option.click()
            break
   
    # Year element
    year_element_tag = driver.find_element(By.XPATH, "//select[@name='ctl00$ContentPlaceHolder1$ddlYear']")
    year_options = year_element_tag.find_elements(By.TAG_NAME, 'option')

    for option in year_options:
        year_val = option.text
        if year_val == year:
            option.click()
            break
   
    # Navigate to submit button
    driver.find_element(By.ID, "ContentPlaceHolder1_btnSubmit").click()

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    rider_dataframes = parse_ridership(list_of_ids, soup)

    # Now rider dataframe has information about the ridership by line
    # We can now use Pandas to manipulate this data and assign weights

    # This gets just the total_ridership row, which is in the first dataframe in the rider_dataframe list
    try:
        sys_ridership = rider_dataframes[0].loc["Total Boardings"].str.replace(",", "").astype(int)
    except ValueError:
        return f"No ridership data for {month} {year}"
    # Now we have total ridership for the month for 3 years (i.e. 2023-2021)
    # We want to consider the proportions for each of the other dataframes, and add this as another row

    for df_index in range(1, len(rider_dataframes)):
        try:
            rider_dataframes[df_index].loc['Proportion of Total Boardings'] = rider_dataframes[df_index].loc['Total Boardings'].str.replace(",", "").astype(int) / sys_ridership
        except ValueError as e:
            pass
    time.sleep(3)
    driver.quit()
    return rider_dataframes

### Overview of "selenium_driver" method

The driver takes in 3 parameters: A month, year, and a list of ids. The type hints are specified in the method; the list_of_ids parameter is the particular set of ids found using the developer tools tab in Chrome (in our case, the list_of_ids are ["ContentPlaceHolder1_rpRailRidership_rpRailSystemwide_gvRailSYS", "ContentPlaceHolder1_rpRailRidership_rpRailRed_gvRailRed", "ContentPlaceHolder1_rpRailRidership_rpRailBlue_gvRailBlue",
"ContentPlaceHolder1_rpRailRidership_rpRailExpo_gvRailExpo",
"ContentPlaceHolder1_rpRailRidership_rpRailGreen_gvRailGreen",
"ContentPlaceHolder1_rpRailRidership_rpRailGold_gvRailGold"].

We first check to make sure our input is valid (i.e. the Month and Year input is a string, the year is between 2009-2023, and the Month is not November or December for the year 2023). Then we define the selenium webdriver via the webdriver.Chrome() command, and make it look at the base url given earlier.

Once here, we need to perform the act of actually clicking on the particular month and year we passed in. We navigate to those boxes via Xpath, and the webdriver has an implementation allowing us to get to these boxed via the "find_element" method. Once here, the webdriver will return all the possible values our "Month" and "Year" boxes can take on, so we only want to call the ".click()" method on the specific month/year passed in by the user.

Once this is done, we click the "submit" button via the same "find_element" method for the webdriver. Selenium now loads the new page with the ridership information for that specific month/year, which means we can now use BeautifulSoup to parse this data. At this point, using this soup object, we can call the helper "parse_ridership" method.

In [2]:
# parse_ridership method

def parse_ridership(list_of_ids, soup):
    """
    This method takes in a list of the ids corresponding to the ridership of the rail systems on the LA Metro,
    and parses them via BeautifulSoup
    """
    dataframes_by_line = []
    for table_id in list_of_ids:
        rail_ridership = soup.find(lambda tag: tag.name=="table" and tag.has_attr('id') and tag['id']==table_id)
        rows = defaultdict(list)

        # First row is the list of dates
        first_row = rail_ridership.find('tr')
        dates = first_row.find_all('th')
        headers = ["Boarding Category"]
        headers += [date.text for date in dates if date.text != ""]

        for row in rail_ridership.find_all('tr'):
            for index, tag in enumerate(row.find_all('td')):
                rows[headers[index]].append(tag.text)
        df = pd.DataFrame(rows)
        df.set_index("Boarding Category", drop=True, inplace=True)
        dataframes_by_line.append(df)
    return dataframes_by_line

### Overview of "parse_ridership" method

Above we have the implementation of the "parse_ridership" method. It takes in the list_of_ids we passed into the "selenium_driver" method, as well as the soup object created in that method.

We first create a list in which we can store the dataframes we get from using BeautifulSoup on the particular set of data generated by the selenium webdriver. Then for each of the ids we have passed in, we get the data within that particular table of the webpage, which is stored in the rail_ridership object. Then for each of the items in rail_ridership, we append the list of dates, the particular category of boarding (i.e. weekday, holiday, total, etc.), and then append the value present in these sections into the dict named "rows" defined earlier.

By storing our information in this manner, we can simply convert the dict into a Pandas dataframe for analysis.

### Running the script 

![Scraping Output](images/ScrapingOutput.png)

# 3. Second Technical Component: NetworkX and Math

(For the sake of legibility, most code is omitted. Look into “BowenMain.ipynb” and “BowenProjectFunctions.py” in the Project_Main directory for in-depth details behind the code)

After the collection of the GTFS data, the scraping, and finally the cleaning of the data, we are able to go ahead and start working with the data. Now we can use the NetworkX package to create graph objects, and we will be using the graph object to create calculations. We will create a simple graph of our rail network for now. There is a lot of hard coding of a couple of connections. From a simple analysis, we can calculate the modularity of our rail system and also calculate the centralities of our nodes (here we use the Katz centrality measure). We found that the modularity of the rail network is about 0.7, which means that rail routes do not have a lot of intersections with each other. We can also visualize the most “central” stations in our network using the values from Katz centrality.

From here, we wanted to figure out the most influential spreaders with a recursive formula over target nodes. First, we must decide which nodes are the most important to us. We created two lists of nodes, the first with the highest ridership and the second for the most important for concerts, large venues, and sports games. We looked at 3 papers that aided the creation of our method:
- https://www.nature.com/articles/srep38865
- https://www.sciencedirect.com/science/article/pii/S0167278923000313?via%3Dihub
- https://www.frontiersin.org/articles/10.3389/fphy.2021.806259/full

The culmination of our readings was in the creation of our implementation of the reverse localized pathing (RLP) method. The math behind it can be simplified and here is the snippiet of the implementation of the main mathematical function used:

In [None]:
def rlp(f, adjacency, epsilon, max_l=3):
    """
    Implementing the RLP algorithm into python

    :param f: 1xN vector, components corresponding to target nodes are 1 and 0 otherwise
    :param adjacency: NxN adjacency matrix of our network
    :param epsilon: tunable parameter controlling weight of the paths with different lengths
    :param max_l: furthest nodes we consider
    :return: 1xN vector that ranks the importance of nodes on our network
    """
    s_rlp = np.zeros(len(f))

    for l in range(0, max_l):
        summation_iteration = np.power(epsilon, l) * f @ np.linalg.matrix_power(adjacency, l + 1)
        s_rlp = np.add(s_rlp, summation_iteration)

    return s_rlp

We then combine the data from the LA Metro GTFS data with the data we scraped and we clean and normalize it. It is important to normalize our data in order to have good results with our RLP implementation. We convert values to be generally more homogonized by converting values assuming a normal distribution. We center at 1 and have standard deviation of 0.2. After normalizing, we can create a new graph with the relevant data weighted onto the edges of our rail network graph and finally use our RLP implementation. We can compare our previous centrality results with our RLP results:

![Katz Centrality Results](images/KatzCentrality.png)
![RLP Results](images/RLPResults.png)

The first figure is a visualization of our Katz Centrality from earlier. The red nodes are the most important stations. The green nodes are the next most important. The second figure is the visualization of our RLP implementation’s results. The most important stations are also highlighted in red here. Even with these generally intuitive findings, our method confirms that our commonly used centrality measures are less than successful at the determination of these influential nodes. The influential nodes we found by RLP do not match those that are ranked highly in the Katz centrality measure. This is also true for other simple centrality measures such as degree centrality, eigenvector centrality, and pagerank centrality. This tells us that the centrality of the station has little to do with the impact it will propagate to the important stations and that small stations may be more likely to create major impacts on delay.

# 4. [Third Technical Component: Archer Visualziations]