---
title: >
    Pengembangan Dash App MTA Ridership: Eksplorasi & Implementasi
description: >
    Notebook ini berisi catatan proses eksplorasi data MTA ridership, eksperimen visualisasi dengan Plotly, dan langkah-langkah pengembangan aplikasi Dash untuk Holiday Season App Challenge.
author:
  - name:
      given: Taruma Sakti
      family: Megariansyah
      # literal: Taruma Sakti Megariansyah
    orcid: 0000-0002-1551-7673
    email: hi@taruma.info
    url: https://dev.taruma.info
# abstract: >
#     Notebook ini berisi catatan proses eksplorasi data MTA ridership, eksperimen visualisasi dengan Plotly, dan langkah-langkah pengembangan aplikasi Dash untuk Holiday Season App Challenge.
keywords:
  - Dash App
  - MTA
  - Ridership
  - Data Visualization
  - Plotly
  - Python
license: "CC BY"
copyright: 
  holder: Taruma Sakti Megariansyah
  year: 2024
date: 2024-11-30
date-modified: last-modified
date-format: full
format:
    html:
        # code-fold: true
        css: assets/quarto_styles.css
        # code-tools: true
        number-sections: true
        toc-title: Daftar Isi
        other-links:
        - text: My Github
          icon: github
          href: https://github.com/taruma
        - text: My Other Projects
          icon: journals
          href: https://dev.taruma.info/projects
        - text: Sponsor Me
          icon: heart
          href: https://github.com/sponsors/taruma
        - text: Buy Me a Drink
          icon: cup-straw
          href: https://trakteer.id/taruma/tip
        code-links:
        - text: Repository
          icon: github
          href: https://github.com/taruma/mta-dash
        - text: Source Code
          icon: code
          href: https://github.com/taruma/mta-dash/blob/main/notebook_id.ipynb
        theme: flatly
        toc: true
        toc-location: left
        toc-expand: 2
        toc-depth: 4
        embed-resources: true
include-in-header: # from: https://github.com/quarto-dev/quarto-cli/discussions/4618
  - text: |
      <link rel = "shortcut icon" href = "favicon-ti.png" />
execute:
  enabled: false
  # echo: false
lightbox: auto
lang: id
---

Notebook ini merupakan catatan pribadi saya dalam mencoba mengikuti _[Holiday Season App Challenge - NYC MTA](https://community.plotly.com/t/holiday-season-app-challenge-nyc-mta/88389/16)_. Isi notebook ini berisikan antara lain mengeksplorasi data yang telah disediakan dan bereksperimen dengan beberapa visualisasi yang mungkin bisa digunakan dalam aplikasi yang akan saya buat.

Strategi yang saya gunakan dalam menyelesaikan kompetisi ini antara lain:

- [ ] Mengenal dataset
- [ ] Identifikasi Awal Masalah dan apa yang ingin disajikan
- [ ] Eksplorasi Data
- [ ] Visualiasi Data

## Pengaturan Awal

In [35]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
import pytemplate
from IPython.display import display # noqa: F401

pio.templates.default = pytemplate.mytemplate

PATH_DATASET = 'data/MTA_Daily_Ridership.csv'
PATH_DICTIONARY = 'data/MTA_data_dictionary.csv'

mta_daily_ridership = pd.read_csv(PATH_DATASET)
mta_dictionary = pd.read_csv(PATH_DICTIONARY)

## Data **NYC MTA**

Dataset ini menyediakan estimasi jumlah pengendara harian dan lalu lintas untuk berbagai layanan Metropolitan Transportation Authority (MTA) di New York, termasuk subway, bus, Long Island Rail Road, Metro-North Railroad, Access-A-Ride, Bridges and Tunnels, dan Staten Island Railway, dimulai dari 1 Maret 2020 (atau 1 April 2020 untuk LIRR dan Metro-North). Data ini juga mencakup perbandingan persentase dengan tanggal pra-pandemi yang sebanding untuk menunjukkan tren pemulihan jumlah pengendara pasca-pandemi. 

Dataset yang digunakan dalam proyek ini bersumber dari repositori [plotly/datasets](https://github.com/plotly/datasets/tree/master/App-Challenges/MTA-NYC) di GitHub, yang diakses pada 7 November 2024.

Dalam bagian ini saya akan mengenal dataset yang akan digunakan, tipe data yang ada, dan apa saja yang dapat dieksplorasi sebelum masuk ke tahap berikutnya.

### Mengenal Dataset

Dari sumber yang disediakan oleh plotly terdapat 3 berkas yaitu:

1. `MTA_DailyRidershipData_Overview.pdf`, Berkas pdf yang menjelaskan tentang dataset yang digunakan.
2. `MTA_Daily_Ridership.csv`, Dataset utama yang akan digunakan.
3. `MTA_data_dictionary.csv`, Penjelasan tentang kolom yang ada dalam dataset.

Informasi detail mengenai struktur data dalam dataset dijabarkan dalam @tbl-mta-dictionary berikut ini. 

In [36]:
#| label: tbl-mta-dictionary
#| column: page
#| tbl-cap: Deskripsi setiap kolom dalam dataset (`MTA_data_dictionary.csv`)
#| echo: false

pd.set_option('display.max_colwidth', None)
mta_dictionary
display(mta_dictionary.style.set_properties(**{'text-align': 'right'}))
pd.reset_option('display.max_colwidth') # reset option

Unnamed: 0,Field,Description
0,Date,The date of travel
1,Subways: Total Estimated Ridership,The daily total estimated subway ridership in New York City (NYC)
2,Subways: % of Comparable Pre-Pandemic Day,The daily subway ridership estimate as a percentage of subway ridership on an equivalent day prior to the COVID-19 pandemic
3,Buses: Total Estimated Ridership,The daily total estimated bus ridership in NYC
4,Buses: % of Comparable Pre-Pandemic Day,The daily bus ridership estimate as a percentage of bus ridership on an equivalent day prior to the COVID-19 pandemic
5,LIRR: Total Estimated Ridership,The daily total estimated Long Island Rail Road (LIRR) ridership (blank value indicates that the ridership data was not or is not currently available or applicable)
6,LIRR: % of Comparable Pre-Pandemic Day,The daily LIRR ridership estimate as a percentage of LIRR ridership on an equivalent day prior to the COVID-19 pandemic
7,Metro-North: Total Estimated Ridership,The daily total estimated Metro-North Railroad (MNR) ridership (blank value indicates that the ridership data was not or is not currently available or applicable)
8,Metro-North: % of Comparable Pre-Pandemic Day,The daily MNR ridership estimate as a percentage of MNR ridership on an equivalent day prior to the COVID-19 pandemic
9,Access-A-Ride: Total Scheduled Trips,The daily total scheduled Access-A-Ride (AAR) Paratransit Service trips (blank value indicates that the ridership data was not or is not currently available or applicable)


Dari @tbl-mta-dictionary diketahui dataset ini terdiri dari 15 kolom, termasuk kolom tanggal dan data spesifik untuk setiap moda transportasi: Subway, Bus, Long Island Rail Road (LIRR), Metro-North Railroad (MNR), Access-A-Ride (AAR), Bridges and Tunnels (B&T), dan Staten Island Railway (SIR). Untuk setiap moda transportasi, dataset ini menyediakan dua jenis data: estimasi total jumlah pengendara/lalu lintas harian dan persentase perbandingan dengan jumlah pengendara/lalu lintas pada hari yang setara sebelum pandemi COVID-19.

Agar pengolahan data dan memanggil kolom lebih mudah, maka nama kolom akan diubah menjadi lebih singkat dan lebih mudah dipahami. Berikut nama kolom baru untuk dataset MTA di @tbl-mta-columns.

| Nama Kolom Asli                                         | Nama Kolom Baru  |
| :------------------------------------------------------ | :--------------- |
| Date                                                    | date             |
| Subways: Total Estimated Ridership                      | subway_ridership |
| Subways: % of Comparable Pre-Pandemic Day               | subway_recovery  |
| Buses: Total Estimated Ridership                        | bus_ridership    |
| Buses: % of Comparable Pre-Pandemic Day                 | bus_recovery     |
| LIRR: Total Estimated Ridership                         | lirr_ridership   |
| LIRR: % of Comparable Pre-Pandemic Day                  | lirr_recovery    |
| Metro-North: Total Estimated Ridership                  | mnr_ridership    |
| Metro-North: % of Comparable Pre-Pandemic Day           | mnr_recovery     |
| Access-A-Ride: Total Scheduled Trips                    | aar_trips        |
| Access-A-Ride: % of Comparable Pre-Pandemic Day         | aar_recovery     |
| Bridges and Tunnels: Total Traffic                      | bt_traffic       |
| Bridges and Tunnels: % of Comparable Pre-Pandemic Day   | bt_recovery      |
| Staten Island Railway: Total Estimated Ridership        | sir_ridership    |
| Staten Island Railway: % of Comparable Pre-Pandemic Day | sir_recovery     |

: Daftar nama kolom baru {#tbl-mta-columns .column-page .striped .hover}

Dari informasi yang tersedia di dataset MTA (berbagai kolom di @tbl-mta-columns), kita dapat menggali berbagai _insight_ terkait tren penggunaan transportasi publik di New York City berupa analisis pemulihan jumlah penumpang (_ridership recovery_) pasca pandemi untuk berbagai moda transportasi (subway, bus, LIRR, Metro-North, Access-A-Ride, Bridges and Tunnels, dan Staten Island Railway), membandingkan tingkat pemulihan antar moda transporatsi. 


### Persiapan Dataset

Nama-nama kolom yang ada di dataset MTA diubah menjadi lebih singkat berdasarkan @tbl-mta-columns. 

In [37]:
#| column: margin
#| echo: false

new_column_names = [
    "date",
    "subway_ridership",
    "subway_recovery",
    "bus_ridership",
    "bus_recovery",
    "lirr_ridership",
    "lirr_recovery",
    "mnr_ridership",
    "mnr_recovery",
    "aar_trips",
    "aar_recovery",
    "bt_traffic",
    "bt_recovery",
    "sir_ridership",
    "sir_recovery",
]

mta_daily_ridership.columns = new_column_names
mta_daily_ridership.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1706 entries, 0 to 1705
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   date              1706 non-null   object
 1   subway_ridership  1706 non-null   int64 
 2   subway_recovery   1706 non-null   int64 
 3   bus_ridership     1706 non-null   int64 
 4   bus_recovery      1706 non-null   int64 
 5   lirr_ridership    1706 non-null   int64 
 6   lirr_recovery     1706 non-null   int64 
 7   mnr_ridership     1706 non-null   int64 
 8   mnr_recovery      1706 non-null   int64 
 9   aar_trips         1706 non-null   int64 
 10  aar_recovery      1706 non-null   int64 
 11  bt_traffic        1706 non-null   int64 
 12  bt_recovery       1706 non-null   int64 
 13  sir_ridership     1706 non-null   int64 
 14  sir_recovery      1706 non-null   int64 
dtypes: int64(14), object(1)
memory usage: 200.1+ KB


Berdasarkan output `mta_daily_ridership.info()` di samping, diketahui bahwa dataset ini terdiri dari 15 kolom dan 1706 baris yang kolom selain `date` berupa data numerik. Dari informasi tersebut juga bahwa semua kolom memiliki `1706 non-null` yang diartikan bahwa setiap kolom tidak memiliki data yang kosong/hilang. 

Karena kolom `date` bukan berupa datetime, maka perlu diubah terlebih dahulu menggunakan `pd.to_datetime(...)`. Dan kolom `date` akan dijadikan sebagai index _dataframe_. _Dataframe_ yang telah diubah akan disimpan dalam variabel `mta_daily` (dengan `mta_daily_ridership` sebagai _dataframe_ original). 

In [38]:
#| column: page
#| tbl-cap: Sample dataset `MTA_Daily_Ridership.csv`
#| code-fold: true

mta_daily = (
    mta_daily_ridership
        .assign(
            date=pd.to_datetime(mta_daily_ridership['date'])
        )
        .set_index('date')
)

mta_daily.head()

Unnamed: 0_level_0,subway_ridership,subway_recovery,bus_ridership,bus_recovery,lirr_ridership,lirr_recovery,mnr_ridership,mnr_recovery,aar_trips,aar_recovery,bt_traffic,bt_recovery,sir_ridership,sir_recovery
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2020-03-01,2212965,97,984908,99,86790,100,55825,59,19922,113,786960,98,1636,52
2020-03-02,5329915,96,2209066,99,321569,103,180701,66,30338,102,874619,95,17140,107
2020-03-03,5481103,98,2228608,99,319727,102,190648,69,32767,110,882175,96,17453,109
2020-03-04,5498809,99,2177165,97,311662,99,192689,70,34297,115,905558,98,17136,107
2020-03-05,5496453,99,2244515,100,307597,98,194386,70,33209,112,929298,101,17203,108


Untuk penggunaan selanjutnya akan digunakan `mta_daily` sebagai _dataframe_ yang akan diolah. Untuk index `mta_daily` juga sudah diatur sebagai index _dataframe_. Berikut informasi yang disajikan dari `mta_daily.info()`.

In [39]:
#| echo: false
#| label: tbl-mta-daily-info
#| tbl-cap: Output `mta_daily.info()`

mta_daily.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1706 entries, 2020-03-01 to 2024-10-31
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   subway_ridership  1706 non-null   int64
 1   subway_recovery   1706 non-null   int64
 2   bus_ridership     1706 non-null   int64
 3   bus_recovery      1706 non-null   int64
 4   lirr_ridership    1706 non-null   int64
 5   lirr_recovery     1706 non-null   int64
 6   mnr_ridership     1706 non-null   int64
 7   mnr_recovery      1706 non-null   int64
 8   aar_trips         1706 non-null   int64
 9   aar_recovery      1706 non-null   int64
 10  bt_traffic        1706 non-null   int64
 11  bt_recovery       1706 non-null   int64
 12  sir_ridership     1706 non-null   int64
 13  sir_recovery      1706 non-null   int64
dtypes: int64(14)
memory usage: 199.9 KB


Dari @tbl-mta-daily-info diketahui bahwa dataset dimulai dari tanggal 1 Maret 2020 (`2020-03-01`) hingga 31 September 2024 (`2024-10-31`) atau sekitar 4.5 tahun.

In [40]:
#| label: fig-subway-ridership
#| fig-cap: Grafik data `subway_ridership` dari dataset `MTA_Daily_Ridership.csv`
#| fig-subcap:
#|     - Subway
#|     - Bus
#| column: page



Dari @fig-subway-ridership diperoleh informasi sebagai berikut:

In [41]:
estimate_columns_suffix = ["ridership", "trips", "traffic"]

estimate_columns = []
for col in mta_daily.columns:
    for suffix in estimate_columns_suffix:
        if suffix in col:
            estimate_columns.append(col)

recovery_columns = [col for col in mta_daily.columns if "recovery" in col]

mta_daily_ridership = mta_daily[estimate_columns]
mta_daily_recovery = mta_daily[recovery_columns]



In [42]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

transportation_modes = ["subway", "bus", "lirr", "mnr", "aar", "bt", "sir"]
transportation_names = [
    "Subways",
    "Buses",
    "Long Island Rail Road",
    "Metro North",
    "Access-A-Ride",
    "Bridges and Tunnels",
    "Staten Island Railway",
]
transportation_emoji = ["🚆", "🚌", "🚄", "🚉", "🚐", "🌉", "🚋"]

selected_transportation_mode = "bt"

# get the columns for the selected transportation mode
selected_ridership_columns = [
    col
    for col in mta_daily_ridership.columns
    if col.startswith(selected_transportation_mode)
]
selected_recovery_columns = [
    col
    for col in mta_daily_recovery.columns
    if col.startswith(selected_transportation_mode)
]

fig = make_subplots(specs=[[{"secondary_y": True}]])

resample_period = "W"

selected_ridership_data = (
    mta_daily_ridership[selected_ridership_columns].resample(resample_period).sum()
)
selected_recovery_data = (
    mta_daily_recovery[selected_recovery_columns].resample(resample_period).mean() / 100
)

fig.add_trace(
    go.Scatter(
        x=selected_ridership_data.index,
        y=selected_ridership_data.values.flatten(),
        name="Ridership",
        legendgroup="a",
        legendgrouptitle_text="a",
    ),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(
        x=selected_recovery_data.index,
        y=selected_recovery_data.values.flatten(),
        name="Recovery",
        yaxis="y2",
        legendgroup="a",
        legendgrouptitle_text="a",
    ),
)

fig.update_layout(
    xaxis=dict(title="Date"),
    yaxis=dict(
        title="Estimated Ridership",
        # tickformat=".3s",
        gridwidth=2,
        hoverformat=".3s",
    ),
    yaxis2=dict(
        title="Recovered (%)",
        autorange="reversed",
        hoverformat=".2%",  # https://observablehq.com/@d3/d3-format?collection=@d3/d3-format
        tickformat=".0%",  # https://observablehq.com/@d3/d3-format?collection=@d3/d3-format
        griddash="dot",
    ),
    margin={"t": 30, "l": 10, "r": 10, "b": 10},
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="left",
        x=0,
    ),
)

fig.show()

In [43]:
transportation_modes

['subway', 'bus', 'lirr', 'mnr', 'aar', 'bt', 'sir']

In [44]:
pytemplate.mytemplate.layout.colorway

('#526074', '#b71e1d', '#1bbd9b', '#ec9d0e', '#78b8ff')

In [45]:
from itertools import cycle, islice

resample_period = "W"

fig_all = make_subplots(specs=[[{"secondary_y": True}]])

colorway = pytemplate.mytemplate.layout.colorway
colors = list(islice(cycle(colorway), len(transportation_modes)))

for counter, (mode, mode_name, color, emoji) in enumerate(zip(transportation_modes, transportation_names, colors, transportation_emoji), 1):
    selected_ridership_columns = [
        col for col in mta_daily_ridership.columns if col.startswith(mode)
    ][0]
    selected_recovery_columns = [
        col for col in mta_daily_recovery.columns if col.startswith(mode)
    ][0]

    selected_ridership_data = (
        mta_daily_ridership[selected_ridership_columns].resample(resample_period).sum()
    )
    selected_recovery_data = (
        mta_daily_recovery[selected_recovery_columns].resample(resample_period).mean()
        / 100
    )

    is_legend_visible = True if counter == 1 else "legendonly"

    ridership_trace = go.Scatter(
        x=selected_ridership_data.index,
        y=selected_ridership_data.values,
        name=f"{mode}_ridership",
        legendgroup=mode,
        legendgrouptitle_text=f"{emoji} {mode_name}",
        line_color=color,
        line_width=3,
        hovertemplate="%{y}",
        visible=is_legend_visible
    )

    recovery_trace = go.Scatter(
        x=selected_recovery_data.index,
        y=selected_recovery_data.values.flatten(),
        name=f"{mode}_recovery",
        yaxis="y2",
        legendgroup=mode,
        legendgrouptitle_text=f"{emoji} {mode_name}",
        line_dash="dot",
        line_color=color,
        line_width=2,
        visible=is_legend_visible
    )

    fig_all.add_trace(ridership_trace)
    fig_all.add_trace(recovery_trace)


layout_all = dict(
    xaxis=dict(title="Date"),
    yaxis=dict(
        title="Estimated Ridership",
        # tickformat=".3s",
        gridwidth=2,
        hoverformat=".3s",
    ),
    yaxis2=dict(
        title="Recovered (%)",
        autorange="reversed",
        hoverformat=".2%",  
        tickformat=".0%",  # https://observablehq.com/@d3/d3-format?collection=@d3/d3-format
        griddash="dashdot",
    ),
    margin={"t": 30, "l": 10, "r": 10, "b": 10},
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="left",
        x=0,
    ),
)

fig_all.update_layout(**layout_all)

fig_all.show()

In [46]:
x = list(zip(transportation_modes, transportation_names, colors, transportation_emoji))
pick_x = ['subway', 'aar']

[
    (mode, mode_name, color, emoji)
    for mode, mode_name, color, emoji in x if mode in pick_x
]

[('subway', 'Subways', '#526074', '🚆'),
 ('aar', 'Access-A-Ride', '#78b8ff', '🚐')]

In [47]:
x

[('subway', 'Subways', '#526074', '🚆'),
 ('bus', 'Buses', '#b71e1d', '🚌'),
 ('lirr', 'Long Island Rail Road', '#1bbd9b', '🚄'),
 ('mnr', 'Metro North', '#ec9d0e', '🚉'),
 ('aar', 'Access-A-Ride', '#78b8ff', '🚐'),
 ('bt', 'Bridges and Tunnels', '#526074', '🌉'),
 ('sir', 'Staten Island Railway', '#b71e1d', '🚋')]

In [48]:
'subway' in pick_x

True

In [49]:
colorway = pytemplate.mytemplate.layout.colorway
list(islice(cycle(colorway), 2))

['#526074', '#b71e1d']

In [50]:
test = []

test is None

False

In [51]:
import base64

def fig_to_base64(fig):
    # Convert plot to PNG image
    img_bytes = fig.to_image(format="png")
    
    # Encode to base64
    img_base64 = base64.b64encode(img_bytes).decode('utf-8')
    
    # Create base64 string format commonly used in APIs
    img_base64_str = f"data:image/png;base64,{img_base64}"
    
    return img_base64, img_base64_str

In [52]:
y1, y2 = fig_to_base64(fig_all)
y1

'iVBORw0KGgoAAAANSUhEUgAAArwAAAH0CAYAAADfWf7fAAAgAElEQVR4Xux9B5QVRfb+JQxhGMLMkDOSlAyKARBBggHTKqCoawDzorsqSFBxxZUguv5cdREVEcMfAwaiSlCUYCaDSpAkaYAhzxBn/ucWW229mk7vTb9+3Y+vz+Ewr7vq1q3v3q7++tat6iL5+fn5hAMIAAEgAASAABAAAkAACCQpAkVAeJPUsugWEAACQAAIAAEgAASAgEAAhBeOAASAABAAAkAACAABIJDUCIDwJrV50TkgAASAABAAAkAACAABEF74ABAAAkAACAABIAAEgEBSIwDCm9TmReeAABAAAkAACAABIAAEQHjhA0AACAABIAAEgAAQAAJJjQAIb1KbF50DAkAACAABIAAEgAAQAOGFDwABIAAEgAAQAAJAAAgkNQIgvEltXnQOCAABIAAEgAAQAAJAAIQXPgAEgAAQAAJAAAgAASCQ1AiA8Ca1edE5IAAEgAAQAAJAAAgAARBe+AAQAAJAAAgAASAABIBAUiMAwpvU5kXngAAQAAJAAAgAASAABEB44QNAAAgAASAABIAAEAACSY0ACG9SmxedAwJAAAgAASAABIAAEADhhQ8AASAABIAAEAACQAAIJDUCILxJbV50DggAASAABIAAEAACQACEFz4ABIAAEAACQAAIAAEgkNQIgPDGYN41a9bQ4sWL6dChQ6J2RkYGXXDBBVStWrUYpKEKEAACQAAIAIFwIZCXl0e7d+8uoHSpUqWoXLly4eoMtD0tEADhjdLMTHQ/XbONPq3ThY4WKyFqlzt6kK7//XPq1ekCqlOnTpQS3Rdft349jRg1hp59ZiRlpKe7r4iSQAAIAIEQIxCvse+DyR8JVHr3vM4zdFjmokXf0cinh9O0GTNpw4aNNGjgw57Jj1bQ9z/8SPO+/sZTHZjs/vrrr7Qr9wit3nvAUCktpTi1zqxAlStXpkrlS9DJ3cuoaNk6VLRc3Wj

In [53]:
def read_text_file(file_path):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read()
        return content
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found")
        return None
    except Exception as e:
        print(f"Error reading file: {str(e)}")
        return None

In [54]:
system_prompt = read_text_file("text/system_prompt.txt")
context_overview = read_text_file("text/context_overview.txt")
system_prompt

Error: File 'text/system_prompt.txt' not found
Error: File 'text/context_overview.txt' not found


In [59]:
mta_daily_recovery.describe()

Unnamed: 0,subway_recovery,bus_recovery,lirr_recovery,mnr_recovery,aar_recovery,bt_recovery,sir_recovery
count,1706.0,1706.0,1706.0,1706.0,1706.0,1706.0,1706.0
mean,55.461313,54.692849,59.12837,51.083236,86.165299,93.375147,37.811254
std,19.819596,19.293307,29.297993,26.137311,24.645063,14.641962,19.273205
min,7.0,1.0,2.0,3.0,13.0,18.0,0.0
25%,40.0,53.0,37.0,29.0,72.0,90.0,24.0
50%,61.0,60.0,60.0,56.0,84.0,97.0,40.0
75%,69.0,65.0,76.0,71.0,104.0,102.0,47.0
max,143.0,126.0,237.0,193.0,144.0,120.0,182.0


In [None]:
# import aisuite as ai
# import dotenv

# dotenv.load_dotenv()

# client = ai.Client()

# models = ["openai:gpt-4o-mini"]
# pick_model = "openai:gpt-4o-mini"

# messages = [
#     {"role": "system", "content": system_prompt},
#     {
#         "role": "user",
#         "content": [
#             {"type": "text", "text": f"Here's overview of this project: {context_overview}"},
#             {"type": "text", "text": "Analyze this graph and provide insights."},
#             {
#                 "type": "image_url",
#                 "image_url": {"url": y2},
#             },
#         ],
#     },
# ]

# response = client.chat.completions.create(
#         model=pick_model,
#         messages=messages,
#         temperature=0.75
#     )
# print(response.choices[0].message.content)

### Insights on MTA Ridership Recovery Trends

The graph illustrates the estimated ridership levels for various MTA services, highlighting the recovery patterns post-pandemic. **Subways** show a gradual recovery, with ridership fluctuating but consistently hovering around 30 million passengers in early 2023. The **subway recovery** line indicates a recovery percentage fluctuating between 40% and 60% compared to pre-pandemic levels, suggesting a moderate rebound as commuters adjust to new norms.

In contrast, **Buses** exhibit a more pronounced recovery trajectory, with ridership levels approaching pre-pandemic figures more rapidly. The **bus recovery** percentage has consistently remained above 50%, indicating strong demand for this mode of transport. Meanwhile, the **Long Island Rail Road (LIRR)** and **Metro-North Railroad (MNR)** show slower recovery rates, with ridership levels stabilizing around 20% to 40% of pre-pandemic figures, reflecting ongoing challenges in regaining commute

## ARCHIVE

In [6]:
# # Normalize the subway ridership and scale to 0-100
# df_subway['Normalized_Ridership'] = 100 * (df_subway['Subways_Ridership'] - df_subway['Subways_Ridership'].min()) / (df_subway['Subways_Ridership'].max() - df_subway['Subways_Ridership'].min())

# # Plot the normalized subway ridership and percentage over time
# fig_subway = px.line(df_subway, x='Date', y=['Normalized_Ridership', 'Subways_Percent_PrePandemic'], 
#                      labels={'value': 'Percentage', 'variable': 'Metric'},
#                      title='Normalized Subway Ridership and Percentage Over Time')

# fig_subway.show()

In [7]:
# # Convert the 'Date' column to datetime
# mta_daily_ridership['Date'] = pd.to_datetime(mta_daily_ridership['Date'])

# # Calculate the number of days in the dataset
# num_days = (mta_daily_ridership['Date'].max() - mta_daily_ridership['Date'].min()).days + 1

# # Calculate the number of weeks
# num_weeks = num_days // 7

# # Calculate the number of weekends
# num_weekends = mta_daily_ridership['Date'].dt.dayofweek.isin([5, 6]).sum()

# print(f"Number of days: {num_days}")
# print(f"Number of weeks: {num_weeks}")
# print(f"Number of weekends: {num_weekends}")

In [8]:
# # Create a new column for the day name with Sunday as 1
# mta_daily_ridership['Day_Name'] = mta_daily_ridership['Date'].dt.dayofweek.map({6: 1, 0: 2, 1: 3, 2: 4, 3: 5, 4: 6, 5: 7})

# # Display the updated dataframe
# mta_daily_ridership.head()

In [9]:
# date_series = mta_daily_ridership.Date

# new_df = pd.DataFrame({
#     'Date': date_series,
#     'Day_Name': date_series.dt.day_name(),
#     'Day_of_Week': date_series.dt.dayofweek + 1 ,
#     'Is_Weekday': date_series.dt.dayofweek < 5,
#     'Month_Name': date_series.dt.month_name(),
#     'Month': date_series.dt.month,
#     'Year': date_series.dt.year
# })

# new_df.head(10)
# new_df['Week_Number'] = date_series.dt.isocalendar().week
# new_df.tail(10)