<h1>Change in CTA Ridership, 2019 to 2022</h1>
This analysis looks at which stations and community areas lost the most ridership between January 2020 and January 2023. I chose January 2020 as the baseline month, because it's the last full month (excluding February, which has a varying number of days) preceding closures resulting from the COVID-19 pandemic in March 2020.<br>
<br>
Looking at changes in ridership by geography was inspired by Ben Wellington's <a href="https://iquantny.tumblr.com/post/612712380924903424/mapping-fridays-30-drop-in-nyc-subway-ridership" target="_blank">analysis of NYC following the onset of pandemic closures</a>.
<br>
My analysis steps:
<ol>
<li><a href="#docs">Review API Documentation</a>
<li><a href="#import">Import Libraries</a>
<li><a href="#retrieve_data">Get Data</a>
    <ol>
        <li><a href="#data_riders">Ridership Data</a>
        <li><a href="#data_stations">Station Info (CTA)</a>
        <li><a href="#data_station_lookup">Station Info (manually entered)</a>
        <li><a href="#data_communities">Community Names</a>
    </ol>
<li><a href="#merge">Merge Datasets and Review Station Summaries</a>
<li><a href="#test">Data Quality Checks and Testing</a>
<li><a href="#findings">Key Findings</a>
</ol>

<h3>Possible Next Steps</h3>
<ul>
    <li>integrate demographic info (race, median income) to look for patterns in which communities lost the most ridership
    <li>look at all months for a single station, to see monthly trajectory of ridership
    <li>develop visualizations
    <li>contextualize results. Why do I think some community areas/stations experienced sharper declines in ridership than others?
</ul>

<a name = "docs"></a>
    <h1>1. Review API Documentation</h1>

<h3>Socrata Portal Info</h3>
 <ul>
<li><b>API Docs:</b> <a href="https://dev.socrata.com/">https://dev.socrata.com/</a> (general reference for Socrata)<br>
    </ul>   

<h3>CTA data</h3>
<ul>
    <li><a href="https://data.cityofchicago.org/Transportation/CTA-List-of-CTA-Datasets/pnau-cf66">List of CTA data sets</a>
    <li>Ridership info
        <ul>
            <li><a href="https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Daily-Totals/5neh-572f">Dataset Overview</a><br>
            <li><a href="https://dev.socrata.com/foundry/data.cityofchicago.org/5neh-572f">Developer Portal</a>
            <li><a href="https://data.cityofchicago.org/resource/5neh-572f.json">JSON Explorer</a>
        </ul>
    <li>Station Info
        <ul>
            <li><a href="https://data.cityofchicago.org/Transportation/CTA-System-Information-List-of-L-Stops/8pix-ypme">Dataset Overview</a><br>
            <li><a href="https://dev.socrata.com/foundry/data.cityofchicago.org/8pix-ypme">Developer Portal</a>
            <li><a href="https://data.cityofchicago.org/resource/8pix-ypme.json">JSON Explorer</a>
        </ul>
</ul>

<h3>Community Area Data</h3>
<ul>
    <li><a href="https://hub.arcgis.com/datasets/6ef851bb4765412d95a66fbb54cffc11_0/api">API Explorer</a>
</ul>

<a name = "import"></a>
<h1>2. Import Libraries</h1>

In [1]:
import pandas as pd
import requests
#import datetime as dt #would only need this if I manipulated dates post-API data retrieval

<a name = "retrieve_data"></a>
    <h1>3. Get Data</h1>

In [None]:
url = f"https://data.cityofchicago.org/resource/22u3-xenr.json"


<h3>3A. Get Daily Ridership Data</h3>

In [2]:
# build my query
#select = "station_id, stationname, date_extract_y(date) as year,date_extract_m(date) as month,count(rides) as nDays,sum(rides) as nRides"
#where = "month = 1 and year between 2020 and 2023"
#group_by = "station_id,stationname, year, month"
select = "station_id, stationname, date_extract_y(date) as year,count(rides) as nDays,sum(rides) as nRides"
where = "year between 2019 and 2022"
group_by = "station_id,stationname, year"
limit = 9999

url = f"https://data.cityofchicago.org/resource/5neh-572f.json?$SELECT={select}&$WHERE={where}&$GROUP={group_by}&$LIMIT={limit}"
print (url)

https://data.cityofchicago.org/resource/5neh-572f.json?$SELECT=station_id, stationname, date_extract_y(date) as year,count(rides) as nDays,sum(rides) as nRides&$WHERE=year between 2019 and 2022&$GROUP=station_id,stationname, year&$LIMIT=9999


In [3]:
#run the query
response = requests.get(url)
data = response.json()
print (response)

<Response [200]>


In [4]:
#create and format dataframe
df_ridership=pd.DataFrame(data)

In [5]:
#should have 4 records per station, x 143 stations
df_ridership.station_id.value_counts()
df_ridership

Unnamed: 0,station_id,stationname,year,nDays,nRides
0,40010,Austin-Forest Park,2019,365,543533
1,40010,Austin-Forest Park,2020,366,207693
2,40010,Austin-Forest Park,2021,365,173234
3,40010,Austin-Forest Park,2022,365,208113
4,40020,Harlem-Lake,2019,365,1101813
...,...,...,...,...,...
568,41690,Cermak-McCormick Place,2022,365,332188
569,41700,Washington/Wabash,2019,365,3126070
570,41700,Washington/Wabash,2020,366,1082287
571,41700,Washington/Wabash,2021,365,1321376


<h4>Clean up Ridership Data</h4>

In [6]:
#check data types
df_ridership.dtypes

station_id     object
stationname    object
year           object
nDays          object
nRides         object
dtype: object

In [7]:
#fix data types
df_ridership['station_id'] = df_ridership['station_id'].astype('int')
df_ridership['stationname'] = df_ridership['stationname'].astype('string')
#df_ridership['year'] = df_ridership['year'].astype('int') #keep year as string to avoid key problems when I make this a column header
#df_ridership['month'] = df_ridership['month'].astype('int')
df_ridership['nDays'] = df_ridership['nDays'].astype('int')
df_ridership['nRides'] = df_ridership['nRides'].astype('int')

<h4>Drop Randolph/Wabash Station</h4>
This station was <a href="https://en.wikipedia.org/wiki/Randolph/Wabash_station">closed in 2017</a>, but has a count of 0 rides showing in 2019

In [8]:
#check that all stations have 4 records, one for each year. Randolph/wabash has only one record
#It turns out this station closed in 2017, and shouldn't have any data.
#see https://en.wikipedia.org/wiki/Randolph/Wabash_station
df_ridership.station_id.value_counts()

station_id
40010    4
40020    4
41000    4
41010    4
41020    4
        ..
40530    4
40540    4
40550    4
40560    4
40200    1
Name: count, Length: 144, dtype: int64

In [9]:
#look at Randolph/Wabash station
df_ridership.query("station_id == 40200")

Unnamed: 0,station_id,stationname,year,nDays,nRides
72,40200,Randolph/Wabash,2019,31,0


In [10]:
#drop this station
df_ridership = df_ridership.drop(df_ridership.query("station_id ==40200").index)

In [11]:
#countnumber of distinct stations again
df_ridership['station_id'].nunique()

143

<h4>Create Pivot Table of Ridership by Station by Year</h4>

In [12]:
df_riders_by_year = df_ridership.pivot_table(index="station_id",columns="year",values="nRides")
df_riders_by_year.head()
#df_riders_by_year.style.format({"2019": "{:,.0f}", "2020": "{:,.0f}"})
#df_riders_by_year = df_riders_by_year.rename(columns={'2020': 'Jan2020','2021': 'Jan2021','2022': 'Jan2022','2023': 'Jan2023'})

year,2019,2020,2021,2022
station_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
40010,543533,207693,173234,208113
40020,1101813,489376,455310,511312
40030,455496,234734,216973,206449
40040,2188354,595970,509549,910782
40050,1091440,393596,392194,468102


In [13]:
df_riders_by_year['2019'].sum()

179071205

<a name = "data_stations"></a>
    <h2>3B. Get Station Data</h2>

In [14]:
#build my query
select = "map_id, station_descriptive_name as station, location, :@computed_region_vrxf_vc4k as community_id"
#select = select + ",blue, g, brn, p, pexp, y, pnk, o"
url = f"https://data.cityofchicago.org/resource/8pix-ypme.json?$SELECT={select}"
print (url)

https://data.cityofchicago.org/resource/8pix-ypme.json?$SELECT=map_id, station_descriptive_name as station, location, :@computed_region_vrxf_vc4k as community_id


In [15]:
#run the query
response = requests.get(url)
data = response.json()
print (response)

<Response [200]>


In [16]:
#create and format dataframe
#calling it stations0 b/c this initial dataframe doesn't contain community name. will add this in subsequent dataframe
#...things can get really messed up if I jump around code otherwise
df_stations0=pd.DataFrame(data)

In [17]:
df_stations0

Unnamed: 0,map_id,station,location,community_id
0,40420,Cicero (Pink Line),"{'latitude': '41.85182', 'longitude': '-87.745...",
1,40780,Central Park (Pink Line),"{'latitude': '41.853839', 'longitude': '-87.71...",30
2,40940,Halsted (Green Line),"{'latitude': '41.778943', 'longitude': '-87.64...",66
3,40230,Cumberland (Blue Line),"{'latitude': '41.984246', 'longitude': '-87.83...",75
4,40470,Racine (Blue Line),"{'latitude': '41.87592', 'longitude': '-87.659...",29
...,...,...,...,...
295,40480,Cicero (Green Line),"{'latitude': '41.886519', 'longitude': '-87.74...",26
296,41330,Montrose (Blue Line),"{'latitude': '41.961539', 'longitude': '-87.74...",16
297,40650,North/Clybourn (Red Line),"{'latitude': '41.910655', 'longitude': '-87.64...",37
298,40890,O'Hare (Blue Line),"{'latitude': '41.97766526', 'longitude': '-87....",75


<h3>Clean up data structures</h3>

In [18]:
#fix data types
#ignoring errors on community_id preserves nulls
df_stations0['map_id'] = df_stations0['map_id'].astype('int')
df_stations0['station'] = df_stations0['station'].astype('string')
df_stations0['community_id'] = df_stations0['community_id'].astype('int', errors ='ignore')

df_stations0= df_stations0.sort_values(by='map_id', ascending=True)

In [19]:
#parse out lat and long into separate fields. not sure why I need to do this as a string, but it works
df_stations0["lat"]=df_stations0['location'].str['latitude'].astype(float)
df_stations0["long"]=df_stations0['location'].str['longitude'].astype(float)

In [20]:
#drop stations field, now that I'm done with it
df_stations0=df_stations0.drop('location', axis=1)

<h3>Drop Duplicate Rows</h3>
I otherwise found duplicated lines in station_summary:
40570 California Blue Line shows up in station_summary b/c name is slightly different
41400	Roosevelt (Red, Orange & Green lines)- slightly different GPS

In [21]:
#drop duplicates, because each station has separate inbound and outbound entries that we don't care about
#see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
df_stations0 = df_stations0.drop_duplicates(subset=['map_id'])

In [22]:
#confirm no duplicates
df_stations0.nunique()

map_id          143
station         143
community_id     42
lat             143
long            143
dtype: int64

<a name = "data_station_lookup"></a>
<h2>3C. Get Station Data (Manually Entered)</h2>
I manually created a lookup table based on station data, to cross-reference train line colors and cities.

In [23]:
df_stations_info=pd.read_csv("source_data/lkup_stations_info.csv")

<a name = "data_communities"></a>
    <h2>3D. Get Community Names</h2>

In [24]:
#created URL from API Explorer on ArcGIS
url = "https://services7.arcgis.com/8kZv9DESIQ1hYuyJ/arcgis/rest/services/Chicago_Community_areas/FeatureServer/0/query?where=1%3D1&outFields=OBJECTID,community&returnGeometry=false&outSR=4326&f=json"

In [25]:
#run the query
response = requests.get(url)
data = response.json()
print (response)

<Response [200]>


In [26]:
#create and format dataframe
df_communities=pd.DataFrame(data['features'])

In [27]:
df_communities

Unnamed: 0,attributes
0,"{'OBJECTID': 1, 'community': 'DOUGLAS'}"
1,"{'OBJECTID': 2, 'community': 'OAKLAND'}"
2,"{'OBJECTID': 3, 'community': 'FULLER PARK'}"
3,"{'OBJECTID': 4, 'community': 'GRAND BOULEVARD'}"
4,"{'OBJECTID': 5, 'community': 'KENWOOD'}"
...,...
72,"{'OBJECTID': 73, 'community': 'MOUNT GREENWOOD'}"
73,"{'OBJECTID': 74, 'community': 'MORGAN PARK'}"
74,"{'OBJECTID': 75, 'community': 'OHARE'}"
75,"{'OBJECTID': 76, 'community': 'EDGEWATER'}"


In [28]:
#parse attributes column (a dictionary containing ID and name) to get ID and name
#don't convert ID to int or else merge won't work?!!
df_communities['community_id']=df_communities['attributes'].str['OBJECTID'].astype(str)
df_communities['community_name']=df_communities['attributes'].str['community'].astype(str)

#drop attributes column b/c we don't need it anymore
df_communities=df_communities.drop('attributes', axis=1)

In [29]:
df_communities.head()

Unnamed: 0,community_id,community_name
0,1,DOUGLAS
1,2,OAKLAND
2,3,FULLER PARK
3,4,GRAND BOULEVARD
4,5,KENWOOD


<a name = "merge"></a>
    <h1>4. Merge Datasets</h1>

<h3>Merge manually-entered station info into station data</h3>

In [30]:
df_stations0 = pd.merge(df_stations0, df_stations_info, left_on='map_id', right_on='map_id', how='left')
df_stations0.head()

Unnamed: 0,map_id,station_x,community_id,lat,long,station_y,city,line
0,40010,Austin (Blue Line),,41.870851,-87.776812,Austin (Blue Line),Oak Park,Blue
1,40020,Harlem/Lake (Green Line),,41.886848,-87.803176,Harlem/Lake (Green Line),Oak Park,Green
2,40030,Pulaski (Green Line),27.0,41.885412,-87.725404,Pulaski (Green Line),Chicago,Green
3,40040,"Quincy/Wells (Brown, Orange, Purple & Pink lines)",38.0,41.878723,-87.63374,"Quincy/Wells (Brown, Orange, Purple & Pink lines)",Chicago,Multi
4,40050,Davis (Purple Line),,42.04771,-87.683543,Davis (Purple Line),Evanston,Purple


<h3>Merge community names into station data</h3>

In [31]:
df_stations = pd.merge(df_stations0, df_communities, left_on='community_id', right_on='community_id', how='left')
#df_stations.rename(columns={'community_area_x': 'community_area'})
df_stations.head()

Unnamed: 0,map_id,station_x,community_id,lat,long,station_y,city,line,community_name
0,40010,Austin (Blue Line),,41.870851,-87.776812,Austin (Blue Line),Oak Park,Blue,
1,40020,Harlem/Lake (Green Line),,41.886848,-87.803176,Harlem/Lake (Green Line),Oak Park,Green,
2,40030,Pulaski (Green Line),27.0,41.885412,-87.725404,Pulaski (Green Line),Chicago,Green,WEST GARFIELD PARK
3,40040,"Quincy/Wells (Brown, Orange, Purple & Pink lines)",38.0,41.878723,-87.63374,"Quincy/Wells (Brown, Orange, Purple & Pink lines)",Chicago,Multi,LOOP
4,40050,Davis (Purple Line),,42.04771,-87.683543,Davis (Purple Line),Evanston,Purple,


In [32]:
df_stations.query("community_id=='75'")

Unnamed: 0,map_id,station_x,community_id,lat,long,station_y,city,line,community_name
20,40230,Cumberland (Blue Line),75,41.984246,-87.838028,Cumberland (Blue Line),Chicago,Blue,OHARE
81,40890,O'Hare (Blue Line),75,41.977665,-87.904223,O'Hare (Blue Line),Chicago,Blue,OHARE


<h3>Merge ridership with station data</h3>

In [33]:
df_station_summary= pd.merge(df_stations, df_riders_by_year, left_on='map_id', right_on='station_id')

In [34]:
#reorder columns and exclude unnecessary columns
#df_station_summary = df_station_summary.reindex(\
#    columns=['station','2019','2020','2021','2022','lat','long','community_name','map_id'])

In [35]:
#not sure why I need to do this again...
df_station_summary['2019'] = df_station_summary['2019'].astype('int')
df_station_summary['2020'] = df_station_summary['2020'].astype('int')
df_station_summary['2021'] = df_station_summary['2021'].astype('int')
df_station_summary['2022'] = df_station_summary['2022'].astype('int')

<h1>Calculate Summary Metrics</h1>

<h3>Calculate Ridership Remaining in 2022</h3>

In [36]:
#get 2022 as a percent of 2019 ridership
df_station_summary['pct_19in22']=\
    (100*df_station_summary['2022']/df_station_summary['2019']).round(decimals=0)

#exaggerate differences for visual metric
df_station_summary['pct_19in22viz']=\
    df_station_summary['pct_19in22']*df_station_summary['pct_19in22']*df_station_summary['pct_19in22']

#put commas in annual ridership stats, but this is only for temporary output
#df_station_summary.style.format({"2019": "{:,.0f}", "2020": "{:,.0f}", "2021": "{:,.0f}", "2022": "{:,.0f}"})

In [37]:
df_station_summary.head()

Unnamed: 0,map_id,station_x,community_id,lat,long,station_y,city,line,community_name,2019,2020,2021,2022,pct_19in22,pct_19in22viz
0,40010,Austin (Blue Line),,41.870851,-87.776812,Austin (Blue Line),Oak Park,Blue,,543533,207693,173234,208113,38.0,54872.0
1,40020,Harlem/Lake (Green Line),,41.886848,-87.803176,Harlem/Lake (Green Line),Oak Park,Green,,1101813,489376,455310,511312,46.0,97336.0
2,40030,Pulaski (Green Line),27.0,41.885412,-87.725404,Pulaski (Green Line),Chicago,Green,WEST GARFIELD PARK,455496,234734,216973,206449,45.0,91125.0
3,40040,"Quincy/Wells (Brown, Orange, Purple & Pink lines)",38.0,41.878723,-87.63374,"Quincy/Wells (Brown, Orange, Purple & Pink lines)",Chicago,Multi,LOOP,2188354,595970,509549,910782,42.0,74088.0
4,40050,Davis (Purple Line),,42.04771,-87.683543,Davis (Purple Line),Evanston,Purple,,1091440,393596,392194,468102,43.0,79507.0


<h3>Get Community Summary</h3>

In [38]:
# #set two-part primary key for stations, including community ID and community name
# df_station_summary_mi = df_station_summary.set_index(['community_id','community_name'])

In [39]:
# #pivot, based on this two-part key
# df_community_summary = df_station_summary_mi.pivot_table(\
#     index=["community_id","community_name"],\
#     values=['2019','2022'],\
#     aggfunc=['sum','count'])

In [40]:
#pivot
df_community_summary = df_station_summary.pivot_table(\
    index=["community_name"],\
    values=['2019','2022'],\
    aggfunc=['sum','count'])

In [41]:
#rename columns and drop second station count
df_community_summary.columns=['2019','2022','stations','stations2']
df_community_summary=df_community_summary.drop('stations2', axis=1)

In [42]:
#get 2022 as a percent of 2019 ridership
df_community_summary['pct_19in22']=\
    (100*df_community_summary['2022']/df_community_summary['2019']).round(decimals=0)
#exaggerate differences for visual metric
df_community_summary['pct_19in22viz']=\
    df_community_summary['pct_19in22']*df_community_summary['pct_19in22']*df_community_summary['pct_19in22']

In [43]:
df_community_summary.head()

Unnamed: 0_level_0,2019,2022,stations,pct_19in22,pct_19in22viz
community_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ALBANY PARK,2312061,1256504,3,54.0,157464.0
ARMOUR SQUARE,2847378,1544260,2,54.0,157464.0
AUSTIN,2236793,1139834,5,51.0,132651.0
AVONDALE,1460132,826171,1,57.0,185193.0
BRIDGEPORT,766196,405877,1,53.0,148877.0


<h3>Get Citywide Summary</h3>

In [44]:
df_citywide = pd.DataFrame({'year':[2019,2020,2021,2022],'nRides':[df_station_summary['2019'].sum(),df_station_summary['2020'].sum(),df_station_summary['2021'].sum(),df_station_summary['2022'].sum()]})

In [45]:
df_citywide

Unnamed: 0,year,nRides
0,2019,179071205
1,2020,62340303
2,2021,66169687
3,2022,87306908


<a name = "test"></a>
    <table><tr><td bgcolor = grey align="center"><a name = "findings"></a><h1>Data Quality Checks and Testing</h1></td></tr></table>
<table align=left>
    <tr><td><b>Observation</b></td><td><b>Interpretation</b></td></tr>
    <tr><td>20 stops do not have community areas assigned</td><td>suburban CTA stops do not have community areas or census tracts assigned</td></tr>
    <tr><td>Lawrence and Berywn have zero ridership in 2022</td><td>those stations temporarily closed as of May 2021</td></tr>
</table>

<b>Other Caveats</b>
<ul>
    <li>Lawrence (40770) and Berwyn (40340) show the largest ridership drops, because those stations temporarily closed as of May 2021
        <li>Argyle (41200) and Wilson (40540) show the smallest ridership drops, possibly because they picked up riders from the adjacent Lawrence and Berywn stops
            <li>Community area and census tract data was not provided in the CTA dataset, so suburban stops (e.g. Purple Line, Yellow Line) do not show up in the community area summary
                <li>This analysis focuses on CTA entry point via train stations only, but in reality CTA riders may enter the system via bus and then transfer
    </ul>

In [46]:
#how many stations don't have a census tract assigned?
df_station_summary.community_name.isna().value_counts()

community_name
False    123
True      20
Name: count, dtype: int64

<h3>look for outliers</h3>

In [47]:
#ridership outliers in 2019
df_station_summary.sort_values('2019').loc[:,['station_x','2019']]

Unnamed: 0,station_x,2019
103,King Drive (Green Line),139131
55,Kostner (Pink Line),144034
86,Halsted (Green Line),159214
27,Indiana (Green Line),231963
77,South Boulevard (Purple Line),232812
...,...,...
81,O'Hare (Blue Line),3811167
34,Washington (Blue Line),4176948
132,Chicago (Red Line),4501851
35,"Clark/Lake (Blue, Brown, Green, Orange, Purple...",5830767


In [48]:
#ridership outliers in 2022
df_station_summary.sort_values('2022').loc[:,['station_x','2022']]

Unnamed: 0,station_x,2022
70,Lawrence (Red Line),0
31,Berwyn (Red Line),0
103,King Drive (Green Line),63011
55,Kostner (Pink Line),73788
86,Halsted (Green Line),76495
...,...,...
111,"Fullerton (Red, Brown & Purple lines)",2031024
132,Chicago (Red Line),2079126
35,"Clark/Lake (Blue, Brown, Green, Orange, Purple...",2220778
81,O'Hare (Blue Line),2368464


In [49]:
#does community sum work the way it should? confirmed.

#find community areas with multiple stations
#df_station_summary.community_area.value_counts()

	#this community area has 2 stations, good for testing
#df_station_summary.query("community_area=='9’")

<table><tr><td bgcolor = turquoise align="center"><a name = "findings"></a><h1>Key Findings</h1></td></tr></table>

<h3>Context</h3>
I spoke with Maddie Kilgannon, CTA's Media Relations contact<br>
she suggested looking at the broader context of transit data including bus ridership, cars, etc<br>
and contacting RTA<br>

<h3>By Station - Highest Ridership in 2022 as % of 2019 Ridership</h3>
Excludes Argyle and Wilson, which picked up riders from neighboring stops (Lawrence and Berwyn) that closed for construction effective 5/16/21

In [50]:
#remove Argyle and Wilson, which picked up riders as Lawrence and Berwyn closed for construction in May 2021
df_top5 = df_station_summary.query("map_id != 41200 and map_id != 40540").sort_values("pct_19in22").tail().loc[:,['station_x','2019','2022','pct_19in22']]
df_top5

Unnamed: 0,station_x,2019,2022,pct_19in22
95,Kedzie (Pink Line),325089,213900,66.0
18,Damen (Pink Line),463277,311301,67.0
40,California (Pink Line),431296,290906,67.0
67,Western (Pink Line),331454,223641,67.0
137,Morgan (Green & Pink lines),1105090,789585,71.0


<h3>By Station - Lowest Ridership in 2022 as % of 2019 Ridership</h3>
Excludes <a href="https://www.transitchicago.com/travel-information/alert-detail/?AlertId=75825" target="_blank">Lawrence (Red Line)</a> and <a href="https://www.transitchicago.com/travel-information/alert-detail/?AlertId=75824" target="_blank">Berwyn (Red Line)</a> were closed for construction effective 5/16/21

In [51]:
#remove Lawrence and Berwyn, which closed for construction in May 2021
df_bottom5 = df_station_summary.query("map_id != 40770 and map_id != 40340").sort_values("pct_19in22").head().loc[:,['station_x','2019','2022','pct_19in22']]
df_bottom5

Unnamed: 0,station_x,2019,2022,pct_19in22
16,Oak Park (Blue Line),504438,164142,33.0
100,Monroe (Red Line),2900809,988180,34.0
72,Monroe (Blue Line),2268194,818687,36.0
51,Jackson (Red Line),2601587,959018,37.0
89,Harlem (Blue Line - Forest Park Branch),344096,131278,38.0


<h3>By Community Area - Highest Ridership in 2022 as % of 2019 Ridership</h3>

In [52]:
df_community_summary.sort_values("pct_19in22").tail()

Unnamed: 0_level_0,2019,2022,stations,pct_19in22,pct_19in22viz
community_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GAGE PARK,1089224,627572,1,58.0,195112.0
NORTH LAWNDALE,1140691,682927,4,60.0,216000.0
BRIGHTON PARK,1009589,619023,1,61.0,226981.0
LOWER WEST SIDE,1822986,1191416,4,65.0,274625.0
SOUTH LAWNDALE,431296,290906,1,67.0,300763.0


<h3>By Community Area - Lowest Ridership in 2022 as % of 2019 Ridership</h3>

In [53]:
df_community_summary.sort_values("pct_19in22").head()

Unnamed: 0_level_0,2019,2022,stations,pct_19in22,pct_19in22viz
community_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
EDGEWATER,4535625,1660736,4,37.0,50653.0
NORTH CENTER,1671607,705217,2,42.0,74088.0
ENGLEWOOD,1096944,464030,2,42.0,74088.0
WOODLAWN,442000,194267,2,44.0,85184.0
LOOP,43653015,19255145,16,44.0,85184.0


<h3>Change in Stations in the Loop</h3>

In [54]:
df_station_summary_loop = df_station_summary.query("community_name =='LOOP'").sort_values("pct_19in22")
df_station_summary_loop.loc[:,['station_x','2019','2022','pct_19in22']]

Unnamed: 0,station_x,2019,2022,pct_19in22
100,Monroe (Red Line),2900809,988180,34.0
72,Monroe (Blue Line),2268194,818687,36.0
51,Jackson (Red Line),2601587,959018,37.0
35,"Clark/Lake (Blue, Brown, Green, Orange, Purple...",5830767,2220778,38.0
14,"LaSalle/Van Buren (Brown, Orange, Purple & Pin...",830063,343373,41.0
3,"Quincy/Wells (Brown, Orange, Purple & Pink lines)",2188354,910782,42.0
6,Jackson (Blue Line),2031329,858993,42.0
66,"Washington/Wells (Brown, Orange, Purple & Pink...",2214522,948336,43.0
138,Lake (Red Line),6450839,2749185,43.0
34,Washington (Blue Line),4176948,1846738,44.0


<h1>Export Data for Visualizations</h1>
<ul>
    <li>citywide ridership by year
<li>by station- change in ridership, 2019 to 2022 (station name, lat/lng, % change, 2019, 2022)
<li>by community area- change in ridership 2019 to 2022 (community area, # stations, % change, 2019, 2022)
    </ul>

In [55]:
df_station_summary.to_csv('results/station_summary.csv')
df_community_summary.to_csv('results/community_summary.csv')
df_citywide.to_csv('results/citywide.csv')
df_station_summary_loop.to_csv('results/loop_stations.csv')

#don't format datawrapper data
df_top5.to_csv('results/top5.csv')
df_bottom5.to_csv('results/bottom5.csv')