In [2]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages
library(readr)
library(janitor)
library(lubridate)
library(ggplot2)
library(maps)
library(tidyr)
library(dplyr)
library(tidyselect)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

##Introduction: 
  I am doing this case study as a capstone project after completion of the Google Intro Data Analytics Certificate Course. This covers all steps of the data analysis process. These steps are ask, prepare, process, analyze, share, and act.
  Growing up in Chicago, I took the train to high school throughout the year. I often wished that the station platforms were enclosed because it was really cold to wait outside. They did have a few heaters on in the winter but if there were too many people those spots were already taken. If there indeed is a significant decline in the number of people using the trains during the winter, this information could be used to help make the case for enclosing stations.
  This case study looks at how many people entered each station each day from 2001 January 1st to 2020 December 31st.

##Working Hypothesis
  I have three working hypotheses.
  1. More people use the 'L' during the summer than during the winter.
  2. More people commute to downtown for work than take the 'L' for leisure.
  3. Individual stations, bus stops, or bus lines are likely to have a similar number of riders year over year with little relative movement. In other words the station that sees the most passengers is likely to be the same from year to year. Except maybe the pandemic might change the relative positions...

##Stakeholders
  The stakeholder or people who might be interested in this case study and its results are the Chicago Transit Authority (CTA), the citizens of Chicago, the government of the City of Chicago, and maybe tourists or visitors.
  
##Description
  To test these hypothesis I looked at open data published by the CTA and the City of Chicago. See the "References" section at the bottom of the page for links to the data sources.
  
##End Deliverables
Publish on GitHub, Kaggle, & Personal Website

In [4]:
train_boardings_daily_df <- read_csv("../input/cta-ridership-l-station-entries-daily-totals/CTA_-_Ridership_-__L__Station_Entries_-_Daily_Totals.csv", show_col_types = FALSE)

# View of the data
train_boardings_daily_df <- arrange(train_boardings_daily_df, desc(rides))
View(train_boardings_daily_df)

# Preview of top stations
top_stations <- subset(train_boardings_daily_df, rides > 24000)
View(top_stations)

##Convert and clean the date field for ordering and organization.
In this code chunk we are converting the "date" field from a 'chr' data type to a 'date' data type. We are also identifying the stations with the highest and lowest passenger volumes. For highest volume we looked at both daily values and total passenger volume across the entire data set. For lowest volume, we only looked at total passenger volume across the data set because there were many stations with 0 passengers per day.

In [5]:
# Convert the "date" field data type from character to date.
rides_by_station <- data.frame(station_id = train_boardings_daily_df$station_id, station_name = train_boardings_daily_df$stationname, new_date = mdy(train_boardings_daily_df$date), day_type = train_boardings_daily_df$daytype, rides = train_boardings_daily_df$rides)

# Add the day of the week in a new field with the wday function
rides_by_station2 <- mutate(rides_by_station, day_of_week = wday(new_date, label = TRUE, abbr = FALSE))

#Identify the top 3 stations by maximum single day volume (station_id 41320, 41660, 41420).
daily_rides_by_station <- arrange(rides_by_station2, rides)
View(daily_rides_by_station)
tail(daily_rides_by_station, n = 4)

# Arrange the data in chronological order
chrono_rides_by_station <- arrange(rides_by_station2, new_date)
View(chrono_rides_by_station)

#Identify the top 3 stations exclusive of repeats (station_id x41660x, 40380, 41450, x41320x, 40450)
#Identify the bottom 3 stations (station_id x41580x, 41680, 40600, 41690) by total passenger volume. *station_id 41580 seems like an outlier with only 27 total rides so it was excluded
total_rides_by_station <- rides_by_station2 %>% group_by(station_id) %>% summarize(total_riders = sum(rides))
total_rides_by_station2 <- arrange(total_rides_by_station, total_riders)
View(total_rides_by_station2)
head(total_rides_by_station2, n = 4)
tail(total_rides_by_station2, n = 5)

##What follows is the data on top three stations with the highest number of passenger entries on a single day.

###Highest Daily Passenger Count First Place

In [6]:
# filter out the data so that we can just look at one station between any custom time range. Note: the == operator was not working as intended for the station_id so I put in a range.
station_id_41320 <- subset(chrono_rides_by_station, station_id > 41319 & station_id < 41321 & new_date >= "2001-01-01" & new_date <= "2020-12-31")
View(station_id_41320)

# plot the data
belmont_main_rides_per_day <- ggplot(data = station_id_41320) +
  geom_point(mapping = aes(x = new_date, y = rides, color = day_of_week)) +
  scale_x_date(minor_breaks = seq(as.Date("2001-01-01"), as.Date("2020-01-01"), by = "years")) +
  labs(x = "Year", y = "Number of Riders", title = "Number of Riders Entering at the Belmont-North Main 'L' Station Daily", subtitle = "By day of the week", legend = "Day of the Week", color = "Day of the Week")
belmont_main_rides_per_day # + scale_x_date(date_labels = "%Y")


###Highest Daily Passenger Count Second Place

In [7]:
station_id_41660 <- subset(chrono_rides_by_station, station_id > 41659 & station_id < 41661 & new_date >= "2001-01-01" & new_date <= "2020-12-31")
View(station_id_41660)

lake_state_rides_per_day <- ggplot(data = station_id_41660) +
  geom_point(mapping = aes(x = new_date, y = rides, color = day_of_week)) +
  scale_x_date(minor_breaks = seq(as.Date("2001-01-01"), as.Date("2020-01-01"), by = "years")) +
  labs(x = "Year", y = "Number of Riders", title = "Number of Riders Entering at the Lake/State 'L' Station Daily", subtitle = "By day of the week", legend = "Day of the Week", color = "Day of the Week")
lake_state_rides_per_day # + scale_x_date(date_labels = "%Y")

###Highest Daily Passenger Count Third Place

In [8]:
station_id_41420 <- subset(chrono_rides_by_station, station_id > 41419 & station_id < 41421 & new_date >= "2001-01-01" & new_date <= "2020-12-31")
View(station_id_41420)

addison_main_rides_per_day <- ggplot(data = station_id_41420) +
  geom_point(mapping = aes(x = new_date, y = rides, color = day_of_week)) +
  scale_x_date(minor_breaks = seq(as.Date("2001-01-01"), as.Date("2020-01-01"), by = "years")) +
  labs(x = "Year", y = "Number of Riders", title = "Number of Riders Entering at the Addison-North Main 'L' Station Daily", subtitle = "By day of the week", legend = "Day of the Week", color = "Day of the Week")
addison_main_rides_per_day # + scale_x_date(date_labels = "%Y")

##Top three stations by total passenger entries across the data set
top 3 stations (station_id x41660x, 40380, 41450, x41320x, 40450)

###Highest Total Passenger Count First Place is a duplicate with station_id 41660

###Highest Total Passenger Count Second Place

In [9]:
station_id_40380 <- subset(chrono_rides_by_station, station_id > 40379 & station_id < 40381 & new_date >= "2001-01-01" & new_date <= "2020-12-31")
View(station_id_40380)

clark_lake_rides_per_day <- ggplot(data = station_id_40380) +
  geom_point(mapping = aes(x = new_date, y = rides, color = day_of_week)) +
  scale_x_date(minor_breaks = seq(as.Date("2001-01-01"), as.Date("2020-01-01"), by = "years")) +
  labs(x = "Year", y = "Number of Riders", title = "Number of Riders Entering at the Clark/Lake 'L' Station Daily", subtitle = "By day of the week", legend = "Day of the Week", color = "Day of the Week")
clark_lake_rides_per_day # + scale_x_date(date_labels = "%Y")

###Highest Total Passenger Count Third Place

In [10]:
station_id_41450 <- subset(chrono_rides_by_station, station_id > 41449 & station_id < 41451 & new_date >= "2001-01-01" & new_date <= "2020-12-31")
View(station_id_41450)

chicago_state_rides_per_day <- ggplot(data = station_id_41450) +
  geom_point(mapping = aes(x = new_date, y = rides, color = day_of_week)) +
  scale_x_date(minor_breaks = seq(as.Date("2001-01-01"), as.Date("2020-01-01"), by = "years")) +
  labs(x = "Year", y = "Number of Riders", title = "Number of Riders Entering at the Chicago/State 'L' Station Daily", subtitle = "By day of the week", legend = "Day of the Week", color = "Day of the Week")
chicago_state_rides_per_day # + scale_x_date(date_labels = "%Y")

###Highest Total Passenger Count Fourth Place is a duplicate with station_id 41320

###Highest Total Passenger Count Fifth Place

In [11]:
station_id_40450 <- subset(chrono_rides_by_station, station_id > 40449 & station_id < 40451 & new_date >= "2001-01-01" & new_date <= "2020-12-31")
View(station_id_40450)

nintyfifth_danryan_rides_per_day <- ggplot(data = station_id_40450) +
  geom_point(mapping = aes(x = new_date, y = rides, color = day_of_week)) +
  scale_x_date(minor_breaks = seq(as.Date("2001-01-01"), as.Date("2020-01-01"), by = "years")) +
  labs(x = "Year", y = "Number of Riders", title = "Number of Riders Entering at the 95th/Dan Ryan 'L' Station Daily", subtitle = "By day of the week", legend = "Day of the Week", color = "Day of the Week")
nintyfifth_danryan_rides_per_day # + scale_x_date(date_labels = "%Y")

##Bottom three stations by total passenger entries across the data set
bottom 3 stations (station_id x41580x, 41680, 40600, 41690)

###Lowest Total Passenger Count - Last Place

In [12]:
station_id_41680 <- subset(chrono_rides_by_station, station_id > 41679 & station_id < 41681 & new_date >= "2001-01-01" & new_date <= "2020-12-31")
View(station_id_41680)

oakton_skokie_rides_per_day <- ggplot(data = station_id_41680) +
  geom_point(mapping = aes(x = new_date, y = rides, color = day_of_week)) +
  scale_x_date(minor_breaks = seq(as.Date("2001-01-01"), as.Date("2020-01-01"), by = "years")) +
  labs(x = "Year", y = "Number of Riders", title = "Number of Riders Entering at the Oakton-Skokie 'L' Station Daily", subtitle = "By day of the week", legend = "Day of the Week", color = "Day of the Week")
oakton_skokie_rides_per_day # + scale_x_date(date_labels = "%Y")

###Lowest Total Passenger Count - Second to Last Place

In [13]:
station_id_40600 <- subset(chrono_rides_by_station, station_id > 40599 & station_id < 40601 & new_date >= "2001-01-01" & new_date <= "2020-12-31")
View(station_id_40600)

kostner_rides_per_day <- ggplot(data = station_id_40600) +
  geom_point(mapping = aes(x = new_date, y = rides, color = day_of_week)) +
  scale_x_date(minor_breaks = seq(as.Date("2001-01-01"), as.Date("2020-01-01"), by = "years")) +
  labs(x = "Year", y = "Number of Riders", title = "Number of Riders Entering at the Kostner 'L' Station Daily", subtitle = "By day of the week", legend = "Day of the Week", color = "Day of the Week")
kostner_rides_per_day # + scale_x_date(date_labels = "%Y")

###Lowest Total Passenger Count - Third to Last Place

In [14]:
station_id_41690 <- subset(chrono_rides_by_station, station_id > 41689 & station_id < 41691 & new_date >= "2001-01-01" & new_date <= "2020-12-31")
View(station_id_41690)

cermak_mccormic_place_rides_per_day <- ggplot(data = station_id_41690) +
  geom_point(mapping = aes(x = new_date, y = rides, color = day_of_week)) +
  scale_x_date(minor_breaks = seq(as.Date("2001-01-01"), as.Date("2020-01-01"), by = "years")) +
  labs(x = "Year", y = "Number of Riders", title = "Number of Riders Entering at the Cermak-McCormick Place 'L' Station Daily", subtitle = "By day of the week", legend = "Day of the Week", color = "Day of the Week")
cermak_mccormic_place_rides_per_day # + scale_x_date(date_labels = "%Y")

##Analysis
Now that I have seen characteristics of a few of the various 'L' stations graphed out, I think there are some statistical analyses that can be applied that might give us some interesting insights into the system and every station. One of the key differences I noticed between the different graphs was the difference between the weekday ridership numbers and weekend ridership numbers. Perhaps predictably, the number of weekday riders is usually higher than weekend riders. However, some stations have a much higher daily ridership numbers during the week about 400%+ higher than they do over the weekends. Other stations have a smoother gradation between weekdays and weekends with weekdays being busier but only by about 50% to 100%. In general, the spread of the weekday numbers typically appears to be much more narrow than that of the weekend numbers. 

###Stadiums & Seasonality
However, the Addison station had an interesting mix with the differences between summer and winter ridership being much larger than weekly variations. This is most likely due to the influence of Wrigley field games on that station's ridership patterns. Furthermore, while all of the stations typically see higher ridership in the summer, at the Addison station, this relationship is particularly pronounced. 

###Events & Daily Ridership Number Outliers
Specific local events near train stations can have an out-size impact on daily ridership numbers on specific days. For example, the Belmont station has an annual peak of daily traffic on one Sunday every year ranging from 2 to 4 times its normal daily ridership numbers. These numbers are explained by the annual gay-pride parade that is held only 2 blocks away from the station. Similar patterns can be seen at the Cermak-McCormick Place L Station which has a single weekend every year that has about double to triple the normal daily traffic on Friday, Saturday, and Sunday which is the International Manufacturing Technology Show.

###Low Usage Commuter Stations
The Oakton-Skokie station has some interesting patterns that are a little more difficult to see on busier stations. In particular there is a much more obvious seasonal shift difference between weekday and weekend ridership. In the Oakton-Skokie Station, weekend daily ridership generally peaks in mid to late summer while weekday ridership peaks in late fall and then drops drastically during the winter. This might indicate an opportunity for a spring drive at this location to increase daily commuter ridership during the late spring and early summer months.

###The Pandemic
Last but not least, the pandemic. Wow, what a drop in ridership. What is interesting is that weekday ridership numbers remained higher than weekend ridership numbers. Furthermore, while the range tightened significantly, it appears that the ratio of weekday to weekend traffic may have held fairly steady despite the rapid and unprecedented changes in ridership volume.

##Weekday Versus Weekend Ridership
For each station, the ratio between the average daily ridership during the weekdays will be compared to the average daily ridership on the weekends and holidays.

In [15]:
#Group the data by station_id and day of week and output the sum total of rides.
rides_station_day <- chrono_rides_by_station %>%
  group_by(station_id, day_of_week) %>%
  dplyr::summarize(daily_rides = sum(rides)) %>% 
  as.data.frame()
View(rides_station_day)

#Group the summary data to find the average by day type (A = Saturday, U = Sunday/Holiday, W = Weekday)
avg_day_type <- chrono_rides_by_station%>%
  group_by(station_id, day_type) %>%
  dplyr::summarize(daily_rides = sum(rides)) %>% 
  as.data.frame()
View(avg_day_type)

#Add a new column to differentiate weekends from weekdays
weekday_vs_end <- mutate(chrono_rides_by_station, wd_vs_we = if_else(day_of_week == "Saturday" | day_of_week == "Sunday", "weekend", "weekday"))
View(weekday_vs_end)

#Group the summary data to find the average for the weekdays and the weekends                  
avg_weekday <- weekday_vs_end%>%
  group_by(station_id, wd_vs_we) %>%
  dplyr::summarize(daily_rides = sum(rides)) %>% 
  as.data.frame()
View(avg_weekday)

#Convert the dataframe from long to wide with rides split into two columns, one for weekends, one for weekdays
station_wd_and_we <- pivot_wider(avg_weekday, names_from = wd_vs_we, values_from = daily_rides)
View(station_wd_and_we)

#Do rough math to calculate the average daily riders on weekdays vs weekends (weekdays/5)/(weekends/2) There are many cases where the stations have 0 rides for many weekends that might throw off these numbers. Also, this is a shitty but quick way of getting a rough approximation of the ratio. To be more accurate would require a count of all weekdays for each station. If I have more time, I will do this.
wd_to_we_ratio <- transform(station_wd_and_we, ratio = ((weekday / 5)/(weekend /2)))
View(wd_to_we_ratio)

#Now we need to add this to the total_rides_by_station data frame
total_rides_by_station2 <- left_join(total_rides_by_station, wd_to_we_ratio, by = c('station_id' = 'station_id'))
View(total_rides_by_station2)

##Mapping the station counts

##Load the relevant data for mapping

In [16]:
# This is the data in CTA's format.
train_station_locations <- read_csv("../input/cta-list-of-l-stops/CTA_-_System_Information_-_List_of__L__Stops.csv", show_col_types = FALSE)
View(train_station_locations)

# This is the data that the CTA has published according to the General Transit Feed Specifications (GTFS)
stops_and_stations <- read_csv("../input/cta-stops-gtfs-format/stops.txt", show_col_types = FALSE)
View(stops_and_stations)

# But first let's just test the GTFS data since it does not appear to require any latidue or longitude processing for usage in the ggplot2 other than filtering out any NA values in the Lat/Long fields .
stop_and_station_locations <- subset(stops_and_stations, stop_lat >= 1.0)
View(stop_and_station_locations)

# After initial review of the data available in the passenger count tables and the data available in the CTA and GTFS formats, it appears that the station_id values match with that in the GTFS dataset. Therefore, I will take latitude and longitude coordinates and then put that data back into the passenger count tables in order to display the passenger count intensities on a map.

# To use this data for mapping we must join the data with the common station_id values
total_rides_lat_lon <- left_join(total_rides_by_station2, stop_and_station_locations, by = c('station_id' = 'stop_id'))
View(total_rides_lat_lon)

# filtering out any NA values in the Lat/Long fields created after the merge.
total_rides_lat_lon2 <- subset(total_rides_lat_lon, stop_lat >= 1.0)
View(total_rides_lat_lon2)

# Plot a map of cook county
all_counties <- map_data("county") #pull the map data
cook_county <- subset(all_counties, region == "illinois" & subregion == "cook") #filter for cook county illinois
#View(cook_county)
countymap <- ggplot(cook_county, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")
countymap + coord_map("azequalarea") #+ xlim(-88.4, -87.5) + ylim(41.4, 42.3)

####for reference on how to use grepl("searchterm", data) in a dataframe

# Plot a map of the communities in the City of Chicago
# Source https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6
# This is the data in this City of Chicago format.
chicago_communities <- read_csv("../input/chicago-community-areas-map/CommAreas.csv", show_col_types = FALSE)
chicago_communities2 <- arrange(chicago_communities, AREA_NUMBE)
head(chicago_communities2, n = 4)

#After reviewing the data it appears that the shapefile definition is squished into one field. This needs to be systematically separated out into two columns with longitude and latitude and then connected.
#First step is to separate the list of coordinates into rows.
coordinate_rows <- separate_rows(chicago_communities2, the_geom, sep = ",\\s*")
View(coordinate_rows)
#Then clean the coordinates by removing any non-numeric characters
clean_coords <- data.frame("lon_lat" = str_replace(coordinate_rows$the_geom, "MULTIPOLYGON *\\(*\\(|\\)+", ""), "group" = coordinate_rows$AREA_NUMBE, "community_name" = coordinate_rows$COMMUNITY, "area_num_1" = coordinate_rows$AREA_NUM_1)
View(clean_coords)

#Then separate the coordinates into two columns
sep_coords <- separate(clean_coords, lon_lat, c("longitude", "latitude"), sep = " ")
View(sep_coords)

#Then convert strings to numbers
sep_coords$longitude <- as.numeric(sep_coords$longitude)
sep_coords$latitude <- as.numeric(sep_coords$latitude)
View(sep_coords)

#ggsave(paste(st, "Cook County Map.jpeg", sep = " "), path = "C:/Users/yuserious/Desktop/Data/Projects/2021-12-29 Transportation Case Study/Plots")

# Attempt to plot the total ridership
chicago_ridership <- ggplot(data = total_rides_lat_lon2) +
  geom_point(aes(x = stop_lon, y = stop_lat, size = total_riders))
chicago_ridership + coord_map("azequalarea") + xlim(-88.4, -87.5) + ylim(41.4, 42.3)

# county fips code for cook county Illinois 17031

#ggsave(paste(st, "Ridership Map.jpeg", sep = " "), path = "C:/Users/yuserious/Desktop/Data/Projects/2021-12-29 Transportation Case Study/Plots")

# Plot a map of cook county
all_counties <- map_data("county") #pull the map data
cook_county <- subset(all_counties, region == "illinois" & subregion == "cook") #filter for cook county illinois


# Plot the total ridership on a map of cook county
cook_county_ridership <- ggplot(data = total_rides_lat_lon2) +
  geom_polygon(data = cook_county, aes(x = long, y = lat, group = group), fill = "white", colour = "black", alpha = 0.5) +
  geom_point(aes(x = stop_lon, y = stop_lat, size = total_riders))
cook_county_ridership + coord_map("azequalarea")

#ggsave(paste(st, "'L' Ridership on Cook County Map.png", sep = " "), path = "C:/Users/yuserious/Desktop/Data/Projects/2021-12-29 Transportation Case Study/Plots")

# Plot the total ridership on a map of chicago communities
chicago_ridership <- ggplot(data = total_rides_lat_lon2) +
  geom_polygon(data = sep_coords, aes(x = longitude, y = latitude, group = group), fill = "white", colour = "black", alpha = 0.5) +
  geom_point(aes(x = stop_lon, y = stop_lat, size = total_riders, color = ratio)) +
  scale_color_gradientn(colours = rev(rainbow(5)),
                    trans = "log",
                    breaks = c(1, 2, 3, 4, 5, 6, 7),
                    limits = c(1, 8),
                    labels = c(1, 2, 3, 4, 5, 6, 7),
                    guide = guide_coloursteps(even.steps = FALSE,
                                                show.limits = FALSE)) +
  labs(title = "Cumulative Chicago 'L' Station Rides from 2001-01-01 to 2021-01-01", subtitle = "Colored by ratio of average weekday commuters to average weekend commuters")
chicago_ridership + coord_map("azequalarea")

st = format(Sys.time(), "%Y-%m-%d-%H%M")
#ggsave(paste(st, "'L' Ridership on Chicago Map.png", sep = " "), width = 7.5, height = 10, dpi = 150, units = "in", path = "C:/Users/yuserious/Desktop/Data/Projects/2021-12-29 Transportation Case Study/Plots")

# Plot a map of downtown Chicago "Loop" neighborhood

# Pull the cleaned map data from the chicago communities data set and filter for the loop neighborhood only
chicago_loop <- subset(sep_coords, group == 32) #filter for loop neighborhood = number 32 in dataset

# Plot the total ridership on a map of downtown chicago
downtown_ridership <- ggplot(data = total_rides_lat_lon2) +
  geom_polygon(data = chicago_loop, aes(x = longitude, y = latitude, group = group), fill = "white", colour = "black", alpha = 0.5) +
  geom_point(aes(x = stop_lon, y = stop_lat, size = total_riders, color = ratio)) +
  scale_color_gradientn(colours = rev(rainbow(5)),
                    trans = "log",
                    breaks = c(1, 2, 3, 4, 5, 6, 7),
                    limits = c(1, 8),
                    labels = c(1, 2, 3, 4, 5, 6, 7),
                    guide = guide_coloursteps(even.steps = FALSE,
                                                show.limits = FALSE)) +
  labs(title = "Cumulative Downtown Chicago 'L' Station Rides from 2001-01-01 to 2021-01-01", subtitle = "Colored by ratio of average weekday commuters to average weekend commuters")
downtown_ridership + coord_map("azequalarea") + xlim(-87.65, -87.6) + ylim(41.86, 41.9) 

##Conclusions
The three working hypotheses I started out with at the beginning were:
  1. More people use the 'L' during the summer than during the winter.
  2. More people commute to downtown for work than take the 'L' for leisure.
  3. Individual stations, bus stops, or bus lines are likely to have a similar number of riders year over year with little relative movement. Except maybe the pandemic might change the relative positions...
  
  Hypothesis 1 that more people use the 'L' during the summer than during the winter does seem to be mostly true. All of the stations see some seasonal fluctuations peaking from mid summer to late fall although the degree varies from station to station. More analysis will have to be done, but it appears that employment commutes are less effected by the seasons than leisure rides. This supports the idea that people are choosing other modes of transportation due to weather causes discomfort, although to be certain one would need to look at the total trips taken over the seasons to see if it is not just a general reduction in travel versus a shift in mode of transport.
  
  Hypothesis 2 that more people commute to downtown for work than take the 'L' for leisure seems to be supported by the data although there are some notable exceptions. In particular stations by major cultural, shopping, or sporting areas seem to have a more equal ratio of weekday to weekend trips although the average daily weekday trips are higher than the average daily weekend trips at all of the stations. The map of the city of Chicago shows an interesting pattern. Most of the stations that are located on feeder train lines into downtown have a ratio of around 2:1 of average weekday daily rides vs average weekend daily rides. Where it gets really interesting is downtown, where the ratio either drops closer to 1.5:1 or spikes up to 3:1 or more and only 4 stations are hovering around the 2:1 range. The other interesting finding was the stations by the airports and sports stadiums have a ratio much closer to 1:1.
  
  Hypothesis 3 is that individual stations, bus stops, or bus lines are likely to have a similar number of riders year over year. I didn't actually get around to analyzing this but just based on the data that I have looked over, this does seem true, with the exception of the pandemic, where I would have to do more detailed statistical analysis in order to see if the data actually supports this hypothesis.
  
##Act
  Per hypothesis 1, the data suggests that there is an opportunity to increase ridership earlier in the spring and summer months. My personal recommendation would be to enclose the platforms but the cost of implementing this suggestion is most likely prohibitive.
  Per hypothesis 2, the data suggests that there may be an opportunity to increase ridership on the weekends to make better use of existing capacity designed for the higher volume weekdays.
  Hypothesis 3 could be tested further by looking at individual stations before and after renovations to see if renovations that improve rider experience have an effect on the number of people using the station.
  
##Limitations and further research needed:
  Since this is an introductory course to R, I still have to learn how to use many of the statistical analysis and mapping tools that R has in order to evaluate the data and answer many more interesting questions. One question for additional study would be how much the alteration of schedules might impact ridership numbers. It would also be very interesting to see how housing density and commercial density impact 'L' ridership numbers. I hope to come back to this case study at some point to show changes over time and see if population density near a train station is a leading or lagging indicator for public transit usage. Furthermore, I would want to see data on other modes of transportation including but not limited to micro-mobility, pedestrian trips, and cars.

##Resources
The following is a list of resources and where the data is coming from so that anyone looking at this can go back to the original data.

###Datasets
https://www.transitchicago.com/data/
https://data.cityofchicago.org/Transportation/CTA-Ridership-Daily-Boarding-Totals/6iiy-9s97
https://data.cityofchicago.org/Transportation/CTA-Ridership-Avg-Weekday-Bus-Stop-Boardings-in-Oc/mq3i-nnqe
https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Daily-Totals/5neh-572f
https://data.cityofchicago.org/Transportation/CTA-System-Information-List-of-L-Stops/8pix-ypme
https://data.cityofchicago.org/Transportation/CTA-System-Information-Developer-Tool-GTFS-Data/sp6w-yusg
https://data.cityofchicago.org/Transportation/CTA-Bus-Stops-kml/84eu-buny
https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6

###General Info
https://askwonder.com/research/country-level-commuting-mode-share-giaipw40e
https://www7.transportation.gov/transportation-health-tool/indicators/detail/il/msa/chicago#indicators
https://mobilitydata.org/what-we-do/#GTFS
https://www.transportation.gov/mission/health/public-transportation-trips-capita
https://www.bts.gov/statistical-products/surveys/national-household-travel-survey-daily-travel-quick-facts
https://moovitapp.com/insights/en/Moovit_Insights_Public_Transit_Index_Singapore_Singapore_%E6%96%B0%E5%8A%A0%E5%9D%A1-1678
https://blog.batchgeo.com/commute-times-and-transportation/
https://www.worldatlas.com/articles/countries-with-the-highest-public-transit-use.html