# Case Study: How Does a Bike-Share Navigate Speedy Success?

![Cyclistic](https://drive.google.com/file/d/15HNuoexwtilOBqyemiZaXzNI47obJ6vd/view?usp=sharing)

**Please check out the dashboard [here](https://app.powerbi.com/view?r=eyJrIjoiZDA0MThhYTktMjE3NC00M2E4LWJiYmUtY2VmNThlZTI4MTNkIiwidCI6IjYxYjJjY2E1LTA1ZmItNDNkMC05YzcyLTljOWI2MTk4MDhkMCJ9&pageName=ReportSection)**

### Scenario

I am a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data visualizations.

### Characters and teams

*  **Cyclistic**: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

*  **Lily Moreno**: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

* **Cyclistic marketing analytics team**: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. I joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how I, as a junior data analyst, can help Cyclistic achieve them.

* **Cyclistic executive team**: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

### Approach

In this case study, the six data analysis proces: **ask, prepare, process, analyze, share,** and **act** will be followed.


## Table of Contents
    1. Ask
    2. Prepare
    3. Process
    4. Analysis of the dataset
        4.1 Riders distribution
        4.2 Average ride duration
        4.3 Distribution of the total monthly rides
        4.4 Distribution of weekly bike usage
        4.5 Distribution of hourly bike usage
        4.6 Average ride duration per hour
        4.7 Most popular trip start stations
        4.8 Most popular route
        4.9 Bike type used by riders
    5. Share
    6. Act
    Acknowledgement

# 1. **Ask**

**The business task:** My manger, Moreno, has asked me to analyze Cyclistic historical data to find out how annual members and casual riders use Cyclistic bikes differently. The insights from my analysis would help my team to design a marketing program targeted at converting casual riders to members to drive the growth of Cyclistic. 

# 2. Prepare

A link to the data for this analysis was provided by Google as part of the case study guide for this capstone project. It was made available by Motivate International Inc. under this [license](https://ride.divvybikes.com/data-license-agreement). The data comprises of historical bike trip data from 2016 to 2022 and can be accessed [here](https://divvy-tripdata.s3.amazonaws.com/index.html)

This data is a public data and can used to explore how different customer types are using Cyclistic bikes. The issues of privacy have been duly addressed as the users of the data have no access to riders’ personally identifiable information. This implies that I will not be be able to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes. Other information that could be useful for the analysis such as gender and occupation of the riders can not be acessed.

This data is reliable, original, comprehensive, current, and cited. These qualities make the data useful for this project.

# 3. Process

I will be using R to wrangle the data. I chose R because of its flexibility in data manipulation and visualization and also because I want to be proficient in using R for exploratory data analysis.

The cleaning and manipulation steps will be documented using this R notebook. I will be using the trip data from January 2021 to December 2021. Though there are newer data (2022), I chose 2021 data to run my analysis on a complete year.

## 3.1 Load packages
Let's start by loading the required packages for this analysis

In [None]:
# Loading the required packages
library(tidyverse)
library(lubridate)
library(janitor)
library(readr)

## 3.2 Load the dataset

In [None]:
# Loading the 2021 trip data
trips01 <- read_csv("../input/divvytrips2021/202101-divvy-tripdata.csv")
trips02 <- read_csv("../input/divvytrips2021/202102-divvy-tripdata.csv")
trips03 <- read_csv("../input/divvytrips2021/202103-divvy-tripdata.csv")
trips04 <- read_csv("../input/divvytrips2021/202104-divvy-tripdata.csv")
trips05 <- read_csv("../input/divvytrips2021/202105-divvy-tripdata.csv")
trips06 <- read_csv("../input/divvytrips2021/202106-divvy-tripdata.csv")
trips07 <- read_csv("../input/divvytrips2021/202107-divvy-tripdata.csv")
trips08 <- read_csv("../input/divvytrips2021/202108-divvy-tripdata.csv")
trips09 <- read_csv("../input/divvytrips2021/202109-divvy-tripdata.csv")
trips10 <- read_csv("../input/divvytrips2021/202110-divvy-tripdata.csv")
trips11 <- read_csv("../input/divvytrips2021/202111-divvy-tripdata.csv")
trips12 <- read_csv("../input/divvytrips2021/202112-divvy-tripdata.csv")

## 3.4 Explore the loaded data

Let's explore the loaded datasets here to be sure the datasets were loaded correctly. I have explored the datasets  individually before I loaded them into R, so I will just explore two of them randomly. 
I will also check if the columns in all the datasets are the same. This would make us know if the datasets are row combinable.


In [None]:
head(trips05)
glimpse(trips12)

In [None]:
# Compare the uploaded data to know if the columns are the same

compare_df_cols_same(trips01, trips02, trips03, trips04, trips05, trips06, 
                     trips07, trips08, trips09, trips10, trips11, trips12)

The datasets are row combinable because they have the same columns. Next, we'll combine the datasets 

## 3.5 Combine the data
Let's combine the data into a single dataset 

In [None]:
# Combine the monthly trip data into one dataset

trips2021 <- rbind(trips01, trips02, trips03, trips04, trips05, trips06, 
                   trips07, trips08, trips09, trips10, trips11, trips12)

head(trips2021)
summary(trips2021)

### 3.5.1 Explore the combined dataset

In [None]:
glimpse(trips2021)

I noticed that there are missing values in the dataset. I will remove the missing rows for the sake of this analysis. Ideally, I would make enquiry from my manager and/or the team that knows how the data was generated and captured to know why there are missing values and what should be done.

## 3.6 Clean the data
I will check if there are duplicate rows in the combined dataset and remove the missing values and duplicate rows if any.

In [None]:
# Checking for duplicate rows
nrow(distinct(trips2021)) == nrow(trips2021)

The two datasets have the same number of rows indicating that there are no duplicate rows. The missing values will be removed next.

In [None]:
# remove NAs
trips2021_v2 <- drop_na(trips2021)

glimpse(trips2021_v2)

## 3.7 Compute some new columns 
Now, let's compute some new columns such as ride duration, weekday, and route. The variables would be useful for the analysis

In [None]:
trips2021_v2 <- trips2021_v2 %>% 
  mutate(ride_duration = round(difftime(ended_at, started_at, units = "mins"))) %>% 
  mutate(month = month(started_at, label = TRUE)) %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  mutate(start_hour = hour(started_at)) %>% 
  mutate(route = str_c(start_station_name, end_station_name, sep = " -to- "))

glimpse(trips2021_v2)
tail(trips2021_v2, 5)

## 3.8 Rearrange the columns
Let's rearrange the columns and rename some of the variables to more intuitive names and drop the columns that will not be used for the analysis

In [None]:
trips2021_v2 <- trips2021_v2 %>% 
  select(member_casual, rideable_type, ends_with("name"),route, ride_duration:start_hour) %>% 
  rename(rider_type = member_casual, bike_type = rideable_type)

head(trips2021_v2, 10)

## 3.9 Explore the new variables

Let's explore the new variables to see if they make sense or need some manipulation/cleaning

In [None]:
head(trips2021_v2)

summary(trips2021_v2)

# Checking for zero trip duration
trips2021_v2 %>% 
  filter(ride_duration == 0) 

# Checking for negative trip duration 
trips2021_v2 %>% 
  filter(ride_duration < 0)

I noticed that there are some zero and negative ride durations. For the zero ride durations, where the ride duration is less than a minute, I gathered that the bikes were taken out of circulation for repairs.

For the negative ride duration, I am not quite sure why this happened. Maybe the trip **start** and **end** times were mistakenly swapped. I also thought it could be as a result of daylight savings time adjustments since the bike trip started and ended at different locations, but this is unlikely because it happened across different months (June, August, September and November) whereas the daylight savings time adjustments usually happen in Chicago in March and November as found on Google. 

These rows will be removed for the sake of this analysis.

*PS: I attempted to find reasons for the anomalies in the ride duration because it is not a good practise to just drop the rows without verification.*


## 3.10 Remove the zero and negative ride durations
Let's remove the rows where the ride duration is either zero or negative. Ideally, before the negative rows would be removed, enquiry would be made to know the cause of the anomaly and the required remedial action.

In [None]:
trips2021_cleaned <- trips2021_v2 %>% 
  filter(!ride_duration <= 0)

head(trips2021_cleaned)
summary(trips2021_cleaned)

Now that our data has been wrangled, it is now ready for analysis.

# 4. Analysis of the dataset
Let's now make calculations and visualize our data to generate insights. In order to understand how the casual riders use the bikes differently from the members, I will explore the following:
* Riders distribution
* Average ride duration
* Distribution of the total monthly rides
* Distribution of weekly bike usage 
* Distribution of hourly bike usage 
* Average ride duration per hour
* Most popular trip start stations
* Most popular route
* Bike type used by riders



## 4.1 Riders distribution
Let's start by computing and visualizing the percentage of each category of riders

In [None]:
trips2021_cleaned %>% 
  group_by(rider_type) %>% 
  tally() %>% 
  mutate(percentage = round(n/sum(n)*100)) %>% 
  ggplot(aes(x = 1, y = percentage, fill = rider_type)) +
  geom_bar(stat = "identity", width = 1) +
  geom_text(aes(label = str_c(rider_type, str_c(percentage, "%"), sep = "\n")), 
            position = position_stack(vjust = 0.5), color = "white", size = 8) + 
  labs(title = "Riders Distribution", fill = "Rider type") + 
  coord_polar(theta = "y") +
  theme_void()


Members use Cylistic bikes more than the casual riders. Members account for 55% of the total rides while casual riders completed 45% of the total rides. 

# 4.2 Average ride duration

Let's examine the average duration of the trips

In [None]:
# Average ride duration
trips2021_cleaned %>% 
  group_by(rider_type) %>% 
  summarise(avg_ride_duration = round(mean(ride_duration)))%>% 
  ggplot(aes(x = rider_type, y = avg_ride_duration)) + 
  geom_col(position = "dodge", fill = "black") +
  labs( x = "Rider type", y = "Average ride duration (mins)") 


It is interesting to see that despite that members account for most of the rides, casual members ride the bikes longer than members. The average ride duration for casual riders is more than twice for members.

# 4.3 Distribution of the total monthly rides
Let's compute and visualize the monthly ride distribution


In [None]:
trips2021_cleaned %>% 
  group_by(rider_type, month) %>% 
  summarise(total_rides = n()) %>% 
  arrange(month) %>% 
  ggplot(aes(x = month, y = total_rides, fill = rider_type)) +
  geom_col(position = "dodge") +
  labs(title = "Monthly bike rides by rider type",
       x = "Month", y = "Number of rides", fill = "Rider type") 

Casual riders use the Cyclistic bikes more than members in the summer months. The rides by casual riders caught up with those by members by the beginning of summer in June and surpassed it in most part of the summer months. It peaked in July before it dropped drastically in October. Usage by casual riders dropped sharply following the end of summer but usage by members remained fairly high before it dropped dramatically going into the winter in December. 

The casual riders’ bike usage was significantly lower than the usage by members from February to April. The bike usage by casual riders and members started to rise in the spring (from March to May) following a dip in the winter months (December to March) with members leading the pack.


# 4.4 Distribution of weekly bike usage 

Let's examine how the bikes are used across the week to uncover usage patterns by members and casual riders.

In [None]:
trips2021_cleaned %>% 
  group_by(rider_type, weekday) %>% 
  summarise(total_rides = n()) %>% 
  ggplot(aes(x = weekday, y = total_rides, fill = rider_type)) +
  geom_col(position = "dodge") +
  labs(title = "Weekly bike usage distribution", 
       fill = "Rider type", x = "Weekday",
       y = "Number of rides")


We observed that casual riders seem to use the bikes more for leisure while the members seem more likely to use the bike to commute to and from work. Casual riders used the bikes far more on weekends. Their usage starts to rise on Fridays and moves up significantly on Saturdays and Sundays from the fairly consistent level on weekdays. Members' usage is fairly consistent throughout the week.

Next, we will examine the bike usage across the times of the day. 

# 4.5 Distribution of hourly bike usage

Let's look at how the trips are distributed across the hours of the day

In [None]:
trips2021_cleaned %>% 
  group_by(rider_type, start_hour) %>% 
  summarise(total_rides = n()) %>% 
  ggplot(aes(x = start_hour, y = total_rides, fill = rider_type)) +
  geom_col(position = "dodge") +
    scale_x_continuous(breaks = c(0:23))+
  labs(title = "Hourly bike rides", 
       fill = "Rider type", x = "Hour",
       y = "Number of rides")

The bikes are mostly used during the day by both category of users. Members use the bikes significantly more than the casual riders from 6 a.m to 9 a.m in the morning and between 4 p.m  to 7 p.m in the evening. These pattern agrees with our hypothesis that the members use the bikes more for work.  

Casual rides use the bikes more than members in the night from 9pm and at odd hours.  

# 4.6 Average ride duration per hour
Now, let's look at the average ride duration across the hours of the day

In [None]:
trips2021_cleaned %>% 
  group_by(rider_type, start_hour) %>% 
  summarise(avg_duration = mean(ride_duration)) %>% 
  ggplot(aes (x = start_hour, y = avg_duration, fill = rider_type)) + 
  geom_col(position ="dodge") + 
  scale_x_continuous(breaks = c(0:23)) +
  labs(title = "Average ride duration per hour",
       x = "Hour", y = "Average ride duration (mins)")

Casual members generally ride the bikes longer than members throughout the day. 
The average ride duration across the hours of the day is fairly constant for members. Additional data is required to verify if this points to the cap on allowable usage for their yearly subscription.

It is interesting to note that despite that fewer rides happen at odd hours from 12 midnight to 4 a.m as evident in the **hourly bike rides** graph above, casual riders ride the bikes for longer period during those periods.

# 4.7 Most popular trip start stations

Let's now determine the most poular stations where the riders start their trip from

In [None]:
# Top 10 start stations
trips2021_cleaned %>% 
  group_by(rider_type, start_station_name) %>% 
  summarise(Number_of_trips = n()) %>% 
  arrange(desc(Number_of_trips)) %>% 
  head(10)

In [None]:
# Top 10 start stations for members

trips2021_cleaned %>% 
  group_by(rider_type, start_station_name) %>% 
  filter(rider_type == "member") %>% 
  summarise(Number_of_rides = n()) %>% 
  arrange(desc(Number_of_rides)) %>% 
  head(10) %>% 
   ggplot(aes(x = Number_of_rides, y = reorder(start_station_name, Number_of_rides))) + 
  geom_col() +
  labs(title = "Most popular start station for members",
       x = "Number of rides", y = "Start station name")

In [None]:
# Top 10 start stations for casual riders
trips2021_cleaned %>% 
  group_by(rider_type, start_station_name) %>% 
  filter(rider_type == "casual") %>% 
  summarise(Number_of_rides = n()) %>% 
  arrange(desc(Number_of_rides)) %>% 
  head(10) %>% 
  ggplot(aes(x = Number_of_rides, y = reorder(start_station_name, Number_of_rides))) + 
  geom_col() +
  labs(title = "Most popular start station for casual riders",
       x = "Number of rides", y = "Start station name")

The top start stations are different for members and casual riders. **Streeter Dr & Grand Ave** is by far the most popular station for casual riders followed by **Millennium Park** and **Michigan Ave and Oak St**. The top three start stations for members are **Clark St & Elm St**, **Wells St & Concord Ln**, and **Kingsbury St & Kinzie St**.

The top 2 most popular stations where casual riders start their trips from are close to leisure centres. Pointing us to the conclusion that they use Cyclistic bikes primarily for leisure.


# 4.8 Most popular route

In [None]:
# Most popular routes
trips2021_cleaned %>% 
  group_by(rider_type, route) %>% 
  summarise(Number_of_rides = n()) %>% 
  arrange(desc(Number_of_rides)) %>% 
  head(10) 

In [None]:
# Most popular route for members with average duration
route_member <- trips2021_cleaned %>% 
  group_by(rider_type, route) %>% 
  filter(rider_type == "member") %>% 
  summarise(Number_of_rides = n(), ride_duration = mean(ride_duration)) %>% 
  arrange(desc(Number_of_rides)) %>% 
  head(10)
 # Plot of Most popular routes for members
ggplot(route_member) + 
  geom_col(aes(x = Number_of_rides, y = reorder(route, Number_of_rides))) +
  labs(title = "Most popular route for members",
       x = "Number of rides", y = "Route")

The start and end stations are different for members. End stations for some trips are the start stations for other trips, indicating that trips completed a loop. This patterm may mean that members primarily use the bikes to commute to and from work.

In [None]:
# Most popular routes for casual riders
route_casual <- trips2021_cleaned %>% 
  group_by(rider_type, route) %>% 
  filter(rider_type == "casual") %>% 
  summarise(Number_of_rides = n(), avg_ride_duration = mean(ride_duration)) %>% 
  arrange(desc(Number_of_rides)) %>% 
  head(10)

 # Plot of Most popular routes for casual riders
ggplot(route_casual) + 
  geom_col(aes(x = Number_of_rides, y = reorder(route, Number_of_rides))) +
  labs(title = "Most Popular route for Casual Riders",
       x = "Number of rides", y = "Route")

The trips started and ended at the same station for 8 out of 10 of the most popular routes. The other  2 started and ended at the top two most popular start stations. Google search revealed that Streeter Dr & Grand Ave is situated nearby to Grand & Streeter, and close to Jane Addams Memorial Park. Millennium Park is obviously a Park. From this information, it might be safe to conclude that casual members mostly ride for leisure.

# 4.9 Bike type used by riders
We will now examine the types of bike used by the riders

In [None]:
trips2021_cleaned %>% 
  group_by(rider_type, bike_type) %>% 
  summarise(Number_of_bikes = n()) %>% 
  ggplot(aes(x = bike_type, y = Number_of_bikes, fill = rider_type)) + 
  geom_col(position = "dodge") +
  labs(x = "Bike type", y = "Number of bikes", fill = "Rider type")

The most popular bike among both categories of riders are the classic bikes. Casual riders also ride electric and docked bikes with the docked bikes being the least used. Members also ride electric bikes but barely use docked bikes.

It will be interesting to know why members barely use the docked bikes. I suspect that it may be more expensive to ride than the other bikes but additional data about bike rental costs would be required to verify this.

# 5. Share
The key insights gleaned from the analysis are as follows:

* Members use Cylistic bikes more than the casual riders. Members account for 55% of the total rides while casual riders completed 45% of the total rides. However, casual riders ride the bikes longer than members. The average ride duration for casual riders is more than twice for members.

* Casual riders use the Cyclistic bikes more than members in the summer months, July to September. Members use the bikes more than casual riders in the remaining months of the year (outside summer months). 

* Casual riders seem to use the bikes more for leisure while the members seem more likely to use the bike to commute to and from work. Casual riders used the bikes far more on weekends. Their usage starts to rise on Fridays and moves up significantly on Saturdays and Sundays from the fairly consistent level on weekdays. Members' usage is fairly consistent throughout the week. 

* The bikes are mostly used during the day by both categories of users. Members use the bikes significantly more than the casual riders from 6 a.m to 9 a.m in the morning and between 4 p.m  to 7 p.m in the evening. Casual riders use the bikes more than members in the night from 9pm and at odd hours.  

* The top 2 most popular stations where casual riders start their trips from are close to leisure centres. Pointing us to the conclusion that they use Cyclistic bikes primarily for leisure.

* Casual members generally ride the bikes longer than members throughout the hours of the day. The average ride duration is fairly constant for members throughout the day.  It is interesting to note that despite that fewer rides happen at odd hours from 12 midnight to 4 a.m, casual riders ride the bikes for longer duration during those periods.





# 6. Act
Here I will highlight my top three recommendations based on the insights I gleaned from the analysis:

* **Loyalty points:** Cyclistic should consider introducing a loyalty program where member riders accumulate points based on the duration of their ride. The loyalty points could be used to offer discounts on rides or tickets to popular leisure centers. Adverts about the loyalty program may be placed at top 5 stations where casual riders start their trip from. I believe this would help to convert casual riders to members because they usually ride the bikes for longer duration and for leisure.

* **Night life Membership:** An initiative that caters to the need of night-lifers may help to convert casual riders.This could be a membership that gives discount on night rides.

* **Further analysis:** Data about occupation, residential, and work locations of casual riders would be helpful to find out if it is possible to make Cyclistic bikes appealing to casual riders to use for work commute. 


*PS: I believe that having a conversation with someone that knows the business really well would be helpful in providing recommendations. As such, conversations with my manager and other stakeholders would have been helpful to provide recommendations that would solve the business problem in the best way.*


# Acknowledgement

This being my first data analysis project with R, completing it was a challenging and interesting task! I had to stop many times to learn new things in R and look through other people's work for inspiration. Many thanks to everyone that has helped me with their work in one way or the other.


***If anyone finds this notebook helpful, please feel free to use it to gain clarity with your own work and kindly upvote it so that others can easily find it. Cheers***