# How Does a Bike-Share Navigate Speedy Success?

**Author:** Yaroslav Lishchuk

**Date:** 11/09/2023

In order to answer the key business questions, I will follow the steps of the data analysis process: **Ask, Prepare, Process, Analyze, Share, Act**

# Introduction

**Scenario**

I'm a junior data analyst hired in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve our recommendations, so they must be backed up with compelling data insights and professional data visualizations.

**About the company**

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. 

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

**The goal of this project**

Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to understand: 
* how  do annual members and casual riders differ; 
* why casual riders would buy a membership; 
* how digital media could affect their marketing tactics.

**My Responsibilities**

The director of marketing has assigned me the question to answer: **How do annual members and casual riders use Cyclistic bikes differently?**

I will produce a report with the following deliverables:
1. A clear statement of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of my analysis
5. Supporting visualizations and key findings
6. My top three recommendations based on my analysis

# Ask

**Key tasks**

1. **Identify the business task**
My team will design a new marketing strategy to convert casual riders into annual members. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. 

2. **Consider key stakeholders**
* The director of marketing and your manager: Lily Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
* Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. 
* Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.


**A clear statement of the business task**

Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to understand how  do annual members and casual riders differ, why casual riders would buy a membership, how digital media could affect their marketing tactics. But first,Cyclistic executives must approve our recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Here we have three questions will guide the future marketing program:
* **How do annual members and casual riders use Cyclistic bikes differently?**
* **Why would casual riders buy Cyclistic annual memberships?**
* **How can Cyclistic use digital media to influence casual riders to become members?**

# Prepare
**Key tasks**
1. I've extracted data from 11/2022 to 10/2023 and stored it..
3. I will sorte and filtere the data. In total  data set have 13 columns. 
4. I've determined the credibility of the data.

**A description of all data sources used**
The datasets have a different name because Cyclistic is a fictional company. I will use the public dataset located [here](https://divvy-tripdata.s3.amazonaws.com/index.html). The data has been made available by Motivate International Inc. under this
[license](https://divvybikes.com/data-license-agreement).

**ROCCC CHECK:**
* Reliable - dataset is not bias
* Original - dataset has located on the original public data
* Comprehensive - have not missed important information
* Current - dataset has updated mounthly
* Cied - dataset is cited

**DATA TYPE CHECK:**



In [None]:
#Instaling and loading necessary packages and library:
install.packages("tidyverse")
install.packages("lubridate")

library(tidyverse)  # helps to transform and better present data
library(lubridate)  # makes it easier to work with dates and times
library(ggplot2)    # motive is to create graphics
library(scales)     # scale functions for visualization

In [None]:
# Collect data and combine in the single dataset

# Upload Divvy datasets (csv files) here from the last 12 month.
# Variable creation for each monthly extract
td01 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202211-divvy-tripdata.csv")
td02 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202212-divvy-tripdata.csv")
td03 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202301-divvy-tripdata.csv")
td04 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202302-divvy-tripdata.csv")
td05 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202303-divvy-tripdata.csv")
td06 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202304-divvy-tripdata.csv")
td07 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202305-divvy-tripdata.csv")
td08 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202306-divvy-tripdata.csv")
td09 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202307-divvy-tripdata.csv")
td10 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202308-divvy-tripdata.csv")
td11 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202309-divvy-tripdata.csv")
td12 <- read.csv("/kaggle/input/202211-202310-divvy-tripdata/202310-divvy-tripdata.csv")

In [None]:
# Column names check
colnames(td01)
colnames(td02)
colnames(td03)
colnames(td04)
colnames(td05)
colnames(td06)
colnames(td07)
colnames(td08)
colnames(td09)
colnames(td10)
colnames(td11)
colnames(td12)

# All columns are in the same order

In [None]:
# Inspect the data frames and look for discrepancies
str(td01)
str(td02)
str(td03)
str(td04)
str(td05)
str(td06)
str(td07)
str(td08)
str(td09)
str(td10)
str(td11)
str(td12)

# Confirmed

In [None]:
# Merge individual monthly data frames into one big data frame
trips_data <- bind_rows(td01, td02, td03, td04, td05, td06, td07, td08, td09, td10, td11, td12)

# Process


**Data Cleansing**

In the below analysis the decision to focus on one year time frame was done, as by limiting the whole available data we can achieve better precision and depth and avoid data overloads. Business aproved that the data in yearly time frame is not changing.

In [None]:
# Inspect the new table that has been created

colnames(trips_data)  # List of column names

In [None]:
nrow(trips_data)  # How many rows are in the data frame

In [None]:
dim(trips_data)  # Dimensions of the data frame

In [None]:
head(trips_data)  # First 6 rows of the data frame 

In [None]:
str(trips_data)  # List of columns and data types

In [None]:
summary(trips_data)  # Statistical summary of the data. Mainly for numeric values

**There are a few challenges to make my data clean and ready for analysis:**

1. In the "started_at" and "ended_at" columns, there are character types data. Convert this datatype to datetime for the further analysis.

2. Add some additional columns of data such as weekday, month, day, year and time  that provide additional opportunities to aggregate the data.

3. Add "ride_length" column. Calculate the length of each ride by subtracting the column “started_at” from the column “ended_at” 

4. Clean dirty data. There are some rides where trip duration shows up as negative, including several hundred rides where bikes have been taken out of circulation for Quality Control reasons.


In [None]:
# Convert started_at and ended_at to datetime 
trips_data$started_at <- as_datetime(trips_data$started_at)
trips_data$ended_at <- as_datetime(trips_data$ended_at)

In [None]:
# Create a column called “day_of_week”.
trips_data$day_of_week <- weekdays(trips_data$started_at)


# Create a column called “month”
trips_data$month <- format(trips_data$started_at, '%m')


# Create a column called “day”
trips_data$day <- format(trips_data$started_at,'%d')


# Create a column called “year”
trips_data$year <- format(trips_data$started_at,'%Y')

# Create a column called “time”
trips_data$time <- format(trips_data$started_at, format = "%H")

#See the first 6 rows of data frame.
head(trips_data)


In [None]:
# Add a "ride_length" calculation to trips_data (in minutes), convert data to intenger.
trips_data$ride_length <- difftime(trips_data$ended_at, trips_data$started_at, units = "mins")
trips_data$ride_length <- as.integer(trips_data$ride_length)

#See the first 6 rows of data frame.
head(trips_data)

**Removing Outlined Data**
 The data frame includes a few hundred entries when bikes were taken out of docks and checked for maintenance or ride_length was negative  [link](http://www.datasciencemadesimple.com/delete-or-drop-rows-in-r-with-conditions-2/)
 
 We will create a new version of the data frame (v2) since data is being removed
 


In [None]:
# Cleaning data 1. Create new data frame without values where ride_length is less than 0 minutes or more than 24 hours(1440 minutes)
trips_data_v2 <- trips_data[!(trips_data$ride_length <= 0 | trips_data$ride_length > 1440),]
dim(trips_data_v2)

# Cleaned 7 269 rows

In [None]:
# Cleaning data 2. Remove all NA data from the data frame 
trips_data_v2 <- drop_na(trips_data_v2)
dim(trips_data_v2)

# Cleaned 802 rows.

In [None]:
# Cleaning data 3. Remove all duplicate Id's from ride_id. 
trips_data_v2 <- trips_data_v2[!duplicated(trips_data_v2$ride_id),]
dim(trips_data_v2)

# Cleaned. No duplicates.

In [None]:
# My data has been processed and cleaned up for further analysis using:
dim(trips_data_v2) 
summary(trips_data_v2) 

**Confirmed 5 644 751 rows in 5 652 822 rows unique - Valid data**

# Analyze and Share

Summary of my analysis with supporting visualizations and key findings

In [None]:
# Number of rides by member casual and Percentage
trips_data_v2 %>%
  group_by(member_casual) %>%
  summarize(number_of_rides = n()) %>%
  mutate(percent_of_riders = percent(number_of_rides / sum(number_of_rides)))

In [None]:
# Visualization for Percentage of rider by type
trips_data_v2 %>%
  group_by(member_casual) %>%
  summarize(number_of_rides = n()) %>%
  mutate(percent_of_riders = percent(number_of_rides / sum(number_of_rides)))%>%
  ggplot( aes(x = "", y = percent_of_riders, fill = member_casual)) +
  geom_col(color = "black") +
  coord_polar("y") +
  geom_text(aes(label = paste0(percent_of_riders)), position = position_stack(vjust=0.5), color = 'white') +
  theme_void()+
  scale_fill_manual(values = c("red", "navy"))+
  labs(title = "Percentage of Rider by Type")

**Observations**

1. These metrics give a snapshot of the differences between casual and member riders, including the volume of rides..
2. The "member" group has a higher number of rides compared to the "casual" group.
3. In terms of the percent of riders, 64% of the total riders are members, and 36% are casual users.

In [None]:
# Descriptive analysis on ride_length (all figures in minutes)
mean(trips_data_v2$ride_length) #straight average total ride length 

In [None]:
median(trips_data_v2$ride_length) #midpoint number in the ascending array of ride lengths

In [None]:
max(trips_data_v2$ride_length) #longest ride

In [None]:
min(trips_data_v2$ride_length) #shortest ride

In [None]:
# Condense the four lines above to one line using summary() on the specific attribute
summary(trips_data_v2$ride_length)

In [None]:
# Compare members and casual users
aggregate(trips_data_v2$ride_length ~ trips_data_v2$member_casual, FUN = mean)

In [None]:
aggregate(trips_data_v2$ride_length ~ trips_data_v2$member_casual, FUN = median)

In [None]:
aggregate(trips_data_v2$ride_length ~ trips_data_v2$member_casual, FUN = max)

In [None]:
aggregate(trips_data_v2$ride_length ~ trips_data_v2$member_casual, FUN = min)

In [None]:
# See the average ride time by each day for members vs casual users
aggregate(trips_data_v2$ride_length ~ trips_data_v2$member_casual + trips_data_v2$day_of_week, FUN = mean)

In [None]:
# Notice that the days of the week are out of order. 
trips_data_v2$day_of_week <- ordered(trips_data_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
aggregate(trips_data_v2$ride_length ~ trips_data_v2$member_casual + trips_data_v2$day_of_week, FUN = mean)

In [None]:
# Analyze ridership data by type and weekday
trips_data_v2 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange( day_of_week)

In [None]:
#  Visualize the number of rides by rider type
trips_data_v2 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge",color = "black")+
  scale_fill_manual(values = c("red", "navy"))+
  labs(title = "No. of Riders by Rider Type")

In [None]:
#  Visualize for average duration
trips_data_v2 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge", color = "black")+
  scale_fill_manual(values = c("red", "navy"))+
  labs(title = "Average Duration")

**Observations**

1. Casual riders tend to have more rides on Saturday, followed by Sunday and Friday. 
2. Member riders have a more consistent number of rides throughout the week, with higher numbers on Tuesday and Wednesday.
3. Casual riders generally have longer average ride durations compared to members.
4. Both casual and member riders have slightly longer average durations on Saturday and Sunday.

In [None]:
# Analyze ridership data by rider type and month
trips_data_v2 %>% 
  group_by(member_casual, month, year) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange( year, month)

In [None]:
#  Visualize the number of rides by rider type
trips_data_v2 %>% 
  group_by(member_casual, month, year) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(year, month) %>% 
  unite(ym, year, month, sep = " ") %>% 
  ggplot(aes(x = ym, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge",color = "black")+
  scale_fill_manual(values = c("red", "navy"))+
  labs(title = "No. of Riders by Rider Type", x = "month")

In [None]:
#  Visualize for average duration
trips_data_v2 %>% 
  group_by(member_casual, month, year) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(year, month) %>%  
  unite(ym, year, month, sep = " ") %>% 
  ggplot(aes(x = ym, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge", color = "black")+
  scale_fill_manual(values = c("red", "navy"))+
  labs(title = "Average Duration", x = "month")

**Observations**
1. For most months, the number of rides for casual users is significantly lower than for member users. 
2. Casual users tend to have longer average ride durations compared to member users. 
3. There are seasonal patterns, as seen in the increase rides during the warmer months (from May to September). This could be influenced by weather conditions.

In [None]:
# Analyze ridership data by rider type and time of day
trips_data_v2 %>% 
  group_by(member_casual, time) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange( time)

In [None]:
#  Visualize the number of rides by rider type
trips_data_v2 %>% 
  group_by(member_casual, time) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(time) %>% 
  ggplot(aes(x = time, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge",color = "black")+
  scale_fill_manual(values = c("red", "navy"))+
  labs(title = "No. of Riders by Rider Type")

In [None]:
#  Visualize for average duration
trips_data_v2 %>% 
  group_by(member_casual, time) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(time) %>%  
  ggplot(aes(x = time, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge", color = "black")+
  scale_fill_manual(values = c("red", "navy"))+
  labs(title = "Average Duration")

**Observations**

1. Member ridership shows peaks during the morning and evening rush hours, typically between 7:00 - 9:00 and between 16:00 - 18:00 .
2. Casual riders tend to have peaks during the late afternoon and early evening hours, typically between 16:00 - 20:00.
3. Casual riders tend to have longer average ride duration, and their peak hours align with the higher duration during the late afternoon and early evening.




# Act

**My top three recommendations based on my analysis**

1. **Weekend Membership Promotions:**

Capitalize on the observation that casual riders may use the bike-sharing service more on weekends for leisurely rides, introduce special weekend membership promotions or discounts to incentivize casual riders to sign up for an annual membership.

2. **Seasonal Promotions for Longer Rides:**

Leverage the insight that casual riders seem to take longer rides on average compared to members, introduce seasonal promotions that reward users for longer rides with discounted annual membership rates or other perks. Emphasize the value proposition of an annual membership for users who enjoy extended rides, showcasing it as a more cost-effective option.

3. **Summer Membership Packages:**

Take advantage of the notable increase in the average duration and number of rides for casual riders from May to September. Launch special summer membership packages tailored to the preferences of casual riders during this period. Emphasize the convenience of having an annual membership for spontaneous and frequent summer rides, positioning it as a hassle-free and economical choice.