In [46]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Case Study: How Does a Bike-Share Navigate Speedy Success?

## Introduction
This case study is for a ﬁctional company, Cyclistic, and meet diﬀerent characters and team members. In order to answer the key business questions, we will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. 

**Scenario**

I am a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes diﬀerently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But ﬁrst, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data
visualizations.


## Characters and teams

● **Cyclistic**: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself
apart by also oﬀering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with
disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about
8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to
commute to work each day.

● **Lily Moreno**: The director of marketing and your manager. Moreno is responsible for the development of campaigns
and initiatives to promote the bike-share program. These may include email, social media, and other channels.

● **Cyclistic marketing analytics team**: A team of data analysts who are responsible for collecting, analyzing, and
reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy
learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic
achieve them.

● **Cyclistic executive team**: The notoriously detail-oriented executive team will decide whether to approve the
recommended marketing program.

## About the company

In 2016, Cyclistic launched a successful bike-share oﬀering. Since then, the program has grown to a ﬂeet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the ﬂexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s ﬁnance analysts have concluded that annual members are much more proﬁtable than casual riders. Although the pricing ﬂexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders diﬀer, why casual riders would buy a membership, and how digital media could aﬀect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

## Ask 

Moreno has assigned me the ﬁrst question to answer from the three questions which will guide the future marketing program: 

* How do annual members and casual riders use Cyclistic bikes diﬀerently?

The goal of Cyclistic Bike-Share is to increase yearly memberships by turning casual customers (single-ride and day passes) into members (annual membership). The purpose of this analysis is to see how casual riders and annual members use Cyclistic bikes differently. In terms of deliverable, we'll take multiple suggestions to boost our annual membership.  

**Stakeholders**:

* Primary stakeholders: The director of marketing and Cyclistic executive team

* Secondary stakeholders: Cyclistic marketing analytics team

## Prepare

Here we will consider from 12 months datasets from divvy-trip as a dummy for this analysis. The datasets contain some rides information such as the start date and time, end date and time, and whether the customer was a casual user or a member.

Source of Datasets: https://divvy-tripdata.s3.amazonaws.com/index.html

The data is organized into 12 CSV files (one for each month), each with 13 columns and almost 5 million rows.

The **ROCCC** method is used to assess the data's credibility.

* **Reliable**: It is precise and accurate, and it reflects all bike trips taken in Chicago for the time period chosen during this analysis.O
* **Original**: Motivate International Inc., which operates the city of Chicago's Divvy bicycle sharing service, which is powered by Lyft, has made the data accessible.
* **Comprehensive** : the data includes all ride details, along with the start and end times, station names, station IDs, membership types, and more.
* **Current**: It is updated since it includes data until the end of October 2021.
* **Cited**: The data is cited and is available under the terms of the Data License Agreement.

In [47]:
# Importing Libraries
library(tidyverse)
library(lubridate)
library(ggplot2)
library(lubridate)

In [48]:
# Loading the datasets

tripdata_202011 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202011-divvy-tripdata.csv")
tripdata_202012 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202012-divvy-tripdata.csv")
tripdata_202101 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202101-divvy-tripdata.csv")
tripdata_202102 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202102-divvy-tripdata.csv")
tripdata_202103 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202103-divvy-tripdata.csv")
tripdata_202104 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202104-divvy-tripdata.csv")
tripdata_202105 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202105-divvy-tripdata.csv")
tripdata_202106 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202106-divvy-tripdata.csv")
tripdata_202107 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202107-divvy-tripdata.csv")
tripdata_202108 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202108-divvy-tripdata.csv")
tripdata_202109 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202109-divvy-tripdata.csv")
tripdata_202110 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202110-divvy-tripdata.csv")

In [49]:
# what type of data they are now?
class(tripdata_202011)

In [50]:
# Checking if all columns are same before merging them into one dataframe

colnames(tripdata_202011)
colnames(tripdata_202012)
colnames(tripdata_202101)
colnames(tripdata_202102)
colnames(tripdata_202103)
colnames(tripdata_202104)
colnames(tripdata_202105)
colnames(tripdata_202106)
colnames(tripdata_202107)
colnames(tripdata_202108)
colnames(tripdata_202109)
colnames(tripdata_202110)

In [51]:
str(tripdata_202011)

In [52]:
# Combining all data frames into one data frame
all_tripdata <- rbind(tripdata_202011,
                  tripdata_202012,
                  tripdata_202101,
                  tripdata_202102,
                  tripdata_202103,
                  tripdata_202104,
                  tripdata_202105,
                  tripdata_202106,
                  tripdata_202107,
                  tripdata_202108,
                  tripdata_202109,
                  tripdata_202110)

In [53]:
class(all_tripdata)

In [54]:
head(all_tripdata)

In [55]:
glimpse(all_tripdata)

In [56]:
colnames(all_tripdata)

In [57]:
nrow(all_tripdata)

In [58]:
summary(all_tripdata)

In [59]:
# check the member_casual columns
table(all_tripdata$member_casual)

## Process


In [60]:
# Remove rows with missing values
colSums(is.na(all_tripdata))

### Data Limitation

After removing the rows with missing values for completeness shows that “start station name and start station ID” and “end station name and end station ID” for some rides are missing. Further observations suggest that the most missing data about “start station name” belongs to “electric bikes” as 600,479 out of 53,78,834 electric ride shares have missing data and it accounts for 11% of total electric-bike ride shares.

This limitation could slightly affect our analysis for finding stations where most electric-bikes are taken but we can use “end station names” to locate our customers and this can be used for further analysis and potential marketing campaigns.

In [61]:
# Store clean data to a new data frame

all_trips_cleaned <- all_tripdata[complete.cases(all_tripdata), ]
nrow(all_trips_cleaned)

In [62]:
# data with started_at greater than ended_at will be removed
all_trips_cleaned <- all_trips_cleaned %>%
  filter(all_trips_cleaned$started_at < all_trips_cleaned$ended_at)

In [63]:
head(all_trips_cleaned)
nrow(all_trips_cleaned)

In [64]:
# Create a new cloumn ride_length

all_trips_cleaned$ride_length <- all_trips_cleaned$ended_at - all_trips_cleaned$started_at

In [65]:
head(all_trips_cleaned$ride_length)

In [66]:
# Convert ride_length from sec to period
all_trips_cleaned$ride_length <- hms::hms(seconds_to_period(all_trips_cleaned$ride_length))

In [67]:
head(all_trips_cleaned$ride_length)

In [68]:
# Create a new column day_of_week

all_trips_cleaned$day_of_week <- wday(all_trips_cleaned$started_at, label = FALSE)

## Analyze

Now the descriptive analys is here to find the pattern between member and casual and members

In [69]:
# Mean of the ride_length

all_trips_cleaned %>% 
  summarise(a = hms::hms(seconds_to_period(mean(ride_length)))) %>% 
  rename_at("a", ~ "Mean of  ride_length") 

In [70]:
# Max of the ride_length

all_trips_cleaned %>% 
  summarise(a = hms::hms(seconds_to_period(max(ride_length)))) %>% 
  rename_at("a", ~ "Max of  ride_length") 

In [71]:
# Minimun ride_length

all_trips_cleaned %>% 
  summarise(a = hms::hms(seconds_to_period(min(ride_length)))) %>% 
  rename_at("a", ~ "Min of  ride_length") 

In [72]:
# Mode of the day_of_week

library(DescTools)
Mode(all_trips_cleaned$day_of_week)

In [73]:
# Average ride_length for members and casual riders

all_trips_cleaned %>% 
  group_by(member_casual) %>% 
  summarise(a = hms::hms(seconds_to_period(mean(ride_length)))) %>% 
  rename_at("a", ~ "Average ride_length")

In [74]:
aggregate(all_trips_cleaned$ride_length ~ all_trips_cleaned$member_casual, FUN = mean)
aggregate(all_trips_cleaned$ride_length ~ all_trips_cleaned$member_casual, FUN = median)
aggregate(all_trips_cleaned$ride_length ~ all_trips_cleaned$member_casual, FUN = max)
aggregate(all_trips_cleaned$ride_length ~ all_trips_cleaned$member_casual, FUN = min)

In [75]:
# average ride_length for users by day_of_week
all_trips_cleaned %>% 
  group_by(day_of_week) %>% 
  summarise(a = hms::hms(seconds_to_period(mean(ride_length)))) %>% 
  rename_at("a", ~ "Average ride_length")

In [76]:
# the average ride time by each day for casual customers and members.

aggregate(all_trips_cleaned$ride_length ~ all_trips_cleaned$member_casual + all_trips_cleaned$day_of_week, FUN = mean)

In [77]:

# Order the day_of_week column into the correct order.

all_trips_cleaned$day_of_week <- ordered(all_trips_cleaned$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

In [78]:
all_trips_cleaned %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%  
  group_by(member_casual, weekday) %>%  
  summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)

## Share

In [80]:
# visualize number of rides by rider type
all_trips_cleaned %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarize(number_of_rides = n(),
            average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday) %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("#CC6633","#6699CC")) +
  labs(title = "Number of Rides by Days and Rider Type",
       subtitle = "Members versus Casual Users") +
  ylab("Number of Rides") +
  xlab("Day of Week")

The next data visualization is for the average ride length by weekday for casual customers and members.

In [81]:
all_trips_cleaned %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")