# Introduction

As a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director
of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore,
my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights,
my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives
must approve your recommendations, so they must be backed up with compelling data insights and professional data
visualizations.

# Ask
1. What is the business task?
    
    **Business task**: To use historical data and analytics to determine how casual and annual members use Cyclistic bikes differently. And to further use these insights to create a marketing strategy that will increase annual membership by converting casual riders to annual members.
 
2.  Who are the key stakeholders?
    
    **Lily Moreno:** The director of marketing and my manager. Moreno is responsible for the development of campaigns
    and initiatives to promote the bike-share program. These may include email, social media, and other channels.
    
     **Cyclistic marketing analytics team**: A team of data analysts who are responsible for collecting, analyzing, and
reporting data that helps guide Cyclistic marketing strategy

    **Cyclistic executive team:** The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.


# Prepare
The data we are using for this project is a public data set which was made available by 
Motivate International Inc. under this [license](https://ride.divvybikes.com/data-license-agreement). 
The data is located [here](https://divvy-tripdata.s3.amazonaws.com/index.html), and we are analysing the most recent past 12 months of trip data, 
from January 2021 to December 2021. 
There are 12 datasets each containing one month one data 
There are no issues with the credibility of this data. 
The dataset proves to be reliable, original, comprehensive, current, and cited.

# Process

In this step we will prepare the data for analysis.
First by merging all the datasets as one single dataframe.

installing the required packages

In [None]:
install.packages("tidyverse")
install.packages("lubridate")
library("tidyverse")
library("lubridate")

importing the datasets

In [None]:
january21 <- read_csv("../input/2021-bike-share/202101-divvy-tripdata.csv")
february21 <- read_csv("../input/2021-bike-share/202102-divvy-tripdata.csv")
march21 <- read_csv("../input/2021-bike-share/202103-divvy-tripdata.csv")
april21 <- read_csv("../input/2021-bike-share/202104-divvy-tripdata.csv")
may21 <- read_csv("../input/2021-bike-share/202105-divvy-tripdata.csv")
june21 <- read_csv("../input/2021-bike-share/202106-divvy-tripdata.csv")
july21 <- read_csv("../input/2021-bike-share/202107-divvy-tripdata.csv")
august21 <- read_csv("../input/2021-bike-share/202108-divvy-tripdata.csv")
september21 <- read_csv("../input/2021-bike-share/202109-divvy-tripdata.csv")
october21 <- read_csv("../input/2021-bike-share/202110-divvy-tripdata.csv")
november21 <- read_csv("../input/2021-bike-share/202111-divvy-tripdata.csv")
december21 <- read_csv("../input/2021-bike-share/202112-divvy-tripdata.csv")

combining the datasets into one single dataframe

In [32]:
bike_data <- bind_rows(january21, february21, march21, april21, may21, june21, july21, august21, september21, october21, november21, december21)

In [33]:
head(bike_data)

## Data Cleaning

1. Checking and removing Removing duplicates 

In [38]:
bike_data <- bike_data[!duplicated(bike_data$ride_id), ]
print(paste("Removed", nrow(bike_data) - nrow(bike_data), "duplicated rows"))

A quick description of the columns and data type

In [39]:
str(bike_data)

In [40]:
nrow(bike_data)

In total so far we have over 5 million rows(observations).
This is a large dataset.

Next we run a quick summary of the dataset

We extract the hour, day, month and year from the date columns, to enable us aggregate data in more different ways.
And also add new columns for the length of each trip in seconds.

In [97]:
bike_data$date <- as.Date(bike_data$started_at) 
bike_data$month <- format(as.Date(bike_data$date), "%m")
bike_data$day <- format(as.Date(bike_data$date), "%d")
bike_data$year <- format(as.Date(bike_data$date), "%Y")
bike_data$day_of_week <- format(as.Date(bike_data$date), "%A")
bike_data$hour_of_day <- strftime(bike_data$started_at, "%H")
bike_data$ride_length <- difftime(bike_data$ended_at,bike_data$started_at)


Convert ride-lenght to numeric

In [99]:
bike_data$ride_length <- as.numeric(as.character(bike_data$ride_length))
is.numeric(bike_data$ride_length)

Group length of ride to enable further aggregation

In [100]:
head(bike_data)

In [101]:
bike_data$group_length <- dplyr::case_when(bike_data$ride_length <900 ~ "Less 15 minutes", bike_data$ride_length <3600 ~"Less 1 hour",
bike_data$ride_length <7200 ~"Less 2 hour",
bike_data$ride_length <28800 ~"Less 8 hour",
bike_data$ride_length <86400 ~"Less 1 day",
bike_data$ride_length >=86400 ~"More than 1 day")

In [102]:
summary(bike_data)

Remove bad data

From the above summary we can see that the min value of the ride_length is -3482, this means that the data frame includes observations were "ride_length is negative", so we have to remove these rows.

In [103]:
bike_data_v2<- bike_data[!(bike_data$ride_length<0),]

In [104]:
summary(bike_data_v2$ride_length)

# Analyze

We obtain the mean, median, max and min values in ride length between casual riders and members

In [105]:
aggregate(bike_data_v2$ride_length ~ bike_data_v2$member_casual, FUN = mean)
aggregate(bike_data_v2$ride_length ~ bike_data_v2$member_casual, FUN = median)
aggregate(bike_data_v2$ride_length ~ bike_data_v2$member_casual, FUN = max)
aggregate(bike_data_v2$ride_length ~ bike_data_v2$member_casual, FUN = min)

We compare the average ride length by day of the week for members and causal riders

In [106]:
aggregate(bike_data_v2$ride_length ~ bike_data_v2$member_casual + bike_data_v2$day_of_week, FUN = mean)

The days of the week are out of order, so we fix that and calculate again 

In [107]:
bike_data_v2$day_of_week <- ordered(bike_data_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

aggregate(bike_data_v2$ride_length ~ bike_data_v2$member_casual + bike_data_v2$day_of_week, FUN = mean)

We visualize the number of rides per weekday by members and casual riders

In [108]:
bike_data_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")

From the chart above we can see that members take more rides during the week, while casuals ride more during the weekends. While the number of ride for casuals remain averagely the same from Monday - Thursday, there's a significant increase in the number of rides from Friday.

Next we calculate the average duration of rides in minutes

In [124]:
bike_data_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length) / 60) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")

In this chart we can see that casual riders take longer rides than the members. The members have a shorter and more stable average duration. This may mean the members use the bikes for more routine trips, while the casual riders use 

we also analyze the number of rides across the months

In [110]:
bike_data_v2 %>% 
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n()) %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")

We discover that there are less frequent rides during the winter months and more rides during summer.
This means that weather and temperature affects the number of rides for both groups. We also see that the casuals take more rides during summer than members, while the members take more rides during winter, perhaps out of necesity to meet up with their routine trips.

In [111]:
bike_data_v2 %>% 
  group_by(member_casual, group_length) %>% 
  summarise(number_of_rides = n()) %>% 
  arrange(group_length)  %>% 
  ggplot(aes(x = group_length, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")

The above chart confirms that the members take more shorter rides under 15 minutes.

We analyze the rides by hour of the day

In [114]:
bike_data_v2 %>%
    ggplot(aes(hour_of_day, fill=member_casual)) +
    geom_bar(position = "dodge")

The the chart above we can see that there are more ride during the afternoon. members also take more rides in the morning between 5 am and 9 am. This may imply that members take more rides to work. We can also see that casuals take more rides between 11pm and 4am. Also the peak of the ride time is between 4 pm to 6pm

In [127]:
head(bike_data_v2)

In [129]:
bike_data_v2 %>%
    ggplot(aes(hour_of_day, fill=member_casual)) +
    geom_bar()+
    facet_wrap(~ day_of_week)

The chart above shows a clear distintion in pattern between weekdays and weekends. The casuals ride more during the weekend peaking in the afternoon. While members ride more during the week, peaking in the early mornings and in the evening.

We analyze the bike types to determine the preference between members and casual riders

In [119]:
bike_data_v2 %>% 
  ggplot(aes(rideable_type, fill = member_casual)) +
  geom_bar()

Overall, the classic bike is prefered more by both groups of riders over the electric bike and the docked bike. This may be because the riders prefer to both exercise and commute while using the bikes. It may also mean that the company has more classic bikes in their inventory.

# Share

In this stage we collate all the finding from our analysis in other to arrive at a conclusion

Our findings from the analysis are listed below:
1. Members take more rides on weekdays
2. Casuals take more rides on weedends
3. There's an increase in the amount of casual rides weekly from Friday
4. Casuals take longer rides than members on all days
5. Number of rides are affcted by season.
6. There's more rides during summer months than in winter months.
7. Members take more shorter rides (less than 15 minutes)
8. There are more rides during the afternoon.
9. Members take more rides between 5 am and 9 pm, and also between 4 pm to 6 pm.
10. Casuals take more rides between 11 pm to 4 am.
11. The classic bike is more preferred by both casual and members


Conclusions from our findings:
* Members used the bikes for more specific activites, e.g. commute to and fro work
* Casuals use the bike more for recretional activities especially during the weekends, e.g. exercising
* Seasons affect bike riding



Recommendations
1. Create 2 category of memberships, full membership and weekend membership (with lesser price), to target casuals who mostly ride on weekends.
2. Offer more discounts or incentives during winter months to encourage more rides.
3. Create a marketing campaign to showcase the benefits of riding to work, emphasing on exercising benefits and climate benefits.