### Table of Contents
* #### [Introduction](#section-one)
* #### [Preparation](#section-two)
    * #### [Pre-Process](#sub-section-two-one)
* #### [Analysis](#section-three)
    * #### [Ridership](#sub-section-three-one)
    * #### [Average Ride Length](#sub-section-three-two)
    * #### [Bike Type](#sub-section-three-three)
    * #### [Popular Destinations](#sub-section-three-four)
* #### [Conclusion](#section-four)
    * #### [Key Insights](#sub-section-four-one)
    * #### [Recommendations](#sub-section-four-two)

<a id = "section-one" ></a> 
## Introduction

In this notebook, I will analyze the most recent past 12 month bike sharing data for a fictional bike sharing company, Cyclistic, using comparable real world data from [Motivate LLC](https://divvy-tripdata.s3.amazonaws.com/index.html), a bike sharing company based in New York City and operates multiple bike sharing programs across the nation, which has been provided through this [license agreement](https://www.divvybikes.com/data-license-agreement). For this case study, I will be using data from November 2020 through October 2021.

To increase the company's profits, Cyclistic's marketing director has asked how we can increase annual membership riders by converting casual riders or those who don't purchase annual memberships. In order to answer this business task, we need to provide the marketing director with data insights between casual and membership riders so the marketing team can effectively apply their marketing strategy and better target their advertisements.

Some questions that might help the marketing team better understand the data are:
1. How much are each user using the bike on a given day of the week and month?
2. How long is each user's average trip duration on a daily and monthly basis?
3. What kind of bike is each user using and is there a difference between daily and monthly usage for bike types?
4. Where are bikes being used the most between users and where are they ending up?

<a id= "section-two" ></a>
## Preparation

I've locally downloaded the most recent past 12 datasets and unzipped the csv files. In total, the 12 workbooks contain approximately 1 GB of data. This gross amount of data and multiple workbooks is too cumbersome to work with any spreadsheet program. Accordingly, it would be best to use other data analytics software program like SQL or R. In this case, we will be choosing R programming.

<a id= "sub-section-two-one" ></a>
### Pre-Process

First, we would like to join the 12 workbooks into one dataframe and then clean from there. However, to do that, we would need to check if the tidyverse package has been installed.


In [None]:
installed.packages()

We see that tidyverse isn't listed and therefore hasn't been installed. So, we'll install the tidyverse package:

In [None]:
install.packages("tidyverse")

Next, we need to load the tidyverse package:

In [None]:
library(tidyverse)

Next, we need to import our CSV files and assign them names so it can be easier to recognize and handle instead of typing out each file path:

In [None]:
Nov20 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202011-divvy-tripdata.csv")
Dec20 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202012-divvy-tripdata.csv")
Jan21 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202101-divvy-tripdata.csv")
Feb21 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202102-divvy-tripdata.csv")
Mar21 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202103-divvy-tripdata.csv")
Apr21 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202104-divvy-tripdata.csv")
May21 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202105-divvy-tripdata.csv")
Jun21 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202106-divvy-tripdata.csv")
Jul21 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202107-divvy-tripdata.csv")
Aug21 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202108-divvy-tripdata.csv")
Sep21 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202109-divvy-tripdata.csv")
Oct21 <- read_csv("../input/divvy-tripdata-nov2020-oct2021/202110-divvy-tripdata.csv")


Since each dataset has the same named columns, we can concatenate the data vertically and combine it into one dataset, which we will assign as Combined_tripdata:

In [None]:
Combined_tripdata <- rbind(Nov20, Dec20, Jan21, Feb21, Mar21, Apr21, May21, Jun21, Jul21, Aug21, Sep21, Oct21)

Now that the datasets have been vertically integrated into one giant dataset, "Combined_tripdata," we can clean it up some more by removing any unnecessary columns, adding columns for the ride duration, the day of the week and month the bikes were used, and removing any null and missing attributes.

In [None]:
Cleaned_tripdata <- Combined_tripdata %>% #using a pipe format to clean up the combined dataframe and assigning it as 'Cleaned_datatrip'
    select(-start_station_id, -end_station_id, -start_lat, -start_lng, -end_lat, -end_lng) %>% #removing the station id, longitudinal, and latitudinal columns because they're not needed
    #mutate(trip_duration_min=difftime(ended_at, started_at, units = "mins")) %>% #adding a column for the trip duration in minutes by subtracting the started_at time from the ended_at time
    rename(rider=member_casual) %>% #renaming the member_casual column to rider
    rename(bike_type=rideable_type) %>% #renaming the rideable_type column to bike_type
    mutate(trip_duration_min=(ended_at - started_at)/60) %>%
    mutate(weekday=weekdays(started_at)) %>% #adding a weekday column by extracting the weekday from the time the user used the bike 
    mutate(month=months(started_at)) %>% #adding a month column by extracting the month from the time the user used the bike
    na.omit() #removing na values

In the following, we check the "Cleaned_tripdata" dataset:

In [None]:
str(Cleaned_tripdata)

Next, we'll check if there are any trip durations less than zero:

In [None]:
nrow(subset(Cleaned_tripdata,trip_duration_min < 0))

This tells us that we have 1303 rows of data where the trip duration was negative! We'll now have to remove these negative trip durations:

In [None]:
Clean_data <- Cleaned_tripdata[!(Cleaned_tripdata$trip_duration_min < 0),]

Okay, so now that our clean data is devoid of any negative minutes, our data should be good to go for further analysis after we alter the trip duration attribute as a numerical value that we can use mathematical operations on:

In [None]:
trip_duration_min_v2 <- as.numeric(Clean_data$trip_duration_min)

Let's also reorder our weekday so it can be listed as Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and end on Sunday:

In [None]:
Clean_data$weekday <- ordered(Clean_data$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", 
"Friday", "Saturday", "Sunday"))

Finally, let's also reorder our months so it can be listed as January, February, March, April, May, June, July, August, September, October, November, and December:

In [None]:
Clean_data$month <- ordered(Clean_data$month, levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

<a id = "section-three"></a>
## Analysis

To start analyzing the cleaned data, we'll query some basic summary statistics.

<a id = "sub-section-three-one"></a>
### Ridership

First, let's look at the count of casual riders vs membership riders: 

In [None]:
table(Clean_data$rider)

ggplot(data=Clean_data, aes(x=rider,)) +
    geom_bar(stat="count", fill="steelblue") +
    labs(title="Ride Share Count: Casual vs. Membership Riders")

We see that membership riders account for about 430,000 more rides than casual riders throughout the 12 month period dating form November 2020 through October 2021; 2.03M casual rides versus 2.46M membership rides.

Let's look at how much each rider is using the bike program on a daily basis:

In [None]:
Clean_data %>%
    group_by(rider, weekday) %>%
    summarize(count = n())

Clean_data %>%
    group_by(rider, weekday) %>%
    summarize(count = n()) %>%
    ggplot(aes(x=weekday, y=count, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Daily Usage between Riders") +
        scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

From this daily usage chart between riders, we see that casual riders increase their ridership starting from midweek into the weekend, whereas, membership riders have roughly the same amount of ridership throughout the week at approximately 350,000 bike trips.

Continuing, let's see how much each rider is using the bike program on a monthly basis:

In [None]:
Clean_data %>%
    group_by(rider, month) %>%
    summarize(count = n())

Clean_data %>%
    group_by(rider, month) %>%
    summarize(count = n()) %>%
    ggplot(aes(x=month, y=count, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Monthly Usage between Riders") +
        scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

From the monthly usage rate table and chart, we see that casual riders exhibit a similarity to a classic bell shape distribution throughout the year. Casual riders have a peak utilization rate in July with most activity showing in the months closest to July and then drops as you move further away from July. Membership riders, on the other hand, show a similar increase in ridership activity from the first half of the year until July, but, unlike casual riders, their activity is sustained further until October at which it starts to decline, but not as precipitous as casual riders show.

<a id = "sub-section-three-two"></a>
### Average Ride Length

Next, let's look at the summary statistic of the trip duration:

In [None]:
summary(trip_duration_min_v2)

We see that the average trip duration was 22.31 minutes across all rides.

Next, let's look at the average trip for each user and throughout each day and month:

In [None]:
Clean_data %>%
    group_by(rider) %>%
    summarize(mean_trip = round(mean(as.numeric(trip_duration_min)),2))

Clean_data %>%
    group_by(rider) %>%
    summarize(mean_trip = round(mean(as.numeric(trip_duration_min)),2)) %>%
    ggplot(aes(x=rider, y=mean_trip)) +
        geom_col( fill="steelblue") +
        labs(title = "Average Ride Trip Length between Users")
    

From this table and chart, we see that casual riders have longer trips, more than twice as much as membership riders.

Let's see what the data looks like for average trips for each user throughout the week

In [None]:
Clean_data %>%
    group_by(rider, weekday) %>% 
    summarize(mean_trip = round(mean(as.numeric(trip_duration_min)),2))

Clean_data %>%
    group_by(rider, weekday) %>% 
    summarize(mean_trip = round(mean(as.numeric(trip_duration_min)),2)) %>%
    ggplot(aes(x=weekday, y=mean_trip, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Daily Average Ride Trip Length between Users")

From the daily table and bar chart, we see the average ride length for each user throughout the week increases for both on the weekends; much more so for casual riders. For membership riders, we see a steady average ride length during the weekday, whereas, for casual riders, they have a declining average ride length starting on the begining of the week but rises back up mid-week into the weekend. Casual riders are not only taking longer rides, but they are also increasing their trips more so than membership riders at the end of the week and during the weekend.

Let's get a sense of the average ride length throughout the year for each rider:

In [None]:
Clean_data %>%
    group_by(rider, month) %>% 
    summarize(mean_trip = round(mean(as.numeric(trip_duration_min)),2))

Clean_data %>%
    group_by(rider, month) %>% 
    summarize(mean_trip = round(mean(as.numeric(trip_duration_min)),2)) %>%
    ggplot(aes(x=month, y=mean_trip, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Monthly Average Ride Trip Length between Users")

From the monthly table and bar chart, we see the average ride length for membership riders stays consistent throughout the year while hovering between 12-14 minutes; whereas, on the other hand, the average ride length for casual riders is greatest in February and goes back down to around 38 minutes on average from March through June and declines from there throughout the year.

<a id = "sub-section-three-three"></a>
### Bike Type

Next, let's take a look at the type of bike each user is using:

In [None]:
Clean_data %>%
    group_by(rider, bike_type) %>% 
    summarize(count = n())

Clean_data %>%
    group_by(rider, bike_type) %>% 
    summarize(count = n()) %>%
    ggplot(aes(x=bike_type, y=count, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Bike Type between Users")

From the bike type usage rate between users chart, we can see that classic bikes are clearly the most populous type, followed by electric bikes at a distant second and lastly docked bikes.

Let's also check if there's a difference between days of the week:

In [None]:
Clean_data %>%
    group_by(rider,weekday) %>% 
    filter(bike_type == "classic_bike") %>%
    summarize(count = n())

Clean_data %>%
    group_by(rider, weekday) %>% 
    filter(bike_type == "classic_bike") %>%
    summarize(count = n()) %>%
    ggplot(aes(x=weekday, y=count, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Rider's Daily Usage for Classic Bikes") +
        scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

In the classic bike type chart, we see that membership riders use the classic bike type on average around 260,000 times throughout the week. On the other hand, casual riders have a lower utilization rate throughout the week for the classic bike, but their usage increases towards the weekend where even at one point surpassing membership riders' usage on Saturday and Sunday.

Continuing with the classic bike type, let's see how users compare throughout the year on a monthly basis:

In [None]:
Clean_data %>%
    group_by(rider, month) %>% 
    filter(bike_type == "classic_bike") %>%
    summarize(count = n())

Clean_data %>%
    group_by(rider, month) %>% 
    filter(bike_type == "classic_bike") %>%
    summarize(count = n()) %>%
    ggplot(aes(x=month, y=count, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Rider's Monthly Usage for Classic Bikes") +
        scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

From the monthly usage rate for the classic bike type chart, we can see that they both skew to the right with usage rate increasing starting from March and hitting their peak in July for casual riders and August for membership riders. Each showing a precipitous decline in usage for the months in December, January, and February. Also, the data is showing that membership riders are using classic bikes more often than casual riders throughout the year.

Next, let's check the docked bikes situation on a daily and monthly rate:

In [None]:
Clean_data %>%
    group_by(rider, weekday) %>% 
    filter(bike_type == "docked_bike") %>%
    summarize(count = n())

Clean_data %>%
    group_by(rider, weekday) %>% 
    filter(bike_type == "docked_bike") %>%
    summarize(count = n()) %>%
    ggplot(aes(x=weekday, y=count, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Rider's Daily Usage for Docked Bikes") +
        scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

From the daily docked bikes chart, we see that casual riders account for more docked bikes than membership riders do throughout the week. The data also shows casual riders use docked bikes at an exponential rate more so than membership riders on the weekends. The average rate for membership riders using docked bikes were approximately 14,000 on the weekend, in contrast, casual riders use docked bikes at an average approximate rate of 82,000 on the weekend.

Continuing with the docked bikes data, let's check the monthly data:

In [None]:
Clean_data %>%
    group_by(rider, month) %>% 
    filter(bike_type == "docked_bike") %>%
    summarize(count = n())

Clean_data %>%
    group_by(rider, month) %>% 
    filter(bike_type == "docked_bike") %>%
    summarize(count = n()) %>%
    ggplot(aes(x=month, y=count, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Rider's Monthly Usage for Docked Bikes") +
        scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Interstingly, it seems there isn't any data for membership riders using docked bikes from the months of February 2021 through October 2021. From the given data, membership riders used docked bikes the most in November, whereas, casual riders used docked bikes the most in June and July.

Lastly, let's check the electric bike situation for each user on a daily and monthly rate:

In [None]:
Clean_data %>%
    group_by(rider, weekday) %>% 
    filter(bike_type == "electric_bike") %>%
    summarize(count = n())

Clean_data %>%
    group_by(rider, weekday) %>% 
    filter(bike_type == "electric_bike") %>%
    summarize(count = n()) %>%
    ggplot(aes(x=weekday, y=count, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Rider's Daily Usage for Electric Bikes") +
        scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

From the daily electric bike chart, we see that membership riders outpace casual riders in using electric bikes throughout the weekday, however, during the weekend, casual riders outpace membership riders.

Let's check the monthly rate:

In [None]:
Clean_data %>%
    group_by(rider, month) %>% 
    filter(bike_type == "electric_bike") %>%
    summarize(count = n())

Clean_data %>%
    group_by(rider, month) %>% 
    filter(bike_type == "electric_bike") %>%
    summarize(count = n()) %>%
    ggplot(aes(x=month, y=count, fill=rider)) +
        geom_col(width=0.7, position = position_dodge(width=1)) +
        labs(title = "Rider's Monthly Usage for Electric Bikes") +
        scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

From the electric bike monthly chart, we see that for both casual and membership riders, the usage rate for electric bikes are most prominent during the months of May through October.

<a id = "sub-section-three-four"></a>
### Popular Destinations

Finally, let's analyze the top five stations that are popular among users.

We'll begin with analyzing the top five start and end stations among both users:

In [None]:
Clean_data %>%
    group_by(start_station_name) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    top_n(5, count)

Clean_data %>%
    group_by(end_station_name) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    top_n(5, count)

Then we'll check the top five start and end stations among casual riders:

In [None]:
Clean_data %>%
    group_by(start_station_name, rider) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    filter(rider=="casual") %>%
    head(n=5)

Clean_data %>%
    group_by(end_station_name, rider) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    filter(rider=="casual") %>%
    head(n=5)

Finally, we'll check the top five start and end stations for membership riders:

In [None]:
Clean_data %>%
    group_by(start_station_name, rider) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    filter(rider=="member") %>%
    head(n=5)

Clean_data %>%
    group_by(end_station_name, rider) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    filter(rider=="member") %>%
    head(n=5)

<a id = "section-four"></a>
### Conclusion

In conclusion, the data that Motivate LLC has made available to us has provided us with valuable information that we can provide our stakeholders to make better business decisions. Although starting with clean and complete data would have been preferable, Motivate's data has nevertheless provided answers to our questions we have inquired to which we can provide key insights to Cyclistic's marketing director and her team so they can execute an effective marketing strategy to convert casual riders into membership riders.

<a id = "sub-section-four-one"></a>
#### Key Insights

* Membership riders account for more rides overall. However, casual riders account for more rides on the weekend while both casual and membership riders ride more often from the months of March through October, which might be attributed to more favorable weather conditions as opposed to the winter months.

* Casual riders ride for a longer duration than membership riders on average by twofold. Casual riders are also more likely on average to ride for a longer period than membership riders throughout the week and year.

* The most popular bike among all groups is the classic bike. Classic bikes are really popular with the casual group during the weekend.

* Popular stations among both groups are:
    1. Streeter Dr & Grand Ave
    2. Michigan Ave & Oak St
    3. Wells St & Concord Ln
    4. Millennium Park
    5. Clark St & Elm St


* Popular stations among casual riders are:

    1. Streeter Dr & Grand Ave
    2. Millenium Park
    3. Michigan Ave & Oak St
    4. Theater on the Lake
    5. Shedd Aquarium


* Popular stations among membership riders are:

    1. Clark St & Elm St
    2. Wells St & Concord Ln
    3. Kingsbury St & Kinzie St
    4. Wells St & Elm St
    5. Dearborn St & Erie St
    
<a id = "sub-section-four-two"></a>
#### Recommendations

* Since most riders are more active during the spring and summer months, target advertisements against casual riders to try to convert them into membership riders during the off-peak months by offering a membership price discount.

* Target advertisements against casual riders at their popular bike stations and other points of interest in the surrounding area.

* Seek partnerships in the surrounding area or points of interest for membership discounts or benefits that would attract casual riders into getting a membership.
