As a part of the of the [Google Data Analytics Professional Certificate ](http://www.coursera.org/professional-certificates/google-data-analytics), the final course is a capstone project to implement the technical skills taught in this course (excel/googlesheets, sql, r). For this case I used R Studio for data preparation, analysis, and visualization.

 **Scenario:** I am a junior data analyst working in the marketing analyst team at Cyclistic, a fictional bike-share company in Chicago. 

* Cyclistic has 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. 

* The bikes can be unlocked from one station and returned to any other station in the system anytime. 

* Until now, the companys marketing strategy relied on appealing to broad consumer segments through flexible pricing plans. 

* This consits of single-passes, full-day passes, and annual memberships. 

* Customers purchasing single-day or full-day passes are casual riders.

* Customers purchasing annual memberships are Cyclistic members. 

**Task:** In order to increase profts, the director of marketing wants design a marketing strategy at converting casual riders to annual members. Therefore, I am tasked with analyzing historical bike trip data to better understand how annual members and causal member use bikes differently. 

**The data:** I will use Cyclistic's historical 12-month [dataset](http://divvy-tripdata.s3.amazonaws.com/index.html) for the year 2021 (Jan 2021 to Dec 2021). The datasets have a different name because Cyclistic is a fictional company, but are appropriate for answering the business questions for this case. The data has been made available by Motivate International Inc. under this [license](http://ride.divvybikes.com/data-license-agreement) and operates the City of Chicago’s Divvy bicycle sharing service.



In [17]:

library(tidyverse)  #helps wrangle data
library(lubridate)  #helps wrangle date attributes
library(ggplot2)  #helps visualize data
getwd() #displays your working directory

In [25]:
# Upload Divvy datasets (csv files) here
library(readr)
Jan <- read_csv("../input/cyclisticdivvy/January.csv")
Feb <-read_csv("../input/cyclisticdivvy/February.csv")
Mar <-read_csv("../input/cyclisticdivvy/March.csv")
Apr <-read_csv("../input/cyclisticdivvy/April.csv")
May <-read_csv("../input/cyclisticdivvy/May.csv")
June <-read_csv("../input/cyclisticdivvy/June.csv")
July <-read_csv("../input/cyclisticdivvy/July.csv")
Aug <-read_csv("../input/cyclisticdivvy/August.csv")
Sep <-read_csv("../input/cyclisticdivvy/September.csv")
Oct <-read_csv("../input/cyclisticdivvy/October.csv")
Nov <-read_csv("../input/cyclisticdivvy/November.csv")
Dec <-read_csv("../input/cyclisticdivvy/December.csv")

In [26]:
# Compare column names each of the files
# While the names don't have to be in the same order, they DO need to match perfectly before we can use a command to join them into one file
colnames(Jan)
colnames(Feb)
colnames(Mar)
colnames(Apr)
colnames(May)
colnames(June)
colnames(July)
colnames(Aug)
colnames(Sep)
colnames(Oct)
colnames(Nov)
colnames(Dec)

In [27]:
# Inspect the dataframes and look for incongruence
str(Jan)
str(Feb)
str(Mar)
str(Apr)
str(May)
str(June)
str(July)
str(Aug)
str(Sep)
str(Oct)
str(Nov)
str(Dec)


In [28]:
# stack individual month's data frames into one big data frame
# since all the datasets have the same column names and data frame, we can stack into one data frame
all_trips <- bind_rows(Jan,Feb,Mar,Apr,May,June,July,Aug,Sep,Oct,Nov,Dec)

In [29]:
# Remove latitude & longitude
all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng))


In [30]:
# Inspect the new table that has been created:

colnames(all_trips)  #List of column names

In [31]:
nrow(all_trips)  #How many rows are in data frame? 5595063 rows      
dim(all_trips)  #Dimensions of the data frame? 5595063 rows, 9 columns

In [32]:
head(all_trips)  #See the first 6 rows of data frame.  

In [33]:
str(all_trips)  #See list of columns and data types (numeric, character, etc)

In [34]:
summary(all_trips)  #Statistical summary of data. 

There are a few problems we will need to fix:

*     (1) The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data -- such as day, month, year -- that provide additional opportunities to aggregate the data.
*     (2) We will want to add a calculated field for length of ride "tripduration" column. We will add "ride_length" to the entire dataframe for consistency.


In [35]:
# Add columns that list the date, month, day, and year of each ride
# This will allow us to aggregate ride data for each month, day, or year ... before completing these operations we could only aggregate at the ride level

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

# Add a "ride_length" calculation to all_trips (in seconds)
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)


In [36]:
# Inspect the structure of the columns
str(all_trips)

In [37]:
# Descriptive analysis on ride_length (all figures in seconds)
summary(all_trips$ride_length)

* mean: straight average (total ride length / rides)
* median: midpoint number in the ascending array of ride lengths
* max: longest ride
* min: shortest ride

In [38]:
# Compare members and casual users
aggregate(all_trips$ride_length ~ all_trips$member_casual, FUN = mean)
aggregate(all_trips$ride_length ~ all_trips$member_casual, FUN = median)
aggregate(all_trips$ride_length ~ all_trips$member_casual, FUN = max)
aggregate(all_trips$ride_length ~ all_trips$member_casual, FUN = min)


In [39]:
# See the average ride time by each day for members vs casual users
aggregate(all_trips$ride_length ~ all_trips$member_casual + all_trips$day_of_week, FUN = mean)


In [40]:
# Notice that the days of the week are out of order. Let's fix that.
all_trips$day_of_week <- ordered(all_trips$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))


In [41]:
# Now, let's run the average ride time by each day for members vs casual users
aggregate(all_trips$ride_length ~ all_trips$member_casual + all_trips$day_of_week, FUN = mean)


In [42]:
# analyze ridership data by type and weekday
all_trips %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%  #creates weekday field using wday()
  group_by(member_casual, weekday) %>%  #groups by usertype and weekday
  summarise(number_of_rides = n()							#calculates the number of rides and average duration 
            ,average_duration = mean(ride_length)) %>% 		# calculates the average duration
  arrange(member_casual, weekday)								# sorts


In [43]:
# Let's visualize the number of rides by rider type
all_trips %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")


In [44]:
# Let's create a visualization for average duration
all_trips %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")

**Conclusion**

To further aide the marketing team in increasing profits by converting casual riders to annual members, we now have data insights on how casual riders and annual members use bikes differently. My reccomendations for the marketing team based on the data are the following: 

1. Casual members have a higher bike usage rate with their average duration roughly double the usage rate of annual members. 

  * Cyclistic can create an advertisement with brief statistical descriptions that help incentivize casual members to switch to being an annual member by highlighting how much money they can save. 


2. Casual members take more bike rides on the weekend and less rides during the week, when compared to annual members.

  * Cyclistic can create a special annual membership program that is limited to weekends only. 

  * Cyclistic can also incorporate partnerships with surrounding business on the weekends, incentivizing casual members to purchase goods by offering a Cyclistic discount. 

