# Decoding Cyclist Data

Using Publically available cyclist data from [Divvy](https://divvy-tripdata.s3.amazonaws.com/index.html), a bike share company in Chicago, this anaylsis aims to put forth a between understanding of Bike users (Member and Casual riders) for marketing objectives.

This Data has been cleaned to strip away any personal identifiers in order to keep riders confidential prior to being accessible to the public. 
This data provided by Motivate International Inc.

## Structure Breakdown of the Analysis
### 1. Loading and Formatting Data; Data Cleaning
### 2. Data Analysis 
### 3. Data Visualization
### 4. Summary of Findings 

### Prerequsites 
Required R Packages 
Install and load Packages

In [None]:
library(tidyverse)
library(lubridate)
library(ggplot2)
library(readr)
library(broom)
library(graphics)

---------------------------
# 1. Loading Data and Formatting data

After loading the CSV files into the environment, a new data frame was created to house all varibles in one.

Columns containing Dates (started_at & ended_at) were formatted to reflect the format of the datetime variables. 

In [None]:
jan_2021 <- read_csv("../input/cyclist-data-project/202101-divvy-tripdata.csv", 
                     col_types = cols(started_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
                                      ended_at = col_datetime(format = "%Y-%m-%d %H:%M:%S")))

feb_2021 <- read_csv("../input/cyclist-data-project/202102-divvy-tripdata.csv", 
                     col_types = cols(started_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
                                      ended_at = col_datetime(format = "%Y-%m-%d %H:%M:%S")))

mar_2021 <- read_csv("../input/cyclist-data-project/202103-divvy-tripdata.csv", 
                     col_types = cols(started_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
                                      ended_at = col_datetime(format = "%Y-%m-%d %H:%M:%S")))
apr_2021 <- read_csv("../input/cyclist-data-project/202104-divvy-tripdata.csv", 
                     col_types = cols(started_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
                                      ended_at = col_datetime(format = "%Y-%m-%d %H:%M:%S")))

may_2021 <- read_csv("../input/cyclist-data-project/202105-divvy-tripdata.csv", 
                     col_types = cols(started_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
                                      ended_at = col_datetime(format = "%Y-%m-%d %H:%M:%S")))
  
jun_2021 <- read_csv("../input/cyclist-data-project/202106-divvy-tripdata.csv", 
                     col_types = cols(started_at = col_datetime(format = "%Y-%m-%d %H:%M:%S"), 
                                      ended_at = col_datetime(format = "%Y-%m-%d %H:%M:%S")))

## Combine into a single data-frame
df_main <- rbind(jan_2021,feb_2021,mar_2021,apr_2021,may_2021,jun_2021) 

str(df_main)

### Data cleaning

For easier reference, the columns were renamed.
Additionally, Duplicate data and colums not relevent were excluded using a new dataframe. 

In [None]:
colnames(df_main)

## Change Names of columns
  
df <- df_main %>% 
  rename(trip_id = ride_id
         ,bike_id= rideable_type
         ,start_time=started_at
         ,end_time=ended_at
         ,usertype=member_casual)

## remove duplicate data objects
rm(feb_2021,jan_2021,mar_2021,apr_2021,may_2021,jun_2021)

##remove unnecessary columns 
maindf<- select(df, -c(5,6,7,8,9,10,11,12))

### Check for Na values in time

There are other NA values in different colums but for this part of the analysis, the focus is just on ride time. 

In [None]:
## Missing values
maindf %>% 
  filter(is.na(start_time)) %>% 
  view()

### Working with Dates & Time

Seperation of month, day, and hours was done to aid in further breakdown

In [None]:
## separate month and date 
maindf$mon <- format(as.Date(maindf$start_time), "%m")
maindf$day <- format(as.Date(maindf$start_time), "%d")
maindf$hours <- format(as_datetime(maindf$start_time), "%H")
glimpse(maindf)


In [None]:
## values to int
maindf$mon <- as.integer(maindf$mon)
maindf$day <- as.integer(maindf$day)
maindf$hours <- as.integer(maindf$hours)
glimpse(maindf)

### Calculating Ride Duration

Calculated how long each ride was and the day of the week it happened on.

In [None]:
## Add new column for calc ride time
maindf <- mutate(maindf, ride_duration = (end_time - start_time))

## convert secs to vector for summary
maindf <-maindf %>% 
  mutate(int_ride_duration = as.integer(ride_duration))

## Convert date to day of the week 
maindf$day_of_Week <- weekdays(maindf$start_time)
head(maindf)

### Additional Cleaning 

Remove negative Time values: As these are test values where the bikes were taken for Quality testing. 

In [None]:
## Data has negative values, rm values     
maindf %>% arrange(ride_duration) %>% head()

maindf <-subset(maindf, int_ride_duration > 0)
head(maindf)
nrow(df_main) - nrow(maindf)

--------------------------------------------------
# 2. Data Analysis

## Overview

Storing the ride summary of all trips in a new varible

In [None]:
## Store in new var, convert secs to mins 
total_ride_summary <- maindf$int_ride_duration %>% 
  seconds_to_period() %>% 
  mean()

In [None]:
## check the amount of riders 
table(maindf$usertype)

In [None]:
## table for day of the week 
xtabs(~day_of_Week + usertype, data = maindf)

## Ride Duration 

### Mean and trimmed mean (10%)

In [None]:
## mean
maindf$int_ride_duration %>% 
  mean() %>% 
  seconds_to_period()

# trimmed mean 
  maindf$int_ride_duration %>% 
  mean(trim= 0.1) %>% 
  seconds_to_period()

## Median 

In [None]:
# Median for ride data
maindf$int_ride_duration %>% 
  median() %>% 
  seconds_to_period()

### Standard Deviation

In [None]:
# std
maindf$int_ride_duration %>% 
  sd() %>% 
  seconds_to_period()

## Breakdown of monthly data by type of users

In [None]:
## no of total riders per month
riders_month <- maindf %>% 
  group_by(month= month(start_time)) %>% 
  count() %>% 
  rename(total = n)

## how many casual riders per month 

riders_month_casual <- maindf %>% 
  group_by(month= month(start_time)) %>% 
  count(tf = (usertype== "casual")) %>% 
  filter(tf ==TRUE) %>% 
  select(-2) %>% 
  rename(casual = n)

## how many member riders per month 

riders_month_members <- maindf %>% 
  group_by(month= month(start_time)) %>% 
  count(tf = (usertype== "member")) %>% 
  filter(tf ==TRUE) %>% 
  select(-2) %>% 
  rename(members = n)


## Monthly data joining
riders_month_final <- riders_month %>% 
  left_join(riders_month_casual, by = "month") %>% 
  left_join(riders_month_members, by= "month")


##remove variables
rm(riders_month,riders_month_casual,riders_month_members)

## Add percentage
riders_month_final<- riders_month_final %>% 
  mutate(per_cas = ((casual/total)*100)) %>% 
  mutate(per_mem= (100-per_cas))

tibble(riders_month_final)


In [None]:
# confirm count 
table(maindf$mon)

Casual Riders- Mean & trimmed mean

In [None]:
## Duration of trip group by usertype

casual_riders<- filter(maindf, usertype=="casual")

casual_riders$rd_mins<-seconds_to_period(casual_riders$int_ride_duration)
summary(casual_riders$rd_mins)

#trimmed mean
casual_riders$int_ride_duration %>% 
  mean(trim = 0.1) %>% 
  seconds_to_period()

In [None]:
head(casual_riders)

Member Riders- Mean & trimmed mean

In [None]:
#members
member_riders<- filter(maindf, usertype=="member") 

member_riders$rd_mins<-seconds_to_period(member_riders$int_ride_duration)
summary(member_riders$rd_mins)

#trimmed mean
member_riders$int_ride_duration %>% 
  mean(trim = 0.1) %>% 
  seconds_to_period()

### Finding outliers by Usertype & Ride duration

In [None]:
summary(casual_riders$rd_mins)
summary(member_riders$rd_mins)
hist(casual_riders$rd_mins, col ="blue")
hist(member_riders$rd_mins, col ="red")

In [None]:
##outlier data
## select values greater than 120 mins/ 7200 sec
outliers_casual <- subset(casual_riders, ride_duration > 7200)
hist(outliers_casual$rd_mins,  col ="blue")
summary(outliers_casual$rd_mins)

## member data
outliers_members <- subset(member_riders, ride_duration > 7200)
hist(outliers_members$rd_mins, col ="red" )
summary(outliers_members$rd_mins)

-----------------
# 3. Data Visualization

### All users Overview

In [None]:
## Distribution of ride duration Data
hist(maindf$int_ride_duration)
hist(log(maindf$int_ride_duration))

## More info on str of data 
 str(hist(maindf$int_ride_duration, breaks= 12, plot = FALSE))

In [None]:
## Distribution of hours and days
hist(maindf$hours)
hist(maindf$day)

In [None]:
## All members ride 
barplot(table(maindf$ride_duration), xlab = "Ride Duration in mins", axes= FALSE)

In [None]:
## Ride length Cas and which typ
ggplot(data=casual_riders, mapping = aes(x=rd_mins)) + geom_bar(color='blue') + facet_wrap(~bike_id) +
  labs(title ="Casual Riders Data", subtitle = "Most common bike type and ride duration") 


ggplot(data=member_riders, mapping = aes(x=rd_mins)) + geom_bar(color='red') + facet_wrap(~bike_id) +
  labs(title ="Member Riders Data", subtitle = "Most common bike type and ride duration")   

In [None]:
## which day of the week did each ride the most 
ggplot()+
  geom_col(data= casual_riders, mapping = aes(y=rd_mins,x=day_of_Week), fill="blue")+
labs(title ="Casual Riders: Most used Types and Days") 

ggplot()+
  geom_col(data= member_riders, mapping = aes(y=rd_mins,x=day_of_Week), fill="red")+
  labs(title ="Member Riders: Most used Types and Days") 

In [None]:
## Monthly usage 
ggplot(data = riders_month_final) +
geom_line(mapping = aes(x=month, y=casual), color="blue") +
  geom_point(mapping = aes(x=month, y=casual), color="blue") +
labs(title="Monthly users", subtitle="Casual-blue & Members-red") +
geom_line(mapping = aes(x=month, y=members), color="red") +
  geom_point(mapping = aes(x=month, y=members), color="red")


In [None]:
## Which time of day 
hist(casual_riders$hours, main = paste("Peak time of Casual Ridership in 24 hour format"), 
     col = "blue")

hist(member_riders$hours, main = paste("Peak time of Member Ridership in 24 hour format"), 
     col = "red")


## Day of the week casual vs member
ggplot(data=casual_riders, mapping = aes(x=day_of_Week)) + geom_bar(fill ="blue") + labs(title = "Casual")
ggplot(data=member_riders, mapping = aes(x=day_of_Week)) + geom_bar(fill ="red") + labs(title = "Member")


--------------------------

# Summary of Findings


## All Riders Summary

This Data Analysis took into consideration over 1.9 million rows of Data on bike riders.

In [None]:
table(maindf$usertype)

## Ridership percentage per month
View(riders_month_final)

## Member Riders Summary

* Majority perferred Classic Bikes
* Did not use Docked Bikes at all

### Perfered Days and Hours
* Spread evenly through out the Week 
* Peak Ridership: 5pm to 7pm 

### Ride Duration
* Average time spent: 14m 34s (Heavily influenced by Outlier data)
* 10% Trimmed Mean: 11m 48s

### Prefered Months
* Need more data but from the current 6 months, Member Ridership surpassed Casual ones in the first 4 months (Till April), and then closely evened out

## Casual Riders Summary

* Majority perferred Electric Bikes
* Uniquely used Docked Bikes as well unlike Members

### Perfered Days and Hours
* Weekends (Sat & Sun) 
* Peak Ridership: 5pm to 7pm 

### Ride Duration
* Average time spent: 37m 35s (Heavily influenced by Outlier data)
* 10% Trimmed Mean: 21m 52s

### Prefered Months
* Need more data but from the current 6 months, Casual Ridership lacked in the first 4 months (Till April), and then evened out with members