# Case Study-1
#### How Does A Bike-Share Navigate Speedy Success?

## Introduction:
#### Google Data Analytics Capstone Project 
##### This case study focuses on Cyclistic, a bike-share company in Chicago.

### Ask

##### Background context on the case study:

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
##### Characters and teams:

● Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

● Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaignand initiatives to promote the bike-share program. These may include email, social media, and other channels.

● Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and
reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy
learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic
achieve them.

● Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the
recommended marketing program.

#### Guiding questions
1. What is the problem you are trying to solve?
Our primary goal is to analyse the profiles of annual members and casual riders and use that information to develop marketing strategies to help casual riders become annual members. 
2. How can your insights drive business decisions?
This information can help the marketing team increase the number of annual members.

Key tasks
1. Identify the business task - Completed
2. Consider key stakeholders - Completed

##### Deliverable
• A clear statement of the business task
Find the difference between annual members and casual riders, and identify the marketing strategy to use to increase annual members.

### Prepare

The Cyclistic’s historical trip data is given [here](https://divvy-tripdata.s3.amazonaws.com/index.html) analyze and identify the trends. Download the previous 12 months of Cyclistic trip data

#### Guiding questions

1. Where is your data located?
The data is stored in my local machine as a dataset named "casestudy".

2. How is the data organized?
The datasets contain trip details from January to December 2021, and I later combined all of the csv files into one cvs file. 

3. Are there issues with bias or credibility in this data? Does your data ROCCC?
The data is collected from a first-party source that is the company's own data storage, so there is a low chance of bias, but because it is the company's own data, the credibility is very high. The data also does ROCCC as it is reliable, original, comprehensive, current, and cited.

4. How are you addressing licensing, privacy, security, and accessibility?
The data is open source, anyone can access it, and the company provides it, but it is also covered by the license. and the data does not include any personal details of the riders to protect their privacy.

5. How did you verify the data’s integrity?
In the analysis of the data, it was found that the data types and the columns (amount and names) were all consistent.

6. How does it help you answer your question?
after thoroughly reviewing data from annual members and casual riders to determine if there are any characteristics regarding the rides, bike usage, and needs

7. Are there any problems with the data?
More information that can be present regarding the units of measure, stations, and riders would add to the data’s value.

##### 
Key tasks
1. Download data and store it appropriately. - Completed
2. Identify how it’s organized. - Completed
3. Sort and filter the data. - Completed
4. Determine the credibility of the data. - Completed

##### Deliverable

• A description of all data sources used
The data source consists of 10 CSV files. Each month starting with April is an individual file. The period starts in January 2021 and runs until December 2021.

### Process
We can combine all CSV files into one to make it easier to manipulate and analyze. The combined file will be cleaned, and additional columns will be added.

##### Guiding questions
1. What tools are you choosing and why?
We have spreadsheets and R as options, and I'm using R because the dataset is too large for spreadsheets. R will allow in-depth analysis and manipulation.

2. Have you ensured your data’s integrity?
I observed the columns after any changes and determined that the data types were consistent after manipulation.

3. What steps have you taken to ensure that your data is clean?
The null and multiple instances of data were removed, and the time and dates were formatted properly.

4. How can you verify that your data is clean and ready to analyze?
The data is ready for the next step when the data cleaning step is successfully completed.

5. Have you documented your cleaning process so you can review and share those results?
The cleaning process has been documented throughout.

##### Key tasks
1. Check the data for errors. - Completed
2. Choose your tools.Completed
3. Transform the data so you can work with it effectively. -Completed
4. Document the cleaning process. -Completed

*Deliverable*

Documentation of any cleaning or manipulation of data

### Code

In [1]:
# Importing libraries for data processing
library(tidyverse)
library(janitor)
library(lubridate)
library(ggplot2)
# if not install allready in the system install.packages('tidyverse')

── [1mAttaching packages[22m ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.0      [32m✔[39m [34mpurrr  [39m 0.3.5 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: 'janitor'


The following objects are masked from 'package:stats':

    chisq.test, fisher.test


Loading required package: timechange


Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdi

Importing the indiviudal CSV files and combining into one

In [2]:
tripdata_202101 <- read.csv("202101-divvy-tripdata.csv")
tripdata_202102 <- read.csv("202102-divvy-tripdata.csv")
tripdata_202103 <- read.csv("202103-divvy-tripdata.csv")
tripdata_202104 <- read.csv("202104-divvy-tripdata.csv")
tripdata_202105 <- read.csv("202105-divvy-tripdata.csv")
tripdata_202106 <- read.csv("202106-divvy-tripdata.csv")
tripdata_202107 <- read.csv("202107-divvy-tripdata.csv")
tripdata_202108 <- read.csv("202108-divvy-tripdata.csv")
tripdata_202109 <- read.csv("202109-divvy-tripdata.csv")
tripdata_202110 <- read.csv("202110-divvy-tripdata.csv")
tripdata_202111 <- read.csv("202111-divvy-tripdata.csv")
tripdata_202112 <- read.csv("202112-divvy-tripdata.csv")


In [3]:

str(tripdata_202101)
str(tripdata_202102)
str(tripdata_202103)
str(tripdata_202104)
str(tripdata_202105)
str(tripdata_202106)
str(tripdata_202107)
str(tripdata_202108)
str(tripdata_202109)
str(tripdata_202110)
str(tripdata_202111)
str(tripdata_202112)

'data.frame':	96834 obs. of  13 variables:
 $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
 $ ended_at          : chr  "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
 $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
 $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num  -87.7 -87

'data.frame':	756147 obs. of  13 variables:
 $ ride_id           : chr  "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-09-28 16:07:10" "2021-09-28 14:24:51" "2021-09-28 00:20:16" "2021-09-28 14:51:17" ...
 $ ended_at          : chr  "2021-09-28 16:09:54" "2021-09-28 14:40:05" "2021-09-28 00:23:57" "2021-09-28 15:00:06" ...
 $ start_station_name: chr  "" "" "" "" ...
 $ start_station_id  : chr  "" "" "" "" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.8 41.8 41.9 ...
 $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 42 41.8 41.8 41.9 ...
 $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
'data.frame':	631226 obs. of  13

In [4]:
## Combining all the individual files into one data file
tripdata_2021 <- rbind(tripdata_202101, tripdata_202102, tripdata_202103, tripdata_202104, 
                       tripdata_202105, tripdata_202106, tripdata_202107, tripdata_202108,
                       tripdata_202109, tripdata_202110,tripdata_202111,tripdata_202112)

In [5]:
head(tripdata_2021)

Unnamed: 0_level_0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,,41.90034,-87.69674,41.89,-87.72,member
2,DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,,41.90033,-87.69671,41.9,-87.69,member
3,EC45C94683FE3F27,electric_bike,2021-01-21 22:35:54,2021-01-21 22:37:14,California Ave & Cortez St,17660,,,41.90031,-87.69664,41.9,-87.7,member
4,4FA453A75AE377DB,electric_bike,2021-01-07 13:31:13,2021-01-07 13:42:55,California Ave & Cortez St,17660,,,41.9004,-87.69666,41.92,-87.69,member
5,BE5E8EB4E7263A0B,electric_bike,2021-01-23 02:24:02,2021-01-23 02:24:45,California Ave & Cortez St,17660,,,41.90033,-87.6967,41.9,-87.7,casual
6,5D8969F88C773979,electric_bike,2021-01-09 14:24:07,2021-01-09 15:17:54,California Ave & Cortez St,17660,,,41.90041,-87.69676,41.94,-87.71,casual


In [6]:
row_tottal <- sum(nrow(tripdata_202101), nrow(tripdata_202102), nrow(tripdata_202103), nrow(tripdata_202104), 
                       nrow(tripdata_202105), nrow(tripdata_202106), nrow(tripdata_202107), nrow(tripdata_202108),
                       nrow(tripdata_202109), nrow(tripdata_202110),nrow(tripdata_202111),nrow(tripdata_202112))
row_tottal

In [7]:
print(nrow(tripdata_2021))

[1] 5595063


In [8]:
str(tripdata_2021)

head(tripdata_2021)

'data.frame':	5595063 obs. of  13 variables:
 $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
 $ ended_at          : chr  "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
 $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
 $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num  -87.7 -

Unnamed: 0_level_0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,,41.90034,-87.69674,41.89,-87.72,member
2,DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,,41.90033,-87.69671,41.9,-87.69,member
3,EC45C94683FE3F27,electric_bike,2021-01-21 22:35:54,2021-01-21 22:37:14,California Ave & Cortez St,17660,,,41.90031,-87.69664,41.9,-87.7,member
4,4FA453A75AE377DB,electric_bike,2021-01-07 13:31:13,2021-01-07 13:42:55,California Ave & Cortez St,17660,,,41.9004,-87.69666,41.92,-87.69,member
5,BE5E8EB4E7263A0B,electric_bike,2021-01-23 02:24:02,2021-01-23 02:24:45,California Ave & Cortez St,17660,,,41.90033,-87.6967,41.9,-87.7,casual
6,5D8969F88C773979,electric_bike,2021-01-09 14:24:07,2021-01-09 15:17:54,California Ave & Cortez St,17660,,,41.90041,-87.69676,41.94,-87.71,casual


### Data cleaning

In [9]:
tripdata_2021$date <- as.Date(tripdata_2021$started_at)
tripdata_2021$month <- format(as.Date(tripdata_2021$date), "%b")
tripdata_2021$day <- format(as.Date(tripdata_2021$date), "%d")
tripdata_2021$year <- format(as.Date(tripdata_2021$date), "%Y")
tripdata_2021$day_of_week <- format(as.Date(tripdata_2021$date), "%A")

In [10]:
head(tripdata_2021)

Unnamed: 0_level_0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,date,month,day,year,day_of_week
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<date>,<chr>,<chr>,<chr>,<chr>
1,E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,,41.90034,-87.69674,41.89,-87.72,member,2021-01-23,Jan,23,2021,Saturday
2,DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,,41.90033,-87.69671,41.9,-87.69,member,2021-01-27,Jan,27,2021,Wednesday
3,EC45C94683FE3F27,electric_bike,2021-01-21 22:35:54,2021-01-21 22:37:14,California Ave & Cortez St,17660,,,41.90031,-87.69664,41.9,-87.7,member,2021-01-21,Jan,21,2021,Thursday
4,4FA453A75AE377DB,electric_bike,2021-01-07 13:31:13,2021-01-07 13:42:55,California Ave & Cortez St,17660,,,41.9004,-87.69666,41.92,-87.69,member,2021-01-07,Jan,7,2021,Thursday
5,BE5E8EB4E7263A0B,electric_bike,2021-01-23 02:24:02,2021-01-23 02:24:45,California Ave & Cortez St,17660,,,41.90033,-87.6967,41.9,-87.7,casual,2021-01-23,Jan,23,2021,Saturday
6,5D8969F88C773979,electric_bike,2021-01-09 14:24:07,2021-01-09 15:17:54,California Ave & Cortez St,17660,,,41.90041,-87.69676,41.94,-87.71,casual,2021-01-09,Jan,9,2021,Saturday


In [11]:
dim(tripdata_2021)

## Removing Null Values

In [12]:
tripdata_2021 <- drop_na(tripdata_2021)

In [13]:
tripdata_2021_no_duplicates <- tripdata_2021[!duplicated(tripdata_2021$ride_id), ]
print(paste("Removed", nrow(tripdata_2021) - nrow(tripdata_2021_no_duplicates), "duplicate rows"))

[1] "Removed 0 duplicate rows"


##### Data manipulation

In [14]:
## Creating a column to determine the ride length 
tripdata_2021_v2 <- mutate(tripdata_2021_no_duplicates, ride_length = difftime(ended_at, started_at, units = "mins"))
str(tripdata_2021_v2)

'data.frame':	5590292 obs. of  19 variables:
 $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
 $ ended_at          : chr  "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
 $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
 $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num  -87.7 -

In [16]:
## filtering out trips with a ride length less than 0.

nrow(tripdata_2021_v2[tripdata_2021_v2$ride_length < 0,])
tripdata_2021_v3 <- tripdata_2021_v2[!tripdata_2021_v2$ride_length <0,]
glimpse(tripdata_2021_v3)

Rows: 5,590,146
Columns: 19
$ ride_id            [3m[90m<chr>[39m[23m "E19E6F1B8D4C42ED", "DC88F20C2C55F27F", "EC45C94683…
$ rideable_type      [3m[90m<chr>[39m[23m "electric_bike", "electric_bike", "electric_bike", …
$ started_at         [3m[90m<chr>[39m[23m "2021-01-23 16:14:19", "2021-01-27 18:43:08", "2021…
$ ended_at           [3m[90m<chr>[39m[23m "2021-01-23 16:24:44", "2021-01-27 18:47:12", "2021…
$ start_station_name [3m[90m<chr>[39m[23m "California Ave & Cortez St", "California Ave & Cor…
$ start_station_id   [3m[90m<chr>[39m[23m "17660", "17660", "17660", "17660", "17660", "17660…
$ end_station_name   [3m[90m<chr>[39m[23m "", "", "", "", "", "", "", "", "", "Wood St & Augu…
$ end_station_id     [3m[90m<chr>[39m[23m "", "", "", "", "", "", "", "", "", "657", "13258",…
$ start_lat          [3m[90m<dbl>[39m[23m 41.90034, 41.90033, 41.90031, 41.90040, 41.90033, 4…
$ start_lng          [3m[90m<dbl>[39m[23m -87.69674, -87.69671, -87.69664, -8

In [17]:
## determining the amount of members vs casual riders

rider_type_total <- table(tripdata_2021_v3$member_casual)
View(rider_type_total)


 casual  member 
2525456 3064690 

In [18]:
## Statistical analysis
trip_stats <- tripdata_2021_v3 %>% 
  group_by(member_casual) %>% 
  summarise(average_ride_length = mean(ride_length), standard_deviation = sd(ride_length), median_ride_length = median(ride_length), min_ride_length = min(ride_length), max_ride_length = max(ride_length))
head(trip_stats)

member_casual,average_ride_length,standard_deviation,median_ride_length,min_ride_length,max_ride_length
<chr>,<drtn>,<dbl>,<drtn>,<drtn>,<drtn>
casual,30.23803 mins,248.1527,15.95 mins,0 mins,55944.150 mins
member,13.35480 mins,20.03945,9.60 mins,0 mins,1499.933 mins


In [20]:
## Determine mode for the day of the week 

getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

weekday_mode <- getmode(tripdata_2021_v3$day_of_week)

print(weekday_mode)

[1] "Saturday"
