# Introduction

This case study is the **Capstone Project** of ***[Google Data Analytics Professional Certificate](https://www.coursera.org/professional-certificates/google-data-analytics)*** . In this case study, I'm a junior data analyst on the marketing analyst team of Cyclistic, a fictitious Chicago bike-share firm.

Cyclistic is a bicycle-sharing scheme with over 5,800 bikes and 600 docking stations. Cyclistic distinguishes itself by including reclining bikes, hand tricycles, and cargo bikes in its fleet, making bike-share more accessible to persons with disabilities and others who can't ride a regular two-wheeled bike. The bulk of riders choose standard bikes, with only roughly 8% opting for assistive solutions. Although cyclists are more likely to ride for pleasure, about 30% of them use bikes to travel to work every day.

Cyclistic developed a successful bike-share programme in 2016. Since then, the initiative has grown to include a fleet of 5,824 bicycles that are geotracked and locked into 692 stations throughout Chicago. Bikes can be unlocked at any station and returned to any other in the system at any time.

The company's future prosperity, according to the director of marketing, hinges on increasing the number of yearly subscribers. As a result, my team is eager to learn more **how casual riders and annual members use Cyclistic bikes differently**. Our team will develop a new marketing plan based on the findings in order to convert casual riders into annual members. However, our recommendations must first be approved by Cyclistic management, therefore they must be backed up with evidence **compelling data insights and professional data visualizations**.

Single-ride passes, full-day passes, and annual memberships are all available. **Casual riders are customers who purchase single-ride or full-day passes**. **Cyclistic members are customers who purchase annual memberships**.

I followed the steps of the data analysis process to answer the key business questions: **ask**, **prepare**, **process**, **analyze**, **share**, and **act**.

# Ask

**The future marketing effort will be guided by three questions:** 
1. What are the differences in how annual members and casual riders use Cyclistic bikes? 
2. Why would a non-cyclist purchase a Cyclistic annual membership? 
3. How can Cyclistic use digital media to persuade non-members to join the club?

**The key stakeholders are:**
1. Cyclistic Executive Team.
2. Lily Moreno (Director of Marketing) & my Manager.

**Key tasks:**
1. Design marketing strategies aimed at converting casual riders into annual members. 


# Prepare

Evaluate and discover trends using Cyclistic's historical trip data. Here you can **[download the previous 12 months](https://divvy-tripdata.s3.amazonaws.com/index.html)** worth of Cyclistic trip data. **The datasets are appropriate for the objectives of this case study and will allow you to answer the business questions**. **Motivate International Inc. has made the data available under this [licence](https://ride.divvybikes.com/data-license-agreement)**. This is open data that you may use to investigate how various consumer categories utilise Cyclistic bikes. However, due to data-privacy concerns, you are unable to use riders' personally identifying information. This means you won't be able to link pass purchases to credit card numbers to see if casual riders live in the Cyclistic service region or bought multiple single passes.

The data is **accurate**, **trustworthy**, **consistent**. There are no issues with **bias** or **legitimacy** because this data is collected by a legitimate Chicago bike sharing firm. As a result, it's **Reliable**, **Original**, **Current**, and **Cited** as in **ROCCC**. I don't believe it is adequate because certain information is missing.

> **Limitations**

* No **financial information**.
* No **riders personal indentifiable information**.

# Process

In [2]:
# importing libraries for data cleaning, manipulating and visualization.

import os
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

import cufflinks as cf



In [4]:
# choosing working directory..

os.chdir("/Users/rohitpawar/Desktop/google-case-study")

# importing 12 csv files ..

mar_2021 = pd.read_csv('202103-divvy-tripdata.csv')
apr_2021 = pd.read_csv('202104-divvy-tripdata.csv')
may_2021 = pd.read_csv('202105-divvy-tripdata.csv')
jun_2021 = pd.read_csv('202106-divvy-tripdata.csv')
jul_2021 = pd.read_csv('202107-divvy-tripdata.csv')
aug_2021 = pd.read_csv('202108-divvy-tripdata.csv')
sep_2021 = pd.read_csv('202109-divvy-tripdata.csv')
oct_2021 = pd.read_csv('202110-divvy-tripdata.csv')
nov_2021 = pd.read_csv('202111-divvy-tripdata.csv')
dec_2021 = pd.read_csv('202112-divvy-tripdata.csv')
jan_2022 = pd.read_csv('202201-divvy-tripdata.csv')
feb_2022 = pd.read_csv('202202-divvy-tripdata.csv')

In [5]:
# lets join all data sets into one data frame..

df = pd.concat([mar_2021,apr_2021,may_2021,jun_2021,jul_2021,aug_2021,sep_2021,oct_2021,nov_2021,dec_2021,jan_2022,feb_2022])

In [8]:
# know your data..

df.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,CFA86D4455AA1030,classic_bike,2021-03-16 08:32:30,2021-03-16 08:36:34,Humboldt Blvd & Armitage Ave,15651,Stave St & Armitage Ave,13266,41.917513,-87.701809,41.917741,-87.691392,casual
1,30D9DC61227D1AF3,classic_bike,2021-03-28 01:26:28,2021-03-28 01:36:55,Humboldt Blvd & Armitage Ave,15651,Central Park Ave & Bloomingdale Ave,18017,41.917513,-87.701809,41.914166,-87.716755,casual
2,846D87A15682A284,classic_bike,2021-03-11 21:17:29,2021-03-11 21:33:53,Shields Ave & 28th Pl,15443,Halsted St & 35th St,TA1308000043,41.842733,-87.635491,41.830661,-87.647172,casual
3,994D05AA75A168F2,classic_bike,2021-03-11 13:26:42,2021-03-11 13:55:41,Winthrop Ave & Lawrence Ave,TA1308000021,Broadway & Sheridan Rd,13323,41.968812,-87.657659,41.952833,-87.649993,casual
4,DF7464FBE92D8308,classic_bike,2021-03-21 09:09:37,2021-03-21 09:27:33,Glenwood Ave & Touhy Ave,525,Chicago Ave & Sheridan Rd,E008,42.012701,-87.666058,42.050491,-87.677821,casual


In [9]:
# converting data type of started_at and ended_at columns into datetime..

df['started_at'] = pd.to_datetime(df['started_at'], format='%Y-%m-%d %H:%M:%S')
df['ended_at'] = pd.to_datetime(df['ended_at'], format='%Y-%m-%d %H:%M:%S')

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5667986 entries, 0 to 115608
Data columns (total 13 columns):
 #   Column              Dtype         
---  ------              -----         
 0   ride_id             object        
 1   rideable_type       object        
 2   started_at          datetime64[ns]
 3   ended_at            datetime64[ns]
 4   start_station_name  object        
 5   start_station_id    object        
 6   end_station_name    object        
 7   end_station_id      object        
 8   start_lat           float64       
 9   start_lng           float64       
 10  end_lat             float64       
 11  end_lng             float64       
 12  member_casual       object        
dtypes: datetime64[ns](2), float64(4), object(7)
memory usage: 605.4+ MB


In [11]:
# getting rid of some columns which are not relevant for this analysis..
df = df.drop(columns=['start_station_name','start_station_id','end_station_name','end_station_id','start_lat',
                      'start_lng','end_lat','end_lng'])

In [12]:
# lets create a new column called 'ride_length' and covert into int..

df['ride_length'] = (df['ended_at'] - df['started_at'])/pd.Timedelta(minutes=1)
df['ride_length'] = df['ride_length'].astype('int32')
df.head(5)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,member_casual,ride_length
0,CFA86D4455AA1030,classic_bike,2021-03-16 08:32:30,2021-03-16 08:36:34,casual,4
1,30D9DC61227D1AF3,classic_bike,2021-03-28 01:26:28,2021-03-28 01:36:55,casual,10
2,846D87A15682A284,classic_bike,2021-03-11 21:17:29,2021-03-11 21:33:53,casual,16
3,994D05AA75A168F2,classic_bike,2021-03-11 13:26:42,2021-03-11 13:55:41,casual,28
4,DF7464FBE92D8308,classic_bike,2021-03-21 09:09:37,2021-03-21 09:27:33,casual,17


In [14]:
# in ride_length column, it seem to have negative and values.
# lets make a new data frame which wouldnt contain negative and less then 1 values for 'ride_length' column..
# trips below 1 mins are probably not relevant for this analysis..
# creating a new data frame.

trimmed_df = df[df['ride_length'] >= 1]
trimmed_df.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,member_casual,ride_length
0,CFA86D4455AA1030,classic_bike,2021-03-16 08:32:30,2021-03-16 08:36:34,casual,4
1,30D9DC61227D1AF3,classic_bike,2021-03-28 01:26:28,2021-03-28 01:36:55,casual,10
2,846D87A15682A284,classic_bike,2021-03-11 21:17:29,2021-03-11 21:33:53,casual,16
3,994D05AA75A168F2,classic_bike,2021-03-11 13:26:42,2021-03-11 13:55:41,casual,28
4,DF7464FBE92D8308,classic_bike,2021-03-21 09:09:37,2021-03-21 09:27:33,casual,17


In [16]:
# coverting data type of some columns in order to perform calculations..

trimmed_df =trimmed_df.astype({'ride_id':'string', 'rideable_type':'category', 'member_casual':'category'})
trimmed_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5580736 entries, 0 to 115608
Data columns (total 6 columns):
 #   Column         Dtype         
---  ------         -----         
 0   ride_id        string        
 1   rideable_type  category      
 2   started_at     datetime64[ns]
 3   ended_at       datetime64[ns]
 4   member_casual  category      
 5   ride_length    int32         
dtypes: category(2), datetime64[ns](2), int32(1), string(1)
memory usage: 202.2 MB


In [17]:
trimmed_df.groupby('member_casual').count()

Unnamed: 0_level_0,ride_id,rideable_type,started_at,ended_at,ride_length
member_casual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
casual,2506221,2506221,2506221,2506221,2506221
member,3074515,3074515,3074515,3074515,3074515


In [19]:
# lets prepare data for analysis..
# lets create new columns for year, month, day_of_week, date, and hour..

# year.
trimmed_df['year'] = trimmed_df['started_at'].dt.year

# month.
m_name = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
trimmed_df['month'] = trimmed_df['started_at'].dt.month_name()
trimmed_df['month'] = trimmed_df['month'].astype(CategoricalDtype(categories=m_name, ordered=False))

# date.
trimmed_df['date'] = trimmed_df['started_at'].dt.date

# day of week.
d_name = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
trimmed_df['day_of_week'] = trimmed_df['started_at'].dt.day_name()
trimmed_df['day_of_week'] = trimmed_df['day_of_week'].astype(CategoricalDtype(categories=d_name, ordered=False))

# hour.
trimmed_df['hour'] = trimmed_df['started_at'].dt.hour

# converting year and hour data types..
trimmed_df = trimmed_df.astype({'year':'int16', 'hour':'int8'})





In [21]:
trimmed_df.tail()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,member_casual,ride_length,year,month,date,day_of_week,hour
115604,211BE0DC162D85B7,electric_bike,2022-02-23 17:47:49,2022-02-23 18:02:29,member,14,2022,February,2022-02-23,Wednesday,17
115605,D4D53E78000C8CA1,electric_bike,2022-02-04 10:43:47,2022-02-04 10:50:52,member,7,2022,February,2022-02-04,Friday,10
115606,9E85F07D2F94492B,electric_bike,2022-02-28 09:16:33,2022-02-28 09:28:11,member,11,2022,February,2022-02-28,Monday,9
115607,B61B559F81F1D823,electric_bike,2022-02-10 16:55:16,2022-02-10 16:57:53,member,2,2022,February,2022-02-10,Thursday,16
115608,841C701610CF0609,electric_bike,2022-02-21 16:35:20,2022-02-21 16:42:53,member,7,2022,February,2022-02-21,Monday,16


# Analyze & Share

**Compare Annual members and Casual riders on the basis of some parameters, like:** 
* types of bike, 
* ride length, 
* year, months(seasons), 
* hours, 
* day of week..
 


## figure1

In [24]:
# Grouping..
figure1 = trimmed_df.groupby('member_casual', as_index=False).count()
figure1

Unnamed: 0,member_casual,ride_id,rideable_type,started_at,ended_at,ride_length,year,month,date,day_of_week,hour
0,casual,2506221,2506221,2506221,2506221,2506221,2506221,2506221,2506221,2506221,2506221
1,member,3074515,3074515,3074515,3074515,3074515,3074515,3074515,3074515,3074515,3074515


In [37]:
# Plotting..
px.bar(figure1, y = 'member_casual', x = 'ride_length',range_x = [0,3000000], color = 'member_casual', height = 250,text = 'ride_length', 
        labels = {'ride_length': 'Number Of Rides', 'member_casual': 'Member Vs. Casual'},
        hover_name = 'member_casual', hover_data = {'member_casual': False, 'month': False, 'ride_length': True}, 
        color_discrete_map = {'casual': '#F5F5DC', 'member': '#00FFFF'})

* **Annual members** have more rides than **Casual riders**, as shown in the graph.

## figure2

In [25]:
# Grouping..
figure2 = round(trimmed_df.groupby(['member_casual','rideable_type'], as_index=False).count(),2).dropna()
figure2


Unnamed: 0,member_casual,rideable_type,ride_id,started_at,ended_at,ride_length,year,month,date,day_of_week,hour
0,casual,classic_bike,1254496,1254496,1254496,1254496,1254496,1254496,1254496,1254496,1254496
1,casual,docked_bike,309359,309359,309359,309359,309359,309359,309359,309359,309359
2,casual,electric_bike,942366,942366,942366,942366,942366,942366,942366,942366,942366
3,member,classic_bike,1971782,1971782,1971782,1971782,1971782,1971782,1971782,1971782,1971782
4,member,docked_bike,0,0,0,0,0,0,0,0,0
5,member,electric_bike,1102733,1102733,1102733,1102733,1102733,1102733,1102733,1102733,1102733


In [36]:
# Plotting..
px.bar(figure2, y = 'member_casual', x = 'ride_length',range_x = [0,3000000], color = 'rideable_type', height = 250,text = 'ride_length', 
        labels = {'ride_length': 'Number Of Rides/Bike', 'member_casual': 'Member Vs. Casual'},
        hover_name = 'member_casual', hover_data = {'member_casual': False, 'month': False, 'ride_length': True}, 
        color_discrete_map = {'casual': '#FFF8DC', 'member': '#00FFFF'})

* The **Classic Bike** appears to be the most popular bike among both, but it is primarily used by **Annual Members**. 

## figure3

In [29]:
# Grouping..
figure3 = trimmed_df.groupby(['year', 'month', 'member_casual'], as_index=False).count()
figure3 = figure3[figure3['ride_id'] != 0]
figure3

Unnamed: 0,year,month,member_casual,ride_id,rideable_type,started_at,ended_at,ride_length,date,day_of_week,hour
4,2021,March,casual,83148,83148,83148,83148,83148,83148,83148,83148
5,2021,March,member,142375,142375,142375,142375,142375,142375,142375,142375
6,2021,April,casual,134945,134945,134945,134945,134945,134945,134945,134945
7,2021,April,member,197477,197477,197477,197477,197477,197477,197477,197477
8,2021,May,casual,253346,253346,253346,253346,253346,253346,253346,253346
9,2021,May,member,269897,269897,269897,269897,269897,269897,269897,269897
10,2021,June,casual,365023,365023,365023,365023,365023,365023,365023,365023
11,2021,June,member,352676,352676,352676,352676,352676,352676,352676,352676
12,2021,July,casual,435927,435927,435927,435927,435927,435927,435927,435927
13,2021,July,member,373833,373833,373833,373833,373833,373833,373833,373833


In [38]:
# Plotting..
px.line(figure3, x = 'month', y = 'ride_id', range_y = [0,500000], color = 'member_casual', line_shape = 'spline', markers=True, 
        labels = {'ride_id': 'Number Of Rides', 'month': 'Months (Mar 2021 - Feb 2022)', 'member_casual': 'Member Vs. Casual'},
        hover_name = 'member_casual', hover_data = {'member_casual': False, 'month': True, 'ride_id': True}, 
        color_discrete_map = {'casual': '#151516', 'member': '#E24A33'})

* The frequency of rides for **Annual Members** increases from **June to October and remains consistent over that time period**.
* For **Casual Riders**, the number of rides is highest from **June to September, with July appearing to be the peak month in between**.

## figure4

In [40]:
# Grouping..
figure4 = round(trimmed_df.groupby(['year', 'month', 'member_casual'], as_index=False).mean(),2).dropna()

In [60]:
# Plotting..
px.bar(figure4, x = 'month', y = 'ride_length',
        color = 'member_casual',
        barmode='group',
        text = 'ride_length', 
        labels = {'ride_length': 'Average Ride Length (mins)', 'member_casual': 'Member Vs. Casual', 'month': 'Months (Mar 2021 - Feb 2022)'},
        hover_name = 'member_casual', hover_data = {'member_casual': False, 'ride_length': True}, 
        color_discrete_map = {'casual': '#F5F5DC', 'member': '#00FFFF'})

* According to this graph, the **Average** ride length for **Casual Riders** is **higher during the entire spring season and the first months of summer**.

* The **Average** ride length for **Annual Members** is very **consistent** throughout the year.

## figure5

In [48]:
# Grouping..
figure5 = round(trimmed_df.groupby(['day_of_week', 'member_casual'], as_index=False).mean(),2)

In [49]:
# Plotting..
px.bar(figure5, x = 'day_of_week', y = 'ride_length',
        color = 'member_casual',
        barmode='group',
        text = 'ride_length', 
        labels = {'ride_length': 'Average Ride Length (mins)', 'member_casual': 'Member Vs. Casual', 'day_of_week': 'Weekdays'},
        hover_name = 'member_casual', hover_data = {'member_casual': False, 'ride_length': True}, 
        color_discrete_map = {'casual': '#F0FFFF', 'member': '#058ED9'})

* **Casual members**, according to this graph, ride their bikes for **longer periods of time, especially on weekends**.
* **Annual members** generally tend to ride their bikes in a steady manner during the week, while on **weekends they ride for a little longer**.

## figure6

In [63]:
# Grouping..
figure6 = trimmed_df.groupby(['rideable_type', 'member_casual'], as_index=False).count()

In [65]:
# Plotting..
px.bar(figure6, x = 'rideable_type', y = 'ride_id',
        color = 'member_casual',
        barmode='group',
        text = 'ride_id', 
        labels = {'ride_id': 'Number Of Rides', 'member_casual': 'Member Vs. Casual', 'rideable_type' : 'Rideable Type'},
        hover_name = 'member_casual', hover_data = {'member_casual': False, 'ride_length': False}, 
        color_discrete_map = {'casual': '#7FFF00', 'member': '#058ED9'})

# Act

### Final Conclusion Based On Analysis

> * Cyclistic bike share is used **differently** by **annual members and casual users**.

> * During the data analysis process, it was discovered that **Annual members** prefer to ride their bikes for **everyday commuting** because they don't ride for long periods of time and have **consistency** in their weekly rides. **Casual riders**, on the other hand, primarily use bikes on **weekends for outings**. nevertheless, some casual riders also use bikes for daily commutes because some of them have short rides.

> * **Casual Riders** average ride length is **higher throughout the spring season and the first months of summer**, but **Annual Members** average ride length is **fairly steady throughout the year**. The frequency of rides for **Annual Members increases from June to October** and remains consistent throughout that time period; on the other hand, the frequency of rides for Non-Annual Members increases. From **June through September**, the number of **Casual riders** is at its highest, with **July looking to be the peak month in between**.


### Effective Use Of Insights

> * Make an appealing annual membership package for casual riders who utilise bikes for regular commutes, especially for daily commuters. And that plan would include some sort of seasonal or festive discount, and when promoting the membership plan, the main focus should be on email marketing, with less emphasis on other media.

> * And for the casual riders who primarily use their bikes on weekends. This set of riders should be offered the same annual membership package, but presented in a different way. There could be coupons for some popular hot spots for this segment of casual riders, and social media should be a big focus for promotion.

> * These strategies can also be used to acquire new clients.

### Recommendation 

> * Add some extra information for deeper analysis, such as financial information, membership plan details, personally identifiable information about clients, and bike type information. It would be more beneficial in improving the marketing strategy's effectiveness.