### Google Data Analytics Case Study - Cyclistic

### Lewis Lee | Jun 2021

#### About the company
In 2016, Cyclistic launched a successful bike-share oering. Since then, the program has grown to a fleet of 5,824 bicycles that
are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and
returned to any other station in the system anytime.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the
pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be
key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very
good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and
have chosen Cyclistic for their mobility needs.


#### Objectives

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do
that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual
riders would buy a membership

# Ask 

Three questions will guide the future marketing program:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?

# Prepare

You will use Cyclistic’s historical trip data to analyze and identify trends. Download the previous 12 months of Cyclistic trip data [here](https://divvy-tripdata.s3.amazonaws.com/index.html). (Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this [license](https://www.divvybikes.com/data-license-agreement)

#### Import Data Sets from May 2020 - Apr 2021

In [None]:
import pandas as pd
import numpy as np

#import and remove any rows which has null values

tripdata_202005 = pd.read_csv('../input/cyclistic/202005-divvy-tripdata.csv').dropna()
tripdata_202006 = pd.read_csv('../input/cyclistic/202006-divvy-tripdata.csv').dropna()
tripdata_202007 = pd.read_csv('../input/cyclistic/202007-divvy-tripdata.csv').dropna()
tripdata_202008 = pd.read_csv('../input/cyclistic/202008-divvy-tripdata.csv').dropna()
tripdata_202009 = pd.read_csv('../input/cyclistic/202009-divvy-tripdata.csv').dropna()
tripdata_202010 = pd.read_csv('../input/cyclistic/202010-divvy-tripdata.csv').dropna()
tripdata_202011 = pd.read_csv('../input/cyclistic/202011-divvy-tripdata.csv').dropna()
tripdata_202012 = pd.read_csv('../input/cyclistic/202012-divvy-tripdata.csv').dropna()
tripdata_202101 = pd.read_csv('../input/cyclistic/202101-divvy-tripdata.csv').dropna()
tripdata_202102 = pd.read_csv('../input/cyclistic/202102-divvy-tripdata.csv').dropna()
tripdata_202103 = pd.read_csv('../input/cyclistic/202103-divvy-tripdata.csv').dropna()
tripdata_202104 = pd.read_csv('../input/cyclistic/202104-divvy-tripdata.csv').dropna()

#### Check Data Types and Column Headers

In [None]:
tripdata_202005.info()
tripdata_202006.info()
tripdata_202007.info()
tripdata_202008.info()
tripdata_202009.info()
tripdata_202010.info()
tripdata_202011.info()
tripdata_202012.info()
tripdata_202101.info()
tripdata_202102.info()
tripdata_202103.info()
tripdata_202104.info()

# Process

1. Check the data for errors.
2. Choose your tools.
3. Transform the data so you can work with it effectively.
4. Document the cleaning process.

**All Column Headers are consistent.**

`1. Convert start_station_id and end_station_id to Object`

`2. Join all csv files into one (12months)`

In [None]:
#Converting Dtypes in order to merge them
tripdata_202005 = tripdata_202005.astype({"start_station_id": object, "end_station_id": object})
tripdata_202006 = tripdata_202006.astype({"start_station_id": object, "end_station_id": object})
tripdata_202007 = tripdata_202007.astype({"start_station_id": object, "end_station_id": object})
tripdata_202008 = tripdata_202008.astype({"start_station_id": object, "end_station_id": object})
tripdata_202009 = tripdata_202009.astype({"start_station_id": object, "end_station_id": object})
tripdata_202010 = tripdata_202010.astype({"start_station_id": object, "end_station_id": object})
tripdata_202011 = tripdata_202011.astype({"start_station_id": object, "end_station_id": object})

In [None]:
#Merging Past 12 Months of Data into one
tripdata = pd.concat([tripdata_202005, tripdata_202006, tripdata_202007,tripdata_202008,
                    tripdata_202009,tripdata_202010,tripdata_202011,tripdata_202012,
                    tripdata_202101,tripdata_202102,tripdata_202103,tripdata_202104])

tripdata.info()

#### Adding New columns for date and ride length

In [None]:
#Changing Dtype to datetime
tripdata['started_at'] = pd.to_datetime(tripdata["started_at"], format="%Y-%m-%d %H:%M:%S")
tripdata['ended_at'] = pd.to_datetime(tripdata["ended_at"], format="%Y-%m-%d %H:%M:%S")

In [None]:
tripdata.dtypes #check for dtype changes

In [None]:
from datetime import datetime as dt
from pandas.api.types import CategoricalDtype

dayofweek_mapping= {
    0: 'Monday', 
    1: 'Tuesday', 
    2: 'Wednesday', 
    3: 'Thursday', 
    4: 'Friday',
    5: 'Saturday', 
    6: 'Sunday'
} 

month_mapping= {
    1: 'Jan', 
    2: 'Feb', 
    3: 'Mar', 
    4: 'Apr',
    5: 'May', 
    6: 'Jun', 
    7: 'Jul', 
    8: 'Aug', 
    9: 'Sep', 
    10: 'Oct',
    11: 'Nov', 
    12: 'Dec'
}

tripdata['Day'] = tripdata['started_at'].dt.day
tripdata['Month'] = tripdata['started_at'].dt.month.map(month_mapping)
tripdata['Year'] = tripdata['started_at'].dt.year
tripdata['Day_of_Week'] = tripdata['started_at'].dt.dayofweek.map(dayofweek_mapping)
tripdata['Starting_Time'] = tripdata['started_at'].dt.strftime('%H:%M')
tripdata['Ride_Length'] = tripdata['ended_at'] - tripdata['started_at']
tripdata['Ride_Length_minute'] = (tripdata['Ride_Length'].dt.total_seconds()/60).round(2) # convert into minutes

tripdata.head()

#### Filtering out Ride Lengths that are negative and more than 24hours

In [None]:
filtered = tripdata[(tripdata['Ride_Length_minute'] > 0) & (tripdata['Ride_Length_minute'] < 720)]

#### Checking Unique Values


Note the 3 different rideable types, 2 different memberships. Number of days/month/year seems to be correct.

In [None]:
filtered.nunique()

In [None]:
#Finding out the number of each unique rideable type.
bikes = pd.DataFrame(filtered['rideable_type'].value_counts())
bikes.rename(columns = {'rideable_type' : 'Count'})

# Analyze

1. Aggregate your data so it’s useful and accessible.
2. Organize and format your data.
3. Perform calculations.
4. Identify trends and relationships.

Calculate the number of rides for users by day_of_week by adding Count of trip_id to Values.

**Descriptive Statistics**

In [None]:
round(filtered.groupby("member_casual")['Ride_Length_minute'].describe(),2)

**Most Popular Day of the week to Cycle**

In [None]:
filtered.groupby("member_casual")['Day_of_Week'].describe()

**Most Popular Month to Cycle**


In [None]:
filtered.groupby("member_casual")['Month'].describe()

**Average ride_length for members and casual riders by Day of Week**

In [None]:
dayofweek = ['Monday', 'Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
table_1 = filtered.groupby(["member_casual", "Day_of_Week"])['Ride_Length_minute'].mean()
pd.DataFrame(table_1).reindex(index=dayofweek,level=1)

**Average ride_length for members and casual riders by Month**

In [None]:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov","Dec"]
table_2 = filtered.groupby(["member_casual", "Month"])['Ride_Length_minute'].mean()
pd.DataFrame(table_2).reindex(index=months,level=1)

**Number of Rides for members and casual riders by Day of Week**

In [None]:
table_3 = filtered.groupby(["member_casual", "Day_of_Week"])['ride_id'].count()
pd.DataFrame(table_3).reindex(index=dayofweek,level=1)

# Visualisations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sb

#Formatting graph aesthetics
sb.set_style("darkgrid")
sb.set_context('paper', font_scale = 1.4)

#Fixing X-Axis day order
field = filtered['Day_of_Week']
day_order = ['Monday', 'Tuesday','Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday']
filtered = filtered.set_index(field).loc[day_order]

#Fixing X-Axis month order
field2 = filtered['Month']
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
filtered_month = filtered.set_index(field2).loc[month_order]

#Filtering variables via member status
casual = filtered.loc[filtered['member_casual'] == 'casual']
member = filtered.loc[filtered['member_casual'] == 'member']

### Comparing the Yearly Number of Rides

Note that due to the input files has more data on 2020, this graph cannot be used to derive any statistical conclusions. Having both 12 months of data for both 2020 and 2021 would give us more insight.

In [None]:
plt.figure(figsize = (10,6))
sb.countplot(x = "Year", data = filtered, hue = 'member_casual')
plt.title('Yearly Count of Rides')

### Comparing the Monthly Number of Rides

1. As per what we expected, Aug is the month with the highest frequency of cyclists. Do note that number of members > casuals
2. The trend between member and casual seems to be closely correlated, we can see the increase from Jan to Aug, followed by a decrease from Aug - Dec.
    - this could be due to seasonal changes which is the largest attributing factor for cyclists

In [None]:
plt.figure(figsize = (10,6))
sb.histplot(x = "Month", data = filtered_month, hue = 'member_casual', palette = 'Pastel1')
plt.title('Monthly Count of Rides')

### Comparing the Day of Week Number of Rides, sorted by rideable types

We can see that for the casuals, most of them do not ride much on weekdays, rather, they prefer to ride on the weekends. On the other hand, the members consistently ride throughout the days of the week.

In [None]:
plt.figure(figsize = (10,6))
sb.histplot(x = "Day_of_Week", data = casual, hue = 'rideable_type')
plt.title("Casual")

plt.figure(figsize = (10,6))
sb.histplot(x = "Day_of_Week", data = member, hue = 'rideable_type')
plt.title("Member")

### Swarmplot to visualise the timing where people start to cycle

By getting a random sample size based on the averaging the number of rides/day , we can see that the highest number of cyclist usually start at **10:00 to 20:00** while the only a small number of them cycle at **00:00 to 05:00**

In [None]:
#Started_at Timings

sample = filtered.sample(9387)
plt.figure(figsize = (20,10))
sb.swarmplot(data = sample, x = 'Day_of_Week' , y = sample['started_at'].dt.hour, size = 0.8, order = ['Monday', 'Tuesday','Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.title('Start Timings for Cyclists')

We can see the graphs between the cyclist's start and end timing are roughly similar. This is within our expectations since the average cyclist ride length is ~15 mins (for members) to ~35mins (for casuals)

In [None]:
#Ended_at Timings
plt.figure(figsize = (20,10))
sb.swarmplot(data = sample, x = 'Day_of_Week' , y = sample['ended_at'].dt.hour, size = 0.8, order = ['Monday', 'Tuesday','Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.title('End Timings for Cyclists')

### Overview of location where cyclists start cycling

In [None]:
plt.figure(figsize = (20,20))
sb.jointplot(data = sample, x = 'start_lat', y = 'start_lng', hue = 'member_casual', palette = 'pastel')

In [None]:
plt.figure(figsize = (20,20))
sb.jointplot(data = sample, x = 'end_lat', y = 'end_lng', hue = 'member_casual', palette = 'pastel')

## Export as CSV

In [None]:
# filtered.to_csv('cyclistic_cleaned_dataset.csv')