# Google Data Analytics Capstone Project 

## Cyclistic bike-share analysis case study
### How does a bike-share navigate speedy success?

Cyclistic is a fictional bike-share company that has two type of customers, casual riders who use the service on a pay-as-you-go basis and members who purchase an annual subscription.   
The goal of this analysis is to better understand how annual members and casual riders use Cyclistic bikes differently, which is part of a broader objective of designing marketing strategies with the aim of converting casual riders into annual members.

## Business task

The business task can be stated as: Identify key differences on how casual riders and annual members use the bike-share service to fuel a targeted marketing compaign aiming at increasing the proportion of annual members within the customers pool.

## On the dataset

We will use historical trip data provided by Motivate International Inc. to analyze and identify trends.
This is a public dataset from a credible source. The data has been checked and is found to be reliable, original, comprehensive, current (last 12 months) and will enable us to answer the business questions.

In [None]:
# Import Python libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Load in data from the past 12 months

df01 = pd.read_csv("../input/bikeshare-trip-details-data/01_May_2020.csv", index_col = 'ride_id')
df02 = pd.read_csv("../input/bikeshare-trip-details-data/02_Jun_2020.csv", index_col = 'ride_id')
df03 = pd.read_csv("../input/bikeshare-trip-details-data/03_Jul_2020.csv", index_col = 'ride_id')
df04 = pd.read_csv("../input/bikeshare-trip-details-data/04_Aug_2020.csv", index_col = 'ride_id')
df05 = pd.read_csv("../input/bikeshare-trip-details-data/05_Sept_2020.csv", index_col = 'ride_id')
df06 = pd.read_csv("../input/bikeshare-trip-details-data/06_Oct_2020.csv", index_col = 'ride_id')
df07 = pd.read_csv("../input/bikeshare-trip-details-data/07_Nov_2020.csv", index_col = 'ride_id')
df08 = pd.read_csv("../input/bikeshare-trip-details-data/08_Dec_2020.csv", index_col = 'ride_id')
df09 = pd.read_csv("../input/bikeshare-trip-details-data/09_Jan_2021.csv", index_col = 'ride_id')
df10 = pd.read_csv("../input/bikeshare-trip-details-data/10_Feb_2021.csv", index_col = 'ride_id')
df11 = pd.read_csv("../input/bikeshare-trip-details-data/11_Mar_2021.csv", index_col = 'ride_id')
df12 = pd.read_csv("../input/bikeshare-trip-details-data/12_Apr_2021.csv", index_col = 'ride_id')

In [None]:
# Combine all dataframes

bikeshare_df = pd.concat([df01, df02, df03, df04, df05, df06, df07, df08, df09, df10, df11, df12])
bikeshare_df.head()

## Data Cleaning

Let's perform some data cleaning before we proceed with the actual analysis

In [None]:
# Ensure there are no duplicate entries

bikeshare_df.drop_duplicates(keep = 'first', inplace=True)


In [None]:
# Check for missing values

for col in bikeshare_df.columns:
    print(col+':', bikeshare_df[col].isnull().sum() )


We'll drop start_station_id and end_station_id columns as we're not intending to work with those two columns

In [None]:
# Drop start_station_id and end_station_id  columns

bikeshare_df.drop(['start_station_id', 'end_station_id'], axis = 1, inplace = True)

In [None]:
# Rename some columns for consistency and clarity

bikeshare_df = bikeshare_df.rename(columns={'started_at': 'start_time', 'ended_at': 'end_time', 
                                            'start_station_name': 'start_station', 'end_station_name': 'end_station',
                                            'member_casual': 'user_type'})

In [None]:
# Remove leading and trailing spaces from all entries in columns with object dtype

object_cols = bikeshare_df.select_dtypes('object').columns

for col in object_cols:
    bikeshare_df[col] = bikeshare_df[col].str.strip()
    

In [None]:
# Convert trip start and end times to datetime format

bikeshare_df['start_time'] = pd.to_datetime(bikeshare_df['start_time'])
bikeshare_df['end_time'] = pd.to_datetime(bikeshare_df['end_time'])

In [None]:
# Calculate trip duration in minutes

bikeshare_df['trip_duration'] = (bikeshare_df['end_time'] - bikeshare_df['start_time'])/np.timedelta64(1, 'm')

In [None]:
# Keep only trips that have a duration greater than 0

bikeshare_df = bikeshare_df[bikeshare_df['trip_duration'] > 0]

In [None]:
# Extract day of week for every ride

bikeshare_df.loc[:, 'day_of_week'] = bikeshare_df.loc[:, 'start_time'].dt.weekday

In [None]:
bikeshare_df.head()

Let's fill the missing values of end_lat and end_lng columns with the average latitude and longitude values of the corresponding station. 

In [None]:
# Average end of trip latitude and longitude for each station

coord = bikeshare_df.groupby('start_station')[['end_lat', 'end_lng']].mean()
coord.head()

In [None]:
# Index locations where the dataset is missing end latitude and end longitude information

indices_wo_end_coord = bikeshare_df[bikeshare_df['end_lat'].isnull()].index

In [None]:
# Fill in with missing values for trip end latitude and longitude

for index in indices_wo_end_coord:
    station = bikeshare_df.loc[index, 'start_station']
    bikeshare_df.loc[index, 'end_lat'] = coord.loc[station,'end_lat']
    bikeshare_df.loc[index, 'end_lng'] = coord.loc[station,'end_lng']

## Data Analysis

Let's start our analysis by finding out the proportions of annual members rides and casual rides in the dataset

In [None]:
# casual rides vs members rides shown as percentage

bikeshare_df.groupby('user_type')['user_type'].count().plot(kind = 'pie', autopct='%1.1f%%')
plt.title('Proportion of casual rides vs members rides')
plt.ylabel("")

**More than 40% of bike trips in the last 12 months are from casual riders, hence the objective of converting them to annual members to grab more revenue.**

Now we'll have a look have at how casual riders differ from annual members by calculating the average trip duration for each type of user.

In [None]:
df1 = bikeshare_df.groupby('user_type')['trip_duration'].agg(['count','min', 'mean', 'max', 'median', 'std'])
df1

In [None]:
# Average trip duration by user type

df1['mean'].plot(kind = 'bar', color = 'g')
plt.xlabel ('Type of user')
plt.ylabel ('Average trip duration in minutes')
plt.title('Average trip duration by user type')

**Average trip duration of casual riders is about 3 times higher than annual members average trip duration. So let's estimate the average distance travelled by each type of customer to see what kind of conclusions we can possibly draw.**

We'll use Haversine formula calculate the distance travelled for each trip using latitude and longitude coordinates. We can already point out that this calculated distance will have one evident shortcoming as round trips i.e. trips that start and end at the same station will account for 0 km travelled.  

So let us first investigate the proportion of round trips in our dataset.

In [None]:
# Proportion of round trips in the entire dataset

bikeshare_df[bikeshare_df['start_station'] == bikeshare_df['end_station']].shape[0] / bikeshare_df.shape[0]


In [None]:
# Proportion of round trips within casual rides

casual_rides = bikeshare_df[bikeshare_df['user_type']=='casual']

casual_rides[casual_rides['start_station'] == casual_rides['end_station']].shape[0] / casual_rides.shape[0]

In [None]:
# Proportion of round trips within annual members rides

member_rides = bikeshare_df[bikeshare_df['user_type']=='member']

member_rides[member_rides['start_station'] == member_rides['end_station']].shape[0] / member_rides.shape[0]

**We have about 10% of all rides being round trips, so overall it should not significantly affect the average travelled distance calculation.  
However the average distance travelled for casual rides will be less accurate than the one for their annual members counterpart, because round trips within casual rides account for 16.7% as compared to 5.2% round trips within annual members rides.** 

In [None]:
# Calculate travelled distance for each ride in the dataset (using Haversine formula)

p = np.pi/180

delta_lat = (bikeshare_df['end_lat']-bikeshare_df['start_lat'])*p 
delta_lng = (bikeshare_df['end_lng']-bikeshare_df['start_lng'])*p
a = np.sin(delta_lat/2)**2 + np.cos(bikeshare_df['start_lat']*p) * np.cos(bikeshare_df['end_lat']*p) * np.sin(delta_lng/2)**2

bikeshare_df['trip_distance'] = 12742 * np.arcsin(np.sqrt(a))

In [None]:
# Average trip distance by user type

bikeshare_df.groupby('user_type')['trip_distance'].mean().plot(kind = 'bar', color = 'g')
plt.xlabel ('Type of user')
plt.ylabel ('Average trip distance in km')
plt.title('Average distance travelled by user type')

**The average distance travelled by the two user types is approximately the same, so this implies that on average casual rides happen at a much slower pace which suggest that casual riders use the service for leisure and touristic activities. On the other, annual members use the service for more pragmatic goals like working out or commuting to work.**

Now we'll explore how the days of the week affect trip taken by each type of customer.

In [None]:
# Number of rides by user type and day of the week

df2 = bikeshare_df.groupby(['user_type','day_of_week'])['trip_duration'].agg(['count', 'mean', 'median'])

for i in range (7):
    daily_users = df2.loc[('casual', i), 'count'] + df2.loc[('member', i), 'count']
    df2.loc[('casual', i), 'percent'] = df2.loc[('casual', i), 'count'] / daily_users
    df2.loc[('member', i), 'percent'] = df2.loc[('member', i), 'count'] / daily_users
df2

In [None]:
# Number of trips by user type on each day of the week

df2.loc[('casual', range(7)), 'count'].plot(label = 'casual')
df2.loc[('member', range(7)), 'count'].plot(label = 'member')

locs, labels = plt.xticks()
plt.xticks(locs, ["",'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', ""])
plt.xlabel ('Day of the week')
plt.ylabel ('Number of trips')
plt.ylim(0)
plt.title('Daily number of trips by user type')
plt.legend()

In [None]:
# Percentage of trips by user type on each day of the week

df2.loc[('casual', range(7)), 'percent'].plot(label = 'casual')
df2.loc[('member', range(7)), 'percent'].plot(label='member')

locs, labels = plt.xticks()
plt.xticks(locs, ["",'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', ""])
plt.xlabel ('Day of the week')
plt.ylabel ('% of rides')
plt.title('Percentage of trips by user type')
plt.ylim(0)
plt.legend()

**The number of trips is pretty steady from Monday to Thursday for both type of users. About 65% for members and 35% for casual users.
On Friday, the number of casual rides drastically increase while the number of members rides take the opposite direction.
The figures even out over the weekend (Saturday and Sunday) to approximately 50% for each user category.  
This reinforces the hypothesis that casual riders use the bikes for leisure, touristic and other weekend related activities, while annual members are more into pre-determined activities.**

We will finally look at route preferences if any.

In [None]:
# Concatenate start_station_name and end_station_name as a new column called route

bikeshare_df['route'] = bikeshare_df['start_station'] + '_' + bikeshare_df['end_station']

In [None]:
# Proportion of casual rides with none or incomplete route information within casual rides
casual_rides = bikeshare_df[bikeshare_df['user_type']=='casual']
casual_rides[casual_rides['route'].isnull()].shape[0] / casual_rides.shape[0]

In [None]:
# Proportion of members rides with none or incomplete route information within annual members rides
member_rides = bikeshare_df[bikeshare_df['user_type']=='member']
member_rides[member_rides['route'].isnull()].shape[0] / member_rides.shape[0]

Since the above two proportions are the nearly same, we'll ignore all trips without start_station_name and/or end_station_name information and determine the proportion of bike trips on the most frequently used routes by each user category.  
We'll define a route as frequently used if it totals at least 500 trips.

In [None]:
# Proportion of rides on most frequently used routes by casual riders

s1 = bikeshare_df[bikeshare_df['user_type'] == 'casual'].groupby('route')['route'].count()
s1[s1 >= 500].sum() / s1.sum()

In [None]:
# Proportion of rides on most frequently used routes by annual members

s2 = bikeshare_df[bikeshare_df['user_type'] == 'member'].groupby('route')['route'].count()
s2[s2 >= 500].sum() / s1.sum()

**Casual rides tend to be more concentrated on the same routes as compared to trips taken by members.**

## Conclusion

In order to convert casual riders to annual members, my top 3 recommendations for the marketing coampaign based on the analysis we've conducted are:

* **Emphasize the annual membership benefits from leisure and weekend related activities standpoint.**  


* **Propose annual memberships with service options like weekends (aiming at converting casual riders) and weekdays (aiming at making annual membership even more compelling for current members).**  


* **Offer one year membership trial at a prescribed discount targeting routes most frequently used by casual riders.**


*Thanks for reading. Your feedback will be enthusiastically taken onboard and highly appreciated.*