# Cyclistic Case Study
### An exploratory look at bike-share data using Python

## About the Data  

Cyclistic is a fictitious company based on a real Chicago bike-share business.  
The data has been anonymized and made available for use under 
[this license](https://www.divvybikes.com/data-license-agreement).

The data in this notebook was collected from the period of June 2020 through May 2021.  
I recieved this data as coursework in the Google Data Analytics Certificate program.

## Tools Used

I've chosen to use Python to process this dataset for its ease of use and diverse functionality.  
The Pandas library is efficient in working with our large dataset, and the Matplotlib library allows us to create custom plots from Pandas objects.  
This case study was constructed in a Jupyter Notebook IDE, then manually imported onto Kaggle.

## Our Task

Cyclistic offers nearly 6,000 bicycles that service the Chicago area with a network of approximately 700 docking stations.  
Riders can choose between purchasing a single-ride pass, a full-day pass, or an annual membership.  

Annual memberships are much more profitable than casual riders purchasing passes.  
Our business objective is to maximize the number of annual memberships by developing a marketing strategy which targets the conversion of casual riders into annual members.  
Our marketing analyst team needs to know how members and casual riders are using the bikes differently.  

## Importing the Data

We recieve our data as twelve CSV files, each representing a complete month of Cyclistic rides.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from os import listdir

# changing default size and style of plots
plt.rcParams['figure.figsize'] = [12, 12]
plt.style.use('fivethirtyeight')

# adjusting chunksize to afford matplotlib speed
mpl.rcParams['agg.path.chunksize'] = 10000

# merging all csv files into one pandas DataFrame object
csv_list = []
for file in listdir('../input/cyclistic'):
    temp_df = pd.read_csv(f'../input/cyclistic/{file}')
    csv_list.append(temp_df)
df = pd.concat(csv_list)

# a view of our df
df.info(memory_usage='deep')

At a glance we can see that our new dataframe contains over four-million observations and is using 2.4 GB in memory.  
> Index: 4073561 entries  
> memory usage: 2.4 GB

## Processing and Cleaning the Dataset

In order to filter and clean our data I will cast some columns into more memory-efficient datatypes, and I will extract relevant information into new columns for later analysis.

In [None]:
# renaming and converting our 'member_casual' column from a list of strings into a series of boolean values
df.rename(columns = {'member_casual': 'is_member'}, inplace=True)
df.replace({'is_member': {'member': True, 'casual': False}}, inplace=True)

# renaming and converting the 'rideable_type' column from a list of strings into an index [0, 1, 2]
df.rename(columns = {'rideable_type': 'is_ebike'}, inplace=True)
df.replace({'is_ebike': {'classic_bike': 0, 'electric_bike': 1, 'docked_bike': 2}}, inplace=True)

# casting our time data into datetime objects
df['started_at'] = pd.to_datetime(df['started_at'])
df['ended_at'] = pd.to_datetime(df['ended_at'])

# creating new columns from our time data
df['hour'] = df['started_at'].dt.hour
df['ord_day'] = df['started_at'].dt.day_of_year
df['week_day'] = df['started_at'].dt.dayofweek
df['month'] = df['started_at'].dt.month

df['ride_minutes'] = (df['ended_at'] - df['started_at'])
df['ride_minutes'] = pd.to_numeric(df['ride_minutes']) / 6e+10
    # the result is given in nanoseconds, so we divide by 6e+10 to return minutes
    
df['ride_km'] = np.sqrt( ((df['end_lat'] - df['start_lat'])**2) + ((df['end_lng'] - df['start_lng'])**2) )
    # formula for distance between two coordinates:
    # sqrt( (x2-x1)^2 + (y2-y1)^2 ) 
df['ride_km'] = df['ride_km'] * 111
    # 111 is an approximate factor for converting degrees of lat/long seperation into kilometers

# checking our new dataframe for null values
df.isnull().sum()

A count of null values in our dataset shows that we are missing information for some columns.  

Another important takeaway is that there are many columns in which we are not missing any data.  
Most notably, all of our time-related data is complete.  
Let's take a closer look at the distribution of total ride times (in minutes) across our dataframe:

In [None]:
# statistics that summarize the shape of our dataset
df['ride_minutes'].describe(include='all')

Taking a glimpse at the length of our rides, we learn that, surprisingly, our shortest ride lengths are of negative value,  
> min     -2.904997e+04  

and our longest ride lengths are over over five weeks!
> max      5.428335e+04

We can speculate that these discrepancies may be due to errors in the algorithm which calculates start/end time, faulty mechanical components, or offline bicycles. 

For the purpose of this analysis I decided that any ride length shorter than one minute or longer than four hours should be excluded as outlying data.  
What proportion of our dataset would this be trimming?
  

In [None]:
# percent of rides shorter than one minute or longer than four hours
drop_filter = (df['ride_minutes'] < 1) | (df['ride_minutes'] > 240)
drop_filter.value_counts(normalize=True)

We see that only 2.12% of our rides fall outside of these parameters so I proceed in removing this data.

In [None]:
# dropping bad data
df.drop(df.index[drop_filter], inplace=True)

In [None]:
# a new search for null values
df.isnull().sum()

Trimming the 'ride_minutes' outliers did remove some of the missing values from our dataframe.  

I have chosen not to remove observations where we are missing start/end station names and ids.  
The approximately 5% of records this would affect are still meaningful to include in our analysis.  

Next let's remove columns that we no longer have use for.

In [None]:
df.drop(columns=['ride_id',
                 'start_station_id', 'end_station_id',
                 'started_at', 'ended_at',
                 'start_lat', 'start_lng',
                 'end_lat', 'end_lng'], inplace=True)

We can now round down our floating point decimals and cast our columns into more memory-efficient datatypes.

In [None]:
df = df.round(2)

df['is_ebike'] = df['is_ebike'].astype('int8')
df['ride_minutes'] = df['ride_minutes'].astype('float32')
df['month'] = df['month'].astype('int8')
df['ord_day'] = df['ord_day'].astype('int16')
df['week_day'] = df['week_day'].astype('int8')
df['hour'] = df['hour'].astype('int8')
df['ride_km'] = df['ride_km'].astype('float32')

df.info(memory_usage='deep')

A look at our cleaned dataframe shows that we have significantly reduced it's size from 2.4 GB to 585.3 MB.
> memory usage: 585.3 MB

Our data is now flexible, relevant, and ready to tell a fascinating story.

## Analysis

Our task for this assignment is to maximize new memberships by converting casual riders and to understand how each use the bicycles differently.  
We can begin by creating filters that segregate our data by membership type.

In [None]:
casuals = df[df['is_member']==False]
members = df[df['is_member']==True]

Next, we ask questions which our data can answer...  

How is the frequency of rides influenced by the time of day?

In [None]:
plt.figure(figsize=(10,10))
members['hour'].value_counts().sort_index().plot.barh(color='#f49264', align='edge', width=-.5, linewidth=.5, edgecolor='k')
casuals['hour'].value_counts().sort_index().plot.barh(color='#69a6d1', align='center', width=.5, linewidth=.5, edgecolor='k')
plt.yticks([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23], 
           ['12am','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','12pm','1pm','2pm','3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'], rotation=0)
plt.xlabel('Total Rides')
plt.ylabel('Hour')
plt.title('Total Rides by Hour')
plt.legend(['Member', 'Casual'])
plt.show()

We can see that peak hours for casual riders are typically between 11am-7pm, with the highest number of rides beginning in the 5pm hour.

Are there trends to explore when considering the day of the week?

In [None]:
plt.figure(figsize=(10,6))
members['week_day'].value_counts().sort_index().plot.bar(color='#f49264', align='edge', width=-.4, linewidth=1, edgecolor='k')
casuals['week_day'].value_counts().sort_index().plot.bar(color='#69a6d1', align='edge', width=.4, linewidth=1, edgecolor='k')
plt.xticks([-1,0,1,2,3,4,5,6], ['', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], rotation=60)
plt.ylabel('Total Rides')
plt.title('Total Rides by Day')
plt.legend(['Member', 'Casual'])
plt.show()

It appears that our members are consistently riding throughout the week, while casual riders exhibit a strong preference for riding on Saturdays and Sundays.

What can we say about how the time of year affects ridership?

In [None]:
plt.figure(figsize=(10,6))
members['month'].value_counts().sort_index().plot(color='#f49264')
casuals['month'].value_counts().sort_index().plot(color='#69a6d1')
plt.xticks([1,2,3,4,5,6,7,8,9,10,11,12],
           ['January','February','March','April','May','June','July','August','September','October','November','December'],
           rotation=60)
plt.xlabel('Month')
plt.ylabel('Total Rides')
plt.title('Total Rides by Month')
plt.legend(['Member', 'Casual'])
plt.annotate('unexpected\n  decrease\nin ridership',
             xy=(6.3,145000), xycoords='data', xytext=(7.2,88000),
            arrowprops={'facecolor': 'black'})
plt.show()

While it seems that rides are increasing throughout the summer, we see an unexpected decline in the month of June.  

Let's take a closer look at the rides from May through July by plotting the total rides per day.

In [None]:
# creating Series objects where 'month' is May, June, or July
casual_summer_days = pd.Series((casuals['month'] >= 5) & (casuals['month'] <= 7))
member_summer_days = pd.Series((members['month'] >= 5) & (members['month'] <= 7))

# filtering membership type by our aformentioned Series
casual_summer_rides = casuals.loc[casual_summer_days]
member_summer_rides = members.loc[member_summer_days]

plt.figure(figsize=(14,6))
member_summer_rides['ord_day'].value_counts().sort_index().plot(kind='bar', color='#f49264', linewidth=.4, edgecolor='k')
casual_summer_rides['ord_day'].value_counts().sort_index().plot(kind='bar', color='#69a6d1', linewidth=.4, edgecolor='k')
plt.xticks([0, 31, 62],['May','June','July'], rotation=60)
plt.xlabel('Days')
plt.ylabel('Rides per Day')
plt.title('Ridership Type by Day')
plt.legend(['Member', 'Casual'])
plt.show()

As expected, we see notably fewer rides in the month of June. But what was the cause of this sudden decline?  

The answer becomes evident when we ask if there were any days in the year where service was completely suspended.

In [None]:
# a list of all ordinal calender days in 2020 (was a leap year)
days_in_2020 = [i for i in range(1,367)]

for day in df['ord_day']:
    if day in days_in_2020:
        days_in_2020.remove(day)

# the resultant list shows only days for which there is no ride data
print(days_in_2020)

We can see that there are no rides in our database for the 152nd, 153rd, and 154th ordinal day of the year. These three days in 2020 were May 31, June 1st, and June 2nd.  
These dates correspond to the arrival of the national guard, declaration of curfew, and halting of public transit systems in Chicago in response to nationwide protests over the murder of George Floyd.  
The protests would continue for weeks in the downtown area and undoubtedly impact ridership.  
  
What correlations exist between distance travelled and duration of ride?

In [None]:
fig, (time, kms) = plt.subplots(2,1, figsize=(8,8))

avg_mem_mins = members['ride_minutes'].mean()
avg_cas_mins = casuals['ride_minutes'].mean()
time.barh(['Casual','Member'],[avg_cas_mins,avg_mem_mins], color=['#69a6d1', '#f49264'], linewidth=1, edgecolor='k')
time.title.set_text('Average Ride Minutes')
time.xaxis.set_ticks_position('top')

avg_mem_km = members['ride_km'].mean()
avg_cas_km = casuals['ride_km'].mean()
kms.barh(['Casual','Member'],[avg_cas_km,avg_mem_km], color=['#69a6d1', '#f49264'], linewidth=1, edgecolor='k')
kms.title.set_text('Average Ride Kilometers')
kms.xaxis.set_ticks_position('top')

We can see that the casual riders use the bikes for a considerably longer duration.

Let's examine what preferences riders have for bicycle type.

In [None]:
# 0 - 'classic_bike'
# 1 - 'electric_bike'
# 2 - 'docked_bike' (no data about bicycle type)
df['is_ebike'].value_counts(normalize=True)

It is important we notice that for over 58% of our data we do not have any indication if a bike is electric or classic.  
Having taken this into account, we may still glean useful insight from analyzing trends where we do have bike type data.

In [None]:
fig, (cas_type, mem_type) = plt.subplots(2,1, figsize=(14,14))
fig.suptitle('Bike Type Preference', fontsize=24)

cas_type.pie(casuals['is_ebike'].value_counts(normalize=True),
        colors=['#306f9c','#69a6d1','#b1d1e7'],
        labels=['No Data','Electric','Classic'],
        shadow=True, explode=[0, 0.05, 0.05], startangle=90, 
        wedgeprops={'edgecolor':'black', 'linewidth':1},
        textprops={'size':18, 'color':'#0c1c27', 'family':'fantasy'},
       autopct='%1.2f%%')
cas_type.title.set_text('Casuals')

mem_type.pie(members['is_ebike'].value_counts(normalize=True),
        colors=['#d74e0f','#f49264','#facdb7'],
        labels=['No Data','Classic','Electric'],
        shadow=True, explode=[0, 0.05, 0.05], startangle=90, 
        wedgeprops={'edgecolor':'black', 'linewidth':1},
        textprops={'size':18, 'color':'#0c1c27', 'family':'fantasy'},
       autopct='%1.2f%%')
mem_type.title.set_text('Members')

When the data is available, it appears that our casual riders do prefer the electronic bikes.

Where are the most popular stations for our casual riders located?

In [None]:
# top 10 most common stations for casuals to start a ride
casuals['start_station_name'].value_counts().head(10)

In [None]:
# top 10 most common stations for members to start a ride
members['start_station_name'].value_counts().head(10)

We can see that the stations frequented by our casual riders include parks, theaters, and aquariums - while the stations most visited by members favor city streets.  


While its helpful to know where our most frequented stations are throughout the year, we must also consider that our advertisement strategy will be more impactful if it targets the highest concentration of casual riders.  
As we discovered previously in our analysis, this comes between the hours of 11am-7pm, on Saturdays and Sundays, in the summer months.

In [None]:
# filter is a dataframe object which will then be passed into .loc indexer
# sorting by time, day of week, and month
target_filter = pd.DataFrame((casuals['hour'] >= 11) & (casuals['hour'] <= 19)
                               & ((casuals['week_day'] == 5) | (casuals['week_day'] == 6))
                               & (casuals['month'] >= 5) & (casuals['month'] <= 9))

target_stations = casuals.loc[target_filter[0]]

target_stations['start_station_name'].value_counts().head(10)

The result is a similar, but different, list of stations. This is precisely the information we need to launch a targeted ad campaign which will have the highest exposure to casual riders.

## Review

Our journey through the Cyclistic bike-share data showed us how annual members and casual riders use the service differently, and we were able to garner some delightful insights along the way.  

We discovered several trends and relationships which aid us in answering our primary business question: how do we increase annual memberships by targeting casual riders?

The principal differences between members and casual riders can be summized with the following visualization:

In [None]:
plt.figure(figsize=(12,12))
casuals.plot(x='ride_minutes', y='ride_km', kind='scatter', legend=True, color='#69a6d1', alpha=.5, marker='.')
ax = plt.gca()
members.plot(x='ride_minutes', y='ride_km', kind='scatter', legend=True, color='#f49264', ax=ax, alpha=.5, marker='.')
plt.xlabel('Duration (minutes)')
plt.ylabel('Distance (km)')
plt.title('All Rides by Time and Distance')
plt.legend(['Casual', 'Member'])
plt.show()

Casual riders are using the bikes for *significantly longer ride times*, even when distance travelled is relatively minimal.  

Our earlier analysis also indicated that casual riders favor unlocking bikes at *recreational and commercial locations*.

This suggests that our target audience is using the bikes for *tourism, sightseeing, and leisurely rides*.  

Furthermore we determined the highest capture of casual riders exists *between the hours of 11am - 7pm*, especially on *Saturdays and Sundays*, and most definitely *during the months of May - September*.

To this point, we have procured a *list of target stations* at, or around, which our ad campaign would be most effectively deployed.

## Next Steps

Further analysis could certainly benefit increased membership conversion.  
Additional data would be needed to expand on these findings, preferably:
* Tracking individual customer usage patterns.
* Year over year comparison.
* Cost of membership types.
* Data from competitors.

Having shared these insights with the Cyclistic Marketing Analysis team, the next task is to approve and implement the new ad strategy.  
Pending group consensus, the analysis would be presented to the Director of Marketing and brought before corporate stakeholders for approval.

## Conclusion

This exploratory look at a practical dataset was an exercise that immersed me in the world of data analytics.  
I was challenged to put into practice the tools and methods I studied in the Google Data Analytics Certificate course, and along the way I discovered many new avenues on which to unleash my curiosity.  
Both the most difficult and the most rewarding aspect of studying Computer Science is the depth of the field - to which I have always found encouragement to grow and a challenge for tomorrow.