#### This small project is conducted in partial fulfillment of the Google Data Analytics Certificate.

#### Background:

Cyclistic is a fictitious bike-share company in Chicago. The marketing department of Cyclistic wants to study the behavior of casual riders and annual members to obtain a higher conversion rate from casual-borrow to membership. Basing on such customer insight, the team will devise the proper marketing strategy.

#### Dataset:
Cyclistic dataset (week 2, Google Data Analytics Capstone: Complete a case study)

https://www.divvybikes.com/data-license-agreement

#### 1. Import relevant packages

In [None]:
import pandas as pd
import numpy as np

#### 2. Aggregate all monthly files into a single dataframe, calculate trip length and create 'date_of_week' array

In [None]:
import glob

path = r'/kaggle/input/cyclistic' 
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col = None, header = 0)
    li.append(df)

df_trips = pd.concat(li, axis = 0, ignore_index = True)
df_trips

In [None]:
#Check the descriptive statistics
df_trips.describe(include="all")

In [None]:
#Check the data types of all columns
df_trips.dtypes

In [None]:
#Transform data type of starting time and ending time to datetime

df_trips['started_at'] = pd.to_datetime(df_trips['started_at'])
df_trips['ended_at'] = pd.to_datetime(df_trips['ended_at'])

from datetime import datetime

#Calculate trip length and store values into 'diff' array
diff = df_trips['ended_at'] - df_trips['started_at']
diff

In [None]:
#As could be observed from 'diff' array, the trip length from this dataset does not exceed 24 and often hover around half an hour.
#Thus, trip length is measured in hours in the following calculation then stored in 'hour_length'

df_trips['trip_length'] = np.array((diff.dt.components.days*24 + diff.dt.components.hours).astype(str).str.zfill(2) + ':' + diff.dt.components.minutes.astype(str).str.zfill(2) + ':' + diff.dt.components.seconds.astype(str).str.zfill(2))

df_trips['trip_length'] = pd.to_datetime(df_trips['trip_length'], errors = 'coerce').dt.time

df_trips['hour_length'] = np.array((diff.dt.components.days*24 + diff.dt.components.hours) + diff.dt.components.minutes/60 + diff.dt.components.seconds/3600)

df_trips

In [None]:
#Create a column which indicates date of the week 
#We will use this array later to plot the trends of trip length and rideable types later on casual riders and member riders

df_trips['started_at'] = pd.to_datetime(df_trips['started_at'])
df_trips['day_of_week'] = df_trips['started_at'].dt.dayofweek

days = {0:'Mon',1:'Tues',2:'Wed',3:'Thur',4:'Fri',5:'Sat',6:'Sun'}
df_trips['day_of_week'] = df_trips['day_of_week'].apply(lambda x: days[x])

df_trips

In [None]:
#Eliminate all flawed records whose hour_length is less or equal to 0

df_trips = df_trips[-df_trips['hour_length']<=0]

df_trips

In [None]:
#Extract initialising date of rides
df_trips['date_start'] = df_trips['started_at'].dt.date
df_trips

#### 3. Visualisation to find trends and patterns

In [None]:
#Import graphing packages

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style = "darkgrid")
sns.set(rc={"figure.figsize": (16, 8)})
plt.rcParams['figure.dpi'] = 360

#### Trends of ride counts and ride length in the year

In [None]:
#Trends in number of rides during the year

#Create a table demonstrating the number of rides based on date_start and membership status ('member_casual')
df_trips_copy = df_trips.copy()
df_trips_copy['count'] = 1
lt_trend_1 = df_trips_copy.groupby(['date_start','member_casual'], as_index=False)['count'].sum()

#Graph
#In the graph, 'count' on y-axis indicates the number of rides

plt.figure(figsize=(20,5))
lt_plot_1 = sns.lineplot(x='date_start', y='count', hue='member_casual', palette = ['m', 'g'], data=lt_trend_1)
lt_plot_1.set(title = 'Number of rides in a year')

In [None]:
#Trends in number of ride length during the year

#Create a table demonstrating the number of rides based on date_start and membership status ('member_casual')

lt_trend_2 = df_trips_copy.groupby(['date_start','member_casual'], as_index=False)['hour_length'].mean()

#Graph
#In the graph, 'constant' on y-axis indicates the mean of trip length

plt.figure(figsize=(20,5))
lt_plot_2 = sns.lineplot(x='date_start', y='hour_length', hue='member_casual', palette = ['m', 'g'], data=lt_trend_2)
lt_plot_2.set(title = 'Mean of ride length in a year')

Pattern:
- Ride counts tend to increase during the warm period of the year for both casual and member category.
- The deviation of ride counts in member riders is fairly smaller than that in casual riders.
- On average, the duration of casual ride is longer that of member ride.

Now we investigate how trip length varies across dates of the week and rider types

#### Riding behavior patterns of riders

In [None]:
#Distribution of trip length (hour_length) of the whole sample

#Set the order of categorical variables
df_trips['day_of_week'] = pd.Categorical(df_trips['day_of_week'], categories=['Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat', 'Sun'], ordered=True)

#Graph
sum_plot = sns.catplot(x = 'day_of_week', y = 'hour_length', kind = 'boxen', palette = 'Pastel1', data = df_trips)
sum_plot.set(ylim=(0, 2))
sum_plot.set(title = 'Distribution of trip length')

In [None]:
#Distribution of trip length (hour_length) of riders with membership only

#Extract 'member' type only to generate a plot
df_member = df_trips[df_trips['member_casual']=='member']

#Order of categorical values
df_member['day_of_week'] = pd.Categorical(df_member['day_of_week'], categories=['Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat', 'Sun'], ordered=True)

#Graph
member_plot = sns.catplot(x = 'day_of_week', y = 'hour_length', kind = 'boxen', palette = 'pastel', data = df_member)
member_plot.set(ylim=(0, 2))
member_plot.set(title = 'Distribution of trip length, member riders')

In [None]:
#Distribution of trip length (hour_length) of casual riders

#Extract 'casual' type only to generate a plot
df_casual = df_trips[df_trips['member_casual']=='casual']

#Order of categorical values
df_casual['day_of_week'] = pd.Categorical(df_casual['day_of_week'], categories=['Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat', 'Sun'], ordered=True)

#Graph
casual_plot = sns.catplot(x = 'day_of_week', y = 'hour_length', kind = 'boxen', palette = 'pastel', data = df_casual)
casual_plot.set(ylim=(0, 2))
casual_plot.set(title = 'Distribution of trip length, casual riders')

In [None]:
comp_plot1 = sns.catplot(x='day_of_week', y='hour_length', kind = 'boxen', hue='member_casual', palette=['m', 'g'], data=df_trips)
comp_plot1.set(ylim=(0, 2))
comp_plot1.set(title = 'Trip length of different membership status categories')

In [None]:
comp_plot2 = sns.catplot(x = 'day_of_week', kind = 'count', hue = 'member_casual', palette = ['m', 'g'], data=df_trips)
comp_plot2.set(title = 'Number of trips initiated by casual and member riders, categorised by date of week')

Patterns:
- In casual rider groups, the ride length is longer on weekends than on weekdays.
- Less dispersion is observed on member group than on casual group. 
- Casual riders tend to borrow bikes when they estimate the trip to be longer than usual. 
- Member riders tend to have a stable ride length during the week.

#### Preference in rideable bikes

In [None]:
comp_plot3 = sns.catplot(x = 'day_of_week', kind = 'count', hue = 'rideable_type', palette = 'Pastel1', data=df_trips)
comp_plot3.set(title = 'Rideable bikes use, categorised by date of week')

In [None]:
rideable_mem = sns.catplot(x = 'day_of_week', kind = 'count', hue = 'rideable_type', palette = 'Pastel1', data=df_member)
rideable_mem.set(title = 'Rideable bikes use of members, categorised by date of week')

In [None]:
rideable_casual = sns.catplot(x = 'day_of_week', kind = 'count', hue = 'rideable_type', palette = 'Pastel1', data=df_casual)
rideable_casual.set(title = 'Rideable bikes use of casual riders, categorised by date of week')

Patterns:
- Docked bike is the most commonly borrowed vehicle
- (Need testing) Casual riders might be indifferent toward electric bikes and classic bikes
