**Introduction**:

We are going to analyze CYCLISTICS data for the capstone project of the Google Data Analytics certificate. For this task, We are going to follow following data analysis processes : Ask, Prepare, Process, Analyze, Share and Act.

**Scenario**:
The director of marketing of Cyclistic, Lily Moreno, believes that the company’s future growth depends on maximizing the number of annual memberships. Hence, the marketing analyst team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the analytics team could be able to design a new marketing strategy to convert casual riders into annual members. 

There are three questions will guide the future marketing campaign,
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?

For this analysis we are going to focus on the first question,**How casual riders and annual members use Cyclistic bikes differently?**

**Phase 1: ASK**

**Business Task:**
Analyse past 12 months trip data and discover the connections between members with annual subscription and casual riders.

 Are annual members much more profitable than casual riders? if so, design a marketing strategies or a campaign that helps us converting casual riders into annual members
 
**Time period for consideration:**
 Jan2021 to Dec2021
 
 
**Key Stakholders:**

**Primary Stakeholders:**
    Lily Moreno, Director of Marketing
    Cyclistic Executive Team
    
**Secondary Stakeholders:**
    Marketing Analytics Team


**Phase 2: PREPARE**

**Data Source:** 
Past 12 month of original bike share data set from 01/01/2021 to 31/12/2021 was extracted as 12 zipped (.csv files).
Source data publicly available at [AWS S3 bucket](https://divvy-tripdata.s3.amazonaws.com/index.html)
The data is made available and licensed by Motivate International Inc under this [license](https://ride.divvybikes.com/data-license-agreement).

For this analysis, we have used kaggle dataset at **divvy-tripdata-2021** by JORGE AREVALO LA ROSA JUNIOR

**Data credibility:**
The data set is reliable, the data is complete and accurate for the chosen time window. The data are original first party information. These data are comprehensive, current and cited.

**Data integrity:**
All 12 datasets consists of 13 columns and each column has correct type of data.

**Data privacy:**
The company has license to the data. Rider's PII are hidden through tokenisation.



**Phase 3: PROCESS**

In this stage, we are going to read, clean, organize and analyze the data.


In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [4]:
#First we are going to combine all dataset to ease out workflow

datasets = '''../input/divvytripdata2021/202101-divvy-tripdata.csv
../input/divvytripdata2021/202102-divvy-tripdata.csv
../input/divvytripdata2021/202103-divvy-tripdata.csv
../input/divvytripdata2021/202104-divvy-tripdata.csv
../input/divvytripdata2021/202105-divvy-tripdata.csv
../input/divvytripdata2021/202106-divvy-tripdata.csv
../input/divvytripdata2021/202107-divvy-tripdata.csv
../input/divvytripdata2021/202108-divvy-tripdata.csv
../input/divvytripdata2021/202109-divvy-tripdata.csv
../input/divvytripdata2021/202110-divvy-tripdata.csv
../input/divvytripdata2021/202111-divvy-tripdata.csv
../input/divvytripdata2021/202112-divvy-tripdata.csv'''.split("\n")

dfs = [pd.read_csv(x,engine="c") for x in datasets]
df = pd.concat(dfs, ignore_index = "True")
del dfs

In [5]:
df.head()

In [6]:
df.dtypes

In [7]:
df["started_at"] = pd.to_datetime(df["started_at"])
df["ended_at"] = pd.to_datetime(df["ended_at"])
df.dtypes

In [8]:
df["trip_duration"] = df["ended_at"]-df["started_at"]
df["trip_day"] = df["started_at"].dt.day_name()
df["trip_month"] = df["started_at"].dt.month_name()
df.head()

In [9]:
df.shape

In [10]:
df.duplicated().sum()

In [11]:
np.where(df["started_at"]<=df["ended_at"], 0, 1).sum()

In [12]:
df = df[df["started_at"]<=df["ended_at"]]

In [13]:
df.shape

**Phase 4: ANALYZE**


In [14]:
df.describe()

In [42]:
df1 = df.loc[(df["trip_duration"]>pd.Timedelta(0))] \
    .groupby(['member_casual'], as_index=False) \
    .agg({'trip_duration':'mean', 'ride_id':'count'}) \
    .sort_values('trip_duration') \
    .rename(columns={'trip_duration': 'mean_trip_duration', 'ride_id':'count'}) 

In [80]:
x=df1['member_casual']
y=df1['count']
plt.pie(y, labels = x, autopct = '%1.1f%%', shadow = True, startangle = 100)
plt.title("Member Type percentage distribution")
plt.show

In [52]:
import math

total_ridecount = sum(df1["count"])
member_percentage = (df1.loc[1,'count']/total_ridecount)*100
casual_percentage = (df1.loc[0,'count']/total_ridecount)*100

print("{}% of users are members and {}% of users are casual riders".format(math.trunc(member_percentage), math.trunc(casual_percentage)))
print("For every single ride, members ride {} mins on average while the casual riders ride {} mins".format(df1.loc[1,'mean_trip_duration'].seconds//60, df1.loc[0,'mean_trip_duration'].seconds//60))

In [84]:
df2 = df.loc[ (df["trip_duration"] > pd.Timedelta(0)) ] \
    .groupby( ['member_casual', 'rideable_type'], as_index=False ) \
    .agg({'ride_id':'count','trip_duration':'mean'}) \
    .sort_values('ride_id', ascending=False) \
    .rename(columns={"rideable_type":"rideable_type_count", "trip_duration":"avg_trip_duration","ride_id":"ride_count"})

df2

In [86]:
fig = df.groupby(['rideable_type', 'member_casual'], as_index=False).count()

px.bar(fig, x = 'rideable_type', y = 'ride_id',color = 'member_casual', barmode='group',
        text = 'ride_id', 
        labels = {'ride_id': 'No. of Rides', 'member_casual': 'Member/Casual', 'rideable_type' : 'Rideable Type'},
        hover_name = 'member_casual',
        hover_data = {'member_casual': False, 'trip_duration': False},
        color_discrete_map = {'casual': '#FF934F', 'member': '#058ED9'}) 

- From the above data, it is obvious that members prefer classic bike more and followed by electric bikes. We can also say that members almost completely avoided using docked bikes. However, this can be that company may offer less docked bikes or not available for members
- Casual riders on average ride more than members. Apparently, members mostly pay for subcription for their daily commuting.
- Casual riders prefer classic bike, followed by electric bikes and docked bikes. Casual riders use docked bike for about 1hour 20mins on average, making it highest used type than any other.

In [64]:
df3 = df.loc[ (df["trip_duration"] > pd.Timedelta(0))] \
      .groupby(['member_casual','trip_day'],as_index=False) \
      .agg({'ride_id':'count','trip_duration':'mean'}) \
      .sort_values('ride_id',ascending=False) \
      .rename(columns={"ride_id":"ride_count","trip_duration":"avg_trip_duration"})

In [65]:
df3

- On Weekends, Casual rides ride more than members
- On Weekdays, Members uses bikes more than casual riders

In [89]:
fig1 = df.groupby(['trip_day', 'member_casual'], as_index=False).count()

px.line(fig1, x = 'trip_day', y = 'ride_id', range_y = [0,550000],
        color = 'member_casual',  
        line_shape = 'spline',
        markers=True,
        labels = {'ride_id': 'No. of Rides', 'trip_day': 'Weekdays', 'member_casual': 'Member/Casual'},
        hover_name = 'member_casual', hover_data = {'member_casual': False, 'trip_month': False, 'ride_id': True}, 
        color_discrete_map = {'casual': '#FF934F', 'member': '#058ED9'})

In [66]:
df4 = df.loc[ (df["trip_duration"] > pd.Timedelta(0)) \
             & (df["member_casual"] == 'casual') ] \
    .groupby( ['trip_month'], as_index=False ) \
    .agg({'ride_id':'count','trip_duration':'mean'}) \
    .sort_values('ride_id', ascending=False) \
    .rename(columns={"ride_id":"ride_count"})

df4

In [90]:
fig2 = df.groupby(['trip_month', 'member_casual'], as_index=False).count()

px.line(fig2, x = 'trip_month', y = 'ride_id', range_y = [0,600000],
        color = 'member_casual',  
        line_shape = 'spline',
        markers=True,
        labels = {'ride_id': 'No. of Rides', 'trip_month': 'Month', 'member_casual': 'Member/Casual'},
        hover_name = 'member_casual', hover_data = {'member_casual': False, 'trip_month': True, 'ride_id': True}, 
        color_discrete_map = {'casual': '#FF934F', 'member': '#058ED9'})

In [67]:
df5 = df.loc[ (df["trip_duration"] > pd.Timedelta(0)) \
             & (df["member_casual"] == 'member') ] \
    .groupby( ['trip_month'], as_index=False ) \
    .agg({'ride_id':'count','trip_duration':'mean'}) \
    .sort_values('ride_id', ascending=False) \
    .rename(columns={"ride_id":"ride_count"})

df5

In [70]:
df5 = df.loc[ (df["member_casual"]=='casual')
             & (df["trip_day"].isin(['Saturday','Sunday']))
             & (df["trip_duration"] > pd.Timedelta(0))
            ] \
    .groupby( ['start_station_name', 'end_station_name'], as_index=False ) \
    .agg({'ride_id':'count','trip_duration':'mean'}) \
    .sort_values('ride_id', ascending=False) \
    .rename(columns={"ride_id":"ride_count", "trip_duration":"avg_trip_duration"})

df5[:10]

In [71]:
df6 = df.loc[ (df["start_station_name"] == df["end_station_name"])
             & (df["trip_duration"] > pd.Timedelta(0))] \
            .agg({'ride_id':'count','trip_duration':'mean'}) \
            .rename({"ride_id":"rideable_count", "trip_duration":"avg_trip_duration"})

df6

In [72]:
df7 = df.loc[ (df["start_station_name"] != df["end_station_name"])
             & (df["trip_duration"] > pd.Timedelta(0))] \
            .agg({'ride_id':'count','trip_duration':'mean'}) \
            .rename({"ride_id":"rideable_count", "trip_duration":"avg_trip_duration"})

df7

In [92]:
df6["avg_trip_duration"] = df6["avg_trip_duration"].total_seconds()/60.0
df7["avg_trip_duration"] = df7["avg_trip_duration"].total_seconds()/60.0

a = df6["avg_trip_duration"]
b = df7["avg_trip_duration"]
c = [a,b]
d = ["casual","member"]
 
from matplotlib.pyplot import figure

figure(figsize=(6,5), dpi=80)

plt.bar(d,c, color=('orange','b'))
plt.title("Average duration of rides with same start and end station vs rides with different start and end station")
plt.xlabel("Member type")
plt.ylabel("Type of rides")
  
plt.show()

**Phase 5: SHARE**

Insights from the above information will be shares to the marketing team.

**Phase 6: ACT**

Based on the above information ,The marketing team of the company will make conclusion.

Recommendation might be,
 - Marketing campaigns can be launched in July and august, as number of casual rides peaks in these months
 - Attractive advertisements, bill boards could be placed in top 10 stations
 - Incentives should be given to new riders
 - Discounts for long rides