## Study Objective
The purpose of this case study is to analyze the annual usage of Cyclistic bikes by annual members and casual riders to guide the Cyclistic marketing strategy.

### - Business goals:
* Identify some trends in Cyclistic bikes usage.
* Find how these trends apply to Cyclistic as a company.
* Use the identified trends to guide the Cyclistic marketing strategy.

### - Analysis Approach:
This analysis is a population-based which analyzes datasets based on the history of the entire sample population rather than the unique participant.

### Data Source:
The data has been provided by Motivate International Inc. under the following license - https://www.divvybikes.com/data-license-agreement.
* Note: The dataset are organized on quarterly basis beginning from the second quarter of 2019 up to the first quarter of 2020. 

### Step 1: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Step 2: Import the Datasets

In [None]:
quarter2_2019 = pd.read_csv("../input/cyclisticbikeshare2019to2020data/Cyclistic_Bike_Share_2019-2020_Data/Divvy_Trips_2019_Q2.csv")
quarter3_2019 = pd.read_csv("../input/cyclisticbikeshare2019to2020data/Cyclistic_Bike_Share_2019-2020_Data/Divvy_Trips_2019_Q3.csv")
quarter4_2019 = pd.read_csv("../input/cyclisticbikeshare2019to2020data/Cyclistic_Bike_Share_2019-2020_Data/Divvy_Trips_2019_Q4.csv")
quarter1_2020 = pd.read_csv("../input/cyclisticbikeshare2019to2020data/Cyclistic_Bike_Share_2019-2020_Data/Divvy_Trips_2020_Q1.csv")

### Step 3: Observe the Dataset Structures.
(1) Cheack if the column names of the tables are consistent with another.
(2) Let's also check the number of rows including column datatypes in each table.

In [None]:
quarter2_2019.info()

In [None]:
quarter3_2019.info()

In [None]:
quarter4_2019.info()

In [None]:
quarter1_2020.info()

### Summary 1:
* Observe that the columns names of the tables are not consistent.
* The datatype of some of the columns in quarter1_2020 (e.g. ride_id and rideable_type) are not also consistent with those in other tables.
* The numbers of rows also differ for each quarter.
* NB: Since this is a population-based analysis, we can ignore the inconsistency in the number of rows. However, we need to make the columns names consistent through wrangling.

## Step 3: Perform Data Wrangling on the Datasets.
* This is importantly useful when merging the tables into a single file.
* --- You can choose any of the tables as reference. For this analysis, let's choose the columns names in quarter1_2020 as reference.
### 3(a) Let's start by renaming the essential fields in the other datasets as follow:

In [None]:
# Rename fields!
quarter2_2019.rename(columns = {"01 - Rental Details Rental ID":"ride_id","01 - Rental Details Bike ID":"rideable_type",
                          "01 - Rental Details Local Start Time":"started_at","01 - Rental Details Local End Time":"ended_at",
                          "03 - Rental Start Station Name":"start_station_name","03 - Rental Start Station ID":"start_station_id",
                          "02 - Rental End Station Name":"end_station_name","02 - Rental End Station ID":"end_station_id",
                          "User Type":"member_casual","01 - Rental Details Duration In Seconds Uncapped":"tripduration",
                         "Member Gender":"gender","05 - Member Details Member Birthday Year":"birthyear"}, inplace=True)

# Check the head!
quarter2_2019.head()

In [None]:
quarter3_2019.rename(columns = {"trip_id":"ride_id","bikeid":"rideable_type","start_time":"started_at","end_time":"ended_at",
                          "from_station_name":"start_station_name","from_station_id":"start_station_id",
                          "to_station_name":"end_station_name","to_station_id":"end_station_id","usertype":"member_casual"}, 
               inplace=True)

# Check the head!
quarter3_2019.head()

In [None]:
quarter4_2019.rename(columns = {"trip_id":"ride_id","bikeid":"rideable_type","start_time":"started_at","end_time":"ended_at",
                          "from_station_name":"start_station_name","from_station_id":"start_station_id",
                          "to_station_name":"end_station_name","to_station_id":"end_station_id","usertype":"member_casual"},
              inplace=True)

# Check the head!
quarter4_2019.head()

### (3b) 
* (i) Convert the ride_id and rideable_type in quarters2,3,4-2019 to string or character type.
* (ii) Note that the date fields are supposed to be in datetime format. Currently, they are in string format. We will convert them to datetime formate later. For now, use the methods provided below for the datatype conversions:

In [None]:
# Method 1:
quarter2_2019['ride_id'] = quarter2_2019.ride_id.astype(str)
quarter2_2019['rideable_type'] = quarter2_2019.rideable_type.astype(str)

quarter3_2019['ride_id'] = quarter3_2019.ride_id.astype(str)
quarter3_2019['rideable_type'] = quarter3_2019.rideable_type.astype(str)

quarter4_2019['ride_id'] = quarter4_2019.ride_id.astype(str)
quarter4_2019['rideable_type'] = quarter4_2019.rideable_type.astype(str)

# Note: A single line of code is used for all these in method 2.

In [None]:
# Method 2:
quarter2_2019[['ride_id','rideable_type']] = quarter2_2019[['ride_id','rideable_type']].apply(str) 
quarter3_2019[['ride_id','rideable_type']] = quarter3_2019[['ride_id','rideable_type']].apply(str) 
quarter4_2019[['ride_id','rideable_type']] = quarter4_2019[['ride_id','rideable_type']].apply(str)

### (3c) The datasets are now fit for merging. Here we use concatenate method to merge the datasets.
* Note: The datasets are merged by rows to form a long dataset. Therefore, set axis = 0. Although, with the wrangling performed it is no longer necessary to set axis.

In [None]:
merged_trips = pd.concat([quarter2_2019, quarter3_2019, quarter4_2019, quarter1_2020], axis=0)

# Check the head!
merged_trips.head()

In [None]:
# Check the info
merged_trips.info()

### (3d) 
* We can remove columns that are not needed for this analysis. For example, the gender, birthyear, start_lat, start_lng, end_lat, and end_lng fields, since these fields are only included in the quarter1_2020 dataset. Similarly, tripduration field should be removed, since it is not contained in the quarter1_2020 dataset.

In [None]:
# Remove fields and then check the head!
merged_trips.drop(['tripduration','gender','birthyear','start_lat','start_lng','end_lat','end_lng'], axis=1, inplace=True)
merged_trips.head()

### Step 4: Clean Up and Add Data to Prepare for Analysis
*Here, we will need to dive deeper into the merged dataset to fix a few problems.
* (1) Observe that membership names in the "member_casual" field are included as ("member" and "Subscriber") for member riders and ("Customer" and "casual") for casual riders . Therefore, we will need to consolidate that from four to two labels.
* (2) Manipulate the date fields into week_day, day, month, year to expand the scope of data aggregations.
* (3) Also, let's calculate the length of ride and store the result into a new field called length_of_ride in the merged_trips.

In [None]:
# Check the unique values in the member_casual field!
merged_trips['member_casual'].unique()

### (4a) Create a function to replace "Subscriber" with "member", and "casual" with "Customer".

In [None]:
def Replace(string):
    if string == "Subscriber":
        return "member"
    elif string == "Customer":
        return "casual"
    else:
        return string
    
merged_trips['member_casual'] = merged_trips['member_casual'].apply(Replace)

# Check the head!
merged_trips.head()

In [None]:
# Check to be sure the member_casual field was updated! You should see only "member" and "casual" after running this code.
merged_trips['member_casual'].unique()

### (4b) Obtain week_day, day, month, and year, respectively from the date (started_at) field.
* (i) Import the datetime library first.
* (ii) Now convert the started_at and ended_at fields from character to datetime datatype.

In [None]:
# Import datetime libraries.

import datetime
from datetime import datetime
from datetime import date

In [None]:
# Convert to datetime
merged_trips['started_at'] = pd.to_datetime(merged_trips['started_at'])
merged_trips['ended_at'] = pd.to_datetime(merged_trips['ended_at'])

In [None]:
# Check the datatype now!
type(merged_trips['started_at'].iloc[0])

In [None]:
# Check the date field datatype now!
type(merged_trips['ended_at'].iloc[0])

### (4c) Create the week_day, day, month and year fields

In [None]:
merged_trips['week_day'] = merged_trips['started_at'].apply(lambda date: date.dayofweek)
merged_trips['day'] = merged_trips['started_at'].apply(lambda date:date.day)
merged_trips['month'] = merged_trips['started_at'].apply(lambda date:date.month)
merged_trips['year'] = merged_trips['started_at'].apply(lambda date:date.year)

# Check the head!
merged_trips.head()

*** Ops! Observe that the week_day are mainly intergers, where 0 - Monday and 6 - Sunday. Convert this into character using the following map function.

In [None]:
day_name = {0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'}
merged_trips['week_day'] = merged_trips['week_day'].map(day_name)

# Check the head
merged_trips.head()

### (4d) Calculate the length of ride (in seconds) and store the result into a new field called length_of_ride in the merged_trips table.

In [None]:
merged_trips['length_of_ride'] = merged_trips['ended_at'] - merged_trips['started_at']
merged_trips['length_of_ride'] = merged_trips['length_of_ride'].apply(lambda time: time.total_seconds())

merged_trips.head()

In [None]:
# Let's check the length_of_ride field for negative values.
merged_trips[merged_trips['length_of_ride'] < 0]['length_of_ride'].count()

# There are 130 times where 'length_of_ride' returned negative values.

### (4e) Remove the data instances where 'length_of_ride' returned negative

In [None]:
# Save the result with a new variable name.
merged_trips_v2 = merged_trips[merged_trips['length_of_ride'] >= 0]

merged_trips_v2.head()

In [None]:
# Check for negative values again!
merged_trips_v2[merged_trips_v2['length_of_ride'] < 0]['length_of_ride'].count()

### Lastly, let's create a new field called quarters to aggregate data per each quarter of the year.

In [None]:
# Create a function to organize data by quarters
def quarters(month):
    if month >= 4 and month <= 6:
        return 'Q2_2019'
    elif month >= 7 and month <= 7:
        return 'Q3_2019'
    elif month >= 10 and month <= 12:
        return 'Q4_2019'
    else:
        return 'Q1_2020'
    

merged_trips_v2['quarter'] = merged_trips_v2['month'].apply(quarters)
merged_trips_v2.head()

### Step 5: Conduct Descriptive Analysis

In [None]:
# Let's visualize the number of rides by rider-type.

sns.countplot(x='member_casual', data=merged_trips_v2)
plt.title('Number of Rides by Rider Type')

In [None]:
# Number of Rides by Rider Type Per Week_day.

fig = plt.figure(figsize=(10,6))
sns.countplot(x='week_day', data=merged_trips_v2, hue='member_casual')
plt.title('Number of Rides by Rider Type Per Week_day')

In [None]:
# Monthly Number of Rides by Rider-Type

fig = plt.figure(figsize=(10,6))
sns.countplot(x='month', data=merged_trips_v2, hue='member_casual')
plt.title('Monthly Number of Rides by Ridertype')

In [None]:
# Quarterly Number of Rides by Rider-Type

fig = plt.figure(figsize=(10,6))
sns.countplot(x='quarter', data=merged_trips_v2, hue='member_casual')
plt.title('Quarterl Number of Rides by Ridertype')

In [None]:
# Number of Rides by Rider Type Per Quarter and Week_day.

week_day_quarter = merged_trips_v2.groupby(by=['week_day','member_casual','quarter']).count().unstack()['day']
fig = plt.figure(figsize=(10,5))
sns.heatmap(week_day_quarter)
plt.title('Number of Rides by Rider Type Per Quarter and Week_day')

### Summary 2: Number of Ride By Ridertype.
* (1) The total number of member riders of the Cyclistic is about 3x more than the casual riders for the past 12 months (Q2_2019 to Q1_2020).
* (2) On a day of week basis (week_day), number of casual riders incresaes from Monday to Sunday. The three top-most number of casual riders are recorded on Friday, Saturday and Sunday.
* (3) On a day of week basis (week_day), number of member riders decresaes from Monday to Sunday.
* (4) The number of rides for both member and casual riders are highest in the month of August.

In [None]:
# Let's visualize the length of ride by rider-type

fig = plt.figure(figsize=(10,5))
sns.barplot(x ='member_casual', y ='length_of_ride', data = merged_trips_v2)
plt.title('Length of Ride by Ridertype')

In [None]:
# Let's visualize the length of ride by rider-type per week

fig = plt.figure(figsize=(10,6))
sns.barplot(x ='week_day', y ='length_of_ride', data = merged_trips_v2, hue='member_casual')
plt.title('Length of Ride by Rider Type Per Week_day')

In [None]:
# Let's visualize the monthly length of ride by rider-type

fig = plt.figure(figsize=(10,6))
sns.barplot(x ='month', y ='length_of_ride', data = merged_trips_v2, hue='member_casual')
plt.title('Monthly Length of Ride by Ridertype')

In [None]:
# Let's visualize the month_quarterly length of ride by rider-type

mapm = {1:'2020_1A',2:'2020_1B',3:'2020_1C',4:'2019_2A',5:'2019_2B',6:'2019_2C',7:'2019_3A',8:'2019_3B',9:'2019_3C',
        10:'2019_4A',11:'2019_4B',12:'2019_4C',}
merged_trips_v2['month_quarter'] = merged_trips_v2['month'].map(mapm)

fig = plt.figure(figsize=(10,6))
sns.barplot(x='month_quarter', y='length_of_ride', data=merged_trips_v2, hue='member_casual')
plt.title('Month_Quarterly Length of Ride by Rider-Type')

In [None]:
# Let's visualize the quarterly ride lengths by rider-type

fig = plt.figure(figsize=(10,6))
sns.barplot(x ='quarter', y ='length_of_ride', data = merged_trips_v2, hue='member_casual')
plt.title('Quarterly Length of Ride by Ridertype')

In [None]:
# Daily average length of ride by year-quarter

day_quarter = merged_trips_v2.pivot_table(values='length_of_ride', index='day', columns='quarter')
day_quarter

In [None]:
# Length of ride daily average by year-quarter

day_quarter.plot(kind='bar', stacked=True, figsize=(10,6))
plt.xlabel('Day')
plt.ylabel('Average Length of Ride')
plt.title('Length of Ride Daily Average By Year-Quarter')
plt.legend(loc='right', bbox_to_anchor=(1.15, 0.879))

In [None]:
# Length of Ride Yearly Average for Member and Casual Riders

fig = plt.figure(figsize=(10,6))
sns.barplot(x='year', y='length_of_ride', data=merged_trips_v2, hue='member_casual')
plt.title('Length of Ride Yearly Average for Member and Casual Riders')

### Summary 3: Ride Length By Ridertype.
* (1) The total ride length of member riders of the Cyclistic is about 1/3x less than the casual riders for the past 12 months (Q2_2019 to Q1_2020).
* (2) On a day of week basis (week_day), ride length of casual riders is relatively stable accross day of week.
* (3) On a day of week basis (week_day), ride length of member riders is relatively stable accross day of week.
* (4) The ride length for the member riders has been decreasing from Q3_2019 to Q1_2020, while an increasing trend is noticed in the ride lenght of casual riders.

In [None]:
# Let's see the stations' profile in terms of numbers of riders and length of rides. This can give insight for a more actionable 
# promotions or incentive for the casual riders to become members.

station_no_of_ride = merged_trips_v2.groupby(by=['start_station_name',
                                                 'quarter']).count().unstack()['day'].sort_values(by='Q1_2020', 
                                                                                                  ascending=False).head(10)
station_no_of_ride

In [None]:
# Ten Top-Most Stations with Highest Number of Ride By Year-Quarter

station_no_of_ride.plot(kind='bar', stacked=True, figsize=(10,6))
plt.xlabel('Station Name')
plt.ylabel('Number of Ride')
plt.title('Ten Top-Most Stations with Highest Number of Ride By Year-Quarter')
plt.legend(loc='right', bbox_to_anchor=(1.0,0.85))

In [None]:
# Ten Top-Most Stations with Highest Length of Ride By Year-Quarter

station_length_of_ride = merged_trips_v2.pivot_table(values='length_of_ride', 
                                                     index='start_station_name', 
                                                     columns='quarter').sort_values(by='Q1_2020', ascending=False).head(10)
station_length_of_ride

In [None]:
# Ten Top-Most Stations with Highest Length of Ride By Year-Quarter

station_length_of_ride.plot(kind='bar', stacked=True, figsize=(10,5))
plt.xlabel('Station Name')
plt.ylabel('Length of Ride')
plt.title('Ten Top-Most Stations with Highest Length of Ride By Year-Quarter')
plt.legend(loc='right', bbox_to_anchor=(1.0,0.85))

### Summary 4: Cyclistic Stations with Best Potentials and Opportunities.
* (1) In the top-ten stations, 'Street Dr & Grand Ave' has the highest number of rides while 'Delay Center Plaza' has the lowest number of rides in the past 12 months (Q2_2019 to Q1_2020).
* (2) However, 'Carpentar St & 63rd St' has the highest ride length while 'South Shore Dr & 74th St' has the least ride length in the past 12 months (Q2_2019 to Q1_2020).

## Conclusions:
### --- Key Insights ----
* Total ride length of casual riders is about 3x more than the ride length of member riders over the past 12 months (Q2_2019 to Q1_2020).
* The three top-most number of casual riders are recorded on Friday, Saturday and Sunday.
* In the current quarter (Q1_2020), 'Carpentar St & 63rd St' station recorded the highest ride length.
## Recommendations:
* Cyclistic can increase its baseline if more casual riders can become member riders.
* The best days to market the casual riders are on weekends (Friday, Saturday, and Sunday).
* Currently, the casual riders with the best potentials are located at the 'Carpentar St & 63rd St' station.