# Toronto Bikeshare

##### Context
The Bike Share Toronto Ridership data contains anonymized trip data, including:
Trip start day and time, Trip end day and time, Trip duration, Trip start station, Trip end station, User type

##### Content
In this dataset, we have the bike sharing information form 2017 and 2018.

##### Acknowledgement
This dataset is from Toronto Parking Authority, published on https://open.toronto.ca/dataset/bike-share-toronto-ridership-data/. You may find the latest complete dataset from there.
The data is licensed under: Open Government License - Toronto

# Business Understanding
##### Problem: 
Someone want to know the bikes flow. is it being returned to its initial location or not. 

##### Clear Questions: 
- a) Does the reverse direction reduce the value of convenience of bikeshares?
- b) Should they just get their own bikes?
- c) do we need a crew to pick up and restore all the bikes to initial distributions?
- d) Find out what percentage of bikes are returned to its initial location?
- e) How many bikeshare trips usually visit more than just start/end points, but visited some other points before the end of the day?

##### Analytic Approach: 
Descriptive Analysis

##### Data Requirements / Features: 
- a) can't be answered
- b) can't be answered
- c) to_station_name, from_station_name, weekday_start, hour_start
- d) <b>One Way Trip</b> => to_station_id, from_station_id
- e) can't be answered

# Data Understanding

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# load data 2018 only
# df_2017_q1 = pd.read_csv('../input/toronto-bikeshare-data/bikeshare-ridership-2017/2017 Data/Bikeshare Ridership (2017 Q1).csv')
# df_2017_q2 = pd.read_csv('../input/toronto-bikeshare-data/bikeshare-ridership-2017/2017 Data/Bikeshare Ridership (2017 Q2).csv')
# df_2017_q3 = pd.read_csv('../input/toronto-bikeshare-data/bikeshare-ridership-2017/2017 Data/Bikeshare Ridership (2017 Q3).csv')
# df_2017_q4 = pd.read_csv('../input/toronto-bikeshare-data/bikeshare-ridership-2017/2017 Data/Bikeshare Ridership (2017 Q4).csv')
df_2018_q1 = pd.read_csv('../input/toronto-bikeshare-data/bikeshare2018/bikeshare2018/Bike Share Toronto Ridership_Q1 2018.csv')
df_2018_q2 = pd.read_csv('../input/toronto-bikeshare-data/bikeshare2018/bikeshare2018/Bike Share Toronto Ridership_Q2 2018.csv')
df_2018_q3 = pd.read_csv('../input/toronto-bikeshare-data/bikeshare2018/bikeshare2018/Bike Share Toronto Ridership_Q3 2018.csv')
df_2018_q4 = pd.read_csv('../input/toronto-bikeshare-data/bikeshare2018/bikeshare2018/Bike Share Toronto Ridership_Q4 2018.csv')

In [None]:
# concat all the data into one master df
# df_2017_q1, df_2017_q2, df_2017_q3, df_2017_q4
frames = [df_2018_q1,df_2018_q2,df_2018_q3,df_2018_q4]

df = pd.concat(frames)
df.head()

In [None]:
# check available columns
df.columns

In [None]:
# check shape
df.shape

In [None]:
# describe
df.describe()

In [None]:
# check column and its dtype
# -> its looks like we need to change the dtype to its correct format
df.info()

In [None]:
# check missing value
df.isnull().sum()

In [None]:
# check unique values
# -> it looks like there are 359 stations in Toronto.
df.nunique()

In [None]:
# check user type values
df["user_type"].unique()

<p style="text-align:center;"> It seems like the data is already cleaned, so we can go to the next phase </p>

<!-- -->

# Data Preparation

In [None]:
# change some format
df['trip_start_time'] = pd.to_datetime(df['trip_start_time'])
df['trip_stop_time'] = pd.to_datetime(df['trip_stop_time'])

# seperate trip_start_time into hour, month, and weekday
df['hour_start'] = df['trip_start_time'].apply(lambda time: time.hour)
df['month_start'] = df['trip_start_time'].apply(lambda time: time.month)
df['weekday_start'] = df['trip_start_time'].apply(lambda time: time.dayofweek)

# seperate trip_stop_time into hour, month, and weekday
df['hour_stop'] = df['trip_stop_time'].apply(lambda time: time.hour)
df['month_stop'] = df['trip_stop_time'].apply(lambda time: time.month)
df['weekday_stop'] = df['trip_stop_time'].apply(lambda time: time.dayofweek)

# create more readable month naming
mon = {1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'}
df['month_start'] = df['month_start'].map(mon)
df['month_stop'] = df['month_stop'].map(mon)

# create more readable day naming
day = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df['weekday_start'] = df['weekday_start'].map(day)
df['weekday_stop'] = df['weekday_stop'].map(day)

In [None]:
#check
df.head()

<!-- -->

# Exploratory Data Analysis

### Question A
- a) Does the reverse direction reduce the value of convenience of bikeshares?

### Conclusion A

> can't be answered, lacks of bike_id information so we can't track the bike share flows

### Question B
- b) Should they just get their own bikes?

### Conclusion B

> can't be answered.

### Question C
- c) do we need a crew to pick up and restore all the bikes to initial distributions?

In [None]:
# Top 5 Arriving Stations
plt.figure(figsize=(10,5))
sns.set_style('darkgrid')
sns.countplot(y=df['to_station_name'],data=df, palette='coolwarm',order=df['to_station_name'].value_counts().index[:5])
plt.title('Top 5 Arriving Stations')

In [None]:
# Top 5 Departing Stations
plt.figure(figsize=(10,5))
sns.set_style('darkgrid')
sns.countplot(y=df['from_station_name'],data=df, palette='coolwarm',order=df['from_station_name'].value_counts().index[:5])
plt.title('Top 5 Departing Stations')

In [None]:
# peak hours
daily_activity = df.groupby(by=['weekday_start','hour_start']).count()['user_type'].unstack()
daily_activity.head()

In [None]:
# plot the peak hours with heatmap
plt.figure(figsize=(12,6))
sns.heatmap(daily_activity,cmap='coolwarm')

### Conclusion C

> not sure, lacks of bike_id

> but looking at the graph, the crew must be supervise at this factors: 
- the peak hour is start at 08.00-09.00 AM and 16.00-18.00 PM  
- the crew must be supervise the top 5 most visited bike points (departure and arrival)

### Question D
- d) Find out what percentage of bikes are returned to its initial location?

In [None]:
# since it's lack of bike_id, i can only show you the bike return for the one-way trip
# using solely to_station_id and from_station_id with this simple algorithm
# this algorithm takes very long time to solve. so we are going to take 1000 rows as sample
df_sample = df.head(1000)

In [None]:
# initialize global var
returned = 0

# one way returnment algorithm
for i in range(len(df_sample)):
    print('i: ' + str(i))
    
    #initialize j
    j = i + 1
    while j < len(df_sample):
        # print('j: ' + str(j))
        # matching
        if(str(df_sample['to_station_id'][i]) + str(df_sample['from_station_id'][i]) == 
           str(df_sample['from_station_id'][j]) + str(df_sample['to_station_id'][j])):
            returned += 1
            print('returned: ' + str(returned))
            break
        j += 1

In [None]:
# in this sample, there are 135 number of bike that are being returned
print('returned count: ' + str(returned))
onewaytrip_return_percentage = returned / len(df_sample) * 100
print(str(onewaytrip_return_percentage) + '%')

### Conclusion D

> there are 13.5% of bike that are being returned in the one-way trip

### Question E
- e) How many bikeshare trips usually visit more than just start/end points, but visited some other points before the end of the day?

In [None]:
# once again, not sure to answer this because the lacks of bike_id and user_id.

### Conclusion E

> can't be answered

# Resources

> Peak Hours and Most Visited Bike Points Graph by Eduardo Sierra
- https://www.kaggle.com/esierr1/exploratory-data-analysis-bike-share-toronto-2018