# Ford GoBike Trip Data Exploration
#### by Keerthana Manoharan

## Preliminary Wrangling

The dataset includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area of the ford gobike for the year 2018.

This document explores the dataset gathered.

### Table of contents:
- <a style="text-decoration: none" href="#gather">Data Gathering</a>
- <a style="text-decoration: none" href="#clean">Data Cleaning</a>
- <a style="text-decoration: none" href="#explore">Data Exploration</a>
    - <a style="text-decoration: none" href="#uni">Univariate Exploration</a>
    - <a style="text-decoration: none" href="#bi">Bivariate Exploration</a>
    - <a style="text-decoration: none" href="#multi">Multivariate Exploration</a>

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

<a id="gather"></a>

## Data Gathering

Ford GoBike is a publicly available dataset and can be downloaded from [here](https://s3.amazonaws.com/fordgobike-data/index.html)

For this analysis, I have downloaded 2018 datasets, which are 12 CSVs for each month.

**Brief description about the data :**
- Trip Duration (seconds)
- Start Time and Date
- End Time and Date
- Start Station ID
- Start Station Name
- Start Station Latitude
- Start Station Longitude
- End Station ID
- End Station Name
- End Station Latitude
- End Station Longitude
- Bike ID
- User Type (“Subscriber” = Member or “Customer” = Casual)
- Member Year of Birth
- Member Gender
- Bike Share for all Trip

In [None]:
# Read from master dataset
df = pd.read_csv('../input/ford-gobike-tripdata-2018/fordgobike_tripdata.csv', encoding='utf-8')

In [None]:
# display few records randomly 
df.sample(5)

In [None]:
# structure of the dataset
df.shape

In [None]:
# features of the dataset
df.info()

**Observations :** Erroneous datatypes 
> - `start_time` and `end_time` should be of type datetime
> - `start_station_id`, `end_station_id` and `member_birth_year` should be of type int
> - `user_type` and `member_gender` should be of type categorical
> - `bike_share_for_all_trip` should be of type bool

In [None]:
# number of null values in each column
df.isna().sum()

<a id="clean"></a>

## Data Cleaning

### Issue 1 : Erroneous datatypes for `start_time` and `end_time`

### Define

Convert the datatype of `start_time` and `end_time` from string to datetime using pd.to_datetime()
    
### Code    

In [None]:
# convert the data type of start_time and end_time to datetime.
df.start_time = pd.to_datetime(df.start_time)
df.end_time = pd.to_datetime(df.end_time)

### Test

In [None]:
# test if the data type of start_time and end_time is datetime.
assert type(df.start_time[0]) is pd.Timestamp

### Issue 2 : Erroneous datatypes for `start_station_id`,  `end_station_id`  and `member_birth_year`

### Define

- Check if there is any valid decimals
- If not convert the datatype of id's `start_station_id`, `end_station_id` and `member_birth_year` from float64 to int64 using astype()

    
### Code    

In [None]:
# Check if there is any valid decimals
df.start_station_id.unique()

In [None]:
# Check if there is any valid decimals
df.end_station_id.unique()

In [None]:
# Check if there is any valid decimals
df.member_birth_year.unique()

> - No valid decimals present
> - NaN present

In [None]:
# convert the data type of start_station_id, end_station_id and member_birth_year from float64 to int.

df.start_station_id = df.start_station_id.astype('Int64')
df.end_station_id = df.end_station_id.astype('Int64')
df.member_birth_year = df.member_birth_year.astype('Int64')

### Test

In [None]:
# test if the data type of start_station_id is int.
assert type(df.start_station_id[0]) is np.int64

In [None]:
# test if the data type of end_station_id is int.
assert type(df.end_station_id[0]) is np.int64

In [None]:
# test if the data type of member_birth_year is int.
assert type(df.member_birth_year[0]) is np.int64

### Issue 3 : Erroneous datatypes for `user_type` and `member_gender`

### Define

Convert the datatype of `user_type` and `member_gender` from string to category using astype()
    
### Code    

In [None]:
# Check the unique values of user_type
df.user_type.unique()

In [None]:
# Convert the datatype to category
df.user_type = df.user_type.astype('category')

In [None]:
# Check the unique values of member_gender
df.member_gender.unique()

In [None]:
# count of each unique values
df.member_gender.value_counts()

In [None]:
# Proportion of values in gender being null
df.member_gender.isna().sum() / df.shape[0]

In [None]:
# Percentage of null being almost 6%, 
# fillna with None
df.member_gender = df.member_gender.fillna('None')

In [None]:
# set the base color
base_color = sns.color_palette()[0]

# plot to see the distribution of the gender
sns.countplot(data=df, x='member_gender', color=base_color);

In [None]:
# Convert the datatype into category
df.member_gender = df.member_gender.astype('category')

#### Test

In [None]:
# Test if the data type of user_type and member_gender is category
df.info()

In [None]:
# unique values in member_gender
df.member_gender.unique()

### Issue 4 : Erroneous datatypes for `bike_share_for_all_trip`

### Define

Convert the datatype of `bike_share_for_all_trip` from string to bool by changing the values to
   - True, if the original value is 'Yes'
   - False otherwise
    
### Code    

In [None]:
df.bike_share_for_all_trip = (df.bike_share_for_all_trip == 'Yes')

### Test

In [None]:
# test if the data type of bike_share_for_all_trip is bool.
assert type(df.bike_share_for_all_trip[0]) is np.bool_

In [None]:
df.info()

<a id="explore"></a>

## Data Exploration

### What is the structure of your dataset?

> The dataset has **18,63,721** ford gobike trip entries and **16** features like (duration_sec, start_time, end_time, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, member_birth_year, member_gender, bike_share_for_all_trip)

> Among 16 specifications, 9 are numerical (int64-2, Int64-3 & float-4), 2 are object type, 2 are datetime, 2 are categorical and 1 is boolean type

### What is/are the main feature(s) of interest in your dataset?

> The most interested feature in this dataset is `duration_sec`, by how it is dependent on other specifications of the dataset.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> I expect that the `duration_sec` is highly dependent on,
> - `start_station_id` and `end_station_id`, where more crowded areas may have high trip duration.
> - Day of the week (`start_time` & `end_time`)
> - Hour of the day (`start_time`)
> - `user_type`, their age (`member_birth_year`) and gender (`member_gender`)

<a id="uni"></a>

## Univariate Exploration


In [None]:
# Choose the first tuple of RGB colors
base_color = sns.color_palette()[0]

from matplotlib import rcParams
# Specify the figure size in inches, for both X, and Y axes
rcParams['figure.figsize'] = 9,6

**Distribution of Trip duration**


Let's see the distribution of the interested feature `duration_sec`

In [None]:
# set bins
bins = np.arange(0, df.duration_sec.max() + 500, 1000)

# Plot the distribution of duration_sec
plt.hist(data=df, x='duration_sec', bins=bins)

# title and labels
plt.title('Distribution of Trip duration in seconds')
plt.xlabel('Duration in seconds')
plt.ylabel('Number of Trips');

> - As the distribution almost have the notable values between 0 and 10000 seconds, let's view deeper by changing the bin values.

In [None]:
# set bins
bins = np.arange(0, 10000 + 1000, 1000)

# Plot the distribution of duration_sec
plt.hist(data=df, x='duration_sec', bins=bins)

# title and labels
plt.title('Distribution of Trip duration in seconds')
plt.xlabel('Duration in seconds')
plt.ylabel('Number of Trips');

> - As the distribution is right skewed, let's view the log scale distribution.

In [None]:
# describe duration_sec
df.describe().duration_sec

In [None]:
# Transform the describe() to a scale of log10
np.log10(df.describe().duration_sec)

In [None]:
# set bins
log_binsize = 0.05
log_bins = 10 ** np.arange(1.5, 5 + log_binsize, log_binsize)

# Plot the distribution of duration_sec
plt.hist(data=df, x='duration_sec', bins=log_bins)

# log scale distribution
plt.xscale('log')

# set ticks locations for x axis
plt.xticks([50, 100, 250, 500, 1e3, 2e3, 5e3, 1e4, 2e4, 3e4], [50, 100, 250, 500, '1k', '2k', '5k', '10k', '20k', '30k'])

# title and labels
plt.title('Log scale Distribution of Trip duration in seconds')
plt.xlabel('Duration in seconds')
plt.ylabel('Number of Trips');

> - Minimum Trip duration is 61 seconds, while the maximum is 86366 seconds.
> - But, Trip duration is mostly concentrated on the spectrum width of 250 - 1500 seconds with peak around 600 seconds. 

Let's view the distribution of the other features that supports our investigation

**Distribution of start station**

In [None]:
# set the figure size
plt.figure(figsize=(18,6))

# set the bins
bins = np.arange(0, df.start_station_id.max()+1, 1)

# Plot start_station_id distribution
plt.hist(data=df.dropna(), x='start_station_id', bins=bins)

# set tick names
x_ticks = np.arange(0, df.start_station_id.max()+2, 10)
plt.xticks(x_ticks)
plt.yticks(np.arange(0, 40000+1, 5000), [0, '5k', '10k', '15k', '20k', '25k', '30k', '35k', '40k'])

# title and labels
plt.title('Distribution of trips starting at Station')
plt.xlabel('Start Station ID')
plt.ylabel('Number of trips starting');

**Distribution of end station**

In [None]:
# set the figure size
plt.figure(figsize=(18,6))

# set the bins
bins = np.arange(0, df.end_station_id.max()+1, 1)

# Plot end_station_id distribution
plt.hist(data=df.dropna(), x='end_station_id', bins=bins)

# set tick names
x_ticks = np.arange(0, df.end_station_id.max()+2, 10)
plt.xticks(x_ticks)
plt.yticks(np.arange(0, 50000+1, 5000), [0, '5k', '10k', '15k', '20k', '25k', '30k', '35k', '40k', '45k', '50k'])

# title and labels
plt.title('Distribution of trips ending at Station')
plt.xlabel('End Station ID')
plt.ylabel('Number of trips ending');

> We can see that the distribution of trips starting and ending at the stations are almost same, indicating that the most used station IDs are below 90

In [None]:
# busiest route
df[['start_station_id','end_station_id']].dropna().value_counts()

> - The route in which the **highest number of trips** made is from (start_station_id) **15** to (end_station_id) **6**

**Distribution of Day of the week**

In order to view the distribution of day of the week, let's use `start_time` to find the day of the week.
Store it in separate column named `start_day`

In [None]:
# create a new column start_day
df['start_day'] = df.start_time.dt.dayofweek

# convert the newly created into categorical type
df['start_day'] = df['start_day'].astype('category')

# Group by start_day to get the count
day_count = df.groupby('start_day').count().start_time

In [None]:
# Plot the day distribution on bar
day_count.plot(kind='bar')

# get the current tick locations and labels
locs, labels = plt.xticks() 

# loop through each pair of locations and labels
for loc, label in zip(locs, labels):

    # get the correct count
    count = day_count[loc]    

    # print the annotation just below the top of the bar
    plt.text(loc, count+5000, count, ha = 'center', color = 'black')
    
# set ticks    
labels = ['Mon','Tue','Wed','Thurs','Fri','Sat','Sun']
plt.xticks(range(7), labels=labels, rotation=0)
plt.yticks(np.arange(0, 400000, 50000), [0, '50k', '100k', '150k', '200k', '250k', '300k', '350k'])    

# title and labels
plt.title('Distribution of Trips on Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Trips');

> - **Weekdays** have more number of trips than **weekends**
> - Especially, the trip count for mid three days of the week (Tue, Wed, Thurs) ranges between 320k and 314k

**Distribution of Time of the Day**

In order to view the distribution of Time of the Day, let's use `start_time` to find the hour of the day.

In [None]:
# create separate column for start and end hour
df['start_hour'] = df.start_time.dt.hour
df['end_hour'] = df.end_time.dt.hour

# groupby start_hour
hour_count = df.groupby('start_hour').count().start_time

In [None]:
# Plot the hour distribution on bar
hour_count.plot(kind='bar')
    
# set ticks
plt.xticks(rotation=0)
plt.yticks(np.arange(0, 250000, 50000), [0, '50k', '100k', '150k', '200k'])

# title and labels
plt.title('Distribution of Trips based on Hour')
plt.xlabel('Hour')
plt.ylabel('Number of Trips');

> - We can see that the peak hours are **8am - 9am** and **5pm - 6pm**

**Distribution of User type**

In [None]:
# calculating the type_counts
user_type_counts = df['user_type'].value_counts()

# Get the unique values of the `user_type` column, in the decreasing order of the frequency.
user_type_order = user_type_counts.index

# Plot user_type on bar
sns.countplot(data=df, x='user_type', color=base_color, order=user_type_order);

# get the current tick locations and labels
locs, labels = plt.xticks() 

# loop through each pair of locations and labels
for loc, label in zip(locs, labels):

    # get the text property for the label to get the correct count
    count = user_type_counts[label.get_text()]    

    # print the annotation just below the top of the bar
    plt.text(loc, count+10000, count, ha = 'center', color = 'black')


# set tick names for y axis
plt.yticks(np.arange(0, 1800000, 200000), [0, '200k', '400k', '600k', '800k', '1M', '1.2M', '1.4M', '1.6M'])

# title and labels
plt.title('Distribution of User Type')
plt.xlabel('User Type')
plt.ylabel('Number of Users');

> **Subscribers** have made the more number of trips than **Customer** by **1.3M**, which is a huge difference

**Distribution of Gender**

In [None]:
# calculating the gender_counts
gender_counts = df['member_gender'].value_counts()

# Get the unique values of the `gender_counts` column, in the decreasing order of the frequency.
gender_order = gender_counts.index

# Plot member_gender
sns.countplot(data=df, x='member_gender', color=base_color, order=gender_order);

# get the current tick locations and labels
locs, labels = plt.xticks() 

# loop through each pair of locations and labels
for loc, label in zip(locs, labels):

    # get the text property for the label to get the correct count
    count = gender_counts[label.get_text()]    

    # print the annotation just below the top of the bar
    plt.text(loc, count+10000, count, ha = 'center', color = 'black')


# set tick names for y axis
plt.yticks(np.arange(0, 1800000, 200000), [0, '200k', '400k', '600k', '800k', '1M', '1.2M', '1.4M', '1.6M'])

# title and labels
plt.title('Distribution of Gender')
plt.xlabel('Gender')
plt.ylabel('Number of Trips');

> - **Male** have the highest number of trips, followed by **Female**
> - As the percentage of NaN values (replaced as 'None') 5%, included in the plot to view if it makes any impact if it is added to any of the other valid gender type.
> - Eventhough, we add the count of 'None' to **Other**, it remians in the last position. Same applies for **Female** being in second position. So, let's see if the users without any personal information has a trend.

**Distribution of User Age**

In [None]:
# create age feature based on member_birth_year
# the value in age is the age of the members as of 2018. 
df['age'] = 2018 - df.member_birth_year

In [None]:
df.age.value_counts()

In [None]:
# set the figure size
plt.figure(figsize=(12,8))

# Let's plot the age of the users
# set bins
bins = np.arange(15.5, df.age.max()+1, 1)
# create x ticks
x_ticks = np.arange(10, df.age.max()+10, 10)

# Plot the age group
sns.histplot(data=df.dropna(), x = 'age', color=base_color, bins=bins)
plt.xticks(x_ticks)
plt.title('Distribution of User Age')
plt.xlabel('User Age')
plt.ylabel('Number of Users');

> - We can see that age group of **25-35** have the wide range of spectrum for the number of trips, in which user age of **30** being the peak in number of users having the trip.
> - We can also see that there are few very tiny lumps above the age of 80-118

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> - The trip duration in seconds have more concentrated spectrum at low values and it is right skewed.
> - In order to understand the distribution, log scale transformation on `duration_sec` is performed.
> - I have observed from the log scale transformation, that the trip duration is mostly concentrated on the spectrum width of **250 - 1500** seconds with **peak around 600 seconds** and the distribution is **unimodal**.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> **start_station_id** and **end_station_id :**
    > - The null values in the mentioned features are dropped.
    > - And also plotted in a larger size, as it gives better perception of the insights
    
> **start_time :**
    > - **Day of the week** is calculated from `start_time` and stored in `start_day`, to view how the number of trips is distributed based on day of the week, as it gives better perception of the start_time.
    > - **Hour of the day** is also calculated from `start_time`, to view the busy hours of the day.
 
> **member_birth_year :**
    > - Dropped the null values from `member_birth_year`, in order to investigate on the feature accurately. 
    > - `age` is calculated by subtracting the `member_birth_year` from **2018**, as it gives the correct age as of 2018, when the data has been collected.
    > - `age` is used to plot across the number of trips, since it gives better insights of the tendency.

> **member_gender :**    
    > - As the percentage of NaN values (replaced as 'None') is 5% in `member_gender`, included in the plot to view if it makes any impact if it is added to any of the other valid gender type.
    > - Eventhough, we add the count of 'None' to **Other**, it remians in the last position. Same applies for **Female** being in second position. So, let's see if the users without any personal information has a trend.    

<a id="bi"></a>

## Bivariate Exploration

> Let's look at the correlation between different variable in here.

**Start, End Station** and **Trip Duration**

In [None]:
# set figure size
plt.figure(figsize=(12,12))

# Group station id, calculate mean trip duration
start_station_duration = df.groupby(['start_station_id']).mean().duration_sec.reset_index()
end_station_duration = df.groupby(['end_station_id']).mean().duration_sec.reset_index()

# plot the trend start_station_duration
plt.subplot(2,1,1)
sns.lineplot(data=start_station_duration, x='start_station_id', y='duration_sec')
plt.ylim(0,7000)

# label and title
plt.xlabel('Start Station ID')
plt.ylabel('Avg. Trip Duration (sec)')
plt.title('Avg. Trip duration starting at Station ID')

# plot the trend end_station_duration
plt.subplot(2,1,2)
sns.lineplot(data=end_station_duration, x='end_station_id', y='duration_sec')

# label and title
plt.xlabel('End Station ID')
plt.ylabel('Avg. Trip Duration (sec)')
plt.title('Avg. Trip duration ending at Station ID');

We can see that there is no trend, as it fluctuates rapidly.

**Hour** and **Trip duration**

In [None]:
# Group by hour and calculate mean durations in seconds
start_hour_duration = df.groupby('start_hour').mean().duration_sec.reset_index()
end_hour_duration = df.groupby('end_hour').mean().duration_sec.reset_index()

# Line plot the trend of start and end hour
sns.lineplot(data=start_hour_duration, x='start_hour', y='duration_sec')
sns.lineplot(data=end_hour_duration, x='end_hour', y='duration_sec',color='r')

# set ticks
plt.xticks(np.arange(0,24,1));
plt.yticks(np.arange(600,2000+200,200),[600,800,'1k','1.2k','1.4k','1.6k','1.8k','2k']);

# title and labels
plt.title('Trips Start and End hour having Avg. trip duration')
plt.xlabel('Hour of the Day')
plt.ylabel('Avg. Trip duration (sec)')

# set the legend
plt.legend(labels=['Start Hour','End Hour'],  title='Hour trend');

> - Both Start and End hour trend increases as day starts from avg. duration around 1.15k and 1k respectiively.
> - And reaches its peak around **3am**
> - The highest average duration is about **1.95k** when starting the trip at **3am**.
> - Both trend decreases sharply after 3am, maintains its slope from 5am to 9am and again rises steadily between 9am-3pm.
> - Drops slowly after 3pm

> - The day time between **9am - 3pm** have constant trip duration

**User Age** and **Trip Duration**

In [None]:
# scatter plot between user age and trip duration
plt.scatter(data=df.dropna(), x='age', y='duration_sec', alpha=0.25, marker='.');

# set ticks
plt.xticks(np.arange(0, df.age.max()+10, 10))
plt.yticks(np.arange(0, df.duration_sec.max()+10000, 10000),[0,'10k','20k','30k','40k','50k','60k','70k','80k','90k'])

# labels and title
plt.xlabel('Age in Years')
plt.ylabel('Trip Duration (sec)')
plt.title('Relationship between User Age and Trip duration');

> - We can see that the relationship is highly concentrated on trip duration below 10k seconds and age below 80
> - So, let's limit on both axes

In [None]:
# set figure size
plt.figure(figsize=(15,6))

plt.suptitle('Relationship between User Age and Trip duration');

# scatter 
plt.subplot(1,2,1)
# scatter plot between user age and trip duration
plt.scatter(data=df.dropna(), x='age', y='duration_sec', alpha=0.25, marker='.');

# set ticks
plt.xticks(np.arange(15, 80+5, 5))
plt.yticks(np.arange(0, 5000+500, 250))
plt.xlim(15,80)
plt.ylim(0,4000)

# labels and title
plt.xlabel('Age in Years')
plt.ylabel('Trip Duration (sec)')


# heat map
plt.subplot(1,2,2)
data = df.dropna()
plt.hist2d(data=df.dropna(), x='age', y='duration_sec', cmin=10, bins=[40,500],cmap='YlGnBu',linewidths=.5)
plt.colorbar(label = 'Frequency (Trip Duration)')

# set ticks
plt.xticks(np.arange(20, 80+10, 10))
plt.yticks(np.arange(0, 5000+500, 250))
plt.xlim(15,80)
plt.ylim(0,4000)

plt.xlabel('Age in Years')
plt.ylabel('Trip Duration (sec)');

> - It is seen that the high bike users are aged between **25 - 40**, having average trip duration around 500 seconds.
> - The higher duration trips are performed almost uniformly by all age groups with low frequency.

In [None]:
# value counts of age
val_count = df.dropna().age.value_counts()
val_count.tail(20)

> We can see some user age has very low frequency of trips, which may lead to outliers. So, let's remove the outliers and plot average duration by age

In [None]:
# users travelled atleast 10 times
filter_age = df.groupby('age').count()
filter_age = filter_age[filter_age.duration_sec >= 10]
filter_age.index

In [None]:
# Calculate the average age for users having trip count more than or equal to 10
response  = pd.DataFrame((df.groupby('age').sum() / filter_age).duration_sec.dropna()).reset_index()
response.tail()

In [None]:
# set figure size
plt.figure(figsize=(18,6))

sns.barplot(data=response, x='age', y='duration_sec', color=base_color);
plt.xticks(rotation=90)
plt.ylabel('Avg. trip duration (sec)');

**Day of the Week** and **Trip Duration**

In [None]:
# box plot between day of the week and trip duration
sns.boxplot(data = df, x = 'start_day', y = 'duration_sec',color=base_color);

# set ticks
labels = ['Mon','Tue','Wed','Thurs','Fri','Sat','Sun']
plt.xticks(range(7), labels=labels, rotation=0)

# labels and title
plt.xlabel('Day of the Week')
plt.ylabel('Trip Duration (sec)')
plt.title('Relationship between Day of the Week and Trip duration');

As the duration is wide spread at lower spectrum, let's trim the range to 3k seconds

In [None]:
# box plot between day of the week and trip duration
sns.boxplot(data = df, x = 'start_day', y = 'duration_sec',color=base_color);

# set ticks
labels = ['Mon','Tue','Wed','Thurs','Fri','Sat','Sun']
plt.xticks(range(7), labels=labels, rotation=0)
plt.yticks(np.arange(0, 3000+1, 500),[0,500,'1k','1.5k','2k','2.5k','3k'])

# set y limit
plt.ylim(-100,2500)

# labels and title
plt.xlabel('Day of the Week')
plt.ylabel('Trip Duration (sec)')
plt.title('Relationship between Day of the Week and Trip duration');

> **Note :** Eventhough the number of trips are greater for **weekdays**, the duration in seconds is greater for **weekends**

> The trip duration for weekdays almost stays the same

**User Type** and **Trip Duration**

In [None]:
# box plot between user type and trip duration
sns.boxplot(data = df, x = 'user_type', y = 'duration_sec',color=base_color);

# set ticks
plt.yticks(np.arange(0, df.duration_sec.max()+10000, 10000),[0,'10k','20k','30k','40k','50k','60k','70k','80k','90k'])

# labels and title
plt.xlabel('User Type')
plt.ylabel('Trip Duration (sec)')
plt.title('Relationship between User Type and Trip duration');

As the duration is wide spread at lower spectrum, let's trim the range to 3.5k seconds

In [None]:
# box plot between user type and trip duration
sns.boxplot(data = df, x = 'user_type', y = 'duration_sec',color=base_color);

# set ticks
plt.yticks(np.arange(0, 3500+1, 500),[0,500,'1k','1.5k','2k','2.5k','3k','3.5k'])
# set y limit
plt.ylim(-100,3500)

# labels and title
plt.xlabel('User Type')
plt.ylabel('Trip Duration (sec)')
plt.title('Relationship between User Type and Trip duration');

> **Note :** Eventhough the number of trips are greater for **Subscriber**, the trip duration is greater for **Customer**

**Gender** and **Trip Duration**

In [None]:
# box plot between gender and trip duration
sns.boxplot(data = df, x = 'member_gender', y = 'duration_sec',color=base_color);

# set ticks
plt.yticks(np.arange(0, df.duration_sec.max()+10000, 10000),[0,'10k','20k','30k','40k','50k','60k','70k','80k','90k'])

# labels and title
plt.xlabel('Gender')
plt.ylabel('Trip Duration (sec)')
plt.title('Relationship between Gender and Trip duration');

As the duration is wide spread at lower spectrum, let's trim the range to 3.5k seconds

In [None]:
# box plot between gender and trip duration
sns.boxplot(data = df, x = 'member_gender', y = 'duration_sec',color=base_color);

# set ticks
plt.yticks(np.arange(0, 3500+1, 500),[0,500,'1k','1.5k','2k','2.5k','3k','3.5k'])
# set y limit
plt.ylim(-100,3500)

# labels and title
plt.xlabel('Gender')
plt.ylabel('Trip Duration (sec)')
plt.title('Relationship between Gender and Trip duration');

> **Note :** The order of number of trips following were, 
    > - Male > Female > None(Null) > Other

> But, the order in trip duration is different as,
    > - None(Null) > Female > Other > Male
    
> The null values is replaced by 'None' indicated that there was no information about the user. This may be due to they have used the service for one time long trip.  

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> - I expected that the station ID would impact the trip duration, but it has no effect on trip duration.
> - Rather, the frequency of trips is highly dependent upon the User **age**
> - The average trip duration is dependent on the time (hour) at when the trip starts or ends


### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> - Eventhough the number of trips is greater for **weekdays**, the duration in seconds is greater for **weekends**
> - Likewise, the number of trips is greater for **Subscriber**, but the trip duration is greater for **Customer**
> - The user aged around 130 have higher trip duration (jumped suddenly to higher)

> - Eventhough the higher number of trips is contributed by **Male, Female** in order, the duration in seconds is greater for **None(Null), Female** in order, 'Male' being least contributed in higher trip duration.
> - The null values in Gender is replaced by 'None' indicated that there was no information about the user. This may be due to they have used the service for one time long trip.

<a id="multi"></a>

## Multivariate Exploration

**Relationship between User Type, Gender and Trip Day**

In [None]:
# group by member_gender and user_type by mean
gender_user = df.groupby(['member_gender','user_type']).mean().duration_sec.reset_index()

In [None]:
# create pivot of group by in previous step
gender_user_pivot = gender_user.pivot(index='member_gender', columns='user_type', values='duration_sec')

In [None]:
# group by trip day and user_type by mean
day_user = df.groupby(['start_day','user_type']).mean().duration_sec.reset_index()

In [None]:
# create pivot
day_user_pivot = day_user.pivot(index='start_day', columns='user_type', values='duration_sec')

In [None]:
# group by trip day, user type and gender by mean
day_gender = df.groupby(['start_day','member_gender','user_type']).mean().duration_sec.reset_index()

In [None]:
plt.figure(figsize=(18,4))

# Heat Map Gender and User Type
plt.subplot(1,2,1)
sns.heatmap(gender_user_pivot, annot = True, fmt = '.0f', cbar_kws={'label': 'Avg. Trip Duration (sec)'})
# labels and title
plt.xlabel('User Type')
plt.ylabel('Gender')
plt.title('Heat Map');

# Heat Map Weekday and UserType
plt.subplot(1,2,2)
sns.heatmap(day_user_pivot, annot = True, fmt = '.0f', cbar_kws={'label': 'Avg. Trip Duration (sec)'})
# set ticks
labels = ['Mon','Tue','Wed','Thurs','Fri','Sat','Sun']
plt.yticks(np.arange(0.5, 7.5, 1), labels=labels, rotation=0)
# labels and title
plt.xlabel('User Type')
plt.ylabel('Day of the Week')
plt.title('Heat Map');

> - **Subscriber: Females** have the higher trip duration, followed by **Other,Users without personal data,Male**
> - **Customer: Users without personal data** have the high trip duration, followed by **Female,Other,Male**

> - In both **Customer** and **Subscriber**, **weekends** have the higher trip duration, among which **Customer** has the highest.

In [None]:
# create color
color = [sns.color_palette()[0],sns.color_palette()[8]]
# create facet grid
g = sns.FacetGrid(data=day_gender, col='member_gender', hue='user_type', col_order = ['Male','Female','Other','None'], palette=color, height=4)
g.map(sns.barplot, 'start_day', 'duration_sec', order=range(7))
g.set_axis_labels("Day of the Week", "Avg. Trip Duration (sec)")
g.add_legend()
labels = ['Mon','Tue','Wed','Thurs','Fri','Sat','Sun']
g.set_xticklabels(labels, rotation=0)
g.set_yticklabels([0,500,'1k','1.5k','2k','2.5k','3k','3.5k']);

> - We have seen that the trip duration for 'None' is higher in **Bivariate** exploration, here it is seen that the higher trip duration is contributed by **Customer** than Subscriber.
> - Also, in all Gender types, higher average trip duration is contributed by **Customer** rather than Subscriber.

So, on which day of the  week these user types makes trip most?

> - Both Users makes higher duration trips on **Weekends** rather than weekdays, although the number of trips are higher for weekdays

### How does the Avg. trip duration vary based on each hour for various age range ?

In [None]:
# set the figure size
plt.figure(figsize=(15,10))

# Group by age and hour
age_hour = df.groupby(['age','start_hour']).mean().duration_sec.reset_index()

# plot 
plt.scatter(data = age_hour, x='age', y='start_hour', s=age_hour['duration_sec']/20)

# set ticks
plt.xticks(np.arange(15, age_hour.age.max()+5, 5))
plt.yticks(range(24))
plt.xlim(15, age_hour.age.max()+5)

# labels and title
plt.title('Average Trip Duration associated with Age and Hour')
plt.xlabel('Age')
plt.ylabel('Hour [Day]')

# dummy series for adding legend
sizes = [50, 100, 250, 500, 1000]
base_color = sns.color_palette()[0]
legend_obj = []
for s in sizes:
    legend_obj.append(plt.scatter([], [], s = s, color = base_color))
plt.legend(legend_obj, sizes, loc='upper left',labelspacing=2, frameon=False, bbox_to_anchor=(1, 1.02), title='Avg. Trip Duration (sec)');

> - Early morning around **1am-4am**, user aged **60** goes for **high duration trips**.
> - While, few elder users aged above **100** go for higher duration trips during **day time (5am-5pm)**.
> - Also, slight increase in the trip duration of users of age **25-50**

**How does the frequency of trips associated on each hour for every day of the week for the users aged 25~50?** 

In [None]:
age_range = df[(df['age'] >=25) & (df['age'] <=50)]

In [None]:
user_grp = ((age_range.groupby(['user_type','start_day','start_hour']).count())).reset_index(['start_day','start_hour'])

# create pivot for customer
cust_pivot = user_grp.loc['Customer'].reset_index().pivot(index='start_hour', columns='start_day', values='duration_sec')

# create pivot for subscriber
subs_pivot = user_grp.loc['Subscriber'].reset_index().pivot(index='start_hour', columns='start_day', values='duration_sec')

In [None]:
# Percentage of frequency on each hour on the given day
cust_pivot = cust_pivot*100/cust_pivot.sum()
subs_pivot = subs_pivot*100/subs_pivot.sum()

In [None]:
# set figure size
plt.figure(figsize=(17,8))

# Overall title
plt.suptitle('Percentage of frequency on each hour on the given day')

# heat map for customer
plt.subplot(1,2,1)
sns.heatmap(data=cust_pivot, cmap='YlGnBu', annot=True, annot_kws={"size": 10}, vmin=0.1 , vmax=15)

# set ticks
labels = ['Mon','Tue','Wed','Thurs','Fri','Sat','Sun']
plt.xticks(np.arange(0.5,7.5,1), labels)
plt.yticks(rotation=360)

# label and title
plt.title('Percentage of frequency for Customer trips')
plt.xlabel('Day [Week]')
plt.ylabel('Hour [Day]');

# heat map for Subscriber
plt.subplot(1,2,2)
sns.heatmap(data=subs_pivot, cmap='YlGnBu', annot=True, annot_kws={"size": 10}, vmin=0.1 , vmax=15, cbar_kws={'label': 'Rank of frequency of  trips'})

# set ticks
plt.xticks(np.arange(0.5,7.5,1), labels)
plt.yticks(rotation=360)

# label and title
plt.title('Percentage of frequency for Subscriber trips')
plt.xlabel('Day [Week]')
plt.ylabel('');

> - In **Busy hours (8am & 5pm)**, frequency of **Subscriber** trips is higher than **Customer** trips during morning, while in the evening frequency of **Customer** trips is almost equal to **Subscriber** trips in **Week Days**.
> - The frequency of trips in **Week ends** is higher for **Customer** rather than **Subscriber** during day time (10am-5pm)

#### Top 20 busiest route

In [None]:
# busiest route
busiest_route = df[['start_station_id','end_station_id']].dropna().value_counts().nlargest(20).unstack()

In [None]:
# set figure size
plt.figure(figsize=(15,6))

# heat map
sns.heatmap(data=busiest_route, annot = True, annot_kws={"size": 10}, fmt = '.0f', cmap='rocket', cbar_kws={'label': 'Count of Trips'})

# set ticks
plt.xticks(rotation=360)
plt.yticks(rotation=360)

# label and title
plt.title('Top 20 Busiest Route')
plt.ylabel('Start Station ID')
plt.xlabel('End Station ID');

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> - **Subscriber: Females** have the higher trip duration, followed by **Other,Users without personal data,Male**
> - **Customer: Users without personal data** have the high trip duration, followed by **Female,Other,Male**
> - In **Busy hours (8am & 5pm)**, frequency of **Subscriber** trips is higher than **Customer** trips during morning, while in the evening frequency of **Customer** trips is almost equal to **Subscriber** trips in **Week Days**.
> - The frequency of trips in **Week ends** is higher for **Customer** rather than **Subscriber** during day time (10am-5pm)
> - Top 20 busiest routes are observed, among which Route **15-6** is the busiest.

### Were there any interesting or surprising interactions between features?

> - In both **Customer** and **Subscriber**, **weekends** have the higher trip duration, among which **Customer** has the highest.
> - Early morning around **1am-4am**, user aged **60** goes for **high duration trips**, is a surprise.
> - While, few elder users aged above **100** go for higher duration trips during **day time (5am-5pm)**, which is interesting.