### Get going by asking the following questions and looking for the answers with some code and plots:
    Can you count something interesting?
    Can you find some trends (high, low, increase, decrease, anomalies)?
    Can you make a bar plot or a histogram?
    Can you compare two related quantities?
    Can you make a scatterplot?
    Can you make a time-series plot?

### Having made these plots:
    What are some insights you get from them? 
    Do you see any correlations? 
    Is there a hypothesis you would like to investigate further? 
    What other questions do they lead you to ask?

### By now you’ve asked a bunch of questions, and found some neat insights. 
    Is there an interesting narrative, a way of presenting the insights using text and plots from the above, 
        that tells a compelling story? 
    As you work out this story, what are some other trends/relationships you think will make it more complete?



In [None]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from glob import glob

import seaborn as sns
sns.set()


## Load Trip Data Sets

In [None]:
print('Loading data...')
try:

    file_path_slug = '../../datasets/bayareabikeshare/*_trip_data.csv'

    # glob all files
    file_list = glob(file_path_slug)

    trips = pd.DataFrame()

    counter = 1
    chunks = []
    for file in file_list:
        print('\nReading file [' + str(counter) + ' of ' + str(len(file_list)) + ']\t ' + str(file))

        # import file in chunks to temp DataFrame
        print('\treading chunks...')
        for chunk in pd.read_csv(file, chunksize=10000, iterator=True):

            chunk = chunk.set_index('Trip ID')

            # standardize column names - 201402 dataset uses 'Subscription Type' in place of 'Subscriber Type'
            chunk.columns = ['Duration', 'Start Date', 'Start Station', 'Start Terminal', 'End Date', 
                             'End Station', 'End Terminal', 'Bike #', 'Subscriber Type', 'Zip Code']

            chunks.append(chunk)

        print('\tfinished file!')
        counter += 1

    # status = pd.concat(chunks, ignore_index=True)
    trips = pd.concat(chunks)


    print('data loaded successfully!')
except:
    print('oops... something went wrong loading the data :(')


### initial exploration

In [None]:
trips.head(10)

In [None]:
len(trips.index)

In [None]:
trips.info()

In [None]:
trips.describe()

## Lets clean up the data set a bit

In [None]:
print('Data cleanup started...')

#   cleanup column names
print('\tcleaning up column names...')
new_cols = []
for col in trips.columns:
    new_col = col.replace(' ', '_').lower()
    new_cols.append(new_col)
trips.columns = new_cols

#   extract columns we want to keep
print('Subsetting to useful columns...')
important_cols = ['duration', 'start_date', 'start_terminal', 'end_date', 'end_terminal', 'bike_#', 'subscriber_type', 'zip_code']
trips = trips[important_cols]


#   create duration_minutes column
trips['duration_minutes'] = trips['duration'] / 60.

#   convert end and start dates to datetime objects
print('\tconverting end and start dates to datetime objects...')
trips['start_date'] = pd.to_datetime(trips['start_date'], format="%m/%d/%Y %H:%M")
trips['end_date'] = pd.to_datetime(trips['end_date'], format="%m/%d/%Y %H:%M")
print('\t\tfinished!')

#   create a start and end hour trip column
print('\tcreating start and end hour columns...')
trips['start_hour'] = trips['start_date'].dt.hour
trips['end_hour'] = trips['end_date'].dt.hour
print('\t\tfinished!')

# convert zip codes to numeric
# trips['zip_code'] = pd.to_numeric(trips['zip_code'], errors='coerce')
trips['zip_code'] = trips['zip_code'].astype(str)

print('\tfinished!')

In [None]:
def clean_zipcode(item):
    if len(item) != 5:

        # split on '-'
        try:
            result = item.split('-')[0]
        except:
            result = item

        # split on '.'
        try:
            result = item.split('.')[0]
        except:
            result = item
        
        # if len of item is less than 5, return 'NaN'
        if len(result) < 5:
            result = 'NaN'
        else:
            # if len result is greater than 5, take at most, first 5 digits
            result = result[:5]
    else:
        result = item
    
    # make sure result is all digits
    if result.isdigit():
        return result
    else:
        return 'NaN'

trips.zip_code = trips.zip_code.apply(clean_zipcode)
trips['zip_code'] = pd.to_numeric(trips['zip_code'], errors='coerce')


In [None]:
# look at unique values in each column
print('#' * 80)
print('#\tColumns and unique values')
detail_cols = ['start_terminal', 'end_terminal', 'bike_#', 'subscriber_type', 'zip_code']
for col in detail_cols:
    print('Column : ' + col + '\t' + str(len(pd.unique(trips[col]))))
    print(np.sort(pd.unique(trips[col])))
#     print(pd.unique(trips[col]))
    print()

In [None]:
trips.info()

## Initial Visual investigation

In [None]:
trips.plot(kind='scatter', x='start_hour', y='duration_minutes')
plt.title('Trip Duration by Start Hour')
plt.xlabel='Start Hour'
plt.ylabel='Duration in Minutes'
plt.show()

Whoa... we have found a pretty significant outlier in this data set, someone had a bike for nearly 300,000 minutes (that is near 200 days!) and we also see some trips that are nearing 50,000 minutes (roughly 34 days).  

Lets first remove the largest outlier and take a look at a histogram of ride duration to see the spread of the data.

In [None]:
# our data set show duration in seconds, here are some handy conversions
second = 1
minute = second * 60
hour = minute * 60
day = hour * 24

# prune data to exclude trips longer than 35 days
print('\tpruning data to trips no more than 35 days long...')
trips = trips[trips['duration'] <= 35 * day].copy()
print('\t\tpruned data set \'trips\' consists of %i entries' % len(trips.index))

# plot histogram of trip duration
trips['duration_minutes'].plot(kind='hist', color='r', alpha=0.25, bins=200, figsize=(20,5))
plt.title('Distribution of Trip Duration in Minutes')
plt.xlabel='Trip Duration (Minutes)'
plt.ylabel='Number of Trips'
plt.legend(loc='best')
plt.show()

Looks like we still have a pretty wide spread, next lets drill into trips that are no more than than 1000 minutes long and see what that distribution looks like.

In [None]:
# prune data to exclude trips longer than 1000 minutes
print('\tpruning data to trips no more than 1000 minutes long...')
trips = trips[trips['duration'] <= 1000 * minute].copy()
print('\t\tpruned data set \'trips\' consists of %i entries' % len(trips.index))

# plot histogram of trip duration
trips['duration_minutes'].plot(kind='hist', color='r', alpha=0.25, bins=200, figsize=(20,5))
plt.title('Distribution of Trip Duration in Minutes')
plt.xlabel='Trip Duration (Minutes)'
plt.ylabel='Number of Trips'
plt.legend(loc='best')
plt.show()

Getting closer, lets prune down to just trips no more than 200 minutes, that looks like it will show all the data and leave some breathing room

In [None]:
# prune data to exclude trips longer than 200 minutes
print('\tpruning data to trips no more than 200 minutes long...')
trips = trips[trips['duration'] <= 200 * minute].copy()
print('\t\tpruned data set \'trips\' consists of %i entries' % len(trips.index))

# plot histogram of trip duration
trips['duration_minutes'].plot(kind='hist', color='r', alpha=0.25, bins=200, figsize=(20,5))
plt.title('Distribution of Trip Duration in Minutes')
plt.xlabel='Trip Duration (Minutes)'
plt.ylabel='Number of Trips'
plt.legend(loc='best')
plt.show()

heck, lets just go for it, it seems that the vast majority of trips are less than 35 minutes, lets just take a look at the distribution of those trips

In [None]:
# prune data to exclude trips longer than 35 minutes
print('\tpruning data to trips no more than 35 minutes long...')
trips = trips[trips['duration'] <= 35 * minute].copy()
print('\t\tpruned data set \'trips\' consists of %i entries' % len(trips.index))

# plot histogram of trip duration
trips['duration_minutes'].plot(kind='hist', color='r', alpha=0.25, bins=200, figsize=(20,5))
plt.title('Distribution of Trip Duration in Minutes')
plt.xlabel='Trip Duration (Minutes)'
plt.ylabel='Number of Trips'
plt.legend(loc='best')
plt.show()

Much better! And look, after pruning the original 983648 trips down to just trips no more than 35 minutes long, we still have 942712 trips, so we are looking at 95.8% of the data.

## Deeper Investigation of Data

now that we have a good grasp of the 'important' data. Lets see if we can find any interesting trends.

### Number of rides by hour of the day

In [None]:
# plot trip duration by start hour
trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar', color='r', figsize=(10,5))

plt.title('Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(['Trips'],loc='best')
plt.show()

Now this is interesting, it looks like there are distinct spikes in rides starting during commute hours.  Seeing as a lot of cummuters would start and end their trips at different terminals, lets see if there is a difference in trips that start and end at different terminals ('one ways') vs trips that start and end at the same terminal ('round trips').

## One Way and Round Trips

In [None]:
print('Pruning a DataFrame of one way trips...')
one_way_trips = trips.loc[trips['start_terminal'] != trips['end_terminal']].copy()
print('\tpruned data set now consists of %i lines' % len(one_way_trips.index))

print('Pruning a DataFrame of round trips...')
round_trips = trips.loc[trips.loc[:,'start_terminal'] == trips.loc[:,'end_terminal']].copy()
print('\tpruned data set now consists of %i lines' % len(round_trips.index))

In [None]:
# plot trip duration by start hour
one_way_trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar', color='b', figsize=(10,5))
plt.title('One Way Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(['One Way Trips'],loc='best')
plt.show()

In [None]:
round_trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar', color='g', figsize=(10,5))
plt.title('Round Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(['Round Trips'],loc='best')
plt.show()

In [None]:
# plot trip duration by start hour
ax = one_way_trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar',position=0, color='b', figsize=(10,5))
round_trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar',position=1, color='g', ax=ax)
plt.title('Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(['One Way Trips', 'Round Trips'],loc='best')
plt.show()

We can see from these three graphs that there are definite spikes in usage during commute hours and that this is nearly all from one way trips.  Separating out the round trip rides, these are mostly taken during waking hours and even have a noticable spike around lunch hours.

All together, it is easy to see that the VAST majority of rides taken are one way trips.

### Next Hypotheis: 

Looks like we have a lot of one way trips in commute hours, and more round trips during waking hours, even a noteworthy peak right around lunch time!  Next we are going to trip down the data sets again to only include trips to and from stations within San Francisco.

In [None]:
# station ID numbers that are in San Francisco
sf_stations = [ 39,41,42,45,46,47,48,49,50,51,54,55,56,57,58,59,60,61,62,63,
                64,65,66,67,68,69,70,71,72,73,74,75,76,77,82,90,91]

In [None]:
#prune each data set to include only trips where both start and end terminals are inside San Francisco
sf_trips = trips.loc[trips['start_terminal'].isin(sf_stations)].copy()
sf_trips = trips.loc[trips['end_terminal'].isin(sf_stations)].copy()

sf_one_way_trips = one_way_trips.loc[one_way_trips['start_terminal'].isin(sf_stations)].copy()
sf_one_way_trips = one_way_trips.loc[one_way_trips['end_terminal'].isin(sf_stations)].copy()

sf_round_trips = round_trips.loc[round_trips['start_terminal'].isin(sf_stations)].copy()
sf_round_trips = round_trips.loc[round_trips['end_terminal'].isin(sf_stations)].copy()
print('complete')



### Lets visualize these San Francisco only trips

In [None]:
# plot trip duration by start hour
sf_trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar', color='r', figsize=(10,5))
plt.title('San Francisco Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(['Trips'],loc='best')
plt.show()

In [None]:
# plot trip duration by start hour
sf_one_way_trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar', color='b', figsize=(10,5))
plt.title('San Francisco One Way Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(['One Way Trips'],loc='best')
plt.show()

In [None]:
# plot trip duration by start hour
sf_round_trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar', color='g', figsize=(10,5))
plt.title('San Francisco Round Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(['Round Trips'],loc='best')
plt.show()

### Visualize Trips that dont include San Francisco terminals

To be thurough, lets take a quick peek at non San Francisco Trips to see if these trends are similar.


In [None]:
#prune each data set to include only trips where both start and end terminals are inside San Francisco
non_sf_trips = trips.loc[~trips['start_terminal'].isin(sf_stations)].copy()
non_sf_trips = trips.loc[~trips['end_terminal'].isin(sf_stations)].copy()

non_sf_one_way_trips = one_way_trips.loc[~one_way_trips['start_terminal'].isin(sf_stations)].copy()
non_sf_one_way_trips = one_way_trips.loc[~one_way_trips['end_terminal'].isin(sf_stations)].copy()

non_sf_round_trips = round_trips.loc[~round_trips['start_terminal'].isin(sf_stations)].copy()
non_sf_round_trips = round_trips.loc[~round_trips['end_terminal'].isin(sf_stations)].copy()
print('complete')

In [None]:
# plot trip duration by start hour
non_sf_trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar', color='r', figsize=(10,5))
plt.title('Non San Francisco Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(['Trips'],loc='best')
plt.show()

In [None]:
# plot trip duration by start hour
non_sf_one_way_trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar', color='b', figsize=(10,5))
plt.title('Non San Francisco One Way Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(['One Way Trips'],loc='best')
plt.show()

In [None]:
# plot trip duration by start hour
non_sf_round_trips.groupby(['start_hour'])['duration_minutes'].count().plot(kind='bar', color='g', figsize=(10,5))
plt.title('Non San Francisco Round Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(['Round Trips'],loc='best')
plt.show()

### First Conlusion : Bay Area Bike Share is used by Commuters

The data set includes an entry for each ride to identify the Subscriber Type.  This is either a 'Subscriber', 'Customer', or 'NaN'.  Subscribers are riders who sign up for monthly or yearly plans so these are more likely than not the same Commuters we have identified so far. So lets test that.

In [None]:
# plot trip start time by subscriber type
trips.groupby(['start_hour', 'subscriber_type'])['duration_minutes'].count().plot(kind='bar', color='r', figsize=(10,5))
plt.title('Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(loc='best')
plt.show()

In [None]:
# plot trip start time by subscriber type
one_way_trips.groupby(['start_hour', 'subscriber_type'])['duration_minutes'].count().plot(kind='bar', color='b', figsize=(10,5))
plt.title('One Way Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(loc='best')
plt.show()

In [None]:
# plot trip start time by subscriber type
round_trips.groupby(['start_hour', 'subscriber_type'])['duration_minutes'].count().plot(kind='bar', color='g', figsize=(10,5))
plt.title('Round Trips by Start Hour')
plt.xlabel='Start Hour'
plt.legend(loc='best')
plt.show()