# Statistical Analysis of Bay Area Bike Share Data

> From our initial Visual Exploratory Data Analysis on the Bay Area BIke Share dataset, we inferred that the vast majority of the trips are taken by cummuters, who are subscribers.
>
> We will also be retaining from previous analysis that we only need concern ourselves with trips no more than 60 minutes in duration
>
> Predicting ridership appears pretty easy, commuters need to commute, and customers seem to be mostly starting or ending their trips at propular tourist destinations.
>
> 1 What factors, if any, have an impact on the duration of rides for customers and subscribers?
> 
> 2 What factors, if any, have an impact on the distance customers and subscribers trips cover?
- gps route data is not collected by Bay Area Bike Share, so we will simply be estimating the distance travelled by calulating the distanc between the start and end terminal of each trip
- trips that start and end at the same terminal (or'round trips'), are defaulted to a distance of 0.0 km.
>
>

In [1]:
%matplotlib inline

import matplotlib
import numpy as np
from scipy import stats
import math
import matplotlib.pyplot as plt
import pandas as pd
from glob import glob

# geopy pacakge used for calculating distance 
# between a pair of (lat, long) coordinate tuples
from geopy.distance import vincenty


import seaborn as sns
sns.set()

## Load Data

### Trip Data

In [2]:
print('Loading Trip Data...')

try:
    file_path_slug = '../../datasets/bayareabikeshare/*_trip_data.csv'
    file_list = glob(file_path_slug)

    trip_import = pd.DataFrame()
    
    counter = 1
    chunks = []
    
    for file in file_list:
        for chunk in pd.read_csv(file, chunksize=10000, iterator=True):
            chunk = chunk.set_index('Trip ID')
            chunk.columns = ['Duration', 'Start Date', 'Start Station', 'Start Terminal', 'End Date', 
                             'End Station', 'End Terminal', 'Bike #', 'Subscriber Type', 'Zip Code']
            chunks.append(chunk)
        print('\tFinished file! (%d of %d)' % (counter, len(file_list)))
        counter += 1

    trip_import = pd.concat(chunks)
    print('Data Loaded Successfully!')

except:
    print('oops... something went wrong importing the data :(')

Loading Trip Data...
	Finished file! (1 of 4)
	Finished file! (2 of 4)
	Finished file! (3 of 4)
	Finished file! (4 of 4)
Data Loaded Successfully!


In [3]:
trip_data = trip_import.copy()

### Weather Data

In [4]:
print('Loading Weather Data...')

try:
    file_path_slug = '../../datasets/bayareabikeshare/*_weather_data.csv'
    file_list = glob(file_path_slug)

    weather_import = pd.DataFrame()

    counter = 1
    chunks = []

    for file in file_list:
        for chunk in pd.read_csv(file, chunksize=10000, iterator=True):
            chunk.columns = ['Date', 'Max_Temperature_F', 'Mean_Temperature_F', 'Min_TemperatureF', 'Max_Dew_Point_F', 
                             'MeanDew_Point_F', 'Min_Dewpoint_F', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 
                             'Max_Sea_Level_Pressure_In', 'Mean_Sea_Level_Pressure_In', 'Min_Sea_Level_Pressure_In', 
                             'Max_Visibility_Miles', 'Mean_Visibility_Miles', 'Min_Visibility_Miles', 
                             'Max_Wind_Speed_MPH', 'Mean_Wind_Speed_MPH', 'Max_Gust_Speed_MPH', 'Precipitation_In', 
                             'Cloud_Cover', 'Events', 'Wind_Dir_Degrees', 'zip']
            chunks.append(chunk)
        print('\tfinished file! (%d of %d)'% (counter, len(file_list)))
        counter += 1

    weather_import = pd.concat(chunks)
    print('Data Loaded Successfully!')
except:
    print('oops... something went wrong loading the data :()')

Loading Weather Data...
	finished file! (1 of 4)
	finished file! (2 of 4)
	finished file! (3 of 4)
	finished file! (4 of 4)
Data Loaded Successfully!


In [5]:
weather_data = weather_import.copy()

### Station Data

In [6]:
print('Loading Station Data...')

try:
    file_path_slug = '../../datasets/bayareabikeshare/*_station_data.csv'
    file_list = glob(file_path_slug)

    station_import = pd.DataFrame()

    counter = 1
    chunks = []

    for file in file_list:
        for chunk in pd.read_csv(file, chunksize=10000, iterator=True):
            chunk.columns = ['station_id', 'name', 'lat', 'long', 'dockcount', 'landmark', 'installation']            
            chunks.append(chunk)
        print('\tFinished file! (%d of %d)' % (counter, len(file_list)))
        counter += 1

    station_import = pd.concat(chunks)
    print('Data Loaded Successfully!')
except:
    print('oops... something went wrong importing the data :(')

Loading Station Data...
	Finished file! (1 of 4)
	Finished file! (2 of 4)
	Finished file! (3 of 4)
	Finished file! (4 of 4)
Data Loaded Successfully!


In [7]:
station_data = station_import.copy()

## Cleaning Data

### Trip Data

In [8]:
# our data set show duration in seconds, here are some handy conversions
second = 1
minute = second * 60
hour = minute * 60

# zipcodes are all over the place, only keep corrected 5 digit zipcodes, and replace all others with NaNs
def clean_zipcode(item):
    if len(item) != 5:
        # split on '-'
        try:
            result = item.split('-')[0]
        except:
            result = item
        # split on '.'
        try:
            result = item.split('.')[0]
        except:
            result = item
        # if len of item is less than 5, return 'NaN'
        if len(result) < 5:
            result = 'NaN'
        else:
            # if len result is greater than 5, take at most, first 5 digits
            result = result[:5]
    else:
        result = item
    # make sure result is all digits
    if result.isdigit():
        return result
    else:
        return 'NaN'

In [9]:
print('Trip Data Cleanup Started...')

# cleanup column names
print('\tcleaning column names')
new_cols = []
for col in trip_data.columns:
    new_cols.append(col.replace(' ', '_').lower())
trip_data.columns = new_cols

# extract columns we want to keep
print('\tsubsetting to useful columns')
important_cols = ['duration', 'start_date', 'start_terminal', 'end_date', 'end_terminal', 'bike_#', 'subscriber_type', 'zip_code']
trip_data = trip_data[important_cols]

# create duration minutes column
print('\tcreating a duration_minutes column')
trip_data['duration_minutes'] = trip_data['duration'] / 60.0

# convert end and start dates to datetime objects
print('\tconverting end and start dates to datetime objects')
trip_data['start_date'] = pd.to_datetime(trip_data['start_date'], format="%m/%d/%Y %H:%M")
trip_data['end_date']   = pd.to_datetime(trip_data['end_date'],   format="%m/%d/%Y %H:%M")

# create a start and end hour trip column
print('\tcreating trip_date and trip_dow columns')
trip_data['trip_date']  = trip_data['start_date'].dt.date
trip_data['trip_dow']  = trip_data['start_date'].dt.weekday
trip_data['trip_day']  = trip_data['start_date'].dt.weekday_name

print('\tcreating start_hour and end_hour columns')
trip_data['start_hour'] = trip_data['start_date'].dt.hour
trip_data['end_hour']   = trip_data['end_date'].dt.hour

# convert and clean zipcodes
print('\tcleaning zipcodes')
trip_data['zip_code'] = trip_data['zip_code'].astype(str)
trip_data.zip_code = trip_data.zip_code.apply(clean_zipcode)
trip_data['zip_code'] = pd.to_numeric(trip_data['zip_code'], errors='coerce')

# clean up data types
print('cleaning up data types')

trip_data['duration']         = trip_data['duration'].astype('float')
trip_data['start_terminal']   = trip_data['start_terminal'].astype('category')
trip_data['end_terminal']     = trip_data['end_terminal'].astype('category')
trip_data['bike_#']           = trip_data['bike_#'].astype('int')
trip_data['subscriber_type']  = trip_data['subscriber_type'].astype('category')
trip_data['zip_code']         = trip_data['zip_code'].astype('str')
trip_data['duration_minutes'] = trip_data['duration_minutes'].astype('float')
trip_data['trip_dow']         = trip_data['trip_dow'].astype('category')
trip_data['trip_day']         = trip_data['trip_day'].astype('category')

# prune data to exclude trips longer than 60 minutes
print('pruning data to trips no more than 60 minutes long...')
trip_data = trip_data[trip_data['duration_minutes'] <= 60]

# Cleanup
trip_data.sort_index(inplace=True)
print('\tpruned data set \'trip_data\' consists of %i entries' % len(trip_data.index))

print('Trip Data Cleanup complete')
trip_clean = trip_data.copy()

Trip Data Cleanup Started...
	cleaning column names
	subsetting to useful columns
	creating a duration_minutes column
	converting end and start dates to datetime objects
	creating trip_date and trip_dow columns
	creating start_hour and end_hour columns
	cleaning zipcodes
cleaning up data types
pruning data to trips no more than 60 minutes long...
	pruned data set 'trip_data' consists of 955557 entries
Trip Data Cleanup complete


### Weather Data

In [10]:
print('Weather Data Cleanup Started...')

# cleanup column names
print('\tcleaning column names')
new_cols = []
for col in weather_data.columns:
    new_cols.append(col.replace(' ', '_').lower())
weather_data.columns = new_cols

# convert end and start dates to datetime objects
print('\tconverting dates to datetime objects')
weather_data['date'] = pd.to_datetime(weather_data['date'], format="%m/%d/%Y")

# extract columns we want to keep
print('\tsubsetting to useful columns')
important_cols = ['date', 'max_temperature_f', 'mean_temperature_f', 'min_temperaturef',
                  'max_dew_point_f', 'meandew_point_f', 'min_dewpoint_f',
                  'max_wind_speed_mph', 'mean_wind_speed_mph', 'max_gust_speed_mph',
                  'precipitation_in', 'cloud_cover', 'events', 'zip']
weather_data = weather_data[important_cols]

# correct min_temperaturef column name to min_temperature_f
weather_data.rename(columns={'min_temperaturef': 'min_temperature_f'}, inplace=True)

# cleanup and set date as index
weather_data.set_index('date', inplace=True)
weather_data.sort_index(inplace=True)

# cleanup precipitation data to be all float values
weather_data['precipitation_in'] = pd.to_numeric(weather_data['precipitation_in'], errors='coerce')

# we only want San Francisco Weather information, zipcode 94107
weather_data = weather_data[weather_data.zip == 94107]

print('Weather Data Cleanup complete')
weather_clean = weather_data.copy()

Weather Data Cleanup Started...
	cleaning column names
	converting dates to datetime objects
	subsetting to useful columns
Weather Data Cleanup complete


### Station Data

In [11]:
def label_zip(row):
    if row['landmark'] == 'San Francisco':
       return '94107'
    if row['landmark'] == 'Redwood City':
        return '94063'
    if row['landmark'] == 'Palo Alto':
        return '94301'
    if row['landmark'] == 'Mountain View':
        return '94041'
    if row['landmark'] == 'San Jose':
        return '95113'
    return '99999'

def make_lat_long(row):
    lat = row['lat']
    long = row['long']
    return (lat, long)

In [12]:
station_data = station_import.copy()

# remove dulplicates
print('remove dulplicates')
station_data.drop_duplicates(keep='first', inplace=True)
station_data.dropna(how='all', inplace=True)

# set datatype for each column
print('set datatype for each column')
station_data['station_id']   = station_data['station_id'].astype('int')
station_data['name']         = station_data['name'].astype('str')
station_data['lat']          = station_data['lat'].astype('float')
station_data['long']         = station_data['long'].astype('float')
station_data['landmark']     = station_data['landmark'].astype('category')

# add a zipcode column for later comparison with weather data
station_data['zip_code'] = station_data.apply(lambda row: label_zip (row),axis=1)
# station_data['zip_code'] = station_data['landmark'].astype('str')

# create lat,lon tuple column
station_data['lat_long'] = station_data.apply(lambda row: make_lat_long (row),axis=1)

# reindex to remove some extra duplicate
print('correcting index')
station_data.reset_index(inplace=True)
station_data.drop_duplicates(['station_id', 'installation'], keep='first', inplace=True)
station_data.set_index('station_id', inplace=True)
station_data.sort_index(inplace=True)
del station_data['index']

station_clean = station_data.copy()
print('Cleaning complete!')
station_clean.info()

remove dulplicates
set datatype for each column
correcting index
Cleaning complete!
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 2 to 91
Data columns (total 8 columns):
name            77 non-null object
lat             77 non-null float64
long            77 non-null float64
dockcount       77 non-null float64
landmark        77 non-null category
installation    77 non-null object
zip_code        77 non-null object
lat_long        77 non-null object
dtypes: category(1), float64(3), object(4)
memory usage: 5.1+ KB


## Appending Distance Data to Trips

In [13]:
def route_distance(row):
    
    # round trips are defaulting to zero km
    if row['start_terminal'] == row['end_terminal']:
        dist = 0.0
    else:
        # lookup start_station id coords
        start_gps = station_clean.loc[row['start_terminal']]['lat_long']
        end_gps = station_clean.loc[row['end_terminal']]['lat_long']

        if isinstance(start_gps, pd.core.series.Series):
            start_gps = start_gps.iloc[-1]
        if isinstance(end_gps, pd.core.series.Series):
            end_gps = end_gps.iloc[-1]
        # sloppy lookup, uses most recent station coordinates
        # does not account for stations that are relocated over time correctly
        try:
            dist = str(vincenty(start_gps, end_gps))
            dist = float(dist.split(' ')[0])
        except:
            dist = 'NaN'  
    return dist
    

In [14]:
trip_clean['distance_km'] = trip_clean.apply(lambda row: route_distance (row),axis=1)

## Splitting up Rainy and Dry Days

In [15]:
# split up rainy days and dry days
rainy_days = weather_clean[ weather_clean['precipitation_in'] > 0.0].reset_index()
dry_days =   weather_clean[-weather_clean['precipitation_in'] > 0.0].reset_index()

# All trips
rainy_trips = trip_clean[ trip_clean['start_date'].dt.date.isin(rainy_days['date'].dt.date)]
dry_trips   = trip_clean[-trip_clean['start_date'].dt.date.isin(rainy_days['date'].dt.date)]

# Customer Trips
customer_rainy_trips = rainy_trips[rainy_trips.subscriber_type == 'Customer']
customer_dry_trips = dry_trips[dry_trips.subscriber_type == 'Customer']

# Subscriber Trips
subscriber_rainy_trips = rainy_trips[rainy_trips.subscriber_type == 'Subscriber']
subscriber_dry_trips = dry_trips[dry_trips.subscriber_type == 'Subscriber']

In [26]:
def calculate_stats(data1, data2):

    # means
    data1_mean = data1.mean()
    data2_mean = data2.mean()
    diff_mean = data1_mean - data2_mean
    print('Diff of means:\t\t', diff_mean)

    # calculate t statistic and p value with scipy
    t, p = stats.ttest_ind(data1, data2)
    print('T Test')
    print('\tt statistic:\t\t', t)
    print('\tp value:\t\t', p)
    print('')
    u, p2 = stats.mannwhitneyu(data1, data2)
    print('MannWhitneyU Test')
    print('\tu statistic:\t\t', u)
    print('\tp value:\t\t', p2)

## Analysis

### 1. Does Rain Affect Trips Duration of Customers or of Subscribers?

> A <b>Two Sample T Test</b> is appropriate for this problem as we are trying to see a difference between two sample means
- Mean ride duration on rainy days vs mean ride duration on dry days
>
>
> ##### Customer Trips
- $HC$o : Customer Mean Trip Duration on Rainy Days = Customer Mean Trip Duration on Dry Days
- $HC$a : Customer Mean Trip Duration on Rainy Days ≠ Customer Mean Trip Duration on Dry Days
>
> ##### Subscriber Trips
- $HS$o : Subscriber Mean Trip Duration on Rainy Days = Subscriber Mean Trip Duration on Dry Days
- $HS$a : Subscriber Mean Trip Duration on Rainy Days ≠ Subscriber Mean Trip Duration on Dry Days

### 1. Results

> #### Customer Trips
> Mean trip durations on rainy days are equal mean trip durations on dry days
- T Statistic <b>-1.7365</b> 
- P Value <b>0.08248</b> which is above the 0.05 threshhold thus we <b>can not reject</b> the $HC$o
- Trips are <b>0.2751 minutes</b> shorter on rainy days than on dry days

> #### Subscriber Trips
> Mean trip durations on rainy days are not equal to mean trip durations on dry days
- T Statistic <b>-11.5929</b> 
- P Value <b>4.4994e-31</b> which is well below the 0.05 threshhold thus we <b>reject</b> the $HS$o
- Trips are <b>0.2209 minutes</b> shorter on rainy days than on dry days



### 1. Calculations

In [27]:
# Customer Trips Only
customer_rainy_data = customer_rainy_trips.duration_minutes
customer_dry_data = customer_dry_trips.duration_minutes

# Subscriber Trips Only
subscriber_rainy_data = subscriber_rainy_trips.duration_minutes
subscriber_dry_data = subscriber_dry_trips.duration_minutes

print('-' * 40)
print('Customer Trips')
calculate_stats(customer_rainy_data, customer_dry_data)
print()
print('-' * 40)
print('Subscriber Trips')
calculate_stats(subscriber_rainy_data, subscriber_dry_data)
print()
print('-' * 40)

----------------------------------------
Customer Trips
Diff of means:		 -0.275120342289
T Test
	t statistic:		 -1.73648012142
	p value:		 0.082481751059

MannWhitneyU Test
	u statistic:		 308648201.0
	p value:		 0.00241488522356

----------------------------------------
Subscriber Trips
Diff of means:		 -0.220882225077
T Test
	t statistic:		 -11.5929405235
	p value:		 4.49935224043e-31

MannWhitneyU Test
	u statistic:		 24951117385.0
	p value:		 1.51337751004e-27

----------------------------------------


### 2. Does Rain Affect Trips Distance of Customers or of Subscribers?

> A <b>Two Sample T Test</b> is appropriate for this problem as we are trying to see a difference between two sample means
- Mean ride duration on rainy days vs mean ride duration on dry days
>
>
> ##### Customer Trips
- $HC$o : Customer Mean Trip Distance on Rainy Days = Customer Mean Trip Distance on Dry Days
- $HC$a : Customer Mean Trip Distance on Rainy Days ≠ Customer Mean Trip Distance on Dry Days
>
> ##### Subscriber Trips
- $HS$o : Subscriber Mean Trip Distance on Rainy Days = Subscriber Mean Trip Distance on Dry Days
- $HS$a : Subscriber Mean Trip Distance on Rainy Days ≠ Subscriber Mean Trip Distance on Dry Days

### 2. Results

> #### Customer Trips
> Mean trip durations on rainy days are equal mean trip durations on dry days
- T Statistic <b>-0.5084</b> 
- P Value <b>0.61114</b> which is above the 0.05 threshhold thus we <b>can not reject</b> the $HC$o
- Trips are <b>0.00575 km</b> shorter on rainy days than on dry days

> #### Subscriber Trips
> Mean trip durations on rainy days are not equal to mean trip durations on dry days
- T Statistic <b>-7.6683</b> 
- P Value <b>1.7454e-14</b> which is well below the 0.05 threshhold thus we <b>reject</b> the $HS$o
- Trips are <b>0.0208 km</b> shorter on rainy days than on dry days

### 2. Calculations

In [29]:
# Customer Trips Only
customer_rainy_data = customer_rainy_trips.distance_km
customer_dry_data = customer_dry_trips.distance_km

# Subscriber Trips Only
subscriber_rainy_data = subscriber_rainy_trips.distance_km
subscriber_dry_data = subscriber_dry_trips.distance_km

print('-' * 40)
print('Customer Trips')
calculate_stats(customer_rainy_data, customer_dry_data)
print()
print('-' * 40)
print('Subscriber Trips')
calculate_stats(subscriber_rainy_data, subscriber_dry_data)
print()
print('-' * 40)

----------------------------------------
Customer Trips
Diff of means:		 -0.00574703603401
T Test
	t statistic:		 -0.508445482701
	p value:		 0.611141973357

MannWhitneyU Test
	u statistic:		 314552413.5
	p value:		 0.354134642576

----------------------------------------
Subscriber Trips
Diff of means:		 -0.0207895550312
T Test
	t statistic:		 -7.668250999
	p value:		 1.74542010787e-14

MannWhitneyU Test
	u statistic:		 25155397054.5
	p value:		 6.37062320397e-14

----------------------------------------


In [30]:
morning_commute = 'hello'