# An Exploration of the Los Angeles Metro Bike Share Trip Data

In 2016 the city of Los Angeles launched a large bike sharing project, where potential customers can pick up bikes at designated stations and drop them off at the same or at a different station. There are also different passes available to meet the needs of diffrent types of customers.

This notebook presents a brief exploration of the data that has been gathered so far, which is released under the [Los Angeles Open Data project](https://data.lacity.org/), and is hosten on Kaggle [here](https://www.kaggle.com/cityofLA/los-angeles-metro-bike-share-trip-data).

#### Import most packages I might need and set some defaults for plotting

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('ggplot')

#### Read in data and get summary view of available fields:

In [None]:
df = pd.read_csv('../input/metro-bike-share-trip-data.csv', low_memory=False)
df.head(5)

#### I was interested to see the frequency distribution of how many passengers buy which passes:

In [None]:
ax = df['Passholder Type'].value_counts().plot.bar();
ax.set_title('Passholder type frequencies');

#### It seems like most rides were done by people who own monthly passes. Keep in mind that this doesn't directly tell us how many of the total riders own monthly passes, just that most individuals rides were done with people who do.

#### We can also find the exact frequency values:

In [None]:
df['Passholder Type'].value_counts()

#### Next I was curious as to see how long riders use the bike for a given trip. Here you can see this distribution of ride duration:

In [None]:
ax = (df.Duration/60).plot.hist(bins=30)
ax.set_title('Duration [min]');

#### Whoops! From this histogram it's clear that there are some outliers, here is a function to remove outliers that lie outside specified quantiles

In [None]:
from pandas.api.types import is_numeric_dtype

def remove_outlier(df):
    low = .05
    high = .95
    quant_df = df.quantile([low, high])
    for name in list(df.columns):
        if is_numeric_dtype(df[name]):
            df = df[(df[name] > quant_df.loc[low, name]) & (df[name] < quant_df.loc[high, name])]
    return df

#### Now I remove all the outliers from the DataFrama:

In [None]:
df = remove_outlier(df)

#### Regenerating the histogram gives a more realistic picture of ride durations:


In [None]:
ax = (df.Duration/60).plot.hist(bins=30)
ax.set_title('Duration [min]');

#### This shows that most peopls use the bikes for around 10 minutues, and the distribution is heavily left-skewd: very dew riders use the bikes for longer than 20 minutues.

#### Riders can either do a Round Trip and drop the bike off at the same station where they found it, or drop it off at another station. Here I calculate the fraction of riders who did a Round Trip, as opposed to a One Way trip:

In [None]:
nr_route_cat = df['Trip Route Category'].count()
nr_one_way = len(df.loc[(df['Trip Route Category'] == 'One Way')])
nr_round_trip = len(df.loc[(df['Trip Route Category'] == 'Round Trip')])
print("Percentage of round trips: {}%".format(round(nr_round_trip / nr_route_cat * 100, 2)))

#### Interesting, most riders do One Way trips.

#### Next I wanted to see how far riders typically ride. Here I calculate the distances of all the rides base on the latitude and longitude of the Starting and Ending Stations. Note that this calculation assumes the Earth is a perfect sphere. Over the relatively small distances considered here, this should be more than accurate enough!

In [None]:
# Distance between pickup and dropoff

import numpy as np

start_lat = np.deg2rad(df['Starting Station Latitude'])
start_lon = np.deg2rad(df['Starting Station Longitude'])
stop_lat = np.deg2rad(df['Ending Station Latitude'])
stop_lon = np.deg2rad(df['Ending Station Longitude'])

dlon = stop_lon - start_lon
dlat = stop_lat - start_lat

a = np.sin(dlat / 2)**2 + np.cos(start_lat) * np.cos(stop_lat) * np.sin(dlon / 2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
# approximate radius of earth in km
R = 6373.0
distance = R * c

#### I only want to consider rides that are One Way trips, otherwise there will be some zero distances due to the 4.13% of round trip riders. Of course this will not give an exact distance the riders travelled, but it might function as a proxy for that.

#### Here I remove all data with a distance of zero km, i.e. the round trip data:

In [None]:
distance_new = pd.DataFrame(distance)
distance_red = distance_new.loc[(distance_new[0] > 0)][0]

#### Now I plot the distribution of One Way trip distances, or more accurately the distance between pickup and dropoff locations:

In [None]:
ax = distance_red.plot.hist(bins=30, title='One-way trip distances')
ax.set_xlabel('Distance [km]');

#### Based on the duration and distance distributions calculated above, it seems that most riders ride for around 10 minutues and a distance of around 1 km. That's a prerry leasurely ride!

#### We can calculate some more detailed summary statistics of those distributions:

In [None]:
print("Trip duration summary statistics:")
print((df.Duration/60).describe())
print('\n')
print("Trip distance proxy summary statistics:")
print((distance_red).describe())

#### Note that the `count` values are different for the two distributions. That's because the outlier algorithm removes outliers in all the fields and one Duration value might not be an outlier in terms of duration, but it might be in terms of some other field.

#### Also note that in the above Duration is divided by 60 to gives values in minutes rather than second, the unit of the raw data. The distance proxy is given in units of km.

#### A velocity can be calculated by dividing distance by time. Let's calculate a mean velocity based on the mean distance and duration data calculated above. Keep in mind that this is a proxy of a proxy, since the distance between pickup and dropoff locations is only a proxy to the total distance travelled. We can also plot a distribution of this velocity proxy, both the mean calculation and the distribution is shown below:

In [None]:
velocity_mean = (distance_red).mean() / (df.Duration/60).mean() * 60  # 60 factor converts from km/min to km/hr\
print("Mean velocity of riders: {} km/hr".format(round(velocity_mean,1)))
(distance_red / (df.Duration/60) * 60).plot.hist(bins=30)

#### This brief exploration showed some interesting information on how the bike sharing project is being utilized. Future work is to do some analysis of the utilization of the system over time. Stay tuned!