# Seattle Cycle Sharing Analysis                                         

### Contents
* [Introduction](#intro)
* [Data Cleaning](#cleaning)
* [Analysis](#analysis)


# Introduction<a class="anchor" id="intro"></a>
This is the bike sharing dataset provided by Seattle's Pronto Cycle Share. This bike sharing system consists of 58 bike stations consisting of over 500 bikes in total. This data is split over 3 files. 

Trips.csv contains data relevant to trip details from departure and arrival timestamps, locations, and bike numbers.
Station.csv contains the list if stations as well as a record of any updates to the station as well as bike capacity.
The weather.csv is supplementary and contains an array of meteorological measures.

In this study special attention will be given to the manipulation of time series data, particularly how cycling trends evolve over time.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import folium
import warnings
warnings.filterwarnings('ignore')

In [None]:
trips = pd.read_csv('../input/trips-fixed/trip_fixed.csv')
station = pd.read_csv('../input/cycle-share-dataset/station.csv')
weather = pd.read_csv('../input/cycle-share-dataset/weather.csv')

# Data Cleaning <a class="anchor" id="cleaning"></a>

Trips.csv was originally a corrupt file, with part of the dataset interrupted at line number 50795, and a new version of the data appended to the end of this line meaning the set could not be imported into pandas. Excel was used to manually remove the initial duplicated part and renamed to trip_fixed.csv. The file is over 40mb, and has been compressed to satisfy ed platforms requirements

We must begin by converting the trip start time and end time into timestamp datatypes as we will need to filter and group data based on this later. The trip duration is given in seconds, so we will convert this to the nearest minute instead. There are close to 90,000 missing values for gender and birth year. However, given the volume of data, we have ample data to conduct our analysis without having to replace these values.

In [None]:
trips.info()

In [None]:
trips['starttime'] = pd.to_datetime(trips['starttime'],format='%m/%d/%Y %H:%M')
trips['stoptime'] = pd.to_datetime(trips['stoptime'],format='%m/%d/%Y %H:%M')
#trip duration in seconds rouded to the nearest minute.
trips['tripduration'] = round(trips['tripduration']/60,0)
trips.head(5)

def age(starttime, birthyear):
    """returns age from birthyear"""
    #most resent date
    curernt_year = starttime.year
    return (curernt_year-birthyear)
#we will replace year of birth with age for simplicity
trips['age'] = trips.apply(lambda x: age(x['starttime'],x['birthyear']),axis=1)
trips.drop('birthyear',axis=1,inplace=True)


In [None]:
trips.head()

Similarly, the weather dataset has a Date column that must be converted into timestamp. We are particularly interested in temperature measurements and weather events as these features are most likely to determine whether somebody decides to ride or not.

In [None]:
weather.info()

From the column summary above we see many missing values in the events column. We will assign these Nans with the string 'None', as the missing value most likely indicates a lack of a significant weather event. In the summary below we also see a small number of other events as well as redundant events such as 'Fog, Rain' and 'Rain, Thunderstorm'. We can merge these groups by similarity to create larger groups.

In [None]:
weather.Events.value_counts()

In [None]:
#Date to datetime object
weather['Date'] = pd.to_datetime(weather['Date'],format='%m/%d/%Y')

def clean_weather(event):
    """
    This function merges the redundant events into
    a similar event
    """
    if event == 'Fog , Rain':
        return 'Fog-Rain'
    if event == 'Rain , Thunderstorm':
        return 'Rain-Thunderstorm'
    if event == 'Rain , Snow':
        return 'Rain-Snow'
    # since snow is very uncommon, it can be merged with Rain-Snow
    if event == 'Snow':
        return 'Rain-Snow'
    return event

#Reducting the large number of features to the ones we are most interested in.
to_keep = ['Date','Max_Temperature_F','Min_TemperatureF','Precipitation_In','Mean_Wind_Speed_MPH','Events']
weather = weather[to_keep]

weather['Events'].fillna(value='None',inplace=True)
weather['Events'] = weather.apply(lambda x: clean_weather(x['Events']),axis=1)
#our reduced number of events
weather['Events'].value_counts()

Inspecting the station data, we see a total of 58 unique stations, of which 4 have been decommissioned and 17 which have been modified as of 2016. There is no cleaning required here as the data is in the preferred format.

In [None]:
station.info()

In [None]:
station.head(5)

# Analysis <a class="anchor" id="analysis"></a>

### 1. How do the number of trips fluctuate during the course over the day?

In [None]:
trips[['starttime']].groupby(trips['starttime'].dt.hour,).agg('count').plot.bar(
    figsize = (10,5),
    title = 'Departures by hour'

);
plt.xlabel('Hour of day')
plt.ylabel('No of departures');

The average trip duration is about 20 minutes The number of departures spike from 6am to 8pm and drop from 5pm onward. These hours coincide with the average person’s work hours and shows that cycles are being used not just for recreation but as a means of commute.

### 2. What are the most popular travel routes?


In [None]:
g = trips[['from_station_name','to_station_name','starttime']]
#Counting occurences and ordering in descening and returning top 10.
routes = g.groupby(['from_station_name','to_station_name']).agg('count').sort_values(by='starttime',ascending=False).iloc[:10,:]
routes= routes.reset_index()
routes.columns = ['departure','destination','count']
routes

The most popular route is from Pier 69 / Alaskan Way & Clay St to itself. When seen on a map, this bike station is in front of pronto train station on a scenic esplanade. The second most popular destinations in the Seattle aquarium which is also along this esplanade. (Map featured below with folium)

In [None]:
#location of pier69 Alskan Way for initial viewpoint
start_location = list(station[station['name'] == 'Pier 69 / Alaskan Way & Clay St'][['lat','long']].values[0])

def load_markers(Map, stations):
    """
    Load the markers (stations) into the Map (Map)
    """
    for i in range(stations.shape[0]):
        #Hover over marker for tooltip.
        tip = f'ID: {stations.iloc[i,0]} Name: {stations.iloc[i,1]}'
        folium.Marker(list(stations.iloc[i,2:]),tooltip=tip).add_to(Map)

#getting the coordinates for pier69/alaskan way.

seattle=folium.Map(location=start_location,zoom_start=16)
load_markers(seattle, station[['station_id','name','lat','long']])
seattle


### 3. How does the traffic of the top 10 stations change over time?
In this segment we will take the 10 day rolling average of the daily traffic of the top 10 destinations from 2014 to 2016.

In [None]:
destinations = trips[['starttime','to_station_name','trip_id']]
#converting the datetime timestamp to a year daily period
destinations['starttime'] = destinations['starttime'].dt.to_period('D')
#aggregating by day
bydate = destinations.groupby(['starttime','to_station_name']).agg('count').reset_index()
#column rename
bydate.columns = ['date','destination','count']
#pivoting makes the destinations into columns
#and constructing the time series.
#filling missing values with 0.
station_traffic = bydate.pivot_table(
    index='date',
    columns='destination',
    values='count',
    aggfunc='sum'
).fillna(0)

station_traffic.head()

We would now like to order the columns in descending order by order of total visits to the station. We can do this by summing along columns and sort the resultant series (this also order the index). passing this sorted index into the original data frame will yield our result.

In [None]:
s = station_traffic.sum(axis=0)
#specify the rolling period and take the mean. Crop the first 10 columns.
#increasing the rolling period will yeild smoother lines

roll = 10 #adjustable rolling period

station_traffic[s.sort_values(ascending=False).index].iloc[:,0:10].rolling(roll).mean().plot.line(
    figsize=(20,10),
    title = f'{roll} period rolling mean of daily traffic to top 10 stations',
    linewidth = 1.5
);
plt.ylabel('Number of arrivals');

The above plot shows that peoples riding habits follow seasonal cycles. This cycles very clearly align with the seasons. December and January being the coldest months in the northern hemisphere with summer peaking during July and August. Another feature of interest is how the riding greatly reduced from the start of 2016. This was when the city of Seattle purchased the system for $1.4 million, but due less than expected ridership and lack of funding, it's usage has declined since.

### 4. Is there a relationship between temperature and the total amount of trips for that day?
In this question we will examine the relationship between climate and decision to ride, as well as the effect of any significant weather events.

In [None]:
# Form the time series of total number of trips each day.
total_trips = trips[['trip_id','starttime']]
#discard time and keep the date.
total_trips['starttime'] = pd.to_datetime(total_trips['starttime'].dt.date)
#count trips for each day
trip_counts = total_trips.groupby([total_trips['starttime']]).agg('count').reset_index()
trip_counts.columns = ['Date','count']
#left join the trip counts and weather on date.
trip_weather = trip_counts.merge(weather, left_on='Date',right_on='Date',how='left')

#plotting the counts agains temperature for each day.
plt.figure(figsize=(20,10))
sns.scatterplot(data = trip_weather, 
                x='Max_Temperature_F',
                y='count',
                hue='Events', #mark any special events rain, snow etc..
                s=60,
                alpha=0.6,
                edgecolor=None);
plt.xlabel('Peak daily temperature (F)')
plt.ylabel('Number of trips')
plt.title('Trips vs Temperature', fontsize=20);

Warmer weather is the determining factor in deciding someone’s decision to ride. The plot also shows that rain is not necessarily a deterrent to bike as the most active day was also a rainy day. It hard to draw other conclusions from other weather events due to their limited instances.

### 5. Identify the age distribution of bike riders. Who rides the most?

In [None]:
trip_age = trips[['age','trip_id','gender']]
#filter by sex
female = trip_age[trip_age['gender'] == 'Female'].drop('gender',axis=1)
male = trip_age[trip_age['gender'] == 'Male'].drop('gender',axis=1)
#aggregate by count on age
m_counts = male.groupby(['age']).agg('count').reset_index()
f_counts = female.groupby(['age']).agg('count').reset_index()
#column rename
m_counts.columns = ['age','count']
f_counts.columns = ['age','count']
#plotting
plt.figure(figsize=(15,10))
plt.bar(m_counts['age'],m_counts['count'],label='male')
plt.bar(f_counts['age'],f_counts['count'],label='female')
plt.legend();

plt.title('The Age Distribution of Cyclists',fontsize=14)
plt.ylabel('Number of trips',fontsize=14);


From the plot we see that 28-year-old males ride the most of all males. 29-year-old women ride the most among all females. Males ride most overall between the ages of 20 and 60, with female ridership increasing in late age. There is also an increase in ridership in the 50's for both age groups

### 6.Which bikes have seen the most ride time and on which routes were they most used?
What are the top 10 most used bikes and what are there most travelled to destinations?

In [None]:
#aggregate sum of all trip durations for each bike
bikes = trips[['bikeid','tripduration']].groupby('bikeid').agg('sum')
bikes = bikes.sort_values(by='tripduration',ascending=False).reset_index()
topbikes = bikes.iloc[:10,:].bikeid

bikes.iloc[:10,:]

In [None]:
topbikes = list(topbikes)#list of top 10 bikes
mostvisited = []
total_visits = []
for bike in topbikes:
    places = trips[trips['bikeid'] == bike]['from_station_name'].value_counts()
    num_visits = places[0] #num visits to this place for this bike
    place = list(places.index)[0] #name of station
    mostvisited.append(place)
    total_visits.append(num_visits)
    

In [None]:
#create display dataFrame
most_used = pd.DataFrame([topbikes,mostvisited,total_visits,bikes['tripduration'][:10]]).T
most_used.columns = ['bike-id','station name','total visits','totaltime (hrs)']
most_used['totaltime (hrs)'] = most_used['totaltime (hrs)']/60
most_used

Of the top 10 most used bikes, 9 of them all travel to the same destination. Pier 69 does appear to be the most frequented bike stand in all of seattle due to its scenic esplanade.

In [None]:
plt.figure(figsize=(20,5))
sns.barplot(data=most_used,y='bike-id',x='totaltime (hrs)')
plt.xlim((210,250))
plt.ylabel('bike_id',fontsize=14)
plt.xlabel('cycle time (hrs)',fontsize=14);
plt.title('Top 10 most used bikes',fontsize=14);