In [None]:
# Author: He Yingxu

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.legend import Legend

import geopandas as gpd
import geopy 
from shapely.geometry import Point, Polygon
from pyproj import Proj
import fiona
import contextily as ctx

import math
from tqdm import tqdm

tqdm.pandas()

%matplotlib inline

This notebook takes the Kaggle challenge [SF Bay Area Bike Share](https://www.kaggle.com/benhamner/sf-bay-area-bike-share). It explores the data set in the following direction: 
* The duration of trips: how long will the trips generally take.
* The volumn of scooter travelling in different time and different days.
* The travelling bahaviour in the peak hours of scooter usage. E.g. Is there a significant trend to travel to certain station at certain time. 
* The number of scooters needed for each station on weekday and weekends.
* The change in number of scooters after each day of operation at each station. 

Possible business insights: 
* The analysis can suggest a reasonable battery capacity of the scooters to the stakeholders.
* It provides insights on optimizing the allocation of scooters such that the demand can be fulfilled to the best extent. 

# 0. Preprocessing

In [None]:
# Loads station information 
stations = pd.read_csv("../input/sf-bay-area-bike-share/station.csv")

# Loads trips information
sf_scooter = pd.read_csv('../input/sf-bay-area-bike-share/trip.csv')
sf_scooter['start_date'] = sf_scooter['start_date'].apply(lambda x: pd.to_datetime(x))
sf_scooter['end_date'] = sf_scooter['end_date'].apply(lambda x: pd.to_datetime(x))
sf_scooter['start_hour'] = sf_scooter['start_date'].dt.hour.astype('int64')
sf_scooter['end_hour'] = sf_scooter['end_date'].dt.hour.astype('int64')

sf_scooter["date"] = sf_scooter["start_date"].dt.strftime("%Y-%m-%d")

In [None]:
sf_scooter.head()

In [None]:
print("The trips data is from {} to {}".format(sf_scooter['date'].min(), sf_scooter['date'].max()))

In [None]:
# base map crs
CRS = 3857

# Loads the shapefile including all the transit stops in 2017 in SF bay area.
# Source: http://opendata.mtc.ca.gov/datasets/major-transit-stops-2017
major_stations = gpd.read_file("../input/major-transit-stops-2017shp").to_crs(epsg=CRS)

# zipcode_zones = gpd.read_file("./data/ZIPCODE.shp").to_crs(epsg=CRS)
# railways = gpd.read_file("Passenger_Railways_(2019)-shp").to_crs(epsg=CRS)
# railways_stations = gpd.read_file("Passenger_Rail_Stations_(2019)-shp").to_crs(epsg=CRS)

# 1. Duration of trips

In [None]:
sf_scooter['Duration_minutes'] = (sf_scooter['duration']/60).astype('int64')
sf_scooter['Duration_hours'] = (sf_scooter['duration']/3600).astype('int64')

We can see most of the trips will ends within one hour, while there are some outliers stand at 100 hours or 200 hours. That may be due to the users forgot to return the scooters or some other issues. However, we will focus on the samples located around 1 hour to see what does the distribution look like there.

In [None]:
sns.distplot(sf_scooter['Duration_hours'], kde=False)
# sf_scooter['Duration_hours'].unique()

This distribution further confirms our understanding that most of the trips are within 1 hour. Next, plot the distribution in terms of minutes

In [None]:
sns.distplot(sf_scooter['Duration_minutes'].where(sf_scooter['Duration_minutes'] <= 60, 61))

From the distribution plot, we can see most of the trips end within 20 minutes. 

In [None]:
sf_scooter['Duration_minutes'].describe()

The median of the duration is at 8 minutes. 75% of the trips will be finished within 12 minutes. The mean value deviates from the median, as there is a long tail in the distribution.

In [None]:
sf_scooter['Duration_minutes'].quantile(0.95)

95% percent of the trips can be finished within 31 minutes. That may serve as a indicator of the capacity of the battery.

# 2. The number of transportations across time

In [None]:
sf_scooter["week_day"] = sf_scooter["start_date"].dt.strftime("%w").astype("int64")
sf_scooter.loc[sf_scooter["week_day"] == 0, "week_day"] = 7

In [None]:
sns.distplot(sf_scooter["week_day"])

The number of trips on weekdays is much more than that on weekends. As there might be different travelling behaviours in weekdays and weekends, they are analyzed seperately

In [None]:
weekdays = sf_scooter[sf_scooter['week_day'] < 6]
weekends = sf_scooter[sf_scooter['week_day'] >= 6]

In [None]:
sns.distplot(weekdays['start_hour'])

From the graph, we can see that the peak hour in terms of start time would be 7, 8, 9 am and 16, 17, 18 pm. That may be caused by the commuting transportation when people go to work and go off work everyday. 

In [None]:
sns.distplot(weekends['start_hour'])

The distribution of trips on weekends peaks in afternoon hours, from 10.00 to 18.00. 

# 3. Travelling behaviours in peak hours

## 3.1 Weekdays Morning

We first look into from 7 to 9 am in weekdays

In [None]:
weekday_morning = weekdays[sf_scooter['start_hour'].between(7,9)]

Group by the start station and count the distinct number of trip ids for each start station.
From the table below, we can see 13.9% of passengers start form terminal, which is almost 2 times higher than the terminal ranked at the second place. Hence there is a significant gap between terminal 70 and the other terminals. 

In [None]:
df_start_station = weekday_morning.groupby('start_station_name').agg({'start_station_id': np.mean, 'id': pd.Series.nunique}).sort_values(['id'], ascending=False)
df_start_station['percentage'] = df_start_station['id']/sum(df_start_station['id'])
df_start_station.head(10)

In [None]:
df_start_station[:8]["percentage"].sum()

The top 8 stations account for 50% of the trips

For the end stations, terminal 70 is still the top one place, while the whole sample is more evenly distributed. Combined with the previous analysis, these two tables imply that there is a large amount of people starting from terminal 70 and heading to different places in every morning. 

In [None]:
df_end_station = weekday_morning.groupby('end_station_name').agg({'end_station_id': np.mean, 'id': pd.Series.nunique}).sort_values(['id'], ascending=False)
df_end_station['percentage'] = df_end_station['id']/sum(df_end_station['id'])
df_end_station.head(10)

In [None]:
plt.bar(df_end_station['end_station_id'], df_end_station['id'], align='center')
plt.show()

The bar chart shows the number of trips scooters returned to each station in the morning. The Pareto principle (80/20 rule) still exists but there is not an obvious long tail, which means the volumn of scooters are more evenly distributed than the start trips.

Next, a heatmap will be draw to show the scooters taken and returned for each stations in weekday mornings, thus the trend of travelling can be inferred.

In [None]:
def aggregate_start_end_trips(raw_data):
    """
    aggregate the start adn end trips at each station on a daily average.
    
    Returns: a data frame with each row including one station station and the average start and end trips on every weekday morning.  
    """
    # the average number of trips started from each stations in weekday morning.
    num_days = len(raw_data["date"].unique())

    start_station = raw_data.groupby(['start_station_name', 'start_station_id']).agg(start_trips=('id', pd.Series.nunique)) \
    .reset_index() \
    .rename(columns={"start_station_name": "station", "start_station_id": "station_id"})

    start_station = start_station.assign(ave_start_trips = start_station["start_trips"] / num_days)

    # the average number of trips ended each stations in weekday morning.
    end_station = raw_data.groupby(['end_station_name', 'end_station_id']).agg(end_trips=('id', pd.Series.nunique)) \
    .reset_index() \
    .rename(columns={"end_station_name": "station", "end_station_id": "station_id"})

    end_station = end_station.assign(ave_end_trips = end_station["end_trips"] / num_days)

    # merge the two data set together with each row represents one station.
    trips = start_station.merge(end_station, on=("station", "station_id"))[["station", "ave_start_trips", "ave_end_trips"]].set_index("station")
    
    return trips

In [None]:
weekday_morning_trips = aggregate_start_end_trips(weekday_morning)

In [None]:
fig, ax = plt.subplots(figsize=(10,40))
sns.heatmap(weekday_morning_trips, ax=ax, annot=True, fmt=".2g", linewidths=.5, cmap="RdPu")

The heat map shows  the scooters taken and return are equal in most of the stations. However, there are some stations having more scooters taken or more scooters returned. For instance, stations with more scooters taken than returned in the morning are:

* Harry Bridges Plaza (Ferry Building)
* San Francisco Caltrain (Townsend at 4th)
* san Francisco 2 (330 Townsend)
* Temporary Transbay Terminal (Howard at Beale)
* Steuart at Market
* Grant Avenue at Columbus Avenue

Probably because their locations are near bus stops/ subway stations, such that people take the scooters to finish the last-mile travel. 

On the other hand, stations with more returned scooters than taken are:

In [None]:
weekday_morning_trips[(weekday_morning_trips["ave_start_trips"] * 1.2 < weekday_morning_trips["ave_end_trips"]) 
          & (weekday_morning_trips["ave_start_trips"] + 2 < weekday_morning_trips["ave_end_trips"])]

There is no stations with a big between scooters returned and scooters taken. 

A geospatial map maybe can give a better illustration. First, merge the data set with the geospatial informaiton of each station and convert the data frame to a geo dataframe.

In [None]:
weekday_morning_trips_geo = weekday_morning_trips.reset_index().merge(stations, left_on="station", right_on="name")
weekday_morning_trips_geo["diff"] = weekday_morning_trips_geo["ave_start_trips"] - weekday_morning_trips_geo["ave_end_trips"]
weekday_morning_trips_geo["type"] = weekday_morning_trips_geo["diff"].apply(lambda x: "more taken than returned scooters" if x > 0 else "fewer taken than returned scooters")
weekday_morning_trips_geo["geometry"] = pd.Series([Point(x,y) for x, y in zip(weekday_morning_trips_geo["long"], weekday_morning_trips_geo["lat"])])

weekday_morning_stations_geo = gpd.GeoDataFrame(weekday_morning_trips_geo)
weekday_morning_stations_geo.crs = {'init' :'epsg:4326'}  
weekday_morning_stations_geo = weekday_morning_stations_geo.to_crs(epsg=3857)
weekday_morning_stations_geo['coords'] = weekday_morning_stations_geo['geometry'].apply(lambda x: x.representative_point().coords[:][0])

In [None]:
# the assigned color for each station
color_map = {"more taken than returned scooters": "#F60712", 
             "fewer taken than returned scooters": "#5254F2", 
             "Bus": "#FFFE08",
             "Light Rail": "#F97100", 
             "Bus Rapid Transit": "#08FFF6", 
             "Ferry": "#2EFF08", 
             "Cable Car": "#C008FF", 
             "Rail": "#000000"}

In [None]:
def plot_start_end_trips(geo_trips, transport_stations = True, annotate = False, xlim=None, ylim=None, color_map=color_map):
    """
    plot a geospatial map with the base map and transit stops such as bus stops, subway stations, etc.
    
    Parameters:
    geo_trips: the data set containing the stations and aggregated start and end trips data.
    transport_stations: Boolean. True if to plot the transit stops as points, False otherwise.
    annotate: Boolean. True if to plot the station id on top of each scooter stations, False otherwise.
    xlim: the limit on x axis for the map. Default None. 
    ylim: the limit on y axis for the map. Default None.
    color_map: A dictionary with the colors assigned for each type of stations.
    
    Returns: NA
    """
    fig, ax = plt.subplots(figsize = (20,20))
    
    if (xlim is not None) and (ylim is not None):
        ax.set_xlim((xlim))
        ax.set_ylim((ylim))


    # plot the stations
    for t, data in geo_trips.groupby("type"):    
        data.plot(ax=ax, 
                  alpha=1,
                  color=color_map[t],
                  edgecolor='k', 
                  markersize=np.abs(data['diff']) * 60,
                  label="station: " + t)
        
        if annotate:
            for idx, row in data.iterrows():
                if (xlim[0] < row['coords'][0] < xlim[1]) and (ylim[0] < row['coords'][1] < ylim[1]): 
                    ax.annotate(s=row['id'], xy=row['coords'], horizontalalignment='center')

    # plot transportation stations
    if transport_stations:
        for t, data in major_stations.groupby('system_typ'):
            data.plot(ax=ax, alpha=0.4, color=color_map[t], label="station: " + t)

    for diff in [1, 10, 20]:
        ax.scatter([], [], c='k', alpha=0.3, s=diff * 60, label="size: {} scooters".format(str(diff)))

    # set the aesthetics of the legend
    ax.legend(fontsize=15,
              frameon=False,
              loc=(1.01, 0.6),
              labelspacing=1.2,
              title="LEGEND").get_title().set_fontsize(18)

    # add the background map to the 2D plane.
    ctx.add_basemap(ax)
    
    plt.show()

In [None]:
plot_start_end_trips(weekday_morning_stations_geo, transport_stations=True)

The data set contains transit stations around the whole bay area. To zoom into the places with the scooter stations, the map can be cropped. 

In [None]:
plot_start_end_trips(weekday_morning_stations_geo, xlim=[-13640000, -13550000], ylim=[4450000, 4570000])

Further zoom into San Francisco city. 

In [None]:
plot_start_end_trips(weekday_morning_stations_geo, xlim=[-13630000, -13622000], ylim=[4545000, 4554000])

The map shows the stations colored in red would be taken 20 more scooters than returned scooters. These stations are near the bus stops or light rails. The blue stations seems to be a bit far away from the bus stops, which might be their working places. 

In [None]:
plot_start_end_trips(weekday_morning_stations_geo, transport_stations=False, annotate=True, xlim=[-13630000, -13622000], ylim=[4545000, 4554000])

To make it clearer, we can remove the transit stops and annotate each scooter stations with the station id. The popular start stations in the morning are 50, 55, 69, 70, and 73.

In San Jose city: 

In [None]:
plot_start_end_trips(weekday_morning_stations_geo, xlim=[-13572000, -13567000], ylim=[4484000, 4488000])

However, the changes in the number of scooters at each station are not that big compared to San Francisco.

## 3.2 Weekday Afternoon

In [None]:
weekday_afternoon = sf_scooter[(sf_scooter['start_hour'].between(16, 18)) & (sf_scooter['week_day'] < 6)]

In [None]:
weekday_afternoon_trips = aggregate_start_end_trips(weekday_afternoon)

Merge the data set with the geospatial informaiton of each station and convert the data frame to a geo dataframe.

In [None]:
weekday_afternoon_trips_geo = weekday_afternoon_trips.reset_index().merge(stations, left_on="station", right_on="name")
weekday_afternoon_trips_geo["diff"] = weekday_afternoon_trips_geo["ave_start_trips"] - weekday_afternoon_trips_geo["ave_end_trips"]
weekday_afternoon_trips_geo["type"] = weekday_afternoon_trips_geo["diff"].apply(lambda x: "more taken than returned scooters" if x > 0 else "fewer taken than returned scooters")
weekday_afternoon_trips_geo["geometry"] = pd.Series([Point(x,y) for x, y in zip(weekday_afternoon_trips_geo["long"], weekday_afternoon_trips_geo["lat"])])

weekday_afternoon_stations_geo = gpd.GeoDataFrame(weekday_afternoon_trips_geo)
weekday_afternoon_stations_geo.crs = {'init' :'epsg:4326'}  
weekday_afternoon_stations_geo = weekday_afternoon_stations_geo.to_crs(epsg=3857)
weekday_afternoon_stations_geo['coords'] = weekday_afternoon_stations_geo['geometry'].apply(lambda x: x.representative_point().coords[:][0])

In [None]:
plot_start_end_trips(weekday_afternoon_stations_geo, xlim=[-13630000, -13622000], ylim=[4545000, 4554000])

It is noticable that the stations with more returned scooters and more taken scooters are reversed. 

In [None]:
plot_start_end_trips(weekday_afternoon_stations_geo, transport_stations=False, annotate=True, xlim=[-13630000, -13622000], ylim=[4545000, 4554000])

This map shows the station id clearer. The phenomenon might indicate peoople might travel from the work places back to the bus stops/subway stations between 4 to 6 pm.

In [None]:
plot_start_end_trips(weekday_afternoon_stations_geo, xlim=[-13572000, -13567000], ylim=[4484000, 4488000])

However, the changes in the number of scooters in San Jose are still not that big compared to San Francisco.

## 3.3 Weekend 10 am - 18 pm

As theres is only one peak during weekends, e.g. 10 am to 18 pm. Trips in this time period are selected for analysis.

In [None]:
weekend = sf_scooter[(sf_scooter['start_hour'].between(10, 18)) & (sf_scooter['week_day'] >= 6)]

In [None]:
weekend_trips = aggregate_start_end_trips(weekend)

In [None]:
fig, ax = plt.subplots(figsize=(10,40))
sns.heatmap(weekend_trips, ax=ax, annot=True, fmt=".2g", linewidths=.5, cmap="RdPu")

There is no significant difference in the number of start and end trips. The start trips and end trips at each station are almost the same. Thus, we only plot the start trips in the map to view the stations with high volumn of trips.

In [None]:
def plot_trips(geo_trips, column_to_plot, annotate=False, transport_stations=True, xlim=None, ylim=None, color_map=color_map):
    """
    plot a geospatial map with the base map and transit stops such as bus stops, subway stations, etc.
    
    Parameters:
    geo_trips: the data set containing the stations and aggregated start and end trips data.
    column_to_plot: the name of the column containing data.
    transport_stations: Boolean. True if to plot the transit stops as points, False otherwise.
    annotate: Boolean. True if to plot the station id on top of each scooter stations, False otherwise.
    xlim: the limit on x axis for the map. Default None. 
    ylim: the limit on y axis for the map. Default None.
    color_map: A dictionary with the colors assigned for each type of stations.
    
    Returns: NA
    """
    fig, ax = plt.subplots(figsize = (20,20))

    if (xlim is not None) and (ylim is not None):
        ax.set_xlim((xlim))
        ax.set_ylim((ylim))


    geo_trips.plot(ax=ax, 
              alpha=1,
              color="#F60712",
              edgecolor='k', 
              markersize=np.abs(geo_trips[column_to_plot]) * 60,
              label="scooter stations")
    
    if annotate:
        for idx, row in geo_trips.iterrows():
            if (xlim[0] < row['coords'][0] < xlim[1]) and (ylim[0] < row['coords'][1] < ylim[1]): 
                ax.annotate(s=row['id'], xy=row['coords'], horizontalalignment='center')

    if transport_stations:
        for t, data in major_stations.groupby('system_typ'):
            data.plot(ax=ax, alpha=0.4, color=color_map[t], label="station: " + t)

    for diff in [1, 10, 20, 30]:
        ax.scatter([], [], c='k', alpha=0.3, s=diff * 60, label="size: {} scooters".format(str(diff)))

    ax.legend(fontsize=15,
              frameon=False,
              loc=(1.01, 0.6),
              labelspacing=1.7,
              title="LEGEND").get_title().set_fontsize(18)

    ctx.add_basemap(ax)

    plt.show()

Merge the data set with the geospatial informaiton of each station and convert the data frame to a geo dataframe.

In [None]:
weekend_trips_geo = weekend_trips.reset_index().merge(stations, left_on="station", right_on="name")
weekend_trips_geo["diff"] = weekend_trips_geo["ave_start_trips"] - weekend_trips_geo["ave_end_trips"]
weekend_trips_geo["geometry"] = pd.Series([Point(x,y) for x, y in zip(weekend_trips_geo["long"], weekend_trips_geo["lat"])])

weekend_stations_geo = gpd.GeoDataFrame(weekend_trips_geo)
weekend_stations_geo.crs = {'init' :'epsg:4326'}  
weekend_stations_geo = weekend_stations_geo.to_crs(epsg=3857)
weekend_stations_geo['coords'] = weekend_stations_geo['geometry'].apply(lambda x: x.representative_point().coords[:][0])

In [None]:
plot_trips(weekend_stations_geo, "ave_start_trips", xlim=[-13630000, -13622000], ylim=[4545000, 4554000])

In [None]:
plot_trips(weekend_stations_geo, "ave_start_trips", annotate=True, transport_stations=False, xlim=[-13630000, -13622000], ylim=[4545000, 4554000])

The most popular stations seem to be those near the coastal, instead of San Francisco Caltrain stations. The top two popular stations are 50 and 60. Thus, more scooters might need to be allocated to these two stations in weekends.

In [None]:
plot_trips(weekend_stations_geo, "ave_start_trips", xlim=[-13572000, -13567000], ylim=[4484000, 4488000])

In San Jose, the demand for scooters are not that big.

# 4. Suggested allocation for each station

## 4.1 Weekdays

In [None]:
def calculate_needed_scooters(data):
    """
    calculate the largest accumulative number of scooters taken in a day. 
    
    Parameters: 
    data: the trips for one station in one day. 
    """
    max_taken = 0
    current_taken = 0
    for index, row in data.sort_values(by="date_time", axis=0).reset_index(drop=True).iterrows():
        if row['trip_type'] == "start": 
            current_taken += 1
        elif row['trip_type'] == "end":
            current_taken -= 1
        if current_taken > max_taken:
            max_taken = current_taken
    return max_taken

In [None]:
weekday_all = sf_scooter[sf_scooter['week_day'] < 6]

In [None]:
start = weekday_all[['start_date', 'start_station_name', 'start_station_id', 'date']] \
.rename(columns={'start_date': 'date_time', 'start_station_name': 'station_name', 'start_station_id': 'station_id'})

start['trip_type'] = 'start'

end = weekday_all[['end_date', 'end_station_name', 'end_station_id', 'date']] \
.rename(columns={'end_date': 'date_time', 'end_station_name': 'station_name', 'end_station_id': 'station_id'})

end['trip_type'] = 'end'

station_trips = pd.concat([start, end], ignore_index=True).sort_values(by="date_time", axis=0).reset_index(drop=True)

estimated_scooters = station_trips.groupby(['date', 'station_name']).progress_apply(calculate_needed_scooters)
#agg(calculate_needed_scooters)

In [None]:
estimated_scooters_ave = pd.DataFrame(estimated_scooters).reset_index() \
.rename(columns={0: 'estimated_scooters'}) \
.groupby('station_name') \
.agg(ave_estimated_scooters=("estimated_scooters", np.mean))

In [None]:
estimated_scooters_ave_geo = estimated_scooters_ave.reset_index().merge(stations, left_on="station_name", right_on="name")
estimated_scooters_ave_geo["geometry"] = pd.Series([Point(x,y) for x, y in zip(estimated_scooters_ave_geo["long"], estimated_scooters_ave_geo["lat"])])

weekday_allocations = gpd.GeoDataFrame(estimated_scooters_ave_geo)
weekday_allocations.crs = {'init' :'epsg:4326'}  
weekday_allocations = weekday_allocations.to_crs(epsg=3857)
weekday_allocations['coords'] = weekday_allocations['geometry'].apply(lambda x: x.representative_point().coords[:][0])

In [None]:
plot_trips(weekday_allocations, "ave_estimated_scooters", xlim=[-13630000, -13622000], ylim=[4545000, 4554000])

In [None]:
plot_trips(weekday_allocations, "ave_estimated_scooters", transport_stations=False, annotate=True, xlim=[-13630000, -13622000], ylim=[4545000, 4554000])

From the plot, stations 50, 55, 69, 70, 73, etc have the highest estimated demand, but all of them needs no more than 30 scooters every day.  

## 4.2 Weekends

In [None]:
weekend_all = sf_scooter[sf_scooter['week_day'] >= 6]

In [None]:
start = weekend_all[['start_date', 'start_station_name', 'start_station_id', 'date']] \
.rename(columns={'start_date': 'date_time', 'start_station_name': 'station_name', 'start_station_id': 'station_id'})

start['trip_type'] = 'start'

end = weekend_all[['end_date', 'end_station_name', 'end_station_id', 'date']] \
.rename(columns={'end_date': 'date_time', 'end_station_name': 'station_name', 'end_station_id': 'station_id'})

end['trip_type'] = 'end'

station_trips = pd.concat([start, end], ignore_index=True).sort_values(by="date_time", axis=0).reset_index(drop=True)

weekend_estimated_scooters = station_trips.groupby(['date', 'station_name']).progress_apply(calculate_needed_scooters)

In [None]:
weekend_estimated_scooters_ave = pd.DataFrame(weekend_estimated_scooters).reset_index() \
.rename(columns={0: 'estimated_scooters'}) \
.groupby('station_name') \
.agg(ave_estimated_scooters=("estimated_scooters", np.mean))

In [None]:
weekend_estimated_scooters_ave_geo = weekend_estimated_scooters_ave.reset_index().merge(stations, left_on="station_name", right_on="name")
weekend_estimated_scooters_ave_geo["geometry"] = pd.Series([Point(x,y) for x, y in zip(weekend_estimated_scooters_ave_geo["long"], weekend_estimated_scooters_ave_geo["lat"])])

weekend_allocations = gpd.GeoDataFrame(weekend_estimated_scooters_ave_geo)
weekend_allocations.crs = {'init' :'epsg:4326'}  
weekend_allocations = weekend_allocations.to_crs(epsg=3857)
weekend_allocations['coords'] = weekend_allocations['geometry'].apply(lambda x: x.representative_point().coords[:][0])

In [None]:
plot_trips(weekend_allocations, "ave_estimated_scooters", xlim=[-13630000, -13622000], ylim=[4545000, 4554000])

On weekends, most of the stations only need fewer than 10 scooters.

# 5. Daily update of number of scooters on each station

In [None]:
weekday_all = sf_scooter[sf_scooter['week_day'] < 6]

In [None]:
weekday_all_trips = aggregate_start_end_trips(weekday_all)

In [None]:
fig, ax = plt.subplots(figsize=(10,40))
sns.heatmap(weekday_all_trips, ax=ax, annot=True, fmt=".2g", linewidths=.5, cmap="RdPu")

After one day, scooters at "2nd at Folsom", "2nd at South Park", "Beale at Market", etc will be reduced after one weekday, while the reduced amount is not more than 10. Stations such as "San Francisco Caltrain (Townsend at 4th)" will have more scooters. Thus, the scooters can be re-allocated based on the estimated demand calculated at section 4.

In [None]:
weekend_all = sf_scooter[sf_scooter['week_day'] >= 6]
weekend_all_trips = aggregate_start_end_trips(weekend_all)

In [None]:
fig, ax = plt.subplots(figsize=(10,40))
sns.heatmap(weekend_all_trips, ax=ax, annot=True, fmt=".2g", linewidths=.5, cmap="RdPu")

The number of scooters at every station is almost unchanged after each day in weekend. The biggest difference is at station Embarcadero at Sansome, where would have 5 more scooters every day in weekend. 

# 6 Conclusion 

* 95% of the trips end within 31 minutes. This might be an indicator for the battery capacity.
* There are different amount of usage in weekdays and weekends. I.e. People travel the most from 7 am to 9 am and from 16 pm to 18 pm in weekdays, while travel the most from 10 am to 18 pm in weekends.  
* The trips in weekdays morning tend to start from scooter stations near the transit stations, such as 69 and 70, to those stations far away from the transit stops, while the trips in afternoon show a revesed pattern, i.e. from the workplace back to stations near the transit stops.   
* The analysis calculates the estimated minimal scooters needed for each station in weekdays and weekends seperately. Most the stations need fewer than 30 scooters in weekdays. The demand will become fewer than 10 in weekends.
* The analysis shows the daily change on the number of scooters at each station. San Francisco Caltrain (Townsend at 4th) will have around 20 extra scooters every weekday, which could be re-allocated to other stations. During weekends, the change in the numbers are not that big. The most significant insight is that station Embarcadero at Sansome will have 4 more scooters everyday on average.