# BiciMAD - Station snapshots analysis
Copyright © 2017 Javi Ramírez <javi.rmrz@gmail.com> | [@rameerez [tw]](http://twitter.com/rameerez) | [GitHub](http://github.com/rameerez)

This code is Open Source, released under the MIT License.

## Dataset description

The data provided by BiciMAD is a hourly snapshot of all the stations in the system, and has the following strcture:

| Field name     | Description                                                                               |
|----------------|-------------------------------------------------------------------------------------------|
| ```_id```      | Timestamp of the `stations` snapshot                                                      |
| ```stations``` | JSON. Array of all stations in the network. Each element shows the status of one station. |

And each of the `station` object has the following structure:

| Field name           | Description                                                                                         |
|----------------------|-----------------------------------------------------------------------------------------------------|
| `id`                 | Station unique ID                                                                                   |
| `latitude`           | Station's latitude in format WGS84                                                                  |
| `longitude`          | Station's longitude in format WGS84                                                                 |
| `name`               | Station name                                                                                        |
| `light`              | Occupation level (0=low; 1=med; 2=high)                                                             |
| `number`             | Station logic denomination. (Some stations have more than 1 "part", `number` might be "1A" or "1B") |
| `activate`           | Is the station active? (0=not active; 1=active)                                                     |
| `no_available`       | Station availability (0=available; 1=not available)                                                 |
| `total_bases`        | Total number of bike bases within the station                                                       |
| `dock_bikes`         | Number of docked bikes in the station                                                               |
| `free_bases`         | Number of free bases                                                                                |
| `reservations_count` | Number of active reservations                                                                       |

These tables are only provided as reference. Please refer to the latest documentation for the BiciMAD API in the [EMT OpenData website](http://opendata.emtmadrid.es).

---

## Global definitions


In [None]:
# Hourly stations' snapshot: data describing all stations status each hour
STATIONS_SNAPSHOT_DATASET = './data/stations_hour_20170301_20170406.json'

## Imports & data loading

In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
from datetime import datetime

In [3]:
import matplotlib.pyplot as plt
from bokeh.io import output_file, output_notebook, show
from bokeh.models import (
  GMapPlot, GMapOptions, ColumnDataSource, Circle, DataRange1d, PanTool, WheelZoomTool, BoxSelectTool
)
from bokeh.plotting import figure
from bokeh.layouts import gridplot
from bokeh.models.sources import ColumnDataSource
from bokeh.sampledata.sample_geojson import geojson

In [4]:
output_notebook()

In [5]:
# Watch out! lines=True needed
df_raw = pd.read_json(STATIONS_SNAPSHOT_DATASET, lines=True)

## Data preprocessing

To analyze the data, we want tables with this format:

| timestamp                  | activate | address             | dock_bikes | free_bases | id | latitude   | longitude  | total_bases | ... |
|----------------------------|----------|---------------------|------------|------------|----|------------|------------|-------------|-----|
| 2017-03-22T22:49:17.396000 | 1        | Puerta del Sol nº 1 | 2          | 20         | 1  | 40.4168961 | -3.7024255 | 24          | ... |

Note that table is incomplete and does not show all the available features.

In a nutshell, what we want is a dataframe containing **all** the stations for every single snapshot. Each station (row) will be tagged with its corresponding timestamp from its snapshot.

That is, we want instances of each station along time.

In [6]:
frames = []

for index, row in df_raw.iterrows():
    timestamp= row['_id']
    stations = json_normalize(row['stations'])
    stations['timestamp'] = timestamp
    frames.append(stations)

In [1]:
df = pd.concat(frames)
df[:2] # showing only two rows to keep it short

NameError: name 'pd' is not defined

In [8]:
station1 = df.loc[df['id'] == 43]
station2 = df.loc[df['id'] == 101]
station3 = df.loc[df['id'] == 3]

Yay! Now we can see how a single station has performed along time.

In [9]:
def datetime(x):
    return np.array(x, dtype=np.datetime64)

In [10]:
p1 = figure(x_axis_type="datetime", title="Station free bases")
p1.grid.grid_line_alpha=0.3
p1.xaxis.axis_label = 'Date'
p1.yaxis.axis_label = '# of free bases'

p1.line(datetime(station1['timestamp'].values), station1['free_bases'].values, color='#A6CEE3', legend=station1['name'].iloc[0])
p1.legend.location = "top_left"

aapl = np.array(station1['free_bases'].values)
aapl_dates = np.array(station1['timestamp'].values, dtype=np.datetime64)

window_size = 30
window = np.ones(window_size)/float(window_size)
aapl_avg = np.convolve(aapl, window, 'same')

p2 = figure(x_axis_type="datetime", title="Station free bases (average)")
p2.grid.grid_line_alpha = 0
p2.xaxis.axis_label = 'Date'
p2.yaxis.axis_label = '# of free bases'
p2.ygrid.band_fill_color = "olive"
p2.ygrid.band_fill_alpha = 0.1

p2.circle(aapl_dates, aapl, size=4, legend='close',
          color='darkgrey', alpha=0.2)

p2.line(aapl_dates, aapl_avg, legend='avg', color='navy')
p2.legend.location = "top_left"

In [11]:
output_notebook()
output_file("stocks.html", title="stocks.py example")
show(gridplot([[p1,p2]], plot_width=400, plot_height=400))  # open a browser

In [2]:
# to-do: compare different stations along time