# Stations analysis

The data provided by BiciMAD is a hourly snapshot of all the stations in the system, and has the following strcture:

| Field name     | Description                                                                               |
|----------------|-------------------------------------------------------------------------------------------|
| ```_id```      | Timestamp of the `stations` snapshot                                                      |
| ```stations``` | JSON. Array of all stations in the network. Each element shows the status of one station. |

And each of the `station` object has the following structure:

| Field name           | Description                                                                                         |
|----------------------|-----------------------------------------------------------------------------------------------------|
| `id`                 | Station unique ID                                                                                   |
| `latitude`           | Station's latitude in format WGS84                                                                  |
| `longitude`          | Station's longitude in format WGS84                                                                 |
| `name`               | Station name                                                                                        |
| `light`              | Occupation level (0=low; 1=med; 2=high)                                                             |
| `number`             | Station logic denomination. (Some stations have more than 1 "part", `number` might be "1A" or "1B") |
| `activate`           | Is the station active? (0=not active; 1=active)                                                     |
| `no_available`       | Station availability (0=available; 1=not available)                                                 |
| `total_bases`        | Total number of bike bases within the station                                                       |
| `dock_bikes`         | Number of docked bikes in the station                                                               |
| `free_bases`         | Number of free bases                                                                                |
| `reservations_count` | Number of active reservations                                                                       |

These tables are only provided as reference. Please refer to the latest documentation for the BiciMAD API in the [EMT OpenData website](http://opendata.emtmadrid.es).

## Imports & data loading

In [1]:
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize

In [3]:
from bokeh.io import output_file, output_notebook, show
from bokeh.models import (
  GMapPlot, GMapOptions, ColumnDataSource, Circle, DataRange1d, PanTool, WheelZoomTool, BoxSelectTool
)
from bokeh.plotting import figure
from bokeh.sampledata.sample_geojson import geojson

In [4]:
output_notebook()

In [5]:
# Watch out! lines=True needed
df_raw = pd.read_json('./data/stations_hour_20170301_20170406.json', lines=True)

## Data preprocessing

To analyze the data, we want tables with this format:

| timestamp                  | activate | address             | dock_bikes | free_bases | id | latitude   | longitude  | total_bases | ... |
|----------------------------|----------|---------------------|------------|------------|----|------------|------------|-------------|-----|
| 2017-03-22T22:49:17.396000 | 1        | Puerta del Sol nº 1 | 2          | 20         | 1  | 40.4168961 | -3.7024255 | 24          | ... |

Please note that table is incomplete and does not show all the available fields.

In a nutshell, what we want is a dataframe containing **all** the stations for every single snapshot. Each station (row) will be tagged with its corresponding timestamp from its snapshot.

That is, we want instances of each station along time.

In [29]:
frames = []

for index, row in df_raw.iterrows():
    timestamp= row['_id']
    stations = json_normalize(row['stations'])
    stations['timestamp'] = timestamp
    frames.append(stations)

In [38]:
df = pd.concat(frames)
df[:2]

Unnamed: 0,activate,address,dock_bikes,free_bases,id,latitude,light,longitude,name,no_available,number,reservations_count,total_bases,timestamp
0,1,Puerta del Sol n� 1,2,20,1,40.4168961,0,-3.7024255,Puerta del Sol A,0,1a,0,24,2017-03-22T22:49:17.396000
1,1,Puerta del Sol n� 1,3,20,2,40.4170009,0,-3.7024207,Puerta del Sol B,0,1b,0,24,2017-03-22T22:49:17.396000


Yay! Now we can see how a single station has performed along time.