# 2. Data Introduction & Process
---
## 2.1 LOOP-SEA
### 2.1.1 Data introduction
#### The data download link contains a list of files:
* `speed_matrix_2015`: Loop Speed Matrix, which is a pickled file that can be read by pandas or other python packages.
* `Loop_Seattle_2015_A.npy`: Loop Adjacency Matrix, which is a numpy matrix to describe the traffic network structure as a graph. 
* `Loop_Seattle_2015_reachability_free_flow_Xmin.npy`: Loop Free-flow Reachability Matrix during X minites' drive.
* `nodes_loop_mp_list.csv`: List of loop detectors' milepost, with the same order of that in the Loop Speed Matrix.

#### Loop detectors distribution (<a href="./img/cabinet.html">Cabinets</a>)
The data is collected by the **inductive loop detectors deployed on freeways in Seattle area**. The freeways contains I-5, I-405, I-90, and SR-520, shown in the below picture. This dataset contains spatio-temporal speed information of the freeway system. In the picture, each blue icon demonstrates loop detectors at a milepost. The speed information at a milepost is averaged from multiple loop detectors on the mainlanes in a same direction at the specific milepost. The time interval of the dataset is 5-minute. 
> <img src="./img/dataset1-loops.png" width="600" height="400"></img>

#### speed_matrix_2015
A demo of the speed_matrix_2015 is shown as the following figure. The horizontal header denotes the milepost and the vertical header indicates the timestamps.
><img src='./img/dataset1-sample.png' width="800" height="400"></img>

The name of each milepost header contains 11 characters:
  * 1 char: 'd' or 'i', i.e. decreasing direction or increasing direction.
  * 2-4 chars: route name, e.g. '405' demonstrates the route I-405.
  * 5-6 chars: 'es' has no meanings here.
  * 7-11 chars: milepost, e.g. '15036' demonstrates the 150.36 milepost.
  
><img src='./img/dataset1-heatmap.png' width="800" height="400"></img>


### 2.1.2 Data Download Link: [Seattle Loop Dataset](https://drive.google.com/drive/folders/1XuK0fgI6lmSUzmToyDdHQy8CPunlm5yr?usp=sharing)

#### If you use this dataset in your work, please cite the following reference:
###### Reference:
* `Cui, Z., Ke, R., & Wang, Y. (2018). Deep Bidirectional and Unidirectional LSTM Recurrent Neural Network for Network-wide Traffic Speed Prediction. arXiv preprint arXiv:1801.02143.`
* `Cui, Z., Henrickson, K., Ke, R., & Wang, Y. (2018). High-Order Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting. arXiv preprint arXiv:1802.07007.`
###### BibTex:
```
@article{cui2018deep,
  title={Deep Bidirectional and Unidirectional LSTM Recurrent Neural Network for Network-wide Traffic Speed Prediction},
  author={Cui, Zhiyong and Ke, Ruimin and Wang, Yinhai},
  journal={arXiv preprint arXiv:1801.02143},
  year={2018}
} ,
@article{cui2018high,
  title={High-Order Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting},
  author={Cui, Zhiyong and Henrickson, Kristian and Ke, Ruimin and Wang, Yinhai},
  journal={arXiv preprint arXiv:1802.07007},
  year={2018}
}
```
#### Note: This dataset should only be used for research.


___

## 2.2 METR-LA and PEMS-BAY
### 2.2.1 Data introduction
#### These dataset are provided by [Yaguang Li - DCRNN](https://github.com/liyaguang/DCRNN).
* **METR-LA**: This traffic dataset contains traffic information collected from loop detectors in the highway of Los Angeles County (Jagadish et al., 2014). We select 207 sensors and collect 4 months of data ranging from Mar 1st 2012 to Jun 30th 2012 for the experiment. The total number of observed traffic data points is 6,519,002.
* **PEMS-BAY**: This traffic dataset is collected by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). We select 325 sensors in the Bay Area and collect 6 months of data ranging from Jan 1st 2017 to May 31th 2017 for the experiment. The total number of observed traffic data points is 16,937,179.

#### The data download link contains a list of files:
* `metr-la.h5`:METR-LA matrix
* `adj_mx.pkl`:METR-LA road network adjacency matrix
* `pems-bay.h5`:PEMS-BAY matrix
* `adj_mx_bay.pkl`:PEMS-BAY road network adjacency matrix

#### Sensors distribution
> <img src="./img/dataset2-sensors.png" width="800" height="400"></img>


#### (1) METR-LA
><img src='./img/dataset2-sample.png' width="800" height="400"></img>
><img src='./img/dataset2-heatmap.png' width="800" height="400"></img>

#### (2) PEMS-BAY
><img src='./img/dataset3-sample.png' width="800" height="400"></img>
><img src='./img/dataset3-heatmap.png' width="800" height="400"></img>


### 2.2.2 Data Download Link: [METR-LA and PEMS-BAY](https://drive.google.com/drive/folders/10FOTa6HXPqX8Pf5WRoRwcFnW9BrNZEIX)
###### Reference:
* `Li, Y., Yu, R., Shahabi, C., & Liu, Y. (2017). Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926.`
###### BibTex:
```
@inproceedings{li2018dcrnn_traffic,
  title={Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting},
  author={Li, Yaguang and Yu, Rose and Shahabi, Cyrus and Liu, Yan},
  booktitle={International Conference on Learning Representations (ICLR '18)},
  year={2018}
}
```
---

## 2.3 INRIX-SEA
### 2.3.1 Data introduction
#### The data download link contains a list of files:
* `INRIX_Seattle_Speed_Matrix__2012_v2.pkl`:Speed matrix
* `INRIX_Seattle_Adjacency_matrix_2012_v2.npy`:Road network adjacency matrix
><img src='./img/dataset4-sample.png' width="800" height="400"></img>
><img src='./img/dataset4-heatmap.png' width="800" height="400"></img>

## 2.3 Data Process Demo
### 2.3.1 Read dataset
The downloaded file of the Zenodo version named "speed_matrix_2015_1mph" is a Python pickled file that can be read by using [Pandas.read_pickle()](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_pickle.html) , and we load the data from the local file and show the top 10 rows as a demonstration:

In [2]:
import pandas as pd
dataset1 = pd.read_pickle('./LOOP-SEA/speed_matrix_2015_1mph')
dataset2 = pd.read_hdf('./METR-LA/metr-la.h5')
dataset3 = pd.read_hdf('./PEMS-BAY/pems-bay.h5')
dataset4 = pd.read_pickle('./INRIX-SEA/INRIX_Seattle_Speed_Matrix__2012_v2.pkl')
dataset1.iloc[:,:9].head(10)

ID,d005es15036,d005es15125,d005es15214,d005es15280,d005es15315,d005es15348,d005es15410,d005es15465,d005es15531
stamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015-01-01 00:00:00,62,64,62,61,63,64,63,65,63
2015-01-01 00:05:00,59,65,65,66,59,62,66,65,61
2015-01-01 00:10:00,62,65,65,64,62,64,60,65,64
2015-01-01 00:15:00,62,65,67,64,66,65,62,66,62
2015-01-01 00:20:00,62,65,67,65,66,65,65,63,62
2015-01-01 00:25:00,63,68,66,63,64,63,62,68,61
2015-01-01 00:30:00,62,65,65,62,62,64,63,68,61
2015-01-01 00:35:00,64,67,67,63,65,65,61,65,61
2015-01-01 00:40:00,63,67,67,65,63,63,60,65,62
2015-01-01 00:45:00,62,66,65,65,64,63,63,67,63


The data is formatted into a matrix, whose horizontal and vertical axes display loop detector names and timestamps. 

The **temporal dimension** of the dataset is 105120 and the **spatial dimension** is 323. 

Here, the temporal dimension is 105120 = 365 (days) * 24 (hours) * 12 (5-minutes).

In [3]:
print('dataset1(Loop-SEA):', dataset1.shape)
print('dataset2(METR-LA):', dataset2.shape)
print('dataset3(PEMS-BAY):', dataset3.shape)
print('dataset4(INRIX_SEA):', dataset4.shape)

dataset1(Loop-SEA): (105120, 323)
dataset2(METR-LA): (34272, 207)
dataset3(PEMS-BAY): (52116, 325)
dataset4(INRIX_SEA): (105312, 745)


### 2.3.2 Format data into training, validation, and testing sets
We first convert the data matrix into three sub-sets, i.e the training, validation, and testing sets. 

Normally, the **Training : Validation : Testing** ratio can be **7:2:1** or **6:2:2**.  

Some meta parameters are listed as below:

> <img src="./img/data-process.png" width="800" height="400"></img>

In [4]:
import numpy as np
import torch
from torch import utils

class PrepareDataset():    
    """ Prepare training and testing datasets and dataloaders.
    
    Convert speed/volume/occupancy matrix to training, validation, and testing sets. 
    The vertical axis of speed_matrix is the time axis and the horizontal axis 
    is the spatial axis.
    
    Args:
        dataset: a Matrix containing spatial-temporal speed data for a network
        seq_len: length of input sequence
        pred_len: length of predicted sequence
    Returns:
        Training dataloader
        Testing dataloader
    """
    def __init__(self,batch_size=40,seq_len=10,pred_len=1,train_propotion=0.7,valid_propotion=0.2):
        self.BATCH_SIZE = batch_size
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.train_propotion = train_propotion
        self.valid_propotion = valid_propotion
        
    def input_output_extract(self,dataset):
        np.random.seed(1024)
        time_len = dataset.shape[0]
        dataset =  dataset / dataset.max().max()

        data_sequences, data_labels = [], []
        for i in range(time_len - self.seq_len - self.pred_len):
            data_sequences.append(dataset.iloc[i:i+self.seq_len].values)
            data_labels.append(dataset.iloc[i+self.seq_len:i+self.seq_len+self.pred_len].values)
        data_sequences, data_labels = np.asarray(data_sequences), np.asarray(data_labels)
        return data_sequences, data_labels
    
    def train_test_split(self,data_sequences,data_labels):
        # shuffle and split the dataset to training and testing datasets
        sample_size = data_sequences.shape[0]
        index = np.arange(sample_size, dtype = int)
        np.random.shuffle(index)

        train_index = int(np.floor(sample_size * self.train_propotion))
        valid_index = int(np.floor(sample_size * ( self.train_propotion + self.valid_propotion)))

        train_data, train_label = data_sequences[:train_index], data_labels[:train_index]
        valid_data, valid_label = data_sequences[train_index:valid_index], data_labels[train_index:valid_index]
        test_data, test_label = data_sequences[valid_index:], data_labels[valid_index:]

        train_data, train_label = torch.Tensor(train_data), torch.Tensor(train_label)
        valid_data, valid_label = torch.Tensor(valid_data), torch.Tensor(valid_label)
        test_data, test_label = torch.Tensor(test_data), torch.Tensor(test_label)

        train_dataset = utils.data.TensorDataset(train_data, train_label)
        valid_dataset = utils.data.TensorDataset(valid_data, valid_label)
        test_dataset = utils.data.TensorDataset(test_data, test_label)

        train_dataloader = utils.data.DataLoader(train_dataset, batch_size = self.BATCH_SIZE, shuffle=True, drop_last = True)
        valid_dataloader = utils.data.DataLoader(valid_dataset, batch_size = self.BATCH_SIZE, shuffle=True, drop_last = True)
        test_dataloader = utils.data.DataLoader(test_dataset, batch_size = self.BATCH_SIZE, shuffle=True, drop_last = True)

        return train_dataloader, valid_dataloader, test_dataloader
    
    def fit(self, dataset):
        data_sequences, data_labels = self.input_output_extract(dataset)
        train_dataloader, valid_dataloader, test_dataloader = self.train_test_split(data_sequences,data_labels)
        return train_dataloader, valid_dataloader, test_dataloader

In [5]:
pdt = PrepareDataset(batch_size = 40, seq_len = 10, pred_len = 1, train_propotion = 0.7, valid_propotion = 0.2)

In [6]:
sequences, labels = pdt.input_output_extract(dataset1)

In [7]:
print(sequences.shape, labels.shape)

(105109, 10, 323) (105109, 1, 323)


In [8]:
train_data, valid_data, test_data = pdt.fit(dataset1)