# TRAFFIX

## 1. Introduction of Traffic Forecasting

### 1.1 Problem Definition

The traffic forecasting problem can be considered as learning a function/model $F$ to mapping historical traffic state data to predict the states in future time step/steps.

> Please note: traffic states normally refer to as travel time, speed, volume, and etc. Some exiting studies also attempted to forecast the amount of origin-desination (O-D) pairs and metro ridership.

When attempting to predict network-wide traffic states, we usually convert the traffic states of the whole traffic network with $N$ roads/links/sensor stations as a vector $x\in R^N$.

A sequence of historical traffic state data with $T$ time steps can be represented by $X=[x_1, x_x, ..., x_T]$. Here, we assume those traffic states have the identical time interval $\Delta t$.

The future traffic states to be prediced can be one-step data $x_{T+1}$ or multi-steps data $[x_{T+1}, x_{T+2}, ..., x_{T+n}]$.

The following figure shows the process of forecasting one future step of traffic states:

<img src="img/TrafficForecasting.PNG" alt="drawing" width="600"/>

Hence, the traffic forecasting problem can be defined as learning a function $F$ such that 

\begin{align}
F([x_1, x_2, ..., x_T]) = [x_T]
\end{align}

or

\begin{align}
F([x_1, x_2, ..., x_T]) = [x_{T+1}, x_{T+2}, ..., x_{T+n}]
\end{align}

### 1.2 Traffic State Dataset Demo and Formatting

###### Taking the Seattle freeway loop detector data as an example

**Dataset Description**

The data is collected by the inductive loop detectors deployed on freeways in Seattle area. The freeways contains I-5, I-405, I-90, and SR-520, shown in the above picture. This dataset contains spatio-temporal speed information of the freeway system. In the picture, each blue icon demonstrates loop detectors at a milepost. The speed information at a milepost is averaged from multiple loop detectors on the mainlanes in a same direction at the specific milepost. The time interval of the dataset is 5-minute.

The dataset can be downloaded from [GitHub](https://github.com/zhiyongc/Seattle-Loop-Data) or [Zenodo](https://zenodo.org/record/3258904#.Xeb9HldKi70). 


**Region of data collected** 

<img src="img/SeattleLoopData.png" alt="drawing" width="400"/>

**Dataset meta information**

|Attributes | values |
|:---|:---|
|Number of loop detector stations | 323 |
|Time range | 2015-01-01 00:00:00 to 2015-12-31 23:55:00|
|Time interval | 5 miniuts|
|Unit of speed values| mile per hour (mph)|
|Precision of speed values (GitHub version)| Rounded to the 6th decimal|
|Precision of speed values (Zenodo version)| Rounded to integers|


In the Zenodo version, the speed values with a unit of mile per hour (mph) are formatted into integers.

#### 1.2.1 Read data 



- [ ] Read data from google drive 

Coming soon!
 


- [x] Read data from local files

The downloaded file of the Zenodo version named "speed_matrix_2015_1mph" is a Python pickled file that can be read by using Pandas.read_pickle(): 


Now, we load the data from the local file and show the top 5 rows as a demonstration:

In [60]:
import pandas as pd
speed_matrix =  pd.read_pickle('speed_matrix_2015_1mph')
speed_matrix.head()

ID,d005es15036,d005es15125,d005es15214,d005es15280,d005es15315,d005es15348,d005es15410,d005es15465,d005es15531,d005es15569,...,i520es00526,i520es00560,i520es00624,i520es00684,i520es00714,i520es00746,i520es00770,i520es00861,i520es00935,i520es00972
stamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-01 00:00:00,62,64,62,61,63,64,63,65,63,65,...,64,60,62,62,64,64,62,68,67,62
2015-01-01 00:05:00,59,65,65,66,59,62,66,65,61,66,...,64,64,65,60,64,48,54,59,59,61
2015-01-01 00:10:00,62,65,65,64,62,64,60,65,64,66,...,60,64,57,63,64,58,59,59,57,57
2015-01-01 00:15:00,62,65,67,64,66,65,62,66,62,66,...,65,66,62,58,56,60,57,64,58,64
2015-01-01 00:20:00,62,65,67,65,66,65,65,63,62,65,...,66,61,62,63,63,63,55,63,63,65


The data is formatted into a matrix, whose horizontal and vertical axes display loop detector names and timestamps. 

The **temporal dimension** of the dataset is 105120 and the **spatial dimension** is 323. 

Here, the temporal dimension is 105120 = 365 (days) * 24 (hours) * 12 (5-minutes).

In [65]:
speed_matrix.shape

(105120, 323)

#### 1.2.2 Format data into training, validation, and testing sets

We first convert the data matrix into three sub-sets, i.e the training, validation, and testing sets. 

Normally, the **Training : Validation : Testing** ratio can be **7:2:1** or **6:2:2**.  



Some meta parameters are listed as below:

In [57]:
import time
import numpy as np

In [58]:
BATCH_SIZE = 40
seq_len = 10
pred_len = 1
train_propotion = 0.7
valid_propotion = 0.2

In [59]:
np.random.seed(1024)
index = np.arange(sample_size, dtype = int)
np.random.shuffle(index)
index

array([ 23996,  16147,   6243, ..., 100844,  74641,  71611])

In [47]:
time_len = speed_matrix.shape[0]

max_speed = speed_matrix.max().max()
speed_matrix =  speed_matrix / max_speed

speed_sequences, speed_labels = [], []
for i in range(time_len - seq_len - pred_len):
    speed_sequences.append(speed_matrix.iloc[i:i+seq_len].values)
    speed_labels.append(speed_matrix.iloc[i+seq_len:i+seq_len+pred_len].values)
speed_sequences, speed_labels = np.asarray(speed_sequences), np.asarray(speed_labels)

# shuffle and split the dataset to training and testing datasets
sample_size = speed_sequences.shape[0]
index = np.arange(sample_size, dtype = int)
np.random.shuffle(index)

train_index = int(np.floor(sample_size * train_propotion))
valid_index = int(np.floor(sample_size * ( train_propotion + valid_propotion)))

train_data, train_label = speed_sequences[:train_index], speed_labels[:train_index]
valid_data, valid_label = speed_sequences[train_index:valid_index], speed_labels[train_index:valid_index]
test_data, test_label = speed_sequences[valid_index:], speed_labels[valid_index:]

In [36]:
def PrepareDataset(speed_matrix, BATCH_SIZE = 40, seq_len = 10, pred_len = 1, train_propotion = 0.7, valid_propotion = 0.2):
    """ Prepare training and testing datasets and dataloaders.
    
    Convert speed/volume/occupancy matrix to training, validation, and testing sets. 
    The vertical axis of speed_matrix is the time axis and the horizontal axis 
    is the spatial axis.
    
    Args:
        speed_matrix: a Matrix containing spatial-temporal speed data for a network
        seq_len: length of input sequence
        pred_len: length of predicted sequence
    Returns:
        Training dataloader
        Testing dataloader
    """
    
    np.random.seed(1024)
    
    time_len = speed_matrix.shape[0]
    
    max_speed = speed_matrix.max().max()
    speed_matrix =  speed_matrix / max_speed
    
    speed_sequences, speed_labels = [], []
    for i in range(time_len - seq_len - pred_len):
        speed_sequences.append(speed_matrix.iloc[i:i+seq_len].values)
        speed_labels.append(speed_matrix.iloc[i+seq_len:i+seq_len+pred_len].values)
    speed_sequences, speed_labels = np.asarray(speed_sequences), np.asarray(speed_labels)
    
    # shuffle and split the dataset to training and testing datasets
    sample_size = speed_sequences.shape[0]
    index = np.arange(sample_size, dtype = int)
    np.random.shuffle(index)
    
    train_index = int(np.floor(sample_size * train_propotion))
    valid_index = int(np.floor(sample_size * ( train_propotion + valid_propotion)))
    
    train_data, train_label = speed_sequences[:train_index], speed_labels[:train_index]
    valid_data, valid_label = speed_sequences[train_index:valid_index], speed_labels[train_index:valid_index]
    test_data, test_label = speed_sequences[valid_index:], speed_labels[valid_index:]
    
    train_data, train_label = torch.Tensor(train_data), torch.Tensor(train_label)
    valid_data, valid_label = torch.Tensor(valid_data), torch.Tensor(valid_label)
    test_data, test_label = torch.Tensor(test_data), torch.Tensor(test_label)
    
    train_dataset = utils.TensorDataset(train_data, train_label)
    valid_dataset = utils.TensorDataset(valid_data, valid_label)
    test_dataset = utils.TensorDataset(test_data, test_label)
    
    train_dataloader = utils.DataLoader(train_dataset, batch_size = BATCH_SIZE, shuffle=True, drop_last = True)
    valid_dataloader = utils.DataLoader(valid_dataset, batch_size = BATCH_SIZE, shuffle=True, drop_last = True)
    test_dataloader = utils.DataLoader(test_dataset, batch_size = BATCH_SIZE, shuffle=True, drop_last = True)
    
    return train_dataloader, valid_dataloader, test_dataloader, max_speed

In [37]:
starttime = time.time()
train_dataloader, valid_dataloader, test_dataloader, max_speed = PrepareDataset(speed_matrix)
print( time.time() - starttime )

NameError: name 'time' is not defined

In [None]:
Read from Local files

In [3]:
import pandas as pd
import requests
from io import StringIO

In [10]:
r = requests.get('https://drive.google.com/file/d/10ZkXiH8eWGcoPdKazYk6JApOlfepz2oN/view?usp=sharing')
data = r.content

In [11]:
data

b'<!DOCTYPE html><html><head><meta name="google" content="notranslate"><meta http-equiv="X-UA-Compatible" content="IE=edge;"><style>@font-face{font-family:\'Roboto\';font-style:italic;font-weight:400;src:local(\'Roboto Italic\'),local(\'Roboto-Italic\'),url(//fonts.gstatic.com/s/roboto/v18/KFOkCnqEu92Fr1Mu51xIIzc.ttf)format(\'truetype\');}@font-face{font-family:\'Roboto\';font-style:normal;font-weight:300;src:local(\'Roboto Light\'),local(\'Roboto-Light\'),url(//fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmSU5fBBc9.ttf)format(\'truetype\');}@font-face{font-family:\'Roboto\';font-style:normal;font-weight:400;src:local(\'Roboto Regular\'),local(\'Roboto-Regular\'),url(//fonts.gstatic.com/s/roboto/v18/KFOmCnqEu92Fr1Mu4mxP.ttf)format(\'truetype\');}@font-face{font-family:\'Roboto\';font-style:normal;font-weight:700;src:local(\'Roboto Bold\'),local(\'Roboto-Bold\'),url(//fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmWUlfBBc9.ttf)format(\'truetype\');}</style><meta name="referrer" content

In [5]:
url = requests.get('https://drive.google.com/open?id=10ZkXiH8eWGcoPdKazYk6JApOlfepz2oN')
csv_raw = StringIO(url.text)
dfs = pd.read_pickle(csv_raw)

ValueError: Unrecognized compression type: infer

In [7]:
csv_raw

<_io.StringIO at 0x18f912974c8>