<img src="images/logo_city.png" align="right" width="20%">

# Training A Graph State Prediction Network (GSPNet): Simplified version

This notebook is a prototype on how to train a graph state prediction network using CNN and LSTM.
The content is devided into __3__ parts:

1. Data preprocessing
2. Model building, Training and Tuning
3. Prediction and per demand modifacation

The model is built with [PyTorch](https://pytorch.org/).

<p style='color: darkred'><strong>This version is simplified and most of intermediate explainatory codes and comments are removed except those directly contribute to generating images. For duck process, see the prototype version.</strong></p>

## Part 1: Data Preprocessing

### 1. Load Libraries and Data

Load in the data. Check the data integrity.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import sqlalchemy
import time
import psycopg2
import warnings
import re
import torch
from dask import dataframe as dd
from PIL import Image
from IPython.display import display
from scipy import stats
from matplotlib import pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Load data for test only. Use dask.

In [2]:
table = dd.read_csv('dataset/nytaxi_yellow_2017_dec.csv')
table.head()

Unnamed: 0,tripid,tpep_pickup_datetime,tpep_dropoff_datetime,pulocationid,dolocationid,trip_distance,passenger_count,total_amount,trip_time,trip_avg_speed,trip_time_sec
0,108413760,2017-12-02 07:44:47,2017-12-02 08:17:03,161,138,11.44,1,49.87,00:32:16,21.272727,1936
1,108449137,2017-12-23 16:30:59,2017-12-23 23:56:08,138,225,11.98,1,38.3,07:25:09,1.614737,26709
2,121063599,2017-12-01 00:01:35,2017-12-01 00:14:01,163,224,2.44,2,14.16,00:12:26,11.774799,746
3,121063645,2017-12-01 00:01:38,2017-12-01 00:10:45,229,229,1.06,2,11.0,00:09:07,6.976234,547
4,121063776,2017-12-01 00:00:48,2017-12-01 00:32:21,140,56,9.05,3,39.96,00:31:33,17.210777,1893


Check table shape:

In [3]:
table.shape

(Delayed('int-766e6926-9a8f-44d2-b1bd-24e9e746ef97'), 11)

Check data:

Ratio of trips last longer than 30 minutes: (better smaller than 0.1)

In [4]:
(table.loc[table['trip_time_sec'] > 1800].shape[0] / table.shape[0])

Delayed('truediv-1b5e0e320605aa5b5b3c93065e5fdaa1')

### 2. Time Interval Preprocessing

Create time interval and process original table to generate images.

Change datetime columns from str type to Timestamp.

Function to create time **intervals**:

In [5]:
def timesplit(stp: str, etp: str, freq='10min'):
    '''
    Create a DatetimeIndx interval.
    
    Params:
        stp: string, starting time point, first left bound
        etp: string, ending time point, last right bound
        freq: frequency, time interval unit of the splice operation
    The stp and etp must of pattern "yyyy-mm-dd hh:mm:ss", otherwise exception will be raised.
    
    Return:
        A list of time intervals tuples,each item is a tuple of two
        interval(i.e., pandas.core.indexes.datetimes.DatetimeIndex object)
        For example, a possible return could be [(2017-01-01 00:00:00, 2017-01-01 00:10:00),
                                                 (2017-01-01 00:10:00, 2017-01-01 00:20:00)]
    '''
    pattern = re.compile('^([0-9]{4})-([0-1][0-9])-([0-3][0-9])\s([0-1][0-9]|[2][0-3]):([0-5][0-9]):([0-5][0-9])$')
    if pattern.match(stp) and pattern.match(etp):
        time_bounds = pd.date_range(stp, etp, freq=freq)
        sub_intervals = list(zip(time_bounds[:-1], time_bounds[1:]))
        print(len(time_bounds), len(sub_intervals))
        return sub_intervals
    else:
        raise Exception('Provided time bound is of invalid format.')


### 3. Location Preprocessing

Preprocess location related information: mapping ids and locations, then generate adjacency matrices.

In [6]:
# Load zone lookup table
zones = pd.read_csv('dataset/taxi_zone_lookup.csv')
print(f'shape is: {zones.shape}')

shape is: (265, 4)


In [7]:
# very important globals:
yellow_zone = zones.loc[zones['Borough'] == 'Manhattan']
print(yellow_zone.shape)
img_size = yellow_zone.shape[0]

real_id = list(map(str, list(yellow_zone.loc[:,'LocationID'])))
conv_id = [i for i in range(img_size)]
assert len(real_id) == len(conv_id)
mp = dict(zip(real_id, conv_id))

(69, 4)


### 4. Generate Tensor: 3 matrices (layers) of connection

Generate images that represent traffic states.

Define functions to generate desired outcomes:

In [8]:
left = pd.to_datetime('2017-04-01 00:00:00')
print(f'Generating images on {left.year}-{left.month}-{left.day} ...')
print(left.minute, left.second)

Generating images on 2017-4-1 ...
0 0


In [9]:
def gen_snap_layers(table, bound):
    '''
    Generate Past layer, Now layer and Future layer for one snapshot.
    Params:
        table:
        bounds:
    Return:
        PNF layers, a list.
    '''
    # left bound and right bound of time interval
    assert type(bound) == tuple
    left = bound[0]
    right = bound[1]
    
    if left.hour == 0 and left.minute == 0 and left.second == 0:
        print(f'Generating images on {left.year}-{left.month}-{left.day}...')
    
    # no need to sort table indeed?
    projected_table = table.loc[:, ['tripid',
                                    'tpep_pickup_datetime',
                                    'tpep_dropoff_datetime',
                                    'pulocationid',
                                    'dolocationid']]

    # The condition of making snapshot should be:
    # at least one temporal end of a trip should be within the bounds:
    snap = projected_table.loc[
        ((projected_table['tpep_pickup_datetime'] >= left) &
         (projected_table['tpep_pickup_datetime'] < right)) |
        ((projected_table['tpep_dropoff_datetime'] >= left) &
         (projected_table['tpep_dropoff_datetime'] < right))]

    # temp table to generate F,P,N layers
    # keep snap intact
    temp_snap = snap.copy()

    # Use the interval to 'catch' corresponding trips.
    # future layer
    f_layer = temp_snap.loc[(temp_snap['tpep_pickup_datetime'] < right) &
                             (temp_snap['tpep_pickup_datetime'] >= left) &
                             (temp_snap['tpep_dropoff_datetime'] >= right)]
    # past layer
    p_layer = temp_snap.loc[(temp_snap['tpep_pickup_datetime'] < left) &
                             (temp_snap['tpep_dropoff_datetime'] >= left) &
                             (temp_snap['tpep_dropoff_datetime'] < right)]
    # now layer
    n_layer = temp_snap.loc[(temp_snap['tpep_pickup_datetime'] >= left) &
                             (temp_snap['tpep_dropoff_datetime'] < right)]

    # Their count should add up to total trips caught
    assert temp_snap.shape[0] == f_layer.shape[0] + p_layer.shape[0] + n_layer.shape[0]

    return p_layer, n_layer, f_layer


# Function that combines layers to an image
def gen_image(p_layer, n_layer, f_layer):
    '''
    Generate an image using given matrices.
    Params:
        p_layer: matrix of past layer
        n_layer: matrix of now layer
        f_layer: matrix of future layer
    Return:
        A PIL image.
    '''
    # create a snapshot
    snapshot = np.zeros([img_size, img_size, 3], dtype='float64')

    # unexpected zones
    left_zones = set()

    # future-Red: 0
    for _, row in f_layer.iterrows():
        try:
            snapshot[mp[str(row['pulocationid'])], mp[str(row['dolocationid'])], 0] += 1
        except Exception as e:
            left_zones.add(str(row['pulocationid']))
            left_zones.add(str(row['dolocationid']))

    # past-Green: 1
    for _, row in p_layer.iterrows():
        try:
            snapshot[mp[str(row['pulocationid'])], mp[str(row['dolocationid'])], 1] += 1
        except Exception as e:
            left_zones.add(str(row['pulocationid']))
            left_zones.add(str(row['dolocationid']))

    # now-Blue: 2
    for _, row in n_layer.iterrows():
        try:
            snapshot[mp[str(row['pulocationid'])], mp[str(row['dolocationid'])], 2] += 1
        except Exception as e:
            left_zones.add(str(row['pulocationid']))
            left_zones.add(str(row['dolocationid']))

    # normalize
    snapshot *= 255 // snapshot.max()
    snapshot = snapshot.astype('uint8')
    image = Image.fromarray(snapshot)
    return image


# generate tensor, because saving image as intermediate data is not efficient
def gen_tensor(p_layer, n_layer, f_layer):
    '''
    Generate a tensor using given matrices.
    Params:
        p_layer: matrix of past layer
        n_layer: matrix of now layer
        f_layer: matrix of future layer
    Return:
        A torch tensor.
    '''
    # create a snapshot
    snapshot = np.zeros([img_size, img_size, 3], dtype='float64')

    # unexpected zones
    left_zones = set()

    # future-Red: 0
    for _, row in f_layer.iterrows():
        try:
            snapshot[mp[str(row['pulocationid'])], mp[str(row['dolocationid'])], 0] += 1
        except Exception as e:
            left_zones.add(str(row['pulocationid']))
            left_zones.add(str(row['dolocationid']))

    # past-Green: 1
    for _, row in p_layer.iterrows():
        try:
            snapshot[mp[str(row['pulocationid'])], mp[str(row['dolocationid'])], 1] += 1
        except Exception as e:
            left_zones.add(str(row['pulocationid']))
            left_zones.add(str(row['dolocationid']))

    # now-Blue: 2
    for _, row in n_layer.iterrows():
        try:
            snapshot[mp[str(row['pulocationid'])], mp[str(row['dolocationid'])], 2] += 1
        except Exception as e:
            left_zones.add(str(row['pulocationid']))
            left_zones.add(str(row['dolocationid']))

    # normalize
    snapshot *= 255 // snapshot.max()
    snapshot = torch.from_numpy(snapshot)
    return snapshot


# Function that gets a specific layer of snapshot.
def get_channel(image, layer:str):
    '''
    Get a layer of the snapshot.
    Params:
        image: PIL image
        channel: one of R-F,G-P,B-N
    Return:
        single channel image
    '''
    assert layer in ['P', 'N', 'F']
    namedict = {'P': 'G', 'N': 'B', 'F': 'R'}
    chandict = {'R':0, 'G':1, 'B':2}
    template = np.array(image)
    chan = np.zeros([*template.shape], dtype='uint8')
    chan[:,:,chandict[namedict[layer]]] = image.getchannel(namedict[layer])
    chan = Image.fromarray(chan)
    return chan

Save data path:

In [10]:
# image save directory:
visual_path = r'D:\GITHUB\GSPNet\snapshots\visualization'
data_path = r'D:\GITHUB\GSPNet\snapshots\data'

Start process:

In [11]:
months = ['04', '05', '06', '07', '08', '09', '10', '11', '12']
for m in range(len(months)-1):
    print(months[m], months[m+1])

04 05
05 06
06 07
07 08
08 09
09 10
10 11
11 12


In [13]:
month_dict = {
                    '01': 'jan',
                    '02': 'feb',
                    '03': 'mar',
                    '04': 'apr',
                    '05': 'may',
                    '06': 'jun',
                    '07': 'jul',
                    '08': 'aug',
                    '09': 'sep',
                    '10': 'oct',
                    '11': 'nov',
                    '12': 'dec'
                }

months = ['12']

start = time.time()
print(f'Process started at {time.ctime()}')

for m in range(len(months)):
    print(f'Generating time intervals...')
    timelist = timesplit(f'2017-{months[m]}-01 00:00:00', f'2018-01-01 00:00:00')

    # convert dtype for entire table here:
    print(f'Preparing table data...')
    table = pd.read_csv(f'dataset/nytaxi_yellow_2017_{month_dict[months[m]]}.csv')

    print(f'table shape: {table.shape}')
    table['tpep_pickup_datetime'] = pd.to_datetime(table['tpep_pickup_datetime'])
    table['tpep_dropoff_datetime'] = pd.to_datetime(table['tpep_dropoff_datetime'])

    print('start generating...')
    for i, bound in enumerate(timelist):
    
        p_layer, n_layer, f_layer = gen_snap_layers(table, bound)
        tensor = gen_tensor(p_layer, n_layer, f_layer)
        torch.save(tensor, f'{data_path}\\nytaxi_yellow_2017_{month_dict[months[m]]}_d{i}.pt')
    
        image = gen_image(p_layer, n_layer, f_layer)
        vimage = image.resize((690,690)) # multiply by factor of 100
        vimage.save(f'{visual_path}\\nytaxi_yellow_2017_{month_dict[months[m]]}_v{i}.jpg')
    
print(f'Image and tensor generation done in {time.time() - start :.2f} seconds.')

Process started at Mon Dec 31 21:49:30 2018
Generating time intervals...
4465 4464
Preparing table data...
table shape: (9446596, 11)
start generating...
Generating images on 2017-12-1...
Generating images on 2017-12-2...
Generating images on 2017-12-3...
Generating images on 2017-12-4...
Generating images on 2017-12-5...
Generating images on 2017-12-6...
Generating images on 2017-12-7...
Generating images on 2017-12-8...
Generating images on 2017-12-9...
Generating images on 2017-12-10...
Generating images on 2017-12-11...
Generating images on 2017-12-12...
Generating images on 2017-12-13...
Generating images on 2017-12-14...
Generating images on 2017-12-15...
Generating images on 2017-12-16...
Generating images on 2017-12-17...
Generating images on 2017-12-18...
Generating images on 2017-12-19...
Generating images on 2017-12-20...
Generating images on 2017-12-21...
Generating images on 2017-12-22...
Generating images on 2017-12-23...
Generating images on 2017-12-24...
Generating imag