# Introduction

This document is created to get you familiar with my horrible codes from this pedestrian crowd project. You can tell how horrible and unorganized I was based on the number of random notebooks I've prototyped throughout the last 1-2 year or so. But ignore those notebooks for now, as hopefully this notebook alone (with some debugging) should be able to get you the necessary components of a complete pipeline that will: 
1. Load the spatio-temporal crowd flow graphs
    This graph dataset was created by integrating spatial and temporal information in the previous paper V. W. H. Wong and K. H. Law, Fusion of CCTV Video and Spatial Information for Automated Crowd Congestion Monitoring in Public Urban Spaces. Algorithms, Mar 2023, 16(3):154. https://doi.org/10.3390/a16030154. As opposed to this previous paper which relies only on one single-camera dataset to do the crowd flow estimation, there are two more self-collected data sources now integrated with multiple cameras. 
2. Prediction
3. Visualization of hostpots 

Proposed title of a publication resulting from this line of work could be something along the lines of: 
"ST-DIF: A spatio-temporal data integration framework to estimate pedestrian crowd flow from multi-camera surveillance in built environments"


# 0. Setup environment

## 0.1. Open your terminal and create a new conda virtual environment from environment.yml. I always print out a conda cheatsheet (https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf)for this but now that there's chatgpt, it's probably useless. 

Ideally in the future we'd use Docker for our projects instead of conda virtual environment

## 0.2. Import packages
Debug for version mismatch & unlisted packages in environment.yml if necessary

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import cv2, glob, os, torch, math, torchvision, datetime
from tqdm.notebook import tqdm 
import torch.nn.functional as F
import configparser

# my files
from cmgraph import parse_gcs, image_to_world, GCSDatasetLoaderStatic, DatasetLoaderStatic
from models import DenseGCNGRU, GRU_only, GCNGRU

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # should say 'cpu' if you're using mac.
print(device)

# 1. Load data from Campus Crowd

## 1.1. Load CSV tables
The original pedestrian data I collected back at Stanford was videos. Due to IRB regulations they are anonymised and are stored in CSV. They are available to public use at https://github.com/vivian-wong/Campus-Crowd (though I really need to add a ReadMe soon...) Let's take a look at them with the help of pandas.

We have 3 subsets of data: GCS (a public one), SEQ, Stadium (both collected at Stanford, post-processed with CNN after collecting videos). GCS will be loaded using the class GCSDatasetLoaderStatic. SEQ and Stadium will be loaded with  DatasetLoaderStatic. Both are dataset loaders I wrote; 

I moved some files into the data folder for better organization. If you get a file error you might need to debug a little bit here. 

In [None]:
DATASET_LIST = ['GCS', 'SEQ', 'STADIUM_2023']
DATASET = DATASET_LIST[0]

if DATASET == 'GCS':
    data_dir = "data/GCS"
    trajs= parse_gcs(os.path.join(data_dir,"Annotation")) # 35 seconds to load
    ZONE_LIST =[
        (0, 0, 28, 14),
        (28, 0, 55, 14),
        (0, 14, 28, 24),
        (28, 14, 55, 24),
        (0, 24, 28, 35),
        (28, 24, 55, 35),
        (0, 35, 28, 45),
        (28, 35, 55, 45),
        (0, 45, 55, 55)
    ]
    loader = GCSDatasetLoaderStatic(
        trajs=trajs,
        ZONE_LIST = ZONE_LIST)
if DATASET == 'SEQ': 
    data_dir = "data/SEQ"
    configs_dict = get_configs_dict(os.path.join(data_dir,'SEQ.cfg'))
    flow_df = pd.read_csv(os.path.join(data_dir,'flow_df_SEQ_1fps.csv'))
    all_zone_dfs = [g for _,g in flow_df.groupby('egress_region')]
    loader = DatasetLoaderStatic(all_zone_dfs,
                                 configs_dict['CMGraph']['adjacency_mat'])
if DATASET == 'STADIUM_2023': 
    data_dir = "data/Stadium"
    configs_dict = get_configs_dict(os.path.join(data_dir,'Stadium_2023.cfg'))
    flow_df = pd.read_csv(os.path.join(data_dir,'flow_df_stadium_2023_1fps.csv'))
    all_zone_dfs = [g for _,g in flow_df.groupby('egress_region')]
    loader = DatasetLoaderStatic(all_zone_dfs,
                                 configs_dict['CMGraph']['adjacency_mat'])

## 1.2. Convert tables to node features of pytorch-geometric graphs. 
Node features are from the CSVs. Adjacency matrices of Stadium and SEQ are manually entered in the config files.

In [None]:
# set any random forecasting horion
forecasting_horizon=20

# use the dataset loader I wrote to load pytorch dataset; I have a "get_dataset" method for both 
dataset = loader.get_dataset(num_timesteps_in=forecasting_horizon, 
                             num_timesteps_out=forecasting_horizon)
print("Dataset type:  ", dataset) 
print("Number of samples / sequences: ",  len(list(dataset)))
print(next(iter(dataset))) # Show first sample; You can read more about pytorch's dataset iterator on pytorch doc website.

In [None]:
# Visualize traffic over time, only showing one region at a time right now.
plt.figure(figsize=(20,5))
region_number = 2 
time = -1 # show from now to the the end of time.In python -1 means the last index. 
region_labels = [bucket.y[region_number][0].item() for bucket in list(dataset)[:time]]
plt.plot(region_labels)

split dataset into train and test set. Usually what's common in the field is to have train, test, val. 
So maybe add val in the future but I worry that our dataset is too small, so let's just leave it for now. 

In [None]:
batch_size = 32

input_np = np.array(dataset.features) 
target_np = np.array(dataset.targets) 
input_tensor = torch.from_numpy(input_np).type(torch.FloatTensor).to(device)  # (B, N, F, T)
target_tensor = torch.from_numpy(target_np).type(torch.FloatTensor).to(device)  # (B, N, T)
dataset_new = torch.utils.data.TensorDataset(input_tensor, target_tensor)

proportions = [0.7,0.3] # train: test ratio
lengths = [int(p * len(dataset_new)) for p in proportions]
lengths[-1] = len(dataset_new) - sum(lengths[:-1])
train_dataset_new, test_dataset_new = torch.utils.data.random_split(dataset_new, lengths, generator=torch.Generator().manual_seed(42))

train_loader = torch.utils.data.DataLoader(train_dataset_new, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset_new, batch_size=batch_size, shuffle=False)

print("Number of train buckets: ", len(list(train_dataset_new)))
print("Number of test buckets: ", len(list(test_dataset_new)))

# 2. Forecast 
To easily run multiple training experiments, I had the experiments in a bash file. See "run_experiment.sh", pasted below for reference:
``` bash
#!/bin/bash

################ change forecasting horizon ################################ 
for FORECASTING_HORIZON in 20 60 120 240
do
    for MODEL in DenseGCNGRU GCNGRU GRU
    do
        for DATASET in GCS SEQ STADIUM_2023
        do
            # Call the experiment script with the specified parameters
            python main.py \
                --DATASET $DATASET \
                --MODEL $MODEL \
                --forecasting_horizon $FORECASTING_HORIZON \
                --save_model True \
                --save_dir './checkpoints'
        done
    done
done
```

An initial "!" in jupyter will run the line like a command line. Before you run the following line, you might need to first create a "checkpoints" folder if you don't have one already. 

In [None]:
!bash script-name-here.sh

# 3. Comparing results of ST-DIF's forecast with purely temporal (no spatial) information-based forecast
Model checkpoint should be updated with MSE and MAE after running the sh script, because of line 279-283 in main.py:
``` python
# test model
model, checkpoint_dict = testing_loop(model, test_loader, static_edge_index, checkpoint_dict=checkpoint_dict)
# update save to include test mse and mae. 
if args.save_model:
    save_or_update_checkpoint(checkpoint_dict, path)
```

We can now tabulate these checkpoint results in a pandas dataframe for easy querying, etc. 

In [None]:
checkpoint_dict_list = []
for root, dirs, files in os.walk('./checkpoints'): 
    for name in files: 
        path = os.path.join(root, name)
        checkpoint_dict = torch.load(path)
        new_dict = checkpoint_dict.copy()
        new_dict['model'] = name.split('_')[0]
        new_dict['path'] = path
        new_dict['forecasting_horizon'] = name.split('_')[-2]
        new_dict['dataset'] = name.split('_')[-3]
        if new_dict['dataset'] == '2023': 
            new_dict['dataset'] = 'STADIUM_2023'
        del new_dict['model_state_dict']
        del new_dict['optimizer_state_dict']
        checkpoint_dict_list.append(new_dict)
results_df = pd.DataFrame(checkpoint_dict_list)
for k in ['forecasting_horizon', 'test_mae', 'test_mse']:
    results_df[k] = pd.to_numeric(results_df[k])
results_df.rename({})
results_df.head(3)

In [None]:
replacements = {
    'DenseGCNGRU': 'Dense-GCN-GRU',
    'GCNGRU': 'GCN-GRU',
    'STADIUM_2023': 'Stadium'
}

results_df = results_df.replace(replacements, regex=True)

In [None]:
results_df.query('forecasting_horizon==20')[['dataset','model','test_mse','test_mae']].sort_values(['dataset', 'model'])

# 4. potential TODOs, in the order of least difficult -> more difficult

## 4.1. Merge data from Grand Central Station, so that it's all one package that people can easily experiment with 

## 4.2. Convert to a simple package that's open source on github.

## 4.3. Add visualization tools

## 4.4. Add stochasticity to graph for better crowd flow estimation