# Bayesian Spatio-Temporal Graph Transformer Network (B-STAR) for Multi-Aircraft Trajectory Prediction
Author: Yutian Pang, Arizona State University


Email: yutian.pang@asu.edu

# Part 1: IFF ASDE-X Flight Track Data Processing with PySpark and Hadoop
This is a demonstration of using PySpark and Hadoop for large-scale processing of IFF ASDE-X data. In practice, this data processing would be performed on a server or high-performance cluster via ssh.

## Module Requirements

This Jupyter notebook has been tested with:
- Ubuntu 20.04 LTS (and 18.04 LTS)
- Python 3.8.5 (and 3.8.10)
- Spark 3.1.1 with Hadoop3.2 (and Spark 3.2.1 with Hadoop3.2)

The software in parenthesis were tested together. Other combinations of Ubuntu, Python, and Spark should be verified for compatibility. See [this](https://stackoverflow.com/questions/58700384/how-to-fix-typeerror-an-integer-is-required-got-type-bytes-error-when-tryin) article for further guidance.

### Instructions for Windows 10 Users

1. **Install Ubuntu on Windows 10 with Windows Subsystem for LInux (WSL)**
    - Windows 10 users with admin privileges can enable Windows Subsystem for Linux (WSL) following [these](https://docs.microsoft.com/en-us/windows/wsl/install-win10) directions.
        
        
2. **Install Anaconda in the Ubuntu terminal**
    - A user can then install Anaconda on their WSL Ubuntu distribution following [these](https://gist.github.com/kauffmanes/5e74916617f9993bc3479f401dfec7da) instructions. 
        
        
3. **Download and unzip Spark on WSL**
    - Identify the distribution of Spark and Hadoop you require [here](https://spark.apache.org/downloads.html). 
    - In your Ubuntu terminal window execute the ```wget``` command followed by the download link in your chosen download directory (likely the ```HOME``` directory). 
    - Then, unzip the downloaded .tgz file with ```tar -xvzf [fname]```.



## Installing the required Python packages
The required Python packages for this module are:
- **[```pyspark```]**(http://spark.apache.org/docs/latest/api/python/getting_started/index.html)
    - This is the Python API for Apache Spark. We will be using the distributed processing features and backend SQL queries for structured data.
- **[```apache-sedona```]**(https://sedona.apache.org/)
    - Formerly Geospark, Apache Sedona extends the Resilient Distributed Dataset (RDD), the core data structure in Apache Spark, to accommodate big geospatial data in a cluster environment.
    
In the Ubuntu or Anaconda terminal, execute ```pip install pyspark apache-sedona```. This will install both the ```pyspark``` and ```apache-sedona``` packages. 

## Setting Environment Variables
The Spark codes (note: Improve this description) retrieve the ```SPARK_HOME```, ```PYTHONPATH```, ```PYSPARK_PYTHON```, and ```PYSPARK_DRIVER_PYTHON``` system variables. Either (Option 1) these are set in the shell environment in the ```.bash_profile``` script or (Option 2) in the Python script prior to calling the ```pyspark``` module.
- ### Option 1: Add environment variables to the ```.bash_profile``` script

    Open the ```.bash_profile``` script in your text editor. On Ubuntu systems, this script is usually found in your ```HOME``` directory ```~/```. If this file does not yet exist (or is empty) you can create one. Then add the following ```export``` statements for each variable you want to add and add them to the path. For example:

    ```export SPARK_HOME="$HOME/spark-X.X.X-bin-hadoopX.X"```

    ```export PYTHONPATH="$HOME/anacond3/bin/python3.8"```

    ```export PYSPARK_PYTHON="$HOME/anacond3/bin/python3.8"```

    ```export PYSPARK_DRIVER_PYTHON="$HOME/anacond3/bin/python3.8"```

    ```export PATH="$SPARK_HOME/bin:$PATH"```

- ### Option 2: Add the environment variables in the Python script using the ```os``` package

    ```import os```
       
    ```os.environ["SPARK_HOME"] = '~/spark-3.1.1-bin-hadoop3.2'```

    ```os.environ["PYTHONPATH"] = '~/anaconda3/bin/python3.8'```

    ```os.environ['PYSPARK_PYTHON'] = '~/anaconda3/bin/python3.8'```

    ```os.environ['PYSPARK_DRIVER_PYTHON'] = '~/anaconda3/bin/python3.8'```


## Procedure 1: Loading IFF ASDE-X Data into the Python Environment
### Step 1a: Use ```sedona``` to register ```SparkSession``` with geospatial packages

In [12]:
from paraatm.io.iff import IFFSpark

iffspark = IFFSpark()

ImportError: cannot import name 'IFFSpark' from 'paraatm.io.iff' (/home/edecarlo/para-atm/paraatm/io/iff.py)

In [6]:
fname = "../../miscellaneous/gnats-fpgen/IFF_SFO_ASDEX_ABC456.csv"
df=iffspark.register_iff_file_as_sql_table(fname,query_name='iffdata')
df.head()
gdf = iffspark.convert_position_to_geometry('iffdata',register_name='iffgeom')
gdf.head()

22/02/01 23:50:51 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.geom.Geometry, which is already registered.
22/02/01 23:50:51 WARN UDTRegistration: Cannot register UDT for org.locationtech.jts.index.SpatialIndex, which is already registered.
22/02/01 23:50:51 WARN SimpleFunctionRegistry: The function st_pointfromtext replaced a previously registered function.
22/02/01 23:50:51 WARN SimpleFunctionRegistry: The function st_polygonfromtext replaced a previously registered function.
22/02/01 23:50:51 WARN SimpleFunctionRegistry: The function st_linestringfromtext replaced a previously registered function.
22/02/01 23:50:51 WARN SimpleFunctionRegistry: The function st_geomfromtext replaced a previously registered function.
22/02/01 23:50:51 WARN SimpleFunctionRegistry: The function st_geomfromwkt replaced a previously registered function.
22/02/01 23:50:51 WARN SimpleFunctionRegistry: The function st_geomfromwkb replaced a previously registered function.
22/02/01 23:50:

Unnamed: 0,recType,recTime,acId,lat,lon,alt,geom
0,3,1546302315,ABC123,37.61867,-122.38173,0.06,POINT (37.619 -122.382)
1,3,1546302316,ABC123,37.6187,-122.38171,0.06,POINT (37.619 -122.382)
2,3,1546302318,ABC123,37.61874,-122.38169,0.06,POINT (37.619 -122.382)
3,3,1546302319,ABC123,37.61876,-122.38172,0.06,POINT (37.619 -122.382)
4,3,1546302320,ABC123,37.61878,-122.38173,0.06,POINT (37.619 -122.382)


## Procedure 3: Perform fast SQL queries to retrieve data subsets
Now can use ```spark.sql``` commands to query from new registered table ```pointtable``` and create new data frames and register them as SQL tables. 
### Step 3a: Temporal queries of IFF ASDE-X data
- Define desired time window from a starting timestamp (e.g. Monday, December 31, 2018 at 4:25pm PST)
- Query returning all flight records within **1 hour** time window
- Register query as SQL table

In [3]:
## Define desired time window
duration = 1 #hour
t_start = 1546302340 #Monday, December 31, 2018 at 4:25pm in PST
t_end = t_start + 3600*duration

In [8]:
## Query returning all flight records within 1 hour time window
df_time=iffspark.query_time("iffgeom",t_start,t_end,register_name='df_time')
df_time.head()

Unnamed: 0,recType,recTime,acId,lat,lon,alt,geom
0,3,1546302340,ABC123,37.61914,-122.38157,0.06,POINT (37.619 -122.382)
1,3,1546302341,ABC123,37.61916,-122.38156,0.06,POINT (37.619 -122.382)
2,3,1546302343,ABC123,37.61918,-122.38155,0.13,POINT (37.619 -122.382)
3,3,1546302344,ABC123,37.6192,-122.38155,0.13,POINT (37.619 -122.382)
4,3,1546302346,ABC123,37.61921,-122.38152,0.13,POINT (37.619 -122.382)


In [10]:
## Define desired spatial rectangle around a central point (e.g. KSFO airport) 
apt_coords = [37.6188056,-122.3754167, 0]  # from https://www.airnav.com/airport/ksfo
r = 0.2 # rectangular query range unit: degrees
vs = 0.3 # vertical threshold unit: x100 feet

In [11]:
## Query returning all flight IDs within temporal_df around KSFO
df_apt=iffspark.query_fix_and_radius('df_time',apt_coords,r,vs,register_name='df_apt')
df_apt.head()

Unnamed: 0,recType,recTime,acId,lat,lon,alt,geom
0,3,1546303252,ABC123,37.62887,-122.36893,1.13,POINT (37.629 -122.369)
1,3,1546303253,ABC123,37.62963,-122.36842,1.44,POINT (37.630 -122.368)
2,3,1546303254,ABC123,37.63041,-122.36789,1.63,POINT (37.630 -122.368)
3,3,1546303255,ABC123,37.63116,-122.36736,2.19,POINT (37.631 -122.367)
4,3,1546303256,ABC123,37.63192,-122.36684,2.69,POINT (37.632 -122.367)
