# Feature generation

The first thing I want to note here is that model selection was a very important criteria for feature generation, but I am going to generally break up the documentation of model selection and feature generation for organizational purposes. But if you want skip ahead to to the next notebook feel free!

## Widending that data

Before getting into real feature generation, I made some alterations to the data. As mentioned in the last notebook, the dataset was really big. I wanted to keep the data in it's native format for EDA. It is much easier to column when they hold single objects than it is to analyze semi-structured elements. However, for the rest of the pipeline I need to otpimize the data, so I reformatted the data utizlining PostgreSQL array columns in the scripts _06_traces_to_db_partioning and _07_traces_to_segements.py . I went ahead and threw out the fields I determined I wouldn't use, but other than that no mathemetical transfomrated occured at this step. All traces of less than 15 minutes were not included in the output data set. 

Here is brief summary of what each script did and how:

- _06_traces_db_partitioning.py
 - Partitions aircraft tracking data table into smaller tables based on first two characters of ICAO aircraft identifier (hex prefixes '00' to 'ff')
 - Creates up to 256 separate tables with identical schema
 - Transfers data in batches of 100,000 rows with indexes on timestamp and ICAO columns
 - Can process either full hex range (00-ff) or specific subsets

- _07_traces_to_segments.py
 - Processes aircraft tracking data across different hex-partitioned tables
 - Implements multi-level parallelization: thread pooling for hex partitions and process pooling for aircraft (ICAO codes)
 - Three main stages: extraction of raw trace data, transformation of flight segments based on timestamp gaps, and loading of processed data

The key tools used to perfrom the transformation have already been mentioned in the preceeding notebook and include same databse and parellelization tools.

Let's go ahead and do our imports for our notebok.

#### Dependancies and config

In [1]:
import pandas as pd

### View of segmented data

In [2]:
# output what the segmented data loks like

# read in sample data
df = pd.read_parquet('../data/_07_traces_to_segments/segments_sample.parquet')

display(df.head())

rows in sample data: 100


Unnamed: 0,icao,start_timestamp,duration,time_offsets,nav_heading,longitude,vertical_rate_baro,latitude,ground_speed,flags,altitude_baro,nav_altitude_mcp,track,indicated_airspeed,roll_angle,squawk,flight,category,emergency
0,407fb9,2024-08-01 15:53:47.540000+00:00,0 days 00:34:07.310000,"[0, 8430000, 48120000, 66400000, 119710000, 13...",,"[2.237915, 2.230303, 2.197187, 2.180001, 2.131...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[53.147003, 53.143295, 53.127274, 53.118942, 5...","[148.7, 148.7, 147.9, 147.9, 147.9, 147.9, 148...","[3, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1600, 1600, 1600, 1600, 1600, 1600, 1600, 160...","[35008, 38000, 33008, 33008, 20000, 20000]","[231.30, 231.30, 231.00, 231.00, 231.00, 231.0...",,"[0.00, 0.00, 0.00, -0.20, 0.00, -0.20, 0.00, -...","[7420, 7420, 7420, 7420, 7420, 7420, 7420, 742...",,,
1,407fb9,2024-08-01 17:27:53.700000+00:00,0 days 00:19:51.340000,"[0, 1080000, 10140000, 11990000, 13470000, 149...",,"[1.287912, 1.287994, 1.288304, 1.288376, 1.288...","[-960, 448, -192, 0, 0, 0, -320, -256, 832, 13...","[52.675347, 52.675415, 52.675859, 52.675919, 5...","[9.9, 12.5, 12.7, 8.1, 8.1, 3.7, 1.2, 0.7, 2.1...","[3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[100, 100, 75, 75, 75, 75, 50, 75, 125, 200, 2...","[35008, 35008, 9008, 35008]","[45.00, 28.60, 18.40, 29.70, 29.70, 56.30, 90....","[264, 274, 258, 260, 260, 259, 259]","[0.00, 0.00, 0.00, 0.00, 0.00, 0.00, -4.20, -8...","[7420, 7420, 7420, 7420, 7420, 7420, 7420, 742...",,,
2,407fb9,2024-08-01 17:58:07.680000+00:00,0 days 00:28:40.530000,"[0, 3300000, 41850000, 44290000, 68970000, 768...",,"[2.283325, 2.283249, 2.288971, 2.288513, 2.271...","[960, 960, 960, 576, 704, 640, 704, 704, 768, ...","[53.051651, 53.051651, 53.059891, 53.060760, 5...","[89.0, 89.0, 89.0, 130.9, 130.7, 131.4, 132.4,...","[1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[50, 50, 475, 500, 800, 875, 900, 1000, 1075, ...",[19008],"[0.60, 0.60, 0.60, 247.60, 240.70, 239.30, 238...","[250, 250, 247, 247, 247, 247, 247, 247]","[-19.50, -19.50, -19.50, -10.60, -10.60, -10.6...","[7420, 7420, 7420, 7420, 7420, 7420, 7420, 742...",,A7,
3,0200ae,2024-08-01 10:47:06.580000+00:00,0 days 02:52:23.940000,"[0, 75440000, 145750000, 180620000, 240330000,...","[224.30, 224.30, 224.30, 224.30, 224.30, 224.3...","[6.110476, 6.110476, 6.110475, 6.110475, 6.110...","[64, 0, 0, 0, 64, 64, 64, 64, 128, 256, 448, 6...","[46.233751, 46.233751, 46.233753, 46.233753, 4...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[3, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[9008, 9008, 9008, 9008, 9008, 9008, 9008, 900...","[135.00, 135.00, 135.00, 135.00, 135.00, 135.0...","[118, 118, 118, 118, 146, 146, 163, 163, 163, ...","[-0.50, -0.50, -0.50, -0.50, -0.50, -0.50, -0....","[5751, 5751, 5751, 5751, 5751, 5751, 5751, 575...",,,none
4,407fc1,2024-08-01 05:16:41.990000+00:00,0 days 01:57:33.120000,"[0, 2610000, 6920000, 25100000, 31650000, 3554...","[75.23, 75.23, 59.77, 59.77, 59.77, 59.77, 59....","[-0.379715, -0.379677, -0.379601, -0.379344, -...","[96, 0, -1376, -1376, -1376, -1376, -1376, -13...","[51.880293, 51.880281, 51.880246, 51.880199, 5...","[2.2, 2.8, 2.2, 1.4, 1.9, 1.1, 0.0, 0.0, 0.0, ...","[1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[4992, 4992, 4992, 4992, 4992, 4992, 4992, 499...","[281.20, 275.60, 264.40, 219.40, 219.40, 216.6...","[256, 256, 262, 262, 262, 262, 262, 262, 152, ...","[0.20, 0.20, 0.50, 0.50, 0.70, 0.70, 0.70, 0.7...","[7305, 7305, 7305, 7305, 7305, 7305, 7305, 730...",,,


# Feature generation strategy

This feature generation approach fundamentally differs from traditional flight data analysis methods by preserving the inherent time-series nature of aircraft behavior while enabling both global pattern recognition and local anomaly detection. Unlike conventional methods that often reduce flight segments to aggregate statistics (like mean, standard deviation, or min/max values), this approach calculates instantaneous rates of change through forward differences, maintaining the temporal relationship between consecutive measurements and capturing the dynamic evolution of flight parameters.

The overlapping 50-point segment design serves multiple crucial purposes. First, it ensures continuous coverage of flight patterns, preventing blind spots that could occur at fixed-window boundaries where important maneuvers or anomalies might be split. Second, by deriving features like ground acceleration, vertical acceleration, turn rates, and climb/descent accelerations using forward differences, the method captures not just the magnitude of changes but their progression over time. This is particularly important for ASDB flight data, where both commercial and general aviation patterns need to be analyzed, as it preserves the sequential nature of aircraft maneuvers while remaining computationally feasible. The approach strikes a balance between granular anomaly detection (such as unusual accelerations within otherwise normal flight patterns) and the practical constraints of processing large-scale flight data, making it well-suited for use with deep learning techniques like autoencoders that can simultaneously learn normal patterns while identifying subtle deviations.

- Preserves time-series relationships
  - Uses forward differences instead of statistical aggregates
  - Enables detection of patterns and anomalies simultaneously
  - Maintains temporal sequence information

- Uses overlapping 50-point segments
  - Provides continuous coverage of flight patterns
  - Works within memory constraints
  - Accommodates variable sampling rates in ASDB data

- Handles real-world data challenges
  - Explicit rules for null values
  - Clear gap handling procedures
  - Defined sequence ending protocols
  - Works across different aircraft types

- Focuses on physics-based derived features
  - Ground and vertical accelerations
  - Turn rates
  - Climb/descent accelerations
  - Direct capture of aircraft dynamics

- Designed for versatility
  - Works with commercial aviation
  - Handles general aviation patterns
  - Captures maneuver-specific details
  - Processes diverse flight behaviors

- Optimized for autoencoder + HDBSCAN
  - Enables simultaneous pattern learning
  - Supports anomaly detection
  - Provides interpretable latent space
  - Works with high-dimensional data

- Avoids traditional method limitations
  - Better than HMMs with continuous data
  - Superior to isolation forests
  - Maintains temporal relationships
  - Handles high-dimensional spaces effectively

# Feature Transformations

## Input Data Arrays
- `flight_data['time_offsets']`: Microsecond time offsets
- `flight_data['vertical_rate_baro']`: Vertical rate in feet/minute
- `flight_data['ground_speed']`: Ground speed in knots
- `flight_data['track']`: Track/heading in degrees (0-360)
- `flight_data['altitude_baro']`: Barometric altitude in feet

## Segmentation Logic
Data is grouped by ICAO. Each ICAO\'s data is processed for valid 50-point segments:

- Main segments: Points 1-50, 51-100, etc.
- Overlapping segments: Points 25-75, 75-125, etc.
- End handling: If sequence ends with <50 points, take last 50 points
- Gap handling: If 10+ second gap found, take previous 50 points if available

## Validity Check
Data Quality requirements:

- Null values break sequences
- All arrays must maintain 50 points
- No duplicate timestamps (take first point if duplicates exist)

## Derived Feature Calculations
- Using forward differences for all calculations:

### Ground Acceleration (ground_accels)
- For point i:
  - Δt = (time_offsets[i+1] - time_offsets[i]) / 1_000_000  # convert to seconds
  - ground_accels[i] = (ground_speed[i+1] - ground_speed[i]) / Δt
- Units: knots/second

### Vertical Acceleration (vertical_accels)
- For point i:
  - Δt = (time_offsets[i+1] - time_offsets[i]) / 1_000_000
  - vertical_accels[i] = (vertical_rate_baro[i+1] - vertical_rate_baro[i]) / Δt
- Units: feet/min²

### Turn Rate (turn_rates)
- For point i:
  - Δt = (time_offsets[i+1] - time_offsets[i]) / 1_000_000
  - heading_diff = track[i+1] - track[i]

### Handle 0/360 wraparound
- if heading_diff > 180:
  - heading_diff -= 360
- elif heading_diff < -180:
  - heading_diff += 360
       
  - turn_rates[i] = heading_diff / Δt
- Units: degrees/second

### Climb/Descent Acceleration (climb_descent_accels)
- For point i:
  - Δt1 = (time_offsets[i+1] - time_offsets[i]) / 1_000_000
  - Δt2 = (time_offsets[i+2] - time_offsets[i+1]) / 1_000_000

  - first_deriv = (altitude_baro[i+1] - altitude_baro[i]) / Δt1
  - second_deriv = (altitude_baro[i+2] - altitude_baro[i+1]) / Δt2

  - climb_descent_accels[i] = (second_deriv - first_deriv) / Δt1
- Units: feet/second²

### Notes on Feature Generation
- Last point(s) copy previous value to maintain 50-point arrays
- No smoothing applied to preserve raw change measurements
- Originally, a data verification step was in place to check for physically unlikely data points
  - this check proved a little too aggressive
  - likely filtering a lot of values due to sensor error
  - ended up with what looked like cruising data for all clusters
  - and it was making all my visuslization in the final notebook super boring, so I commented it out for now

In [5]:
# extract and print of what the data looks after feature engineering
df = pd.read_parquet('../data/_11_analysis_&_conclusions/autoencoder_training_sample.parquet')

display(df.head())

Unnamed: 0,segment_id,icao,start_timestamp,segment_duration,point_count,vertical_rates,ground_speeds,headings,altitudes,time_offsets,vertical_accels,ground_accels,turn_rates,climb_descent_accels
0,9060761,4baa66,2024-08-01 07:42:09.310000+00:00,0 days 00:15:56.990000,50,"[0.0, 0.0, 64.0, 0.0, 0.0, -64.0, -64.0, 0.0, ...","[426.4, 425.1, 424.7, 424.7, 424.7, 424.7, 424...","[331.6, 331.6, 331.4, 331.4, 331.4, 331.4, 331...","[32000.0, 32000.0, 32000.0, 32000.0, 32000.0, ...","[0, 19930000, 39840000, 59750000, 79420000, 98...","[0.0, 3.2144650929181315, -3.2144650929181315,...","[-0.06522829904666104, -0.020090406830740034, ...","[0.0, -0.010045203415371445, 0.0, 0.0, 0.0, 0....","[0.0, 0.0, 0.0, 0.0, -0.06491489916379224, 0.0..."
1,8802576,440d10,2024-08-01 05:39:39.500000+00:00,0 days 00:06:04.750000,50,"[-1216.0, -1216.0, -1152.0, -1216.0, -1216.0, ...","[352.8, 353.2, 352.3, 351.3, 350.0, 345.6, 341...","[111.6, 111.8, 111.8, 111.9, 111.8, 111.7, 111...","[10375.0, 10200.0, 10050.0, 9875.0, 9700.0, 95...","[0, 8380000, 16960000, 25430000, 34220000, 429...","[0.0, 7.459207459207459, -7.55608028335301, 0....","[0.04773269689737199, -0.10489510489510225, -0...","[0.023866348448687687, 0.0, 0.0118063754427400...","[0.4057920537093013, -0.37047080912306557, 0.0..."
2,10958504,a69cb4,2024-08-01 15:50:52.470000+00:00,0 days 00:13:12.430000,50,"[0.0, 0.0, 0.0, 0.0, 0.0, 128.0, 64.0, 0.0, 0....","[385.5, 386.2, 385.5, 385.5, 385.5, 385.5, 386...","[223.8, 223.7, 223.6, 223.6, 223.6, 223.4, 223...","[38000.0, 38000.0, 38000.0, 38000.0, 38000.0, ...","[0, 19470000, 39160000, 58770000, 78360000, 97...","[0.0, 0.0, 0.0, 0.0, 6.580976863753214, -3.235...","[0.03595274781715402, -0.03555104113763274, 0....","[-0.005136106831023253, -0.005078720162518756,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,8656496,3b7567,2024-08-01 10:46:00.580000+00:00,0 days 00:07:34.270000,50,"[-1696.0, -2080.0, -2080.0, -1984.0, -2368.0, ...","[468.7, 468.7, 466.5, 460.2, 456.4, 452.6, 450...","[128.1, 128.1, 128.1, 128.7, 131.1, 133.9, 136...","[27200.0, 27125.0, 26725.0, 26375.0, 26100.0, ...","[0, 1930000, 12840000, 24370000, 32180000, 392...","[-198.96373056994818, 0.0, 8.326105810928015, ...","[0.0, -0.20164986251145633, -0.546400693842152...","[0.0, 0.0, 0.052038161318299594, 0.30729833546...","[1.1380788918256388, 0.5781867335818337, -0.42..."
4,8791211,440209,2024-08-01 09:36:53.580000+00:00,0 days 00:15:22.850000,50,"[0.0, 64.0, 0.0, 0.0, 0.0, 0.0, 64.0, 0.0, 64....","[467.1, 468.1, 469.1, 470.1, 471.2, 472.1, 473...","[5.4, 5.4, 5.4, 5.4, 5.5, 5.3, 5.3, 5.4, 5.6, ...","[37975.0, 37975.0, 37975.0, 38000.0, 38000.0, ...","[0, 19780000, 39280000, 58670000, 78590000, 98...","[3.235591506572295, -3.282051282051282, 0.0, 0...","[0.05055611729019211, 0.05128205128205128, 0.0...","[0.0, 0.0, 0.0, 0.005020080321285122, -0.01008...","[0.0, 0.06611919969320691, -0.0664942957203473..."
