# Feature generation

The first thing I want to note here is that model selection was a very important criteria for feature generation, but I am going to generally break up the documentation of model selection and feature generation for organizational purposes. But if you want skip ahead to to the next notebook feel free!

## Widending that data

Before getting into real feature generation, I made some alterations to the data. As mentioned in the last notebook, the dataset was really big. I wanted to keep the data in it's native format for EDA. It is much easier to column when they hold single objects than it is to analyze semi-structured elements. However, for the rest of the pipeline I need to otpimize the data, so I reformatted the data utizlining PostgreSQL array columns in the scripts _06_traces_to_db_partioning and _07_traces_to_segements.py . I went ahead and threw out the fields I determined I wouldn't use, but other than that no mathemetical transfomrated occured at this step. All traces of less than 15 minutes were not included in the output data set. 

Here is brief summary of what each script did and how:

- _06_traces_db_partitioning.py
 - Partitions aircraft tracking data table into smaller tables based on first two characters of ICAO aircraft identifier (hex prefixes '00' to 'ff')
 - Creates up to 256 separate tables with identical schema
 - Transfers data in batches of 100,000 rows with indexes on timestamp and ICAO columns
 - Can process either full hex range (00-ff) or specific subsets

- _07_traces_to_segments.py
 - Processes aircraft tracking data across different hex-partitioned tables
 - Implements multi-level parallelization: thread pooling for hex partitions and process pooling for aircraft (ICAO codes)
 - Three main stages: extraction of raw trace data, transformation of flight segments based on timestamp gaps, and loading of processed data

The key tools used to perfrom the transformation have already been mentioned in the preceeding notebook and include same databse and parellelization tools.

Let's go ahead and do our imports for our notebok.

#### Dependancies and config

In [2]:
import pandas as pd

### View of segmented data

In [3]:
# output what the segmented data loks like

# read in sample data
df = pd.read_parquet('../data/_07_traces_to_segments/segments_sample.parquet')

print("rows in sample data:", len(df))

display(df.head())

rows in sample data: 100


Unnamed: 0,icao,start_timestamp,duration,time_offsets,nav_heading,longitude,vertical_rate_baro,latitude,ground_speed,flags,altitude_baro,nav_altitude_mcp,track,indicated_airspeed,roll_angle,squawk,flight,category,emergency
0,407fb9,2024-08-01 15:53:47.540000+00:00,0 days 00:34:07.310000,"[0, 8430000, 48120000, 66400000, 119710000, 13...",,"[2.237915, 2.230303, 2.197187, 2.180001, 2.131...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[53.147003, 53.143295, 53.127274, 53.118942, 5...","[148.7, 148.7, 147.9, 147.9, 147.9, 147.9, 148...","[3, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1600, 1600, 1600, 1600, 1600, 1600, 1600, 160...","[35008, 38000, 33008, 33008, 20000, 20000]","[231.30, 231.30, 231.00, 231.00, 231.00, 231.0...",,"[0.00, 0.00, 0.00, -0.20, 0.00, -0.20, 0.00, -...","[7420, 7420, 7420, 7420, 7420, 7420, 7420, 742...",,,
1,407fb9,2024-08-01 17:27:53.700000+00:00,0 days 00:19:51.340000,"[0, 1080000, 10140000, 11990000, 13470000, 149...",,"[1.287912, 1.287994, 1.288304, 1.288376, 1.288...","[-960, 448, -192, 0, 0, 0, -320, -256, 832, 13...","[52.675347, 52.675415, 52.675859, 52.675919, 5...","[9.9, 12.5, 12.7, 8.1, 8.1, 3.7, 1.2, 0.7, 2.1...","[3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[100, 100, 75, 75, 75, 75, 50, 75, 125, 200, 2...","[35008, 35008, 9008, 35008]","[45.00, 28.60, 18.40, 29.70, 29.70, 56.30, 90....","[264, 274, 258, 260, 260, 259, 259]","[0.00, 0.00, 0.00, 0.00, 0.00, 0.00, -4.20, -8...","[7420, 7420, 7420, 7420, 7420, 7420, 7420, 742...",,,
2,407fb9,2024-08-01 17:58:07.680000+00:00,0 days 00:28:40.530000,"[0, 3300000, 41850000, 44290000, 68970000, 768...",,"[2.283325, 2.283249, 2.288971, 2.288513, 2.271...","[960, 960, 960, 576, 704, 640, 704, 704, 768, ...","[53.051651, 53.051651, 53.059891, 53.060760, 5...","[89.0, 89.0, 89.0, 130.9, 130.7, 131.4, 132.4,...","[1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[50, 50, 475, 500, 800, 875, 900, 1000, 1075, ...",[19008],"[0.60, 0.60, 0.60, 247.60, 240.70, 239.30, 238...","[250, 250, 247, 247, 247, 247, 247, 247]","[-19.50, -19.50, -19.50, -10.60, -10.60, -10.6...","[7420, 7420, 7420, 7420, 7420, 7420, 7420, 742...",,A7,
3,0200ae,2024-08-01 10:47:06.580000+00:00,0 days 02:52:23.940000,"[0, 75440000, 145750000, 180620000, 240330000,...","[224.30, 224.30, 224.30, 224.30, 224.30, 224.3...","[6.110476, 6.110476, 6.110475, 6.110475, 6.110...","[64, 0, 0, 0, 64, 64, 64, 64, 128, 256, 448, 6...","[46.233751, 46.233751, 46.233753, 46.233753, 4...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[3, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[9008, 9008, 9008, 9008, 9008, 9008, 9008, 900...","[135.00, 135.00, 135.00, 135.00, 135.00, 135.0...","[118, 118, 118, 118, 146, 146, 163, 163, 163, ...","[-0.50, -0.50, -0.50, -0.50, -0.50, -0.50, -0....","[5751, 5751, 5751, 5751, 5751, 5751, 5751, 575...",,,none
4,407fc1,2024-08-01 05:16:41.990000+00:00,0 days 01:57:33.120000,"[0, 2610000, 6920000, 25100000, 31650000, 3554...","[75.23, 75.23, 59.77, 59.77, 59.77, 59.77, 59....","[-0.379715, -0.379677, -0.379601, -0.379344, -...","[96, 0, -1376, -1376, -1376, -1376, -1376, -13...","[51.880293, 51.880281, 51.880246, 51.880199, 5...","[2.2, 2.8, 2.2, 1.4, 1.9, 1.1, 0.0, 0.0, 0.0, ...","[1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[4992, 4992, 4992, 4992, 4992, 4992, 4992, 499...","[281.20, 275.60, 264.40, 219.40, 219.40, 216.6...","[256, 256, 262, 262, 262, 262, 262, 262, 152, ...","[0.20, 0.20, 0.50, 0.50, 0.70, 0.70, 0.70, 0.7...","[7305, 7305, 7305, 7305, 7305, 7305, 7305, 730...",,,


# Data Feature Generation Documentation

## Input Data Arrays
- flight_data['time_offsets']: Microsecond time offsets
- flight_data['vertical_rate_baro']: Vertical rate in feet/minute
- flight_data['ground_speed']: Ground speed in knots  
- flight_data['track']: Track/heading in degrees (0-360)
- flight_data['altitude_baro']: Barometric altitude in feet

## Segmentation Logic
1. Data is grouped by ICAO
2. Each ICAO's data is processed for valid 50-point segments:
  - Main segments: Points 1-50, 51-100, etc.
  - Overlapping segments: Points 25-75, 75-125, etc.
  - End handling: If sequence ends with <50 points, take last 50 points
  - Gap handling: If 10+ second gap found, take previous 50 points if available

## Validity Checks
1. Physics Thresholds (2x 737 max performance):
  - vertical_rate_baro: ±12,000 feet/minute
  - ground_speed: 0 to 1,000 knots
  - track: Must be 0-360 degrees
  - altitude_baro: -1,000 to 82,000 feet

2. Data Quality:
  - Null values break sequences
  - All arrays must maintain 50 points
  - No duplicate timestamps (take first point if duplicates exist)

## Derived Feature Calculations
Using forward differences for all calculations:

1. Ground Acceleration (ground_accels):
  For point i:
  Δt = (time_offsets[i+1] - time_offsets[i]) / 1_000_000  # convert to seconds
  ground_accels[i] = (ground_speed[i+1] - ground_speed[i]) / Δt
  Units: knots/second

2. Vertical Acceleration (vertical_accels):
  For point i:
  Δt = (time_offsets[i+1] - time_offsets[i]) / 1_000_000
  vertical_accels[i] = (vertical_rate_baro[i+1] - vertical_rate_baro[i]) / Δt
  Units: feet/min²

3. Turn Rate (turn_rates):
  For point i:
  Δt = (time_offsets[i+1] - time_offsets[i]) / 1_000_000
  heading_diff = track[i+1] - track[i]
  
  # Handle 0/360 wraparound
  if heading_diff > 180:
      heading_diff -= 360
  elif heading_diff < -180:
      heading_diff += 360
      
  turn_rates[i] = heading_diff / Δt
  Units: degrees/second

4. Climb/Descent Acceleration (climb_descent_accels):
  For point i:
  Δt1 = (time_offsets[i+1] - time_offsets[i]) / 1_000_000
  Δt2 = (time_offsets[i+2] - time_offsets[i+1]) / 1_000_000
  
  first_deriv = (altitude_baro[i+1] - altitude_baro[i]) / Δt1
  second_deriv = (altitude_baro[i+2] - altitude_baro[i+1]) / Δt2
  
  climb_descent_accels[i] = (second_deriv - first_deriv) / Δt1
  Units: feet/second²

For all derivatives:
- Last point(s) copy previous value to maintain 50-point arrays
- No smoothing applied to preserve raw change measurements
