# Extensive Exploratory Data Analysis (EDA)
## Golden 7-Day ADS-B Research Dataset

**Purpose:** Comprehensive analysis for Deep Neural Networks, LLMs, Academic Research, and Commercial Applications

**Dataset:** Golden 7-Day Sample (2026-01-16 to 2026-01-22)  
**Sensors:** sensor-east (Sipoo), sensor-west (Jorvas)  
**License:** MIT  

---

## Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from pathlib import Path

warnings.filterwarnings('ignore')

# Plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Configuration
BASE_DIR = Path('..')
EDA_RESULTS = BASE_DIR / 'analysis' / 'golden_7day_eda_results'

print("‚úÖ Imports complete")

## Load ML-Ready Dataset

In [None]:
# Load the ML-ready dataset
df = pd.read_csv(EDA_RESULTS / 'golden_7day_ml_dataset.csv')

# Convert timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'])

print(f"Dataset Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nMemory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

df.head()

## Dataset Overview

In [None]:
print("=" * 80)
print("DATASET OVERVIEW")
print("=" * 80)

print(f"\nüìä Total Records: {len(df):,}")
print(f"‚úàÔ∏è  Unique Aircraft: {df['hex'].nunique():,}")
print(f"üóìÔ∏è  Date Range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"‚è±Ô∏è  Duration: {(df['timestamp'].max() - df['timestamp'].min()).days} days")
print(f"\nüì° Sensors:")
for sensor in df['sensor'].unique():
    count = len(df[df['sensor'] == sensor])
    aircraft = df[df['sensor'] == sensor]['hex'].nunique()
    print(f"  ‚Ä¢ {sensor}: {count:,} records, {aircraft} unique aircraft")

## Statistical Summary

In [None]:
# Load statistical summary
stats_summary = pd.read_csv(EDA_RESULTS / 'statistical_summary.csv', index_col=0)
stats_summary

## Visualizations

### 1. Missing Values Pattern

In [None]:
from IPython.display import Image
Image(filename=str(EDA_RESULTS / 'figures' / '01_missing_values_heatmap.png'))

### 2. Temporal Patterns

In [None]:
Image(filename=str(EDA_RESULTS / 'figures' / '02_temporal_patterns.png'))

### 3. Geospatial Analysis

In [None]:
Image(filename=str(EDA_RESULTS / 'figures' / '03_geospatial_analysis.png'))

### 4. Signal Quality

In [None]:
Image(filename=str(EDA_RESULTS / 'figures' / '04_signal_quality.png'))

### 5. Aircraft Behavior

In [None]:
Image(filename=str(EDA_RESULTS / 'figures' / '05_aircraft_behavior.png'))

### 6. Cross-Sensor Correlation

In [None]:
Image(filename=str(EDA_RESULTS / 'figures' / '06_cross_sensor_analysis.png'))

### 7. Feature Correlation Matrix

In [None]:
Image(filename=str(EDA_RESULTS / 'figures' / '07_correlation_matrix.png'))

### 8. 3D Flight Trajectories

In [None]:
Image(filename=str(EDA_RESULTS / 'figures' / '08_3d_trajectories.png'))

### 9. Detection Heatmap

In [None]:
Image(filename=str(EDA_RESULTS / 'figures' / '09_detection_heatmap.png'))

### 10. Weekly Activity Heatmap

In [None]:
Image(filename=str(EDA_RESULTS / 'figures' / '10_activity_heatmap.png'))

## Interactive Analysis: Custom Queries

### Example 1: Find Fastest Aircraft

In [None]:
fastest = df.nlargest(10, 'gs')[['hex', 'timestamp', 'gs', 'alt_baro', 'sensor']]
print("Top 10 Fastest Aircraft Observations:")
fastest

### Example 2: Highest Altitude Observations

In [None]:
highest = df.nlargest(10, 'alt_baro')[['hex', 'timestamp', 'alt_baro', 'gs', 'sensor']]
print("Top 10 Highest Altitude Observations:")
highest

### Example 3: Analyze Specific Aircraft

In [None]:
# Get aircraft with most observations
top_aircraft = df['hex'].value_counts().head(1).index[0]
aircraft_data = df[df['hex'] == top_aircraft].sort_values('timestamp')

print(f"Aircraft: {top_aircraft}")
print(f"Total Observations: {len(aircraft_data)}")
print(f"\nTrajectory Sample:")
aircraft_data[['timestamp', 'lat', 'lon', 'alt_baro', 'gs', 'sensor']].head(10)

### Example 4: Signal Quality by Distance

In [None]:
# Plot RSSI vs Distance
valid_data = df.dropna(subset=['distance_km', 'rssi'])

plt.figure(figsize=(12, 6))
for sensor in df['sensor'].unique():
    sensor_data = valid_data[valid_data['sensor'] == sensor]
    plt.scatter(sensor_data['distance_km'], sensor_data['rssi'], 
               alpha=0.1, s=1, label=sensor)

plt.xlabel('Distance (km)', fontsize=12)
plt.ylabel('RSSI (dBm)', fontsize=12)
plt.title('Signal Strength vs Distance', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Feature Engineering Insights

### Engineered Features for ML

In [None]:
ml_features = [
    'hour_sin', 'hour_cos', 'day_of_week_sin', 'day_of_week_cos',
    'distance_km', 'signal_deviation', 'altitude_speed_ratio',
    'track_sin', 'track_cos', 'likely_commercial', 'likely_general_aviation'
]

print("ML-Ready Features:")
print("=" * 60)
for feat in ml_features:
    if feat in df.columns:
        valid = df[feat].notna().sum()
        print(f"‚Ä¢ {feat:30} {valid:,} valid values")

print("\nFeature Descriptions:")
print("‚îÄ" * 60)
print("‚Ä¢ hour_sin/cos: Cyclic time encoding")
print("‚Ä¢ distance_km: Haversine distance from sensor")
print("‚Ä¢ signal_deviation: Deviation from expected signal strength")
print("‚Ä¢ altitude_speed_ratio: Physics-based feature")
print("‚Ä¢ track_sin/cos: Cyclic heading encoding")
print("‚Ä¢ likely_commercial: High altitude + high speed indicator")
print("‚Ä¢ likely_general_aviation: Low altitude + low speed indicator")

## Recommendations for ML/LLM Development

### For Deep Neural Networks:

1. **Sequence Models (LSTM/Transformers)**
   - Use temporal features (hour_sin, hour_cos)
   - Sort by timestamp and aircraft ID
   - Predict future positions/behaviors

2. **Graph Neural Networks**
   - Multi-sensor data as graph nodes
   - Edges represent spatial relationships
   - Learn sensor fusion patterns

3. **Anomaly Detection**
   - Use signal_deviation as target
   - Physics-based features (altitude_speed_ratio)
   - Multi-sensor consistency checks

### For LLM Training:

1. **Context Generation**
   - Flight narratives from trajectory data
   - Multi-sensor descriptions
   - Temporal event sequences

2. **Classification Tasks**
   - Aircraft type classification
   - Anomaly explanation generation
   - Sensor reliability assessment

3. **Question Answering**
   - "What aircraft were at altitude X at time Y?"
   - "Which sensor has better coverage?"
   - "Explain this signal pattern"

### Data Splits:

```python
# Temporal split (recommended for time-series)
train_data = df[df['timestamp'] < '2026-01-17']
val_data = df[(df['timestamp'] >= '2026-01-17') & (df['timestamp'] < '2026-01-18')]
test_data = df[df['timestamp'] >= '2026-01-18']

# Or stratified split by sensor
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, stratify=df['sensor'])
```

## Next Steps

1. **Baseline Models**: Train Random Forest, XGBoost on engineered features
2. **Deep Learning**: Implement LSTM for trajectory prediction
3. **Unsupervised Learning**: Cluster aircraft behaviors
4. **Sensor Fusion**: Combine multi-sensor observations
5. **Real-time Pipeline**: Deploy models for live data

---

## Citation

If you use this analysis or dataset:

```
Wiren, Richard. (2026). ADS-B Research Grid: Distributed Sensor Network 
for Spoofing Detection [Software]. https://github.com/rwiren/adsb-research-grid
```

**License:** MIT  
**Repository:** https://github.com/rwiren/adsb-research-grid