# Analyzing Your 1979 IBTrACS Data

**Goal:** Understand exactly what trajectory data you have for testing

Before reading the detailed guide (`ibtracs_processing_explained.md`), let's explore your data hands-on.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from datetime import datetime, timedelta

print("Libraries loaded!")

## Step 1: Load the Final Processed Data

In [None]:
# Load the final processed IBTrACS file
# This has already been cleaned through 5 processing steps
df = pd.read_csv('IBTrACS_fore72.txt', header=None,
                 names=['name','date','lat','lon','ws','p','speed','direct'])

print(f"Total rows: {len(df):,}")
print(f"\nFirst few rows:")
print(df.head(10))

### ü§î What do you notice?

- The first row has `66666` as name - this is the separator
- Each cyclone starts with this separator
- Then comes actual observations with name, time, position, intensity

In [None]:
# Find all cyclone separators
separator_indices = df[df['name'] == '66666'].index.tolist()

print(f"Number of cyclones in full dataset: {len(separator_indices):,}")
print(f"\nFirst 10 cyclone start indices: {separator_indices[:10]}")

## Step 2: Extract ONLY 1979 Data

In [None]:
# Convert date column to datetime (skip separator rows)
df_clean = df[df['name'] != '66666'].copy()
df_clean['datetime'] = pd.to_datetime(df_clean['date'])
df_clean['year'] = df_clean['datetime'].dt.year

# Filter for 1979
df_1979 = df_clean[df_clean['year'] == 1979].copy()

print(f"üìä 1979 DATA SUMMARY:")
print(f"  Total observations: {len(df_1979):,}")
print(f"  Date range: {df_1979['datetime'].min()} to {df_1979['datetime'].max()}")
print(f"  Unique cyclones: {df_1979['name'].nunique()}")

In [None]:
# List all 1979 cyclones with statistics
print("\nüåÄ 1979 TROPICAL CYCLONES:")
print("="*80)

for i, name in enumerate(df_1979['name'].unique(), 1):
    storm = df_1979[df_1979['name'] == name]
    
    duration = (storm['datetime'].max() - storm['datetime'].min()).total_seconds() / 3600  # hours
    n_obs = len(storm)
    max_wind = storm['ws'].max()
    min_pressure = storm['p'].min()
    
    print(f"{i:2d}. {name:12s} | {n_obs:3d} obs | {duration:6.1f}h | "
          f"Max wind: {max_wind:5.1f} kt | Min P: {min_pressure:7.1f} hPa")

print("="*80)

### üéØ KEY INSIGHT:

Count how many cyclones you found. This number determines:
- How many training samples you can create
- Whether 1979 alone is sufficient for testing
- If you need data from other years too

In [None]:
# Calculate potential training samples
# Each cyclone can generate multiple samples using sliding windows

print("\nüìà TRAINING SAMPLE POTENTIAL:")
print("="*60)

total_samples = 0
for name in df_1979['name'].unique():
    storm = df_1979[df_1979['name'] == name]
    n_obs = len(storm)
    
    # For 72-hour forecast, need:
    # - Input: 8 observations (24 hours at 3-hr intervals)
    # - Output: up to 24 observations (72 hours)
    # - Minimum: 8 + 24 = 32 observations
    
    if n_obs >= 32:
        # Number of sliding windows
        samples = n_obs - 32 + 1
        total_samples += samples
        print(f"{name:12s}: {n_obs:3d} obs ‚Üí {samples:3d} samples")

print("="*60)
print(f"\nüéØ TOTAL POTENTIAL SAMPLES FROM 1979: {total_samples:,}")
print(f"\n‚ö†Ô∏è  Is this enough for training a diffusion model?")
print(f"    Consider: Typical DiT models need 10k-100k+ samples")
print(f"    Recommendation: Use 1979 for TESTING only, train on other years")

## Step 3: Visualize Cyclone Tracks

In [None]:
# Plot all 1979 cyclone tracks
fig = plt.figure(figsize=(15, 10))
ax = plt.axes(projection=ccrs.PlateCarree())

# Add map features
ax.add_feature(cfeature.LAND, facecolor='lightgray')
ax.add_feature(cfeature.COASTLINE, linewidth=0.5)
ax.add_feature(cfeature.BORDERS, linewidth=0.3)
ax.gridlines(draw_labels=True, linewidth=0.5, alpha=0.5)

# Set extent to Western Pacific
ax.set_extent([100, 180, 0, 60], crs=ccrs.PlateCarree())

# Plot each cyclone
colors = plt.cm.tab20(np.linspace(0, 1, df_1979['name'].nunique()))
for i, name in enumerate(df_1979['name'].unique()):
    storm = df_1979[df_1979['name'] == name].sort_values('datetime')
    
    # Plot track
    ax.plot(storm['lon'], storm['lat'], 
            color=colors[i], linewidth=2, 
            transform=ccrs.PlateCarree(),
            label=name)
    
    # Mark start and end
    ax.plot(storm['lon'].iloc[0], storm['lat'].iloc[0], 
            'o', color=colors[i], markersize=8, 
            transform=ccrs.PlateCarree())
    ax.plot(storm['lon'].iloc[-1], storm['lat'].iloc[-1], 
            's', color=colors[i], markersize=8, 
            transform=ccrs.PlateCarree())

ax.legend(loc='upper left', fontsize=8)
ax.set_title('1979 Western Pacific Tropical Cyclones\n(‚óã = start, ‚ñ° = end)', 
             fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('1979_cyclone_tracks.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n‚úÖ Track map saved as '1979_cyclone_tracks.png'")

## Step 4: Detailed Analysis of One Cyclone

In [None]:
# Pick the first cyclone for detailed analysis
first_cyclone_name = df_1979['name'].unique()[0]
cyclone = df_1979[df_1979['name'] == first_cyclone_name].copy()
cyclone = cyclone.sort_values('datetime').reset_index(drop=True)

print(f"üîç DETAILED ANALYSIS: {first_cyclone_name}")
print("="*60)
print(f"\nLifetime: {cyclone['datetime'].min()} to {cyclone['datetime'].max()}")
print(f"Duration: {(cyclone['datetime'].max() - cyclone['datetime'].min()).total_seconds()/3600:.1f} hours")
print(f"Number of observations: {len(cyclone)}")
print(f"\nIntensity:")
print(f"  Max wind: {cyclone['ws'].max():.1f} kt")
print(f"  Min pressure: {cyclone['p'].min():.1f} hPa")
print(f"\nPosition range:")
print(f"  Latitude: {cyclone['lat'].min():.2f}¬∞N to {cyclone['lat'].max():.2f}¬∞N")
print(f"  Longitude: {cyclone['lon'].min():.2f}¬∞E to {cyclone['lon'].max():.2f}¬∞E")

In [None]:
# Check time spacing (should be exactly 3 hours everywhere)
time_diffs = cyclone['datetime'].diff().dt.total_seconds() / 3600
time_diffs = time_diffs.dropna()

print(f"\n‚è∞ TIME SPACING CHECK:")
print(f"  Expected: 3.0 hours between all observations")
print(f"  Actual:")
print(f"    Mean: {time_diffs.mean():.2f} hours")
print(f"    Min: {time_diffs.min():.2f} hours")
print(f"    Max: {time_diffs.max():.2f} hours")
print(f"    Std: {time_diffs.std():.2f} hours")

if time_diffs.std() < 0.01:
    print(f"\n  ‚úÖ Perfect! All observations are exactly 3 hours apart")
else:
    print(f"\n  ‚ö†Ô∏è  Warning: Some irregular time spacing detected")
    print(f"  Irregular gaps: {time_diffs[time_diffs != 3.0].tolist()}")

In [None]:
# Visualize intensity evolution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Wind speed over time
axes[0, 0].plot(cyclone['datetime'], cyclone['ws'], 'b-', linewidth=2)
axes[0, 0].set_xlabel('Time')
axes[0, 0].set_ylabel('Wind Speed (kt)')
axes[0, 0].set_title(f'{first_cyclone_name}: Wind Speed Evolution')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Pressure over time
axes[0, 1].plot(cyclone['datetime'], cyclone['p'], 'r-', linewidth=2)
axes[0, 1].set_xlabel('Time')
axes[0, 1].set_ylabel('Central Pressure (hPa)')
axes[0, 1].set_title(f'{first_cyclone_name}: Pressure Evolution')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].invert_yaxis()  # Lower pressure = stronger storm

# Plot 3: Movement speed over time
axes[1, 0].plot(cyclone['datetime'], cyclone['speed'], 'g-', linewidth=2)
axes[1, 0].set_xlabel('Time')
axes[1, 0].set_ylabel('Movement Speed (kt)')
axes[1, 0].set_title(f'{first_cyclone_name}: Translation Speed')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Track
axes[1, 1].plot(cyclone['lon'], cyclone['lat'], 'k-', linewidth=2)
axes[1, 1].plot(cyclone['lon'].iloc[0], cyclone['lat'].iloc[0], 
               'go', markersize=15, label='Start')
axes[1, 1].plot(cyclone['lon'].iloc[-1], cyclone['lat'].iloc[-1], 
               'ro', markersize=15, label='End')
axes[1, 1].set_xlabel('Longitude (¬∞E)')
axes[1, 1].set_ylabel('Latitude (¬∞N)')
axes[1, 1].set_title(f'{first_cyclone_name}: Track')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].legend()

plt.tight_layout()
plt.savefig(f'{first_cyclone_name}_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\n‚úÖ Analysis plot saved as '{first_cyclone_name}_analysis.png'")

## Step 5: Create One Sample (What the Model Will See)

In [None]:
# Create a single training sample
# Input: Past 24 hours (8 observations at 3-hr intervals)
# Output: Future positions for 6, 12, 24, 48, 72 hours

# Take observations 10-17 as input (indices 10-17)
input_start = 10
input_end = input_start + 8

input_data = cyclone.iloc[input_start:input_end].copy()
current_time = input_data['datetime'].iloc[-1]
current_pos = (input_data['lat'].iloc[-1], input_data['lon'].iloc[-1])

print(f"\nüì¶ SAMPLE CREATION EXAMPLE:")
print("="*60)
print(f"\nCurrent time (t=0): {current_time}")
print(f"Current position: ({current_pos[0]:.2f}¬∞N, {current_pos[1]:.2f}¬∞E)")

print(f"\nüì• INPUT (Past 24 hours):")
for i, row in input_data.iterrows():
    hours_ago = (current_time - row['datetime']).total_seconds() / 3600
    print(f"  t-{hours_ago:4.0f}h: ({row['lat']:6.2f}¬∞N, {row['lon']:7.2f}¬∞E) | "
          f"WS: {row['ws']:5.1f} kt | P: {row['p']:7.1f} hPa")

# Find future positions
forecast_hours = [6, 12, 24, 48, 72]
print(f"\nüì§ TARGET (Future positions):")
for fh in forecast_hours:
    target_time = current_time + timedelta(hours=fh)
    # Find the observation closest to target time
    time_diff = abs(cyclone['datetime'] - target_time)
    closest_idx = time_diff.idxmin()
    
    if time_diff[closest_idx].total_seconds() / 3600 <= 1.5:  # Within 1.5 hours
        target_row = cyclone.loc[closest_idx]
        print(f"  t+{fh:3d}h: ({target_row['lat']:6.2f}¬∞N, {target_row['lon']:7.2f}¬∞E)")
    else:
        print(f"  t+{fh:3d}h: NO DATA (cyclone ended or data gap)")

print("\nüí° This is ONE training sample!")
print("   With sliding windows, you can create many more from each cyclone.")

## Summary & Next Steps

### What You've Learned:

1. ‚úÖ How many cyclones are in your 1979 data
2. ‚úÖ The data structure and format
3. ‚úÖ How to verify data quality (3-hour spacing, no gaps)
4. ‚úÖ How sliding windows create multiple samples
5. ‚úÖ What input/output pairs look like for the model

### Key Findings:

- **Number of 1979 cyclones:** (printed above)
- **Total potential samples:** (printed above)
- **Data quality:** Time spacing verified

### What's Missing:

This is JUST the trajectory data. To train the model, you also need:

1. **ERA5 environmental data** for each timestamp:
   - Wind fields (u,v) at 300, 500, 700, 850 hPa
   - Sea surface temperature (SST)
   - Geopotential height

2. **Devortexing** of wind fields (removing cyclone's own circulation)

3. **Feature engineering** (19 trajectory features mentioned in paper)

4. **Normalization** of all variables

### Next Actions:

**Option 1: Connect to ERA5** (Recommended first)
- Use the `era5_data_exploration_1979.ipynb` notebook
- Extract environmental data for one cyclone timestamp
- Verify alignment between IBTrACS and ERA5

**Option 2: Understand Devortexing**
- Study the `meteor_factor.ipynb` code
- Run devortexing on one wind field sample
- Visualize before/after comparison

**Option 3: Full Pipeline**
- Process all 1979 cyclones end-to-end
- Create complete training samples
- Save in format ready for your Diffusion Transformer

**Which one should you tackle first?**