# Appendix: Feature Engineering for OCSF Data

This notebook demonstrates feature engineering techniques for OCSF (Open Cybersecurity Schema Framework) data.

**What you'll learn:**
1. Loading and exploring OCSF parquet data
2. Understanding the schema and available fields
3. Engineering temporal features
4. Handling categorical and numerical features
5. Preparing data for TabularResNet

**Prerequisites:**
- Sample data in `../data/` (included in repository)
- Or generate your own using `../appendix-code/`

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', 100)

## 1. Load OCSF Data

The sample data is already in OCSF-compliant parquet format with flattened fields.

In [None]:
# Load the OCSF logs
df = pd.read_parquet('../data/ocsf_logs.parquet')

print(f"Dataset shape: {df.shape}")
print(f"Columns: {len(df.columns)}")
print(f"\nColumn names:")
for i, col in enumerate(df.columns):
    print(f"  {i+1:2d}. {col}")

In [None]:
# Preview the data
df.head(3)

## 2. Explore OCSF Schema

OCSF events have:
- **Core fields**: class_uid, category_uid, activity_id, severity_id, time
- **Nested objects**: actor, src_endpoint, dst_endpoint, http_request, http_response
- **Flattened fields**: actor_user_name, http_request_method, etc.

In [None]:
# Check data types and non-null counts
print("Data types and non-null counts:")
print(df.dtypes.value_counts())
print("\nSample of each column type:")
for dtype in df.dtypes.unique():
    cols = df.select_dtypes(include=[dtype]).columns[:3].tolist()
    print(f"  {dtype}: {cols}")

In [None]:
# Check unique values for key categorical columns
categorical_cols = ['class_name', 'category_name', 'activity_name', 'status', 'level', 'service']

print("Unique values in categorical columns:")
for col in categorical_cols:
    if col in df.columns:
        unique_vals = df[col].nunique()
        sample_vals = df[col].value_counts().head(5).index.tolist()
        print(f"\n{col} ({unique_vals} unique):")
        for val in sample_vals:
            count = (df[col] == val).sum()
            print(f"  - {val}: {count}")

## 3. Engineer Temporal Features

Time-based patterns are critical for anomaly detection:
- Logins at 3 AM are suspicious
- Attack patterns have timing signatures
- Business hours vs off-hours traffic differs

In [None]:
def extract_temporal_features(df, time_col='time'):
    """
    Extract temporal features from Unix timestamp (milliseconds).
    
    Returns DataFrame with new temporal columns.
    """
    result = df.copy()
    
    # Convert milliseconds to datetime
    result['datetime'] = pd.to_datetime(result[time_col], unit='ms', errors='coerce')
    
    # Basic temporal features
    result['hour_of_day'] = result['datetime'].dt.hour
    result['day_of_week'] = result['datetime'].dt.dayofweek  # 0=Monday
    result['is_weekend'] = (result['day_of_week'] >= 5).astype(int)
    result['is_business_hours'] = ((result['hour_of_day'] >= 9) & 
                                    (result['hour_of_day'] < 17)).astype(int)
    
    # Cyclical encoding (sin/cos) - preserves circular nature
    # Hour 23 and hour 0 are close in (sin, cos) space
    result['hour_sin'] = np.sin(2 * np.pi * result['hour_of_day'] / 24)
    result['hour_cos'] = np.cos(2 * np.pi * result['hour_of_day'] / 24)
    result['day_sin'] = np.sin(2 * np.pi * result['day_of_week'] / 7)
    result['day_cos'] = np.cos(2 * np.pi * result['day_of_week'] / 7)
    
    return result

# Apply temporal feature extraction
df = extract_temporal_features(df)

# Show sample of temporal features
temporal_cols = ['datetime', 'hour_of_day', 'day_of_week', 'is_weekend', 
                 'is_business_hours', 'hour_sin', 'hour_cos']
df[temporal_cols].head()

In [None]:
# Visualize hour distribution
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Hour distribution
df['hour_of_day'].hist(bins=24, ax=axes[0], edgecolor='black')
axes[0].set_xlabel('Hour of Day')
axes[0].set_ylabel('Count')
axes[0].set_title('Event Distribution by Hour')

# Day of week distribution
df['day_of_week'].hist(bins=7, ax=axes[1], edgecolor='black')
axes[1].set_xlabel('Day of Week (0=Mon)')
axes[1].set_ylabel('Count')
axes[1].set_title('Event Distribution by Day')

plt.tight_layout()
plt.show()

## 4. Select Core Features

Not all 60+ columns are useful. We select:
- **Categorical**: class, activity, status, user, HTTP method
- **Numerical**: severity, duration, response codes, temporal features

In [None]:
# Define feature sets
categorical_features = [
    'class_name',
    'activity_name', 
    'status',
    'level',
    'service',
    'actor_user_name',
    'http_request_method',
    'http_request_url_path',
]

numerical_features = [
    'severity_id',
    'activity_id',
    'status_id',
    'duration',
    'http_response_code',
    'hour_of_day',
    'day_of_week',
    'is_weekend',
    'is_business_hours',
    'hour_sin',
    'hour_cos',
    'day_sin',
    'day_cos',
]

# Filter to columns that exist in our data
categorical_features = [c for c in categorical_features if c in df.columns]
numerical_features = [c for c in numerical_features if c in df.columns]

print(f"Categorical features ({len(categorical_features)}): {categorical_features}")
print(f"\nNumerical features ({len(numerical_features)}): {numerical_features}")

## 5. Handle Missing Values

OCSF events have optional fields. Strategy:
- **Categorical**: Use special 'MISSING' category
- **Numerical**: Use 0 or median, optionally add `_is_missing` indicator

In [None]:
def handle_missing_values(df, categorical_cols, numerical_cols):
    """
    Handle missing values in feature columns.
    """
    result = df.copy()
    
    # Categorical: fill with 'MISSING'
    for col in categorical_cols:
        if col in result.columns:
            result[col] = result[col].fillna('MISSING').astype(str)
            result[col] = result[col].replace('', 'MISSING')
    
    # Numerical: fill with 0
    for col in numerical_cols:
        if col in result.columns:
            result[col] = pd.to_numeric(result[col], errors='coerce').fillna(0)
    
    return result

# Apply missing value handling
df_clean = handle_missing_values(df, categorical_features, numerical_features)

# Check for remaining nulls
all_features = categorical_features + numerical_features
null_counts = df_clean[all_features].isnull().sum()
print("Null counts after handling:")
print(null_counts[null_counts > 0] if null_counts.sum() > 0 else "No nulls remaining!")

## 6. Encode Features for TabularResNet

TabularResNet needs:
- **Numerical array**: Normalized floats (mean=0, std=1)
- **Categorical array**: Integer indices (0, 1, 2, ...)

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder

def prepare_for_tabular_resnet(df, categorical_cols, numerical_cols):
    """
    Prepare features for TabularResNet.
    
    Returns:
        numerical_array: Normalized numerical features
        categorical_array: Integer-encoded categorical features
        encoders: Dict of LabelEncoders
        scaler: StandardScaler
        cardinalities: List of vocab sizes per categorical
    """
    # Encode categorical features
    encoders = {}
    categorical_data = []
    cardinalities = []
    
    for col in categorical_cols:
        encoder = LabelEncoder()
        # Add 'UNKNOWN' for handling new values at inference
        unique_vals = list(df[col].unique()) + ['UNKNOWN']
        encoder.fit(unique_vals)
        encoded = encoder.transform(df[col])
        categorical_data.append(encoded)
        encoders[col] = encoder
        cardinalities.append(len(encoder.classes_))
    
    categorical_array = np.column_stack(categorical_data) if categorical_data else np.array([])
    
    # Scale numerical features
    scaler = StandardScaler()
    numerical_array = scaler.fit_transform(df[numerical_cols])
    
    return numerical_array, categorical_array, encoders, scaler, cardinalities

# Prepare features
numerical_array, categorical_array, encoders, scaler, cardinalities = \
    prepare_for_tabular_resnet(df_clean, categorical_features, numerical_features)

print("Feature arrays ready for TabularResNet:")
print(f"  Numerical shape: {numerical_array.shape}")
print(f"  Categorical shape: {categorical_array.shape}")
print(f"  Categorical cardinalities: {cardinalities}")

In [None]:
# Preview encoded data
print("\nNumerical features (first 3 rows, normalized):")
print(pd.DataFrame(numerical_array[:3], columns=numerical_features).round(3))

print("\nCategorical features (first 3 rows, integer encoded):")
print(pd.DataFrame(categorical_array[:3], columns=categorical_features))

## 7. Save Processed Features

Save the processed data and encoding artifacts for training.

In [None]:
import pickle

# Save feature arrays
np.save('../data/numerical_features.npy', numerical_array)
np.save('../data/categorical_features.npy', categorical_array)

# Save encoders and scaler
artifacts = {
    'encoders': encoders,
    'scaler': scaler,
    'categorical_cols': categorical_features,
    'numerical_cols': numerical_features,
    'cardinalities': cardinalities
}

with open('../data/feature_artifacts.pkl', 'wb') as f:
    pickle.dump(artifacts, f)

print("Saved:")
print("  - ../data/numerical_features.npy")
print("  - ../data/categorical_features.npy")
print("  - ../data/feature_artifacts.pkl")

## Summary

In this notebook, we:

1. **Loaded OCSF data** from parquet format
2. **Explored the schema** - 60 columns with nested objects flattened
3. **Engineered temporal features** - hour, day, cyclical sin/cos encoding
4. **Selected core features** - categorical and numerical subsets
5. **Handled missing values** - 'MISSING' for categorical, 0 for numerical
6. **Encoded for TabularResNet** - LabelEncoder + StandardScaler

**Next**: Use these features in [04-self-supervised-training.ipynb](04-self-supervised-training.ipynb) to train embeddings.