# Transform Traffic Captures to ML Format

This notebook converts raw traffic capture files into pickle format compatible with the ExplainWF framework for Website Fingerprinting attacks.

## Input Format
- Raw traffic traces from `../../data/traffic_captures/` or `../../data/reduced_list/`
- Each file contains traces in format: `<url> <num_packets> <timestamp1>:<size1> <timestamp2>:<size2> ...`
- Negative sizes = outgoing packets (client-to-server)
- Positive sizes = incoming packets (server-to-client)

## Output Format
- Pickle file compatible with ExplainWF: https://github.com/explainwf-popets2023/explainwf-popets2023.github.io/tree/main/data
- Contains tuple: `(X_train, X_test, y_train, y_test)`
- Each sample is a dictionary with `cells` key containing list of `[rel_time, direction, direction, volume]`
- Direction: 0 = outgoing, 1 = incoming
- Relative time in seconds from first packet

## Processing Steps
1. Read all trace files from input directory
2. Parse each line (up to 200 instances per website)
3. Convert to relative timestamps (seconds from start)
4. Filter websites with at least 20 instances
5. Split into 60% train / 40% test with stratification
6. Save as pickle file

## Usage
Set the `scenario` variable below to the configuration you want to process, then run the cells.

In [5]:
import os
import pickle
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split

def create_pkl(tcp_path, pkl_name, time_added=0):
    # Create output directory if it doesn't exist
    output_dir = os.path.dirname(pkl_name)
    if output_dir and not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    data = []
    labels = []
    for path, subdirs, files in os.walk(tcp_path):
        index = 1
        for file in files:
            
            instance = 1
            n = 200
            DIRECTION = {"client-to-server": -1, "server-to-client": 1}
            instance = 1
            
            with open(os.path.join(path, file), "r") as f:
                print("({}/{}) Reading : {}".format(index, len(files), file))
                for line in f:
                    points = line.split(" ")[2:]
                    base_time = points[0].split(":")[0]
                    base_time = float(base_time) if "." in base_time else int(base_time)
                    file_content = ""
                    cells = []
                    for point in points:
    
                        time = point.split(":")[0]
                        time = float(time) if "." in time else int(time)
                        
                        rel_time = (time - base_time) / 1000.0
    
                        # in our format, -1 is OUT, 1 is IN
                        direction = np.sign(float(point.split(":")[1]))
    
                        direction = 0 if direction == DIRECTION["client-to-server"] else 1
    
    
                        volume = int(np.abs(float(point.split(":")[1])))
    
                        cells.append([rel_time, direction,direction,  volume])
                    
                    filtered_cells = [c for c in cells if c[0] <= cells[-1][0] - time_added]
                    labels.append(file)
                    data.append({"cells": filtered_cells})
                    
                    instance += 1
    
                    if instance > n:
                       break
                index += 1
                if instance <= n:
                    print(f"[ERROR] {file} has only {instance-1} instances")

    label_counts = Counter(labels)
    

        # Filter data and labels
    filtered_data = []
    filtered_labels = []
    
    s = set(labels)
    ds = dict(zip(s,range(len(s))))

    
    
    
    for d, l in zip(data, labels):
        if label_counts[l] >= 20:
            filtered_data.append(d)
            filtered_labels.append(ds[l])

    X_train, X_test, y_train, y_test = train_test_split(
        filtered_data,
        filtered_labels,
        train_size=0.6,
        shuffle=True,
        stratify=filtered_labels,
    )

    # Assuming args.output_file is a file path
    with open(pkl_name, 'wb') as f:
        pickle.dump((X_train, X_test, y_train, y_test), f)
    return ds

## Example: Process a Single Configuration

Change the `scenario` variable to match the directory name in `../../data/reduced_list/` or `../../data/traffic_captures/`.

Available scenarios include:
- `configuration00_default` - Default Nym setup
- `configuration01_lqp10` - Low queue parameter 10
- `nym_defence_wtfpad` - Nym with WTF-PAD defense
- `tor_defence_front` - Tor with FRONT defense
- ...


In [None]:
scenario = 'configuration00_default'
base_folder = f'../../data/reduced_list/{scenario}'
create_pkl(base_folder, f"../../data/train_test_WF/{scenario}.pkl")

(1/58) Reading : soundcloud.com
(2/58) Reading : facebook.com
(2/58) Reading : facebook.com
(3/58) Reading : springer.com
(3/58) Reading : springer.com
(4/58) Reading : latimes.com
(4/58) Reading : latimes.com
(5/58) Reading : medium.com
(5/58) Reading : medium.com
(6/58) Reading : github.com
(6/58) Reading : github.com
(7/58) Reading : yahoo.com
(7/58) Reading : yahoo.com
(8/58) Reading : weibo.com
(8/58) Reading : weibo.com
(9/58) Reading : linkedin.com
(9/58) Reading : linkedin.com
(10/58) Reading : unsplash.com
(10/58) Reading : unsplash.com
(11/58) Reading : yelp.com
(11/58) Reading : yelp.com
(12/58) Reading : baidu.com
(12/58) Reading : baidu.com
(13/58) Reading : nature.com
(13/58) Reading : nature.com
(14/58) Reading : techcrunch.com
(14/58) Reading : techcrunch.com
(15/58) Reading : free.fr
(15/58) Reading : free.fr
(16/58) Reading : dailymail.co.uk
(16/58) Reading : dailymail.co.uk
(17/58) Reading : forbes.com
(17/58) Reading : forbes.com
(18/58) Reading : ebay.com
(18/58) R

{'twitter.com': 0,
 'opera.com': 1,
 'washingtonpost.com': 2,
 'imgur.com': 3,
 'yelp.com': 4,
 'linkedin.com': 5,
 'unsplash.com': 6,
 'creativecommons.org': 7,
 'gov.uk': 8,
 'berkeley.edu': 9,
 'wordpress.org': 10,
 'mozilla.org': 11,
 'nih.gov': 12,
 'free.fr': 13,
 'wikipedia.org': 14,
 'apple.com': 15,
 'gnu.org': 16,
 'flickr.com': 17,
 'europa.eu': 18,
 'etsy.com': 19,
 'apache.org': 20,
 'w3.org': 21,
 'reddit.com': 22,
 'forbes.com': 23,
 'theguardian.com': 24,
 'latimes.com': 25,
 'office.com': 26,
 'soundcloud.com': 27,
 'yahoo.com': 28,
 'huffingtonpost.com': 29,
 'reuters.com': 30,
 'techcrunch.com': 31,
 'mit.edu': 32,
 'slideshare.net': 33,
 'facebook.com': 34,
 'cnn.com': 35,
 'instagram.com': 36,
 'datatracker.ietf.org': 37,
 'ted.com': 38,
 'bing.com': 39,
 'dailymail.co.uk': 40,
 'nature.com': 41,
 'addtoany.com': 42,
 'springer.com': 43,
 'weibo.com': 44,
 'google.com': 45,
 'vimeo.com': 46,
 'npr.org': 47,
 'baidu.com': 48,
 'theverge.com': 49,
 'github.com': 50,
