---
title: "Sample learning dataset"
execute:
  echo: true
  enabled: false
  output: true
  warning: false
---

In [2]:
#| echo: false
#| output: false
basepath = "/home/u1/"

In [4]:
#| echo: false
#| output: false
import os
os.environ["MODIN_ENGINE"] = "dask"
import modin.pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

In [5]:
from detect_common import *

## Real world traffic sampling

### Sampled to 10% of DoH and 100% of HTTPS traffic

DoH and HTTPS are sampled separately, because in the resulting dataset we may want to have different ratio of DoH and HTTPS traffic to get more benign samples.

In [12]:
prefix = basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100/"
print("files:", len(list(enum_csv(prefix))))

for i, f in enumerate(f for f in enum_csv(prefix) if ('DoH-Real-World' in str(f)) and ('DoH' in os.path.basename(f))):
    if i == 0:
        kwargs = {"mode": "w", "header": True}
    else:
        kwargs = {"mode": "a", "header": False}

    df = pd.read_csv(f)
    df_sampled = df.sample(frac=0.1, random_state=42)
    print(os.path.basename(f), "size:", len(df), "; sampled size:", len(df_sampled))
    df_sampled.to_csv(
        basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100-sample-01-real-world-doh.csv", 
        **kwargs
    )


files: 81
DoH-01082021-48h.pcapng.trapcap.csv size: 1108565 ; sampled size: 110856


Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.


DoH-03082021-48h.pcapng.trapcap.csv size: 1011530 ; sampled size: 101153




DoH-06102021-48h.pcapng.trapcap.csv size: 3588446 ; sampled size: 358845




DoH-08102021-48h.pcapng.trapcap.csv size: 1392097 ; sampled size: 139210




DoH-13072021-48h.pcapng.trapcap.csv size: 1272538 ; sampled size: 127254




DoH-15072021-48h.pcapng.trapcap.csv size: 1079644 ; sampled size: 107964




DoH-17072021-48h.pcapng.trapcap.csv size: 787199 ; sampled size: 78720




DoH-19072021-48h.pcapng.trapcap.csv size: 1349750 ; sampled size: 134975




DoH-27072021-48h.pcapng.trapcap.csv size: 1241409 ; sampled size: 124141




DoH-28062021-24h.pcapng.trapcap.csv size: 1019033 ; sampled size: 101903




DoH-30072021-48h.pcapng.trapcap.csv size: 633728 ; sampled size: 63373




In [13]:
prefix = basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100/"
print("files:", len(list(enum_csv(prefix))))

for i, f in enumerate(f for f in enum_csv(prefix) if ('DoH-Real-World' in str(f)) and ('HTTPS' in os.path.basename(f))):
    if i == 0:
        kwargs = {"mode": "w", "header": True}
    else:
        kwargs = {"mode": "a", "header": False}

    df = pd.read_csv(f)
    df_sampled = df.sample(frac=1.0, random_state=42)  # no sampling
    print(os.path.basename(f), "size:", len(df), "; sampled size:", len(df_sampled))
    df_sampled.to_csv(
        basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100-sample-1-real-world-https.csv", 
        **kwargs
    )


files: 81
HTTPS-04102021-01h-1.pcapng.trapcap.csv size: 30575 ; sampled size: 30575




HTTPS-04102021-01h-2.pcapng.trapcap.csv size: 35537 ; sampled size: 35537




HTTPS-04102021-02h.pcapng.trapcap.csv size: 33113 ; sampled size: 33113




HTTPS-20102021-10h.pcapng.trapcap.csv size: 104214 ; sampled size: 104214




HTTPS-20102021-12h.pcapng.trapcap.csv size: 85777 ; sampled size: 85777




HTTPS-21102021-12h.pcapng.trapcap.csv size: 81300 ; sampled size: 81300




 - `unirec-csv-p100/unirec/DoH-Gen-C-CFGHOQS/data/generated/pcap/chrome/ffmuc/1_chrome_ffmuc.pcap.trapcap.csv` is empty, I will remove it
 - **Question**: Real-World dataset contains pcap, however readme states it can't be distributed because requires anonymization

In [15]:
df = pd.read_csv(basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100-sample-01-real-world-doh.csv")
df.to_feather(basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100-sample-01-real-world-doh.ft")
len(df)

1448394

In [16]:
df = pd.read_csv(basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100-sample-1-real-world-https.csv")
df.to_feather(basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100-sample-1-real-world-https.ft")
len(df)

370516

### Sample 50k DoH and 50k HTTPS

In [17]:
# df = pd.read_feather("/jupyter/warehouse/Jerabek2022Collection-unirec/unirec-csv-p100-sample-001-real-world-doh.ft")
# df.sample(n=50000, random_state=42).reset_index(drop=True).to_feather(
#     "/jupyter/warehouse/Jerabek2022Collection-unirec/unirec-csv-p100-sample-50k-real-world-doh.ft"
# )

# df = pd.read_feather("/jupyter/warehouse/Jerabek2022Collection-unirec/unirec-csv-p100-sample-001-real-world-https.ft")
# df.sample(n=50000, random_state=42).reset_index(drop=True).to_feather(
#     "/jupyter/warehouse/Jerabek2022Collection-unirec/unirec-csv-p100-sample-50k-real-world-https.ft"
# )

## Generated traffic sampling

### Sampled to 10%:

In [18]:
prefix = basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100/"
print("files:", len(list(enum_csv(prefix))))

for i, f in enumerate(f for f in enum_csv(prefix) if 'DoH-Real-World' not in str(f)):
    if i == 0:
        kwargs = {"mode": "w", "header": True}
    else:
        kwargs = {"mode": "a", "header": False}

    df = pd.read_csv(f)
    df_sampled = df.sample(frac=0.1, random_state=42)
    print(os.path.basename(f), "size:", len(df), "; sampled size:", len(df_sampled))
    df_sampled.to_csv(
        basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100-sample-01-generated.csv", 
        **kwargs
    )


files: 81
0_chrome_adguard.pcap.trapcap.csv size: 23652 ; sampled size: 2365
1_chrome_adguard.pcap.trapcap.csv size: 23001 ; sampled size: 2300
0_chrome_ahadns.pcap.trapcap.csv size: 23618 ; sampled size: 2362
1_chrome_ahadns.pcap.trapcap.csv size: 23063 ; sampled size: 2306
0_chrome_blahdns.pcap.trapcap.csv size: 18648 ; sampled size: 1865
1_chrome_blahdns.pcap.trapcap.csv size: 19073 ; sampled size: 1907
0_chrome_bravedns.pcap.trapcap.csv size: 35271 ; sampled size: 3527
1_chrome_bravedns.pcap.trapcap.csv size: 35753 ; sampled size: 3575
0_chrome_comcast.pcap.trapcap.csv size: 37097 ; sampled size: 3710
1_chrome_comcast.pcap.trapcap.csv size: 38072 ; sampled size: 3807
0_chrome_cznic.pcap.trapcap.csv size: 34198 ; sampled size: 3420
1_chrome_cznic.pcap.trapcap.csv size: 33623 ; sampled size: 3362
0_chrome_cloudflare.pcap.trapcap.csv size: 30804 ; sampled size: 3080
1_chrome_cloudflare.pcap.trapcap.csv size: 34577 ; sampled size: 3458
0_chrome_ffmuc.pcap.trapcap.csv size: 11669 ; samp

In [6]:
df = pd.read_csv(basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100-sample-01-generated.csv")
df.to_feather(basepath + "datasets/Jerabek2022Collection-unirec/unirec-csv-p100-sample-01-generated.ft")
len(df)


    from distributed import Client

    client = Client()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 40697 instead
Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.


513386

### Sampled to 50k records

In [21]:
# df = pd.read_feather("/jupyter/warehouse/Jerabek2022Collection-unirec/unirec-csv-p100-sample-005-generated.ft")
# df.sample(n=50000, random_state=42).reset_index(drop=True).to_feather(
#     "/jupyter/warehouse/Jerabek2022Collection-unirec/unirec-csv-p100-sample-50k-generated.ft"
# )