# CAPEv2 Dataset Exploration

This notebook helps us understand the structure and contents of the AVAST-CTU CAPEv2 dataset.

**Dataset Overview:**
- 48,976 malicious file reports
- 6 malware types: banker, trojan, pws, coinminer, rat, keylogger
- 10 malware families: Adload, Emotet, HarHar, Lokibot, njRAT, Qakbot, Swisyn, Trickbot, Ursnif, Zeus

**Files:**
- `public_labels.csv`: Labels with sha256, malware family, type, and date
- `public_small_reports/public_small_reports/`: CAPEv2 JSON reports

In [1]:
import os
import json
import pandas as pd
from pprint import pprint

## Step 1: Load and Explore the Labels CSV

In [2]:
# Load the labels CSV
labels_path = '../data/public_labels.csv'
labels_df = pd.read_csv(labels_path)

print(f"Total samples: {len(labels_df)}")
print(f"\nColumn names: {labels_df.columns.tolist()}")
print(f"\nFirst 5 rows:")
labels_df.head()

Total samples: 48976

Column names: ['sha256', 'classification_family', 'classification_type', 'date']

First 5 rows:


Unnamed: 0,sha256,classification_family,classification_type,date
0,00003d128a7eb859f65f5780d8fa2b5e52d472665678bf...,Qakbot,banker,2019-03-20
1,0000698621e473a0828c1c957873e7d086f2d90eebc6a3...,Swisyn,trojan,2019-11-16
2,0000a65749f5902c4d82ffa701198038f0b4870b00a27c...,Qakbot,banker,2018-06-25
3,0001770a2a6e888a123775fc24d8a4ffccc5c8191c1c2a...,Emotet,banker,2019-10-06
4,000376a4b234f57ae0f1fb959817486040d7d8d8be1fbc...,Emotet,banker,2019-09-30


## Step 2: Analyze Label Distribution

Let's see the distribution of malware families and types.

In [3]:
# Distribution of malware families
print("Malware Family Distribution:")
print(labels_df['classification_family'].value_counts())

print("\n" + "="*50)

# Distribution of malware types
print("\nMalware Type Distribution:")
print(labels_df['classification_type'].value_counts())

Malware Family Distribution:
classification_family
Emotet      14429
Swisyn      12591
Qakbot       4895
Trickbot     4202
Lokibot      4191
njRAT        3372
Zeus         2594
Ursnif       1343
Adload        704
HarHar        655
Name: count, dtype: int64


Malware Type Distribution:
classification_type
banker       27463
trojan       13295
pws           4191
rat           3343
coinminer      655
keylogger       29
Name: count, dtype: int64


## Step 3: List and Load JSON Reports

The CAPEv2 JSON reports are in `public_small_reports/public_small_reports/`.

In [4]:
# List JSON report files
reports_dir = '../data/public_small_reports/public_small_reports'
json_files = [f for f in os.listdir(reports_dir) if f.endswith('.json')]

print(f"Found {len(json_files)} JSON report files.")
print(f"\nSample filenames (these are sha256 hashes):")
for f in json_files[:5]:
    print(f"  - {f}")

Found 13929 JSON report files.

Sample filenames (these are sha256 hashes):
  - 000376a4b234f57ae0f1fb959817486040d7d8d8be1fbcb627e0102147192fc6.json
  - 0006655a8a16a0334a991e2bc9c7ed3eb772d2f36546bb00760314f141000d6b.json
  - 00073c9f307ded712caf4959b6a2ca8b812695dc20763dc34d434a5a04224d3f.json
  - 0008f01033ba93d60c8f0ee288f53f26deccb6e402e065c276c9fa0c0030cbec.json
  - 000a891f4bd2cb143cbf8f4764f510ecef3e0632c31fa83a9593bf9bf61f2e9b.json


## Step 4: Load and Explore a Sample JSON Report

In [5]:
# Load a sample JSON report
sample_file = os.path.join(reports_dir, json_files[0])
with open(sample_file, 'r', encoding='utf-8') as f:
    sample_report = json.load(f)

# Show top-level keys
print("Top-level keys in the CAPEv2 report:")
pprint(list(sample_report.keys()))

Top-level keys in the CAPEv2 report:
['behavior', 'static']


## Step 5: Explore Behavioral Data (Processes and API Calls)

CAPEv2 reports contain behavioral data under the `behavior` key. This includes processes and their API calls, which are key for signature generation.

In [6]:
# Explore behavioral data
behavior = sample_report.get('behavior', {})
print("Keys in 'behavior':")
pprint(list(behavior.keys()))

# Explore processes
processes = behavior.get('processes', [])
print(f"\nNumber of processes: {len(processes)}")

if processes:
    print("\nKeys in first process:")
    pprint(list(processes[0].keys()))

Keys in 'behavior':
['summary']

Number of processes: 0


In [7]:
# Explore API calls from the first process
if processes and 'calls' in processes[0]:
    calls = processes[0]['calls']
    print(f"Number of API calls in first process: {len(calls)}")
    
    if calls:
        print("\nSample API call structure:")
        pprint(calls[0])
        
        print("\nFirst 10 API names:")
        for call in calls[:10]:
            print(f"  - {call.get('api', 'N/A')}")

## Step 6: Explore Network Activity

Network activity is another important feature for malware signatures.

In [8]:
# Explore network activity
network = sample_report.get('network', {})
print("Keys in 'network':")
pprint(list(network.keys()))

# Check for DNS, HTTP, etc.
if 'dns' in network:
    print(f"\nDNS queries: {len(network['dns'])}")
    if network['dns']:
        print("Sample DNS query:")
        pprint(network['dns'][0])

if 'http' in network:
    print(f"\nHTTP requests: {len(network['http'])}")
    if network['http']:
        print("Sample HTTP request:")
        pprint(network['http'][0])

Keys in 'network':
[]


## Step 7: Link Labels to Reports

Match sha256 hashes from the CSV to the JSON filenames to get labeled samples.

In [9]:
# Get sha256 from filename (remove .json extension)
sample_sha256 = json_files[0].replace('.json', '')
print(f"Sample SHA256: {sample_sha256}")

# Find the corresponding label
sample_label = labels_df[labels_df['sha256'] == sample_sha256]
if not sample_label.empty:
    print("\nLabel information for this sample:")
    print(sample_label)
else:
    print("\nNo label found for this sample in the CSV.")

Sample SHA256: 000376a4b234f57ae0f1fb959817486040d7d8d8be1fbcb627e0102147192fc6

Label information for this sample:
                                              sha256 classification_family  \
4  000376a4b234f57ae0f1fb959817486040d7d8d8be1fbc...                Emotet   

  classification_type        date  
4              banker  2019-09-30  


## Summary: What Did We Learn?

**Data Structure:**
- `public_labels.csv` contains metadata: sha256, malware family, type, and detection date
- JSON reports contain detailed behavioral data from CAPEv2 sandbox execution

**Key Fields for Signature Generation:**
1. **behavior.processes**: List of processes spawned during execution
2. **behavior.processes[].calls**: API calls made by each process
3. **network**: DNS queries, HTTP requests, connections

**Next Steps:**
1. Parse all JSON reports and extract features
2. Use LLMs to identify patterns and generate CAPEv2 signatures
3. Map behaviors to MITRE ATT&CK techniques