# EE 467 Lab 7: Instruction Set Architecture (ISA) Identification of Program Binaries

Welcome to Lab 7 of EE 467! Today we apply the machine learning techniques you’ve learned throughout the course to a practical cybersecurity task: Instruction Set Architecture (ISA) detection. We will use a dataset with **50k Base64-encoded binaries** downloaded from Praetorian’s “Machine Learning Binaries" challenge web page [1]. Each of the encoded binary string in this dataset consists of 88 characters (66 bytes) on average and belongs to one of the following **twelve architecture types: avr, alphaev56, arm, m68k, mips, mipsel, powerpc, s390, sh4, sparc, x86_64, and xtensa**.

We will use two different feature extration models, **byte-histogram+endianness features and byte-level TF-IDF features**, to extract features from the binaries. Using these features, we will **train and test SVM, Logisitic Regression, Decision Tree, and Random Forrest** classification algorithms. As in the previous labs, all algorithms are evaluated by **accuracy, precision, recall and F1-score**.

## Preparation

Like previous labs, we start by installing all dependencies needed for this lab:

In [None]:
%pip install numpy scipy scikit-learn termcolor

## Data Pre-processing

We load the raw binary data and their labels from separate files. During loading we decode each Base64 string into bytes, preview a handful of samples in three representations (Base64, hex, and byte-level), deduplicate entries, and compute summary statistics over the full dataset:


In [None]:
!tar -xJf binaries-dataset.tar.xz #<---- For unzipping the .zip dataset folder

In [None]:
import base64
from collections import Counter

from termcolor import colored

# Path of binaries and labels files
BINARIES_PATH = "./binaries-dataset/Base64EcncodedBinaries-50k.txt"
LABELS_PATH = "./binaries-dataset/LabelsOfBinaries-50k.txt"
# Number of samples to display
N_DISPLAY_SAMPLES = 5

# Binaries and labels
binaries = []
binaries_set = set()
raw_labels = []
# Duplicated samples and displayed samples counts
dup_count = 0
display_count = 0

# Decode each Base64-encoded binary into raw bytes, display a few samples in multiple
# representations, and deduplicate entries before storing them.
# Note: Each byte (8 bits, 256 possible values) is represented by 2 hex characters.
with open(BINARIES_PATH) as binaries_file, open(LABELS_PATH) as binaries_label_file:
    for line_tmp, label_tmp in zip(binaries_file, binaries_label_file):
        # Remove EOL characters
        line_eol_removed = line_tmp.rstrip()
        raw_label = label_tmp.rstrip()

        # [ TODO ]
        # 1. Decode Base64-encoded binaries into byte strings
        binary_sample = NotImplemented

        # Display the first N_DISPLAY_SAMPLES binaries in three representations
        if display_count < N_DISPLAY_SAMPLES:
            # [ TODO ]
            # 2. Encode byte strings as hex strings
            hex_encoded = NotImplemented
            # 3. Rewrite encoded hex strings in byte-level granularities
            #    (i.e. Separate byte data by spaces)
            byte_level = NotImplemented

            print(colored("Base64-encoded string:", attrs=["bold"]))
            print(colored(line_eol_removed, "red"))

            print(colored("Hex-encoded string:", attrs=["bold"]))
            print(colored(hex_encoded, "blue"))

            print(colored("Byte level decomposition:", attrs=["bold"]))
            print(colored(byte_level, "green"))
            print()

            display_count += 1

        # Skip duplicates; otherwise add sample to the dataset
        if binary_sample not in binaries_set:
            binaries.append(binary_sample)
            binaries_set.add(binary_sample)
            raw_labels.append(raw_label)
        else:
            dup_count += 1

# Count labels
labels_info = Counter(raw_labels)
min_sample_size = min(labels_info.values())

# Compute and print statistics
print(colored("[[ Dataset Statistics ]]", attrs=["bold"]))

print("* Distinct labels:", labels_info.keys())
print("* # of samples for all classes:", labels_info)
print("* Min # of samples in class:", min(labels_info.values()))
print("* # of samples in total:", sum(labels_info.values()))
print("* # of duplicates:", dup_count)

The statistics indicate that our dataset is largely balanced — all twelve ISA classes have similar sample counts, so no undersampling or oversampling is needed. We only need to map the string labels to integer indices for use with scikit-learn:


In [None]:
import numpy as np

# Map each unique string label to a consecutive integer (0, 1, 2, ...)
raw_labels_to_labels = {label: i for i, label in enumerate(labels_info.keys())}

# Build the integer label array used by all scikit-learn models below
labels = np.array([raw_labels_to_labels[raw_label] for raw_label in raw_labels])

## Feature Extraction

In this lab we are going to try three different feature extraction techniques:

* Byte-Histogram and Endianness Features
* Byte-level 1,2,3-Gram TF-IDF Features
* Hex-level (4-bit) 1,2,3-Gram TF-IDF Features

### Byte-Histogram and Endianness Features

This feature extraction method was originally proposed in [2]. It builds two histograms from raw binary data:

1. **Byte-value histogram**: Scans the binary one byte at a time, producing a 256-entry histogram (one bin per possible byte value, 0–255).
2. **Endianness histogram**: Scans for four specific two-byte (word) patterns — `00 01`, `01 00`, `ff fe`, and `fe ff` — whose frequency hints at the endianness of the binary data.

Both histograms are concatenated and then **normalized** by the total number of bytes in the binary, yielding a 260-dimensional feature vector that is comparable across binaries of different lengths.

In [None]:
from collections.abc import Iterator

# The four two-byte patterns whose presence signals endianness
ENDIANNESS_WORDS = (b"\x00\x01", b"\x01\x00", b"\xff\xfe", b"\xfe\xff")

def pairwise(bstr: bytes) -> Iterator[bytes]:
    """
    Yield every consecutive overlapping byte pair in `bstr`.

    Example: b"uvwxyz" -> b"uv", b"vw", b"wx", b"xy", b"yz"
    """
    bstr_iter = iter(bstr)
    char_a = next(bstr_iter)

    for char_b in bstr_iter:
        yield char_a + char_b
        char_a = char_b

def make_byte_hist_endian_feature(binary: bytes) -> np.ndarray:
    """
    Return a 260-dim normalized feature vector for a single binary.

    The vector is the concatenation of:
      - a 256-entry byte-value histogram
      - a 4-entry endianness-word histogram
    divided element-wise by the binary's byte length.
    """
    # Count occurrences of each byte value (0–255)
    byte_hist = np.zeros(256, dtype=int)
    for byte_data in binary:
        byte_hist[byte_data] += 1

    # Count occurrences of the four endianness-indicating word patterns
    word_hist = np.zeros(len(ENDIANNESS_WORDS), dtype=int)
    for word_data in pairwise(binary):
        try:
            word_idx = ENDIANNESS_WORDS.index(word_data)
            word_hist[word_idx] += 1
        except ValueError:
            # Not an endianness marker; skip
            pass

    # Concatenate and normalize by binary length so features are scale-invariant
    concat_hist = np.concatenate((byte_hist, word_hist))
    return concat_hist / len(binary)

feats_byte_hist_endian = np.stack([make_byte_hist_endian_feature(binary) for binary in binaries])

### Byte-level 1,2,3-Gram TF-IDF Features

This method generalizes the byte-histogram approach using TF-IDF weighting over byte n-grams (unigrams, bigrams, and trigrams). A few implementation notes:

* We reuse scikit-learn's `TfidfVectorizer` with the character (`"char"`) analyzer so it treats each byte as a single character. Setting the encoding to `latin1` ensures every possible byte value (0–255) maps to a valid character without errors.
* For trigrams, we cap the vocabulary with `max_features` to keep feature dimensionality (and training time) manageable.
* Unigram+bigram and trigram matrices are merged in the feature dimension using `scipy.sparse.hstack`, matching the approach from homework 1.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

# Limit trigram vocabulary to keep memory and training time manageable
TF_IDF_3_MAX_FEATS = 10000

# [ TODO ]
# Extract Byte-level (1,2,3)-gram TF-IDF features using TfidfVectorizer
# 1. Fit character (1,2)-gram TF-IDF model using `binaries` as training data, with `encoding` set to
#    "latin1". Save transformed features as `feats_tf_idf_12`.
feats_tf_idf_12 = NotImplemented

# 2. Fit character trigram TF-IDF model using `binaries` as training data, with `encoding` set to
#    "latin1" and number of features limited by `TF_IDF_3_MAX_FEATS`. Save transformed features as
#    `feats_tf_idf_3`.
feats_tf_idf_3 = NotImplemented

# 3. Concatenate `feats_tf_idf_12` and `feats_tf_idf_3` into `feats_tf_idf_123` using `scipy.sparse.hstack`.
feats_tf_idf_123 = NotImplemented

### Hex-level (4-bit) 1,2,3-Gram TF-IDF Features

This method applies the same TF-IDF n-gram approach but operates on the **hex-encoded** representation of each binary. Because each hex character encodes four bits (half a byte), the resulting vocabulary captures finer-grained patterns than the byte-level approach, at the cost of doubling the sequence length. No `latin1` encoding override is needed here since all hex characters are standard ASCII.


In [None]:
# [ TODO ]
# Extract Hex-level (1,2,3)-gram TF-IDF features using TfidfVectorizer
# 1. Hex-encode all binary data in `binaries` and save them as `binaries_hex`
binaries_hex = NotImplemented
# 2. Fit character (1,2,3)-gram TF-IDF model using `binaries_hex` as training data. Save transformed features
#    as `feats_tf_idf_hex_123`.
feats_tf_idf_hex_123 = NotImplemented

## Training and Testing Different ML Models

Now that we have multiple feature representations, we evaluate each one across four classifiers: **Linear SVM**, **Logistic Regression**, **Decision Tree**, and **Random Forest**. To keep runtimes reasonable with a 50k-sample dataset, we use a stratified 20%/5% train/test split. We report accuracy, precision, recall, and F1-score, as well as the confusion matrix, for both the training and test sets.

We start by implementing a shared `train_test_ml_models` function that encapsulates the complete train-evaluate loop so we can reuse it for each feature type:


In [None]:
import time
from contextlib import contextmanager

from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

# Default random seed for reproducibility
RNG_SEED = 42

@contextmanager
def timeit(action: str):
    """Context manager that prints elapsed wall-clock time for a code block."""
    start_time = time.time()
    print(f"Timing started for {action} ...")

    yield
    elapsed_time = time.time() - start_time
    print(f"Timing ended for {action}. Elapsed time: {elapsed_time:.2f}s")

def train_test_ml_models(feats: np.ndarray, labels: np.ndarray, train_size: float = 0.2, test_size: float = 0.05,
    rng_seed_data: int = RNG_SEED, rng_seed: int = RNG_SEED):
    """
    Train and evaluate four classifiers on the given features and labels.

    Uses a stratified train/test split to preserve class balance, then reports
    confusion matrices and full classification reports for both subsets.
    """
    # Stratified split keeps class proportions consistent across train and test sets
    binaries_train, binaries_test, labels_train, labels_test = train_test_split(
        feats, labels, train_size=train_size, test_size=test_size, stratify=labels, random_state=rng_seed_data
    )

    # List of ML models to train and evaluate
    ml_models = {
        "SVM": LinearSVC(max_iter=200, random_state=rng_seed),
        "Logistic Regression": LogisticRegression(max_iter=200, random_state=rng_seed),
        "Decision Tree": DecisionTreeClassifier(random_state=rng_seed),
        "Random Forest": RandomForestClassifier(random_state=rng_seed)
    }

    for model_name, model in ml_models.items():
        print(colored(f"[[ Training and Evaluation for {model_name} ]]", attrs=["bold"]))
        print(colored("[ Training ]", attrs=["bold"]))

        with timeit(action=f"training of {model_name}"):
            model.fit(binaries_train, labels_train)

        # Compute and show metrics for training and testing sets
        for subset_name, binaries_eval, labels_eval in (
            ("Training", binaries_train, labels_train),
            ("Testing", binaries_test, labels_test)
        ):
            preds = model.predict(binaries_eval)

            print(colored(f"[ {subset_name} Set ]", attrs=["bold"]))

            print("Confusion matrix:")
            print(confusion_matrix(labels_eval, preds))
            print()

            print("Classification report:")
            print(classification_report(labels_eval, preds))

With the training and evaluation routine implemented, let's first run it on the **Byte-Histogram and Endianness Features**:


In [None]:
train_test_ml_models(feats_byte_hist_endian, labels)

... and for **Byte-level 1,2,3-Gram TF-IDF Features**:

In [None]:
train_test_ml_models(feats_tf_idf_123, labels)

Let's check the shape of the byte-level TF-IDF feature matrix before deciding on dimensionality reduction:


In [None]:
feats_tf_idf_123.shape

We'll find out that for **Byte-level 1,2,3-Gram TF-IDF Features**, there are more than 32,000 feature dimensions per sample. To ease the computational burden, we apply `KernelPCA` with an RBF kernel to reduce dimensionality to 300 components. Because `KernelPCA` does not natively support sparse input, we first fit it on a randomly-selected 20% subset of the data to reduce memory usage, then transform the full dataset.


In [None]:
from sklearn.decomposition import KernelPCA

KERNEL_PCA_DIMS = 300

# KernelPCA is fitted on a 20% stratified subset to reduce memory and runtime.
# The full dataset is then transformed using the fitted model.
feats_tf_idf_123_kpca_train, _ = train_test_split(
    feats_tf_idf_123, train_size=0.2, stratify=labels, random_state=RNG_SEED
)

with timeit("Kernel PCA fitting"):
    kpca_tf_idf_123 = KernelPCA(n_components=KERNEL_PCA_DIMS, kernel="rbf", random_state=RNG_SEED)
    kpca_tf_idf_123.fit(feats_tf_idf_123_kpca_train)

with timeit("Kernel PCA dimensionality reduction"):
    feats_tf_idf_123_kpca = kpca_tf_idf_123.transform(feats_tf_idf_123)

We can now run training and evaluation using the dimensionality-reduced (300-component) byte-level TF-IDF features:


In [None]:
train_test_ml_models(feats_tf_idf_123_kpca, labels)

Finally, let's run training and evaluation on the **Hex-level 1,2,3-Gram TF-IDF Features**:


In [None]:
train_test_ml_models(feats_tf_idf_hex_123, labels)

## References

1. Tech challenge: Machine learning binaries,” Feb 2021. [Online]. Available: https://www.praetorian.com/challenges/machine-learning-challenge/#how-to-play
2. J. Clemens, “Automatic classification of object code using machine learning,” Digital Investigation, vol. 14, pp. S156–S162, 2015.
3. D. Sahabandu, J. S. Mertoguno and R. Poovendran, "A Natural Language Processing Approach for Instruction Set Architecture Identification," in IEEE Transactions on Information Forensics and Security, vol. 18, pp. 4086-4099, 2023, doi: 10.1109/TIFS.2023.3288456.

