## Week 14: Malware Family Classifications

Malware Family Classification refers to the categorization of different types of malicious software based on their characteristics, behavior, and code similarities to help security experts identify and respond to cyber threats effectively.

This classification helps cybersecurity professionals to identify and respond to threats more effectively, as it allows for grouping related malware strains together for better understanding and analysis. By organizing malware into families, analysts can track patterns, predict potential future threats, and develop targeted defense strategies to combat the evolving landscape of cybersecurity threats.

Distinguishing and classifying different types of malware from each other is important to better understanding how they can infect computers and devices, the threat level they pose and how to protect against them.


###### Sample classification tree from Kaspersky:

![kaspersky](malware-tree.jpg)

Kaspersky Lab classifies the entire range of malicious software or potentially unwanted objects that are detected by Kaspersky’s antivirus engine – classifying the malware items according to their activity on users’ computers. The classification system used by Kaspersky is also used by a number of other antivirus vendors as the basis for their classifications.

Kaspersky’s classification system gives each detected object a clear description and a specific location in the ‘classification tree’ shown below. In the ‘classification tree’ diagram:

- The types of behaviour that pose the least threat are shown in the lower area of the diagram.
- The types of behaviour that pose a greater threat are displayed in the upper part of the diagram.

#### Malware Types with Multiple Functions

Individual malware programs often include several malicious functions and propagation routines and, without some additional classification rules, this could lead to confusion.

For example, a specific malicious program may be capable of being spread via an email attachment and also as files via P2P networks. The program may also have the ability to harvest email addresses from an infected computer, without the consent of the user. With this range of functions, the program could be correctly classified as an Email-Worm, a P2P-Worm or a Trojan-Mailfinder. To avoid this confusion, Kaspersky applies a set of rules that can unambiguously categorise a malicious program as having a particular behaviour, regardless of the program functions:

- The ‘classification tree’ shows that each behaviour has been assigned its own threat level.
- In the ‘classification tree’ the behaviours that pose a higher risk outrank those behaviours that represent a lower risk.
- So… in our example, the Email-Worm behaviour represents a higher level of threat than either the P2P-Worm or Trojan-Mailfinder behaviour – and thus, our example malicious program would be classified as an Email-Worm.

Multiple functions with equal threat levels
- If a malicious program has two or more functions that all have equal threat levels – such as Trojan-Ransom, Trojan-ArcBomb, Trojan-Clicker, Trojan-DDoS, Trojan-Downloader, Trojan-Dropper, Trojan-IM, Trojan-Notifier, Trojan-Proxy, Trojan-SMS, Trojan-Spy, Trojan-Mailfinder, Trojan-GameThief, Trojan-PSW or Trojan-Banker – the program is classified as a Trojan.
- If a malicious program has two or more functions with equal threat levels – such as IM-Worm, P2P-Worm or IRC-Worm – the program is classified as a Worm.

---

## How classification helps

The core problem in malware analysis is the sheer volume of samples (millions per year) and the need to identify entirely new types of threats that lack a known signature. Clustering solves this by grouping samples based on behavioral and structural similarities, effectively discovering new families without any prior labels.

#### Feature Extraction: Creating the Data Vector

Before clustering can occur, raw malware files must be translated into numerical feature vectors. This is the most crucial step. Features can come from two main types of analysis:

|Analysis Type|Focus|Example Features (The Data Points)|
|:--|:--|:--|
|Static Analysis|Examining the file without executing it.|Byte N-grams: Frequencies of small byte sequences (e.g., pairs of bytes) found in the code.|
|||PE Header Info: Metadata like file size, timestamp, or header structure.|
|Dynamic Analysis|Executing the malware in a controlled, isolated environment (sandbox).|API Call Sequences: The order and frequency of system calls (e.g., CreateProcess, RegOpenKey, WriteFile).|
|||Network Activity: IP addresses contacted or specific network protocols used.|

#### Clustering: Identifying New Families

Once features are extracted, a clustering algorithm like K-Means is applied to the high-dimensional feature vectors:

- Grouping Similar Behavior: The algorithm calculates the distance between all malware samples in the feature space. Samples that share a high degree of similarity in their features (e.g., two different files that make the same 10 API calls in the same sequence) are placed into the same cluster.

- Known vs. Unknown:
    - Large, established clusters often represent known malware families (e.g., all samples identified as Emotet variants).
    - A new, distinct cluster or an outlier (a sample that doesn't fit well into any existing cluster) signals a potentially new or emerging malware family. This is the unsupervised novelty detection in action.

#### Dimensionality Reduction: Visualizing the Threat

Dimensionality reduction, particularly t-SNE (or the more modern UMAP), is used to give human analysts a clear visual map of the entire threat landscape:
- PCA might be used before clustering to reduce noise and the huge number of features (e.g., reducing 10,000 opcode frequencies down to 100 components).
- t-SNE is then used to plot these high-dimensional clusters onto a 2D scatter plot. The analysts can visually see new clusters forming (the new threats) and how close new samples are to existing families, helping them prioritize which samples to reverse-engineer immediately.

---

#### Malware Classification Process

1. Feature Engineering: The Input Data
- Since malware files are just bytes, they must be converted into numerical feature vectors for ML algorithms to process. This step determines the similarity metric used by clustering.
    - Action: Malware is subjected to analysis (in a safe, isolated environment called a sandbox) to extract measurable characteristics:
        - Behavioral Features (Dynamic Analysis): Sequences of API calls (WriteFile, CreateProcess, etc.), registry key modifications, and network connections. This is often the most revealing data because it shows what the malware does.
        - Structural Features (Static Analysis): Frequencies of specific byte sequences (n-grams), metadata from the Portable Executable (PE) header, and code section entropy.

2. Dimensionality Reduction: The Preprocessing
- The extracted feature vectors often have thousands of dimensions (features), making both clustering and visualization computationally difficult.
    - Goal: Use PCA to reduce the complexity.
    - Result: PCA transforms the data into a smaller set of Principal Components that retain the most important variance. This makes the clustering faster, more stable, and helps remove noise features that don't contribute to distinguishing malware types.

3. Clustering: Classification and Novelty Detection
This is where the magic happens—grouping the malware based on their reduced feature vectors.
    - A. Classifying Existing Threats
        - Mechanism: An algorithm like K-Means groups similar samples. Large, stable clusters are labeled as known malware families (e.g., Emotet, Wannacry).
        - Team Benefit: Analysts only need to reverse-engineer one representative sample from each known cluster. Once the representative sample is analyzed, all other files in that cluster are automatically classified, saving immense time.

    - B. Identifying New Threats (Novelty Detection)
        - Mechanism: Unsupervised learning is crucial because it doesn't need prior labels.
            - If a batch of new samples forms a tight, distinct cluster that is far from any known cluster, it signals a brand new, previously unseen malware family.
            - If a sample is completely isolated and forms an outlier (a key feature of DBSCAN), it's highly anomalous and requires immediate attention.
        - Team Benefit: The security team focuses its limited resources on investigating these new clusters first, which are the emerging, zero-day threats.

All in all, it's a system designed to manage vast amounts of data by automating classification and flagging novel threats for human analysis.

---

### Reflection Prompt:
How is using clustering to find new groups of data similar to using it to find new types of threats?

Using clustering allows us to find new groups and also allows us to know if the new data can be classified to an existing cluster or if it should be classified to a new one. Knowing this will help the team with identifying how to mitigate, how to defend, and how to prevent certain types of attacks. 

Identifying new patterns or groups are especially important as this will allow the team to know whether or not a certain attack is known or unknown in order for them to do the necessary things to do when mitigating threats.

Nowadays, the variants of threats are also increasing. Utilizing clustering to know if a certain defense can work against an attack is important to ensure that the correct steps are undertaken to defend against the threat.