Skip to content
/ clava Public

clava ๐Ÿ”: Generate Code-Based Yara Rules using Machine Learning.

License

Notifications You must be signed in to change notification settings

strfx/clava

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

23 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

clava ๐Ÿ”

Generate Code-Based Yara Rules using Machine Learning.

clava CLI

Table of Contents

About

I wrote clava for an industry project during my studies at Hochschule Luzern. This project researches how to automatically generate code-based Yara rules for a given malware sample using machine learning. We've kept the machine learning part intentionally rudimentary to demonstrate how much can be achieved with basic methods. The research is documented in a paper (German only). Contact me if you are interested in the paper.

TL;DR: clava creates n-grams of mnemonics (e.g., XOR or PUSH) of good- and malware and trains a logistic regression classifier on the n-gram's term frequency weights. We drop the operands as they are subject to change and only keep the instruction's operation part to improve the robustness of the rules. Dropping the operands requires to wildcard the Yara rules. We are using mkYARA (kudos!) for that task.

We've kept the methodology overly simplistic to demonstrate what can be achieved and also due to the project's time constraints. Using n-grams of mnemonics (where n is small) is simple but lacks semantic meaning - a malware analyst would have a hard time figuring out the context of the output sequence. Semantic meaning can be achieved by increasing the n-grams size (see also KiloGrams: Very Large N-Grams for Malware Classification) or by using semantically meaningful features in the first place, such as function bodies of the disassembled binaries. Further, one could explore more elaborate models such as sequence models like RNNs.

The trained models are not public. However, you can train a model on your own dataset. Instructions will follow.

Getting Started

To install clava, clone this repository and run (preferably in a virtualenv):

$ pip install -r requirements.txt
$ python setup.py install

clava offers a simple CLI to interact. To list all available options, run:

$ clava -h

To generate a yara rule based on a sample:

$ clava yara <path/to/sample>

Use the official Yara binaries to apply the generated rule on your sample and / or corpus of samples. The binaries can be downloaded from here

For example:

# Create yara rule 'detect-evil.yara' for evil.exe:
$ clava yara evil.exe -o detect-evil.yar

# Check if any file in a corpus matches the generated rule:
$ yara detect-evil.yar my-malware-corpus/

# Tip: If you have a large corpus, you can compile the yara rule to
# increase the performance:
$ yarac detect-evil.yar detect-evil-compiled
$ yara -C detect-evil-compiled my-malware-corpus/

Important: Rules created with clava should not directly be used in production, but can assist during rule development. This project is heavily inspired by yarGen, therefore see also Floriah Roth's blog post "How to post-process YARA rules generated by yarGen".

Development

During development, install clava in editable mode:

$ pip install -e .[dev]

Running the tests

clava uses pytest. To run the test suite with a set of predefined settings, run:

$ make tests

Alternatively, you can run pytest against the tests/ directory with your own settings.

Contribute

Contributions are welcome! If you plan major changes, please create an issue first to discuss the changes.

Resources

Good datasets are essential, however there are not many public datasets of good- and malware executables. You can assemble your own dataset using projects like:

Public goodware datasets are rare - PRs are welcome :-)

Tools:

  • Capstone.js for interactive disassembling, useful during development.

Credits

clava was heavily inspired by these projects:

I would also like to thank these projects: