VirusShareSant is a Malware Dataset based on VirusShare and it was created based on the following choices:
- The goal of the analysis: the chosen goal was malware family classification;
- The malware's execution platform: In [1], 51.08% of all detected malware run on Microsoft Windows Platform. So, the chosen platform was Windows;
- The amount of labeled samples: The malware dataset provided in Microsoft Malware Classification Challenge (BIG 2015) [2] contains 21 thousand samples, but only half of them are labeled. VirusShareSant has 28,617 labeled samples.
- Binary availability: all samples are avaliable under request for research purposes (ricardo.santana@ime.eb.br).
VirusShare.com is a repository which contains malware samples for security researchers, incident responders, forensic analysts, and the morbidly curious access to samples of live malicious code. The malwares samples are available in enumerated package of malwares. Packages files from 0 to 148 contains 131,072 samples and package files 149 or later contains 65,536 samples each. The last available package (december 2021) is Package 400.
Anyway, it does not contain labeled samples. So, we havedused ldjson files from ML Sec Project[3] in which all samples from VirusShare packages from 0 to 233 were submitted to Virus Total's [4] analysis using a python API[5]. The result of a malware submission to Virus Total using an API is a json object containing information about antivirus scanning. ML Project added all json objects from a VirusShare package in a unique file, using ldjson extension.
First, we have selected 7 packages from VirushShare: VirusShare_00015 (2012-10-20), VirusShare_00021 (2012-11-20), VirusShare_00023 (2012-11-30), VirusShare_00024 (2012-12-06), VirusShare_00026 (2012-12-22), VirusShare_00047 (2013-03-25) and VirusShare_00094 (2013-09-08).
The following selected families were chosen (using Microsoft Antivirus names): Backdoor:Win32/Bifrose, Trojan:Win32/Vundo,Backdoor:Win32/Cycbot, BrowserModifier:Win32/Zwangi, Rogue:Win32/Winwebsec, Trojan:Win32/Koutodoor, Backdoor:Win32/Rbot, Backdoor:Win32/Hupigon and Trojan:Win32/Startpage.
Using all ldjson, any malware from the above packages which belongs to selected families were accepted if:
- the malware was detected by Microsoft Antivirus;
- if the malware was detected, at least, by 10 Antivírus;
- if at least two other Antivirus use a family name similar to Microsoft's name;
- if the malware were a pefile compatible.
To verify if malware is pefile compatible, we have used the pefile library from python[6]. The following code is a simple example:
try:
pe = pefile.PE(filename)
print("Accept it"))
except pefile.PEFormatError as e:
print("{:s}\t ERROR {:s}".format(name,e.value))
So, the number of samples in VirusShareSant Dataset per Family is presented in the following table.
Class | Family | Samples |
---|---|---|
0 | Backdoor:Win32/Bifrose | 2291 |
1 | Trojan:Win32/Vundo | 6794 |
2 | Backdoor:Win32/Cycbot | 3622 |
3 | BrowserModifier:Win32/Zwangi | 920 |
4 | Rogue:Win32/Winwebsec | 4624 |
5 | Trojan:Win32/Koutodoor | 5605 |
6 | Backdoor:Win32/Rbot | 1170 |
7 | Backdoor:Win32/Hupigon | 1943 |
8 | Trojan:Win32/Startpage | 1648 |
28 617 |
The available files, X_exp3.npy, y_exp3.npy and D_exp3.npy are numpy arrays which contais the name of the file, the class and the package. For instance, the X, y and D for the first malware in the dataset are:
Name | Class | Package |
---|---|---|
VirusShare_c99da9702b21eda352d570f5168a5252 | 0 | VirusShare_00047 |
So the malware class is 0, which is Bifrose (Backdoor:Win32/Bifrose) and could be find in Package VirusShare_00047.
[1] AVTEST, T. Heightened threat scenario: all the facts in the AV-TEST Security Report 2018/2019. Avaliable in <https://www.av-test.org/en/news/ heightened-threat-scenario-all-the-facts-in-the-av-test-security-report-2018-2019/>
[2] Ronen, Royi, et al. "Microsoft malware classification challenge." arXiv preprint arXiv:1802.10135 (2018).
[3] http://www.mlsec.org/ and https://github.com/seymour1/label-virusshare
[4] https://www.virustotal.com/