Skip to content

Malware DataSet for Windows Platform containing 28617 labeled samples from VirusShare packages.

Notifications You must be signed in to change notification settings

ricksant2003/MalwareDatasetVirusShareSant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VirusShareSant

A Malware Dataset for Family Classification

VirusShareSant is a Malware Dataset based on VirusShare and it was created based on the following choices:

  • The goal of the analysis: the chosen goal was malware family classification;
  • The malware's execution platform: In [1], 51.08% of all detected malware run on Microsoft Windows Platform. So, the chosen platform was Windows;
  • The amount of labeled samples: The malware dataset provided in Microsoft Malware Classification Challenge (BIG 2015) [2] contains 21 thousand samples, but only half of them are labeled. VirusShareSant has 28,617 labeled samples.
  • Binary availability: all samples are avaliable under request for research purposes (ricardo.santana@ime.eb.br).

Labeling

VirusShare.com is a repository which contains malware samples for security researchers, incident responders, forensic analysts, and the morbidly curious access to samples of live malicious code. The malwares samples are available in enumerated package of malwares. Packages files from 0 to 148 contains 131,072 samples and package files 149 or later contains 65,536 samples each. The last available package (december 2021) is Package 400.

Anyway, it does not contain labeled samples. So, we havedused ldjson files from ML Sec Project[3] in which all samples from VirusShare packages from 0 to 233 were submitted to Virus Total's [4] analysis using a python API[5]. The result of a malware submission to Virus Total using an API is a json object containing information about antivirus scanning. ML Project added all json objects from a VirusShare package in a unique file, using ldjson extension.

Selecting Malware Families from VirusShare Packages

First, we have selected 7 packages from VirushShare: VirusShare_00015 (2012-10-20), VirusShare_00021 (2012-11-20), VirusShare_00023 (2012-11-30), VirusShare_00024 (2012-12-06), VirusShare_00026 (2012-12-22), VirusShare_00047 (2013-03-25) and VirusShare_00094 (2013-09-08).

The following selected families were chosen (using Microsoft Antivirus names): Backdoor:Win32/Bifrose, Trojan:Win32/Vundo,Backdoor:Win32/Cycbot, BrowserModifier:Win32/Zwangi, Rogue:Win32/Winwebsec, Trojan:Win32/Koutodoor, Backdoor:Win32/Rbot, Backdoor:Win32/Hupigon and Trojan:Win32/Startpage.

Using all ldjson, any malware from the above packages which belongs to selected families were accepted if:

  • the malware was detected by Microsoft Antivirus;
  • if the malware was detected, at least, by 10 Antivírus;
  • if at least two other Antivirus use a family name similar to Microsoft's name;
  • if the malware were a pefile compatible.

To verify if malware is pefile compatible, we have used the pefile library from python[6]. The following code is a simple example:

try:
    pe = pefile.PE(filename)
    print("Accept it"))
except pefile.PEFormatError as e:
    print("{:s}\t ERROR {:s}".format(name,e.value))

Files

So, the number of samples in VirusShareSant Dataset per Family is presented in the following table.

Class Family Samples
0 Backdoor:Win32/Bifrose 2291
1 Trojan:Win32/Vundo 6794
2 Backdoor:Win32/Cycbot 3622
3 BrowserModifier:Win32/Zwangi 920
4 Rogue:Win32/Winwebsec 4624
5 Trojan:Win32/Koutodoor 5605
6 Backdoor:Win32/Rbot 1170
7 Backdoor:Win32/Hupigon 1943
8 Trojan:Win32/Startpage 1648
28 617

The available files, X_exp3.npy, y_exp3.npy and D_exp3.npy are numpy arrays which contais the name of the file, the class and the package. For instance, the X, y and D for the first malware in the dataset are:

Name Class Package
VirusShare_c99da9702b21eda352d570f5168a5252 0 VirusShare_00047

So the malware class is 0, which is Bifrose (Backdoor:Win32/Bifrose) and could be find in Package VirusShare_00047.

References

[1] AVTEST, T. Heightened threat scenario: all the facts in the AV-TEST Security Report 2018/2019. Avaliable in <https://www.av-test.org/en/news/ heightened-threat-scenario-all-the-facts-in-the-av-test-security-report-2018-2019/>

[2] Ronen, Royi, et al. "Microsoft malware classification challenge." arXiv preprint arXiv:1802.10135 (2018).

[3] http://www.mlsec.org/ and https://github.com/seymour1/label-virusshare

[4] https://www.virustotal.com/

[5] https://github.com/dbrennand/virustotal-python

[6] https://github.com/erocarrera/pefile

About

Malware DataSet for Windows Platform containing 28617 labeled samples from VirusShare packages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages