I have successfully curated a comprehensive list of datasets for Malware analysi and Threat Intelligence.
Dataset | Description | Link |
---|---|---|
APT Malware Dataset | This dataset contains over 3,500 malware samples that are related to 12 APT groups which alledgedly are sponsored by 5 different nation-states. | https://github.com/cyber-research/APTMalware |
Malware sample library | This dataset contains 26 categories of malware samples | https://github.com/mstfknn/malware-sample-library |
Malware-samples | A collection of 10+ classes of malware samples caught by several honeypots | https://github.com/fabrimagic72/malware-samples |
Malware-API-class | Public malware dataset generated by Cuckoo Sandbox based on Windows OS API calls analysis for cyber security researchers | https://github.com/ocatak-zz/malware_api_class |
MalWAReX | A collection of RAT (Remote Access Trojan) malwares targeted at computer networks | https://github.com/0x48piraj/MalWAReX |
Malicious URL | The data set consists of about 2.4 million URLs (examples) and 3.2 million features | http://www.sysnet.ucsd.edu/projects/url/ |
Malicious URL | A huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs | https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset |
UNSW-NB15 Dataset | This data set has nine families of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms. The Argus, Bro-IDS tools are utilised and twelve algorithms are developed to generate totally 49 features with the class label | https://research.unsw.edu.au/projects/unsw-nb15-dataset |
Microsoft Malware Classification Challenge | You are provided with a set of known malware files representing a mix of 9 different families. Each malware file has an Id, a 20 character hash value uniquely identifying the file, and a Class, an integer representing one of 9 family names to which the malware may belong | https://www.kaggle.com/competitions/malware-classification/data |
CIC-MalMem-2022 | The dataset is balanced with it being made up by 50% malicious memory dumps and 50% benign memory dumps. The dataset contains a total of 58,596 records with 29,298 benign and 29,298 malicious | https://www.unb.ca/cic/datasets/malmem-2022.html |
The data presented above may be messy and contain various duplicates. To this end, I have curated a perfect-ish dataset that represents all the classes of malwares using the above sources.